
Leveraging Gemini Models for Multimodal Queries in Node.js In this video, I provide a detailed guide on how to utilize the new Gemini series, including Gemini Flash and Gemini Pro, to handle multiple file types like audio, video, images, and text within a single query, taking advantage of a massive context window of up to a million tokens. I’ll explain the capabilities and the interesting use cases enabled by these models, such as comparing different media types. Furthermore, I’ll cover the pricing details, including a competitive cost structure and a free tier option. Additionally, I’ll include a step-by-step coding tutorial on setting up and making requests to the models, leveraging Google’s AI studio and GitHub resources for easier implementation. Lastly, I’ll highlight the difference in performance and cost between the Gemini Flash and Pro models through practical examples. 00:00 Introduction to Gemini Series Models 00:17 Exploring Gemini Flash: Capabilities and Use Cases 01:01 Understanding the Context Window and Its Potential 02:00 Pricing and Accessibility of Gemini Models 02:50 Getting Started: Tools and Resources 03:08 Step-by-Step Coding Tutorial 05:43 Demonstrating Gemini’s Capabilities with Examples 09:10 Conclusion and GitHub Resources Repo: github.com/developersdigest/gemini-flash-api
--- type: transcript date: 2024-05-21 youtube_id: TJOrVx8ewpY --- # Transcript: Gemini Flash API: 10-Minute Multimodal Crash Course in this video I'm going to run through an example on how you can get set up with the new Gemini series of models including Gemini Flash and Gemini Pro and how you can leverage that huge context window of up to a million tokens I'm going to show you how to upload audio video images and text all within one query before I get into that first I want to dive into Gemini flash itself and why it's really interesting this is a model that allows you to have up to a million tokens of contacts while also adding in all of those different file types within that million tokens of context this can be videos like I mentioned this can be audio this can be images the thing that's interesting is being able to pass in all of these different modalities at once you can have these different use cases that are pretty interesting right you can ask it to compare different things between Say the video and the images or compare the different images or what have you the possibilities by being able to have native support for all of these different modalities within one model is really powerful just to give you an idea of how many tokens this is so there's a really good example on the Deep Mind website here this is about an hour of video 11 hours of audio code bases with more than 30,000 lines of code or over 700,000 words just to give you an idea I was playing around with this and just a text input I passed in essentially the entire HTML document of what's known as an S1 filing and this got pretty close to the million tokens of context and this document is about 300 pages long and you're able to ask very specific questions with this whether it's with the Gemini 1.5 pro model or the Gemini flash model the one thing that I do want to note is if you're pushing the token context window is the responses are going to be very slow but the thing that I found interesting with this is say if you have background processes where if you're just passing in these long documents and maybe the latency isn't a huge Factor you can just have all these documents maybe being summarized in the background and the thing with this is you don't have to worry about setting up any retrieval augmented generation now the other thing that really stands out and what a lot of people were excited about were the pricing of the model so one of the flagship pricing metrics that came out from the announce was this 35 cents per million tokens of context that you can pass in so while this pricing is extremely competitive there's also a free tier where you can pass in that million tokens of contexts per minute so you can break up that million tokens per minute into 50 requests per minute and ultimately 1,500 requests per day the trade-off is this is going to be used to actually improve their products whether that means training their models or what have you you can learn more here if you'd like but it's really nice that it gives you that option if you're just looking to play around with this or explore it as a potential option without having to incur any costs in terms of some other resources if you just want to try this out you can go on over to Ai studio. goole.com and you can just have this little playground here where you can play around with the models within here you have the Gemini 1.5 flash model with the million tokens of context as well as the Gemini pro model and then you can also just grab your API Key by clicking this button here now in terms of the actual coding portion I'm going to run through this relatively quickly and I'll also throw it up on GitHub if you just want to pull it down and get started with it the first thing that we're going to do is we're going to go ahead and import a couple modules what you'll have to do is you'll have to go ahead and mpm these two packages you can either use bun or mpm or pmpm or whatever you're using so you can just bun I this and and you can paste the subsequent string here as well once you have that we're going to be using the path module since we're going to be reaching for a few different files within our directory we're going to be setting up an API key what you can do here is you can go ahead grab that API key from the generative AI studio and then you can put it within your API key with the Gemini API key just like this just Gemini API key equals right within that EnV and save it out from there we're going to go ahead and establish that we are going to be using the Google AI file manager which is going to be how we upload these files and then we're going to be using the generative AI package and we're going to be passing in our API key here so all that this function is doing is we're going to be passing in the file name as well as the MIM type so for each file a mime type if you're not familiar this is going to be what the type of file is so say if it's a type of PDF or if it's a type of MP4 or what have you that's going to be what the MIM type is a simple little function here and all that this is doing is we're going to be uploading that file to Google before we actually process that request for inference within the next step here what we're going to be doing here is essentially we're going to be waiting for all of those files have been successfully uploaded since these files can be really big if you're uploading a whole movie or something like that all that this is really doing is checking whether those files in the previous step were uploaded within their documentation they have this while loop where it's essentially looking to see if the files have been processed if they're processed continue on to the next step it's just go ahead and check that at a particular interval here from there this is going to be where we configure our generative model and the one thing that I did want to point out is that the system instructions are a little bit different within the Gemini API than something like the open AI API so you will have to set up these system instructions right when you configure the generative model here and if you want to swap it out for the Gemini 1.5 pro model you can also swap out the model string here as well from there there's a number of optional configurations that you can pass in when you're actually invoking that that chat we're just going to declare some of the optional values that we're going to be passing into the model you don't have to specify these but just to give you an idea on how you can do it from there we're going to set up a simple run function which is going to be what wraps our entire application the first thing that we're going to do is I have four different file types here I have an image of a simple spreadsheet I have an image of two cameras I have a simple audio file and then I also have a simple video file of the Earth spinning these are going to be the files that we upload and then send in for inference what we're going to be doing for each of these files is we're going to upload the files the arguments that the upload to Gemini function takes is the file name as well as the MIM type like we had talked about then from there we're going to wait for all the files to be complete once that's done we're going to start our chat session in this example this is a chat interface and there are some particular rules that you do have to follow to set this up within our chat session all that we have to do to pass in all of these different modalities Within in our input is we have to declare that the role is going to be user and you cannot pass in a system message as the first message or within this at all for that sake you do have to put the system message within where you configure the generative model here it's not like open AI where you pass in the system message as your first message the other thing to De with this just is in a side you can't pass in a model message as the first message within the history array here so long as you follow that all that you have to do to add in file type within your input is you can just specify them within the parts array here you can go ahead pass in audio we'll pass in a couple images we'll pass in a video and that's it we're passing in the file data we're passing in the MIM type and then the file urri and then from there you can go ahead and query it in this case I'll ask describe me the spreadsheet here which is image one we have a very rudry spreadsheet here I set it up so you can see the number of tokens that you are using we're using about 11,000 tokens of contact and then within the response here we see the spreadsheet shows exam results from four students we can see that we do have four students and it's breaking it down right we have Carol John Eden and James and all of the different subjects pretty amazing if I ask another question and I say what is this video about and I go ahead and I run this again it's going to upload all of those different files and the one thing that I did want to point out out is if you do want to swap this out to the pro model all they have to do is change flash to Pro here within the model string save out the file and then you can see what the pro model responds here we see that the video is a time lapse of the Earth at night was really nailing these so it is doing really well with these questions but if I try with the pro model and I say describe in great detail all of these images videos and audio and if I go ahead and IUN on that the one thing to note with the pro model is it's about 10 times the price of the flash model but there is also a free tier you can pass in up to 32,000 tokens of context per minute or you can use this pro model for free so within the example we saw that we're using about 11,000 tokens of context if I just put in another message and I say describe in detail the differences between everything I passed in and then we just try and run that what it's going to do here is hopefully give me a good depiction of the differences between all of these different files and so we have images we have audio and then we also have that video so here it's asking for clarifying questions which within a chat application this can be really helpful so I'm going to throw this up on GitHub you can go ahead to play around with this if you found this video useful please like comment share and subscribe otherwise until the next one
Weekly deep dives on AI agents, coding tools, and building with LLMs - delivered to your inbox.
Free forever. No spam.
Subscribe FreeNew tutorials, open-source projects, and deep dives on coding agents - delivered weekly.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.