
In this video, I demonstrate how to set up and deploy a Llama 3.1 Phi Mistral Gemma 2 model using Olama on an AWS EC2 instance with GPU. Starting from scratch, I guide you through the entire process on AWS, including instance setup, selecting the appropriate AMI, configuring the instance, and setting up the environment with CUDA drivers. We also cover installing Go, cloning a simple Go server, configuring API keys, and securing the server for persistent deployment. By the end, you'll have a functional, customizable setup to run your own AI models efficiently and economically. Steps include selecting the appropriate instance type, setting up SSH, installing dependencies, running Olama, and securing the web service. Whether you're a developer looking to integrate AI or just starting, this tutorial will help you achieve a smooth deployment. Repo: https://github.com/developersdigest/aws-ec2-cuda-ollama Ollama: https://ollama.com/ 00:00 Introduction to Deploying Llama 3.1 Phi Mistral Gemma 2 00:52 Setting Up Your EC2 Instance 02:25 Configuring Your Instance and Storage 03:28 Connecting to Your Instance via SSH 04:08 Installing Dependencies and Cloning the Repository 05:05 Running the Model and Setting Up the Server 05:58 Configuring Security and Testing the Endpoint 07:33 Ensuring Server Persistence 08:53 Conclusion and Final Thoughts
Weekly deep dives on AI agents, coding tools, and building with LLMs - delivered to your inbox.
Free forever. No spam.
Subscribe FreeNew tutorials, open-source projects, and deep dives on coding agents - delivered weekly.
--- type: transcript date: 2024-08-03 youtube_id: SAhUc9ywIiw --- # Transcript: Deploy ANY Open-Source LLM with Ollama on an AWS EC2 + GPU in 10 Min (Llama-3.1, Gemma-2 etc.) in this video I'm going to show you how you can deploy llama 3.1 fi mistol Gemma 2 all through o llama on a GPU enabled ec2 instance on AWS I'm going to show you completely from scratch how to set this up from AWS and then what we're going to be leveraging and by the end of the video you'll have a nice clean go script so whether you want to add API Keys within this or if you want to build on top of it you'll be able to do all of that I'll just show you quickly how it will work through our ghost script we're going to have a really basic open aai compatible script where we'll be able to pass in our base URL the model the messages as well as the Stream So by the end of the video you'll have a base URL you'll be able to set up your authentication with your API key and then we'll have a simple sort of open AI compatible schema for how we interact with our API so just to show you how this works this is just a really quick demonstration on how it works without further Ado let's get into it so to get started once you're in the console here is you can search for ec2 in the search box here if you don't have it on your homepage from there we're going to go ahead and click launch instance in this case we can just call it ama GPU server or whatever you want really and then what we're going to do here is we're just going to browse these Amis so now what we're going to be searching for is deep learning now the reason we're using an Ami is it makes it really easy to set up all the different Cuda drivers and all of the things that you need to leverage that GPU that's connected to your ec2 instance if we didn't do this you could still set this all up but there would be a handful more steps on having to actually install all of the different drivers and making sure that's all set up the nice thing with this is there's less room for error you can just search for the Deep learning base it should be the one at the top but just to confirm it's going to be the Deep learning base OSS Nvidia driver GPU Ami Ubuntu 2204 we can go ahead and select that continue on next for the instance type itself if you just search for G4 here it should pop right up we're going to use a G4 adx large so you can see it's 4 vpcu it's 16 gigs of memory and you can see all of the different pricing there as well once we have that set up you can select your key pair if you don't have one you can just create a new one and save out a pem file put that within a safe spot on your computer but you can name this whatever you'd like we're going to use that in a further step here then once we have that set up we're going to allow https traffic as well as HTTP for now and then we're going to configure the storage so in this case I'm just going to leave it to 65 gigs the one thing to be mindful of is when you're selecting the amount of storage is just think about how many models you're going to have depending on the size of the different models you can see like llama 3.1 is 4.7 gigs or if you select the 70b version that it's 40 gigs just be mindful if you just want a few of the smaller models you'll be able to get by with something like 65 gigs you can also tweak this later if you run run out of space and you want to provision some more storage just be mindful of that as well once we have that we're good to launch our instance it will just take a moment and the one thing is if you don't get a success screen don't worry you'll get a little URL that you can go to where you'll just have to request access to more vcpu so just make sure that you select the proper type of ec2 so in this case we're using that G Series and then you're also going to want to specify the region that you're going to be using and then just put in a couple lines on what you're using it for once you have that you can go ahead and connect to instance here and this is going to be how we SSH into the server so now go within the folder wherever you have that PM key saved out in this case I just put my PM Keys temporarily on the desktop here and what I'm going to do is I'm just going to pseudo SSH or depending on how your computer set up you'll be able to just SSH potentially and then once we have that we're just going to log in here now the first time you connect to the instance you'll just have to accept and say yes just to make sure that it's an authentic request and then as soon as we see Ubuntu as well as the IP on the bottom line here we're good to go we're just going to go ahead and run an update and upgrade everything on the instance so even though it comes preconfigured you update it in case there's any sort of security updates or just little things on how the library works all right so there's a handful of things that need to be restarted once that's done what we can do is we're going to go ahead and install go so the one thing I'll mention is I will put a read me with all of the steps on what I'm doing in this video you can just walk through and set this up all right so again we're just going to be restarting here so just a couple last things we're going to be installing oama and then finally we're just going to get cloned this simple go server that we can configure for being able to set up API keys and further build on what we already have here once we have that let's just LS and we see that we have that from GitHub and we'll CD Within here again you can LS in here just to see what's in here it's a simple main.go file as well as a go dood in this case let's try it with a really small and quick model and that's going to be the Gemma 22b model you can just search AMA for the model that you want to run here and then you can just paste it in to your terminal the first time that you run the command it is going to download that model depending on how big the model is it might take a few seconds to get everything spun up but as soon as it's downloaded you'll be able to start to interact with it directly within the terminal here if I just say hello world now we have that streaming back to us so now that we know AMA is running what we're going to do is we're going to G run main.go to fire up that server now we see that the server is running at 8080 so what we need to do now is we need to go back to our instance where we connected to it you can go down to where it says security here you can click security security groups and then from there what we're going to do is we're going to edit the inbound rule so we're going to add a new rule we're going to leave it to custom TCP and then we're going to set it to 8080 now you can set this just to your IP if you'd like otherwise you can just expose it there now mind you if it's exposed like this it is accessible on the internet we'll save that out and then what we can do is if we just open a new terminal we can go back to your instances and click on the instance details and what we're going to need here is the public IP address we're going to copy that and then we're going to hop on over to the terminal here again and then what we're going to be doing is that's what we're going to be using for the base URL to query our endpoint if I just paste this in here and I walk through this we're going to be using the chat completions zpoint we're going to be using the be token of demo this demo token is just hardcoded within the go file right now you could swap that out to something unique now obviously there's going to be a few different steps to make this really secure I'm just showing you a basic implementation here you would have had to at least guess this token to be able to interact with our server then from there what you can do here is you can pass in the model string and then you can pass in your messages just like you would the open a API then you'll be able to specify whether you want it streaming or true and then we can test out the endpoint there so there you see now we have a streaming response back and that's pretty much it now there's a couple other things that I wanted to show you if I hop back to our server here and I just close out that go file so now if we want some persistence so say if the server stops for whatever reason or it errors out and it exits what we can do is we can pseudo Vim into AMA api. service then within here what we're going to do is put in a few things so if you're not familiar with Vim you can just click I and that will insert and then you can paste in this script here that I'll put within the GitHub repository and then once we have that we can click escape and then you're going to go colon WQ exclamation mark That's how you save things out in Vim but just to run through this we're going to be pointing to that main.go file that we had here within here when the server starts we're going to run that go command on the main.go file that we have there and then we have our working directory and then the user when it restarts Etc we can go ahead write that out and then what we can do here is if we pseudo the system CTL we can enable that service and then we can also enable that service to start as well we see that it's created that and then the service should start now if I go back to where I was interacting with the API on my local machine and I go ahead and run it and now we see that it's running on the server and I didn't have to go within the folder and actually go run main like you saw me do previously that's pretty much it for this video you can play around with the main.go file if you want to swap out the API key you can swap it out in there but there's a number of different things that you can do within here now I did this within go you could set this up with whatever you like if you want to use something like node.js and an Express server or a bun server you could do something like that you could even leverage something like laying chain in an implementation like this and all of a sudden with this you have your own GPU so you aren't going to be metered with every single token and having to incur a cost with having to interact with all of these different hosted providers you'll just be able to run it yourself and have that autonomy and the other nice thing with AMA is you will be able to just manage it as if it was on your local machine and the great thing with AMA is usually within minutes or hours when there's a major open source release they have them implemented within AMA you can imagine once you have this all set up you can just run the newest model and you'll be able to have it on your own hosted endpoint that's pretty much it for this video if you found this video useful please like comment share and subscribe otherwise until the next one
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.