
Optimize Your LLM Application with Upstash Semantic Cache In this video, I'll show you how to set up a semantic cache to improve the performance of your LLM application, reducing response times from seconds to milliseconds. I'll explain the benefits of semantic caching, like lowering inference and API costs, and achieving faster, more deterministic results. I'll be using Upstash Redis's new AI offerings to implement this caching strategy. From creating a vector database and setting up environment variables to coding in VS Code and integrating with an answer engine, this step-by-step guide will walk you through the entire process. By the end, you'll have an advanced understanding of how to leverage semantic caching to make your applications more efficient and cost-effective. Links: https://upstash.com/ https://github.com/upstash/semantic-cache https://github.com/developersdigest/llm-answer-engine/ 00:00 Introduction to Semantic Caching 00:09 Understanding the Benefits and Costs of LLM Applications 00:48 Setting Up with Upstash 01:08 Creating a Vector Database in Upstash 01:57 Project Setup in VS Code 02:40 Implementing Semantic Cache in Your Application 03:12 Exploring Semantic Similarity and Cache Mechanics 04:14 Practical Example: Setting Up Semantic Cache 05:29 Integrating Semantic Cache with the Answer Engine 08:17 Frontend Integration and Cache Management 12:47 Conclusion and Thanks
Weekly deep dives on AI agents, coding tools, and building with LLMs - delivered to your inbox.
Free forever. No spam.
Subscribe FreeNew tutorials, open-source projects, and deep dives on coding agents - delivered weekly.
--- type: transcript date: 2024-05-25 youtube_id: iF-npWXuKCQ --- # Transcript: Make Your LLM App Lightning Fast in this video I'm going to show you the easiest way on how you can set up a semantic cach within your application instead of responding in seconds it's going to be responding in milliseconds just to break down some of the benefits by using the semantic cache in this particular application one of the most expensive pieces of an llm application is the cost of inference whether you're using GPD 4 o or you're using Gemini Pro or if you're using anthropic even if you're using some of the cheaper models at scale these things can be incredibly expensive in the case of something like a engine or say something like perplexity let's say you had the query that a lot of people are going to be asking every single day the way that it's set up in this application is it's not just caching the llm response it's also going to be caching the results that I get back from the search engine apis such as the sources the videos the images as well as the follow-up questions I thought this was a perfect example on how you can use something like a semantic cash one thing I wanted to point out with up stash so I'm going to be working on them with some content up stash has a Time coming down the pike in terms of new AI offerings and they're going to be simplifying a ton of different pieces this is just one really great implementation of one of the packages that they offer all that you have to do to set this up you can go over to up stash when you create an index it's really straightforward if I just go example vector and then we select Us East and then from here one of the unique pieces is youve actually select your embedding model directly within interface here it will go ahead and select the dimensions for you and then you can choose the metric on how it measures the distance between those Vector relationships so from there you can go ahead and select your plan here that you have 10,000 updates and 10,000 queries per day we'll go ahead and select that and then once that's selected all that you'll need within the example here is we're going to head on over to the EnV and then we're going to be using both these values within our application I want to show you how to set this up with the answer engine like I had just shown you but I also want to show you a basic example on how to set this up if you go over to vs code now we have our Vector database all set up we can go over to vs code and what I'm going to do here is I'm just going to go ahead and create a new project I'm just going to bun in it- y and within here the first thing that I'm going to do is I'm going to create theed dnv I'm going to go ahead and paste in our environment variables and once you have that done you can just go ahead and close this out now that we have our environment variables all set up all that we have to do is we have to install a couple packages here so you can just go ahead and grab those scripts here and we're going to npm install you can also use button if you'd like we're just going to paste in the instructions from the read me here and then we're going to also expand this c a little bit now that we have all the stuff that we need to get started all right so now that we have our code set up I'm going to run through exactly what's happening here so once you have it set up it just works you don't need to manually paste in any more strings or anything like that this is just going to work for you if you go ahead you can test it right off the bat before we run through what exactly the code is doing we just go ahead and run the script here first thing that we're going to do is declare the index of where our index for the vector database is pointing to and from there within our semantic cache we're going to be passing in that index and then we're also going to be declaring the minimum proximity for it to return a cashit essentially how this works is you can think of it as the relatedness between two items if you think zo animals and you think about a tiger and you think about a lion those two items are going to have a much closer semantic similarity then if you said something like a Toyota or a Tesla or a type of car or something like that right so cars would be grouped together within that semantic similarity essentially what embeddings are doing is it's going to take the similarity and group particular items and queries in this case that are close to one another in this case if we're asking what's the controversy with that Sky voice that open AI released what it's going to do with that is it's going to essentially plot that you can almost think of it as a three-dimensional box it's going to plot it somewhere within that box and then for any subsequent queries when we go ahead and we do that lookup Within our Vector database here it's going to go ahead and see if there's any that are above this threshold so you can play around with the proximity if you want it to be a little bit more strict you can dial this up or alternatively you can just turn it down as well if you'd like within the example here this is a simplified example of what I showed you on the out set of the video the way that this works is for the semantic cach what we're doing is we're going to be embedding this line here this is going to be sent to that embeddings model once it's returned it's going to be stored within our Vector database here once we have that item stored within our Vector database we can look up that key and then it will return that result that we have set here the other benefit of this is it can make the output a lot more deterministic there are also some drawbacks so if you're caching things that aren't a good response you're going to have to make sure you have a mechanism to actually clear that response or presumably you'd likely want to set something like that up but the benefits of having something more deterministic is it's more predictable and then it gives you the ability to save on the in cost as well as any other API cost and then it also gives you that improved speed as well which is obviously really nice so next we have this synthetic DeLay So what this is for is to essentially allow for time for that embeddings to be created and it to be stored within our Vector database and once we have that all set up next what we're going to do is we're going to say what is Turkey's capital and since it's going to be similar to this it's going to return with the result Anor here here we just have a simple delay function that's our helper function and that's essentially how it works if I break down what we're going to be setting up in just a moment within the answer engine project within this first argument here is we're going to be setting that input that the user puts in as our first argument and then for the second argument we're going to wait for all of those responses to come back we're going to have all of the sources we're going to have the images we're going to have the photos we're going to have all of the llm response and then the follow-up question as well as whether there's function calls that are invoked what we're going to do at the very bottom of our last response there after the the follow-up question is we're just going to Json stringify all of that payload and it's actually not that big in terms of the payload that we're going to be saving within the database it is a relatively light load and it comes at the trade-off of being significantly cheaper as well as considerably faster I'm going to dive into setting this up within the answer engine itself so everything I am about to show you you can go ahead and pull down from the llm answer engine repo which I'll link within the description of the video you can just head on over to the project if you're not familiar there's also some other videos that you can check out if you're interested now within the repo what you'll need to set this up within the answer engine you will need to install both the up stash Vector rest URL as well as the up stash Vector rest token like you saw in the previous example and grab it from the up stash console like you had just seen the next thing that we're going to do is within the configuration TSX we're going to add another key of use semantic cach that you can turn on or off depending on whether you want to use the cache if you don't want to use this within the llm answer engine you can just set this to false and then it will just default to not using it so first within our action. TSX what we're going to be doing is we're going to be adding these new Imports for both the semantic cache as well as the index from up stash Vector from there we're going to look within our configuration object like you had just seen and then we're going to see whether we're going to be using it or not if we're going to be using it we're going to be plugging it in here and then just like you saw in the previous example you can change the minimum proximity and this is how you pass in your index and all of that where we are referencing the semantic cache on the back end within our action. TSX what we're going to be doing here is we're going to be having a condition that wraps these pieces just to make it easy and flexible so you can turn it off you can turn it on if you want to just test some things without it on it gives you some flexibility without having to go into the code and actually have to replace things the first thing that our action is doing is we're going to be checking whether the rate limit has been met that's going to be the very first condition as soon as our server action isn't vote immediately after that assuming the rate limit isn't met we're going to check that user message to see if there's a semantic cash hit and if there is a cash hit we're going to go ahead and stream that cash response back to our client once we have that set up we're going to be running through the my action function where we're going to be setting up a few different things within that and then we're going to be declaring this new clear semantic cache button which you'll be able to click on the front end and that will be a way that you can invalidate the cach and just delete that Vector storage the reason I wanted to include that is let's just say you have a query that returns a bad message from the llm for whatever reason this will ensure just in case we have a bad response we're just going to have a simple click where anyone can click and then from there it will just invalidate their response and then the subsequent message will avoid that cash and then generate it fresh from the llm and the sources and all of that and whatnot to Circle back on what our actions doing first we're doing doing a rate limit check if you're using rate limiting and then immediately after that we're going to be checking to see if we're using that semantic cache we're going to go ahead and get the user message and we're going to see whether there's a semantic cach hit if there is a cash hit what we're going to be doing here is within our cash data we're going to have the whole Json payload that renders that entire view that's going to be how it generates all the different sources the response the follow-up question the images the videos the nice thing with this is is not just the llm response that it's caching there's a couple different llm calls within the response of the message the follow-up questions there's also the API keys for the search engine apis for the images videos and then the search queries themselves so that's going to be one of the nice things with how this is set up where you're just going to be able to stream everything of all the results back we can skip through most of the streamable portions it's pretty straightforward we're just going to be sending in all the different results as they come back to us to the front end if we just go down to the bottom what we're going to be doing here is we're going to be creating an object called Data to cache and within this object we're going to have all of the different values of what we're going to be storing within our semantic cache and then this is going to be the line that actually invokes the method to set up and declare our semantic cache we're going to have the user message and then we're going to have this payload that we stringify and store the last couple things that we're going to be setting up with in our action. TSX is going to be the button to clear our server cache we're going to have a very simple server action where we're just going to be passing in the string of the user message and then if that button is clicked where this method is invoked we're going to go ahead and delete that message from our semantic cache so we don't have those bad results if that's the case or maybe there's just a result that you don't like or you want to hear it in a different way whatever it might be this is going to be the method that we use to actually invalidate that cash since we are using the versel AI SDK within your actions here you can add that method that we had just declared above within the actions there so now once you have that the lon share of the work is done all that we really need to do to set this up on the front end is essentially parse that payload that we're getting back from our semantic cach from up stash if it does match that type of cach data we're going to parse that Json payload and then as soon as all of these different parts are parsed it's going to go ahead put that in state and then it's going to render it on the screen for us and there's also support for those conditionally rendered UI components if you happen to capture the video where I covered function calling this is essentially that part as well there's a couple other minor pieces to set this up on the front end it's basically going through and setting up all of the different interfaces to make sure that we do have those relevant keys and that it does satisfy what we need for using it within typescript now the last thing I wanted to point out is within our llm response component we have this new prop that we're passing in of semantic cache key within our llm response component all that we're really going to be adding here is that we're going to have the ability to clear that semantic cash we're going to be adding just a couple things within our llm response component we're going to have that button to actually clear the cache which is going to be at the bottom of the response container then there's also going to be just a really simple model that pops up if you've actually clicked it so as soon as you click it we're just going to have a really basic method that will just show that the user message is being cleared and invalidated now in terms of the button itself you can just see here we have the button you can name it whatever you want I just named it clear response from cache and then we're handling that clear cach method that we had put here and then that will fire this handle clear cach here and then that's going to be where we pass in our semantic cash key that's it to get set up and running so in the video you saw a really basic example on how to get started and hopefully get comfortable with using it and then by the end of the video you have a bit more of an advanced use case on how you can leverage this package I just wanted to show you another tool within the toolkit on how you can make your llm more performant make it it cheaper make it more reliable make it more deterministic and I wanted to thank up stasher for allowing me to collaborate on this content to allow me to Showcase you this great work and how you can leverage it within your application that's it for this video if you found this video useful please like comment share and subscribe otherwise until the next one
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.