
In this video, we delve into Anthropic's latest release—prompt caching with Claude—right after Google's context caching announcement for Gemini 1.5. We explore the benefits of this technology, particularly for developers using Claude 3.5 Sonnet, and its potential to save both time and cost across various use cases. Discover how prompt caching can enhance applications like conversational agents, coding assistants, and long document processing by reducing latency and costs significantly. 00:00 Introduction to Prompt Caching with Claude 00:46 Overview of Prompt Caching Benefits 01:44 Use Cases for Prompt Caching 04:34 Technical Details and Performance Metrics 05:53 Pricing and Availability 07:25 Implementation and Best Practices 10:04 Conclusion and Next Steps
--- type: transcript date: 2024-08-18 youtube_id: _n7d_KiHyfk --- # Transcript: Prompt Caching: Anthropic Claude 3.5 Sonnet's Game-Changing Update! anthropic just released prompt caching with Claude this is on the heels of Google's announcement of context caching just a number of weeks ago which is available in Gemini 1.5 flash as well as Gemini 1.5 Pro and this is also something that we have seen from Deep seek which is a company that has a ton of really great open source models that are particularly geared to coding now the first thing is this is really interesting for Claude 3.5 Sonet because is it's arguably one of the best models out there especially for developers a lot of developers swear by Cloud 3.5 Sonet that it just is the best model out there at least at the time of recording this video so without further Ado I'm just going to go over the blog post a little bit so prom caching which enables developers to Cache frequently used context between API calls is now available in the anthropic API one thing to know with this is it doesn't mention that this is available on AWS where they also have clot 3.5 Sonet or whether it's going to be available on gcp right now it looks off the bat that it is going to be available through the anthropic API hopefully it will come to other vendors like AWS in time with prompt caching customers can provide Claud with more background knowledge and example outputs while reducing the cost by up to 90% And The latency by up to 85% for long prompts prom caching is available today in public public beta for CLA 3.5 Sonet as well as clae 3 hi cou with the CLA 3 Opus support coming soon they mentioned a number of different use cases where prompt caching can be useful and some of these are for example conversational agents say if you have a chat history that's particularly long the benefit of using prom caching is as soon as you pass in that key value pair where it's essentially passing in a value of ephemeral to add it to that context caching all of a sudden for those subsequent prompts you're going to be reaching for that cash so long as it's in that cash now since it is an ephemeral cach this isn't something that's going to last or persist for a long period of time let's say you're within chat GPT or perplexity or the CLA interface or whatever it might be now what those different interfaces do and they all do it a little bit different but they essentially pass a portion of the chat history with each subsequent turn into the llm now if you're able to have those historic messages or even the most recent messages stored within this context caching it's going to save you a ton of money because you're not going to have to pass in all of those recognized tokens that have already been processed by the llm that's one example with a conversational agent where it could be useful now obviously coding assistants are a huge one because there's going to be a ton of use cases where say if you pass in a ton of context let's just say you pass in an entire repo into a context window like we're getting to the point where you're going to be able to do those types of things now the limit on Claud right now is 200,000 tokens of input and that's a pretty decent amount of code and I'd imagine in a year or two years we're probably going to be within the millions range in terms of the amount of contexts that you can pass in we already have the Google Gemini where you can pass into up to 2 million tokens which is available through their API right now I'd imagine whether it's open AI or anthropic that we're going to be able to pass in more and more context obviously long document processing that's going to be a huge one say you pass in something like an SEC filing and you want to ask questions about it that could be a good use case detailed instruction set say you have a chatbot or something like that on a website and you have a particularly long whether it's a system prompt in combination with the context of whatever you're feeding to it if you have that cached and you don't need to continually send that in and be buil at that full rate obviously you're going to save both the time as well as the cost that's going to be associated with that a few others here agenic search talk to books papers documentation podcast transcripts Etc there's just a ton of really useful use cases so this is really going to shine especially in applications where you're passing in a ton of different context whether it's with chat history or whether it's with some of these other examples that I had mentioned here's the table where they break it all down chat with a book of a 100,000 tokens cached with a prompt so the latency without the caching at all the time to First token would be 11.5 seconds it has to process all of that information that you're sending it and that time to First token is that essentially that first vowel or that first word that you get back from the llm quite a long time for a long prompt obviously a 100,000 is a ton but still a significant amount of time now with Contex caching is that's going to be 2.4 seconds which is almost 80% faster and then the cost reduction is going to be 90% cheaper so this is pretty much a Nob brainer for leveraging that prompt caching feature they also have some examples of multi-shot prompting as well as multi-turn conversation many shot prompting with 10,000 tokens would be 1.6 seconds for the first first time to First token and then the cost reduction for the subsequent ones would be 86% as well as a 31% increase in terms of the speed so a 10 turn conversation with a long system prompt the initial time to First token could be 10 seconds latency with the caching 2.5 seconds an increase of 75% as well as a cost reduction of 53% cash prompts are priced within the number of tokens you cash and how frequently you use that content writing to the cash costs 25% more than our base input token price for any given model while using the cash content is significantly cheaper costing only 10% of the base input token price to give you an idea for cloud 3.5 Sonet it is $3 per million tokens and then for the prompt caching feature it's going to be slightly more so 25% more at $3.75 for that initial cashing of the to tokens and then once you're actually quering it it's going to be 30 cents per million tokens now the one thing to note is the output is still $15 per million tokens now for CLA 3 Opus I don't imagine a lot of people are necessarily going to be using this right now at least not until maybe claw 3 point Opus comes out because it is significantly more expensive now that being said it is available on highq and it is very cheap to read the cash it is going to be 3 cents to read per million tokens of cash red with an output of only a125 per million tokens and Claud 3 hu is a very competent model it's also a very quick model so this is something that's going to be like a gbd4 mini type of competitor maybe not quite as capable as GPT 40 mini but mind you it might make sense with this new prompt caching feature available to get started with the API it's really straightforward all that you need to do is pass in the key of cash control into the message that you want to have cash let's just say you have a really long amount of text or code or whatever it is within here and then you can specify that it's going to be type of ephemeral and then that's going to be what is ultimately cash you can decide what or how much you want to cash by simply passing in this key value pair here into your different messages there the interesting thing with this is the cache has a 5 minute Lifetime and it's refreshed each time the cach content is used so you have to keep the cach warm by using it and interacting with it now they probably did this to go a different route than what Google did where they charge you per hour of the number of tokens that are stored within their system now with that being said maybe developers will come up with a creative solution to keep this quote unquote warm with a Cron job asking for the llm to respond back with a word every four minutes or something just a few other things the minimum prom lank is 1,024 tokens for claw 3.5 Sonet as well as CLA 3 Opus mind you this isn't quite available yet and then it is 248 tokens for Claude 3 hiou they do mention some best practice so to optimize prom caching performance they say to Cache stable reusable content like system instructions background information large context or frequent tool definitions you're also able to monitor the performance of the API there going to be a couple keys that you can access the number of tokens written to the cache when creating a new entry as well as the number of tokens that are red there are some examples within here there's a large context example here there's also a caching tool definition within here say you have a ton of different tool calling definitions you can also use this with function calling so you have a number of different tools and then you want to cach those tools so you don't have to continually pass them in you can also use them on tools that's also a really interesting thing to see as well and then in terms of a multi-turn conversation here's just to give you an idea how this could work hello can you tell me about the solar system you can set the cache controls there we get the response back without the cach controls and then the subsequent user message it just has that cache control with the subsequent messages there just to give you an idea on how you could potentially use it you could use it within a multi- turn convers where you're just continually caching all of the different things that you're telling it but otherwise that's pretty much it for this video if you're interested you can click through their documentation which are really great I'd encourage you to check this out try it out if you found this video useful please like comment share and subscribe otherwise until the next one
Weekly deep dives on AI agents, coding tools, and building with LLMs - delivered to your inbox.
Free forever. No spam.
Subscribe FreeNew tutorials, open-source projects, and deep dives on coding agents - delivered weekly.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.