
TL;DR
How KV caching speeds up LLM inference - the math, the code, the memory tradeoffs, and when it stops helping. Every dev running local models hits this wall.
You spin up a 7B model on a decent GPU. The first generation feels fast. Then you push the context to a few thousand tokens and the throughput collapses. You profile and the answer is the same one a thousand devs have arrived at independently: most of your forward passes are recomputing attention over tokens you already saw.
This is the KV-cache wall. It is the canonical bottleneck of transformer inference, the first real performance lesson when you stop calling hosted APIs and start running models yourself, and the topic of one of the clearest explainers Hugging Face has shipped this year - not-lain's KV caching post.
This piece is the developer's version of that explainer. Less math, more code, more focus on the engineering decisions you actually make when you ship.
A transformer generates one token at a time. Each new token attends to every previous token in the sequence. If your context is one thousand tokens and you generate the one-thousand-and-first, the model needs the keys and values for all one thousand previous tokens to compute attention.
The naive implementation recomputes those keys and values on every generation step. You feed the whole sequence through the model again, throw away most of the output, and keep only the new token. This is O(n^2) work to generate n tokens.
The KV cache is the obvious fix. After the first forward pass, you have computed the keys and values for the prompt. Cache them in memory. On the next step, you only run the new token through the model, attend it against the cached keys and values, and produce one new pair of cached entries. The whole generation becomes O(n) instead of O(n^2).
The speedup is enormous. For a sequence length of 2,048 with a 7B model, KV caching can take you from "this is unusable" to "this is real-time" on the same GPU.
If you are using Hugging Face Transformers, KV caching is on by default in generate(). The interesting work happens when you build your own inference loop and need to reason about it directly.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "meta-llama/Llama-3.2-3B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="cuda",
)
prompt = "The KV cache stores"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda()
past_key_values = None
generated = input_ids
max_new = 50
with torch.no_grad():
for _ in range(max_new):
if past_key_values is None:
inputs = generated
else:
inputs = generated[:, -1:]
out = model(
input_ids=inputs,
past_key_values=past_key_values,
use_cache=True,
)
past_key_values = out.past_key_values
next_token = out.logits[:, -1, :].argmax(dim=-1, keepdim=True)
generated = torch.cat([generated, next_token], dim=1)
print(tokenizer.decode(generated[0]))
The shape of the cache is what matters. past_key_values is a tuple with one entry per transformer layer. Each entry holds two tensors - the keys and the values - of shape [batch, num_heads, seq_len, head_dim]. The seq_len dimension grows by one with every generation step.
That growing dimension is the catch. The cache scales linearly with sequence length, and the constant is not small. For a 7B-class model with 32 layers, 32 heads, head_dim 128, in float16, the math is:
2 * 32 layers * 32 heads * 128 head_dim * 2 bytes = ~512 KB per token
At 2,048 tokens of context, that is one gigabyte of KV cache per request. At 32K context, sixteen gigabytes. The KV cache, not the model weights, is what fills your VRAM in long-context inference.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
Once you internalize that the KV cache is most of your memory budget, a lot of optimization decisions snap into focus.
Quantization of the cache itself. You can store the cache in int8 or int4 instead of bfloat16. The accuracy hit is usually small for chat workloads. Hugging Face Transformers supports int8 cache out of the box - flip a config flag, halve your memory.
Multi-query and grouped-query attention. Modern model architectures - Llama 3, Mistral, Qwen - share key and value heads across multiple query heads. Grouped-query attention with eight kv heads instead of thirty-two cuts the cache by 4x with minimal quality cost. This is why a 70B Llama 3 has roughly the same KV-cache footprint per token as a 7B from a couple of years ago.
Paged attention. The vLLM project's contribution. Instead of allocating one contiguous KV cache buffer per request, vLLM allocates fixed-size pages and looks them up by a virtual address per request. This eliminates the fragmentation that wastes memory in batched serving and is the single biggest reason vLLM dominates self-hosted inference today. If you are serving more than one request at a time, you should not be writing your own attention loop - you should be running vLLM or a successor.
Sliding window attention. Some architectures attend only over a window of recent tokens. The KV cache becomes a fixed-size ring buffer instead of a growing array. Mistral popularized this. The cost is that the model genuinely cannot see beyond the window, so anything that depends on long-range structure has to be summarized into recent tokens.
The KV cache discard problem. For very long contexts, the cache may not fit even with quantization. You then have to choose what to evict. Options range from naive (drop the oldest tokens), to clever (the H2O paper, attention-score-based eviction), to extreme (re-summarize the dropped region into a much shorter prefix). Production systems usually pick a static policy - keep the system prompt, keep the last N turns - and call it good.
Cache reuse across requests is real and underused. If many requests share a long system prompt, you can compute the cache for the system prompt once and reuse it across all requests. This is the prefix caching feature in vLLM and SGLang. For agent workloads with a long system prompt and short user turns, prefix caching can halve your cost per request.
Batch size and cache size are linked. The KV cache is per-request. If you serve eight requests at once, you have eight caches in memory. The maximum batch size is bounded by (VRAM - model_weights) / (cache_per_token * max_seq_len). Profile this before you size your hardware.
Speculative decoding interacts in non-obvious ways. With a draft model proposing tokens that the target model verifies, you have two caches running in lockstep, and rolling back the target cache when the draft is wrong is fiddly. Most frameworks handle this for you. If you implement it yourself, double-check the rollback path.
Streaming output and the cache. The cache is mutated in place by generate(). If you stream tokens to a client and the client disconnects, you need to release the cache memory promptly. Plumbing the cancel path through your inference server is a real source of memory leaks.
The honest perspective from someone who has shipped a few agent products. KV caching is not a feature you turn on - it is the air everything else breathes. Every other inference optimization assumes it. Continuous batching, speculative decoding, prefix caching, paged attention - all of them are reorganizations of the KV cache.
For most application devs, the right move is not to write your own KV-cache logic. The right move is to pick a serving framework that has solved it and to understand the framework's trade space. vLLM is the default for self-hosted Llama-class models. SGLang is the more aggressive option for very long contexts and structured generation. TensorRT-LLM is the right answer if you live on NVIDIA hardware and need every last token per second.
The case for understanding KV caching deeply is when you start doing things the framework was not designed for. Long-context retrieval pipelines where you want to splice in cached prefixes per query. Multi-tenant agent serving where you want to share cache across users with the same system prompt. Local on-device inference where the cache is the dominant memory cost and you cannot afford a generic policy.
If you are running models locally, even casually, the rule of thumb is: most of your VRAM is going to be cache, not weights, and the choice of architecture matters more than the choice of model size for inference economics.
We have been profiling KV-cache behavior inside Traces, our agent-run timeline tool. Traces was originally built to render Claude Code transcripts as a stepped UI. Adding self-hosted-model traces meant we had to surface KV-cache utilization as a first-class metric, because for self-hosted runs the cache is the most predictive number for "is this run going to OOM."
The pattern that has worked is to log, per turn, the prefix cache hit ratio, the live cache size in MB, and the maximum-allowed cache size for the request. Plot those over the run and you can immediately see when a request is cache-bound versus compute-bound versus prompt-bound. This kind of observability has been hard to get out of self-hosted serving frameworks; building it in turned out to be a small but valuable feature.
For workloads that need persistent agent state, AgentFS is where the prefix-cache idea pays off. AgentFS gives agents a durable workspace across runs, and most of what an agent does in a given session involves the same long system prompt, the same toolset descriptions, and the same workspace summary. Caching the prefix once per agent and reusing it across turns inside a session is the difference between five-second latency and one-second latency on a self-hosted setup. The same idea generalizes to any product with a stable long prefix.
The interesting open questions.
Cache compression beyond int8. Recent research is pushing into 4-bit and even 2-bit cache quantization with minimal quality cost. If those numbers hold up in production, the effective memory budget for long-context inference doubles or quadruples without new hardware.
Cache-aware routing. Multi-tenant inference systems are starting to route requests to the GPU that already has a relevant prefix cached. The economic logic is obvious. The implementation is gnarly because you need a global view of which caches live where.
KV-cache sharing across models. If you have a chain of models - a small fast model proposing, a large slow model verifying, a re-ranker checking - sharing the KV computation across them is an open research direction. The constraint is that the architectures have to match.
Hardware support. The newest accelerators are starting to ship with KV-cache-specific memory hierarchies - dedicated cache-only HBM stacks, faster cache-to-compute paths. The hardware is racing to catch up to the access pattern that transformer inference actually exhibits, and the next generation of GPUs will have very different KV-cache economics from the current one.
We are running a deeper hands-on walkthrough on YouTube - building a tiny inference loop with explicit KV cache management, then bolting on prefix caching, then comparing against vLLM. Watching the mental model assemble in code is, in our experience, the fastest way to actually internalize this. Once you have built it once, you stop being surprised by inference performance forever.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
C++ inference engine for LLMs. GGUF format, quantization, CPU and Metal/CUDA support. The foundation most local tools bu...
View ToolHigh-throughput inference server for LLMs. PagedAttention memory management. The go-to for serious local or self-hosted...
View ToolApple's array framework for machine learning on Apple Silicon. Native Metal support, unified memory, first-class LLM inf...
View ToolOpen-source AI pair programming in your terminal. Works with any LLM - Claude, GPT, Gemini, local models. Git-aware ed...
View ToolConfigure Claude Code for maximum productivity -- CLAUDE.md, sub-agents, MCP servers, and autonomous workflows.
AI AgentsStep-by-step guide to building an MCP server in TypeScript - from project setup to tool definitions, resource handling, testing, and deployment.
AI AgentsSet up Codex Chronicle on macOS, manage permissions, and understand privacy, security, and troubleshooting.
Getting Started
Promptlock gives every prompt a 12-char content-addressable id and a diff-able artifact, turning silent prompt drift int...

Alibaba's newest Qwen release claims flagship-level coding in a 27B dense model. Here is why dense matters, where it fit...

Inception Labs shipped the first reasoning model built on diffusion instead of autoregressive generation. Over 1,000 tok...

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.