KV Caching: A Practical Guide to Optimizing Transformer Inference

Developers Digest•April 29, 2026•11 min read

LLM Inference Optimization Hugging Face Local Models

TL;DR

How KV caching speeds up LLM inference - the math, the code, the memory tradeoffs, and when it stops helping. Every dev running local models hits this wall.

The wall every local-LLM dev hits

You spin up a 7B model on a decent GPU. The first generation feels fast. Then you push the context to a few thousand tokens and the throughput collapses. You profile and the answer is the same one a thousand devs have arrived at independently: most of your forward passes are recomputing attention over tokens you already saw.

This is the KV-cache wall. It is the canonical bottleneck of transformer inference, the first real performance lesson when you stop calling hosted APIs and start running models yourself, and the topic of one of the clearest explainers Hugging Face has shipped this year - not-lain's KV caching post.

This piece is the developer's version of that explainer. Less math, more code, more focus on the engineering decisions you actually make when you ship.

What KV caching actually is

A transformer generates one token at a time. Each new token attends to every previous token in the sequence. If your context is one thousand tokens and you generate the one-thousand-and-first, the model needs the keys and values for all one thousand previous tokens to compute attention.

The naive implementation recomputes those keys and values on every generation step. You feed the whole sequence through the model again, throw away most of the output, and keep only the new token. This is O(n^2) work to generate n tokens.

The KV cache is the obvious fix. After the first forward pass, you have computed the keys and values for the prompt. Cache them in memory. On the next step, you only run the new token through the model, attend it against the cached keys and values, and produce one new pair of cached entries. The whole generation becomes O(n) instead of O(n^2).

The speedup is enormous. For a sequence length of 2,048 with a 7B model, KV caching can take you from "this is unusable" to "this is real-time" on the same GPU.

What it looks like in code

If you are using Hugging Face Transformers, KV caching is on by default in generate(). The interesting work happens when you build your own inference loop and need to reason about it directly.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "meta-llama/Llama-3.2-3B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)

prompt = "The KV cache stores"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda()

past_key_values = None
generated = input_ids
max_new = 50

with torch.no_grad():
    for _ in range(max_new):
        if past_key_values is None:
            inputs = generated
        else:
            inputs = generated[:, -1:]

        out = model(
            input_ids=inputs,
            past_key_values=past_key_values,
            use_cache=True,
        )
        past_key_values = out.past_key_values
        next_token = out.logits[:, -1, :].argmax(dim=-1, keepdim=True)
        generated = torch.cat([generated, next_token], dim=1)

print(tokenizer.decode(generated[0]))

The shape of the cache is what matters. past_key_values is a tuple with one entry per transformer layer. Each entry holds two tensors - the keys and the values - of shape [batch, num_heads, seq_len, head_dim]. The seq_len dimension grows by one with every generation step.

That growing dimension is the catch. The cache scales linearly with sequence length, and the constant is not small. For a 7B-class model with 32 layers, 32 heads, head_dim 128, in float16, the math is:

2 * 32 layers * 32 heads * 128 head_dim * 2 bytes = ~512 KB per token

At 2,048 tokens of context, that is one gigabyte of KV cache per request. At 32K context, sixteen gigabytes. The KV cache, not the model weights, is what fills your VRAM in long-context inference.

Get the weekly deep dive

Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.

The memory tradeoffs that matter

Once you internalize that the KV cache is most of your memory budget, a lot of optimization decisions snap into focus.

Quantization of the cache itself. You can store the cache in int8 or int4 instead of bfloat16. The accuracy hit is usually small for chat workloads. Hugging Face Transformers supports int8 cache out of the box - flip a config flag, halve your memory.

Multi-query and grouped-query attention. Modern model architectures - Llama 3, Mistral, Qwen - share key and value heads across multiple query heads. Grouped-query attention with eight kv heads instead of thirty-two cuts the cache by 4x with minimal quality cost. This is why a 70B Llama 3 has roughly the same KV-cache footprint per token as a 7B from a couple of years ago.

Paged attention. The vLLM project's contribution. Instead of allocating one contiguous KV cache buffer per request, vLLM allocates fixed-size pages and looks them up by a virtual address per request. This eliminates the fragmentation that wastes memory in batched serving and is the single biggest reason vLLM dominates self-hosted inference today. If you are serving more than one request at a time, you should not be writing your own attention loop - you should be running vLLM or a successor.

Sliding window attention. Some architectures attend only over a window of recent tokens. The KV cache becomes a fixed-size ring buffer instead of a growing array. Mistral popularized this. The cost is that the model genuinely cannot see beyond the window, so anything that depends on long-range structure has to be summarized into recent tokens.

The KV cache discard problem. For very long contexts, the cache may not fit even with quantization. You then have to choose what to evict. Options range from naive (drop the oldest tokens), to clever (the H2O paper, attention-score-based eviction), to extreme (re-summarize the dropped region into a much shorter prefix). Production systems usually pick a static policy - keep the system prompt, keep the last N turns - and call it good.

Gotchas worth knowing

Cache reuse across requests is real and underused. If many requests share a long system prompt, you can compute the cache for the system prompt once and reuse it across all requests. This is the prefix caching feature in vLLM and SGLang. For agent workloads with a long system prompt and short user turns, prefix caching can halve your cost per request.

Batch size and cache size are linked. The KV cache is per-request. If you serve eight requests at once, you have eight caches in memory. The maximum batch size is bounded by (VRAM - model_weights) / (cache_per_token * max_seq_len). Profile this before you size your hardware.

Speculative decoding interacts in non-obvious ways. With a draft model proposing tokens that the target model verifies, you have two caches running in lockstep, and rolling back the target cache when the draft is wrong is fiddly. Most frameworks handle this for you. If you implement it yourself, double-check the rollback path.

Streaming output and the cache. The cache is mutated in place by generate(). If you stream tokens to a client and the client disconnects, you need to release the cache memory promptly. Plumbing the cancel path through your inference server is a real source of memory leaks.

DD take: where this fits the agent stack

The honest perspective from someone who has shipped a few agent products. KV caching is not a feature you turn on - it is the air everything else breathes. Every other inference optimization assumes it. Continuous batching, speculative decoding, prefix caching, paged attention - all of them are reorganizations of the KV cache.

For most application devs, the right move is not to write your own KV-cache logic. The right move is to pick a serving framework that has solved it and to understand the framework's trade space. vLLM is the default for self-hosted Llama-class models. SGLang is the more aggressive option for very long contexts and structured generation. TensorRT-LLM is the right answer if you live on NVIDIA hardware and need every last token per second.

The case for understanding KV caching deeply is when you start doing things the framework was not designed for. Long-context retrieval pipelines where you want to splice in cached prefixes per query. Multi-tenant agent serving where you want to share cache across users with the same system prompt. Local on-device inference where the cache is the dominant memory cost and you cannot afford a generic policy.

If you are running models locally, even casually, the rule of thumb is: most of your VRAM is going to be cache, not weights, and the choice of architecture matters more than the choice of model size for inference economics.

Wiring it into a real product

We have been profiling KV-cache behavior inside Traces, our agent-run timeline tool. Traces was originally built to render Claude Code transcripts as a stepped UI. Adding self-hosted-model traces meant we had to surface KV-cache utilization as a first-class metric, because for self-hosted runs the cache is the most predictive number for "is this run going to OOM."

The pattern that has worked is to log, per turn, the prefix cache hit ratio, the live cache size in MB, and the maximum-allowed cache size for the request. Plot those over the run and you can immediately see when a request is cache-bound versus compute-bound versus prompt-bound. This kind of observability has been hard to get out of self-hosted serving frameworks; building it in turned out to be a small but valuable feature.

For workloads that need persistent agent state, AgentFS is where the prefix-cache idea pays off. AgentFS gives agents a durable workspace across runs, and most of what an agent does in a given session involves the same long system prompt, the same toolset descriptions, and the same workspace summary. Caching the prefix once per agent and reusing it across turns inside a session is the difference between five-second latency and one-second latency on a self-hosted setup. The same idea generalizes to any product with a stable long prefix.

What to watch next

The interesting open questions.

Cache compression beyond int8. Recent research is pushing into 4-bit and even 2-bit cache quantization with minimal quality cost. If those numbers hold up in production, the effective memory budget for long-context inference doubles or quadruples without new hardware.

Cache-aware routing. Multi-tenant inference systems are starting to route requests to the GPU that already has a relevant prefix cached. The economic logic is obvious. The implementation is gnarly because you need a global view of which caches live where.

KV-cache sharing across models. If you have a chain of models - a small fast model proposing, a large slow model verifying, a re-ranker checking - sharing the KV computation across them is an open research direction. The constraint is that the architectures have to match.

Hardware support. The newest accelerators are starting to ship with KV-cache-specific memory hierarchies - dedicated cache-only HBM stacks, faster cache-to-compute paths. The hardware is racing to catch up to the access pattern that transformer inference actually exhibits, and the next generation of GPUs will have very different KV-cache economics from the current one.

We are running a deeper hands-on walkthrough on YouTube - building a tiny inference loop with explicit KV cache management, then bolting on prefix caching, then comparing against vLLM. Watching the mental model assemble in code is, in our experience, the fastest way to actually internalize this. Once you have built it once, you stop being surprised by inference performance forever.

Suggest an editSave

Discuss this article on Twitter/X

Developers Digest

Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.

300+ videos30K+ GitHub stars50+ articles

Subscribe YouTube GitHub Twitter/X

Comments

Related Tools

Local AI

llama.cpp

C++ inference engine for LLMs. GGUF format, quantization, CPU and Metal/CUDA support. The foundation most local tools bu...

View Tool

Local AI

vLLM

High-throughput inference server for LLMs. PagedAttention memory management. The go-to for serious local or self-hosted...

View Tool

Local AI

MLX

Apple's array framework for machine learning on Apple Silicon. Native Metal support, unified memory, first-class LLM inf...

View Tool

AI Coding

Aider

Open-source AI pair programming in your terminal. Works with any LLM - Claude, GPT, Gemini, local models. Git-aware ed...

View Tool

Related Guides

Guide

Claude Code Setup Guide

Configure Claude Code for maximum productivity -- CLAUDE.md, sub-agents, MCP servers, and autonomous workflows.

AI Agents

Guide

Building Your First MCP Server

Step-by-step guide to building an MCP server in TypeScript - from project setup to tool definitions, resource handling, testing, and deployment.

AI Agents

Guide

Chronicle Research Preview Setup Guide

Set up Codex Chronicle on macOS, manage permissions, and understand privacy, security, and troubleshooting.

Getting Started

Promptlock: Deterministic Prompt Versioning for LLM Apps

7 min read

AI Coding

Promptlock: Deterministic Prompt Versioning for LLM Apps

Promptlock gives every prompt a 12-char content-addressable id and a diff-able artifact, turning silent prompt drift int...

April 28, 2026

7 min read

Qwen

Qwen3.6-27B: A 27-Billion-Parameter Dense Model That Actually Codes

Alibaba's newest Qwen release claims flagship-level coding in a 27B dense model. Here is why dense matters, where it fit...

April 22, 2026

8 min read

Mercury 2: The LLM That Doesn't Generate Like an LLM

Inception Labs shipped the first reasoning model built on diffusion instead of autoregressive generation. Over 1,000 tok...

February 24, 2026

Get Smarter About AI Dev

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.

One email per weekReal code, not theoryFree forever

KV Caching: A Practical Guide to Optimizing Transformer Inference

Developers Digest•April 29, 2026•11 min read

LLM Inference Optimization Hugging Face Local Models

TL;DR

How KV caching speeds up LLM inference - the math, the code, the memory tradeoffs, and when it stops helping. Every dev running local models hits this wall.

The wall every local-LLM dev hits

This piece is the developer's version of that explainer. Less math, more code, more focus on the engineering decisions you actually make when you ship.

What KV caching actually is

The speedup is enormous. For a sequence length of 2,048 with a 7B model, KV caching can take you from "this is unusable" to "this is real-time" on the same GPU.

What it looks like in code

If you are using Hugging Face Transformers, KV caching is on by default in generate(). The interesting work happens when you build your own inference loop and need to reason about it directly.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "meta-llama/Llama-3.2-3B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)

prompt = "The KV cache stores"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda()

past_key_values = None
generated = input_ids
max_new = 50

with torch.no_grad():
    for _ in range(max_new):
        if past_key_values is None:
            inputs = generated
        else:
            inputs = generated[:, -1:]

        out = model(
            input_ids=inputs,
            past_key_values=past_key_values,
            use_cache=True,
        )
        past_key_values = out.past_key_values
        next_token = out.logits[:, -1, :].argmax(dim=-1, keepdim=True)
        generated = torch.cat([generated, next_token], dim=1)

print(tokenizer.decode(generated[0]))

2 * 32 layers * 32 heads * 128 head_dim * 2 bytes = ~512 KB per token

At 2,048 tokens of context, that is one gigabyte of KV cache per request. At 32K context, sixteen gigabytes. The KV cache, not the model weights, is what fills your VRAM in long-context inference.

Get the weekly deep dive

Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.

The memory tradeoffs that matter

Once you internalize that the KV cache is most of your memory budget, a lot of optimization decisions snap into focus.

Gotchas worth knowing

DD take: where this fits the agent stack

Wiring it into a real product

What to watch next

The interesting open questions.

Suggest an editSave

Discuss this article on Twitter/X

Developers Digest

Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.

300+ videos30K+ GitHub stars50+ articles

Subscribe YouTube GitHub Twitter/X