LLM Engineering
Cut latency and token cost by structuring prompts so the stable prefix (system, tools, long context) is cached and only the tail changes per request.
1 file
Description
Cut latency and token cost by structuring prompts so the stable prefix (system, tools, long context) is cached and only the tail changes per request.
Repeated calls that share a large prefix: a chat with a long system prompt, an agent with many tool definitions, or a RAG pipeline that resends the same reference document across turns.
Caching works on a prefix. The model caches everything up to a marked boundary; on the next call, if the prefix is byte-identical, that portion is read from cache at a fraction of the cost and latency. So the win comes from ordering the prompt stable-first, volatile-last.
Check the usage numbers the API returns per call. You want cache-read tokens to dominate on the second and later calls. If cache creation happens every call, something in your prefix is changing - diff two consecutive raw request bodies.
Related
Added 2026-07-01. Back to the Skill Library.

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.