LLM Engineering

Prompt Caching Optimization

Cut latency and token cost by structuring prompts so the stable prefix (system, tools, long context) is cached and only the tail changes per request.

prompt-cachinglatencycostllmoptimization

1 file

Description

Cut latency and token cost by structuring prompts so the stable prefix (system, tools, long context) is cached and only the tail changes per request.

Install

SKILL.md

name: prompt-caching-optimization description: Use when an LLM feature sends a large, mostly-stable prompt on every request (long system prompt, tool definitions, retrieved context) and you want to cut cost and latency with prompt caching.

Prompt Caching Optimization

When to trigger

Repeated calls that share a large prefix: a chat with a long system prompt, an agent with many tool definitions, or a RAG pipeline that resends the same reference document across turns.

The core idea

Caching works on a prefix. The model caches everything up to a marked boundary; on the next call, if the prefix is byte-identical, that portion is read from cache at a fraction of the cost and latency. So the win comes from ordering the prompt stable-first, volatile-last.

Structure the prompt

Put the invariant content first and in a fixed order: system instructions, tool definitions, then long shared context (documents, examples).
Mark the cache boundary at the end of the stable region so everything before it is cacheable.
Put the per-request content (the user's new message, the current query) after the boundary.

Rules that make or break the cache

The cached prefix must be byte-identical across calls. A timestamp, a request id, or a reordered tool list at the top invalidates the cache every time.
Caches are time-limited. Bursty, back-to-back traffic benefits most; a request every few minutes may find the cache expired.
Order matters more than size. A small dynamic field placed early poisons a large stable prefix behind it.

Verify it works

Check the usage numbers the API returns per call. You want cache-read tokens to dominate on the second and later calls. If cache creation happens every call, something in your prefix is changing - diff two consecutive raw request bodies.

Pitfalls

Injecting the current date or a session id into the system prompt silently disables caching. Move it below the boundary.
Rebuilding tool definitions from a map with non-deterministic key order changes the bytes. Serialize deterministically.
Do not assume a cache hit. Measure it from the returned token counts before claiming a cost win.

/tools/ai-cost-calculator

Added 2026-07-01. Back to the Skill Library.

Get Smarter About AI Dev

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.

One email per weekReal code, not theoryFree forever

Structure the prompt

Put the invariant content first and in a fixed order: system instructions, tool definitions, then long shared context (documents, examples).

Mark the cache boundary at the end of the stable region so everything before it is cacheable.

Put the per-request content (the user's new message, the current query) after the boundary.

Rules that make or break the cache

The cached prefix must be byte-identical across calls. A timestamp, a request id, or a reordered tool list at the top invalidates the cache every time.

Caches are time-limited. Bursty, back-to-back traffic benefits most; a request every few minutes may find the cache expired.

Order matters more than size. A small dynamic field placed early poisons a large stable prefix behind it.

Pitfalls

Injecting the current date or a session id into the system prompt silently disables caching. Move it below the boundary.

Rebuilding tool definitions from a map with non-deterministic key order changes the bytes. Serialize deterministically.

Do not assume a cache hit. Measure it from the returned token counts before claiming a cost win.