
TL;DR
A production-grade RAG pipeline with Claude. Chunking that survives real documents, retrieval tuning that actually moves the needle, citation tracking, and the prompt caching trick that makes RAG cheap enough to ship.
Read next
How RAG works, why it matters, and how to implement it in TypeScript. The technique that lets AI models use your data without fine-tuning.
8 min readA practical architecture for multi-step Claude agents. Loop patterns, state management, error recovery, and the production gotchas that turn a five-step demo into a 20 percent success rate at scale.
11 min readThe defensive patterns that keep Claude integrations alive in production. Retry shapes, backoff with jitter, circuit breakers, fallback chains, and the observability you need to debug at 3am.
10 min readIf you have proprietary data and you want a model to answer questions about it, you have three options. Few-shot in the prompt. Retrieval-augmented generation. Fine-tuning.
For 90 percent of teams, the right answer is RAG. Few-shot dies the moment your knowledge base outgrows the context window. Fine-tuning is expensive, slow to iterate on, and changes the model's behavior in ways that are hard to predict and harder to undo. RAG keeps the base model unchanged, scales to arbitrary document counts, and lets you update the knowledge base by re-indexing instead of retraining.
The case for fine-tuning is narrower than people think. Fine-tune when you need a specific output format the base model resists, or when you have millions of high-quality examples and latency matters more than freshness. Everything else is RAG, and Claude is genuinely good at the synthesis step that turns retrieved chunks into a grounded answer.
We covered the conceptual basics in what is RAG. This post is the implementation - the parts that don't show up in the marketing diagrams.
Every RAG team I have worked with underestimates chunking. Then they spend three weeks tuning their retriever before realizing the problem was upstream.
The naive approach is fixed-size chunks. Split your document every 1000 tokens. This breaks the moment a sentence, a code block, or a table spans a boundary. Your retriever returns half a thought. Claude makes up the other half. You blame the embeddings.
The right approach is semantic chunking with hierarchy. Three rules.
Respect document structure. Markdown headings, HTML sections, code fences, table boundaries. These are pre-existing semantic units. Use them as primary chunk boundaries. Falling back to paragraph breaks before falling back to sentence breaks before falling back to fixed token windows.
Keep parent context. Every chunk should carry metadata about what document it came from, what section, what the surrounding heading hierarchy was. When you retrieve a chunk that says "the limit is 100 requests per minute," the retriever needs to know whether that was about the free tier or the enterprise tier. Stuff the heading path into a context field on each chunk.
Overlap, but small. A 10-15 percent token overlap between adjacent chunks helps with the case where the answer straddles a boundary. More overlap wastes embedding budget. Less overlap loses answers.
interface Chunk {
id: string;
text: string;
documentId: string;
headingPath: string[];
position: number;
metadata: Record<string, string>;
}
function chunkMarkdown(doc: string, docId: string): Chunk[] {
const sections = splitByHeadings(doc);
const chunks: Chunk[] = [];
let position = 0;
for (const section of sections) {
const tokens = estimateTokens(section.body);
if (tokens < 1200) {
chunks.push({
id: `${docId}:${position}`,
text: section.body,
documentId: docId,
headingPath: section.headings,
position: position++,
metadata: { tokenCount: String(tokens) },
});
} else {
const subs = splitByParagraphs(section.body, 1000, 150);
for (const sub of subs) {
chunks.push({
id: `${docId}:${position}`,
text: sub,
documentId: docId,
headingPath: section.headings,
position: position++,
metadata: { tokenCount: String(estimateTokens(sub)) },
});
}
}
}
return chunks;
}
This is not glamorous code. It is the code that determines whether your RAG works.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
From the archive
Apr 29, 2026 • 10 min read
Apr 29, 2026 • 13 min read
Apr 29, 2026 • 12 min read
Apr 29, 2026 • 12 min read
Pure vector search loses to hybrid search on real workloads. The reason is that embeddings are good at semantic similarity and bad at exact-match recall. A user query for "error code E47" needs to find the chunk that contains the literal string "E47," and the embedding model sees both as roughly equivalent vectors.
Hybrid search runs both: BM25 (or a similar keyword index) and a vector search, then fuses the rankings. Reciprocal Rank Fusion is the simplest fusion algorithm and it works.
interface ScoredChunk {
chunk: Chunk;
score: number;
}
function reciprocalRankFusion(
rankings: ScoredChunk[][],
k = 60
): ScoredChunk[] {
const scores = new Map<string, { chunk: Chunk; score: number }>();
for (const ranking of rankings) {
ranking.forEach((item, rank) => {
const existing = scores.get(item.chunk.id);
const fused = 1 / (k + rank + 1);
if (existing) existing.score += fused;
else scores.set(item.chunk.id, { chunk: item.chunk, score: fused });
});
}
return Array.from(scores.values()).sort((a, b) => b.score - a.score);
}
async function retrieve(query: string, topK = 8): Promise<Chunk[]> {
const [vectorHits, keywordHits] = await Promise.all([
vectorSearch(query, topK * 2),
keywordSearch(query, topK * 2),
]);
const fused = reciprocalRankFusion([vectorHits, keywordHits]);
return fused.slice(0, topK).map((s) => s.chunk);
}
The next lever is reranking. Pull a top-30 from the hybrid retriever, then run a cross-encoder reranker (Cohere Rerank, BGE Reranker, or a small Claude prompt) over those 30 to pick the best 8. Reranking is expensive per query but it eliminates the long tail of retrieval misses where the right answer was at rank 12 and got dropped.
Don't skip the eval. Build a set of 50 question/answer pairs from your real data. Measure recall at 10 and answer correctness on every retrieval change. Most "improvements" don't improve anything when you measure them.
The retrieval is half the problem. The generation prompt is the other half.
Three things go into a RAG prompt for Claude.
A system message that tells Claude its job: answer the user's question using only the provided sources. If the sources don't contain the answer, say so explicitly. Do not use prior knowledge. Cite sources by ID.
The retrieved chunks, formatted with clear delimiters and an explicit ID per chunk so citations work.
The user's question.
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
const SYSTEM_PROMPT = `You answer questions using only the provided sources.
Rules:
- Cite every claim with [source-id].
- If the sources do not contain enough information to answer, respond exactly: "The provided sources do not contain enough information to answer this question."
- Do not use general knowledge.
- Quote directly when precision matters.`;
function formatSources(chunks: Chunk[]): string {
return chunks
.map(
(c) =>
`<source id="${c.id}" path="${c.headingPath.join(" > ")}">\n${c.text}\n</source>`
)
.join("\n\n");
}
async function generate(question: string, chunks: Chunk[]) {
return await client.messages.create({
model: "claude-sonnet-4-5",
max_tokens: 1024,
system: [
{
type: "text",
text: SYSTEM_PROMPT,
cache_control: { type: "ephemeral" },
},
],
messages: [
{
role: "user",
content: `Sources:\n\n${formatSources(chunks)}\n\nQuestion: ${question}`,
},
],
});
}
The XML-style source tags matter. Claude is trained to respect them as structural delimiters, and it cites by attribute when you ask it to. The "respond exactly" instruction is also load-bearing - without it, Claude will reach for prior knowledge when sources are thin and tell you with confidence things that aren't in your corpus.
Citations in the output are necessary but not sufficient. The full audit trail in production looks like this for every query:
The user question. The retrieved chunk IDs and their relevance scores. The chunks the model actually cited in its response (parsed out of the output). The final answer text.
Log all four. When a user reports a wrong answer, you can immediately see whether the failure was retrieval (right chunks not retrieved), grounding (right chunks retrieved but model ignored them), or hallucination (model cited a chunk that doesn't say what it claimed).
function extractCitations(text: string): string[] {
const matches = text.match(/\[([a-z0-9:.-]+)\]/g) ?? [];
return [...new Set(matches.map((m) => m.slice(1, -1)))];
}
async function answerWithAudit(question: string) {
const chunks = await retrieve(question);
const response = await generate(question, chunks);
const text = response.content
.filter((b): b is Anthropic.TextBlock => b.type === "text")
.map((b) => b.text)
.join("");
return {
question,
retrievedIds: chunks.map((c) => c.id),
citedIds: extractCitations(text),
answer: text,
usage: response.usage,
};
}
The diff between retrievedIds and citedIds is your most useful debugging signal. If the model cited zero retrieved chunks but produced an answer, that is hallucination, full stop.
The single biggest cost optimization for production RAG is prompt caching on the system prompt and any stable context (reference docs, glossaries, persona). For a chatbot that answers from a knowledge base, the system prompt and instructions don't change between queries. Cache them.
Cached reads cost 10 percent of normal input. For a 2k-token system prompt that gets called 10,000 times a day, that is the difference between a real bill and a footnote. Note that the retrieved chunks themselves don't cache well because they vary per query, but the scaffolding around them does.
The full pattern: cache system prompt as one block, put dynamic chunks in the user message, keep the structure stable so cache prefix matching works on every call. For RAG specifically the caching savings often dwarf the embedding and vector DB costs.
End-to-end RAG latency breaks down roughly: embedding the query (50-200ms), vector search (20-100ms), keyword search (10-50ms), reranking (200-500ms), Claude generation (1-3s for short answers, 3-10s for long). The generation dominates. Optimizing anything else first is premature.
The two highest-leverage latency wins are streaming the response (start showing tokens at 800ms instead of waiting 3s for the full answer) and parallelizing retrieval calls with Promise.all. Both are free wins.
Throughput hits walls in two places. The vector DB starts choking past a certain QPS depending on which one you picked. And Anthropic rate limits cap your generation throughput. Both need monitoring. Both want exponential backoff with jitter on retries, which we wrote up in Claude API reliability.
Cost monitoring is the part teams skip until the bill comes. Track tokens per query (input from chunks, output from generation), retrieval cost, and per-user cost. We watch this on agent-finops for our own RAG endpoints. The p99 cost user is usually 50x the median and is usually a bot. Catch them early.
For replay and debugging the answers that don't look right, tracetrail lets us step through retrieval and generation with the original chunk set so we can see whether the bug was upstream or in the prompt itself.
If you want a deeper walkthrough, the DevDigest YouTube build of a better RAG pipeline goes through the same architecture end to end with live debugging.
A working RAG system is mostly chunking, retrieval tuning, prompt discipline, and operational hygiene. Claude is excellent at the synthesis step. The job is to feed it the right context and verify what comes out. Get those pieces right and the rest is plumbing.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
AI coding platform built for large, complex codebases. Context Engine indexes 500K+ files across repos with 100ms retrie...
View ToolGoogle's open-source coding CLI. Free tier with Gemini 2.5 Pro. Supports tool use, file editing, shell commands. 1M toke...
View ToolVercel's generative UI tool. Describe a component, get production-ready React code with shadcn/ui and Tailwind. Iterate...
View ToolOpen-source AI pair programming in your terminal. Works with any LLM - Claude, GPT, Gemini, local models. Git-aware ed...
View ToolAsk quick side questions without derailing the main task.
Claude CodeTargeted edits to specific sections without rewriting entire files.
Claude CodeModify Jupyter notebook cells directly without touching JSON.
Claude Code
How RAG works, why it matters, and how to implement it in TypeScript. The technique that lets AI models use your data wi...

A practical architecture for multi-step Claude agents. Loop patterns, state management, error recovery, and the producti...

The defensive patterns that keep Claude integrations alive in production. Retry shapes, backoff with jitter, circuit bre...

Build MCP servers that connect Claude to your databases, APIs, and tools. Architecture, TypeScript SDK code, debugging,...

OpenAI shipped an open-weight PII redactor. Here is how to wire it into a real ingestion pipeline locally, fast, with ze...

SNEWPAPERS is a useful Show HN signal: the strongest agentic search products do not replace search results with prose. T...

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.