RAG with Claude: Add Context Without Retraining

Why RAG with Claude beats fine-tuning for almost everyone

If you have proprietary data and you want a model to answer questions about it, you have three options. Few-shot in the prompt. Retrieval-augmented generation. Fine-tuning.

For 90 percent of teams, the right answer is RAG. Few-shot dies the moment your knowledge base outgrows the context window. Fine-tuning is expensive, slow to iterate on, and changes the model's behavior in ways that are hard to predict and harder to undo. RAG keeps the base model unchanged, scales to arbitrary document counts, and lets you update the knowledge base by re-indexing instead of retraining.

The case for fine-tuning is narrower than people think. Fine-tune when you need a specific output format the base model resists, or when you have millions of high-quality examples and latency matters more than freshness. Everything else is RAG, and Claude is genuinely good at the synthesis step that turns retrieved chunks into a grounded answer.

We covered the conceptual basics in what is RAG. This post is the implementation - the parts that don't show up in the marketing diagrams.

Chunking is the single biggest lever

Every RAG team I have worked with underestimates chunking. Then they spend three weeks tuning their retriever before realizing the problem was upstream.

The naive approach is fixed-size chunks. Split your document every 1000 tokens. This breaks the moment a sentence, a code block, or a table spans a boundary. Your retriever returns half a thought. Claude makes up the other half. You blame the embeddings.

The right approach is semantic chunking with hierarchy. Three rules.

Respect document structure. Markdown headings, HTML sections, code fences, table boundaries. These are pre-existing semantic units. Use them as primary chunk boundaries. Falling back to paragraph breaks before falling back to sentence breaks before falling back to fixed token windows.

Keep parent context. Every chunk should carry metadata about what document it came from, what section, what the surrounding heading hierarchy was. When you retrieve a chunk that says "the limit is 100 requests per minute," the retriever needs to know whether that was about the free tier or the enterprise tier. Stuff the heading path into a context field on each chunk.

Overlap, but small. A 10-15 percent token overlap between adjacent chunks helps with the case where the answer straddles a boundary. More overlap wastes embedding budget. Less overlap loses answers.

interface Chunk {
  id: string;
  text: string;
  documentId: string;
  headingPath: string[];
  position: number;
  metadata: Record<string, string>;
}

function chunkMarkdown(doc: string, docId: string): Chunk[] {
  const sections = splitByHeadings(doc);
  const chunks: Chunk[] = [];
  let position = 0;

  for (const section of sections) {
    const tokens = estimateTokens(section.body);
    if (tokens < 1200) {
      chunks.push({
        id: `${docId}:${position}`,
        text: section.body,
        documentId: docId,
        headingPath: section.headings,
        position: position++,
        metadata: { tokenCount: String(tokens) },
      });
    } else {
      const subs = splitByParagraphs(section.body, 1000, 150);
      for (const sub of subs) {
        chunks.push({
          id: `${docId}:${position}`,
          text: sub,
          documentId: docId,
          headingPath: section.headings,
          position: position++,
          metadata: { tokenCount: String(estimateTokens(sub)) },
        });
      }
    }
  }

  return chunks;
}

This is not glamorous code. It is the code that determines whether your RAG works.

Get the weekly deep dive

Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.

From the archive

SAM 3.1: Realtime Video Segmentation in Apps

Apr 29, 2026 • 10 min read

Self-Hosting AI Agents: 5 Ways to Run Claude Code on Your Own Infra

Apr 29, 2026 • 13 min read

Shipping OpenAI Symphony in Prod: A Real-World Guide

Apr 29, 2026 • 12 min read

Tool Use in the Claude API: Production Patterns for Reliable Agents

Apr 29, 2026 • 12 min read

Retrieval: hybrid wins, almost always

Pure vector search loses to hybrid search on real workloads. The reason is that embeddings are good at semantic similarity and bad at exact-match recall. A user query for "error code E47" needs to find the chunk that contains the literal string "E47," and the embedding model sees both as roughly equivalent vectors.

Hybrid search runs both: BM25 (or a similar keyword index) and a vector search, then fuses the rankings. Reciprocal Rank Fusion is the simplest fusion algorithm and it works.

interface ScoredChunk {
  chunk: Chunk;
  score: number;
}

function reciprocalRankFusion(
  rankings: ScoredChunk[][],
  k = 60
): ScoredChunk[] {
  const scores = new Map<string, { chunk: Chunk; score: number }>();
  for (const ranking of rankings) {
    ranking.forEach((item, rank) => {
      const existing = scores.get(item.chunk.id);
      const fused = 1 / (k + rank + 1);
      if (existing) existing.score += fused;
      else scores.set(item.chunk.id, { chunk: item.chunk, score: fused });
    });
  }
  return Array.from(scores.values()).sort((a, b) => b.score - a.score);
}

async function retrieve(query: string, topK = 8): Promise<Chunk[]> {
  const [vectorHits, keywordHits] = await Promise.all([
    vectorSearch(query, topK * 2),
    keywordSearch(query, topK * 2),
  ]);
  const fused = reciprocalRankFusion([vectorHits, keywordHits]);
  return fused.slice(0, topK).map((s) => s.chunk);
}

The next lever is reranking. Pull a top-30 from the hybrid retriever, then run a cross-encoder reranker (Cohere Rerank, BGE Reranker, or a small Claude prompt) over those 30 to pick the best 8. Reranking is expensive per query but it eliminates the long tail of retrieval misses where the right answer was at rank 12 and got dropped.

Don't skip the eval. Build a set of 50 question/answer pairs from your real data. Measure recall at 10 and answer correctness on every retrieval change. Most "improvements" don't improve anything when you measure them.

Generation: the prompt that prevents hallucination

The retrieval is half the problem. The generation prompt is the other half.

Three things go into a RAG prompt for Claude.

A system message that tells Claude its job: answer the user's question using only the provided sources. If the sources don't contain the answer, say so explicitly. Do not use prior knowledge. Cite sources by ID.

The retrieved chunks, formatted with clear delimiters and an explicit ID per chunk so citations work.

The user's question.

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

const SYSTEM_PROMPT = `You answer questions using only the provided sources.

Rules:
- Cite every claim with [source-id].
- If the sources do not contain enough information to answer, respond exactly: "The provided sources do not contain enough information to answer this question."
- Do not use general knowledge.
- Quote directly when precision matters.`;

function formatSources(chunks: Chunk[]): string {
  return chunks
    .map(
      (c) =>
        `<source id="${c.id}" path="${c.headingPath.join(" > ")}">\n${c.text}\n</source>`
    )
    .join("\n\n");
}

async function generate(question: string, chunks: Chunk[]) {
  return await client.messages.create({
    model: "claude-sonnet-4-5",
    max_tokens: 1024,
    system: [
      {
        type: "text",
        text: SYSTEM_PROMPT,
        cache_control: { type: "ephemeral" },
      },
    ],
    messages: [
      {
        role: "user",
        content: `Sources:\n\n${formatSources(chunks)}\n\nQuestion: ${question}`,
      },
    ],
  });
}

The XML-style source tags matter. Claude is trained to respect them as structural delimiters, and it cites by attribute when you ask it to. The "respond exactly" instruction is also load-bearing - without it, Claude will reach for prior knowledge when sources are thin and tell you with confidence things that aren't in your corpus.

Citations and trust: the audit trail you actually need

Citations in the output are necessary but not sufficient. The full audit trail in production looks like this for every query:

The user question. The retrieved chunk IDs and their relevance scores. The chunks the model actually cited in its response (parsed out of the output). The final answer text.

Log all four. When a user reports a wrong answer, you can immediately see whether the failure was retrieval (right chunks not retrieved), grounding (right chunks retrieved but model ignored them), or hallucination (model cited a chunk that doesn't say what it claimed).

function extractCitations(text: string): string[] {
  const matches = text.match(/\[([a-z0-9:.-]+)\]/g) ?? [];
  return [...new Set(matches.map((m) => m.slice(1, -1)))];
}

async function answerWithAudit(question: string) {
  const chunks = await retrieve(question);
  const response = await generate(question, chunks);
  const text = response.content
    .filter((b): b is Anthropic.TextBlock => b.type === "text")
    .map((b) => b.text)
    .join("");

  return {
    question,
    retrievedIds: chunks.map((c) => c.id),
    citedIds: extractCitations(text),
    answer: text,
    usage: response.usage,
  };
}

The diff between retrievedIds and citedIds is your most useful debugging signal. If the model cited zero retrieved chunks but produced an answer, that is hallucination, full stop.

Prompt caching: the trick that makes RAG affordable

The single biggest cost optimization for production RAG is prompt caching on the system prompt and any stable context (reference docs, glossaries, persona). For a chatbot that answers from a knowledge base, the system prompt and instructions don't change between queries. Cache them.

Cached reads cost 10 percent of normal input. For a 2k-token system prompt that gets called 10,000 times a day, that is the difference between a real bill and a footnote. Note that the retrieved chunks themselves don't cache well because they vary per query, but the scaffolding around them does.

The full pattern: cache system prompt as one block, put dynamic chunks in the user message, keep the structure stable so cache prefix matching works on every call. For RAG specifically the caching savings often dwarf the embedding and vector DB costs.

Scaling: latency, throughput, and the parts that fail under load

End-to-end RAG latency breaks down roughly: embedding the query (50-200ms), vector search (20-100ms), keyword search (10-50ms), reranking (200-500ms), Claude generation (1-3s for short answers, 3-10s for long). The generation dominates. Optimizing anything else first is premature.

The two highest-leverage latency wins are streaming the response (start showing tokens at 800ms instead of waiting 3s for the full answer) and parallelizing retrieval calls with Promise.all. Both are free wins.

Throughput hits walls in two places. The vector DB starts choking past a certain QPS depending on which one you picked. And Anthropic rate limits cap your generation throughput. Both need monitoring. Both want exponential backoff with jitter on retries, which we wrote up in Claude API reliability.

Cost monitoring is the part teams skip until the bill comes. Track tokens per query (input from chunks, output from generation), retrieval cost, and per-user cost. We watch this on agent-finops for our own RAG endpoints. The p99 cost user is usually 50x the median and is usually a bot. Catch them early.

For replay and debugging the answers that don't look right, tracetrail lets us step through retrieval and generation with the original chunk set so we can see whether the bug was upstream or in the prompt itself.

If you want a deeper walkthrough, the DevDigest YouTube build of a better RAG pipeline goes through the same architecture end to end with live debugging.

A working RAG system is mostly chunking, retrieval tuning, prompt discipline, and operational hygiene. Claude is excellent at the synthesis step. The job is to feed it the right context and verify what comes out. Get those pieces right and the rest is plumbing.

What is RAG? Retrieval Augmented Generation Explained

Agent Architecture: Building Multi-Step AI Workflows That Survive Production

Claude API Reliability: Error Handling Best Practices

Why RAG with Claude beats fine-tuning for almost everyone

Chunking is the single biggest lever

SAM 3.1: Realtime Video Segmentation in Apps

Self-Hosting AI Agents: 5 Ways to Run Claude Code on Your Own Infra

Shipping OpenAI Symphony in Prod: A Real-World Guide

Tool Use in the Claude API: Production Patterns for Reliable Agents

Retrieval: hybrid wins, almost always

Generation: the prompt that prevents hallucination

Citations and trust: the audit trail you actually need

Prompt caching: the trick that makes RAG affordable

Scaling: latency, throughput, and the parts that fail under load

Comments

Related Tools

Augment Code

Gemini CLI

v0

Aider

Apps from Developers Digest

Skill Builder

MCP Directory

Related Guides

Side Questions with /btw - Claude Code

Edit Tool - Claude Code

Notebook Edit - Claude Code

Related Videos

Nimbalyst: The Open-Source Visual Workspace for Building with Codex and Claude Code

Related Posts

What is RAG? Retrieval Augmented Generation Explained

Agent Architecture: Building Multi-Step AI Workflows That Survive Production

Claude API Reliability: Error Handling Best Practices

Model Context Protocol: A Production Guide To Building MCP Servers

OpenAI Privacy Filter: Production PII Redaction Guide

Agentic Search Works Best When It Writes Queries, Not Answers

Get Smarter About AI Dev

What is RAG? Retrieval Augmented Generation Explained

Agent Architecture: Building Multi-Step AI Workflows That Survive Production

Claude API Reliability: Error Handling Best Practices

Why RAG with Claude beats fine-tuning for almost everyone

Chunking is the single biggest lever

SAM 3.1: Realtime Video Segmentation in Apps

Self-Hosting AI Agents: 5 Ways to Run Claude Code on Your Own Infra

Shipping OpenAI Symphony in Prod: A Real-World Guide

Tool Use in the Claude API: Production Patterns for Reliable Agents

Retrieval: hybrid wins, almost always

Generation: the prompt that prevents hallucination

Citations and trust: the audit trail you actually need

Prompt caching: the trick that makes RAG affordable

Scaling: latency, throughput, and the parts that fail under load

Comments

Related Tools

Augment Code

Gemini CLI

v0

Aider

Apps from Developers Digest

Skill Builder

MCP Directory

Related Guides

Side Questions with /btw - Claude Code

Edit Tool - Claude Code

Notebook Edit - Claude Code

Related Videos

Nimbalyst: The Open-Source Visual Workspace for Building with Codex and Claude Code

Related Posts

What is RAG? Retrieval Augmented Generation Explained

Agent Architecture: Building Multi-Step AI Workflows That Survive Production

Claude API Reliability: Error Handling Best Practices

Model Context Protocol: A Production Guide To Building MCP Servers

OpenAI Privacy Filter: Production PII Redaction Guide

Agentic Search Works Best When It Writes Queries, Not Answers

Get Smarter About AI Dev