Large language models know a lot, but they do not know your data. They cannot answer questions about your company's internal docs, your product's knowledge base, or anything that happened after their training cutoff. Fine-tuning is expensive and produces a frozen snapshot. RAG solves this without touching the model at all.
Retrieval Augmented Generation (RAG) is a technique where you retrieve relevant context from a knowledge base at query time, then pass that context to the LLM alongside the user's question. The model generates its response grounded in your data. No training runs. No GPU clusters. Just search and prompt construction.
This is the single most practical technique for making AI models useful with private or dynamic data. If you have ever wanted an AI that can answer questions about your docs, your codebase, or your product catalog, RAG is how you build it.
The RAG pipeline has three steps: embed, retrieve, generate. Every RAG system, from a weekend prototype to a production deployment, follows this pattern.
User Question
|
v
[1. EMBED] Convert question to a vector embedding
|
v
[2. RETRIEVE] Search vector store for similar document chunks
|
v
[3. GENERATE] Pass retrieved chunks + question to the LLM
|
v
Answer (grounded in your data)
Before RAG can work, your documents need to be converted into vector embeddings. An embedding is a numerical representation of text, a list of numbers (typically 1024 or 1536 dimensions) that captures the semantic meaning of a passage.
You split your documents into chunks, run each chunk through an embedding model, and store the resulting vectors in a database. At query time, you embed the user's question using the same model. This gives you a vector you can compare against your stored document vectors.
import { embed, embedMany } from "ai";
import { openai } from "@ai-sdk/openai";
const embeddingModel = openai.embedding("text-embedding-3-small");
// Embed your documents (do this once, at ingestion time)
const chunks = splitIntoChunks(documents, { maxTokens: 512 });
const { embeddings } = await embedMany({
model: embeddingModel,
values: chunks.map((c) => c.text),
});
// Store chunks + embeddings in your vector database
await vectorStore.upsert(
chunks.map((chunk, i) => ({
id: chunk.id,
text: chunk.text,
embedding: embeddings[i],
metadata: { source: chunk.source, section: chunk.section },
}))
);
When a user asks a question, you embed their query and search for the most similar document chunks. This is called similarity search, and it is the core of what makes RAG work. Chunks that are semantically close to the question score high. Chunks that are unrelated score low.
// Embed the user's query
const { embedding: queryEmbedding } = await embed({
model: embeddingModel,
value: "How do I configure authentication?",
});
// Find the top 5 most relevant chunks
const results = await vectorStore.search(queryEmbedding, {
topK: 5,
filter: { source: "documentation" },
});
The topK parameter controls how many chunks you retrieve. More chunks means more context for the model, but also more tokens and higher latency. Five to ten chunks is a good starting point for most use cases.
Pass the retrieved chunks to the LLM along with the user's question. The model generates a response grounded in the provided context instead of relying solely on its training data.
import { generateText } from "ai";
import { anthropic } from "@ai-sdk/anthropic";
const context = results
.map((r) => `[Source: ${r.metadata.source}]\n${r.text}`)
.join("\n\n");
const { text } = await generateText({
model: anthropic("claude-sonnet-4-6"),
system: `You are a helpful assistant. Answer questions based on the provided context.
If the context does not contain enough information to answer, say so.
Do not make up information that is not in the context.`,
prompt: `Context:\n${context}\n\nQuestion: How do I configure authentication?`,
});
That is the entire pipeline. Embed your docs, search for relevant chunks, feed them to the model. Everything else in RAG is an optimization on top of these three steps.
Three approaches exist for getting AI models to use specific knowledge. Each has different tradeoffs.
| Approach | Best For | Cost | Latency | Data Freshness |
|---|---|---|---|---|
| RAG | Dynamic knowledge bases, large document sets, data that changes | Low | Medium | Real-time |
| Fine-tuning | Changing model behavior, style, or domain-specific reasoning | High | Low | Frozen snapshot |
| Prompt engineering | Small context, task instructions, formatting rules | Free | Low | Per-request |
Use RAG when you have a large corpus of documents that changes over time. Product docs, knowledge bases, legal documents, research papers. The data is too large to fit in a single prompt, and it updates frequently enough that fine-tuning would be stale within weeks.
Use fine-tuning when you need the model to behave differently, not just know different things. If you want it to write in a specific voice, follow domain conventions, or handle a specialized format, fine-tuning changes the model itself. But it is expensive, slow, and produces a snapshot that does not update.
Use prompt engineering when the context fits in the prompt. If your entire knowledge base is a few pages of instructions, just put it in the system prompt. No infrastructure needed.
In practice, most production systems combine all three. Prompt engineering for behavior instructions, RAG for dynamic knowledge, and occasionally fine-tuning for domain adaptation.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
Here is a production-ready RAG implementation using the Vercel AI SDK with a vector store. This example uses Supabase with pgvector, but the pattern works with any vector database.
import { generateText, embed, embedMany, tool } from "ai";
import { anthropic } from "@ai-sdk/anthropic";
import { openai } from "@ai-sdk/openai";
import { createClient } from "@supabase/supabase-js";
import { z } from "zod";
const supabase = createClient(
process.env.SUPABASE_URL!,
process.env.SUPABASE_KEY!
);
const embeddingModel = openai.embedding("text-embedding-3-small");
// --- Ingestion: run once when documents change ---
async function ingestDocuments(docs: { id: string; text: string; source: string }[]) {
const chunks = docs.flatMap((doc) =>
splitIntoChunks(doc.text, { maxTokens: 512 }).map((chunk, i) => ({
id: `${doc.id}-${i}`,
text: chunk,
source: doc.source,
}))
);
const { embeddings } = await embedMany({
model: embeddingModel,
values: chunks.map((c) => c.text),
});
const rows = chunks.map((chunk, i) => ({
id: chunk.id,
content: chunk.text,
embedding: embeddings[i],
metadata: { source: chunk.source },
}));
await supabase.from("documents").upsert(rows);
}
// --- Query: run on every user request ---
async function queryRAG(question: string): Promise<string> {
// 1. Embed the question
const { embedding } = await embed({
model: embeddingModel,
value: question,
});
// 2. Retrieve relevant chunks
const { data: chunks } = await supabase.rpc("match_documents", {
query_embedding: embedding,
match_threshold: 0.7,
match_count: 5,
});
if (!chunks || chunks.length === 0) {
return "I could not find any relevant information to answer that question.";
}
// 3. Generate a grounded response
const context = chunks
.map((c: any) => c.content)
.join("\n\n---\n\n");
const { text } = await generateText({
model: anthropic("claude-sonnet-4-6"),
system: `Answer the user's question based only on the provided context.
Cite which section the information comes from when possible.
If the context does not contain the answer, say so clearly.`,
prompt: `Context:\n${context}\n\nQuestion: ${question}`,
});
return text;
}
The match_documents function is a Postgres function that performs cosine similarity search using pgvector. You create it once in your database:
create or replace function match_documents(
query_embedding vector(1536),
match_threshold float,
match_count int
) returns table (
id text,
content text,
metadata jsonb,
similarity float
) language sql stable as $$
select
id, content, metadata,
1 - (embedding <=> query_embedding) as similarity
from documents
where 1 - (embedding <=> query_embedding) > match_threshold
order by embedding <=> query_embedding
limit match_count;
$$;
Your vector database is the retrieval engine. The choice matters less than you think for getting started, but it matters a lot at scale.
Supabase pgvector is the easiest path if you already use Postgres. Add the pgvector extension, create an embedding column, and query with cosine similarity. No new infrastructure. Works well up to a few million vectors.
Pinecone is a managed vector database built for this use case. Handles billions of vectors, supports metadata filtering, and scales without you thinking about it. Good for production workloads where you do not want to manage infrastructure.
Convex vector search integrates vector search directly into your Convex backend. If you are already using Convex for your app, this keeps everything in one place. Define a vector index on a table and query it with a single function call.
Weaviate is an open-source vector database with built-in vectorization. You can send it raw text and it handles the embedding step for you. Useful if you want the database to manage the embedding pipeline.
For most TypeScript projects, start with pgvector or Convex. You can always migrate to a dedicated vector database later if you outgrow it.
RAG gets more powerful when you combine it with AI agents. Instead of a fixed retrieve-then-generate pipeline, you give the agent a search tool and let it decide when and how to use it.
import { generateText, tool, embed } from "ai";
import { anthropic } from "@ai-sdk/anthropic";
import { z } from "zod";
const { text } = await generateText({
model: anthropic("claude-sonnet-4-6"),
maxSteps: 5,
system: "You are a helpful assistant with access to a knowledge base. Search it when you need information to answer the user's question.",
tools: {
searchKnowledgeBase: tool({
description: "Search the knowledge base for relevant information",
parameters: z.object({
query: z.string().describe("Search query"),
filter: z
.enum(["docs", "api-reference", "tutorials", "all"])
.describe("Category to search in")
.default("all"),
}),
execute: async ({ query, filter }) => {
const { embedding } = await embed({
model: openai.embedding("text-embedding-3-small"),
value: query,
});
const { data } = await supabase.rpc("match_documents", {
query_embedding: embedding,
match_threshold: 0.7,
match_count: 5,
});
return data?.map((d: any) => d.content) ?? [];
},
}),
},
prompt: userQuestion,
});
With maxSteps: 5, the model can search multiple times with different queries, refine its search based on initial results, and then synthesize a comprehensive answer. This is significantly more capable than a single-shot retrieve-and-generate pipeline because the model can reason about what information it still needs.
RAG looks simple in diagrams but has real failure modes in production. Here are the ones that bite most teams.
If your chunks are too large, the retrieved context contains too much noise. The relevant sentence gets buried in paragraphs of unrelated text, and the model either misses it or gets confused by contradictory information. If chunks are too small, they lack the surrounding context needed to be useful. A sentence fragment about "the configuration file" is meaningless without knowing which configuration file.
Start with 300 to 500 tokens per chunk. Overlap consecutive chunks by 50 to 100 tokens so you do not split a concept across two chunks. Adjust based on your data. Technical documentation with dense information benefits from smaller chunks. Narrative content works better with larger ones.
Similarity search alone is not enough. If you have documentation for multiple products or API versions, a query about "authentication" will return chunks from every product. Attach metadata to every chunk: product, version, date, section. Filter before or during similarity search.
const results = await vectorStore.search(embedding, {
topK: 5,
filter: {
product: "my-api",
version: "v3",
},
});
This is the difference between a RAG system that kind of works and one that gives accurate answers.
When no chunks pass the similarity threshold, your system needs to say "I do not know" instead of hallucinating. Set a minimum similarity score and handle the case where nothing matches.
const relevantChunks = results.filter((r) => r.similarity > 0.7);
if (relevantChunks.length === 0) {
return "I could not find relevant information to answer that question. Try rephrasing or ask about a different topic.";
}
Never pass an empty context to the model and hope for the best. The model will generate a plausible-sounding answer from its training data, and the user will think it came from your knowledge base.
Cosine similarity measures how close two vectors are in embedding space. It does not measure whether a chunk actually answers the question. A chunk about "how to configure authentication in Django" will score high for "how to configure authentication in Express" because the embeddings are semantically close. But the content is wrong for the user's stack.
Combine similarity search with keyword matching (hybrid search), metadata filtering, and a reranking step if accuracy matters. Some vector databases support hybrid search natively. For others, you can implement it in your retrieval function by merging results from vector search and full-text search.
If your documents change but your embeddings do not, the model answers questions using outdated information. Build an ingestion pipeline that re-embeds documents when they change. Track document versions and only re-embed modified chunks. This is unglamorous infrastructure work, but it determines whether your RAG system stays accurate over time.
RAG is the foundation. Once you have the basic pipeline working, you can layer on more sophisticated techniques: reranking retrieved chunks for better precision, using hybrid search that combines vector similarity with keyword matching, or building agentic RAG where the model iteratively searches and refines its results.
For the SDK used in this guide, see the full Vercel AI SDK guide. For vector storage that integrates with a reactive backend, check out Convex. And for building autonomous agents that use RAG as one of many tools, read How to Build AI Agents in TypeScript.
Start with a small document set, 10 to 20 pages of your own docs or a project README. Get the pipeline running end to end. Then scale from there. You will learn more about RAG's tradeoffs by building a working system than by reading about architectures you will never implement.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
The TypeScript toolkit for building AI apps. Unified API across OpenAI, Anthropic, Google. Streaming, tool calling, stru...
View ToolAI app builder - describe what you want, get a deployed full-stack app with React, Supabase, and auth. No coding requi...
View ToolNew tutorials, open-source projects, and deep dives on coding agents - delivered weekly.
Most popular LLM framework. 100K+ GitHub stars. Chains, RAG, vector stores, tool use. LangGraph adds stateful multi-agen...
What MCP servers are, how they work, and how to build your own in 5 minutes.
AI AgentsInstall the dd CLI and scaffold your first AI-powered app in under a minute.
Getting StartedConfigure Claude Code for maximum productivity -- CLAUDE.md, sub-agents, MCP servers, and autonomous workflows.
AI Agents
Getting Started with OpenAI's New TypeScript Agents SDK: A Comprehensive Guide OpenAI has recently unveiled their Agents SDK within TypeScript, and this video provides a detailed walkthrough...

Unveiling GPT-4o Image Generation: A Game-Changing Multimodal AI OpenAI has released the revolutionary GPT-4o image generation capabilities, which can produce stunning visuals from text and...

Exploring OpenAI's New Sora Video Generator: Subscription Tiers and Features In this video, I dive into OpenAI's newly released Sora, part of their third day of the '12 days of OpenAI'. Sora...

AI agents use LLMs to complete multi-step tasks autonomously. Here is how they work and how to build them in TypeScript.
A practical guide to building AI agents with TypeScript using the Vercel AI SDK. Tool use, multi-step reasoning, and rea...

Two popular frameworks for building AI apps in TypeScript. Here is when to use each and why most Next.js developers shou...