115 terms and counting. Written for developers building with AI, not researchers reading papers.
The ability of an AI agent to retain information across turns, sessions, or tasks. Short-term memory lives in the current context window. Long-term memory persists across sessions by writing facts to files, databases, or vector stores. Claude Code uses auto memory to remember project details, debugging insights, and user preferences without being told to save them explicitly.
A multi-agent pattern where many lightweight agents work on sub-tasks simultaneously without a central orchestrator. Each agent operates independently on its assigned piece, and results are aggregated when all agents complete. Swarms trade coordination overhead for parallelism, making them effective for tasks that decompose into independent units like testing multiple files, researching multiple topics, or auditing different parts of a codebase.
A development workflow where AI agents write, test, and iterate on code autonomously. Instead of suggesting completions, the agent plans multi-step tasks, runs commands, reads errors, and fixes them in a loop until the job is done.
The core cycle that powers AI agents: observe the current state, think about what to do next, act by calling a tool or generating output, then repeat. Each iteration feeds the result of the last action back into the model's context so it can decide the next step. This observe-think-act pattern is what separates agents from single-shot prompt-response interactions.
Software that uses a large language model to reason about goals, break them into steps, call external tools, and act on results without human intervention at each step. Agents differ from chatbots because they maintain a plan and execute it across multiple turns.
A secret string that authenticates your application with an external service. AI providers like OpenAI, Anthropic, and Google issue API keys so your code can send prompts and receive completions programmatically.
The core technique inside transformers that lets a model weigh the relevance of every token relative to every other token in a sequence. Instead of processing input left-to-right, attention computes relevance scores across all positions at once, allowing the model to focus on the most important parts of the context regardless of distance.
A feature in Claude Code that automatically remembers important information across sessions without explicit user action. When Claude Code discovers something useful during a session - a build command, a debugging insight, a project convention - it can store it in memory files that load automatically in future sessions. Auto memory reduces the need to repeat context and makes the agent smarter over time for your specific project.
Running a coding agent in a mode where it completes entire features without pausing for human approval. The agent reads the codebase, writes code, runs tests, and commits - all from a single prompt.
A flow-control mechanism that prevents an agent pipeline from overwhelming downstream systems. When a fast-producing agent generates work faster than a slow-consuming agent can process it, backpressure signals the producer to slow down or queue work. In multi-agent systems, backpressure prevents token budget exhaustion, API rate limit violations, and runaway costs by throttling agent activity based on available capacity.
Processing multiple prompts or inputs through a model simultaneously rather than one at a time. Batch inference reduces per-request overhead and can significantly lower costs with providers that offer batch pricing (often 50% cheaper than real-time). It is the standard approach for processing large datasets, running evaluations, and executing offline workloads where real-time response is not required.
Standardized tests that measure model performance on tasks like code generation, math, reasoning, and instruction following. Common benchmarks include SWE-bench, HumanEval, MMLU, and GPQA. They help developers compare models, but real-world performance often differs from benchmark scores.
Software that compiles, bundles, or transforms source code into production-ready output. In the AI dev space, build tools like Turbopack, Vite, and esbuild handle TypeScript compilation, module bundling, and hot reload for the frameworks that power AI applications.
The process of splitting large documents into smaller, overlapping segments for embedding and retrieval in RAG systems. Chunk size and overlap strategy directly affect retrieval quality. Too large and you lose precision. Too small and you lose context. Common strategies include fixed-size chunks (500-1000 tokens), sentence-based splitting, and recursive character splitting that respects document structure like headings and paragraphs.
Anthropic's highest-tier subscription plan for Claude, available at $100/month or $200/month. Max provides significantly higher usage limits than Pro, including extended context windows, more messages per day, and access to the most capable Claude models. The $200/month tier is the standard choice for professional developers using Claude Code heavily, as it removes most practical usage limits for coding workflows.
A prompting technique where the model is asked to show its step-by-step reasoning before arriving at a final answer. CoT improves accuracy on math, logic, and coding tasks by forcing the model to decompose problems rather than jumping to conclusions. Reasoning models like o1 and o3 use chain-of-thought internally as part of their training.
A markdown file placed in your project root that configures Claude Code's behavior. It defines project rules, coding conventions, file structure, and custom instructions that persist across sessions - acting as a project-specific system prompt for your AI coding agent.
Anthropic's official CLI tool for agentic coding. It runs in your terminal, reads your codebase, edits files, runs commands, and iterates on tasks autonomously. It uses CLAUDE.md for project config and supports sub-agents, hooks, and MCP integrations.
A text-based interface for interacting with software by typing commands. Many AI coding tools - Claude Code, Gemini CLI, Codex - run as CLIs because they integrate directly into the developer's terminal workflow alongside git, npm, and other standard tools.
A command in Claude Code that compresses the current conversation history to free up context window space. When a session gets long, the context window fills up and the agent may start losing track of earlier information. Running /compact summarizes the conversation so far into a condensed form, preserving key decisions and state while reclaiming tokens for new work. It is essential for long coding sessions.
An alignment technique developed by Anthropic where an AI model is trained to follow a set of principles (a constitution) that guide its behavior. Instead of relying solely on human feedback for every edge case, the model uses its own reasoning to evaluate and revise responses against the stated principles. Constitutional AI is a core part of how Claude is trained to be helpful, harmless, and honest.
The discipline of designing what information goes into a model's context window and how it is structured. Context engineering goes beyond prompt engineering by managing system prompts, retrieved documents, tool results, conversation history, and memory to give the model exactly the right information at the right time. It is the difference between a model that kind of works and one that works reliably.
NVIDIA's parallel computing platform that lets software run computations on NVIDIA GPUs. CUDA is the foundation of GPU-accelerated AI inference and training. When you run a local model with Ollama or LM Studio and it uses your NVIDIA GPU, CUDA is doing the heavy lifting. AMD GPUs use ROCm as an alternative, and Apple Silicon uses Metal.
The maximum amount of text (measured in tokens) that a model can process in a single request. Larger context windows let agents read more code at once. Modern models range from 128K to over 1M tokens, but effective use of context still matters more than raw size.
A sequence of automated steps that move and transform data from source to destination. In AI applications, data pipelines handle document ingestion, chunking, embedding generation, and vector store population for RAG systems. A well-built pipeline keeps your knowledge base current by processing new documents as they arrive, re-embedding when models change, and cleaning stale data.
An AI agent that can perform extended, multi-step research or coding tasks by spawning sub-tasks, searching the web, reading documents, and synthesizing findings. Deep agents run for minutes or hours rather than seconds, tackling problems too complex for a single prompt-response cycle.
A multi-agent architecture where a manager agent breaks a task into sub-tasks and assigns each one to a specialized worker agent. The manager coordinates results and handles failures. This pattern is used in Claude Code's sub-agent system (via the Task tool), CrewAI's hierarchical process, and LangGraph's orchestrator nodes. It is the most common way to parallelize agent work.
A class of generative models that learn to create data by reversing a gradual noising process. During training, the model learns to remove noise from corrupted data step by step. During generation, it starts from pure random noise and iteratively denoises it into coherent output. Diffusion models power image generators like Stable Diffusion, DALL-E, and Midjourney.
A training technique that aligns language models with human preferences without needing a separate reward model. Unlike RLHF, which trains a reward model first and then optimizes against it, DPO directly optimizes the language model using pairs of preferred and dispreferred outputs. DPO is simpler to implement, more stable to train, and has become a popular alternative to RLHF for model alignment.
A training technique where a smaller "student" model learns to replicate the behavior of a larger "teacher" model. The student is trained on the teacher's outputs rather than raw data, inheriting much of the larger model's capability at a fraction of the size and inference cost. Distillation is how many fast, lightweight models are created from frontier models.
A Claude feature that gives the model a dedicated thinking phase before producing its visible response. During extended thinking, the model works through complex problems step by step internally, similar to chain-of-thought reasoning but happening behind the scenes. Extended thinking improves performance on math, code, and multi-step reasoning tasks. It uses additional tokens but produces more reliable answers for hard problems.
Numerical vector representations of text that capture semantic meaning. Similar concepts have vectors that are close together in high-dimensional space. Embeddings power semantic search, RAG systems, and recommendation engines by letting you find related content without exact keyword matches.
Serverless functions that run on CDN nodes close to the user rather than in a central data center. Platforms like Vercel and Cloudflare Workers use edge functions to reduce latency for API routes, middleware, and AI inference endpoints.
The systematic process of testing an AI model's performance against a defined set of inputs and expected outputs. Evals measure whether a model is actually good at the task you care about, not just benchmarks. They can be automated (comparing outputs to ground truth) or human-judged (rating quality on a rubric). Running evals before and after changes is how teams catch regressions and validate improvements.
An agent orchestration pattern where a coordinator splits work into parallel sub-tasks (fan-out), distributes them to multiple agents, waits for all results, and then combines them into a single output (fan-in). This pattern maximizes throughput for tasks with independent sub-problems. Claude Code uses fan-out/fan-in when spawning sub-agents for parallel research, coding, and testing tasks.
A prompting technique where you include a small number of input-output examples in the prompt to show the model the pattern you want it to follow. Instead of describing the format in words, you demonstrate it with 2-5 examples. Few-shot learning reliably improves output quality for classification, formatting, and domain-specific tasks without any model training.
The process of training a pre-existing model on a custom dataset to specialize its behavior. Fine-tuning adjusts model weights for specific tasks - like classifying support tickets or generating code in a particular framework - without training from scratch.
A model capability where the LLM outputs structured JSON describing which function to call and with what arguments, rather than plain text. This lets AI applications reliably trigger actions like database queries, API calls, or tool use based on natural language input.
A binary file format for storing quantized language models, designed for efficient local inference with llama.cpp and tools built on it. GGUF files contain model weights, tokenizer data, and metadata in a single file. Both Ollama and LM Studio use GGUF as their primary model format. When you download a model from Hugging Face for local use, you are typically downloading a GGUF file.
AI systems that create new content - text, images, code, audio, video - rather than just classifying or analyzing existing data. LLMs like Claude and GPT are generative models trained to predict and produce sequences of tokens based on input prompts.
The practice of structuring web content so AI models cite and surface it in their responses. While SEO targets search engine rankings, GEO targets AI-generated answers by using clear definitions, structured data (JSON-LD), and authoritative formatting that models can extract.
Connecting a model's responses to verified, external data sources rather than relying solely on its training data. Grounding techniques include RAG, tool use, and web search - they reduce hallucinations by giving the model facts to reference instead of generating from memory alone.
Safety constraints and validation layers applied to AI model inputs and outputs. Guardrails can block harmful content, enforce output formats, prevent prompt injection, filter sensitive data, and keep responses on-topic. They are typically implemented as middleware that wraps model calls rather than modifications to the model itself.
A design pattern where AI systems include explicit checkpoints for human review, approval, or correction before proceeding. Instead of running fully autonomously, the agent pauses at critical decision points and presents its plan or output for human validation. This pattern is essential for high-stakes tasks like production deployments, financial transactions, and content publishing where AI errors have real consequences.
A retrieval strategy that combines keyword-based search (BM25, TF-IDF) with semantic vector search (embeddings) to get the best of both approaches. Keyword search excels at exact matches and rare terms. Semantic search excels at finding conceptually related content. Hybrid search runs both in parallel and merges the results using reciprocal rank fusion or similar techniques. It consistently outperforms either approach alone in RAG systems.
When a model generates confident-sounding information that is factually incorrect or fabricated. Hallucinations happen because LLMs predict plausible-sounding text, not verified facts. Techniques like RAG, grounding, and structured output help reduce but do not eliminate hallucinations.
User-defined shell commands that run automatically at specific points in the Claude Code lifecycle - before a tool executes, after a tool completes, or when a notification fires. Hooks let you enforce project rules, run linters, or trigger custom workflows without modifying the agent itself.
The ability of a language model to learn new tasks from examples or instructions provided in the prompt, without any weight updates or training. When you include examples in a prompt and the model follows the pattern, that is in-context learning. It is why few-shot prompting works and why well-structured context dramatically improves model performance. The quality of in-context learning scales with model size and context window length.
A software application that combines a code editor, debugger, terminal, and tooling into one workspace. VS Code, Cursor, and Zed are popular IDEs. AI coding assistants increasingly integrate into IDEs or replace them entirely with CLI-based workflows.
The date after which a model has no training data. Information published after the knowledge cutoff is invisible to the model unless provided through context (RAG, tool use, or web search). Knowing a model's cutoff date helps you decide when to supplement it with retrieved information. For example, asking a model about events after its cutoff without grounding will likely produce hallucinated or outdated answers.
A structured repository of information that an AI system can query to answer questions or provide context. In AI applications, knowledge bases are often backed by vector databases and used in RAG pipelines, letting models access up-to-date, domain-specific facts that were not part of their training data.
The compressed, high-dimensional representation that a neural network learns internally. Each point in latent space encodes a meaningful combination of features from the training data. Navigating latent space is how generative models interpolate between concepts, and it is why models can produce novel outputs that blend characteristics of things they have seen before.
A neural network trained on massive text datasets that can generate, summarize, translate, and reason about language. Models like Claude, GPT, Gemini, and Llama are LLMs. They power chatbots, coding agents, search tools, and most modern AI applications.
A parameter-efficient fine-tuning method that trains a small set of adapter weights instead of modifying the full model. LoRA makes fine-tuning practical on consumer hardware and is widely used in the open-source model community for creating specialized model variants.
An application that connects to MCP servers to use the tools, resources, and prompts they expose. Claude Code, Cursor, Windsurf, and the Claude desktop app are all MCP clients. A client discovers what capabilities a server offers, presents them to the AI model, and routes the model's tool calls to the appropriate server. A single client can connect to multiple MCP servers simultaneously.
An open protocol created by Anthropic that standardizes how AI applications connect to external data sources and tools. MCP servers expose resources, tools, and prompts through a common interface so any MCP-compatible client can use them without custom integration code.
A read-only data source exposed by an MCP server. Resources provide context to the AI model - things like configuration files, database records, documentation, or API state. Unlike tools (which perform actions), resources are passive data that the model reads for information. Resources are identified by URIs and can return text, JSON, or binary content.
A program that exposes tools, resources, and prompt templates to AI clients through the Model Context Protocol. MCP servers can provide access to databases, APIs, file systems, deployment pipelines, or any external service. They communicate with clients over stdio (spawned as a child process) or HTTP/SSE (running as a web service). Anyone can build and publish an MCP server.
A function exposed by an MCP server that an AI agent can call to perform an action. Tools have a name, description, input schema (defined with Zod or JSON Schema), and a handler function. When an AI model decides to use a tool, the MCP client validates the inputs, calls the server, and returns the result to the model. Tools are the most commonly used MCP capability.
The communication layer between an MCP client and server. The two standard transports are stdio (the client spawns the server as a child process and communicates via stdin/stdout) and HTTP with Server-Sent Events (the server runs as a web service). Stdio is simpler and used for local development. HTTP/SSE is used for remote servers, shared team infrastructure, and production deployments.
A model architecture that routes each input to a small subset of specialized sub-networks ("experts") rather than activating the entire model. A gating network decides which experts handle each token, so the model can have a massive total parameter count while only using a fraction of them per inference pass. MoE powers models like Mixtral and GPT-4, delivering strong performance at lower compute cost than dense models of equivalent size.
Architectures where multiple AI agents collaborate on a task, each handling a specialized role. One agent might research while another writes code and a third reviews it. Multi-agent patterns include orchestrator-worker, pipeline, and swarm topologies.
A phenomenon where AI models trained on AI-generated data progressively lose quality and diversity over generations. If Model A generates training data for Model B, and Model B generates training data for Model C, each generation drifts further from the original data distribution. Rare patterns and minority viewpoints disappear first. Model collapse is a growing concern as AI-generated content becomes a larger share of internet data, and it is one reason human-curated training data remains valuable.
An extension of RAG that retrieves and processes not just text but also images, tables, code snippets, diagrams, and other non-text content. A multimodal RAG system might embed images alongside text, retrieve relevant diagrams when answering questions about architecture, or extract data from tables in PDF documents. As models become more capable of understanding multiple modalities, multimodal RAG closes the gap between what a model can process and what a knowledge base contains.
AI models that can process and generate more than one type of data - text, images, audio, video, or code. A multi-modal model can analyze a screenshot, read the text in it, and generate code that reproduces the UI, all in a single interaction.
A computing architecture loosely inspired by biological neurons, made up of layers of interconnected nodes that transform input data through learned weights and activation functions. Neural networks are the foundation of modern AI. Stacking many layers creates "deep" neural networks, which is where the term deep learning comes from.
The field of AI focused on enabling computers to understand, interpret, and generate human language. NLP covers everything from tokenization and sentiment analysis to machine translation and conversational AI. LLMs represent the current state of the art in NLP, but the field also includes older techniques like TF-IDF, named entity recognition, and dependency parsing.
A multi-agent architecture where a central orchestrator agent receives a task, decomposes it into sub-tasks, assigns each to a specialized worker agent, monitors progress, and synthesizes results. The orchestrator handles planning and coordination while workers handle execution. This is the most structured multi-agent pattern and is used by Claude Code (parent agent + sub-agents), CrewAI (hierarchical process), and LangGraph (orchestrator node).
AI models released with publicly available weights that anyone can download, run, fine-tune, and deploy. Models like Llama, Qwen, Mistral, and DeepSeek offer alternatives to closed APIs, enabling local inference, customization, and full control over your AI stack.
The coordination layer that manages how AI agents, tools, and data sources work together in a pipeline. Orchestration handles routing prompts to the right model, managing context across steps, retrying failures, and combining results from parallel sub-tasks.
The approval system that controls what actions Claude Code can take on your system. By default, Claude Code asks for permission before writing files, running commands, or accessing the network. You can configure permission levels from strict (approve everything) to permissive (auto-approve safe operations) to autonomous (approve all operations without prompting). Settings files and hooks provide fine-grained control over which specific actions require approval.
A metric that measures how well a language model predicts a sequence of tokens. Lower perplexity means the model is less "surprised" by the text and assigns higher probability to the correct next tokens. Perplexity is commonly used to compare language models during training and evaluation, though it does not always correlate perfectly with real-world task performance.
The phase in an agent's workflow where it analyzes a task, breaks it into steps, and determines the order of operations before taking any action. Good planning reduces wasted tool calls and dead ends. Planning can be explicit (the agent writes out a plan before executing) or implicit (the model reasons through steps internally). Claude Code's plan mode lets users review the agent's plan before it starts making changes.
A provider feature that caches the processing of static prompt prefixes so repeated requests with the same system prompt or context pay the computation cost only once. Subsequent requests that share the cached prefix skip re-processing, resulting in lower latency and reduced costs (typically 90% cheaper for cached tokens). Prompt caching is especially valuable for RAG systems, agent loops, and any application that sends the same large context repeatedly.
The practice of designing and iterating on prompts to get consistent, high-quality outputs from AI models. Good prompt engineering involves clear instructions, examples (few-shot), structured output formats, and systematic testing - not guesswork.
An attack where malicious input tricks an AI model into ignoring its instructions and following attacker-supplied commands instead. Direct injection embeds instructions in user input. Indirect injection hides instructions in external data the model reads (web pages, documents, emails). Prompt injection is the most significant security risk in AI applications. Defenses include input sanitization, output filtering, privilege separation, and guardrail layers.
A reusable prompt structure exposed by an MCP server. Prompt templates define a prompt skeleton with parameters that users fill in, standardizing how AI interacts with a specific domain. For example, a code review prompt template might accept code and language as parameters and produce a structured review prompt. Templates help enforce consistency across teams and reduce the need for manual prompt crafting.
A company or service that hosts AI models and exposes them through an API. Major providers include Anthropic (Claude), OpenAI (GPT), Google (Gemini), and Meta (Llama via third parties). Frameworks like the Vercel AI SDK abstract over providers so you can switch models without rewriting code.
The process of reducing the numerical precision of a model's weights, typically from 16-bit or 32-bit floating point down to 8-bit, 4-bit, or even lower. Quantization dramatically reduces model size and memory usage, making it possible to run large models on consumer GPUs and edge devices. The trade-off is a small loss in output quality, though modern quantization techniques like GPTQ and AWQ minimize this gap.
An agent architecture that interleaves Reasoning and Acting in a loop. The agent reasons about what to do next, takes an action (calls a tool), observes the result, and then reasons again about the next step. This reason-act-observe cycle repeats until the task is complete. ReAct combines the strengths of chain-of-thought reasoning with tool use, and it is the foundational pattern behind most modern AI agents including Claude Code.
A pattern that improves LLM responses by retrieving relevant documents from an external knowledge base and injecting them into the prompt before generation. RAG gives the model up-to-date, domain-specific context without fine-tuning, reducing hallucinations and keeping responses grounded in real data.
A technique where an AI agent evaluates its own output before returning it to the user. After generating a response, the agent prompts itself to critique the result for accuracy, completeness, and quality, then revises based on its own feedback. Reflection adds an extra inference step but significantly improves output quality. Some agents run multiple reflection cycles, progressively refining their work.
Models specifically trained or prompted to show their step-by-step thinking process before producing a final answer. Models like o1, o3, and Claude with extended thinking use chain-of-thought reasoning to tackle complex math, logic, and coding problems more reliably.
The process of finding relevant documents, passages, or data from a knowledge base in response to a query. In AI applications, retrieval typically combines embedding-based semantic search with keyword search to find the most relevant context for a model. The quality of retrieval directly determines the quality of RAG systems. Better retrieval means fewer hallucinations and more accurate, grounded responses.
A training technique that fine-tunes a model using human preference judgments. Humans rank model outputs from best to worst, and those rankings train a reward model. The language model is then optimized via reinforcement learning to produce outputs the reward model scores highly. RLHF is a key step in making raw pre-trained models helpful, harmless, and aligned with human intent.
The surrounding code and infrastructure that turns a raw language model into a useful application. A scaffold includes the prompt template, retrieval pipeline, tool definitions, output parsing, error handling, retry logic, and user interface. The model is the engine, but the scaffold is the car. Most of the engineering effort in AI applications goes into scaffolding, not model selection.
A search method that finds results based on meaning rather than exact keyword matches. Semantic search works by converting text into vector embeddings and finding the closest vectors in a database. A search for 'how to fix a broken deployment' will find documents about 'troubleshooting production rollbacks' even though they share no keywords. Semantic search is the retrieval backbone of most RAG systems.
Custom slash commands defined in markdown files that extend Claude Code's capabilities for specific workflows. Skills are stored in .claude/commands/ and can be project-specific or global. Each skill file contains instructions that Claude Code follows when the skill is invoked. Skills let teams codify common workflows - deployment procedures, code review checklists, content creation pipelines - as reusable, shareable commands.
The default communication method between MCP clients and servers. The client spawns the server as a child process and sends JSON-RPC messages over stdin/stdout. Stdio transport requires no network configuration, works entirely locally, and is the simplest way to connect an MCP server. It is used by Claude Code, Cursor, and most MCP clients for local development workflows.
Delivering model output token-by-token as it is generated rather than waiting for the full response. Streaming improves perceived latency in chat interfaces and coding agents. The Vercel AI SDK and most provider SDKs support streaming responses out of the box.
Constraining a model to respond in a specific format - typically JSON matching a defined schema. Structured outputs eliminate parsing failures and make AI responses reliable enough to pipe directly into application logic. Zod schemas are commonly used to define the expected shape.
Lightweight AI agents spawned by a parent agent to handle a specific sub-task in parallel. Claude Code uses sub-agents (via the Task tool) to divide work - one sub-agent researches while another writes code - then the parent synthesizes the results.
Training data generated by AI models rather than collected from real-world sources. Synthetic data is used to augment scarce datasets, create evaluation benchmarks, train specialized models, and generate diverse examples for fine-tuning. High-quality synthetic data from capable models can train smaller models to punch above their weight. The risk is model collapse if synthetic data replaces real data entirely across training generations.
A benchmark that evaluates AI coding agents on real-world software engineering tasks pulled from GitHub issues. Each task requires the agent to read a codebase, understand the bug or feature request, and produce a working patch. SWE-bench has become the standard measure for how well AI agents can do actual software development, not just isolated code generation.
Hidden instructions prepended to every conversation that define an AI model's behavior, personality, and constraints. System prompts set the rules that the model follows - like response format, tone, and what topics to avoid. CLAUDE.md files serve a similar purpose for coding agents.
The process of breaking a complex goal into smaller, manageable sub-tasks that an agent can execute individually. Good task decomposition is often the difference between an agent that succeeds and one that gets lost. Decomposition can be done by the user (explicitly listing steps), the agent (planning phase), or the framework (CrewAI tasks, LangGraph nodes). The quality of decomposition directly affects agent performance.
A parameter (typically 0 to 2) that controls how random or creative a model's output is. Low temperature (0-0.3) produces focused, deterministic responses ideal for code generation. High temperature (0.7-1.5) produces more varied, creative outputs better for brainstorming.
The component that converts raw text into tokens (and back) for a language model. Different models use different tokenizers with different vocabularies, which is why the same text produces different token counts across models. Understanding your tokenizer matters for cost estimation, context window management, and prompt optimization. BPE (Byte Pair Encoding) is the most common tokenization algorithm used by modern LLMs.
The basic unit of text that LLMs process. A token is roughly 3-4 characters or about 0.75 words in English. Models have token limits for input (context window) and output (max completion). API pricing is typically measured per million tokens.
A pattern where an AI agent uses the output of one tool call as the input for the next, building a multi-step pipeline of actions. For example, an agent might search for a file (tool 1), read its contents (tool 2), modify the code (tool 3), and run tests (tool 4). Each step depends on the previous result. Tool chaining is the mechanism that turns single-tool agents into capable, multi-step problem solvers.
A model capability where the LLM can invoke external tools - running code, searching the web, reading files, calling APIs - as part of generating a response. Tool use turns a passive text generator into an active agent that can interact with the real world.
Two methods for controlling the randomness of model output during token generation. Top-K sampling limits the model to choosing from the K most likely next tokens. Top-P (nucleus) sampling limits the model to the smallest set of tokens whose cumulative probability exceeds P. Both work alongside temperature to balance creativity and coherence. Top-P 0.9 with temperature 0.7 is a common starting point for general text generation. Lower values produce more focused output.
The technique of taking a model trained on one task and adapting it for a different but related task. In AI, this usually means starting with a pre-trained language model (which learned general language understanding from massive text data) and fine-tuning it on domain-specific data. Transfer learning is why you do not need to train a model from scratch for every application. The pre-trained model already understands language; you just teach it your specific domain.
The neural network architecture behind virtually all modern large language models. Introduced in the 2017 paper "Attention Is All You Need," transformers use self-attention mechanisms to process all tokens in a sequence simultaneously rather than one at a time. This parallelism makes them vastly more efficient to train than previous architectures like RNNs and LSTMs, and is the reason LLMs were able to scale to billions of parameters.
A category of machine learning where models learn patterns from data without labeled examples or explicit correct answers. The model discovers structure on its own, such as clusters, correlations, or compressed representations. LLM pre-training is a form of unsupervised (or self-supervised) learning, where the model learns to predict the next token from massive amounts of unlabeled text.
A storage system purpose-built for saving, indexing, and querying vector embeddings at scale. Vector stores power the retrieval step in RAG pipelines by enabling fast similarity search across millions of embedded documents. Options range from hosted services (Pinecone, Weaviate) to database extensions (pgvector for PostgreSQL) to in-memory libraries (FAISS, Annoy). The choice depends on scale, latency requirements, and infrastructure preferences.
A database optimized for storing and querying high-dimensional vectors (embeddings). Vector databases like Pinecone, Weaviate, and pgvector enable fast similarity search, powering RAG pipelines, semantic search, and recommendation systems at scale.
A development approach where you describe what you want in natural language and let an AI agent handle the implementation details. The developer focuses on the overall direction and feel of the project rather than writing every line. The term was coined by Andrej Karpathy.
A technique for processing text that exceeds a model's context window by moving a fixed-size window across the input, processing each chunk, and combining the results. In coding agents, window sliding lets the model work through large files section by section. In RAG systems, overlapping windows during chunking ensure that no information is lost at chunk boundaries. The overlap size (typically 10-20% of the window) is a key parameter.
The numerical parameters inside a neural network that are learned during training. Weights determine how the network transforms input into output. When people say a model has 70 billion parameters, they mean 70 billion weights. Releasing model weights publicly is what makes open-source AI models possible, since anyone with the weights can run inference without depending on an API provider.
A git feature that Claude Code uses to run multiple agents on separate branches simultaneously without conflicts. Each worktree is a separate working directory linked to the same repository, letting agents work on different features in parallel and merge results back.
A model's ability to perform a task it was not explicitly trained on, using only the instructions in the prompt with no examples. When you ask a model to classify sentiment, translate text, or summarize an article without providing any examples, that is zero-shot learning. Modern LLMs are remarkably capable zero-shot learners because their broad training lets them generalize to new tasks from instructions alone. When zero-shot performance is insufficient, adding examples (few-shot learning) usually closes the gap.
A TypeScript-first schema declaration and validation library. In AI development, Zod defines the shape of structured outputs from LLMs, validates API payloads, and ensures type safety at runtime. It is the standard schema tool in the Vercel AI SDK and most TypeScript AI stacks.

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.