Self-Improving AI Agents: Building Systems That Learn From Their Mistakes

Official Sources
Anthropic TypeScript SDK	Client library used in code examples
Claude Messages API	Core API for execution and evaluation
Claude Code Memory Docs	File-based memory with CLAUDE.md
Anthropic Prompt Engineering	Best practices for reflection prompts
OpenAI Embeddings Guide	Vector embedding fundamentals
Semantic Search with Embeddings	Memory retrieval patterns

Most AI agents are goldfish. They execute a task, succeed or fail, and forget everything. The next time they encounter the same problem, they make the same mistakes. Every session starts from zero.

Self-improving agents break this cycle. They reflect on what happened, extract lessons from successes and failures, store those lessons in accessible formats, and retrieve them when relevant. Over time, they get measurably better at their tasks.

This is not speculative AI research. These patterns are running in production today. Here is how to build agents that actually learn.

The Reflection Pattern#

Reflection is the mechanism that converts experience into knowledge. After an agent completes a task (or fails at one), a reflection step analyzes what happened and extracts transferable lessons.

For the design side of the same problem, read AI Agents Explained: A TypeScript Developer's Guide with How to Build AI Agents in TypeScript; they show how agent-generated interfaces fail and how to give coding agents better visual constraints.

Basic Reflection Loop#

The simplest reflection pattern has three steps: execute, evaluate, extract.

TypeScript

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

interface Reflection {
  task: string;
  outcome: "success" | "failure" | "partial";
  lessons: string[];
  confidence: number;
  timestamp: string;
}

async function executeWithReflection(
  task: string,
  context: string
): Promise<{ result: string; reflection: Reflection }> {
  // Step 1: Execute the task
  const executionResult = await client.messages.create({
    model: "claude-sonnet-4-6-20260409",
    max_tokens: 4096,
    system: context,
    messages: [{ role: "user", content: task }],
  });

  const result =
    executionResult.content[0].type === "text"
      ? executionResult.content[0].text
      : "";

  // Step 2: Evaluate the outcome
  const evaluationResult = await client.messages.create({
    model: "claude-sonnet-4-6-20260409",
    max_tokens: 1024,
    messages: [
      {
        role: "user",
        content: `Evaluate this task execution:

Task: ${task}
Result: ${result}

Rate the outcome as "success", "failure", or "partial".
Extract 1-3 specific, actionable lessons learned.
Rate your confidence in each lesson from 0.0 to 1.0.

Respond as JSON:
{
  "outcome": "success|failure|partial",
  "lessons": ["lesson 1", "lesson 2"],
  "confidence": 0.85
}`,
      },
    ],
  });

  const evaluation = JSON.parse(
    evaluationResult.content[0].type === "text"
      ? evaluationResult.content[0].text
      : "{}"
  );

  const reflection: Reflection = {
    task,
    outcome: evaluation.outcome,
    lessons: evaluation.lessons,
    confidence: evaluation.confidence,
    timestamp: new Date().toISOString(),
  };

  return { result, reflection };
}

This is the foundation. The agent does work, then a separate evaluation pass examines the work and extracts lessons. The evaluation pass uses the same model but with a different prompt focused on analysis rather than execution.

Why Separate Execution from Evaluation#

A common mistake is asking the agent to execute and reflect simultaneously. "Do this task and also think about what you learned." This produces worse execution and worse reflection because the model splits its attention.

Separation works better for two reasons:

Cognitive clarity. The execution pass focuses entirely on the task. The evaluation pass focuses entirely on analysis. Neither is compromised by the other.

Different system prompts. The execution pass might have a system prompt optimized for coding ("You are an expert TypeScript developer"). The evaluation pass has a system prompt optimized for analysis ("You are a quality analyst reviewing agent performance"). Specialization improves both outputs.

Memory Architectures#

Reflection produces lessons. Memory stores them for retrieval. The architecture of your memory system determines whether lessons are available when they matter.

Architecture 1: Flat File Memory#

The simplest approach. Store all reflections in a single JSON file, read at session start.

TypeScript

import { readFile, writeFile } from "fs/promises";

const MEMORY_PATH = ".agent/memory.json";

interface MemoryStore {
  reflections: Reflection[];
  skills: Skill[];
  corrections: Correction[];
}

async function loadMemory(): Promise<MemoryStore> {
  try {
    const data = await readFile(MEMORY_PATH, "utf-8");
    return JSON.parse(data);
  } catch {
    return { reflections: [], skills: [], corrections: [] };
  }
}

async function saveReflection(reflection: Reflection): Promise<void> {
  const memory = await loadMemory();
  memory.reflections.push(reflection);

  // Prune low-confidence reflections when memory gets large
  if (memory.reflections.length > 200) {
    memory.reflections = memory.reflections
      .sort((a, b) => b.confidence - a.confidence)
      .slice(0, 150);
  }

  await writeFile(MEMORY_PATH, JSON.stringify(memory, null, 2));
}

async function getRelevantMemories(task: string): Promise<string> {
  const memory = await loadMemory();

  // Simple keyword matching for retrieval
  const words = task.toLowerCase().split(/\s+/);
  const relevant = memory.reflections.filter((r) =>
    words.some(
      (w) =>
        r.task.toLowerCase().includes(w) ||
        r.lessons.some((l) => l.toLowerCase().includes(w))
    )
  );

  return relevant
    .slice(0, 10)
    .map(
      (r) =>
        `[${r.outcome}] ${r.task}: ${r.lessons.join("; ")} (confidence: ${r.confidence})`
    )
    .join("\n");
}

Flat file memory works for agents with fewer than a few hundred reflections. Beyond that, keyword matching becomes unreliable and loading the entire file wastes tokens.

When to use: Personal agents, project-specific agents, any system where the total reflection count stays under 500.

Architecture 2: Structured Memory with Categories#

Split memory into categories so the agent loads only relevant context.

TypeScript

interface StructuredMemory {
  technical: {
    bugs: Reflection[];
    patterns: Reflection[];
    performance: Reflection[];
  };
  process: {
    planning: Reflection[];
    testing: Reflection[];
    deployment: Reflection[];
  };
  domain: {
    [key: string]: Reflection[];
  };
}

async function categorizeAndStore(reflection: Reflection): Promise<void> {
  const memory = await loadStructuredMemory();

  // Use the model to categorize the reflection
  const categoryResult = await client.messages.create({
    model: "claude-sonnet-4-6-20260409",
    max_tokens: 256,
    messages: [
      {
        role: "user",
        content: `Categorize this lesson into one category.

Lesson: ${reflection.lessons.join("; ")}
Task: ${reflection.task}

Categories:
- technical/bugs: Bug fixes and error handling
- technical/patterns: Code patterns and architecture
- technical/performance: Performance optimization
- process/planning: Task planning and decomposition
- process/testing: Testing strategies
- process/deployment: Deployment and operations
- domain/{topic}: Domain-specific knowledge

Respond with just the category path, e.g., "technical/bugs"`,
      },
    ],
  });

  const category =
    categoryResult.content[0].type === "text"
      ? categoryResult.content[0].text.trim()
      : "domain/general";

  // Store in the appropriate category
  const [top, sub] = category.split("/");
  if (!memory[top]) memory[top] = {};
  if (!memory[top][sub]) memory[top][sub] = [];
  memory[top][sub].push(reflection);

  await saveStructuredMemory(memory);
}

async function retrieveByCategory(
  categories: string[]
): Promise<Reflection[]> {
  const memory = await loadStructuredMemory();
  const results: Reflection[] = [];

  for (const category of categories) {
    const [top, sub] = category.split("/");
    if (memory[top]?.[sub]) {
      results.push(...memory[top][sub]);
    }
  }

  return results.sort((a, b) => b.confidence - a.confidence).slice(0, 20);
}

Structured memory solves the relevance problem. When the agent is debugging, it loads technical/bugs. When it is deploying, it loads process/deployment. Irrelevant memories stay out of the context window.

When to use: Agents that handle diverse tasks across multiple domains. The categorization overhead is worth it when the total memory exceeds a few hundred entries.

Architecture 3: Embedding-Based Retrieval#

For large memory stores, use vector embeddings to find semantically relevant memories.

TypeScript

interface EmbeddedReflection extends Reflection {
  embedding: number[];
}

async function embedReflection(
  reflection: Reflection
): Promise<EmbeddedReflection> {
  const text = `${reflection.task} ${reflection.lessons.join(" ")}`;

  // Using a local embedding model or API
  const embedding = await getEmbedding(text);

  return { ...reflection, embedding };
}

async function findSimilar(
  query: string,
  topK: number = 5
): Promise<Reflection[]> {
  const queryEmbedding = await getEmbedding(query);
  const allMemories = await loadEmbeddedMemories();

  // Cosine similarity search
  const scored = allMemories.map((m) => ({
    reflection: m,
    score: cosineSimilarity(queryEmbedding, m.embedding),
  }));

  return scored
    .sort((a, b) => b.score - a.score)
    .slice(0, topK)
    .map((s) => s.reflection);
}

function cosineSimilarity(a: number[], b: number[]): number {
  let dotProduct = 0;
  let normA = 0;
  let normB = 0;
  for (let i = 0; i < a.length; i++) {
    dotProduct += a[i] * b[i];
    normA += a[i] * a[i];
    normB += b[i] * b[i];
  }
  return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}

Embedding-based retrieval finds semantically similar memories even when the exact keywords differ. A lesson about "database connection pooling" retrieves when the agent encounters "Postgres timeout errors" because the embeddings are close in semantic space.

When to use: Agents with thousands of reflections, or agents that handle tasks where keyword matching is insufficient.

Newsletter

Get the weekly deep dive

Tutorials on Claude Code, AI agents, and dev tools, delivered free every week.

From the archive

Anthropic vs OpenAI: Developer Experience Compared

Apr 3, 2026 • 8 min read

Building a SaaS with Claude Code: End-to-End Guide

Apr 3, 2026 • 10 min read

Case Study: Building Developers Digest with Claude Code

Apr 3, 2026 • 8 min read

Convex vs Supabase for AI Apps

Apr 3, 2026 • 8 min read

Skill Accumulation#

Reflections are raw experience. Skills are refined knowledge. The skill extraction process converts multiple related reflections into a reusable, structured skill.

From Reflections to Skills#

TypeScript

interface Skill {
  name: string;
  description: string;
  steps: string[];
  constraints: string[];
  confidence: number;
  sourceReflections: number;
  lastUpdated: string;
}

async function extractSkill(
  reflections: Reflection[],
  domain: string
): Promise<Skill> {
  const reflectionText = reflections
    .map(
      (r) =>
        `Task: ${r.task}\nOutcome: ${r.outcome}\nLessons: ${r.lessons.join("; ")}`
    )
    .join("\n\n");

  const result = await client.messages.create({
    model: "claude-sonnet-4-6-20260409",
    max_tokens: 2048,
    messages: [
      {
        role: "user",
        content: `Analyze these ${reflections.length} related experiences in the "${domain}" domain and extract a reusable skill.

${reflectionText}

Create a skill with:
1. A clear name (verb + noun, e.g., "Debug Database Connections")
2. A one-sentence description
3. Ordered steps that work reliably
4. Constraints (things to avoid, based on failures)
5. Confidence level (0.0-1.0) based on how many successes vs failures

Respond as JSON:
{
  "name": "...",
  "description": "...",
  "steps": ["step 1", "step 2"],
  "constraints": ["never do X", "always check Y"],
  "confidence": 0.85
}`,
      },
    ],
  });

  const skillData = JSON.parse(
    result.content[0].type === "text" ? result.content[0].text : "{}"
  );

  return {
    ...skillData,
    sourceReflections: reflections.length,
    lastUpdated: new Date().toISOString(),
  };
}

The skill extraction prompt does the heavy lifting: it reads multiple experiences and synthesizes them into a repeatable procedure. The constraints are particularly valuable because they encode failure modes. An agent with a skill that says "never use raw SQL for schema changes" will not make that mistake even if its base training would suggest it.

Skill Evolution#

Skills are not static. As the agent accumulates more experience, skills should be updated with new steps, refined constraints, and adjusted confidence levels.

TypeScript

async function evolveSkill(
  existing: Skill,
  newReflections: Reflection[]
): Promise<Skill> {
  const reflectionText = newReflections
    .map(
      (r) =>
        `Task: ${r.task}\nOutcome: ${r.outcome}\nLessons: ${r.lessons.join("; ")}`
    )
    .join("\n\n");

  const result = await client.messages.create({
    model: "claude-sonnet-4-6-20260409",
    max_tokens: 2048,
    messages: [
      {
        role: "user",
        content: `Update this existing skill based on new experiences.

Current skill:
Name: ${existing.name}
Steps: ${existing.steps.join("\n")}
Constraints: ${existing.constraints.join("\n")}
Confidence: ${existing.confidence}
Based on: ${existing.sourceReflections} experiences

New experiences:
${reflectionText}

Update the skill:
- Add new steps if the new experiences reveal missing procedures
- Add new constraints if failures reveal new pitfalls
- Remove steps that new experience shows are unnecessary
- Adjust confidence based on success/failure ratio
- Keep what works, fix what doesn't

Respond as the updated skill JSON.`,
      },
    ],
  });

  const updated = JSON.parse(
    result.content[0].type === "text" ? result.content[0].text : "{}"
  );

  return {
    ...updated,
    sourceReflections: existing.sourceReflections + newReflections.length,
    lastUpdated: new Date().toISOString(),
  };
}

Skill evolution is where the compounding effect becomes visible. A skill based on 5 reflections is rough and general. The same skill after 50 reflections is specific, battle-tested, and reliable. Each new experience either reinforces existing steps (increasing confidence) or reveals gaps (adding new steps or constraints).

The Skill Library#

Over time, an agent accumulates a library of skills that covers its common tasks. The library structure matters for retrieval.

TypeScript

interface SkillLibrary {
  skills: Map<string, Skill>;
  index: Map<string, string[]>; // keyword -> skill names
}

async function findApplicableSkills(
  task: string,
  library: SkillLibrary
): Promise<Skill[]> {
  const words = task.toLowerCase().split(/\s+/);
  const candidateNames = new Set<string>();

  for (const word of words) {
    const matches = library.index.get(word) || [];
    matches.forEach((name) => candidateNames.add(name));
  }

  const candidates = Array.from(candidateNames)
    .map((name) => library.skills.get(name))
    .filter((s): s is Skill => s !== undefined);

  // Sort by confidence, then by recency
  return candidates
    .sort(
      (a, b) =>
        b.confidence - a.confidence ||
        new Date(b.lastUpdated).getTime() - new Date(a.lastUpdated).getTime()
    )
    .slice(0, 3);
}

The agent consults its skill library before starting a task. If relevant skills exist, they provide a starting procedure and known constraints. If no skills exist, the agent operates from its base capabilities and generates new reflections that may seed future skills.

Correction Tracking#

Corrections are the highest-signal input for self-improvement. When a human corrects an agent, that correction represents a gap between the agent's behavior and the desired behavior.

TypeScript

interface Correction {
  context: string;
  agentBehavior: string;
  humanCorrection: string;
  category: string;
  timestamp: string;
  applied: boolean;
}

async function processCorrection(
  context: string,
  agentBehavior: string,
  humanCorrection: string
): Promise<Correction> {
  // Categorize the correction
  const categoryResult = await client.messages.create({
    model: "claude-sonnet-4-6-20260409",
    max_tokens: 256,
    messages: [
      {
        role: "user",
        content: `Categorize this correction:
Agent did: ${agentBehavior}
Human corrected to: ${humanCorrection}

Categories: style, logic, architecture, security, performance, convention, other
Respond with just the category.`,
      },
    ],
  });

  const category =
    categoryResult.content[0].type === "text"
      ? categoryResult.content[0].text.trim()
      : "other";

  const correction: Correction = {
    context,
    agentBehavior,
    humanCorrection,
    category,
    timestamp: new Date().toISOString(),
    applied: false,
  };

  await storeCorrection(correction);

  // Check if we have enough corrections in this category to update a skill
  const categoryCorrections = await getCorrectionsByCategory(category);
  if (categoryCorrections.length >= 3) {
    await updateSkillFromCorrections(category, categoryCorrections);
  }

  return correction;
}

The threshold of three corrections before updating a skill prevents overreaction to a single data point. One correction might be situational. Three corrections in the same category indicate a systematic gap.

Correction Confidence Decay#

Not all corrections age equally. A correction from yesterday is more relevant than one from three months ago because the codebase, the developer's preferences, and the project conventions may have changed.

TypeScript

function correctionWeight(correction: Correction): number {
  const ageInDays =
    (Date.now() - new Date(correction.timestamp).getTime()) /
    (1000 * 60 * 60 * 24);

  // Half-life of 30 days
  return Math.exp(-0.693 * (ageInDays / 30));
}

async function getWeightedCorrections(
  category: string
): Promise<Correction[]> {
  const corrections = await getCorrectionsByCategory(category);
  return corrections
    .map((c) => ({ ...c, weight: correctionWeight(c) }))
    .filter((c) => c.weight > 0.1) // Discard corrections older than ~100 days
    .sort((a, b) => b.weight - a.weight);
}

The 30-day half-life means recent corrections weigh heavily while old ones fade. This prevents the agent from following outdated preferences.

Putting It All Together#

Here is the complete self-improving agent loop, combining execution, reflection, memory, and skill accumulation.

TypeScript

async function selfImprovingAgent(task: string): Promise<string> {
  // 1. Load relevant context
  const memories = await getRelevantMemories(task);
  const skills = await findApplicableSkills(task, await loadSkillLibrary());
  const corrections = await getRecentCorrections(task);

  // 2. Build enhanced context
  const context = buildContext(memories, skills, corrections);

  // 3. Execute with enhanced context
  const { result, reflection } = await executeWithReflection(task, context);

  // 4. Store the reflection
  await saveReflection(reflection);

  // 5. Check if any skills need updating
  if (reflection.lessons.length > 0) {
    const relatedReflections = await findRelatedReflections(reflection);
    if (relatedReflections.length >= 5) {
      const existingSkill = await findMatchingSkill(task);
      if (existingSkill) {
        await evolveSkill(existingSkill, [reflection]);
      } else {
        const newSkill = await extractSkill(relatedReflections, task);
        await addToSkillLibrary(newSkill);
      }
    }
  }

  return result;
}

function buildContext(
  memories: string,
  skills: Skill[],
  corrections: Correction[]
): string {
  let context = "You are an AI agent that learns from experience.\n\n";

  if (skills.length > 0) {
    context += "## Relevant Skills\n\n";
    for (const skill of skills) {
      context += `### ${skill.name}\n`;
      context += `${skill.description}\n`;
      context += `Steps:\n${skill.steps.map((s, i) => `${i + 1}. ${s}`).join("\n")}\n`;
      context += `Constraints:\n${skill.constraints.map((c) => `- ${c}`).join("\n")}\n\n`;
    }
  }

  if (memories) {
    context += "## Relevant Past Experiences\n\n";
    context += memories + "\n\n";
  }

  if (corrections.length > 0) {
    context += "## Recent Corrections\n\n";
    for (const c of corrections.slice(0, 5)) {
      context += `- Instead of: ${c.agentBehavior}\n  Do: ${c.humanCorrection}\n`;
    }
    context += "\n";
  }

  return context;
}

The loop runs on every task execution. Over time, the agent's context becomes enriched with relevant skills, past experiences, and human corrections. Each interaction contributes to the next one.

Measuring Improvement#

Self-improvement claims require measurement. Here are concrete metrics to track.

First-Attempt Success Rate#

Track whether the agent's first attempt at a task is accepted without correction.

TypeScript

interface PerformanceMetrics {
  totalTasks: number;
  firstAttemptSuccesses: number;
  correctionsReceived: number;
  averageConfidence: number;
  skillCount: number;
}

function successRate(metrics: PerformanceMetrics): number {
  return metrics.firstAttemptSuccesses / metrics.totalTasks;
}

A self-improving agent should show increasing first-attempt success rate over time. If the rate is flat, the memory and skill systems are not being retrieved effectively.

Correction Frequency Decline#

Plot corrections per task over time. A declining trend means the agent is learning from previous corrections and not repeating mistakes.

Skill Coverage#

Track what percentage of tasks have applicable skills in the library. Higher coverage means fewer tasks where the agent operates from base capabilities alone.

Practical Considerations#

Memory Pruning#

Unbounded memory eventually degrades performance. Old, low-confidence reflections add noise without value. Prune aggressively:

Remove reflections older than 90 days with confidence below 0.5
Merge similar reflections into consolidated entries
Archive rather than delete, so you can analyze historical patterns

Skill Conflicts#

When two skills give contradictory advice, the agent needs a resolution strategy. The simplest approach: prefer the skill with higher confidence and more source reflections. A skill based on 30 experiences at 0.9 confidence overrides one based on 3 experiences at 0.6 confidence.

Human Override#

Self-improvement systems must preserve human authority. If a developer explicitly overrides a skill or correction, that override takes precedence. The system should not argue with direct instructions, even if its learned behavior disagrees.

TypeScript

interface Override {
  skillName: string;
  originalBehavior: string;
  overrideBehavior: string;
  reason: string;
  permanent: boolean;
}

Permanent overrides modify the skill directly. Temporary overrides apply for the current session only. Both are tracked so the agent can distinguish between "the human always wants this" and "the human wanted this once."

The Compounding Effect#

A self-improving agent that runs 20 tasks per day and extracts lessons from 10% of them accumulates 2 new reflections daily. After a month, it has 60 reflections. After six months, 360. These reflections condense into 20 to 40 skills that cover the agent's most common tasks.

The agent after six months of accumulation is fundamentally different from the agent on day one. It has seen the failure modes, learned the conventions, absorbed the corrections, and encoded the solutions. Every session is faster and more accurate than the last.

This is the promise of self-improving AI agents. Not artificial general intelligence. Not consciousness. Just systems that remember what worked, remember what failed, and apply those memories to the next task. The implementation is straightforward. The compounding effect is not.

FAQ#

What is a self-improving AI agent?#

A self-improving AI agent is a system that reflects on its task executions, extracts lessons from successes and failures, stores those lessons in persistent memory, and retrieves them for future tasks. Unlike standard agents that start fresh each session, self-improving agents accumulate knowledge over time and measurably improve their performance.

How does reflection work in AI agents?#

Reflection is a separate evaluation pass after task execution. The agent completes a task, then a second prompt analyzes what happened and extracts transferable lessons. Separating execution from evaluation produces better results for both - the execution pass focuses entirely on the task, while the evaluation pass focuses entirely on analysis.

What is the best memory architecture for self-improving agents?#

It depends on scale. Flat file memory works for agents with fewer than 500 reflections. Structured memory with categories (technical/bugs, process/deployment, etc.) works for diverse tasks across multiple domains. Embedding-based retrieval with vector search is necessary for thousands of reflections or when keyword matching is insufficient.

How do agents convert reflections into skills?#

Skill extraction analyzes multiple related reflections and synthesizes them into a reusable procedure. The process identifies a clear name, ordered steps that work reliably, and constraints (things to avoid based on failures). Skills evolve as new reflections accumulate - a skill based on 5 reflections is rough, while the same skill after 50 reflections is battle-tested.

How do corrections improve agent behavior?#

Corrections are the highest-signal input for self-improvement. When a human corrects an agent, that correction represents a gap between actual and desired behavior. The system categorizes corrections, stores them with confidence decay (recent corrections matter more), and after multiple corrections in the same category, updates the relevant skill.

How do you measure if an agent is actually improving?#

Track three metrics: first-attempt success rate (tasks completed without corrections), correction frequency over time (should decline), and skill coverage (percentage of tasks with applicable skills). A truly self-improving agent shows increasing first-attempt success and declining corrections over weeks and months.

How much memory should a self-improving agent retain?#

Prune aggressively. Remove reflections older than 90 days with confidence below 0.5, merge similar reflections, and archive rather than delete for historical analysis. Unbounded memory eventually degrades performance because old low-confidence reflections add noise without value.

How do self-improving agents handle conflicting skills?#

When two skills give contradictory advice, prefer the skill with higher confidence and more source reflections. A skill based on 30 experiences at 0.9 confidence overrides one based on 3 experiences at 0.6 confidence. Human overrides always take precedence - the system should never argue with direct instructions.

Official Sources
Anthropic TypeScript SDK	Client library used in code examples
Claude Messages API	Core API for execution and evaluation
Claude Code Memory Docs	File-based memory with CLAUDE.md
Anthropic Prompt Engineering	Best practices for reflection prompts
OpenAI Embeddings Guide	Vector embedding fundamentals
Semantic Search with Embeddings	Memory retrieval patterns

Most AI agents are goldfish. They execute a task, succeed or fail, and forget everything. The next time they encounter the same problem, they make the same mistakes. Every session starts from zero.

This is not speculative AI research. These patterns are running in production today. Here is how to build agents that actually learn.

The Reflection Pattern#

Reflection is the mechanism that converts experience into knowledge. After an agent completes a task (or fails at one), a reflection step analyzes what happened and extracts transferable lessons.

Basic Reflection Loop#

The simplest reflection pattern has three steps: execute, evaluate, extract.

TypeScript

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

interface Reflection {
  task: string;
  outcome: "success" | "failure" | "partial";
  lessons: string[];
  confidence: number;
  timestamp: string;
}

async function executeWithReflection(
  task: string,
  context: string
): Promise<{ result: string; reflection: Reflection }> {
  // Step 1: Execute the task
  const executionResult = await client.messages.create({
    model: "claude-sonnet-4-6-20260409",
    max_tokens: 4096,
    system: context,
    messages: [{ role: "user", content: task }],
  });

  const result =
    executionResult.content[0].type === "text"
      ? executionResult.content[0].text
      : "";

  // Step 2: Evaluate the outcome
  const evaluationResult = await client.messages.create({
    model: "claude-sonnet-4-6-20260409",
    max_tokens: 1024,
    messages: [
      {
        role: "user",
        content: `Evaluate this task execution:

Task: ${task}
Result: ${result}

Rate the outcome as "success", "failure", or "partial".
Extract 1-3 specific, actionable lessons learned.
Rate your confidence in each lesson from 0.0 to 1.0.

Respond as JSON:
{
  "outcome": "success|failure|partial",
  "lessons": ["lesson 1", "lesson 2"],
  "confidence": 0.85
}`,
      },
    ],
  });

  const evaluation = JSON.parse(
    evaluationResult.content[0].type === "text"
      ? evaluationResult.content[0].text
      : "{}"
  );

  const reflection: Reflection = {
    task,
    outcome: evaluation.outcome,
    lessons: evaluation.lessons,
    confidence: evaluation.confidence,
    timestamp: new Date().toISOString(),
  };

  return { result, reflection };
}

Why Separate Execution from Evaluation#

Separation works better for two reasons:

Cognitive clarity. The execution pass focuses entirely on the task. The evaluation pass focuses entirely on analysis. Neither is compromised by the other.

Memory Architectures#

Reflection produces lessons. Memory stores them for retrieval. The architecture of your memory system determines whether lessons are available when they matter.

Architecture 1: Flat File Memory#

The simplest approach. Store all reflections in a single JSON file, read at session start.

TypeScript

import { readFile, writeFile } from "fs/promises";

const MEMORY_PATH = ".agent/memory.json";

interface MemoryStore {
  reflections: Reflection[];
  skills: Skill[];
  corrections: Correction[];
}

async function loadMemory(): Promise<MemoryStore> {
  try {
    const data = await readFile(MEMORY_PATH, "utf-8");
    return JSON.parse(data);
  } catch {
    return { reflections: [], skills: [], corrections: [] };
  }
}

async function saveReflection(reflection: Reflection): Promise<void> {
  const memory = await loadMemory();
  memory.reflections.push(reflection);

  // Prune low-confidence reflections when memory gets large
  if (memory.reflections.length > 200) {
    memory.reflections = memory.reflections
      .sort((a, b) => b.confidence - a.confidence)
      .slice(0, 150);
  }

  await writeFile(MEMORY_PATH, JSON.stringify(memory, null, 2));
}

async function getRelevantMemories(task: string): Promise<string> {
  const memory = await loadMemory();

  // Simple keyword matching for retrieval
  const words = task.toLowerCase().split(/\s+/);
  const relevant = memory.reflections.filter((r) =>
    words.some(
      (w) =>
        r.task.toLowerCase().includes(w) ||
        r.lessons.some((l) => l.toLowerCase().includes(w))
    )
  );

  return relevant
    .slice(0, 10)
    .map(
      (r) =>
        `[${r.outcome}] ${r.task}: ${r.lessons.join("; ")} (confidence: ${r.confidence})`
    )
    .join("\n");
}

Flat file memory works for agents with fewer than a few hundred reflections. Beyond that, keyword matching becomes unreliable and loading the entire file wastes tokens.

When to use: Personal agents, project-specific agents, any system where the total reflection count stays under 500.

Architecture 2: Structured Memory with Categories#

Split memory into categories so the agent loads only relevant context.

TypeScript

interface StructuredMemory {
  technical: {
    bugs: Reflection[];
    patterns: Reflection[];
    performance: Reflection[];
  };
  process: {
    planning: Reflection[];
    testing: Reflection[];
    deployment: Reflection[];
  };
  domain: {
    [key: string]: Reflection[];
  };
}

async function categorizeAndStore(reflection: Reflection): Promise<void> {
  const memory = await loadStructuredMemory();

  // Use the model to categorize the reflection
  const categoryResult = await client.messages.create({
    model: "claude-sonnet-4-6-20260409",
    max_tokens: 256,
    messages: [
      {
        role: "user",
        content: `Categorize this lesson into one category.

Lesson: ${reflection.lessons.join("; ")}
Task: ${reflection.task}

Categories:
- technical/bugs: Bug fixes and error handling
- technical/patterns: Code patterns and architecture
- technical/performance: Performance optimization
- process/planning: Task planning and decomposition
- process/testing: Testing strategies
- process/deployment: Deployment and operations
- domain/{topic}: Domain-specific knowledge

Respond with just the category path, e.g., "technical/bugs"`,
      },
    ],
  });

  const category =
    categoryResult.content[0].type === "text"
      ? categoryResult.content[0].text.trim()
      : "domain/general";

  // Store in the appropriate category
  const [top, sub] = category.split("/");
  if (!memory[top]) memory[top] = {};
  if (!memory[top][sub]) memory[top][sub] = [];
  memory[top][sub].push(reflection);

  await saveStructuredMemory(memory);
}

async function retrieveByCategory(
  categories: string[]
): Promise<Reflection[]> {
  const memory = await loadStructuredMemory();
  const results: Reflection[] = [];

  for (const category of categories) {
    const [top, sub] = category.split("/");
    if (memory[top]?.[sub]) {
      results.push(...memory[top][sub]);
    }
  }

  return results.sort((a, b) => b.confidence - a.confidence).slice(0, 20);
}

When to use: Agents that handle diverse tasks across multiple domains. The categorization overhead is worth it when the total memory exceeds a few hundred entries.

Architecture 3: Embedding-Based Retrieval#

For large memory stores, use vector embeddings to find semantically relevant memories.

TypeScript

interface EmbeddedReflection extends Reflection {
  embedding: number[];
}

async function embedReflection(
  reflection: Reflection
): Promise<EmbeddedReflection> {
  const text = `${reflection.task} ${reflection.lessons.join(" ")}`;

  // Using a local embedding model or API
  const embedding = await getEmbedding(text);

  return { ...reflection, embedding };
}

async function findSimilar(
  query: string,
  topK: number = 5
): Promise<Reflection[]> {
  const queryEmbedding = await getEmbedding(query);
  const allMemories = await loadEmbeddedMemories();

  // Cosine similarity search
  const scored = allMemories.map((m) => ({
    reflection: m,
    score: cosineSimilarity(queryEmbedding, m.embedding),
  }));

  return scored
    .sort((a, b) => b.score - a.score)
    .slice(0, topK)
    .map((s) => s.reflection);
}

function cosineSimilarity(a: number[], b: number[]): number {
  let dotProduct = 0;
  let normA = 0;
  let normB = 0;
  for (let i = 0; i < a.length; i++) {
    dotProduct += a[i] * b[i];
    normA += a[i] * a[i];
    normB += b[i] * b[i];
  }
  return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}

When to use: Agents with thousands of reflections, or agents that handle tasks where keyword matching is insufficient.

Newsletter

Get the weekly deep dive

Tutorials on Claude Code, AI agents, and dev tools, delivered free every week.

From the archive

Anthropic vs OpenAI: Developer Experience Compared

Apr 3, 2026 • 8 min read

Building a SaaS with Claude Code: End-to-End Guide

Apr 3, 2026 • 10 min read

Case Study: Building Developers Digest with Claude Code

Apr 3, 2026 • 8 min read

Convex vs Supabase for AI Apps

Apr 3, 2026 • 8 min read

Skill Accumulation#

Reflections are raw experience. Skills are refined knowledge. The skill extraction process converts multiple related reflections into a reusable, structured skill.

From Reflections to Skills#

TypeScript

interface Skill {
  name: string;
  description: string;
  steps: string[];
  constraints: string[];
  confidence: number;
  sourceReflections: number;
  lastUpdated: string;
}

async function extractSkill(
  reflections: Reflection[],
  domain: string
): Promise<Skill> {
  const reflectionText = reflections
    .map(
      (r) =>
        `Task: ${r.task}\nOutcome: ${r.outcome}\nLessons: ${r.lessons.join("; ")}`
    )
    .join("\n\n");

  const result = await client.messages.create({
    model: "claude-sonnet-4-6-20260409",
    max_tokens: 2048,
    messages: [
      {
        role: "user",
        content: `Analyze these ${reflections.length} related experiences in the "${domain}" domain and extract a reusable skill.

${reflectionText}

Create a skill with:
1. A clear name (verb + noun, e.g., "Debug Database Connections")
2. A one-sentence description
3. Ordered steps that work reliably
4. Constraints (things to avoid, based on failures)
5. Confidence level (0.0-1.0) based on how many successes vs failures

Respond as JSON:
{
  "name": "...",
  "description": "...",
  "steps": ["step 1", "step 2"],
  "constraints": ["never do X", "always check Y"],
  "confidence": 0.85
}`,
      },
    ],
  });

  const skillData = JSON.parse(
    result.content[0].type === "text" ? result.content[0].text : "{}"
  );

  return {
    ...skillData,
    sourceReflections: reflections.length,
    lastUpdated: new Date().toISOString(),
  };
}

Skill Evolution#

Skills are not static. As the agent accumulates more experience, skills should be updated with new steps, refined constraints, and adjusted confidence levels.

TypeScript

async function evolveSkill(
  existing: Skill,
  newReflections: Reflection[]
): Promise<Skill> {
  const reflectionText = newReflections
    .map(
      (r) =>
        `Task: ${r.task}\nOutcome: ${r.outcome}\nLessons: ${r.lessons.join("; ")}`
    )
    .join("\n\n");

  const result = await client.messages.create({
    model: "claude-sonnet-4-6-20260409",
    max_tokens: 2048,
    messages: [
      {
        role: "user",
        content: `Update this existing skill based on new experiences.

Current skill:
Name: ${existing.name}
Steps: ${existing.steps.join("\n")}
Constraints: ${existing.constraints.join("\n")}
Confidence: ${existing.confidence}
Based on: ${existing.sourceReflections} experiences

New experiences:
${reflectionText}

Update the skill:
- Add new steps if the new experiences reveal missing procedures
- Add new constraints if failures reveal new pitfalls
- Remove steps that new experience shows are unnecessary
- Adjust confidence based on success/failure ratio
- Keep what works, fix what doesn't

Respond as the updated skill JSON.`,
      },
    ],
  });

  const updated = JSON.parse(
    result.content[0].type === "text" ? result.content[0].text : "{}"
  );

  return {
    ...updated,
    sourceReflections: existing.sourceReflections + newReflections.length,
    lastUpdated: new Date().toISOString(),
  };
}

The Skill Library#

Over time, an agent accumulates a library of skills that covers its common tasks. The library structure matters for retrieval.

TypeScript

interface SkillLibrary {
  skills: Map<string, Skill>;
  index: Map<string, string[]>; // keyword -> skill names
}

async function findApplicableSkills(
  task: string,
  library: SkillLibrary
): Promise<Skill[]> {
  const words = task.toLowerCase().split(/\s+/);
  const candidateNames = new Set<string>();

  for (const word of words) {
    const matches = library.index.get(word) || [];
    matches.forEach((name) => candidateNames.add(name));
  }

  const candidates = Array.from(candidateNames)
    .map((name) => library.skills.get(name))
    .filter((s): s is Skill => s !== undefined);

  // Sort by confidence, then by recency
  return candidates
    .sort(
      (a, b) =>
        b.confidence - a.confidence ||
        new Date(b.lastUpdated).getTime() - new Date(a.lastUpdated).getTime()
    )
    .slice(0, 3);
}

Correction Tracking#

Corrections are the highest-signal input for self-improvement. When a human corrects an agent, that correction represents a gap between the agent's behavior and the desired behavior.

TypeScript

interface Correction {
  context: string;
  agentBehavior: string;
  humanCorrection: string;
  category: string;
  timestamp: string;
  applied: boolean;
}

async function processCorrection(
  context: string,
  agentBehavior: string,
  humanCorrection: string
): Promise<Correction> {
  // Categorize the correction
  const categoryResult = await client.messages.create({
    model: "claude-sonnet-4-6-20260409",
    max_tokens: 256,
    messages: [
      {
        role: "user",
        content: `Categorize this correction:
Agent did: ${agentBehavior}
Human corrected to: ${humanCorrection}

Categories: style, logic, architecture, security, performance, convention, other
Respond with just the category.`,
      },
    ],
  });

  const category =
    categoryResult.content[0].type === "text"
      ? categoryResult.content[0].text.trim()
      : "other";

  const correction: Correction = {
    context,
    agentBehavior,
    humanCorrection,
    category,
    timestamp: new Date().toISOString(),
    applied: false,
  };

  await storeCorrection(correction);

  // Check if we have enough corrections in this category to update a skill
  const categoryCorrections = await getCorrectionsByCategory(category);
  if (categoryCorrections.length >= 3) {
    await updateSkillFromCorrections(category, categoryCorrections);
  }

  return correction;
}

Correction Confidence Decay#

TypeScript

function correctionWeight(correction: Correction): number {
  const ageInDays =
    (Date.now() - new Date(correction.timestamp).getTime()) /
    (1000 * 60 * 60 * 24);

  // Half-life of 30 days
  return Math.exp(-0.693 * (ageInDays / 30));
}

async function getWeightedCorrections(
  category: string
): Promise<Correction[]> {
  const corrections = await getCorrectionsByCategory(category);
  return corrections
    .map((c) => ({ ...c, weight: correctionWeight(c) }))
    .filter((c) => c.weight > 0.1) // Discard corrections older than ~100 days
    .sort((a, b) => b.weight - a.weight);
}

The 30-day half-life means recent corrections weigh heavily while old ones fade. This prevents the agent from following outdated preferences.

Putting It All Together#

Here is the complete self-improving agent loop, combining execution, reflection, memory, and skill accumulation.

TypeScript

async function selfImprovingAgent(task: string): Promise<string> {
  // 1. Load relevant context
  const memories = await getRelevantMemories(task);
  const skills = await findApplicableSkills(task, await loadSkillLibrary());
  const corrections = await getRecentCorrections(task);

  // 2. Build enhanced context
  const context = buildContext(memories, skills, corrections);

  // 3. Execute with enhanced context
  const { result, reflection } = await executeWithReflection(task, context);

  // 4. Store the reflection
  await saveReflection(reflection);

  // 5. Check if any skills need updating
  if (reflection.lessons.length > 0) {
    const relatedReflections = await findRelatedReflections(reflection);
    if (relatedReflections.length >= 5) {
      const existingSkill = await findMatchingSkill(task);
      if (existingSkill) {
        await evolveSkill(existingSkill, [reflection]);
      } else {
        const newSkill = await extractSkill(relatedReflections, task);
        await addToSkillLibrary(newSkill);
      }
    }
  }

  return result;
}

function buildContext(
  memories: string,
  skills: Skill[],
  corrections: Correction[]
): string {
  let context = "You are an AI agent that learns from experience.\n\n";

  if (skills.length > 0) {
    context += "## Relevant Skills\n\n";
    for (const skill of skills) {
      context += `### ${skill.name}\n`;
      context += `${skill.description}\n`;
      context += `Steps:\n${skill.steps.map((s, i) => `${i + 1}. ${s}`).join("\n")}\n`;
      context += `Constraints:\n${skill.constraints.map((c) => `- ${c}`).join("\n")}\n\n`;
    }
  }

  if (memories) {
    context += "## Relevant Past Experiences\n\n";
    context += memories + "\n\n";
  }

  if (corrections.length > 0) {
    context += "## Recent Corrections\n\n";
    for (const c of corrections.slice(0, 5)) {
      context += `- Instead of: ${c.agentBehavior}\n  Do: ${c.humanCorrection}\n`;
    }
    context += "\n";
  }

  return context;
}

The loop runs on every task execution. Over time, the agent's context becomes enriched with relevant skills, past experiences, and human corrections. Each interaction contributes to the next one.

Measuring Improvement#

Self-improvement claims require measurement. Here are concrete metrics to track.

First-Attempt Success Rate#

Track whether the agent's first attempt at a task is accepted without correction.

TypeScript

interface PerformanceMetrics {
  totalTasks: number;
  firstAttemptSuccesses: number;
  correctionsReceived: number;
  averageConfidence: number;
  skillCount: number;
}

function successRate(metrics: PerformanceMetrics): number {
  return metrics.firstAttemptSuccesses / metrics.totalTasks;
}

A self-improving agent should show increasing first-attempt success rate over time. If the rate is flat, the memory and skill systems are not being retrieved effectively.

Correction Frequency Decline#

Plot corrections per task over time. A declining trend means the agent is learning from previous corrections and not repeating mistakes.

Skill Coverage#

Track what percentage of tasks have applicable skills in the library. Higher coverage means fewer tasks where the agent operates from base capabilities alone.

Practical Considerations#

Memory Pruning#

Unbounded memory eventually degrades performance. Old, low-confidence reflections add noise without value. Prune aggressively:

Remove reflections older than 90 days with confidence below 0.5
Merge similar reflections into consolidated entries
Archive rather than delete, so you can analyze historical patterns

Skill Conflicts#

Human Override#

TypeScript

interface Override {
  skillName: string;
  originalBehavior: string;
  overrideBehavior: string;
  reason: string;
  permanent: boolean;
}

The Reflection Pattern#

Basic Reflection Loop#

Why Separate Execution from Evaluation#

Memory Architectures#

Architecture 1: Flat File Memory#

Architecture 2: Structured Memory with Categories#

Architecture 3: Embedding-Based Retrieval#

Anthropic vs OpenAI: Developer Experience Compared

Building a SaaS with Claude Code: End-to-End Guide

Case Study: Building Developers Digest with Claude Code

Convex vs Supabase for AI Apps

Skill Accumulation#

From Reflections to Skills#

Skill Evolution#

The Skill Library#

Correction Tracking#

Correction Confidence Decay#

Putting It All Together#

Measuring Improvement#

First-Attempt Success Rate#

Correction Frequency Decline#

Skill Coverage#

Practical Considerations#

Memory Pruning#

Skill Conflicts#

Human Override#

The Compounding Effect#

FAQ#

What is a self-improving AI agent?#

How does reflection work in AI agents?#

What is the best memory architecture for self-improving agents?#

How do agents convert reflections into skills?#

How do corrections improve agent behavior?#

How do you measure if an agent is actually improving?#

How much memory should a self-improving agent retain?#

How do self-improving agents handle conflicting skills?#

AI Agent Memory Patterns

Self-Improving Skills: Claude Code That Learns From Every Session

Continual Learning in Claude Code: Memory That Compounds

Related Tools

Vercel AI SDK

Claude Agent SDK

OpenAI Agents SDK

Mastra

Apps from Developers Digest

DevDigest Academy

Brand Studio

Related Guides

Building Your First MCP Server

Migrating from Cursor to Claude Code

Auto Memory - Claude Code

Related Videos

TRAE: Custom AI Agents That Actually Understand Your Codebase

Building Effective AI Agents with VectorShift

Related Posts

AI Agent Memory Patterns

Self-Improving Skills: Claude Code That Learns From Every Session

Continual Learning in Claude Code: Memory That Compounds

Multi-Agent Systems: How to Orchestrate Multiple AI Agents in TypeScript

How to Build AI Agents in TypeScript

AI Agents Explained: A TypeScript Developer's Guide

Build with the member tools

Get Smarter About AI Dev

The Reflection Pattern#

Basic Reflection Loop#

Why Separate Execution from Evaluation#

Memory Architectures#

Architecture 1: Flat File Memory#

Architecture 2: Structured Memory with Categories#

Architecture 3: Embedding-Based Retrieval#

Anthropic vs OpenAI: Developer Experience Compared

Building a SaaS with Claude Code: End-to-End Guide

Case Study: Building Developers Digest with Claude Code

Convex vs Supabase for AI Apps

Skill Accumulation#

From Reflections to Skills#

Skill Evolution#

The Skill Library#

Correction Tracking#

Correction Confidence Decay#