TL;DR
AI agents that reflect on failures, accumulate skills, and get better with every session. Reflection patterns, memory architectures, skill extraction, and working code examples for building agents that actually learn.
Most AI agents are goldfish. They execute a task, succeed or fail, and forget everything. The next time they encounter the same problem, they make the same mistakes. Every session starts from zero.
Self-improving agents break this cycle. They reflect on what happened, extract lessons from successes and failures, store those lessons in accessible formats, and retrieve them when relevant. Over time, they get measurably better at their tasks.
This is not speculative AI research. These patterns are running in production today. Here is how to build agents that actually learn.
Reflection is the mechanism that converts experience into knowledge. After an agent completes a task (or fails at one), a reflection step analyzes what happened and extracts transferable lessons.
The simplest reflection pattern has three steps: execute, evaluate, extract.
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
interface Reflection {
task: string;
outcome: "success" | "failure" | "partial";
lessons: string[];
confidence: number;
timestamp: string;
}
async function executeWithReflection(
task: string,
context: string
): Promise<{ result: string; reflection: Reflection }> {
// Step 1: Execute the task
const executionResult = await client.messages.create({
model: "claude-sonnet-4-6-20260409",
max_tokens: 4096,
system: context,
messages: [{ role: "user", content: task }],
});
const result =
executionResult.content[0].type === "text"
? executionResult.content[0].text
: "";
// Step 2: Evaluate the outcome
const evaluationResult = await client.messages.create({
model: "claude-sonnet-4-6-20260409",
max_tokens: 1024,
messages: [
{
role: "user",
content: `Evaluate this task execution:
Task: ${task}
Result: ${result}
Rate the outcome as "success", "failure", or "partial".
Extract 1-3 specific, actionable lessons learned.
Rate your confidence in each lesson from 0.0 to 1.0.
Respond as JSON:
{
"outcome": "success|failure|partial",
"lessons": ["lesson 1", "lesson 2"],
"confidence": 0.85
}`,
},
],
});
const evaluation = JSON.parse(
evaluationResult.content[0].type === "text"
? evaluationResult.content[0].text
: "{}"
);
const reflection: Reflection = {
task,
outcome: evaluation.outcome,
lessons: evaluation.lessons,
confidence: evaluation.confidence,
timestamp: new Date().toISOString(),
};
return { result, reflection };
}
This is the foundation. The agent does work, then a separate evaluation pass examines the work and extracts lessons. The evaluation pass uses the same model but with a different prompt focused on analysis rather than execution.
A common mistake is asking the agent to execute and reflect simultaneously. "Do this task and also think about what you learned." This produces worse execution and worse reflection because the model splits its attention.
Separation works better for two reasons:
Cognitive clarity. The execution pass focuses entirely on the task. The evaluation pass focuses entirely on analysis. Neither is compromised by the other.
Different system prompts. The execution pass might have a system prompt optimized for coding ("You are an expert TypeScript developer"). The evaluation pass has a system prompt optimized for analysis ("You are a quality analyst reviewing agent performance"). Specialization improves both outputs.
Reflection produces lessons. Memory stores them for retrieval. The architecture of your memory system determines whether lessons are available when they matter.
The simplest approach. Store all reflections in a single JSON file, read at session start.
import { readFile, writeFile } from "fs/promises";
const MEMORY_PATH = ".agent/memory.json";
interface MemoryStore {
reflections: Reflection[];
skills: Skill[];
corrections: Correction[];
}
async function loadMemory(): Promise<MemoryStore> {
try {
const data = await readFile(MEMORY_PATH, "utf-8");
return JSON.parse(data);
} catch {
return { reflections: [], skills: [], corrections: [] };
}
}
async function saveReflection(reflection: Reflection): Promise<void> {
const memory = await loadMemory();
memory.reflections.push(reflection);
// Prune low-confidence reflections when memory gets large
if (memory.reflections.length > 200) {
memory.reflections = memory.reflections
.sort((a, b) => b.confidence - a.confidence)
.slice(0, 150);
}
await writeFile(MEMORY_PATH, JSON.stringify(memory, null, 2));
}
async function getRelevantMemories(task: string): Promise<string> {
const memory = await loadMemory();
// Simple keyword matching for retrieval
const words = task.toLowerCase().split(/\s+/);
const relevant = memory.reflections.filter((r) =>
words.some(
(w) =>
r.task.toLowerCase().includes(w) ||
r.lessons.some((l) => l.toLowerCase().includes(w))
)
);
return relevant
.slice(0, 10)
.map(
(r) =>
`[${r.outcome}] ${r.task}: ${r.lessons.join("; ")} (confidence: ${r.confidence})`
)
.join("\n");
}
Flat file memory works for agents with fewer than a few hundred reflections. Beyond that, keyword matching becomes unreliable and loading the entire file wastes tokens.
When to use: Personal agents, project-specific agents, any system where the total reflection count stays under 500.
Split memory into categories so the agent loads only relevant context.
interface StructuredMemory {
technical: {
bugs: Reflection[];
patterns: Reflection[];
performance: Reflection[];
};
process: {
planning: Reflection[];
testing: Reflection[];
deployment: Reflection[];
};
domain: {
[key: string]: Reflection[];
};
}
async function categorizeAndStore(reflection: Reflection): Promise<void> {
const memory = await loadStructuredMemory();
// Use the model to categorize the reflection
const categoryResult = await client.messages.create({
model: "claude-sonnet-4-6-20260409",
max_tokens: 256,
messages: [
{
role: "user",
content: `Categorize this lesson into one category.
Lesson: ${reflection.lessons.join("; ")}
Task: ${reflection.task}
Categories:
- technical/bugs: Bug fixes and error handling
- technical/patterns: Code patterns and architecture
- technical/performance: Performance optimization
- process/planning: Task planning and decomposition
- process/testing: Testing strategies
- process/deployment: Deployment and operations
- domain/{topic}: Domain-specific knowledge
Respond with just the category path, e.g., "technical/bugs"`,
},
],
});
const category =
categoryResult.content[0].type === "text"
? categoryResult.content[0].text.trim()
: "domain/general";
// Store in the appropriate category
const [top, sub] = category.split("/");
if (!memory[top]) memory[top] = {};
if (!memory[top][sub]) memory[top][sub] = [];
memory[top][sub].push(reflection);
await saveStructuredMemory(memory);
}
async function retrieveByCategory(
categories: string[]
): Promise<Reflection[]> {
const memory = await loadStructuredMemory();
const results: Reflection[] = [];
for (const category of categories) {
const [top, sub] = category.split("/");
if (memory[top]?.[sub]) {
results.push(...memory[top][sub]);
}
}
return results.sort((a, b) => b.confidence - a.confidence).slice(0, 20);
}
Structured memory solves the relevance problem. When the agent is debugging, it loads technical/bugs. When it is deploying, it loads process/deployment. Irrelevant memories stay out of the context window.
When to use: Agents that handle diverse tasks across multiple domains. The categorization overhead is worth it when the total memory exceeds a few hundred entries.
For large memory stores, use vector embeddings to find semantically relevant memories.
interface EmbeddedReflection extends Reflection {
embedding: number[];
}
async function embedReflection(
reflection: Reflection
): Promise<EmbeddedReflection> {
const text = `${reflection.task} ${reflection.lessons.join(" ")}`;
// Using a local embedding model or API
const embedding = await getEmbedding(text);
return { ...reflection, embedding };
}
async function findSimilar(
query: string,
topK: number = 5
): Promise<Reflection[]> {
const queryEmbedding = await getEmbedding(query);
const allMemories = await loadEmbeddedMemories();
// Cosine similarity search
const scored = allMemories.map((m) => ({
reflection: m,
score: cosineSimilarity(queryEmbedding, m.embedding),
}));
return scored
.sort((a, b) => b.score - a.score)
.slice(0, topK)
.map((s) => s.reflection);
}
function cosineSimilarity(a: number[], b: number[]): number {
let dotProduct = 0;
let normA = 0;
let normB = 0;
for (let i = 0; i < a.length; i++) {
dotProduct += a[i] * b[i];
normA += a[i] * a[i];
normB += b[i] * b[i];
}
return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}
Embedding-based retrieval finds semantically similar memories even when the exact keywords differ. A lesson about "database connection pooling" retrieves when the agent encounters "Postgres timeout errors" because the embeddings are close in semantic space.
When to use: Agents with thousands of reflections, or agents that handle tasks where keyword matching is insufficient.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
Reflections are raw experience. Skills are refined knowledge. The skill extraction process converts multiple related reflections into a reusable, structured skill.
interface Skill {
name: string;
description: string;
steps: string[];
constraints: string[];
confidence: number;
sourceReflections: number;
lastUpdated: string;
}
async function extractSkill(
reflections: Reflection[],
domain: string
): Promise<Skill> {
const reflectionText = reflections
.map(
(r) =>
`Task: ${r.task}\nOutcome: ${r.outcome}\nLessons: ${r.lessons.join("; ")}`
)
.join("\n\n");
const result = await client.messages.create({
model: "claude-sonnet-4-6-20260409",
max_tokens: 2048,
messages: [
{
role: "user",
content: `Analyze these ${reflections.length} related experiences in the "${domain}" domain and extract a reusable skill.
${reflectionText}
Create a skill with:
1. A clear name (verb + noun, e.g., "Debug Database Connections")
2. A one-sentence description
3. Ordered steps that work reliably
4. Constraints (things to avoid, based on failures)
5. Confidence level (0.0-1.0) based on how many successes vs failures
Respond as JSON:
{
"name": "...",
"description": "...",
"steps": ["step 1", "step 2"],
"constraints": ["never do X", "always check Y"],
"confidence": 0.85
}`,
},
],
});
const skillData = JSON.parse(
result.content[0].type === "text" ? result.content[0].text : "{}"
);
return {
...skillData,
sourceReflections: reflections.length,
lastUpdated: new Date().toISOString(),
};
}
The skill extraction prompt does the heavy lifting: it reads multiple experiences and synthesizes them into a repeatable procedure. The constraints are particularly valuable because they encode failure modes. An agent with a skill that says "never use raw SQL for schema changes" will not make that mistake even if its base training would suggest it.
Skills are not static. As the agent accumulates more experience, skills should be updated with new steps, refined constraints, and adjusted confidence levels.
async function evolveSkill(
existing: Skill,
newReflections: Reflection[]
): Promise<Skill> {
const reflectionText = newReflections
.map(
(r) =>
`Task: ${r.task}\nOutcome: ${r.outcome}\nLessons: ${r.lessons.join("; ")}`
)
.join("\n\n");
const result = await client.messages.create({
model: "claude-sonnet-4-6-20260409",
max_tokens: 2048,
messages: [
{
role: "user",
content: `Update this existing skill based on new experiences.
Current skill:
Name: ${existing.name}
Steps: ${existing.steps.join("\n")}
Constraints: ${existing.constraints.join("\n")}
Confidence: ${existing.confidence}
Based on: ${existing.sourceReflections} experiences
New experiences:
${reflectionText}
Update the skill:
- Add new steps if the new experiences reveal missing procedures
- Add new constraints if failures reveal new pitfalls
- Remove steps that new experience shows are unnecessary
- Adjust confidence based on success/failure ratio
- Keep what works, fix what doesn't
Respond as the updated skill JSON.`,
},
],
});
const updated = JSON.parse(
result.content[0].type === "text" ? result.content[0].text : "{}"
);
return {
...updated,
sourceReflections: existing.sourceReflections + newReflections.length,
lastUpdated: new Date().toISOString(),
};
}
Skill evolution is where the compounding effect becomes visible. A skill based on 5 reflections is rough and general. The same skill after 50 reflections is specific, battle-tested, and reliable. Each new experience either reinforces existing steps (increasing confidence) or reveals gaps (adding new steps or constraints).
Over time, an agent accumulates a library of skills that covers its common tasks. The library structure matters for retrieval.
interface SkillLibrary {
skills: Map<string, Skill>;
index: Map<string, string[]>; // keyword -> skill names
}
async function findApplicableSkills(
task: string,
library: SkillLibrary
): Promise<Skill[]> {
const words = task.toLowerCase().split(/\s+/);
const candidateNames = new Set<string>();
for (const word of words) {
const matches = library.index.get(word) || [];
matches.forEach((name) => candidateNames.add(name));
}
const candidates = Array.from(candidateNames)
.map((name) => library.skills.get(name))
.filter((s): s is Skill => s !== undefined);
// Sort by confidence, then by recency
return candidates
.sort(
(a, b) =>
b.confidence - a.confidence ||
new Date(b.lastUpdated).getTime() - new Date(a.lastUpdated).getTime()
)
.slice(0, 3);
}
The agent consults its skill library before starting a task. If relevant skills exist, they provide a starting procedure and known constraints. If no skills exist, the agent operates from its base capabilities and generates new reflections that may seed future skills.
Corrections are the highest-signal input for self-improvement. When a human corrects an agent, that correction represents a gap between the agent's behavior and the desired behavior.
interface Correction {
context: string;
agentBehavior: string;
humanCorrection: string;
category: string;
timestamp: string;
applied: boolean;
}
async function processCorrection(
context: string,
agentBehavior: string,
humanCorrection: string
): Promise<Correction> {
// Categorize the correction
const categoryResult = await client.messages.create({
model: "claude-sonnet-4-6-20260409",
max_tokens: 256,
messages: [
{
role: "user",
content: `Categorize this correction:
Agent did: ${agentBehavior}
Human corrected to: ${humanCorrection}
Categories: style, logic, architecture, security, performance, convention, other
Respond with just the category.`,
},
],
});
const category =
categoryResult.content[0].type === "text"
? categoryResult.content[0].text.trim()
: "other";
const correction: Correction = {
context,
agentBehavior,
humanCorrection,
category,
timestamp: new Date().toISOString(),
applied: false,
};
await storeCorrection(correction);
// Check if we have enough corrections in this category to update a skill
const categoryCorrections = await getCorrectionsByCategory(category);
if (categoryCorrections.length >= 3) {
await updateSkillFromCorrections(category, categoryCorrections);
}
return correction;
}
The threshold of three corrections before updating a skill prevents overreaction to a single data point. One correction might be situational. Three corrections in the same category indicate a systematic gap.
Not all corrections age equally. A correction from yesterday is more relevant than one from three months ago because the codebase, the developer's preferences, and the project conventions may have changed.
function correctionWeight(correction: Correction): number {
const ageInDays =
(Date.now() - new Date(correction.timestamp).getTime()) /
(1000 * 60 * 60 * 24);
// Half-life of 30 days
return Math.exp(-0.693 * (ageInDays / 30));
}
async function getWeightedCorrections(
category: string
): Promise<Correction[]> {
const corrections = await getCorrectionsByCategory(category);
return corrections
.map((c) => ({ ...c, weight: correctionWeight(c) }))
.filter((c) => c.weight > 0.1) // Discard corrections older than ~100 days
.sort((a, b) => b.weight - a.weight);
}
The 30-day half-life means recent corrections weigh heavily while old ones fade. This prevents the agent from following outdated preferences.
Here is the complete self-improving agent loop, combining execution, reflection, memory, and skill accumulation.
async function selfImprovingAgent(task: string): Promise<string> {
// 1. Load relevant context
const memories = await getRelevantMemories(task);
const skills = await findApplicableSkills(task, await loadSkillLibrary());
const corrections = await getRecentCorrections(task);
// 2. Build enhanced context
const context = buildContext(memories, skills, corrections);
// 3. Execute with enhanced context
const { result, reflection } = await executeWithReflection(task, context);
// 4. Store the reflection
await saveReflection(reflection);
// 5. Check if any skills need updating
if (reflection.lessons.length > 0) {
const relatedReflections = await findRelatedReflections(reflection);
if (relatedReflections.length >= 5) {
const existingSkill = await findMatchingSkill(task);
if (existingSkill) {
await evolveSkill(existingSkill, [reflection]);
} else {
const newSkill = await extractSkill(relatedReflections, task);
await addToSkillLibrary(newSkill);
}
}
}
return result;
}
function buildContext(
memories: string,
skills: Skill[],
corrections: Correction[]
): string {
let context = "You are an AI agent that learns from experience.\n\n";
if (skills.length > 0) {
context += "## Relevant Skills\n\n";
for (const skill of skills) {
context += `### ${skill.name}\n`;
context += `${skill.description}\n`;
context += `Steps:\n${skill.steps.map((s, i) => `${i + 1}. ${s}`).join("\n")}\n`;
context += `Constraints:\n${skill.constraints.map((c) => `- ${c}`).join("\n")}\n\n`;
}
}
if (memories) {
context += "## Relevant Past Experiences\n\n";
context += memories + "\n\n";
}
if (corrections.length > 0) {
context += "## Recent Corrections\n\n";
for (const c of corrections.slice(0, 5)) {
context += `- Instead of: ${c.agentBehavior}\n Do: ${c.humanCorrection}\n`;
}
context += "\n";
}
return context;
}
The loop runs on every task execution. Over time, the agent's context becomes enriched with relevant skills, past experiences, and human corrections. Each interaction contributes to the next one.
Self-improvement claims require measurement. Here are concrete metrics to track.
Track whether the agent's first attempt at a task is accepted without correction.
interface PerformanceMetrics {
totalTasks: number;
firstAttemptSuccesses: number;
correctionsReceived: number;
averageConfidence: number;
skillCount: number;
}
function successRate(metrics: PerformanceMetrics): number {
return metrics.firstAttemptSuccesses / metrics.totalTasks;
}
A self-improving agent should show increasing first-attempt success rate over time. If the rate is flat, the memory and skill systems are not being retrieved effectively.
Plot corrections per task over time. A declining trend means the agent is learning from previous corrections and not repeating mistakes.
Track what percentage of tasks have applicable skills in the library. Higher coverage means fewer tasks where the agent operates from base capabilities alone.
Unbounded memory eventually degrades performance. Old, low-confidence reflections add noise without value. Prune aggressively:
When two skills give contradictory advice, the agent needs a resolution strategy. The simplest approach: prefer the skill with higher confidence and more source reflections. A skill based on 30 experiences at 0.9 confidence overrides one based on 3 experiences at 0.6 confidence.
Self-improvement systems must preserve human authority. If a developer explicitly overrides a skill or correction, that override takes precedence. The system should not argue with direct instructions, even if its learned behavior disagrees.
interface Override {
skillName: string;
originalBehavior: string;
overrideBehavior: string;
reason: string;
permanent: boolean;
}
Permanent overrides modify the skill directly. Temporary overrides apply for the current session only. Both are tracked so the agent can distinguish between "the human always wants this" and "the human wanted this once."
A self-improving agent that runs 20 tasks per day and extracts lessons from 10% of them accumulates 2 new reflections daily. After a month, it has 60 reflections. After six months, 360. These reflections condense into 20 to 40 skills that cover the agent's most common tasks.
The agent after six months of accumulation is fundamentally different from the agent on day one. It has seen the failure modes, learned the conventions, absorbed the corrections, and encoded the solutions. Every session is faster and more accurate than the last.
This is the promise of self-improving AI agents. Not artificial general intelligence. Not consciousness. Just systems that remember what worked, remember what failed, and apply those memories to the next task. The implementation is straightforward. The compounding effect is not.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
The TypeScript toolkit for building AI apps. Unified API across OpenAI, Anthropic, Google. Streaming, tool calling, stru...
View ToolAnthropic's Python SDK for building production agent systems. Tool use, guardrails, agent handoffs, and orchestration. R...
View Tool
New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.
Lightweight Python framework for multi-agent systems. Agent handoffs, tool use, guardrails, tracing. Successor to the ex...

Check out Trae here! https://tinyurl.com/2f8rw4vm In this video, we dive into @Trae_ai a newly launched AI IDE packed with innovative features. I provide a comprehensive demonstration...

In this video, I demonstrate how to use VectorShift to build AI applications and workflows. By applying ideas from Anthropic's blog post 'Building Effective Agents,' I show you how to create...

Agents forget everything between sessions. Here are the patterns that fix that: CLAUDE.md persistence, RAG retrieval, co...
Production-tested patterns for orchestrating AI agent teams - from fan-out parallelism to hierarchical delegation. Cover...

AI agents fail in ways traditional debugging cannot catch. Here are the tools and patterns for finding and fixing broken...