How to Debug AI Agent Workflows

Official Sources

Resource	Link
Claude Code Hooks	code.claude.com/docs/en/hooks
Claude Code Overview	code.claude.com/docs/en/overview
Vercel AI SDK Agents	ai-sdk.dev/docs/agents
LangSmith Tracing	docs.langchain.com/langsmith/observability-quickstart
OpenTelemetry Semantic Conventions	opentelemetry.io/docs/specs/semconv
Zod Schema Validation	zod.dev

Traditional debugging is about finding where code breaks. Agent debugging is about finding where reasoning breaks. The code runs fine. The model just made the wrong decision. If you are still designing the loop itself, start with how to build AI agents in TypeScript and the agent architecture guide.

Here are the patterns that actually work.

The Agent Debugging Stack

You need visibility into three things:

What the agent decided - the plan it formed
What tools it called - with exact inputs and outputs
What context it had - the full prompt at each step

Without all three, you are guessing. This is also why long-running agent harnesses and DD Traces for local OpenTelemetry matter: they turn agent behavior into something you can inspect after the run.

Pattern 1: Structured Tool Logging

Log every tool call with structured data. Not just "tool called" - the full input, output, and timing.

interface ToolLog {
  tool: string;
  input: Record<string, unknown>;
  output: unknown;
  durationMs: number;
  timestamp: number;
  step: number;
}

function wrapTool<T>(name: string, fn: (input: T) => Promise<unknown>) {
  return async (input: T, step: number): Promise<{ result: unknown; log: ToolLog }> => {
    const start = Date.now();
    try {
      const result = await fn(input);
      const log: ToolLog = {
        tool: name,
        input: input as Record<string, unknown>,
        output: result,
        durationMs: Date.now() - start,
        timestamp: start,
        step,
      };
      return { result, log };
    } catch (error) {
      const log: ToolLog = {
        tool: name,
        input: input as Record<string, unknown>,
        output: { error: String(error) },
        durationMs: Date.now() - start,
        timestamp: start,
        step,
      };
      return { result: null, log };
    }
  };
}

When an agent goes wrong, you can trace the exact sequence: step 3 called search_files with the wrong query, got no results, then hallucinated the file content.

Newsletter

Get the weekly deep dive

Tutorials on Claude Code, AI agents, and dev tools, delivered free every week.

From the archive

MCP vs Function Calling: When to Use Each

Apr 3, 2026 • 6 min read

How to Migrate from GitHub Copilot to Claude Code

Apr 3, 2026 • 7 min read

10 TypeScript Patterns Every AI Developer Should Know

Apr 3, 2026 • 10 min read

Every AI Coding Tool Compared: The 2026 Matrix

Apr 2, 2026 • 15 min read

Pattern 2: Context Window Snapshots

The most common agent failure is context overflow. The agent loses important information because the context window filled up with tool outputs.

function trackContext(messages: Message[]): ContextSnapshot {
  const totalTokens = estimateTokens(messages);
  const breakdown = messages.map((m) => ({
    role: m.role,
    tokens: estimateTokens([m]),
    preview: m.content.slice(0, 100),
  }));

  return {
    totalTokens,
    maxTokens: 200_000,
    utilization: totalTokens / 200_000,
    breakdown,
    warning: totalTokens > 150_000 ? "Context 75%+ full" : null,
  };
}

If your agent starts failing after 10+ steps, it is almost always context overflow. The fix: summarize intermediate results instead of keeping raw tool outputs.

Pattern 3: Decision Trace

Before each action, ask the agent to explain its reasoning in structured form.

const decisionSchema = z.object({
  observation: z.string().describe("What I see in the current state"),
  reasoning: z.string().describe("Why I chose this action"),
  action: z.string().describe("What I will do next"),
  confidence: z.number().min(0).max(1).describe("How confident I am"),
  alternatives: z.array(z.string()).describe("Other actions I considered"),
});

When confidence drops below 0.5, you know exactly where the agent got uncertain. This is where human review adds the most value.

Pattern 4: Replay and Diff

Save the full agent trajectory so you can replay it.

interface AgentTrajectory {
  task: string;
  steps: {
    thought: string;
    action: string;
    toolInput: unknown;
    toolOutput: unknown;
    contextTokens: number;
  }[];
  outcome: "success" | "failure" | "timeout";
  totalSteps: number;
  totalDurationMs: number;
}

// Save trajectory
async function saveTrajectory(trajectory: AgentTrajectory) {
  const id = `${Date.now()}-${trajectory.task.slice(0, 30)}`;
  await fs.writeFile(
    `./traces/${id}.json`,
    JSON.stringify(trajectory, null, 2)
  );
}

When a similar task fails, diff the successful trajectory against the failing one. The divergence point is usually the bug.

Pattern 5: Claude Code Hooks for Debugging

If you are using Claude Code, hooks give you deterministic debugging points. The companion Claude Code hooks guide explains the lifecycle events, and Hookyard covers the packaged workflow for teams that want reusable hook installs.

{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": ".*",
        "command": "echo \"Tool: $TOOL_NAME | Exit: $EXIT_CODE\" >> /tmp/claude-debug.log"
      }
    ],
    "Stop": [
      {
        "command": "echo \"Session ended at $(date)\" >> /tmp/claude-debug.log"
      }
    ]
  }
}

Every tool call gets logged. Every session end gets recorded. Review the log when something goes wrong.

Common Agent Failures

Infinite loops. The agent keeps retrying the same action. Fix: add a step counter and bail after N attempts.

Tool misuse. The agent calls a tool with the wrong arguments. Fix: improve tool descriptions and add input validation.

Context poisoning. A large tool output fills the context with irrelevant data. Fix: truncate or summarize tool outputs before adding to context.

Premature termination. The agent thinks it is done but it is not. Fix: add verification steps that check the actual result against the original task.

Wrong tool selection. The agent picks the wrong tool for the job. Fix: make tool descriptions more specific about when to use each tool.

When to Add a Human in the Loop

Not every agent failure needs code fixes. Sometimes the right answer is human review at critical points:

Before destructive actions (file deletion, database writes)
When confidence drops below a threshold
After N consecutive failures
Before the final "done" declaration

The best agent systems are not fully autonomous. They are autonomous for the easy parts and interactive for the hard parts.

Frequently Asked Questions

What is the most common reason AI agents fail?

Context overflow. After enough tool calls, the context window fills with intermediate results and the agent loses track of the original task. The fix is summarizing intermediate results and managing context deliberately.

How do I debug a Claude Code session that went wrong?

Use hooks to log every tool call. Add a PostToolUse hook that records the tool name, input, and exit code. Review the log file to trace the exact decision sequence. The /transcript command also helps.

Should I use structured logging for AI agents?

Yes. Structured tool logs (JSON with tool name, input, output, duration, step number) are essential. You can filter, query, and diff them. Plain text logs are almost useless for multi-step agent debugging.

How do I prevent infinite loops in agents?

Add a max step counter and a loop detector. Track the last N actions - if the same tool+input combination appears 3 times, break the loop and ask for human input.

When should I add human review to an agent workflow?

Before destructive actions, when the agent's confidence is low, after consecutive failures, and before declaring a task complete. The goal is not to remove the human - it is to minimize unnecessary interruptions while keeping critical checkpoints.

Official Sources

Resource	Link
Claude Code Hooks	code.claude.com/docs/en/hooks
Claude Code Overview	code.claude.com/docs/en/overview
Vercel AI SDK Agents	ai-sdk.dev/docs/agents
LangSmith Tracing	docs.langchain.com/langsmith/observability-quickstart
OpenTelemetry Semantic Conventions	opentelemetry.io/docs/specs/semconv
Zod Schema Validation	zod.dev

Here are the patterns that actually work.

The Agent Debugging Stack

You need visibility into three things:

What the agent decided - the plan it formed
What tools it called - with exact inputs and outputs
What context it had - the full prompt at each step

Without all three, you are guessing. This is also why long-running agent harnesses and DD Traces for local OpenTelemetry matter: they turn agent behavior into something you can inspect after the run.

Pattern 1: Structured Tool Logging

Log every tool call with structured data. Not just "tool called" - the full input, output, and timing.

interface ToolLog {
  tool: string;
  input: Record<string, unknown>;
  output: unknown;
  durationMs: number;
  timestamp: number;
  step: number;
}

function wrapTool<T>(name: string, fn: (input: T) => Promise<unknown>) {
  return async (input: T, step: number): Promise<{ result: unknown; log: ToolLog }> => {
    const start = Date.now();
    try {
      const result = await fn(input);
      const log: ToolLog = {
        tool: name,
        input: input as Record<string, unknown>,
        output: result,
        durationMs: Date.now() - start,
        timestamp: start,
        step,
      };
      return { result, log };
    } catch (error) {
      const log: ToolLog = {
        tool: name,
        input: input as Record<string, unknown>,
        output: { error: String(error) },
        durationMs: Date.now() - start,
        timestamp: start,
        step,
      };
      return { result: null, log };
    }
  };
}

When an agent goes wrong, you can trace the exact sequence: step 3 called search_files with the wrong query, got no results, then hallucinated the file content.

Newsletter

Get the weekly deep dive

Tutorials on Claude Code, AI agents, and dev tools, delivered free every week.

From the archive

MCP vs Function Calling: When to Use Each

Apr 3, 2026 • 6 min read

How to Migrate from GitHub Copilot to Claude Code

Apr 3, 2026 • 7 min read

10 TypeScript Patterns Every AI Developer Should Know

Apr 3, 2026 • 10 min read

Every AI Coding Tool Compared: The 2026 Matrix

Apr 2, 2026 • 15 min read

Pattern 2: Context Window Snapshots

The most common agent failure is context overflow. The agent loses important information because the context window filled up with tool outputs.

function trackContext(messages: Message[]): ContextSnapshot {
  const totalTokens = estimateTokens(messages);
  const breakdown = messages.map((m) => ({
    role: m.role,
    tokens: estimateTokens([m]),
    preview: m.content.slice(0, 100),
  }));

  return {
    totalTokens,
    maxTokens: 200_000,
    utilization: totalTokens / 200_000,
    breakdown,
    warning: totalTokens > 150_000 ? "Context 75%+ full" : null,
  };
}

If your agent starts failing after 10+ steps, it is almost always context overflow. The fix: summarize intermediate results instead of keeping raw tool outputs.

Pattern 3: Decision Trace

Before each action, ask the agent to explain its reasoning in structured form.

const decisionSchema = z.object({
  observation: z.string().describe("What I see in the current state"),
  reasoning: z.string().describe("Why I chose this action"),
  action: z.string().describe("What I will do next"),
  confidence: z.number().min(0).max(1).describe("How confident I am"),
  alternatives: z.array(z.string()).describe("Other actions I considered"),
});

When confidence drops below 0.5, you know exactly where the agent got uncertain. This is where human review adds the most value.

Pattern 4: Replay and Diff

Save the full agent trajectory so you can replay it.

interface AgentTrajectory {
  task: string;
  steps: {
    thought: string;
    action: string;
    toolInput: unknown;
    toolOutput: unknown;
    contextTokens: number;
  }[];
  outcome: "success" | "failure" | "timeout";
  totalSteps: number;
  totalDurationMs: number;
}

// Save trajectory
async function saveTrajectory(trajectory: AgentTrajectory) {
  const id = `${Date.now()}-${trajectory.task.slice(0, 30)}`;
  await fs.writeFile(
    `./traces/${id}.json`,
    JSON.stringify(trajectory, null, 2)
  );
}

When a similar task fails, diff the successful trajectory against the failing one. The divergence point is usually the bug.

Pattern 5: Claude Code Hooks for Debugging

{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": ".*",
        "command": "echo \"Tool: $TOOL_NAME | Exit: $EXIT_CODE\" >> /tmp/claude-debug.log"
      }
    ],
    "Stop": [
      {
        "command": "echo \"Session ended at $(date)\" >> /tmp/claude-debug.log"
      }
    ]
  }
}

Every tool call gets logged. Every session end gets recorded. Review the log when something goes wrong.

Common Agent Failures

Infinite loops. The agent keeps retrying the same action. Fix: add a step counter and bail after N attempts.

Tool misuse. The agent calls a tool with the wrong arguments. Fix: improve tool descriptions and add input validation.

Context poisoning. A large tool output fills the context with irrelevant data. Fix: truncate or summarize tool outputs before adding to context.

Premature termination. The agent thinks it is done but it is not. Fix: add verification steps that check the actual result against the original task.

Wrong tool selection. The agent picks the wrong tool for the job. Fix: make tool descriptions more specific about when to use each tool.

When to Add a Human in the Loop

Not every agent failure needs code fixes. Sometimes the right answer is human review at critical points:

Before destructive actions (file deletion, database writes)
When confidence drops below a threshold
After N consecutive failures
Before the final "done" declaration

The best agent systems are not fully autonomous. They are autonomous for the easy parts and interactive for the hard parts.

Frequently Asked Questions

What is the most common reason AI agents fail?

How do I debug a Claude Code session that went wrong?

Should I use structured logging for AI agents?

How do I prevent infinite loops in agents?

Add a max step counter and a loop detector. Track the last N actions - if the same tool+input combination appears 3 times, break the loop and ask for human input.

Official Sources

The Agent Debugging Stack

Pattern 1: Structured Tool Logging

MCP vs Function Calling: When to Use Each

How to Migrate from GitHub Copilot to Claude Code

10 TypeScript Patterns Every AI Developer Should Know

Every AI Coding Tool Compared: The 2026 Matrix

Pattern 2: Context Window Snapshots

Pattern 3: Decision Trace

Pattern 4: Replay and Diff

Pattern 5: Claude Code Hooks for Debugging

Common Agent Failures

When to Add a Human in the Loop

Frequently Asked Questions

What is the most common reason AI agents fail?

How do I debug a Claude Code session that went wrong?

Should I use structured logging for AI agents?

How do I prevent infinite loops in agents?

When should I add human review to an agent workflow?

How to Build AI Agents in TypeScript

Multi-Agent Systems: How to Orchestrate Multiple AI Agents in TypeScript

AI Agents Explained: A TypeScript Developer's Guide

Related Tools

Mastra

Cline

Claude Opus 4.7

AgentCanvas

Apps from Developers Digest

DD Traces

ctx-peek

Overnight Agents

Related Guides

Claude Code Setup Guide

Claude Code Complete Course

Building Your First MCP Server

Related Videos

Agents 101: How to Build and Deploy Anything with AI Agents

Zed: The Open Source Agentic IDE - Use Claude Code, Codex & Gemini CLI in one place

Claude Code NEW Sub Agents in 7 Minutes

Related Posts

How to Build AI Agents in TypeScript

Multi-Agent Systems: How to Orchestrate Multiple AI Agents in TypeScript

AI Agents Explained: A TypeScript Developer's Guide

What Is MCP (Model Context Protocol)? A TypeScript Developer's Guide

Vercel AI SDK 7: The Production Agent Upgrade

Agent Workflows as Code: Why State Machines Beat Prompt Checklists

Build with the member tools

Get Smarter About AI Dev

Official Sources

The Agent Debugging Stack

Pattern 1: Structured Tool Logging

MCP vs Function Calling: When to Use Each

How to Migrate from GitHub Copilot to Claude Code

10 TypeScript Patterns Every AI Developer Should Know

Every AI Coding Tool Compared: The 2026 Matrix

Pattern 2: Context Window Snapshots

Pattern 3: Decision Trace

Pattern 4: Replay and Diff

Pattern 5: Claude Code Hooks for Debugging

Common Agent Failures

When to Add a Human in the Loop

Frequently Asked Questions

What is the most common reason AI agents fail?

How do I debug a Claude Code session that went wrong?

Should I use structured logging for AI agents?

How do I prevent infinite loops in agents?

When should I add human review to an agent workflow?

How to Build AI Agents in TypeScript

Multi-Agent Systems: How to Orchestrate Multiple AI Agents in TypeScript

AI Agents Explained: A TypeScript Developer's Guide

Related Tools

Mastra

Cline

Claude Opus 4.7

AgentCanvas

Apps from Developers Digest

DD Traces

ctx-peek

Overnight Agents

Related Guides