
AI Agents Deep Dive
7 partsTL;DR
AI agents fail in ways traditional debugging cannot catch. Here are the tools and patterns for finding and fixing broken agent loops, tool failures, and context issues.
Traditional debugging is about finding where code breaks. Agent debugging is about finding where reasoning breaks. The code runs fine. The model just made the wrong decision.
Here are the patterns that actually work.
You need visibility into three things:
Without all three, you are guessing.
Log every tool call with structured data. Not just "tool called" - the full input, output, and timing.
interface ToolLog {
tool: string;
input: Record<string, unknown>;
output: unknown;
durationMs: number;
timestamp: number;
step: number;
}
function wrapTool<T>(name: string, fn: (input: T) => Promise<unknown>) {
return async (input: T, step: number): Promise<{ result: unknown; log: ToolLog }> => {
const start = Date.now();
try {
const result = await fn(input);
const log: ToolLog = {
tool: name,
input: input as Record<string, unknown>,
output: result,
durationMs: Date.now() - start,
timestamp: start,
step,
};
return { result, log };
} catch (error) {
const log: ToolLog = {
tool: name,
input: input as Record<string, unknown>,
output: { error: String(error) },
durationMs: Date.now() - start,
timestamp: start,
step,
};
return { result: null, log };
}
};
}
When an agent goes wrong, you can trace the exact sequence: step 3 called search_files with the wrong query, got no results, then hallucinated the file content.
The most common agent failure is context overflow. The agent loses important information because the context window filled up with tool outputs.
function trackContext(messages: Message[]): ContextSnapshot {
const totalTokens = estimateTokens(messages);
const breakdown = messages.map((m) => ({
role: m.role,
tokens: estimateTokens([m]),
preview: m.content.slice(0, 100),
}));
return {
totalTokens,
maxTokens: 200_000,
utilization: totalTokens / 200_000,
breakdown,
warning: totalTokens > 150_000 ? "Context 75%+ full" : null,
};
}
If your agent starts failing after 10+ steps, it is almost always context overflow. The fix: summarize intermediate results instead of keeping raw tool outputs.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
Before each action, ask the agent to explain its reasoning in structured form.
const decisionSchema = z.object({
observation: z.string().describe("What I see in the current state"),
reasoning: z.string().describe("Why I chose this action"),
action: z.string().describe("What I will do next"),
confidence: z.number().min(0).max(1).describe("How confident I am"),
alternatives: z.array(z.string()).describe("Other actions I considered"),
});
When confidence drops below 0.5, you know exactly where the agent got uncertain. This is where human review adds the most value.
Save the full agent trajectory so you can replay it.
interface AgentTrajectory {
task: string;
steps: {
thought: string;
action: string;
toolInput: unknown;
toolOutput: unknown;
contextTokens: number;
}[];
outcome: "success" | "failure" | "timeout";
totalSteps: number;
totalDurationMs: number;
}
// Save trajectory
async function saveTrajectory(trajectory: AgentTrajectory) {
const id = `${Date.now()}-${trajectory.task.slice(0, 30)}`;
await fs.writeFile(
`./traces/${id}.json`,
JSON.stringify(trajectory, null, 2)
);
}
When a similar task fails, diff the successful trajectory against the failing one. The divergence point is usually the bug.
If you are using Claude Code, hooks give you deterministic debugging points.
{
"hooks": {
"PostToolUse": [
{
"matcher": ".*",
"command": "echo \"Tool: $TOOL_NAME | Exit: $EXIT_CODE\" >> /tmp/claude-debug.log"
}
],
"Stop": [
{
"command": "echo \"Session ended at $(date)\" >> /tmp/claude-debug.log"
}
]
}
}
Every tool call gets logged. Every session end gets recorded. Review the log when something goes wrong.
Infinite loops. The agent keeps retrying the same action. Fix: add a step counter and bail after N attempts.
Tool misuse. The agent calls a tool with the wrong arguments. Fix: improve tool descriptions and add input validation.
Context poisoning. A large tool output fills the context with irrelevant data. Fix: truncate or summarize tool outputs before adding to context.
Premature termination. The agent thinks it is done but it is not. Fix: add verification steps that check the actual result against the original task.
Wrong tool selection. The agent picks the wrong tool for the job. Fix: make tool descriptions more specific about when to use each tool.
Not every agent failure needs code fixes. Sometimes the right answer is human review at critical points:
The best agent systems are not fully autonomous. They are autonomous for the easy parts and interactive for the hard parts.
Context overflow. After enough tool calls, the context window fills with intermediate results and the agent loses track of the original task. The fix is summarizing intermediate results and managing context deliberately.
Use hooks to log every tool call. Add a PostToolUse hook that records the tool name, input, and exit code. Review the log file to trace the exact decision sequence. The /transcript command also helps.
Yes. Structured tool logs (JSON with tool name, input, output, duration, step number) are essential. You can filter, query, and diff them. Plain text logs are almost useless for multi-step agent debugging.
Add a max step counter and a loop detector. Track the last N actions - if the same tool+input combination appears 3 times, break the loop and ask for human input.
Before destructive actions, when the agent's confidence is low, after consecutive failures, and before declaring a task complete. The goal is not to remove the human - it is to minimize unnecessary interruptions while keeping critical checkpoints.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
TypeScript-first AI agent framework. Workflows, RAG, tool use, evals, and integrations. Built for production Node.js app...
View ToolAnthropic's agentic coding CLI. Runs in your terminal, edits files autonomously, spawns sub-agents, and maintains memory...
View Tool
New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.
The TypeScript toolkit for building AI apps. Unified API across OpenAI, Anthropic, Google. Streaming, tool calling, stru...
Configure Claude Code for maximum productivity -- CLAUDE.md, sub-agents, MCP servers, and autonomous workflows.
AI AgentsWhat MCP servers are, how they work, and how to build your own in 5 minutes.
AI AgentsInstall Claude Code, configure your first project, and start shipping code with AI in under 5 minutes.
Getting Started
Check out Zed here! https://zed.dev In this video, we dive into Zed, a robust open source code editor that has recently introduced the Agent Client Protocol. This new open standard allows...

Leveraging Anthropic's Subagent for Claude Code: A Step-by-Step Guide In this video, we explore Anthropic's newly released subagent feature for Cloud Code, which allows developers to create...

Check out Trae here! https://tinyurl.com/2f8rw4vm In this video, we dive into @Trae_ai a newly launched AI IDE packed with innovative features. I provide a comprehensive demonstration...

A practical guide to building AI agents with TypeScript using the Vercel AI SDK. Tool use, multi-step reasoning, and rea...

Agents forget everything between sessions. Here are the patterns that fix that: CLAUDE.md persistence, RAG retrieval, co...

How to go from idea to deployed SaaS product using Claude Code as your primary development tool. Project setup, feature...