
TL;DR
Master tool use in the Claude API. Schema design, retry logic, multi-step loops, and the failure modes that only show up at 10k calls a day.
Tool use is the feature that turns Claude from a chatbot into an agent. It is also the feature that, deployed casually, fails silently in ways that are hard to root-cause. We have run Claude tool use through production paths handling tens of thousands of daily calls across DD products, and almost every outage has been traced to one of a small number of patterns: ambiguous schemas, missing error handling on the executor side, runaway loops, or tools the model thought existed.
This is the production playbook. Schema design, the execution layer, multi-step loops, security, and what to monitor. Code samples are TypeScript with the official Anthropic SDK because that is what most of our deployed agents run on.
We walked through a live build of one of these in our Building Reliable Claude Agents video. This is the deeper writeup.
You pass tools to messages.create. Claude either responds with text, or with one or more tool_use blocks. You execute the tool, send back a tool_result block in the next user message, and the loop continues until Claude stops requesting tools.
The thing nobody tells you up front: Claude can hallucinate a tool call. It is rare on Sonnet 4.5 and above, but it happens, especially when your tool schemas overlap or when the user request is ambiguous. Your executor has to handle "tool name not found" as a normal case, not a crash. We will get to that.
A minimal correct loop looks like this:
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
const tools: Anthropic.Tool[] = [
{
name: "get_weather",
description:
"Get the current weather for a specific city. Returns temperature in Celsius and a short conditions string.",
input_schema: {
type: "object",
properties: {
city: {
type: "string",
description: "City name, e.g. 'San Francisco' or 'Tokyo'",
},
},
required: ["city"],
},
},
];
async function runAgent(userMessage: string) {
const messages: Anthropic.MessageParam[] = [
{ role: "user", content: userMessage },
];
for (let iter = 0; iter < 10; iter++) {
const response = await client.messages.create({
model: "claude-sonnet-4-5",
max_tokens: 1024,
tools,
messages,
});
messages.push({ role: "assistant", content: response.content });
if (response.stop_reason !== "tool_use") {
return response;
}
const toolResults: Anthropic.ToolResultBlockParam[] = [];
for (const block of response.content) {
if (block.type !== "tool_use") continue;
const result = await executeTool(block.name, block.input);
toolResults.push({
type: "tool_result",
tool_use_id: block.id,
content: JSON.stringify(result),
is_error: result.error !== undefined,
});
}
messages.push({ role: "user", content: toolResults });
}
throw new Error("Max iterations exceeded");
}
Note three deliberate choices: a hard iteration cap, is_error set on the result when the tool fails, and tool_use_id matched correctly per call. Skip any of these and you are one bad day from an outage.
Schema quality is the single biggest predictor of tool-use reliability. The model picks tools based on names, descriptions, and parameter docs. If two tools sound similar, it will pick wrong, and the failure is invisible until a user complains.
Bad:
{ name: "search", description: "Search for information." }
{ name: "lookup", description: "Look up information." }
Good:
{
name: "search_internal_kb",
description:
"Search Acme's internal knowledge base of product docs and runbooks. Use for questions about Acme features, APIs, or internal processes. Do not use for general web search.",
}
{
name: "search_web",
description:
"Search the public web via Google. Use for current events, third-party software, or anything not covered by the internal KB.",
}
Rules of thumb we have converged on:
get_user, get_user_by_email, list_users_in_org — not get, find, lookup.enum aggressively on string parameters with a fixed set of valid values. The model respects enums far more reliably than prose constraints.A diagnostic worth running: take your tool list, paste it into Claude with the user message "which tool would you call for X?" for ten realistic prompts. If it picks wrong on any, the schemas are ambiguous.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
The naive executor is tools[name](input). The production executor handles: unknown tool names, schema validation, timeouts, retries, structured error responses, and logging. Here is the shape we run.
import { z, ZodSchema } from "zod";
interface ToolDef<I, O> {
name: string;
schema: ZodSchema<I>;
handler: (input: I) => Promise<O>;
timeoutMs: number;
}
const registry = new Map<string, ToolDef<any, any>>();
function register<I, O>(def: ToolDef<I, O>) {
registry.set(def.name, def);
}
async function executeTool(name: string, rawInput: unknown) {
const def = registry.get(name);
if (!def) {
return { error: `Unknown tool: ${name}. Available: ${[...registry.keys()].join(", ")}` };
}
const parsed = def.schema.safeParse(rawInput);
if (!parsed.success) {
return { error: `Invalid input: ${parsed.error.message}` };
}
const start = Date.now();
try {
const result = await Promise.race([
def.handler(parsed.data),
new Promise((_, rej) =>
setTimeout(() => rej(new Error("timeout")), def.timeoutMs)
),
]);
log("tool_call", { name, durationMs: Date.now() - start, ok: true });
return { data: result };
} catch (err) {
log("tool_call", { name, durationMs: Date.now() - start, ok: false, err: String(err) });
return { error: `Tool failed: ${err instanceof Error ? err.message : String(err)}` };
}
}
The critical detail: errors come back as data, not exceptions. When you send is_error: true in the tool_result, Claude reads the error message and usually does the sensible thing — retries with corrected input, picks a different tool, or tells the user. Throwing an exception kills the loop.
This is also where you do retries. Transient network errors on a downstream API should retry inside the handler with exponential backoff. Permanent errors (4xx, validation) should return the error to Claude and let the model decide. The mental model: the handler is responsible for transient retries, Claude is responsible for semantic recovery.
Real agents chain calls. The user asks "summarize last week's support tickets and email me the top three categories." That is list_tickets → categorize → send_email. Three sequential tool calls with state flowing between them.
Two failure modes show up here:
current_state text block between tool calls, or have the orchestrator append a synthetic "so far you have learned: ..." message every N iterations.For workflows above ~5 steps, consider not solving the orchestration with a single agent loop. Decompose into a planner that emits a DAG and an executor that runs nodes. We covered this pattern in Seven AI Agent Orchestration Patterns.
A tool is anything Claude can invoke. The user can influence what Claude invokes. So a user can, transitively, influence your tools. This is prompt injection 101 and it bites every team that ships tool use without thinking about it.
Hardening checklist we apply to every tool:
read_file tool should be scoped to a sandbox directory. A query_db tool should only see specific tables. Never trust the LLM to constrain itself.The red-team test: assume the user message is hostile. Can they extract data they should not see, or trigger an action they should not be able to trigger? If yes, scope the tool tighter.
Performance and reasoning quality degrade with more tools. The breakpoints we have observed:
For agents with large tool surfaces (e.g., MCP servers exposing 50+ resources each), use a two-stage pattern: first call selects a tool category with a small fixed tool list, second call exposes only the tools in that category. This is roughly how Claude Code handles its bundled toolset internally.
Cost-wise, every tool definition lives in the system prompt and ships on every request. Cache them. The prompt caching guide covers exactly how to put tool definitions inside a cache block so a 50-tool agent does not bleed money on input tokens.
Four metrics that catch almost every tool-use regression in production:
executeTool returns "Unknown tool". Should be near zero. Spikes mean someone deployed a tool list change that broke a code path.The easy mistake is monitoring only the agent's overall success rate. By the time that drops, you have hours of broken sessions. Tool-level metrics catch problems within minutes.
Tool use is a sharp edge. Treat it like one and it scales cleanly. For the next layer up — long-running, stateful integrations — see our guide to building MCP servers.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
Anthropic's Python SDK for building production agent systems. Tool use, guardrails, agent handoffs, and orchestration. R...
View ToolLightweight Python framework for multi-agent systems. Agent handoffs, tool use, guardrails, tracing. Successor to the ex...
View ToolTypeScript-first AI agent framework. Workflows, RAG, tool use, evals, and integrations. Built for production Node.js app...
View ToolMulti-agent orchestration framework built on the OpenAI Agents SDK. Define agent roles, typed tools, and directional com...
View ToolGranular allow/ask/deny rules per tool with wildcard patterns.
Claude CodeConfigure Claude Code for maximum productivity -- CLAUDE.md, sub-agents, MCP servers, and autonomous workflows.
AI AgentsStep-by-step guide to building an MCP server in TypeScript - from project setup to tool definitions, resource handling, testing, and deployment.
AI Agents
Build MCP servers that connect Claude to your databases, APIs, and tools. Architecture, TypeScript SDK code, debugging,...

Cut Claude API spend by up to 90% with prompt caching. Real numbers, TypeScript SDK code, and the gotchas Anthropic's do...

I told an agent to improve the site every 10 minutes and went to sleep. Here is what 12 new repos, 60 PRs, and three goo...

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.