Tool Use in the Claude API: Production Patterns for Reliable Agents

Tool Use Is Powerful And Fragile

Tool use is the feature that turns Claude from a chatbot into an agent. It is also the feature that, deployed casually, fails silently in ways that are hard to root-cause. We have run Claude tool use through production paths handling tens of thousands of daily calls across DD products, and almost every outage has been traced to one of a small number of patterns: ambiguous schemas, missing error handling on the executor side, runaway loops, or tools the model thought existed.

This is the production playbook. Schema design, the execution layer, multi-step loops, security, and what to monitor. Code samples are TypeScript with the official Anthropic SDK because that is what most of our deployed agents run on.

We walked through a live build of one of these in our Building Reliable Claude Agents video. This is the deeper writeup.

How The Tool Use Loop Actually Works

You pass tools to messages.create. Claude either responds with text, or with one or more tool_use blocks. You execute the tool, send back a tool_result block in the next user message, and the loop continues until Claude stops requesting tools.

The thing nobody tells you up front: Claude can hallucinate a tool call. It is rare on Sonnet 4.5 and above, but it happens, especially when your tool schemas overlap or when the user request is ambiguous. Your executor has to handle "tool name not found" as a normal case, not a crash. We will get to that.

A minimal correct loop looks like this:

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

const tools: Anthropic.Tool[] = [
  {
    name: "get_weather",
    description:
      "Get the current weather for a specific city. Returns temperature in Celsius and a short conditions string.",
    input_schema: {
      type: "object",
      properties: {
        city: {
          type: "string",
          description: "City name, e.g. 'San Francisco' or 'Tokyo'",
        },
      },
      required: ["city"],
    },
  },
];

async function runAgent(userMessage: string) {
  const messages: Anthropic.MessageParam[] = [
    { role: "user", content: userMessage },
  ];

  for (let iter = 0; iter < 10; iter++) {
    const response = await client.messages.create({
      model: "claude-sonnet-4-5",
      max_tokens: 1024,
      tools,
      messages,
    });

    messages.push({ role: "assistant", content: response.content });

    if (response.stop_reason !== "tool_use") {
      return response;
    }

    const toolResults: Anthropic.ToolResultBlockParam[] = [];
    for (const block of response.content) {
      if (block.type !== "tool_use") continue;
      const result = await executeTool(block.name, block.input);
      toolResults.push({
        type: "tool_result",
        tool_use_id: block.id,
        content: JSON.stringify(result),
        is_error: result.error !== undefined,
      });
    }

    messages.push({ role: "user", content: toolResults });
  }

  throw new Error("Max iterations exceeded");
}

Note three deliberate choices: a hard iteration cap, is_error set on the result when the tool fails, and tool_use_id matched correctly per call. Skip any of these and you are one bad day from an outage.

Designing Schemas That Do Not Get Misused

Schema quality is the single biggest predictor of tool-use reliability. The model picks tools based on names, descriptions, and parameter docs. If two tools sound similar, it will pick wrong, and the failure is invisible until a user complains.

Bad:

{ name: "search", description: "Search for information." }
{ name: "lookup", description: "Look up information." }

Good:

{
  name: "search_internal_kb",
  description:
    "Search Acme's internal knowledge base of product docs and runbooks. Use for questions about Acme features, APIs, or internal processes. Do not use for general web search.",
}
{
  name: "search_web",
  description:
    "Search the public web via Google. Use for current events, third-party software, or anything not covered by the internal KB.",
}

Rules of thumb we have converged on:

Names should be verb-noun and disambiguated by domain. get_user, get_user_by_email, list_users_in_org — not get, find, lookup.
Descriptions should say when not to use the tool. This is the highest-leverage line in any tool description.
Parameter descriptions should include format examples. Especially for dates, IDs, and enums.
Use enum aggressively on string parameters with a fixed set of valid values. The model respects enums far more reliably than prose constraints.
Mark required fields explicitly. Optional fields invite hallucinated defaults.

A diagnostic worth running: take your tool list, paste it into Claude with the user message "which tool would you call for X?" for ten realistic prompts. If it picks wrong on any, the schemas are ambiguous.

Get the weekly deep dive

Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.

A Production-Grade Execution Layer

The naive executor is tools[name](input). The production executor handles: unknown tool names, schema validation, timeouts, retries, structured error responses, and logging. Here is the shape we run.

import { z, ZodSchema } from "zod";

interface ToolDef<I, O> {
  name: string;
  schema: ZodSchema<I>;
  handler: (input: I) => Promise<O>;
  timeoutMs: number;
}

const registry = new Map<string, ToolDef<any, any>>();

function register<I, O>(def: ToolDef<I, O>) {
  registry.set(def.name, def);
}

async function executeTool(name: string, rawInput: unknown) {
  const def = registry.get(name);
  if (!def) {
    return { error: `Unknown tool: ${name}. Available: ${[...registry.keys()].join(", ")}` };
  }

  const parsed = def.schema.safeParse(rawInput);
  if (!parsed.success) {
    return { error: `Invalid input: ${parsed.error.message}` };
  }

  const start = Date.now();
  try {
    const result = await Promise.race([
      def.handler(parsed.data),
      new Promise((_, rej) =>
        setTimeout(() => rej(new Error("timeout")), def.timeoutMs)
      ),
    ]);
    log("tool_call", { name, durationMs: Date.now() - start, ok: true });
    return { data: result };
  } catch (err) {
    log("tool_call", { name, durationMs: Date.now() - start, ok: false, err: String(err) });
    return { error: `Tool failed: ${err instanceof Error ? err.message : String(err)}` };
  }
}

The critical detail: errors come back as data, not exceptions. When you send is_error: true in the tool_result, Claude reads the error message and usually does the sensible thing — retries with corrected input, picks a different tool, or tells the user. Throwing an exception kills the loop.

This is also where you do retries. Transient network errors on a downstream API should retry inside the handler with exponential backoff. Permanent errors (4xx, validation) should return the error to Claude and let the model decide. The mental model: the handler is responsible for transient retries, Claude is responsible for semantic recovery.

Multi-Step Workflows And Loop Control

Real agents chain calls. The user asks "summarize last week's support tickets and email me the top three categories." That is list_tickets → categorize → send_email. Three sequential tool calls with state flowing between them.

Two failure modes show up here:

Infinite loops. The model gets stuck calling the same tool with slightly different inputs. Always cap iterations. We use 10 for user-facing flows, 25 for batch jobs.
State drift. The model loses track of intermediate results in long chains. The fix is to summarize state explicitly: have the agent emit a current_state text block between tool calls, or have the orchestrator append a synthetic "so far you have learned: ..." message every N iterations.

For workflows above ~5 steps, consider not solving the orchestration with a single agent loop. Decompose into a planner that emits a DAG and an executor that runs nodes. We covered this pattern in Seven AI Agent Orchestration Patterns.

Security: Tools Are An Attack Surface

A tool is anything Claude can invoke. The user can influence what Claude invokes. So a user can, transitively, influence your tools. This is prompt injection 101 and it bites every team that ships tool use without thinking about it.

Hardening checklist we apply to every tool:

Allowlist what each tool can touch. A read_file tool should be scoped to a sandbox directory. A query_db tool should only see specific tables. Never trust the LLM to constrain itself.
Validate inputs at the boundary. Zod or equivalent. Reject anything outside the schema before the handler runs.
No string-concatenated SQL or shell commands. Parameterize. If a tool builds a shell command, Claude can be coerced into building a malicious one.
Rate-limit per session. A runaway agent loop should not be able to hammer your downstream APIs. Per-tool, per-session limits with hard fails.
Log everything. Every tool call, every input, every result. You will need this for incident response, not for debugging.
Treat tool descriptions as untrusted documentation. If a downstream MCP server sends you a tool with a hostile description, you will execute it. Audit imported tool definitions.

The red-team test: assume the user message is hostile. Can they extract data they should not see, or trigger an action they should not be able to trigger? If yes, scope the tool tighter.

Scaling Tool Definitions

Performance and reasoning quality degrade with more tools. The breakpoints we have observed:

Up to ~20 tools: no measurable degradation
20–50 tools: occasional wrong-tool selection on ambiguous queries
50–100 tools: noticeable slowdown in selection, more hallucinated calls
100+ tools: you need a router

For agents with large tool surfaces (e.g., MCP servers exposing 50+ resources each), use a two-stage pattern: first call selects a tool category with a small fixed tool list, second call exposes only the tools in that category. This is roughly how Claude Code handles its bundled toolset internally.

Cost-wise, every tool definition lives in the system prompt and ships on every request. Cache them. The prompt caching guide covers exactly how to put tool definitions inside a cache block so a 50-tool agent does not bleed money on input tokens.

What To Monitor

Four metrics that catch almost every tool-use regression in production:

Tool call success rate per tool. A drop on one tool usually means a downstream API change.
Average iterations per session. A creep upward means the model is working harder for the same outcomes — usually a schema regression.
Hallucinated tool name rate. Every time executeTool returns "Unknown tool". Should be near zero. Spikes mean someone deployed a tool list change that broke a code path.
Tool latency P95. Slow tools cascade through the loop. Cap with timeouts and watch the P95 per tool.

The easy mistake is monitoring only the agent's overall success rate. By the time that drops, you have hours of broken sessions. Tool-level metrics catch problems within minutes.

Closing Checklist

Tool use is a sharp edge. Treat it like one and it scales cleanly. For the next layer up — long-running, stateful integrations — see our guide to building MCP servers.

Tool Use Is Powerful And Fragile

How The Tool Use Loop Actually Works

Designing Schemas That Do Not Get Misused

A Production-Grade Execution Layer

Multi-Step Workflows And Loop Control

Security: Tools Are An Attack Surface

Scaling Tool Definitions

What To Monitor

Closing Checklist

Comments

Related Tools

Claude Agent SDK

OpenAI Agents SDK

Mastra

Agency Swarm

Related Guides

Permission Rules - Claude Code

Claude Code Setup Guide

Building Your First MCP Server

Related Posts

Model Context Protocol: A Production Guide To Building MCP Servers

Prompt Caching in the Claude API: A Production Guide

12 Tools in One Night: An Honest Overnight Agent Report

Get Smarter About AI Dev

Tool Use Is Powerful And Fragile

How The Tool Use Loop Actually Works

Designing Schemas That Do Not Get Misused

A Production-Grade Execution Layer

Multi-Step Workflows And Loop Control

Security: Tools Are An Attack Surface

Scaling Tool Definitions

What To Monitor

Closing Checklist

Comments

Related Tools

Claude Agent SDK

OpenAI Agents SDK

Mastra

Agency Swarm

Related Guides

Permission Rules - Claude Code

Claude Code Setup Guide

Building Your First MCP Server

Related Posts

Model Context Protocol: A Production Guide To Building MCP Servers

Prompt Caching in the Claude API: A Production Guide

12 Tools in One Night: An Honest Overnight Agent Report

Get Smarter About AI Dev