
TL;DR
How to ship Claude's vision API in production. OCR, charts, UI audits, real cost numbers, TypeScript SDK code, and the gotchas that bite at 100k images a month.
Read next
How to ship Claude's Batch API in production. 50% cost savings, TypeScript SDK code, JSONL request format, and the async architecture gotchas that bite at 100k requests.
11 min readCut Claude API spend by up to 90% with prompt caching. Real numbers, TypeScript SDK code, and the gotchas Anthropic's docs gloss over.
11 min readA production guide to Claude's extended thinking mode. Real cost math, TypeScript SDK code, and the tasks where reasoning tokens are worth 3x the spend.
12 min readThe first time you hand Claude a screenshot and it correctly extracts every field on a messy invoice, the temptation is to stick that in a loop and bill your way to a finished product. Then the bill arrives, half your images came back with subtle hallucinations, and you discover the support team has been emailing you screenshots that are actually photos of laptop screens with glare on them.
Claude's vision API is genuinely strong on structured imagery - documents, charts, diagrams, UI screenshots, code on screens. It is genuinely weak on the failure modes you only notice in production: low-resolution thumbnails, handwriting, rotated photos, images where the relevant content is 5% of the pixels. This guide covers the production patterns we use to ship vision pipelines that actually hold up at volume.
We walked through some of the more counterintuitive tricks in our Vision API Tricks: Extract Data from Screenshots video on YouTube. This post is the long-form companion.
Claude reliably crushes:
Claude struggles with, in our experience:
Treat the second list as warning signs, not deal-breakers. With the right preprocessing and prompting, most of them get usable. Without it, you ship hallucinations.
Vision pricing is token-based, but the token count is computed from the image dimensions. The exact formula is roughly tokens ~ (width x height) / 750, capped after Anthropic's resizing logic. Two practical consequences:
Supported formats are JPEG, PNG, GIF, and WebP. You can pass a URL or base64. URL is preferable when you have a CDN; base64 is fine for pipelines that already have the bytes in memory.
For most production workloads, the right preprocessing pipeline is:
This typically cuts image costs 40 to 70% with no measurable accuracy loss on text-heavy imagery.
Here is a minimal but production-shaped vision call using the official Anthropic SDK. It accepts a base64 image, asks for structured extraction, and uses prompt caching on the instruction prefix so a high-volume pipeline does not pay full input cost on every call.
import Anthropic from "@anthropic-ai/sdk";
import sharp from "sharp";
const client = new Anthropic();
async function preprocessImage(buf: Buffer): Promise<{ data: string; mediaType: "image/jpeg" }> {
const out = await sharp(buf)
.rotate()
.resize({ width: 1568, height: 1568, fit: "inside", withoutEnlargement: true })
.jpeg({ quality: 85 })
.toBuffer();
return { data: out.toString("base64"), mediaType: "image/jpeg" };
}
const EXTRACTION_PROMPT = `You are a precise document extractor. Return only valid JSON matching:
{ "vendor": string, "date": string, "total": number, "line_items": [{"description": string, "amount": number}] }
If a field is unreadable, use null. Do not invent values.`;
export async function extractInvoice(imageBytes: Buffer) {
const { data, mediaType } = await preprocessImage(imageBytes);
const response = await client.messages.create({
model: "claude-sonnet-4-5",
max_tokens: 1024,
system: [
{
type: "text",
text: EXTRACTION_PROMPT,
cache_control: { type: "ephemeral" },
},
],
messages: [
{
role: "user",
content: [
{
type: "image",
source: { type: "base64", media_type: mediaType, data },
},
{ type: "text", text: "Extract this invoice." },
],
},
],
});
const text = response.content.find((b) => b.type === "text");
if (!text || text.type !== "text") throw new Error("no text response");
return JSON.parse(text.text);
}
A few non-obvious things this shape captures:
system with cache_control, not in the user turn. Vision pipelines tend to have a stable instruction and a varying image; cache the stable part.<json> tag instruction or use a tool-use schema to lock it down.For schema-locked extraction, use tool use. Define a tool with the exact JSON Schema you want, force tool_choice to that tool, and you get back a guaranteed-valid object. We covered the details in tool use patterns.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
From the archive
Apr 29, 2026 • 9 min read
Apr 29, 2026 • 9 min read
Apr 29, 2026 • 10 min read
Apr 29, 2026 • 11 min read
1. Invoice and receipt OCR. Vision plus a strict schema delivers extraction accuracy in the high 90s on real-world receipts - better than most dedicated OCR services on messy inputs, because Claude can reason about layout. Cost runs about $0.005 to $0.015 per receipt depending on resolution.
2. Bug-report screenshot triage. Customer pastes a screenshot, vision extracts the visible error message, the URL bar, the browser, and the visible state. Auto-tagged into the issue tracker. We have seen support teams cut triage time by 60% on this single workflow.
3. UI accessibility audits. Feed a screenshot of a page, ask for WCAG violations. Claude is shockingly good at "this contrast looks bad," "the touch target is too small," "this label is missing for the icon." Not a replacement for axe-core, but an excellent complement on visual issues automated tools miss.
4. Chart-to-data extraction. Bar charts, line charts, scatter plots - Claude returns reasonable JSON of the underlying series. Watch out: numbers are estimates derived from pixels, not ground truth. Use it for "what is this chart showing," not for downstream analytics.
5. Diagram-to-code. Hand-drawn architecture diagrams or flowcharts converted to mermaid, plantuml, or actual code stubs. Underrated workflow for design reviews.
Volume changes everything. At one image per minute you can ignore the failure modes; at 100k a month they dominate.
The pipeline shape we have ended up with after several iterations:
[ingest] -> [validate] -> [preprocess] -> [hash + cache lookup]
-> [vision call w/ retries] -> [schema validate]
-> [confidence gate] -> [human review or auto-accept]
-> [store + index]
Each step earns its keep:
confidence field, and route low-confidence outputs to human review. This is the single highest-impact reliability lever.For monitoring spend across a high-volume vision pipeline, CodeBurn surfaces per-route token cost so you can tell instantly when an image-preprocessing regression doubles your bill.
The errors you will actually hit:
invalid_request_error: image too large - You sent over 5MB or over 8000px on a side. Preprocess.overloaded_error - Vision endpoints get bursty. Exponential backoff with jitter. Three retries is enough.max_tokens high enough. For invoice extraction, 2000 is safer than 1024.You can pass multiple images in a single user turn. The model treats them as an ordered sequence. Common uses:
Cost scales linearly with images. A four-image call costs roughly four times a one-image call (plus a small fixed overhead). For long documents, multi-image extraction in a single call is usually cheaper and more accurate than calling once per page, because the model can reason across pages.
1. EXIF orientation is not auto-applied. A photo taken in portrait will land at the API rotated unless you bake the orientation into the bytes. We have seen entire pipelines fail because the iPhone landscape vs portrait flag was ignored.
2. Animated GIFs are sampled, not analyzed frame-by-frame. You typically get the first frame. If you need video analysis, pass key frames as separate images.
3. The model will describe what it sees if you don't tell it not to. Open-ended prompts like "what is this" produce flowery prose. For extraction, be explicit: return JSON only, no commentary.
4. Image content counts against the cache. A cached system prompt plus an uncached image is the right pattern. You cannot meaningfully cache the image itself unless the exact same bytes recur, in which case do it via hash dedup at your layer.
5. PII in images is a real compliance issue. Vision will happily extract SSNs, account numbers, faces. If you process user-uploaded images, run a redaction or detection pass and have an explicit data retention policy. Anthropic's data is not used for training by default on the API, but your own logs probably retain images longer than you think.
6. Resolution requirements vary by task. UI screenshots can be downscaled aggressively. Tiny text in a fax-quality scan needs the full resolution. Don't blanket-downscale everything; route by content type.
For 100k+ images a month, three things matter more than the SDK call:
cache_controlVision is the most under-used feature in the Claude API among teams that already have text working. The ones who get it right at scale treat it as an industrial pipeline, not a model call. Preprocess, cache, validate, attribute, and you will ship a vision feature that holds up.
For more on shipping Claude in production, see our writeups on prompt caching and tool use patterns.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
Open-source OpenAI API replacement. Runs LLMs, vision, voice, image, and video models on any hardware - no GPU require...
View ToolServerless cloud for AI/ML workloads. Write Python with decorators, Modal handles GPU provisioning and scaling. 2-4s col...
View ToolAnthropic's Python SDK for building production agent systems. Tool use, guardrails, agent handoffs, and orchestration. R...
View ToolAnthropic's flagship reasoning model. Best-in-class for coding, long-context analysis, and agentic workflows. 1M token c...
View ToolThe primary command-line entry point for Claude Code sessions.
Claude CodeReal-time prompt loop with history, completions, and multiline input.
Claude CodeConfigure Claude Code for maximum productivity -- CLAUDE.md, sub-agents, MCP servers, and autonomous workflows.
AI Agents
How to ship Claude's Batch API in production. 50% cost savings, TypeScript SDK code, JSONL request format, and the async...

A production guide to Claude's extended thinking mode. Real cost math, TypeScript SDK code, and the tasks where reasonin...

Cut Claude API spend by up to 90% with prompt caching. Real numbers, TypeScript SDK code, and the gotchas Anthropic's do...

Master tool use in the Claude API. Schema design, retry logic, multi-step loops, and the failure modes that only show up...

A practical architecture for multi-step Claude agents. Loop patterns, state management, error recovery, and the producti...

The defensive patterns that keep Claude integrations alive in production. Retry shapes, backoff with jitter, circuit bre...

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.