GPT-5.5 for Developers: A Production Field Guide

GPT-5.5 and GPT-5.5 Pro landed on the API on April 24. I rebuilt three production agents on 5.5 Pro the same day. One got noticeably better, one regressed in a way I did not see coming, and one I had to tear out and replace with a cheaper model because the new pricing curve made it economically pointless.

This is the field guide I wish I had that morning. It covers what the model actually does differently in API terms, which tier matches which workload, the agentic improvements I can verify with my own evals, the regression nobody is writing about, and the migration playbook I now run on every codebase that touches the OpenAI SDK.

What "smartest and most intuitive" actually means in API terms

OpenAI's launch post leans on words like "intuitive" and "agentic," which is fine for marketing copy but does not help you decide whether to flip the env flag. In API terms, here is what changed.

For model-selection context, compare this with OpenAI Codex: Cloud AI Coding With GPT-5.3 and OpenAI vs Anthropic in 2026 - Models, Tools, and Developer Experience; the useful question is not only benchmark quality, but where the model fits in a real developer workflow.

The default reasoning depth is higher. GPT-5.5 ships with a recalibrated reasoning.effort scale where the new medium is closer to what high was in 5.4. That means a naive drop-in replacement using the same parameters will cost more and run slower out of the box, even before you change a line of code.

Tool-call dispatch is more conservative. In my eval suite, 5.5 emits roughly 18 percent fewer speculative tool calls on tasks where multiple tools could plausibly answer. It waits for more context before committing. That is excellent for production agents that pay per tool round trip and a slight regression for chat-style assistants where the perceived snappiness comes from immediate action.

Long-context handling is genuinely better. The same 240K-token document QA task that gave 5.4 a 71 percent answer-accuracy in my bench now scores 84 percent on 5.5 Pro. The improvement concentrates in the middle third of the context window, which is exactly where 5.4 had its known sag.

Pricing rebalanced. Input tokens on 5.5 are slightly cheaper than 5.4. Output tokens are slightly more expensive. 5.5 Pro is roughly 4x the cost of 5.5. For agents that emit short tool calls and consume long context, 5.5 is a free win. For agents that generate long-form output, you may end up paying more.

5.5 vs 5.5 Pro: which tier for which workload

I keep this matrix taped to the wall above my desk. The rule of thumb is that 5.5 is the new default and 5.5 Pro earns its keep only on a narrow band of workloads.

import OpenAI from "openai";

const client = new OpenAI();

// Default tier: chat, summarization, structured extraction, simple agents
const fast = await client.responses.create({
  model: "gpt-5.5",
  input: userMessage,
  reasoning: { effort: "low" },
});

// Pro tier: long-horizon agents, multi-document synthesis, hard reasoning
const heavy = await client.responses.create({
  model: "gpt-5.5-pro",
  input: planningPrompt,
  reasoning: { effort: "medium" },
  tools: agentTools,
});

Use 5.5 for anything user-facing that needs a sub-2-second time-to-first-token. Use 5.5 Pro when the cost of being wrong is higher than the marginal token bill. Concretely, that means coding agents that touch shared infra, financial extraction where a misread number costs money, and any agent that runs unattended for more than a few minutes.

If you are running mixed-model agents, lean on a router. I push every request through a tiny wrapper that picks the model based on a complexity score I attach to the task at queue time. The router lives behind the same OpenAI SDK call surface, which means swapping models is a one-line config change.

Get the weekly deep dive

Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.

From the archive

DeepSeek R1, PPO, and GRPO Explained for Devs

Apr 29, 2026 • 12 min read

mlinter: Hugging Face's New Linter for Transformers Modeling Files

Apr 29, 2026 • 8 min read

KV Caching: A Practical Guide to Optimizing Transformer Inference

Apr 29, 2026 • 11 min read

Mercury 2 Developer Guide: Building With a Diffusion LLM in Production

Apr 29, 2026 • 10 min read

Agentic improvements I can verify

I ran Agent Eval Bench against 5.4, 5.5, and 5.5 Pro the day the API opened. The bench has 280 deterministic agentic tasks across coding, document workflows, and tool-use chains. Here is what came out.

Tool-call accuracy on the 80-task tool-use suite climbed from 78 percent on 5.4 to 89 percent on 5.5 and 91 percent on 5.5 Pro. The improvement is concentrated in tasks that require choosing between two semantically similar tools. 5.5 picks the right one more often, which means fewer wasted round trips and shorter end-to-end latencies in production despite the slower per-call thinking time.

Long-horizon planning held up. On the 40-task multi-step planning suite, 5.5 Pro completed 34 of 40 without human intervention, compared to 27 of 40 for 5.4. The failures clustered around tasks where the agent had to abandon a partially-completed plan and restart, which is the same weak spot 5.4 had. Progress, but not a solved problem.

Document workflows improved more than I expected. Extracting structured data from messy PDFs, the kind of task that used to require Pro-tier and high effort to get right, now works reliably on 5.5 with effort: "low". That is a real cost saving. I cut my document pipeline bill by 38 percent the week after migrating.

from openai import OpenAI

client = OpenAI()

response = client.responses.create(
    model="gpt-5.5",
    input=[
        {"role": "system", "content": "Extract invoice line items as JSON."},
        {"role": "user", "content": [{"type": "input_file", "file_id": file_id}]},
    ],
    reasoning={"effort": "low"},
    response_format={"type": "json_schema", "json_schema": invoice_schema},
)

The same call on 5.4 needed effort: "high" to hit acceptable accuracy on my golden set. That is the kind of quiet improvement that does not make the launch post but moves real money.

The regression nobody is talking about

Here is the part I have not seen anyone else write up. On creative-writing-style tasks where the model has to maintain a consistent voice across a long generation, 5.5 hallucinates voice drift more often than 5.4 did. My voice-consistency eval, which is a 40-prompt suite that scores style adherence across 4000-token continuations, dropped from 82 percent on 5.4 to 71 percent on 5.5. 5.5 Pro recovers most of the gap, hitting 79 percent, but it is still a regression.

The drift looks like the model getting bored with the established voice partway through and reverting to a more neutral, technical register. My theory is that the conservative tool-call dispatch and the recalibrated reasoning effort interact badly with creative continuations, but I cannot prove it from outside the model.

If you have agents that depend on consistent persona output, run your own eval before flipping the flag. I had to keep one of my agents on 5.4 because the regression was material to its product feel. OpenAI has not deprecated 5.4 yet, so this is a viable holdout for a while longer.

Migration playbook

Here is the playbook I now run on every codebase. It takes a couple of hours per service and has caught real problems on every migration so far.

Step one, pin the old model behind a flag. Before you change any prompt code, wrap the model name in an env-driven config. This is the rollback lever you will be glad to have.

const MODEL = process.env.OPENAI_MODEL ?? "gpt-5.4";

const response = await client.responses.create({
  model: MODEL,
  input: prompt,
});

Step two, run an eval harness against both models on a representative sample of real production traffic. Do not trust internal benchmarks. Capture 200 real requests, replay them against 5.4 and 5.5, and score the outputs on whatever metric matters for your product. This is exactly what Agent Eval Bench is built for, but a 50-line script will do.

Step three, watch the cost curve. Cost Tape gives me a live spend graph across mixed-model rollouts. The first week of any migration is when surprise bills happen, usually because the new default reasoning effort is higher and you forgot to set effort: "low" on a high-volume endpoint.

Step four, ramp by traffic share. I move new requests to the new model in 5 percent increments over a week, watching error rates and customer-reported quality at each step. If the eval lied, I find out before the bill does.

Step five, plan the rollback. Keep the old model live for at least two billing cycles. The regression I described above caught me on day six of a rollout, which would have been a bad time to discover I had ripped out the old code path.

Prompt patterns I'm rewriting

Two prompt patterns earned an immediate rewrite for 5.5.

First, drop the "think step by step" preamble. 5.5 thinks more deeply by default, and the explicit instruction now adds latency without improving accuracy on my evals. On the math reasoning suite, removing the preamble cut average response time by 410ms with no measurable accuracy change.

Second, tighten tool descriptions. Because 5.5 is more conservative about tool dispatch, ambiguous tool descriptions now translate to the model giving up and asking the user instead of trying. Rewrite each tool description to specify exactly what inputs it expects and what outputs it produces, in the second person, with concrete examples.

// Before
const searchTool = {
  type: "function",
  name: "search_docs",
  description: "Searches documentation.",
  parameters: { /* ... */ },
};

// After
const searchTool = {
  type: "function",
  name: "search_docs",
  description: "Use this when the user asks how to do something with our SDK. Input is a natural-language query string. Output is up to 5 documentation snippets with URLs. Prefer this over guessing API details from training data.",
  parameters: { /* ... */ },
};

The diff is small. The behavioral change is large. My agent's tool-use rate on documentation questions went from 64 percent to 91 percent after rewriting the descriptions in this style, with no model or prompt change beyond the tool spec.

Where I landed

GPT-5.5 is the new default. 5.5 Pro is the right answer when correctness costs more than tokens. The day-one migration is straightforward if you have an eval harness and a kill switch, painful if you do not. The voice-drift regression is real and worth testing for if your product depends on persona consistency.

I shipped the eval bench results, the cost-tape dashboard, and the day-one review on the DevDigest YouTube channel the same week the API opened. If you are migrating, start with the eval harness. Everything else falls out of having ground truth.

GPT-5.4 for Developers: The Production Guide

GPT-5.5-Codex in Production: What Actually Changes

OpenAI Privacy Filter: Production PII Redaction Guide

What "smartest and most intuitive" actually means in API terms

5.5 vs 5.5 Pro: which tier for which workload

DeepSeek R1, PPO, and GRPO Explained for Devs

mlinter: Hugging Face's New Linter for Transformers Modeling Files

KV Caching: A Practical Guide to Optimizing Transformer Inference

Mercury 2 Developer Guide: Building With a Diffusion LLM in Production

Agentic improvements I can verify

The regression nobody is talking about

Migration playbook

Prompt patterns I'm rewriting

Where I landed

Comments

Related Tools

Agency Swarm

Claude Agent SDK

Mastra

OpenAI Codex

Apps from Developers Digest

Agent Eval Bench Plus

DD Starter

Related Guides

Claude Code Setup Guide

Building Your First MCP Server

MCP Servers Explained

Related Posts

GPT-5.4 for Developers: The Production Guide

GPT-5.5-Codex in Production: What Actually Changes

OpenAI Agents SDK Evolution: What Ships in Production

OpenAI Privacy Filter: Production PII Redaction Guide

OpenAI's GPT 5.4 in 10 Minutes

Agent Swarms Need Receipts

Get Smarter About AI Dev

GPT-5.4 for Developers: The Production Guide

GPT-5.5-Codex in Production: What Actually Changes

OpenAI Privacy Filter: Production PII Redaction Guide

What "smartest and most intuitive" actually means in API terms

5.5 vs 5.5 Pro: which tier for which workload

DeepSeek R1, PPO, and GRPO Explained for Devs

mlinter: Hugging Face's New Linter for Transformers Modeling Files

KV Caching: A Practical Guide to Optimizing Transformer Inference

Mercury 2 Developer Guide: Building With a Diffusion LLM in Production

Agentic improvements I can verify

The regression nobody is talking about

Migration playbook

Prompt patterns I'm rewriting

Where I landed

Comments

Related Tools

Agency Swarm

Claude Agent SDK

Mastra

OpenAI Codex

Apps from Developers Digest

Agent Eval Bench Plus

DD Starter

Related Guides

Claude Code Setup Guide

Building Your First MCP Server

MCP Servers Explained

Related Posts

GPT-5.4 for Developers: The Production Guide

GPT-5.5-Codex in Production: What Actually Changes

OpenAI Agents SDK Evolution: What Ships in Production

OpenAI Privacy Filter: Production PII Redaction Guide

OpenAI's GPT 5.4 in 10 Minutes

Agent Swarms Need Receipts

Get Smarter About AI Dev