
TL;DR
GPT-5.5 and 5.5 Pro hit the API on April 24. Here is what changes for builders: pricing, agentic tasks, tool-use, and the real benchmarks I ran the day it dropped.
Read next
GPT-5.4 ships state-of-the-art computer use, steerable thinking, and a million-token window. Here is the implementation guide for builders, with real OpenAI SDK code, the 272K pricing cliff, and where it actually beats 5.3 and 5.5 in production.
12 min readGPT-5.5-Codex merges Codex and GPT-5 stacks. Here is what the unified model means for real coding agents - latency, costs, prompt rewrites.
9 min readOpenAI shipped an open-weight PII redactor. Here is how to wire it into a real ingestion pipeline locally, fast, with zero leaks, and how it benchmarks against Presidio and a regex baseline.
10 min readGPT-5.5 and GPT-5.5 Pro landed on the API on April 24. I rebuilt three production agents on 5.5 Pro the same day. One got noticeably better, one regressed in a way I did not see coming, and one I had to tear out and replace with a cheaper model because the new pricing curve made it economically pointless.
This is the field guide I wish I had that morning. It covers what the model actually does differently in API terms, which tier matches which workload, the agentic improvements I can verify with my own evals, the regression nobody is writing about, and the migration playbook I now run on every codebase that touches the OpenAI SDK.
OpenAI's launch post leans on words like "intuitive" and "agentic," which is fine for marketing copy but does not help you decide whether to flip the env flag. In API terms, here is what changed.
For model-selection context, compare this with OpenAI Codex: Cloud AI Coding With GPT-5.3 and OpenAI vs Anthropic in 2026 - Models, Tools, and Developer Experience; the useful question is not only benchmark quality, but where the model fits in a real developer workflow.
The default reasoning depth is higher. GPT-5.5 ships with a recalibrated reasoning.effort scale where the new medium is closer to what high was in 5.4. That means a naive drop-in replacement using the same parameters will cost more and run slower out of the box, even before you change a line of code.
Tool-call dispatch is more conservative. In my eval suite, 5.5 emits roughly 18 percent fewer speculative tool calls on tasks where multiple tools could plausibly answer. It waits for more context before committing. That is excellent for production agents that pay per tool round trip and a slight regression for chat-style assistants where the perceived snappiness comes from immediate action.
Long-context handling is genuinely better. The same 240K-token document QA task that gave 5.4 a 71 percent answer-accuracy in my bench now scores 84 percent on 5.5 Pro. The improvement concentrates in the middle third of the context window, which is exactly where 5.4 had its known sag.
Pricing rebalanced. Input tokens on 5.5 are slightly cheaper than 5.4. Output tokens are slightly more expensive. 5.5 Pro is roughly 4x the cost of 5.5. For agents that emit short tool calls and consume long context, 5.5 is a free win. For agents that generate long-form output, you may end up paying more.
I keep this matrix taped to the wall above my desk. The rule of thumb is that 5.5 is the new default and 5.5 Pro earns its keep only on a narrow band of workloads.
import OpenAI from "openai";
const client = new OpenAI();
// Default tier: chat, summarization, structured extraction, simple agents
const fast = await client.responses.create({
model: "gpt-5.5",
input: userMessage,
reasoning: { effort: "low" },
});
// Pro tier: long-horizon agents, multi-document synthesis, hard reasoning
const heavy = await client.responses.create({
model: "gpt-5.5-pro",
input: planningPrompt,
reasoning: { effort: "medium" },
tools: agentTools,
});
Use 5.5 for anything user-facing that needs a sub-2-second time-to-first-token. Use 5.5 Pro when the cost of being wrong is higher than the marginal token bill. Concretely, that means coding agents that touch shared infra, financial extraction where a misread number costs money, and any agent that runs unattended for more than a few minutes.
If you are running mixed-model agents, lean on a router. I push every request through a tiny wrapper that picks the model based on a complexity score I attach to the task at queue time. The router lives behind the same OpenAI SDK call surface, which means swapping models is a one-line config change.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
From the archive
Apr 29, 2026 • 12 min read
Apr 29, 2026 • 8 min read
Apr 29, 2026 • 11 min read
Apr 29, 2026 • 10 min read
I ran Agent Eval Bench against 5.4, 5.5, and 5.5 Pro the day the API opened. The bench has 280 deterministic agentic tasks across coding, document workflows, and tool-use chains. Here is what came out.
Tool-call accuracy on the 80-task tool-use suite climbed from 78 percent on 5.4 to 89 percent on 5.5 and 91 percent on 5.5 Pro. The improvement is concentrated in tasks that require choosing between two semantically similar tools. 5.5 picks the right one more often, which means fewer wasted round trips and shorter end-to-end latencies in production despite the slower per-call thinking time.
Long-horizon planning held up. On the 40-task multi-step planning suite, 5.5 Pro completed 34 of 40 without human intervention, compared to 27 of 40 for 5.4. The failures clustered around tasks where the agent had to abandon a partially-completed plan and restart, which is the same weak spot 5.4 had. Progress, but not a solved problem.
Document workflows improved more than I expected. Extracting structured data from messy PDFs, the kind of task that used to require Pro-tier and high effort to get right, now works reliably on 5.5 with effort: "low". That is a real cost saving. I cut my document pipeline bill by 38 percent the week after migrating.
from openai import OpenAI
client = OpenAI()
response = client.responses.create(
model="gpt-5.5",
input=[
{"role": "system", "content": "Extract invoice line items as JSON."},
{"role": "user", "content": [{"type": "input_file", "file_id": file_id}]},
],
reasoning={"effort": "low"},
response_format={"type": "json_schema", "json_schema": invoice_schema},
)
The same call on 5.4 needed effort: "high" to hit acceptable accuracy on my golden set. That is the kind of quiet improvement that does not make the launch post but moves real money.
Here is the part I have not seen anyone else write up. On creative-writing-style tasks where the model has to maintain a consistent voice across a long generation, 5.5 hallucinates voice drift more often than 5.4 did. My voice-consistency eval, which is a 40-prompt suite that scores style adherence across 4000-token continuations, dropped from 82 percent on 5.4 to 71 percent on 5.5. 5.5 Pro recovers most of the gap, hitting 79 percent, but it is still a regression.
The drift looks like the model getting bored with the established voice partway through and reverting to a more neutral, technical register. My theory is that the conservative tool-call dispatch and the recalibrated reasoning effort interact badly with creative continuations, but I cannot prove it from outside the model.
If you have agents that depend on consistent persona output, run your own eval before flipping the flag. I had to keep one of my agents on 5.4 because the regression was material to its product feel. OpenAI has not deprecated 5.4 yet, so this is a viable holdout for a while longer.
Here is the playbook I now run on every codebase. It takes a couple of hours per service and has caught real problems on every migration so far.
Step one, pin the old model behind a flag. Before you change any prompt code, wrap the model name in an env-driven config. This is the rollback lever you will be glad to have.
const MODEL = process.env.OPENAI_MODEL ?? "gpt-5.4";
const response = await client.responses.create({
model: MODEL,
input: prompt,
});
Step two, run an eval harness against both models on a representative sample of real production traffic. Do not trust internal benchmarks. Capture 200 real requests, replay them against 5.4 and 5.5, and score the outputs on whatever metric matters for your product. This is exactly what Agent Eval Bench is built for, but a 50-line script will do.
Step three, watch the cost curve. Cost Tape gives me a live spend graph across mixed-model rollouts. The first week of any migration is when surprise bills happen, usually because the new default reasoning effort is higher and you forgot to set effort: "low" on a high-volume endpoint.
Step four, ramp by traffic share. I move new requests to the new model in 5 percent increments over a week, watching error rates and customer-reported quality at each step. If the eval lied, I find out before the bill does.
Step five, plan the rollback. Keep the old model live for at least two billing cycles. The regression I described above caught me on day six of a rollout, which would have been a bad time to discover I had ripped out the old code path.
Two prompt patterns earned an immediate rewrite for 5.5.
First, drop the "think step by step" preamble. 5.5 thinks more deeply by default, and the explicit instruction now adds latency without improving accuracy on my evals. On the math reasoning suite, removing the preamble cut average response time by 410ms with no measurable accuracy change.
Second, tighten tool descriptions. Because 5.5 is more conservative about tool dispatch, ambiguous tool descriptions now translate to the model giving up and asking the user instead of trying. Rewrite each tool description to specify exactly what inputs it expects and what outputs it produces, in the second person, with concrete examples.
// Before
const searchTool = {
type: "function",
name: "search_docs",
description: "Searches documentation.",
parameters: { /* ... */ },
};
// After
const searchTool = {
type: "function",
name: "search_docs",
description: "Use this when the user asks how to do something with our SDK. Input is a natural-language query string. Output is up to 5 documentation snippets with URLs. Prefer this over guessing API details from training data.",
parameters: { /* ... */ },
};
The diff is small. The behavioral change is large. My agent's tool-use rate on documentation questions went from 64 percent to 91 percent after rewriting the descriptions in this style, with no model or prompt change beyond the tool spec.
GPT-5.5 is the new default. 5.5 Pro is the right answer when correctness costs more than tokens. The day-one migration is straightforward if you have an eval harness and a kill switch, painful if you do not. The voice-drift regression is real and worth testing for if your product depends on persona consistency.
I shipped the eval bench results, the cost-tape dashboard, and the day-one review on the DevDigest YouTube channel the same week the API opened. If you are migrating, start with the eval harness. Everything else falls out of having ground truth.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
Multi-agent orchestration framework built on the OpenAI Agents SDK. Define agent roles, typed tools, and directional com...
View ToolAnthropic's Python SDK for building production agent systems. Tool use, guardrails, agent handoffs, and orchestration. R...
View ToolTypeScript-first AI agent framework. Workflows, RAG, tool use, evals, and integrations. Built for production Node.js app...
View ToolOpenAI's cloud coding agent. Runs in a sandboxed container, reads your repo, executes tasks, and submits PRs. Uses GPT-5...
View ToolConfigure Claude Code for maximum productivity -- CLAUDE.md, sub-agents, MCP servers, and autonomous workflows.
AI AgentsStep-by-step guide to building an MCP server in TypeScript - from project setup to tool definitions, resource handling, testing, and deployment.
AI AgentsWhat MCP servers are, how they work, and how to build your own in 5 minutes.
AI Agents
GPT-5.4 ships state-of-the-art computer use, steerable thinking, and a million-token window. Here is the implementation...

GPT-5.5-Codex merges Codex and GPT-5 stacks. Here is what the unified model means for real coding agents - latency, cost...

Configurable memory, sandbox-aware orchestration, Codex-like filesystem tools. Here is how the new Agents SDK actually b...

OpenAI shipped an open-weight PII redactor. Here is how to wire it into a real ingestion pipeline locally, fast, with ze...

State-of-the-art computer use, steerable thinking you can redirect mid-response, and a million tokens of context. GPT 5....

GitHub is filling with multi-agent frameworks, skills, and coding harnesses. The useful lesson is not that every team ne...

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.