
TL;DR
GPT-5.5-Codex merges Codex and GPT-5 stacks. Here is what the unified model means for real coding agents - latency, costs, prompt rewrites.
Read next
OpenAI is moving Codex from a coding assistant into an enterprise agent platform. Here is what changed with Codex, Managed Agents, AWS, and the Responses API.
8 min readCodex runs in a sandbox, reads your TypeScript repo, and submits PRs. Here is how to use it and how it compares to Claude Code.
5 min readOpenAI is drawing a line in the sand. GPT-5 Codex is not an API release.
7 min readI migrated three production coding agents from gpt-5-codex to gpt-5.5-codex over a single weekend. One was a multi-file refactor bot that runs against a 400k LOC monorepo. One was a PR triage agent that comments on every pull request before a human looks at it. The third was an internal CLI that scaffolds boilerplate from JIRA tickets.
What follows is the real diff: token cost, p95 latency, PR acceptance rate, and the four prompt scaffolds I kept versus the ones I had to throw away. If you are sitting on the fence about migrating, this is the writeup I wish I had on Friday night.
The headline is not the version bump. The headline is that OpenAI has merged its two parallel post-training stacks into one. For most of 2025, gpt-5-codex and gpt-5 were trained for different optimization targets. Codex variants were tuned for long-horizon, tool-using, file-editing workflows. The base GPT-5 line was tuned for reasoning, instruction following, and general chat.
gpt-5.5-codex is the first model that inherits both. The instruction-following work that landed in 5.5 base shows up in the Codex variant immediately, and the multi-file editing behavior that Codex pioneered now informs how the base model handles code. In practice this means fewer surprises when you mix prompt patterns from your chat stack with patterns from your agent stack.
For builders this matters because it lowers the cost of standardizing on one model across product surfaces. I now run the same model behind my CLI, my PR bot, and my customer-facing chat features. One eval suite, one cost line, one prompt library.
The model string change itself is trivial. The OpenAI Python SDK migration was three lines per agent.
from openai import OpenAI
client = OpenAI()
response = client.responses.create(
model="gpt-5.5-codex",
input=[
{"role": "developer", "content": "You are a senior backend engineer reviewing pull requests for a Next.js monorepo. You always cite file paths and line numbers."},
{"role": "user", "content": "Review the diff in pr_12894.patch and flag any issues with auth, database queries, or type safety."},
],
tools=[{"type": "code_interpreter", "container": {"type": "auto"}}],
reasoning={"effort": "medium"},
)
print(response.output_text)
Two things to watch when migrating. First, any code that pinned reasoning.effort to high for gpt-5-codex should drop to medium first and benchmark. The 5.5 model is more efficient at the same effort level, and several of my agents got better outputs at lower effort, not higher. Second, deprecated parameters from older Codex preview models such as max_completion_tokens should be replaced with max_output_tokens in the Responses API. The SDK will warn, not error, so it is easy to miss.
The system-prompt rewrites that paid off were small but consistent. I removed the explicit "think step by step" scaffolding that I had baked into gpt-5-codex prompts. In 5.5 it is redundant and occasionally produces verbose preambles that you then have to strip in post. I also tightened my output-format instructions, because the new model follows JSON schemas and structured-output requests with noticeably less drift.
If you keep prompts in version control, and you should, now is the time to tag the pre-migration prompt set. We use Promptlock to version prompts across model migrations exactly so the rollback is one command, not a git archaeology session. The diff between my gpt-5-codex and gpt-5.5-codex prompt branches is genuinely useful as a reference.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
From the archive
Apr 23, 2026 • 7 min read
Apr 23, 2026 • 9 min read
Apr 22, 2026 • 7 min read
Apr 22, 2026 • 7 min read
These numbers are from production traffic over a 7 day window before and after migration. Same prompts where I did not change them, same tools, same eval set. Token counts are summed across input plus output.
| Agent | Tokens before | Tokens after | p95 before | p95 after | PR accept before | PR accept after |
|---|---|---|---|---|---|---|
| Refactor bot | 41.2M | 33.8M | 18.4s | 14.1s | 61% | 72% |
| PR triage | 12.6M | 11.9M | 6.2s | 5.0s | n/a | n/a |
| Boilerplate CLI | 3.9M | 3.4M | 9.8s | 7.6s | 88% | 91% |
A few honest caveats. The refactor bot saw the biggest gain because it benefits most from longer-horizon planning, which is where 5.5 made the cleanest jump. The PR triage agent was already cheap and fast, so the absolute delta is small. The boilerplate CLI is so structured that the model improvements are almost noise. The win there is consistency, not capability.
I track all of this from my status bar with Cost Tape, which broke the migration cost delta out per agent in real time. Watching the line chart drop on the refactor bot for three straight days was the moment I committed to rolling out fully.
Multi-file edits are the thing I want to talk about first, because this was the longest standing pain point with the 5-codex line. The previous model would correctly identify that a refactor required changes in five files, then make the edits inconsistently, sometimes naming a renamed function correctly in three places and using the old name in the other two. 5.5 holds the rename across the entire edit batch with much higher reliability. My refactor bot now lands cross-file renames on the first try in roughly four out of five attempts, up from about half.
Long-horizon tasks are the second improvement. Tasks that span more than 20 tool calls used to drift. The model would forget early constraints, contradict its own plan, or revisit the same file three times. 5.5 holds the plan better. I am no longer adding "remember the original requirement" reminders into my system prompts every five turns.
Instruction following on ambiguous tickets is the third. JIRA tickets are a hazard surface for any coding agent because they are written by humans who already share context with the reader. 5.5 asks fewer clarifying questions and makes better default assumptions when the ticket is underspecified. When my boilerplate CLI sees a ticket like "add the new endpoint for the marketing team", the model now correctly infers the file layout, the route convention, and the test pattern from the rest of the repo without me having to spell it out in the system prompt.
For a side-by-side terminal recording of the same multi-file refactor task running on gpt-5-codex and gpt-5.5-codex, the DevDigest YouTube hands-on video is worth ten minutes of your time. Watching the two agents run on a split screen tells you more than any benchmark table.
I want to be honest about the failure modes because the rollout-everything-on-Monday energy on Twitter does not match my experience.
Config drift is the first real bug class. When a repo has both a pnpm-workspace.yaml and a stale lerna.json, 5.5 will sometimes follow the lerna config and produce commands that no longer apply. The fix is the same as it was on the previous model: tell the agent which config file is canonical in the developer message, and verify before letting it run scripts.
Confidently wrong refactors are the second. The model is now better at multi-file edits, which paradoxically makes it more dangerous when it is wrong. A confident sweep across 12 files with a subtly incorrect type signature is harder to catch in review than a hesitant attempt at three files. The countermeasure is unchanged: run the test suite before accepting, and make your CI block on failures.
Cost cliffs on long contexts are the third. Pricing on 5.5-codex scales with input tokens as expected, but the model is more willing to read entire files end to end when it could have grepped. If you give it filesystem tools without rate limits, you will see your bill jump on agents that previously were cautious. I added a hard cap on file reads per task in my agent loop and the daily spend dropped back into expected territory.
Verdict: migrate. The latency win alone justifies the move for any user-facing agent. The cost delta is real if your traffic skews toward refactor-style work. The risk of confidently-wrong sweeps is manageable with discipline you should already have.
Four prompt scaffolds survived migration intact and I will keep using them across whatever ships next.
The first is the role-with-codebase-anchor pattern. State the role, name the codebase, and pin one or two architectural facts the model needs to behave correctly. This worked on gpt-5-codex and works equally well on 5.5.
The second is the cite-file-and-line discipline. Always require the model to cite the exact file path and line number when making claims about code. This kills hallucinated references on any model and is even cheaper to enforce on 5.5 because the model resists drifting from the requirement.
The third is the plan-then-execute split. Have the model emit a plan first, log it, then execute against the plan. The plan is invaluable for postmortems when an agent goes wrong, and 5.5 produces visibly better plans than its predecessor.
The fourth is the structured-output-or-fail rule. If a downstream consumer expects JSON, declare the schema and reject anything that does not match. The 5.5 model is forgiving enough that this rarely triggers, but the contract has saved me twice this month already.
If you are mid-migration, the playbook is straightforward. Tag your prompt set, swap the model string, drop reasoning effort one level, run your eval suite, watch your cost dashboard for 48 hours, and only then roll out broadly. The unified stack is real, the gains are real, and the only thing left is the discipline of measuring it on your own traffic instead of trusting any benchmark, including the table I wrote above.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
OpenAI's cloud coding agent. Runs in a sandboxed container, reads your repo, executes tasks, and submits PRs. Uses GPT-5...
View ToolOpenAI's flagship. GPT-4o for general use, o3 for reasoning, Codex for coding. 300M+ weekly users. Tasks, agents, web br...
View ToolOpenAI's open-source terminal coding agent built in Rust. Runs locally, reads your repo, edits files, and executes comma...
View ToolMulti-agent orchestration framework built on the OpenAI Agents SDK. Define agent roles, typed tools, and directional com...
View ToolWhat MCP servers are, how they work, and how to build your own in 5 minutes.
AI AgentsSet up Codex Chronicle on macOS, manage permissions, and understand privacy, security, and troubleshooting.
Getting StartedA practical walk-through of how to design, write, and ship a Claude Code skill - from choosing when to trigger, through allowed-tools, to the steps the agent will actually follow.
Getting Started
OpenAI is moving Codex from a coding assistant into an enterprise agent platform. Here is what changed with Codex, Manag...

Codex runs in a sandbox, reads your TypeScript repo, and submits PRs. Here is how to use it and how it compares to Claud...

OpenAI is drawing a line in the sand. GPT-5 Codex is not an API release.

GitHub Copilot is moving from autocomplete into asynchronous coding agents, terminal workflows, MCP, skills, and model c...

The math of agent pipelines is brutal. 85% reliability per step compounds to about 20% at 10 steps. Here is why long cha...

A new study from nrehiew quantifies a problem every Claude Code, Cursor, and Codex user has felt: models making huge dif...

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.