
TL;DR
GPT-5.4 ships state-of-the-art computer use, steerable thinking, and a million-token window. Here is the implementation guide for builders, with real OpenAI SDK code, the 272K pricing cliff, and where it actually beats 5.3 and 5.5 in production.
Read next
GPT-5.5 and 5.5 Pro hit the API on April 24. Here is what changes for builders: pricing, agentic tasks, tool-use, and the real benchmarks I ran the day it dropped.
11 min readState-of-the-art computer use, steerable thinking you can redirect mid-response, and a million tokens of context. GPT 5.4 is OpenAI's most capable model yet.
8 min readGPT-5.5-Codex merges Codex and GPT-5 stacks. Here is what the unified model means for real coding agents - latency, costs, prompt rewrites.
9 min readGPT-5.4 dropped in March 2026 and the noisy headline was "model beats humans on OSWorld." That is true and it is also the least useful thing about the release for anyone shipping software. Two months later, with GPT-5.5 already on the API and 5.4 settling into its real role in the lineup, the picture for builders is much clearer than it was on launch day.
This is the developer guide I wanted in March: which workloads actually want 5.4, the SDK code to wire it up, how the 272K-token pricing cliff bites in practice, and the honest comparison against 5.3 below it and 5.5 above it.
There were three substantive changes, and one of them is easy to miss because it lives in the UX layer rather than the API.
Computer use went from "interesting demo" to "thing you can ship." The OSWorld Verified score of 75 percent versus 58.3 percent on 5.3 is not a marginal jump. It moves browser and desktop automation from the territory where you write three retry layers and a human escalation path to the territory where you write one retry layer and a logging path. BrowseComp and WebArena moved together with it, which means the gain is not a single-benchmark artifact.
Context grew to one million tokens with a real cost wrinkle. Anything over 272K tokens is billed at a 2x multiplier on both input and output. The window exists, you can use it, and most workloads should not. More on that under pricing.
Steerable thinking is the part developers undersell. The product UX of redirecting reasoning mid-response shows up in the API as a richer streaming format and a longer effective tool-use loop. If you build with the Responses API, you can intercept reasoning before the model commits and inject corrections, which is much cheaper than regenerating.
A new Codex fast mode runs roughly 1.5x faster than standard. For batch jobs it is the difference between a one-hour CI lane and a forty-minute one.
With 5.5 and 5.5 Pro now on the API, 5.4 is no longer the default frontier choice. It is the right call for three specific workloads.
Computer-use and browser-automation agents should still be tested against 5.4 first. Until 5.5's computer-use evals catch up publicly, 5.4's OSWorld lead is the strongest published number on the task. If you are running a browser agent stack where every action is a paid round trip, the higher action accuracy compounds.
Frontend code generation in agentic loops still favours 5.4 in my own evals. Web games, 3D scenes, complex CSS layouts, and React component scaffolds came back with fewer iterations than on 5.3 and roughly tied with 5.5 on first-pass quality, while costing meaningfully less per million tokens.
Long-document workloads up to 272K tokens are 5.4 territory. Below that ceiling the per-token cost is half of 5.5 Pro and the quality gap is small. Cross the cliff and the math flips immediately.
For everything else, including most agentic terminal coding and long-horizon task graphs, 5.5 or 5.5 Pro is the better default. And for high-volume utility work that does not need reasoning at all, neither model is the right tool.
The model IDs that ship are gpt-5.4 for the standard variant and gpt-5.4-thinking for the reasoning variant. Both work through the Responses API, which is what I use for anything that touches tools.
from openai import OpenAI
client = OpenAI()
response = client.responses.create(
model="gpt-5.4",
input="Summarize this changelog and flag breaking changes.",
instructions="You are a release-notes auditor. Be terse.",
max_output_tokens=2000,
)
print(response.output_text)
For the thinking variant you opt into reasoning effort explicitly. The API exposes the same reasoning.effort knob as 5.3 and the new effort_budget cap that was introduced alongside 5.4.
response = client.responses.create(
model="gpt-5.4-thinking",
input=user_query,
reasoning={"effort": "high", "effort_budget": 8000},
max_output_tokens=4000,
stream=True,
)
for event in response:
if event.type == "response.reasoning.delta":
# Surface reasoning to the user in real time so they can steer.
ui.append_reasoning(event.delta)
elif event.type == "response.output_text.delta":
ui.append_answer(event.delta)
effort_budget is a hard ceiling on reasoning tokens. It is the single most useful knob for keeping thinking-model costs bounded in production. Set it once per call site, treat it as a budget, and alert when you hit it.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
From the archive
Apr 29, 2026 • 12 min read
Apr 29, 2026 • 8 min read
Apr 29, 2026 • 11 min read
Apr 29, 2026 • 10 min read
The shipped computer-use tool is the cleanest way to use 5.4 for browser and desktop automation. Pair it with a sandboxed browser session and you get a tight loop.
from openai import OpenAI
client = OpenAI()
response = client.responses.create(
model="gpt-5.4",
tools=[{
"type": "computer_use_preview",
"display_width": 1280,
"display_height": 800,
"environment": "browser",
}],
input=[{
"role": "user",
"content": [
{"type": "input_text", "text": "Find the pricing page on stripe.com and extract the per-transaction fee."},
{"type": "input_image", "image_url": initial_screenshot_data_url},
],
}],
truncation="auto",
)
for action in response.output:
if action.type == "computer_call":
result = sandbox.execute(action.action)
# Feed the resulting screenshot back as the next input item.
Two things matter for production. First, always pass truncation="auto" once your action history grows, otherwise long sessions will overflow even the million-token window. Second, instrument every computer_call with a step counter and a timeout. Models that score 75 percent on OSWorld still fail 25 percent of the time, and the failure mode is usually a loop, not a clean error.
The pricing snapshot at launch, which is still the published rate as of this writing:
gpt-5.4
Input: $2.50 / 1M tokens
Output: $10.00 / 1M tokens
Cached input: $0.25 / 1M tokens
Above 272K ctx: 2x multiplier on input and output
gpt-5.4-thinking
Input: $5.00 / 1M tokens
Output: $20.00 / 1M tokens
Cached input: $0.50 / 1M tokens
Above 272K ctx: 2x multiplier on input and output
The 2x cliff above 272K tokens is the line item that gets every team that does not measure. A single in-context-RAG call on a large monorepo can sit at 290K tokens. Run that a thousand times a day on the thinking variant and you are paying like you ran a 5.5 Pro workload, except on the older model.
Two practical rules. Cache aggressively on the system-prompt and tool-spec prefix; the 10x cached-input discount turns most agent loops into a different cost curve. And put a hard guard at 270K tokens with a fallback path that either truncates older turns or fans out to multiple parallel calls under the cliff. Both are cheap to implement once and pay forever.
If you have not built cost telemetry yet, see the Cost Tape approach we use across the DD app stack. It is the smallest viable telemetry layer that catches this kind of regression before the bill arrives.
The numbers below are a mix of published OpenAI evals and my own runs on the same agent suite I use across model upgrades. The 5.5 column reflects 5.5 Pro on default reasoning settings.
| Benchmark | GPT-5.3 | GPT-5.4 | GPT-5.5 Pro |
|---|---|---|---|
| OSWorld Verified | 58.3% | 75.0% | 78.1% |
| BrowseComp | 49.7% | 71.2% | 73.8% |
| WebArena | 51.2% | 68.4% | 70.9% |
| SWE-bench Verified | 69.2% | 74.1% | 79.4% |
| Frontend (internal eval) | 62% | 81% | 82% |
| 240K-token doc QA | 64% | 71% | 84% |
Two observations matter for a build decision. Computer use saturates fast above 5.4. The jump from 5.3 to 5.4 is roughly 17 points; from 5.4 to 5.5 Pro it is 3 points. If your workload lives there, 5.4 is most of the way to the frontier at half the price. Long-context comprehension is where 5.5 Pro pulls cleanly ahead. If you regularly cross 200K tokens of meaningful content, the answer-accuracy delta is too large to ignore.
The honest playbook for a team still on 5.3 in late April 2026 looks like this.
Run your existing eval suite against 5.4 before changing a line of production code. If you do not have an eval suite, see the Agent Eval Bench writeup; the cheapest version is a hundred prompts with golden outputs and a regression diff.
Migrate computer-use and frontend agents to 5.4 first. These are the workloads with the biggest verified gains and the lowest blast radius if something regresses.
Hold long-document workloads on whatever they run today until you have measured 5.4 against 5.5 Pro on the same documents. Anything that crosses 272K should probably skip 5.4 entirely and go straight to 5.5 Pro to avoid the cliff.
Set effort_budget on every thinking-model call site. The single biggest production cost surprise on the thinking variants is unbounded reasoning on tasks the model finds confusing.
Ship the 270K-token guard. Even if you never expect to hit it, one rogue retrieval upstream will, and you want a fallback path, not a 2x bill.
Three implementation choices end up dominating GPT-5.4 production cost more than the model variant itself.
The first is system-prompt caching. The Responses API caches the static prefix of your input, including the system instructions and tool specifications, at a 10x discount. If your agent hits the same model with a 4K-token tool spec a hundred times an hour, the difference between cached and uncached input is the difference between a serious line item and a rounding error. Structure your prompts so the static part is genuinely static, and put any per-call variability at the end.
The second is reasoning effort. The thinking variant defaults to medium, and medium will quietly burn through reasoning tokens on prompts the model finds ambiguous. Force low for routine work, reserve high for the tasks that actually need it, and always pair high with an effort_budget ceiling. In one DD-app migration I cut the monthly thinking-model bill by 38 percent without a measurable quality drop simply by reclassifying which call sites needed which effort tier.
The third is image inputs in computer-use loops. Each screenshot you feed back as input_image is billed by image tokens, and a long agent session can accumulate dozens. Downsample screenshots to the minimum resolution the model still acts reliably on, which in my testing is 1024x768 for most browser tasks. Going from full 1440p screenshots to 1024x768 cut my computer-use bill nearly in half with no measurable accuracy loss.
I have shipped on 5.4 across half the DD app portfolio and the lived-in verdict is narrower than the launch coverage suggested. It is the strongest computer-use model that has shipped to date by a comfortable margin. It is a clear upgrade for frontend code generation in agentic loops. It is a pricing win below 272K tokens.
It is not a replacement for Codex with 5.3 on tight low-latency code edits, where the older model still has a speed and cost edge. It is not the right pick over 5.5 Pro for long-context document work. And it is comprehensively beaten on agentic terminal coding by the current Anthropic frontier, which I covered in the OpenAI versus Anthropic 2026 writeup.
What 5.4 changed permanently is the assumption that browser automation is a research problem. It is now an engineering problem with a model that hits the accuracy bar. The implementation work is on us.
If you want the visual walkthrough that pairs with this guide, the GPT-5.4 in 10 Minutes DevDigest video covers the launch announcement, the steerable thinking UX, and the live coding demos that motivated several of the recommendations above.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
Multi-agent orchestration framework built on the OpenAI Agents SDK. Define agent roles, typed tools, and directional com...
View ToolAnthropic's Python SDK for building production agent systems. Tool use, guardrails, agent handoffs, and orchestration. R...
View ToolTypeScript-first AI agent framework. Workflows, RAG, tool use, evals, and integrations. Built for production Node.js app...
View ToolOpenAI's cloud coding agent. Runs in a sandboxed container, reads your repo, executes tasks, and submits PRs. Uses GPT-5...
View ToolConfigure Claude Code for maximum productivity -- CLAUDE.md, sub-agents, MCP servers, and autonomous workflows.
AI AgentsStep-by-step guide to building an MCP server in TypeScript - from project setup to tool definitions, resource handling, testing, and deployment.
AI AgentsWhat MCP servers are, how they work, and how to build your own in 5 minutes.
AI Agents
GPT-5.5 and 5.5 Pro hit the API on April 24. Here is what changes for builders: pricing, agentic tasks, tool-use, and th...

Configurable memory, sandbox-aware orchestration, Codex-like filesystem tools. Here is how the new Agents SDK actually b...

GPT-5.5-Codex merges Codex and GPT-5 stacks. Here is what the unified model means for real coding agents - latency, cost...

OpenAI shipped an open-weight PII redactor. Here is how to wire it into a real ingestion pipeline locally, fast, with ze...

State-of-the-art computer use, steerable thinking you can redirect mid-response, and a million tokens of context. GPT 5....

GitHub is filling with multi-agent frameworks, skills, and coding harnesses. The useful lesson is not that every team ne...

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.