
TL;DR
GPT-5.4 ships state-of-the-art computer use, steerable thinking, and a million-token window. Here is the implementation guide for builders, with real OpenAI SDK code, the 272K pricing cliff, and where it actually beats 5.3 and 5.5 in production.
| Resource | Link |
|---|---|
| OpenAI Platform Overview | platform.openai.com |
| OpenAI Models Documentation | platform.openai.com/docs/models |
| OpenAI API Pricing | openai.com/api/pricing |
| OpenAI Responses API | platform.openai.com/docs/api-reference/responses |
| OpenAI Computer Use Guide | platform.openai.com/docs/guides/tools-computer-use |
| OpenAI Python SDK | github.com/openai/openai-python |
| OpenAI Changelog | platform.openai.com/docs/changelog |
Last updated: May 23, 2026. Verify model availability, pricing, and API details against the official OpenAI documentation before production deployment.
GPT-5.4 dropped in March 2026 and the noisy headline was "model beats humans on OSWorld." That is true and it is also the least useful thing about the release for anyone shipping software. Two months later, with GPT-5.5 already on the API and 5.4 settling into its real role in the lineup, the picture for builders is much clearer than it was on launch day.
This is the developer guide I wanted in March: which workloads actually want 5.4, the SDK code to wire it up, how the 272K-token pricing cliff bites in practice, and the honest comparison against 5.3 below it and 5.5 above it.
There were three substantive changes, and one of them is easy to miss because it lives in the UX layer rather than the API.
Computer use went from "interesting demo" to "thing you can ship." The OSWorld Verified score of 75 percent versus 58.3 percent on 5.3 is not a marginal jump. It moves browser and desktop automation from the territory where you write three retry layers and a human escalation path to the territory where you write one retry layer and a logging path. BrowseComp and WebArena moved together with it, which means the gain is not a single-benchmark artifact.
Context grew to one million tokens with a real cost wrinkle. Anything over 272K tokens is billed at a 2x multiplier on both input and output. The window exists, you can use it, and most workloads should not. More on that under pricing.
Steerable thinking is the part developers undersell. The product UX of redirecting reasoning mid-response shows up in the API as a richer streaming format and a longer effective tool-use loop. If you build with the Responses API, you can intercept reasoning before the model commits and inject corrections, which is much cheaper than regenerating.
A new Codex fast mode runs roughly 1.5x faster than standard. For batch jobs it is the difference between a one-hour CI lane and a forty-minute one.
With 5.5 and 5.5 Pro now on the API, 5.4 is no longer the default frontier choice. It is the right call for three specific workloads. For how 5.4 stacks up against its direct mid-tier rivals, see the GPT-5.4 vs Gemini 3.1 Pro vs DeepSeek V4 Pro shootout.
Computer-use and browser-automation agents should still be tested against 5.4 first. Until 5.5's computer-use evals catch up publicly, 5.4's OSWorld lead is the strongest published number on the task. If you are running a browser agent stack where every action is a paid round trip, the higher action accuracy compounds.
Frontend code generation in agentic loops still favours 5.4 in my own evals. Web games, 3D scenes, complex CSS layouts, and React component scaffolds came back with fewer iterations than on 5.3 and roughly tied with 5.5 on first-pass quality, while costing meaningfully less per million tokens.
Long-document workloads up to 272K tokens are 5.4 territory. Below that ceiling the per-token cost is half of 5.5 Pro and the quality gap is small. Cross the cliff and the math flips immediately.
For everything else, including most agentic terminal coding and long-horizon task graphs, 5.5 or 5.5 Pro is the better default. And for high-volume utility work that does not need reasoning at all, neither model is the right tool.
The model IDs that ship are gpt-5.4 for the standard variant and gpt-5.4-thinking for the reasoning variant. Both work through the Responses API, which is what I use for anything that touches tools.
from openai import OpenAI
client = OpenAI()
response = client.responses.create(
model="gpt-5.4",
input="Summarize this changelog and flag breaking changes.",
instructions="You are a release-notes auditor. Be terse.",
max_output_tokens=2000,
)
print(response.output_text)
For the thinking variant you opt into reasoning effort explicitly. The API exposes the same reasoning.effort knob as 5.3 and the new effort_budget cap that was introduced alongside 5.4.
response = client.responses.create(
model="gpt-5.4-thinking",
input=user_query,
reasoning={"effort": "high", "effort_budget": 8000},
max_output_tokens=4000,
stream=True,
)
for event in response:
if event.type == "response.reasoning.delta":
# Surface reasoning to the user in real time so they can steer.
ui.append_reasoning(event.delta)
elif event.type == "response.output_text.delta":
ui.append_answer(event.delta)
effort_budget is a hard ceiling on reasoning tokens. It is the single most useful knob for keeping thinking-model costs bounded in production. Set it once per call site, treat it as a budget, and alert when you hit it.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
From the archive
Apr 29, 2026 • 9 min read
Apr 29, 2026 • 12 min read
Apr 29, 2026 • 8 min read
Apr 29, 2026 • 11 min read
The shipped computer-use tool is the cleanest way to use 5.4 for browser and desktop automation. Pair it with a sandboxed browser session and you get a tight loop.
from openai import OpenAI
client = OpenAI()
response = client.responses.create(
model="gpt-5.4",
tools=[{
"type": "computer_use_preview",
"display_width": 1280,
"display_height": 800,
"environment": "browser",
}],
input=[{
"role": "user",
"content": [
{"type": "input_text", "text": "Find the pricing page on stripe.com and extract the per-transaction fee."},
{"type": "input_image", "image_url": initial_screenshot_data_url},
],
}],
truncation="auto",
)
for action in response.output:
if action.type == "computer_call":
result = sandbox.execute(action.action)
# Feed the resulting screenshot back as the next input item.
Two things matter for production. First, always pass truncation="auto" once your action history grows, otherwise long sessions will overflow even the million-token window. Second, instrument every computer_call with a step counter and a timeout. Models that score 75 percent on OSWorld still fail 25 percent of the time, and the failure mode is usually a loop, not a clean error.
The pricing snapshot at launch, which is still the published rate as of this writing:
gpt-5.4
Input: $2.50 / 1M tokens
Output: $10.00 / 1M tokens
Cached input: $0.25 / 1M tokens
Above 272K ctx: 2x multiplier on input and output
gpt-5.4-thinking
Input: $5.00 / 1M tokens
Output: $20.00 / 1M tokens
Cached input: $0.50 / 1M tokens
Above 272K ctx: 2x multiplier on input and output
The 2x cliff above 272K tokens is the line item that gets every team that does not measure. A single in-context-RAG call on a large monorepo can sit at 290K tokens. Run that a thousand times a day on the thinking variant and you are paying like you ran a 5.5 Pro workload, except on the older model.
Two practical rules. Cache aggressively on the system-prompt and tool-spec prefix; the 10x cached-input discount turns most agent loops into a different cost curve. And put a hard guard at 270K tokens with a fallback path that either truncates older turns or fans out to multiple parallel calls under the cliff. Both are cheap to implement once and pay forever.
If you have not built cost telemetry yet, see the Cost Tape approach we use across the DD app stack. It is the smallest viable telemetry layer that catches this kind of regression before the bill arrives.
The numbers below are a mix of published OpenAI evals and my own runs on the same agent suite I use across model upgrades. The 5.5 column reflects 5.5 Pro on default reasoning settings.
| Benchmark | GPT-5.3 | GPT-5.4 | GPT-5.5 Pro |
|---|---|---|---|
| OSWorld Verified | 58.3% | 75.0% | 78.1% |
| BrowseComp | 49.7% | 71.2% | 73.8% |
| WebArena | 51.2% | 68.4% | 70.9% |
| SWE-bench Verified | 69.2% | 74.1% | 79.4% |
| Frontend (internal eval) | 62% | 81% | 82% |
| 240K-token doc QA | 64% | 71% | 84% |
Two observations matter for a build decision. Computer use saturates fast above 5.4. The jump from 5.3 to 5.4 is roughly 17 points; from 5.4 to 5.5 Pro it is 3 points. If your workload lives there, 5.4 is most of the way to the frontier at half the price. Long-context comprehension is where 5.5 Pro pulls cleanly ahead. If you regularly cross 200K tokens of meaningful content, the answer-accuracy delta is too large to ignore.
The honest playbook for a team still on 5.3 in late April 2026 looks like this.
Run your existing eval suite against 5.4 before changing a line of production code. If you do not have an eval suite, see the Agent Eval Bench writeup; the cheapest version is a hundred prompts with golden outputs and a regression diff.
Migrate computer-use and frontend agents to 5.4 first. These are the workloads with the biggest verified gains and the lowest blast radius if something regresses.
Hold long-document workloads on whatever they run today until you have measured 5.4 against 5.5 Pro on the same documents. Anything that crosses 272K should probably skip 5.4 entirely and go straight to 5.5 Pro to avoid the cliff.
Set effort_budget on every thinking-model call site. The single biggest production cost surprise on the thinking variants is unbounded reasoning on tasks the model finds confusing.
Ship the 270K-token guard. Even if you never expect to hit it, one rogue retrieval upstream will, and you want a fallback path, not a 2x bill.
Three implementation choices end up dominating GPT-5.4 production cost more than the model variant itself.
The first is system-prompt caching. The Responses API caches the static prefix of your input, including the system instructions and tool specifications, at a 10x discount. If your agent hits the same model with a 4K-token tool spec a hundred times an hour, the difference between cached and uncached input is the difference between a serious line item and a rounding error. Structure your prompts so the static part is genuinely static, and put any per-call variability at the end.
The second is reasoning effort. The thinking variant defaults to medium, and medium will quietly burn through reasoning tokens on prompts the model finds ambiguous. Force low for routine work, reserve high for the tasks that actually need it, and always pair high with an effort_budget ceiling. In one DD-app migration I cut the monthly thinking-model bill by 38 percent without a measurable quality drop simply by reclassifying which call sites needed which effort tier.
The third is image inputs in computer-use loops. Each screenshot you feed back as input_image is billed by image tokens, and a long agent session can accumulate dozens. Downsample screenshots to the minimum resolution the model still acts reliably on, which in my testing is 1024x768 for most browser tasks. Going from full 1440p screenshots to 1024x768 cut my computer-use bill nearly in half with no measurable accuracy loss.
I have shipped on 5.4 across half the DD app portfolio and the lived-in verdict is narrower than the launch coverage suggested. It is the strongest computer-use model that has shipped to date by a comfortable margin. It is a clear upgrade for frontend code generation in agentic loops. It is a pricing win below 272K tokens.
It is not a replacement for Codex with 5.3 on tight low-latency code edits, where the older model still has a speed and cost edge. It is not the right pick over 5.5 Pro for long-context document work. And it is comprehensively beaten on agentic terminal coding by the current Anthropic frontier, which I covered in the OpenAI versus Anthropic 2026 writeup.
What 5.4 changed permanently is the assumption that browser automation is a research problem. It is now an engineering problem with a model that hits the accuracy bar. The implementation work is on us.
GPT-5.4 is OpenAI's March 2026 model release with three major improvements over GPT-5.3: computer use went from demo quality to production-ready with a 75 percent OSWorld Verified score versus 58.3 percent on 5.3, context expanded to one million tokens with a 2x pricing multiplier above 272K, and steerable thinking lets developers intercept and redirect reasoning mid-response. For browser automation and frontend code generation, 5.4 is a significant upgrade. For long-context document work, 5.5 Pro is usually the better choice.
GPT-5.4 costs $2.50 per million input tokens and $10.00 per million output tokens, with cached input at $0.25 per million. The thinking variant doubles those rates. The critical detail is the 272K-token cliff: any context above 272K tokens is billed at 2x rates on both input and output. Below that cliff, 5.4 is half the price of 5.5 Pro. Above it, the cost advantage disappears.
Use GPT-5.4 for computer use and browser automation where the OSWorld accuracy lead still holds, for frontend code generation in agentic loops where it matches 5.5 at lower cost, and for long-document workloads under 272K tokens. Use GPT-5.5 Pro for document QA above 200K tokens where accuracy matters. Stay on GPT-5.3 or 5.5 standard for tight low-latency code edits where speed and cost trump capability. For agentic terminal coding, Anthropic's frontier models still lead.
Pass the computer_use_preview tool type to the Responses API with display dimensions and environment set to browser. Always include truncation="auto" to prevent context overflow in long sessions. Instrument every computer_call with a step counter and timeout because the model still fails 25 percent of the time, usually in loops rather than clean errors. Downsample screenshots to 1024x768 to cut image token costs nearly in half without measurable accuracy loss.
The effort_budget parameter caps reasoning tokens on the GPT-5.4-thinking variant. Without it, the model can burn through expensive reasoning tokens on ambiguous prompts. Set effort_budget on every thinking-model call site as a hard ceiling, treat it as a budget, and alert when you hit it. In one migration, reclassifying which call sites needed which effort tier cut monthly thinking-model costs by 38 percent without quality loss.
Cache aggressively on the system prompt and tool spec prefix to get the 10x cached-input discount. Put a hard guard at 270K tokens with a fallback path that either truncates older conversation turns or fans out to multiple parallel calls under the cliff. Even if you never expect to hit the cliff, one rogue retrieval upstream will, and you want a fallback path rather than a 2x bill.
GPT-5.4 excels at frontend code generation where web games, 3D scenes, complex CSS layouts, and React component scaffolds come back with fewer iterations than 5.3 and roughly match 5.5 on first-pass quality at lower cost. For agentic terminal coding and multi-file refactors, Anthropic's Claude models still lead. For tight low-latency code edits, GPT-5.3 or Codex in fast mode may be faster and cheaper.
Use the OpenAI Responses API with the gpt-5.4 or gpt-5.4-thinking model ID. The standard variant handles most workloads. The thinking variant adds reasoning with configurable effort levels and budget caps. For computer use, add the computer_use_preview tool type. Stream responses with stream=True to surface reasoning in real time so users can steer the model mid-response.
If you want the visual walkthrough that pairs with this guide, the GPT-5.4 in 10 Minutes DevDigest video covers the launch announcement, the steerable thinking UX, and the live coding demos that motivated several of the recommendations above.
Read next
GPT-5.5 and 5.5 Pro hit the API on April 24. Here is what changes for builders: pricing, agentic tasks, tool-use, and the real benchmarks I ran the day it dropped.
11 min readCodex works from the terminal, cloud tasks, IDEs, GitHub, Slack, and Linear. Here is how to use it and how it compares to Claude Code.
5 min readComplete pricing breakdown for every major AI coding tool. Claude Code, Cursor, Copilot, Windsurf, Codex, Augment, and more. Free tiers, pro plans, hidden costs, and what you actually get for your money.
12 min readTechnical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
Multi-agent orchestration framework built on the OpenAI Agents SDK. Define agent roles, typed tools, and directional com...
View ToolAnthropic's Python SDK for building production agent systems. Tool use, guardrails, agent handoffs, and orchestration. R...
View ToolTypeScript-first AI agent framework. Agents, tools, memory, workflows, RAG, evals, tracing, MCP, and production deployme...
View ToolOpenAI's coding agent for terminal, cloud, IDE, GitHub, Slack, and Linear workflows. Reads repos, edits files, runs comm...
View ToolConfigure Claude Code for maximum productivity -- CLAUDE.md, sub-agents, MCP servers, and autonomous workflows.
AI AgentsStep-by-step guide to building an MCP server in TypeScript - from project setup to tool definitions, resource handling, testing, and deployment.
AI AgentsWhat MCP servers are, how they work, and how to build your own in 5 minutes.
AI Agents
GPT-5.5 and 5.5 Pro hit the API on April 24. Here is what changes for builders: pricing, agentic tasks, tool-use, and th...

Codex works from the terminal, cloud tasks, IDEs, GitHub, Slack, and Linear. Here is how to use it and how it compares t...

Complete pricing breakdown for every major AI coding tool. Claude Code, Cursor, Copilot, Windsurf, Codex, Augment, and m...

OpenAI is sunsetting the Assistants API in 2026. Here is a tested migration plan to the Responses API - code, state, t...

Two platforms, two philosophies. Here is how Anthropic and OpenAI compare on APIs, SDKs, documentation, pricing, and the...

Configurable memory, sandbox-aware orchestration, Codex-like filesystem tools. Here is how the new Agents SDK actually b...

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.