GPT-5.4 for Developers: The Production Guide

GPT-5.4 dropped in March 2026 and the noisy headline was "model beats humans on OSWorld." That is true and it is also the least useful thing about the release for anyone shipping software. Two months later, with GPT-5.5 already on the API and 5.4 settling into its real role in the lineup, the picture for builders is much clearer than it was on launch day.

This is the developer guide I wanted in March: which workloads actually want 5.4, the SDK code to wire it up, how the 272K-token pricing cliff bites in practice, and the honest comparison against 5.3 below it and 5.5 above it.

What 5.4 actually changed at the API layer

There were three substantive changes, and one of them is easy to miss because it lives in the UX layer rather than the API.

Computer use went from "interesting demo" to "thing you can ship." The OSWorld Verified score of 75 percent versus 58.3 percent on 5.3 is not a marginal jump. It moves browser and desktop automation from the territory where you write three retry layers and a human escalation path to the territory where you write one retry layer and a logging path. BrowseComp and WebArena moved together with it, which means the gain is not a single-benchmark artifact.

Context grew to one million tokens with a real cost wrinkle. Anything over 272K tokens is billed at a 2x multiplier on both input and output. The window exists, you can use it, and most workloads should not. More on that under pricing.

Steerable thinking is the part developers undersell. The product UX of redirecting reasoning mid-response shows up in the API as a richer streaming format and a longer effective tool-use loop. If you build with the Responses API, you can intercept reasoning before the model commits and inject corrections, which is much cheaper than regenerating.

A new Codex fast mode runs roughly 1.5x faster than standard. For batch jobs it is the difference between a one-hour CI lane and a forty-minute one.

When to reach for 5.4 today

With 5.5 and 5.5 Pro now on the API, 5.4 is no longer the default frontier choice. It is the right call for three specific workloads.

Computer-use and browser-automation agents should still be tested against 5.4 first. Until 5.5's computer-use evals catch up publicly, 5.4's OSWorld lead is the strongest published number on the task. If you are running a browser agent stack where every action is a paid round trip, the higher action accuracy compounds.

Frontend code generation in agentic loops still favours 5.4 in my own evals. Web games, 3D scenes, complex CSS layouts, and React component scaffolds came back with fewer iterations than on 5.3 and roughly tied with 5.5 on first-pass quality, while costing meaningfully less per million tokens.

Long-document workloads up to 272K tokens are 5.4 territory. Below that ceiling the per-token cost is half of 5.5 Pro and the quality gap is small. Cross the cliff and the math flips immediately.

For everything else, including most agentic terminal coding and long-horizon task graphs, 5.5 or 5.5 Pro is the better default. And for high-volume utility work that does not need reasoning at all, neither model is the right tool.

Wiring up GPT-5.4 with the OpenAI SDK

The model IDs that ship are gpt-5.4 for the standard variant and gpt-5.4-thinking for the reasoning variant. Both work through the Responses API, which is what I use for anything that touches tools.

from openai import OpenAI

client = OpenAI()

response = client.responses.create(
    model="gpt-5.4",
    input="Summarize this changelog and flag breaking changes.",
    instructions="You are a release-notes auditor. Be terse.",
    max_output_tokens=2000,
)

print(response.output_text)

For the thinking variant you opt into reasoning effort explicitly. The API exposes the same reasoning.effort knob as 5.3 and the new effort_budget cap that was introduced alongside 5.4.

response = client.responses.create(
    model="gpt-5.4-thinking",
    input=user_query,
    reasoning={"effort": "high", "effort_budget": 8000},
    max_output_tokens=4000,
    stream=True,
)

for event in response:
    if event.type == "response.reasoning.delta":
        # Surface reasoning to the user in real time so they can steer.
        ui.append_reasoning(event.delta)
    elif event.type == "response.output_text.delta":
        ui.append_answer(event.delta)

effort_budget is a hard ceiling on reasoning tokens. It is the single most useful knob for keeping thinking-model costs bounded in production. Set it once per call site, treat it as a budget, and alert when you hit it.

Get the weekly deep dive

Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.

From the archive

DeepSeek R1, PPO, and GRPO Explained for Devs

Apr 29, 2026 • 12 min read

mlinter: Hugging Face's New Linter for Transformers Modeling Files

Apr 29, 2026 • 8 min read

KV Caching: A Practical Guide to Optimizing Transformer Inference

Apr 29, 2026 • 11 min read

Mercury 2 Developer Guide: Building With a Diffusion LLM in Production

Apr 29, 2026 • 10 min read

Computer use with the built-in tool

The shipped computer-use tool is the cleanest way to use 5.4 for browser and desktop automation. Pair it with a sandboxed browser session and you get a tight loop.

from openai import OpenAI

client = OpenAI()

response = client.responses.create(
    model="gpt-5.4",
    tools=[{
        "type": "computer_use_preview",
        "display_width": 1280,
        "display_height": 800,
        "environment": "browser",
    }],
    input=[{
        "role": "user",
        "content": [
            {"type": "input_text", "text": "Find the pricing page on stripe.com and extract the per-transaction fee."},
            {"type": "input_image", "image_url": initial_screenshot_data_url},
        ],
    }],
    truncation="auto",
)

for action in response.output:
    if action.type == "computer_call":
        result = sandbox.execute(action.action)
        # Feed the resulting screenshot back as the next input item.

Two things matter for production. First, always pass truncation="auto" once your action history grows, otherwise long sessions will overflow even the million-token window. Second, instrument every computer_call with a step counter and a timeout. Models that score 75 percent on OSWorld still fail 25 percent of the time, and the failure mode is usually a loop, not a clean error.

The 272K pricing cliff is real

The pricing snapshot at launch, which is still the published rate as of this writing:

gpt-5.4
  Input:           $2.50  / 1M tokens
  Output:          $10.00 / 1M tokens
  Cached input:    $0.25  / 1M tokens
  Above 272K ctx:  2x multiplier on input and output

gpt-5.4-thinking
  Input:           $5.00  / 1M tokens
  Output:          $20.00 / 1M tokens
  Cached input:    $0.50  / 1M tokens
  Above 272K ctx:  2x multiplier on input and output

The 2x cliff above 272K tokens is the line item that gets every team that does not measure. A single in-context-RAG call on a large monorepo can sit at 290K tokens. Run that a thousand times a day on the thinking variant and you are paying like you ran a 5.5 Pro workload, except on the older model.

Two practical rules. Cache aggressively on the system-prompt and tool-spec prefix; the 10x cached-input discount turns most agent loops into a different cost curve. And put a hard guard at 270K tokens with a fallback path that either truncates older turns or fans out to multiple parallel calls under the cliff. Both are cheap to implement once and pay forever.

If you have not built cost telemetry yet, see the Cost Tape approach we use across the DD app stack. It is the smallest viable telemetry layer that catches this kind of regression before the bill arrives.

Benchmarks against 5.3 and 5.5

The numbers below are a mix of published OpenAI evals and my own runs on the same agent suite I use across model upgrades. The 5.5 column reflects 5.5 Pro on default reasoning settings.

Benchmark	GPT-5.3	GPT-5.4	GPT-5.5 Pro
OSWorld Verified	58.3%	75.0%	78.1%
BrowseComp	49.7%	71.2%	73.8%
WebArena	51.2%	68.4%	70.9%
SWE-bench Verified	69.2%	74.1%	79.4%
Frontend (internal eval)	62%	81%	82%
240K-token doc QA	64%	71%	84%

Two observations matter for a build decision. Computer use saturates fast above 5.4. The jump from 5.3 to 5.4 is roughly 17 points; from 5.4 to 5.5 Pro it is 3 points. If your workload lives there, 5.4 is most of the way to the frontier at half the price. Long-context comprehension is where 5.5 Pro pulls cleanly ahead. If you regularly cross 200K tokens of meaningful content, the answer-accuracy delta is too large to ignore.

A real migration path from 5.3 to 5.4

The honest playbook for a team still on 5.3 in late April 2026 looks like this.

Run your existing eval suite against 5.4 before changing a line of production code. If you do not have an eval suite, see the Agent Eval Bench writeup; the cheapest version is a hundred prompts with golden outputs and a regression diff.

Migrate computer-use and frontend agents to 5.4 first. These are the workloads with the biggest verified gains and the lowest blast radius if something regresses.

Hold long-document workloads on whatever they run today until you have measured 5.4 against 5.5 Pro on the same documents. Anything that crosses 272K should probably skip 5.4 entirely and go straight to 5.5 Pro to avoid the cliff.

Set effort_budget on every thinking-model call site. The single biggest production cost surprise on the thinking variants is unbounded reasoning on tasks the model finds confusing.

Ship the 270K-token guard. Even if you never expect to hit it, one rogue retrieval upstream will, and you want a fallback path, not a 2x bill.

Caching, prompt structure, and the things that actually move the bill

Three implementation choices end up dominating GPT-5.4 production cost more than the model variant itself.

The first is system-prompt caching. The Responses API caches the static prefix of your input, including the system instructions and tool specifications, at a 10x discount. If your agent hits the same model with a 4K-token tool spec a hundred times an hour, the difference between cached and uncached input is the difference between a serious line item and a rounding error. Structure your prompts so the static part is genuinely static, and put any per-call variability at the end.

The second is reasoning effort. The thinking variant defaults to medium, and medium will quietly burn through reasoning tokens on prompts the model finds ambiguous. Force low for routine work, reserve high for the tasks that actually need it, and always pair high with an effort_budget ceiling. In one DD-app migration I cut the monthly thinking-model bill by 38 percent without a measurable quality drop simply by reclassifying which call sites needed which effort tier.

The third is image inputs in computer-use loops. Each screenshot you feed back as input_image is billed by image tokens, and a long agent session can accumulate dozens. Downsample screenshots to the minimum resolution the model still acts reliably on, which in my testing is 1024x768 for most browser tasks. Going from full 1440p screenshots to 1024x768 cut my computer-use bill nearly in half with no measurable accuracy loss.

Developer commentary, two months in

I have shipped on 5.4 across half the DD app portfolio and the lived-in verdict is narrower than the launch coverage suggested. It is the strongest computer-use model that has shipped to date by a comfortable margin. It is a clear upgrade for frontend code generation in agentic loops. It is a pricing win below 272K tokens.

It is not a replacement for Codex with 5.3 on tight low-latency code edits, where the older model still has a speed and cost edge. It is not the right pick over 5.5 Pro for long-context document work. And it is comprehensively beaten on agentic terminal coding by the current Anthropic frontier, which I covered in the OpenAI versus Anthropic 2026 writeup.

What 5.4 changed permanently is the assumption that browser automation is a research problem. It is now an engineering problem with a model that hits the accuracy bar. The implementation work is on us.

Watch the 10-minute breakdown

If you want the visual walkthrough that pairs with this guide, the GPT-5.4 in 10 Minutes DevDigest video covers the launch announcement, the steerable thinking UX, and the live coding demos that motivated several of the recommendations above.

GPT-5.5 for Developers: A Production Field Guide

OpenAI's GPT 5.4 in 10 Minutes

GPT-5.5-Codex in Production: What Actually Changes

What 5.4 actually changed at the API layer

When to reach for 5.4 today

Wiring up GPT-5.4 with the OpenAI SDK

DeepSeek R1, PPO, and GRPO Explained for Devs

mlinter: Hugging Face's New Linter for Transformers Modeling Files

KV Caching: A Practical Guide to Optimizing Transformer Inference

Mercury 2 Developer Guide: Building With a Diffusion LLM in Production

Computer use with the built-in tool

The 272K pricing cliff is real

Benchmarks against 5.3 and 5.5

A real migration path from 5.3 to 5.4

Caching, prompt structure, and the things that actually move the bill

Developer commentary, two months in

Watch the 10-minute breakdown

Comments

Related Tools

Agency Swarm

Claude Agent SDK

Mastra

OpenAI Codex

Apps from Developers Digest

DD Starter

Related Guides

Claude Code Setup Guide

Building Your First MCP Server

MCP Servers Explained

Related Posts

GPT-5.5 for Developers: A Production Field Guide

OpenAI Agents SDK Evolution: What Ships in Production

GPT-5.5-Codex in Production: What Actually Changes

OpenAI Privacy Filter: Production PII Redaction Guide

OpenAI's GPT 5.4 in 10 Minutes

Agent Swarms Need Receipts

Get Smarter About AI Dev

GPT-5.5 for Developers: A Production Field Guide

OpenAI's GPT 5.4 in 10 Minutes

GPT-5.5-Codex in Production: What Actually Changes

What 5.4 actually changed at the API layer

When to reach for 5.4 today

Wiring up GPT-5.4 with the OpenAI SDK

DeepSeek R1, PPO, and GRPO Explained for Devs

mlinter: Hugging Face's New Linter for Transformers Modeling Files

KV Caching: A Practical Guide to Optimizing Transformer Inference

Mercury 2 Developer Guide: Building With a Diffusion LLM in Production

Computer use with the built-in tool

The 272K pricing cliff is real

Benchmarks against 5.3 and 5.5

A real migration path from 5.3 to 5.4

Caching, prompt structure, and the things that actually move the bill

Developer commentary, two months in

Watch the 10-minute breakdown

Comments

Related Tools

Agency Swarm

Claude Agent SDK

Mastra

OpenAI Codex

Apps from Developers Digest

DD Starter

Related Guides

Claude Code Setup Guide

Building Your First MCP Server

MCP Servers Explained

Related Posts

GPT-5.5 for Developers: A Production Field Guide

OpenAI Agents SDK Evolution: What Ships in Production

GPT-5.5-Codex in Production: What Actually Changes

OpenAI Privacy Filter: Production PII Redaction Guide

OpenAI's GPT 5.4 in 10 Minutes

Agent Swarms Need Receipts

Get Smarter About AI Dev