Harness Engineering Makes Tokens a Systems Budget

Official Sources#

Source	Description
OpenAI - Harness engineering: Leveraging Codex in an agent-first world	OpenAI's June 2026 writeup on agent harnesses, scaffolding, tests, and feedback loops
HN discussion	Hacker News discussion around the OpenAI harness engineering article
Tokenomics: Quantifying Where Tokens Are Used in Agentic Software Engineering	Research paper measuring token use across agentic software engineering tasks
OpenAI Codex docs	Official Codex product and developer documentation
OpenAI Codex changelog	Official Codex release notes for current product behavior

OpenAI's harness engineering post hit the Hacker News front page today, and the headline is easy to flatten into "Codex works better with good tests."

That is true, but too small.

The more useful read is that agentic coding is becoming a systems engineering problem. The agent is only one component. The rest of the system is prompt scaffolding, repo setup, task routing, tool access, test selection, human review, and feedback capture.

Once you see it that way, tokens stop being a vague AI bill. Tokens become a systems budget.

A coding agent spends tokens to understand the repo, choose a plan, read files, call tools, inspect failures, rewrite code, explain its diff, and respond to review. A new paper, Tokenomics, puts numbers behind that intuition by studying where tokens are consumed in agentic software engineering workflows. The important point is not the exact split for your repo. It is that token use has structure.

If token use has structure, you can instrument it.

If you can instrument it, you can improve it.

Harnesses Are the New IDE Settings#

The old AI coding workflow was mostly personal preference:

which model you like
whether you use Cursor, Claude Code, Codex, or another agent
how much context you paste
whether you ask for tests
how carefully you read the diff

That still matters, but it does not scale to teams.

OpenAI's harness engineering framing says the durable unit is the harness around the agent. That means the repeatable environment that tells the agent what work is allowed, where context lives, how to run checks, how to recover from errors, and what evidence it must leave behind.

That connects directly to the last few weeks of DevDigest coverage: security agents need repro harnesses, AI code attribution needs defect forensics, agent memory needs a context ledger, and agent containment needs a capability ledger. Each post is a different face of the same shift.

Agent quality is no longer just "which model is smartest?"

It is:

what context did the system provide?
what work did the agent attempt?
what checks ran?
what tokens were spent where?
what proof did the agent leave?
what changed after review?

The harness is where those questions become enforceable.

Newsletter

Get the weekly deep dive

Tutorials on Claude Code, AI agents, and dev tools, delivered free every week.

From the archive

LLM Routers Compared: LiteLLM vs Portkey vs OpenRouter in 2026

Jun 7, 2026 • 10 min read

Headroom: Compress Agent Tool Output Before It Reaches the LLM

Jun 6, 2026 • 8 min read

MAI-Code-1-Flash Is a Model Routing Signal

Jun 3, 2026 • 7 min read

Spreadsheet Agents Need Permission Ledgers

Jun 1, 2026 • 8 min read

The HN Pushback Is Right#

The Hacker News thread is useful because it does not treat the article as magic. The skeptical version is basically: this works when you have enough scaffolding, enough tests, enough infrastructure, and enough patience to build a specialized workflow. It is not the same as dropping a generic agent into an arbitrary repo and expecting compounding returns.

That criticism is correct.

It is also the point.

Most teams should not expect a coding agent to walk into a messy codebase, infer the product, infer the test policy, infer the deploy constraints, infer the review culture, and consistently produce good work. Humans do not do that either. Good teams onboard people into local constraints.

The agent harness is the onboarding system for software that keeps working after the first session.

The bad version is a pile of prompt text:

Text

Be careful. Run tests. Follow our style. Do not break things.

The better version is executable:

Text

Read AGENTS.md.
Use pnpm typecheck, pnpm lint, and pnpm test for this package.
Never edit generated files.
When touching auth, run the auth route smoke.
Return the failing command if blocked.
Attach the diff and verification receipt.

The best version is measured. It can tell whether the harness made the agent faster, cheaper, or more reliable.

Token Budgets Belong in the Harness#

Most AI cost discussions are account-level:

how many seats
which plan
which model
how many requests
how much spend this month

That is useful for finance. It is too coarse for engineering.

For agentic coding, the more interesting budget is task-level:

how many tokens went to repo exploration?
how many went to reading irrelevant files?
how many went to repeated failing commands?
how many went to long explanations that nobody needed?
how many went to useful verification?
how many went to review response?

That is where the Tokenomics paper is helpful. It pushes the conversation away from "agents are expensive" and toward "which parts of the workflow are expensive, and are they buying reliability?"

Some token spend is good. A coding agent that spends more tokens reading the right files before a dangerous migration may save hours of review. A security agent that spends more tokens building a proof of concept may prevent a fake finding. A refactor agent that spends extra context on tests may avoid a subtle regression.

Some token spend is waste. Reading the same files every run because memory is missing is waste. Re-running the wrong command ten times is waste. Producing a long executive summary for a one-line CSS fix is waste. Searching the whole repo when a task map already exists is waste.

The harness should separate those categories.

A Practical Token Receipt#

You do not need a perfect observability stack to start. Add a lightweight receipt to every serious agent run:

YAML

task: "Add invoice CSV export"
agent: "codex"
model: "gpt-5.3"
scope:
  files_changed: 4
  tests_run:
    - "pnpm typecheck"
    - "pnpm test -- invoice"
budget:
  rough_token_shape:
    exploration: "medium"
    implementation: "low"
    verification: "high"
    review_response: "low"
evidence:
  passed:
    - "typecheck"
    - "invoice tests"
  failed: []
follow_up:
  - "Add browser smoke for download filename"

That is intentionally simple. The point is not exact accounting on day one. The point is to make every run reviewable as a systems event.

Over time, you can make the receipt richer:

capture actual token counts from the provider or gateway
record tool-call counts
tag repeated command failures
mark files that were read but never used
compare harness versions against acceptance rate
track which prompts reduce review churn
route low-risk tasks to cheaper models
escalate hard tasks only when the receipt says the cheap path failed

This is where Codex as a cloud and terminal agent, Claude Code memory, and team-level coding-agent policies start to converge. The product boundary is not the chat box. It is the run ledger.

The Take#

Harness engineering is not just "write better prompts for Codex."

It is the discipline of making agentic software work measurable, repeatable, and reviewable. Tests are part of it. Instructions are part of it. Sandboxes are part of it. Memory is part of it. Token budgets are part of it.

The teams that get real leverage from coding agents will not be the teams that simply buy more model access. They will be the teams that can answer three questions after every agent run:

What did the agent spend attention on?
What evidence did the system collect?
What changed in the harness so the next run is cheaper or more reliable?

That is the difference between agent usage and agent engineering.

FAQ#

What is harness engineering for AI coding agents?#

Harness engineering is the practice of building the repeatable environment around an AI coding agent: instructions, repo setup, tools, tests, sandboxes, review rules, and feedback loops. The harness makes agent work measurable instead of depending only on one-off prompts.

Why do token budgets matter for coding agents?#

Token budgets show where an agent spends attention. They help teams separate useful effort, such as reading the right files and running verification, from waste, such as repeated failed commands or irrelevant repo exploration.

Is this only relevant to Codex?#

No. OpenAI used Codex to explain the pattern, but the same harness idea applies to Claude Code, Cursor agents, OpenCode, custom MCP workflows, and any agent that reads a repo, edits files, runs tools, and returns a diff.

Should teams optimize for fewer tokens?#

Not blindly. The goal is better reliability per token, not the lowest possible token count. A good harness spends more where evidence matters and less where the agent is repeating avoidable work.

Official Sources#

Source	Description
OpenAI - Harness engineering: Leveraging Codex in an agent-first world	OpenAI's June 2026 writeup on agent harnesses, scaffolding, tests, and feedback loops
HN discussion	Hacker News discussion around the OpenAI harness engineering article
Tokenomics: Quantifying Where Tokens Are Used in Agentic Software Engineering	Research paper measuring token use across agentic software engineering tasks
OpenAI Codex docs	Official Codex product and developer documentation
OpenAI Codex changelog	Official Codex release notes for current product behavior

OpenAI's harness engineering post hit the Hacker News front page today, and the headline is easy to flatten into "Codex works better with good tests."

That is true, but too small.

Once you see it that way, tokens stop being a vague AI bill. Tokens become a systems budget.

If token use has structure, you can instrument it.

If you can instrument it, you can improve it.

Harnesses Are the New IDE Settings#

The old AI coding workflow was mostly personal preference:

which model you like
whether you use Cursor, Claude Code, Codex, or another agent
how much context you paste
whether you ask for tests
how carefully you read the diff

That still matters, but it does not scale to teams.

Agent quality is no longer just "which model is smartest?"

It is:

what context did the system provide?
what work did the agent attempt?
what checks ran?
what tokens were spent where?
what proof did the agent leave?
what changed after review?

The harness is where those questions become enforceable.

Newsletter

Get the weekly deep dive

Tutorials on Claude Code, AI agents, and dev tools, delivered free every week.

From the archive

LLM Routers Compared: LiteLLM vs Portkey vs OpenRouter in 2026

Jun 7, 2026 • 10 min read

Headroom: Compress Agent Tool Output Before It Reaches the LLM

Jun 6, 2026 • 8 min read

MAI-Code-1-Flash Is a Model Routing Signal

Jun 3, 2026 • 7 min read

Spreadsheet Agents Need Permission Ledgers

Jun 1, 2026 • 8 min read

The HN Pushback Is Right#

That criticism is correct.

It is also the point.

The agent harness is the onboarding system for software that keeps working after the first session.

The bad version is a pile of prompt text:

Text

Be careful. Run tests. Follow our style. Do not break things.

The better version is executable:

Text

Read AGENTS.md.
Use pnpm typecheck, pnpm lint, and pnpm test for this package.
Never edit generated files.
When touching auth, run the auth route smoke.
Return the failing command if blocked.
Attach the diff and verification receipt.

The best version is measured. It can tell whether the harness made the agent faster, cheaper, or more reliable.

Token Budgets Belong in the Harness#

Most AI cost discussions are account-level:

how many seats
which plan
which model
how many requests
how much spend this month

That is useful for finance. It is too coarse for engineering.

For agentic coding, the more interesting budget is task-level:

how many tokens went to repo exploration?
how many went to reading irrelevant files?
how many went to repeated failing commands?
how many went to long explanations that nobody needed?
how many went to useful verification?
how many went to review response?

That is where the Tokenomics paper is helpful. It pushes the conversation away from "agents are expensive" and toward "which parts of the workflow are expensive, and are they buying reliability?"

The harness should separate those categories.

A Practical Token Receipt#

You do not need a perfect observability stack to start. Add a lightweight receipt to every serious agent run:

YAML

task: "Add invoice CSV export"
agent: "codex"
model: "gpt-5.3"
scope:
  files_changed: 4
  tests_run:
    - "pnpm typecheck"
    - "pnpm test -- invoice"
budget:
  rough_token_shape:
    exploration: "medium"
    implementation: "low"
    verification: "high"
    review_response: "low"
evidence:
  passed:
    - "typecheck"
    - "invoice tests"
  failed: []
follow_up:
  - "Add browser smoke for download filename"

That is intentionally simple. The point is not exact accounting on day one. The point is to make every run reviewable as a systems event.

Over time, you can make the receipt richer:

capture actual token counts from the provider or gateway
record tool-call counts
tag repeated command failures
mark files that were read but never used
compare harness versions against acceptance rate
track which prompts reduce review churn
route low-risk tasks to cheaper models
escalate hard tasks only when the receipt says the cheap path failed

This is where Codex as a cloud and terminal agent, Claude Code memory, and team-level coding-agent policies start to converge. The product boundary is not the chat box. It is the run ledger.

The Take#

Harness engineering is not just "write better prompts for Codex."

The teams that get real leverage from coding agents will not be the teams that simply buy more model access. They will be the teams that can answer three questions after every agent run:

What did the agent spend attention on?
What evidence did the system collect?
What changed in the harness so the next run is cheaper or more reliable?

That is the difference between agent usage and agent engineering.

FAQ#

What is harness engineering for AI coding agents?#

Why do token budgets matter for coding agents?#

Is this only relevant to Codex?#

Should teams optimize for fewer tokens?#

Not blindly. The goal is better reliability per token, not the lowest possible token count. A good harness spends more where evidence matters and less where the agent is repeating avoidable work.

Official Sources#

Harnesses Are the New IDE Settings#

LLM Routers Compared: LiteLLM vs Portkey vs OpenRouter in 2026

Headroom: Compress Agent Tool Output Before It Reaches the LLM

MAI-Code-1-Flash Is a Model Routing Signal

Spreadsheet Agents Need Permission Ledgers

The HN Pushback Is Right#

Token Budgets Belong in the Harness#

A Practical Token Receipt#

The Take#

FAQ#

What is harness engineering for AI coding agents?#

Why do token budgets matter for coding agents?#

Is this only relevant to Codex?#

Should teams optimize for fewer tokens?#

Security Agents Need Repro Harnesses, Not More Scan Prompts

AI Code Attribution Needs Defect Forensics, Not Vibes

AI Agent Memory Needs a Context Ledger

Related Tools

Claude Code

OpenAI Codex

OpenAI Agents SDK

Codex CLI

Related Guides

Claude Code Complete Course

Chronicle Research Preview Setup Guide

Related Videos

Introducing GPT-5 Codex: Optimized Agentic Coding for Developers

Related Posts

Security Agents Need Repro Harnesses, Not More Scan Prompts

AI Code Attribution Needs Defect Forensics, Not Vibes

AI Agent Memory Needs a Context Ledger

AI Agent Containment Needs a Capability Ledger

OpenAI Codex: Terminal and Cloud AI Coding Agent

Domain Expertise Is the New Agentic Coding Moat

Build with the member tools

Get Smarter About AI Dev

Official Sources#

Harnesses Are the New IDE Settings#

LLM Routers Compared: LiteLLM vs Portkey vs OpenRouter in 2026

Headroom: Compress Agent Tool Output Before It Reaches the LLM

MAI-Code-1-Flash Is a Model Routing Signal

Spreadsheet Agents Need Permission Ledgers

The HN Pushback Is Right#

Token Budgets Belong in the Harness#

A Practical Token Receipt#

The Take#

FAQ#

What is harness engineering for AI coding agents?#

Why do token budgets matter for coding agents?#

Is this only relevant to Codex?#

Should teams optimize for fewer tokens?#

Security Agents Need Repro Harnesses, Not More Scan Prompts

AI Code Attribution Needs Defect Forensics, Not Vibes

AI Agent Memory Needs a Context Ledger

Related Tools

Claude Code

OpenAI Codex

OpenAI Agents SDK

Codex CLI

Related Guides

Claude Code Complete Course

Chronicle Research Preview Setup Guide

Related Videos

Introducing GPT-5 Codex: Optimized Agentic Coding for Developers

Related Posts

Security Agents Need Repro Harnesses, Not More Scan Prompts

AI Code Attribution Needs Defect Forensics, Not Vibes

AI Agent Memory Needs a Context Ledger

AI Agent Containment Needs a Capability Ledger

OpenAI Codex: Terminal and Cloud AI Coding Agent

Domain Expertise Is the New Agentic Coding Moat

Build with the member tools

Get Smarter About AI Dev