
TL;DR
OpenAI's harness engineering post and new token-use research point to the same lesson: agentic coding teams need token budgets, receipts, and eval loops, not vibes.
Read next
Anthropic's open-source vulnerability harness shows where AI security work is going: reproducible exploit loops, separate verification agents, and patch receipts.
9 min readThe rsync Claude debate shows why teams need reproducible defect forensics before AI attribution becomes a public blame machine.
8 min readGitHub Trending is full of agent memory and context tools. The useful version is not magic recall. It is a context ledger: source-linked, scoped, expiring memory that agents can inspect and users can audit.
8 min read| Source | Description |
|---|---|
| OpenAI - Harness engineering: Leveraging Codex in an agent-first world | OpenAI's June 2026 writeup on agent harnesses, scaffolding, tests, and feedback loops |
| HN discussion | Hacker News discussion around the OpenAI harness engineering article |
| Tokenomics: Quantifying Where Tokens Are Used in Agentic Software Engineering | Research paper measuring token use across agentic software engineering tasks |
| OpenAI Codex docs | Official Codex product and developer documentation |
| OpenAI Codex changelog | Official Codex release notes for current product behavior |
OpenAI's harness engineering post hit the Hacker News front page today, and the headline is easy to flatten into "Codex works better with good tests."
That is true, but too small.
The more useful read is that agentic coding is becoming a systems engineering problem. The agent is only one component. The rest of the system is prompt scaffolding, repo setup, task routing, tool access, test selection, human review, and feedback capture.
Once you see it that way, tokens stop being a vague AI bill. Tokens become a systems budget.
A coding agent spends tokens to understand the repo, choose a plan, read files, call tools, inspect failures, rewrite code, explain its diff, and respond to review. A new paper, Tokenomics, puts numbers behind that intuition by studying where tokens are consumed in agentic software engineering workflows. The important point is not the exact split for your repo. It is that token use has structure.
If token use has structure, you can instrument it.
If you can instrument it, you can improve it.
The old AI coding workflow was mostly personal preference:
That still matters, but it does not scale to teams.
OpenAI's harness engineering framing says the durable unit is the harness around the agent. That means the repeatable environment that tells the agent what work is allowed, where context lives, how to run checks, how to recover from errors, and what evidence it must leave behind.
That connects directly to the last few weeks of DevDigest coverage: security agents need repro harnesses, AI code attribution needs defect forensics, agent memory needs a context ledger, and agent containment needs a capability ledger. Each post is a different face of the same shift.
Agent quality is no longer just "which model is smartest?"
It is:
The harness is where those questions become enforceable.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
From the archive
Jun 3, 2026 • 7 min read
Jun 1, 2026 • 8 min read
May 30, 2026 • 8 min read
May 30, 2026 • 9 min read
The Hacker News thread is useful because it does not treat the article as magic. The skeptical version is basically: this works when you have enough scaffolding, enough tests, enough infrastructure, and enough patience to build a specialized workflow. It is not the same as dropping a generic agent into an arbitrary repo and expecting compounding returns.
That criticism is correct.
It is also the point.
Most teams should not expect a coding agent to walk into a messy codebase, infer the product, infer the test policy, infer the deploy constraints, infer the review culture, and consistently produce good work. Humans do not do that either. Good teams onboard people into local constraints.
The agent harness is the onboarding system for software that keeps working after the first session.
The bad version is a pile of prompt text:
Be careful. Run tests. Follow our style. Do not break things.
The better version is executable:
Read AGENTS.md.
Use pnpm typecheck, pnpm lint, and pnpm test for this package.
Never edit generated files.
When touching auth, run the auth route smoke.
Return the failing command if blocked.
Attach the diff and verification receipt.
The best version is measured. It can tell whether the harness made the agent faster, cheaper, or more reliable.
Most AI cost discussions are account-level:
That is useful for finance. It is too coarse for engineering.
For agentic coding, the more interesting budget is task-level:
That is where the Tokenomics paper is helpful. It pushes the conversation away from "agents are expensive" and toward "which parts of the workflow are expensive, and are they buying reliability?"
Some token spend is good. A coding agent that spends more tokens reading the right files before a dangerous migration may save hours of review. A security agent that spends more tokens building a proof of concept may prevent a fake finding. A refactor agent that spends extra context on tests may avoid a subtle regression.
Some token spend is waste. Reading the same files every run because memory is missing is waste. Re-running the wrong command ten times is waste. Producing a long executive summary for a one-line CSS fix is waste. Searching the whole repo when a task map already exists is waste.
The harness should separate those categories.
You do not need a perfect observability stack to start. Add a lightweight receipt to every serious agent run:
task: "Add invoice CSV export"
agent: "codex"
model: "gpt-5.3"
scope:
files_changed: 4
tests_run:
- "pnpm typecheck"
- "pnpm test -- invoice"
budget:
rough_token_shape:
exploration: "medium"
implementation: "low"
verification: "high"
review_response: "low"
evidence:
passed:
- "typecheck"
- "invoice tests"
failed: []
follow_up:
- "Add browser smoke for download filename"
That is intentionally simple. The point is not exact accounting on day one. The point is to make every run reviewable as a systems event.
Over time, you can make the receipt richer:
This is where Codex as a cloud and terminal agent, Claude Code memory, and team-level coding-agent policies start to converge. The product boundary is not the chat box. It is the run ledger.
Harness engineering is not just "write better prompts for Codex."
It is the discipline of making agentic software work measurable, repeatable, and reviewable. Tests are part of it. Instructions are part of it. Sandboxes are part of it. Memory is part of it. Token budgets are part of it.
The teams that get real leverage from coding agents will not be the teams that simply buy more model access. They will be the teams that can answer three questions after every agent run:
That is the difference between agent usage and agent engineering.
Harness engineering is the practice of building the repeatable environment around an AI coding agent: instructions, repo setup, tools, tests, sandboxes, review rules, and feedback loops. The harness makes agent work measurable instead of depending only on one-off prompts.
Token budgets show where an agent spends attention. They help teams separate useful effort, such as reading the right files and running verification, from waste, such as repeated failed commands or irrelevant repo exploration.
No. OpenAI used Codex to explain the pattern, but the same harness idea applies to Claude Code, Cursor agents, OpenCode, custom MCP workflows, and any agent that reads a repo, edits files, runs tools, and returns a diff.
Not blindly. The goal is better reliability per token, not the lowest possible token count. A good harness spends more where evidence matters and less where the agent is repeating avoidable work.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
Anthropic's agentic coding CLI. Runs in your terminal, edits files autonomously, spawns sub-agents, and maintains memory...
View ToolOpenAI's coding agent for terminal, cloud, IDE, GitHub, Slack, and Linear workflows. Reads repos, edits files, runs comm...
View ToolLightweight Python framework for multi-agent systems. Agent handoffs, tool use, guardrails, tracing. Successor to the ex...
View ToolOpenAI's open-source terminal coding agent built in Rust. Runs locally, reads your repo, edits files, and executes comma...
View ToolA complete, citation-backed Claude Code course with setup, prompting systems, MCP, CI, security, cost controls, and capstone workflows.
ai-developmentSet up Codex Chronicle on macOS, manage permissions, and understand privacy, security, and troubleshooting.
Getting Started
Anthropic's open-source vulnerability harness shows where AI security work is going: reproducible exploit loops, separat...

The rsync Claude debate shows why teams need reproducible defect forensics before AI attribution becomes a public blame...

GitHub Trending is full of agent memory and context tools. The useful version is not magic recall. It is a context ledge...

Anthropic's Claude containment writeup points to the next security layer for coding agents: deterministic capability led...

Codex works from the terminal, cloud tasks, IDEs, GitHub, Slack, and Linear. Here is how to use it and how it compares t...

A huge Hacker News thread says domain expertise is the real moat in agentic coding. The sharper version: tacit judgment...

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.