Agent Workflows as Code: Why State Machines Beat Prompt Checklists

Prompts can describe a workflow. They cannot enforce one.

That is the sharp lesson from the latest agent tooling wave. OpenAI is moving production agent work away from hosted visual surfaces and toward the Agents SDK. LangChain is writing about custom harnesses as the scaffolding around the model. Aharness, a new Codex-focused project on GitHub and Hacker News, makes the argument more explicit: encode coding-agent workflows as finite state machines with typed gates, validated evidence, controlled transitions, repair paths, and inspectable logs.

That is the right direction.

The next useful abstraction is not a longer prompt. It is a workflow runtime the agent cannot casually ignore.

Last updated: June 23, 2026

What Is New Here

The fresh signal is Aharness, described as a workflow harness for Codex. Its pitch is narrow and practical: agent workflows should be finite state machines written in TypeScript, with states that define what Codex may do next and transitions that require validated exits.

The Show HN thread frames the problem directly: models are capable enough for longer autonomous work, but process drift and context management are now the failure modes. Prompts and skills describe the process; they do not enforce it.

That is the distinction worth writing down.

It also fits the larger context from the last few days:

OpenAI Agent Builder and Evals are on a shutdown path, which pushes production agent logic toward code and repo-owned evals.
LangChain's custom agent harness post argues that production agents need middleware for retries, policies, human approvals, cost limits, and task-specific scaffolding.
OpenAI's Agents SDK docs emphasize typed application code, direct tool control, custom storage, state, guardrails, human review, and observability.

The direction is consistent: serious agent workflows are becoming software artifacts.

The Take: Process Belongs in Runtime, Not Prompt Memory

A prompt checklist can say:

Plan first.
Only edit the requested files.
Run tests.
Attach evidence.
Stop if tests fail twice.
Ask before risky changes.

That is better than nothing. It is also easy for an agent to forget, reinterpret, or satisfy with a weak summary.

A workflow runtime can enforce:

the agent cannot leave planning until it submits an accepted plan;
implementation cannot start until scope is declared;
verification cannot pass without command output;
repair can only loop a fixed number of times;
final reporting cannot happen until evidence is attached;
risky actions require a human gate.

That is a different class of control.

This is the same reason long-running agents need harnesses. The model can do more work now. The surrounding system has to decide what counts as a valid move.

Why Finite State Machines Fit Agent Work

A finite state machine sounds academic until you map it to a coding-agent run.

State	Allowed exits	Evidence required
intake	accept task, request clarification, reject task	task contract
plan	approve plan, revise plan	scoped plan and file boundaries
implement	move to verify, request tool approval	diff summary
verify	pass, fail, repair	command output and failing logs
repair	retry verify, escalate, stop	fix attempt and retry count
final	close run	receipt with changes, checks, risks

That is already how good human-led agent sessions work. The difference is whether the structure lives in the operator's head or in code.

State machines give agent runs three useful properties:

Controlled transitions. The agent can only move to states the workflow exposes. If there is no direct path from intake to final, the agent cannot skip planning and verification by writing a confident closeout.

Typed submissions. Each state can require a specific shape of evidence: a plan object, a file list, a command transcript, a test result, or a risk note. Natural language becomes input to a verifier, not the verifier itself.

Repair paths. Failure can be part of the workflow instead of an exception. A failed test can move the run to repair with a retry budget, or to escalation if the same failure repeats.

That makes the workflow inspectable after the fact. You can ask where the run stalled, which gate failed, which evidence was missing, and whether the agent followed the process.

Get the weekly deep dive

Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.

From the archive

AI's Affordability Crisis Is Really an Agent Cost Accounting Problem

Jun 23, 2026 • 8 min read

Armin Ronacher on The Coming Loop and Why Agent-Driven Code Still Needs Human Comprehension

Jun 23, 2026 • 9 min read

Cerebras Stock Is a Public Test of AI Inference Demand

Jun 23, 2026 • 7 min read

Claude Outages Are a Workflow Design Problem

Jun 23, 2026 • 7 min read

Skills Still Matter, But They Are Not Enough

This is not an argument against skills.

Skills are useful because they package operating knowledge. A good skill can teach an agent how your team debugs flaky tests, writes release notes, reviews migrations, or handles UI QA. That is why skills beat prompts for repeatable work.

But a skill is still mostly instruction. It tells the agent what good looks like.

A workflow runtime tells the agent what moves are allowed.

You want both:

AGENTS.md for repo context;
skills for reusable methods;
MCP and CLI tools for observation and action;
state machines for process control;
eval receipts for outcome comparison.

That is the stack the post-visual-builder world is converging on.

Where LangChain's Harness Pattern Fits

LangChain's custom harness post uses different language, but the problem is similar. The post defines a harness as scaffolding around the model that connects it to the real world. It specifically calls out middleware for retries, fallbacks, policy enforcement, PII handling, approval gates, steering, cost limits, and prompt caching.

That is harness thinking.

The useful part is "task-harness fit." A customer service agent, coding agent, data agent, and legal review agent should not share one generic runtime. They need different gates, tools, logs, and failure paths.

State machines are one way to make that fit explicit. Middleware is another. LangGraph is another. The common point is that process moves out of invisible prompt wording and into something engineers can inspect.

This is where agent eval receipts matter. Once the workflow is code, you can compare versions:

Did the new gate reduce bad final reports?
Did typed evidence increase pass rates?
Did repair loops save human review time or burn tokens?
Did stricter transitions make exploratory work worse?

Those are answerable questions.

The Opposing Take: State Machines Can Overfit the Work

The strongest objection is also correct: not every agent task should be a state machine.

Some work is exploratory. Research, debugging, discovery, architecture search, and incident response often start without a known path. If you force those into a rigid workflow too early, you get process theater: the agent fills boxes instead of thinking.

That is the risk.

The answer is not to wrap everything in a finite state machine. The answer is to encode the parts of the workflow that should not be ambiguous.

Good candidates:

migration checklists;
release note generation;
dependency upgrade review;
security triage;
code review receipts;
frontend QA loops;
eval replay workflows;
deploy closeout checks.

Bad candidates:

open-ended research;
early product exploration;
ambiguous architecture discovery;
first-pass debugging where the failure mode is not known.

Use dynamic agent behavior for discovery. Use state machines for commitments.

A Practical Pattern

If I were turning a prompt checklist into an agent workflow, I would start with four files:

workflows/
  bugfix.fsm.ts
  bugfix.schema.ts
  bugfix.evals.jsonl
  README.md

The finite state machine owns the legal transitions. The schema file owns typed submissions. The eval file owns representative tasks. The README explains when to use the workflow and when not to.

For teams that already version prompts, this should feel like the next step after Prompt Versioning with Promptlock. Prompt diffs show what instructions changed. Workflow diffs show what the agent is allowed to do with those instructions.

The key gates:

Gate	What it prevents
accepted task contract	vague work entering the run
scoped plan	broad diffs before agreement
declared file list	silent ownership expansion
verification output	fake "tests passed" summaries
bounded repair loop	endless retry token burn
final receipt	unreviewable closeouts

This is not heavy process. It is the minimum scaffolding that keeps a capable agent from wandering.

What To Watch Next

The interesting race is not whether Aharness specifically wins. It is whether the pattern spreads.

Watch for:

workflow packages shared like npm modules;
agent harnesses with typed submission schemas;
CI checks that verify agent workflow definitions;
visualizers that render state-machine runs for human review;
eval suites that compare workflow versions, not only model versions;
integrations that let Codex, Claude Code, Cursor, and custom agents consume the same process definitions.

That last point matters. The durable artifact should not be "a prompt that works in one chat app." It should be a workflow definition that survives model and UI churn.

The agent ecosystem is slowly relearning a very old software lesson: if a process matters, put it in code.

FAQ

What does "agent workflows as code" mean?

It means encoding the agent process in versioned software artifacts instead of relying only on natural language prompts. The workflow can define states, allowed transitions, evidence requirements, retry limits, tool policies, and final receipts.

Why use a state machine for coding agents?

State machines make the run inspectable and enforceable. They prevent agents from skipping required stages, require evidence before transitions, and make failures route through defined repair or escalation paths.

Are skills the same as workflows as code?

No. Skills package operating knowledge and reusable instructions. Workflows as code enforce the process around the skill: when it runs, what evidence it must produce, what transitions are allowed, and when the run stops.

When should I avoid state-machine agent workflows?

Avoid rigid workflows for early exploration, open-ended research, and ambiguous debugging. Use them when the process is known and the cost of skipping steps is high: releases, migrations, security triage, code review receipts, eval replay, and deploy checks.

Is Aharness only for Codex?

Aharness is currently framed around Codex workflows, but the broader idea is not Codex-specific. Any coding-agent stack can benefit from typed gates, controlled transitions, repair paths, and inspectable evidence.

What Is New Here

The Take: Process Belongs in Runtime, Not Prompt Memory

Why Finite State Machines Fit Agent Work

AI's Affordability Crisis Is Really an Agent Cost Accounting Problem

Armin Ronacher on The Coming Loop and Why Agent-Driven Code Still Needs Human Comprehension

Cerebras Stock Is a Public Test of AI Inference Demand

Claude Outages Are a Workflow Design Problem

Skills Still Matter, But They Are Not Enough

Where LangChain's Harness Pattern Fits

The Opposing Take: State Machines Can Overfit the Work

A Practical Pattern

What To Watch Next

FAQ

What does "agent workflows as code" mean?

Why use a state machine for coding agents?

Are skills the same as workflows as code?

When should I avoid state-machine agent workflows?

Is Aharness only for Codex?

Sources

OpenAI Agent Builder and Evals Are Shutting Down: Move the Agent Stack Into Code

Long-Running Agents Need Harnesses, Not Hope

Agent Evals Need Baseline Receipts

Try These Tools

Related Tools

OpenAI Codex

CopilotKit

LangChain / LangGraph

Mastra

Apps from Developers Digest

Overnight Agents

Agent Hub

Skill Builder

Related Guides

Claude Code Setup Guide

Claude Code Complete Course

Building Your First MCP Server

Related Videos

Agents 101: How to Build and Deploy Anything with AI Agents

Related Posts

OpenAI Agent Builder and Evals Are Shutting Down: Move the Agent Stack Into Code

Long-Running Agents Need Harnesses, Not Hope

Agent Evals Need Baseline Receipts

Flue: The Agent Harness Framework and Why It Feels Different

Why Skills Beat Prompts for Coding Agents in 2026

Agent Sandbox Architecture: How to Choose the Right Runtime Boundary

Get Smarter About AI Dev

What Is New Here

The Take: Process Belongs in Runtime, Not Prompt Memory

Why Finite State Machines Fit Agent Work

AI's Affordability Crisis Is Really an Agent Cost Accounting Problem

Armin Ronacher on The Coming Loop and Why Agent-Driven Code Still Needs Human Comprehension

Cerebras Stock Is a Public Test of AI Inference Demand

Claude Outages Are a Workflow Design Problem

Skills Still Matter, But They Are Not Enough

Where LangChain's Harness Pattern Fits

The Opposing Take: State Machines Can Overfit the Work

A Practical Pattern

What To Watch Next

FAQ

What does "agent workflows as code" mean?

Why use a state machine for coding agents?

Are skills the same as workflows as code?

When should I avoid state-machine agent workflows?

Is Aharness only for Codex?

Sources

OpenAI Agent Builder and Evals Are Shutting Down: Move the Agent Stack Into Code

Long-Running Agents Need Harnesses, Not Hope

Agent Evals Need Baseline Receipts

Try These Tools

Related Tools

OpenAI Codex

CopilotKit

LangChain / LangGraph

Mastra

Apps from Developers Digest

Overnight Agents

Agent Hub

Skill Builder

Related Guides

Claude Code Setup Guide