Codex Loops: What Boris Cherny Gets Right About Managing Agent Work

Boris Cherny's recent interview is worth watching because it names the thing most AI coding demos still hide: the future of agent work is not one perfect prompt. It is many supervised loops.

In the interview, Boris describes a personal Claude Code setup that has moved far past "agent writes a diff." He talks about running multiple sessions, using sub-agents heavily, and leaning more and more on /loop: recurring agent jobs scheduled with cron. The examples are wonderfully boring:

babysit pull requests;
fix CI;
auto-rebase branches;
keep CI healthy;
cluster Twitter feedback every 30 minutes;
report back when a changing data stream needs attention.

That is the useful part. The examples are not magical. They are the exact maintenance chores every engineering team already does poorly.

This is also where Codex content should go next. Codex automations, Codex goals, the Codex GitHub Action, and the Codex cloud security playbook all point in the same direction: the winning agent workflow is a loop with boundaries, receipts, and escalation rules.

The Big Shift: From Tasks to Loops

The first AI coding workflow was a task:

Fix this bug.

The second workflow was a scoped task:

Fix the billing webhook validation.
Only touch app/api/billing and lib/billing.
Run pnpm test billing and pnpm typecheck.
Return changed files, tests run, and risks.

The loop workflow is different:

Every 15 minutes, inspect open PRs labeled codex-watch.
If CI is red for a deterministic reason, attempt one fix.
If main moved, rebase once.
If the same failure appears twice, stop and leave a concise report.
Never push directly to main.

That is not just "task, repeated." It has a trigger, scope, action budget, stop condition, and reporting path. Those are the pieces that turn an agent from a clever assistant into a useful background process.

Why Loops Beat One-Shot Agents

One-shot agents are good at bounded edits. Loops are good at changing state.

A PR changes after review comments land. CI changes after a dependency cache expires. A deployment changes after Coolify finishes building. User feedback changes every hour. A model eval changes after new examples arrive. These are not single-shot problems. They are state-monitoring problems.

That is why Boris's examples land. PR babysitting and CI repair are high-value because they sit in the annoying gap between "the code is basically right" and "the work is actually merged."

Codex is well positioned for this because the surface area is already there:

Codex CLI for local scoped work;
Codex GitHub Action for repo-triggered review and automation;
Codex automations for recurring checks and reports;
Codex goals for longer-lived objectives;
browser verification for UI and deploy checks.

The missing piece is not capability. It is loop design.

The Loop Contract

Every useful Codex loop should fit on one page.

name: pr-babysitter
trigger:
  every: 15m
scope:
  include:
    - pull_requests:
        labels: ["codex-watch"]
  exclude:
    - main
permissions:
  repo: write-branch
  ci: read
  deploys: read
budget:
  max_attempts_per_pr: 1
  max_runtime_minutes: 20
  max_files_changed: 8
stop:
  - same_failure_seen_twice
  - merge_conflict_requires_product_decision
  - tests_fail_after_one_fix
report:
  destination: pr-comment
  fields:
    - summary
    - action_taken
    - tests_run
    - remaining_blocker

The contract matters because loops are powerful in the same way cron jobs are powerful: they keep running after the interesting part is over.

Without a contract, a loop becomes background chaos. With a contract, it becomes a junior operations teammate that handles the boring parts and escalates the judgment calls.

Get the weekly deep dive

Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.

From the archive

Free Claude Code Is Really a Model Gateway Bet

May 5, 2026 • 7 min read

GPT Image 2 Prompt Libraries Are Becoming Production Infrastructure

May 5, 2026 • 7 min read

Karpathy's Loopy Era Is the Best Way to Understand Codex

May 5, 2026 • 9 min read

OpenAI's Codex Mac Certificate Deadline Is a Runbook Test

May 5, 2026 • 7 min read

Four Codex Loops I Would Actually Run

Start with loops that are safe, boring, and obviously reviewable.

1. PR Babysitter

Trigger: every 15 minutes on PRs with a label.

Job:

check CI;
rebase if main moved;
fix one deterministic failure;
summarize review comments;
report blockers.

Stop if the same failure appears twice. Stop if the branch has merge conflicts that require a human decision. Stop if the fix touches files outside the declared scope.

This is the cleanest Codex loop because it maps to GitHub's natural workflow. The output is a PR comment, a small branch commit, or a status report.

2. CI Health Loop

Trigger: every 30 minutes on main.

Job:

inspect the latest CI failures;
cluster failures by signature;
identify flakes vs deterministic failures;
open one issue or draft one fix branch.

The important thing is not letting the agent quietly mutate production code. The first version should be report-only. Once the reports are useful, let it open a branch for the top deterministic failure.

This pairs well with long-running agent harnesses, because CI health is exactly where retry limits, tool logs, and receipts matter.

3. Deploy Verification Loop

Trigger: after push to main, or every 10 minutes while a deploy is in progress.

Job:

check deployment queue;
wait for active deploy to finish;
hit /api/health;
verify changed routes return 200;
confirm expected image paths or page text are present;
report live links.

This is the loop I want for content automation. A blog post is not done when the commit lands. It is done when production returns 200 and the page references the expected hero image.

For Codex, this should be a first-class recurring pattern because it is one of the easiest ways to turn agent work into visible shipped work.

4. Feedback Clustering Loop

Trigger: every 30 or 60 minutes.

Job:

pull feedback from GitHub issues, X, YouTube comments, Discord, Linear, or support channels;
cluster it by product area;
identify repeated complaints;
map each cluster to an existing post, guide, tool, or product gap.

Boris mentioned clustering Twitter feedback. That is the exact pattern content teams should steal. It turns the outside world into a recurring editorial signal.

For Developers Digest, this is how "go hard on Codex" becomes a system:

Codex question appears repeatedly;
loop clusters it;
agent checks whether a post already exists;
if not, a scoped draft gets proposed;
human picks the angle;
Codex ships the article and verifies production.

The Failure Modes

Loops fail differently from one-shot agents.

They Keep Spending

A one-shot agent fails and stops. A loop fails and comes back in 15 minutes.

That can be good. It can also create the exact cost pattern from the $400 overnight agent bill: retry, inspect, edit, rerun, repeat.

Every loop needs a hard budget:

max attempts per target;
max runtime;
max files changed;
max tool calls;
max spend;
max consecutive failures.

They Hide Stale Assumptions

A loop can keep acting on yesterday's plan after today's context changes.

Fix: every loop run starts by refreshing the state it depends on. For PRs, fetch latest base and head. For CI, inspect the current run, not the last one cached in context. For deploys, ask production, not local build output.

They Need Ownership

If five loops can touch the same PR, you do not have automation. You have a race condition.

Assign ownership:

one loop owns PR rebase;
one loop owns CI failure triage;
one loop owns content production verification;
one loop owns feedback clustering.

Shared read access is fine. Shared write access should be rare.

They Need Escalation

The best loop is not the one that never asks for help. The best loop is the one that knows when it has hit a judgment boundary.

Escalate when:

product behavior is ambiguous;
security permissions need widening;
the same failure repeats;
tests contradict each other;
a deploy is healthy but the page is wrong;
the loop would need to touch files outside scope.

This is where agents become useful teammates instead of background scripts with model access.

What Boris Gets Right

The important insight in the interview is not that Boris runs an absurd number of agents. Most teams should not copy that directly.

The important insight is that he is moving up a level of abstraction. He is not only asking agents to write code. He is asking agents to maintain workflows over time.

That is the same shift Codex needs to own.

Codex should not only answer:

Can you fix this bug?

It should answer:

Can you keep this PR moving until it is either merged or blocked by a human decision?

That second question is much more valuable.

The Codex Version

Here is the content and product thesis:

Codex wins when it becomes the loop manager for engineering work.

Not just the model that writes the code. Not just the CLI that edits files. The system that can:

start from a goal;
run scoped work;
verify with browser, tests, and production checks;
return on a schedule;
report what changed;
stop when judgment is required.

That is the difference between agent assistance and agent operations.

The next Codex content cluster should cover:

PR babysitting loops;
CI repair loops;
deploy verification loops;
feedback clustering loops;
cost caps for loops;
loop prompts and YAML contracts;
GitHub Action implementations;
when to use Codex automations vs CLI vs SDK.

That cluster is more useful than another generic "what is Codex" post because it meets teams where they are: trying to turn agent output into shipped, reviewed, production-safe work.

The Bigger Take

Boris's loop-heavy workflow is a preview of where agentic coding is going. The headline is not "engineers will manage thousands of agents." The headline is smaller and more practical:

Recurring engineering work is about to become agent-managed.

The winning teams will not be the ones with the most agents. They will be the ones with the clearest loop contracts.

For Codex, that is the content lane to own: how to design, run, verify, and stop the loops that keep software moving.

FAQ

What are agent loops?

Agent loops are recurring AI workflows that inspect state, decide whether action is needed, act within a defined scope, and report results. They are useful for PR babysitting, CI repair, deploy verification, feedback clustering, and other changing-state engineering work.

How is a loop different from a cron job?

A cron job runs a fixed command on a schedule. An agent loop runs a recurring decision process: inspect the current state, choose an action, apply bounded changes, verify, and escalate if needed.

How does this apply to Codex?

Codex has the right surfaces for loops: CLI for local work, GitHub Action for repo events, automations for recurring checks, goals for longer-running objectives, and browser verification for production checks. The missing part is a clear loop contract.

What is the safest Codex loop to start with?

Start with a read-only PR review loop. Have Codex inspect pull requests with a label, summarize CI and review status, and post a concise comment. Add write access only after the signal is consistently useful.

Sources: Boris Cherny interview on YouTube, OpenAI Codex CLI docs, OpenAI Codex SDK docs, openai/codex-action README, OpenAI Codex changelog.

Codex Automations: Where Scheduled AI Agents Actually Help

Codex /goal and Claude Managed Outcomes: The New Control Loops

Codex SDK vs CLI vs GitHub Action: Which Surface Should You Build On?

The Big Shift: From Tasks to Loops

Why Loops Beat One-Shot Agents

The Loop Contract

Free Claude Code Is Really a Model Gateway Bet

GPT Image 2 Prompt Libraries Are Becoming Production Infrastructure

Karpathy's Loopy Era Is the Best Way to Understand Codex

OpenAI's Codex Mac Certificate Deadline Is a Runbook Test

Four Codex Loops I Would Actually Run

1. PR Babysitter

2. CI Health Loop

3. Deploy Verification Loop

4. Feedback Clustering Loop

The Failure Modes

They Keep Spending

They Hide Stale Assumptions

They Need Ownership

They Need Escalation

What Boris Gets Right

The Codex Version

The Bigger Take

FAQ

What are agent loops?

How is a loop different from a cron job?

How does this apply to Codex?

What is the safest Codex loop to start with?

Comments

Related Tools

OpenAI Codex

Composio

OpenAI Agents SDK

ChatGPT

Apps from Developers Digest

Agent Hub

Overnight Agents

DD Canvas

Related Guides

Claude Code Setup Guide

MCP Servers Explained

AI Agent Frameworks Compared: CrewAI vs LangGraph vs AutoGen vs Claude Code

Related Videos

Nimbalyst: The Open-Source Visual Workspace for Building with Codex and Claude Code

Related Posts

Codex Automations: Where Scheduled AI Agents Actually Help

Codex /goal and Claude Managed Outcomes: The New Control Loops

Codex SDK vs CLI vs GitHub Action: Which Surface Should You Build On?

Long-Running Agents Need Harnesses, Not Hope

The Agent Reliability Cliff: Why Your 10-Step Chain Only Succeeds 20% of the Time

Codex Is Becoming a General-Purpose AI Agent, Not Just a Coding Tool

Get Smarter About AI Dev

Codex Automations: Where Scheduled AI Agents Actually Help

Codex /goal and Claude Managed Outcomes: The New Control Loops

Codex SDK vs CLI vs GitHub Action: Which Surface Should You Build On?

The Big Shift: From Tasks to Loops

Why Loops Beat One-Shot Agents

The Loop Contract

Free Claude Code Is Really a Model Gateway Bet

GPT Image 2 Prompt Libraries Are Becoming Production Infrastructure

Karpathy's Loopy Era Is the Best Way to Understand Codex

OpenAI's Codex Mac Certificate Deadline Is a Runbook Test

Four Codex Loops I Would Actually Run

1. PR Babysitter

2. CI Health Loop

3. Deploy Verification Loop

4. Feedback Clustering Loop

The Failure Modes

They Keep Spending

They Hide Stale Assumptions

They Need Ownership

They Need Escalation

What Boris Gets Right

The Codex Version

The Bigger Take

FAQ

What are agent loops?

How is a loop different from a cron job?

How does this apply to Codex?

What is the safest Codex loop to start with?