Developers Digest

AI Code Review Is the New Bottleneck

Developers Digest — Sat, 16 May 2026 00:00:00 GMT

The AI coding story has moved from "can it write code?" to "can we review the amount of code it writes?" That is the more useful question in 2026. [Claude Code](/blog/what-is-claude-code-complete-guide-2026), [Codex](/blog/openai-codex-guide), Cursor, Copilot, and terminal agents can all produce working diffs quickly. The weak point is no longer generation. The weak point is the review queue behind it. Two recent research signals make the pattern hard to ignore. The arXiv paper [Debt Behind the AI Boom](https://arxiv.org/abs/2603.28592) studied 302.6k verified AI-authored commits across 6,299 GitHub repositories and found 484,366 distinct introduced issues. Code smells made up 89.3 percent of the total, and 22.7 percent of tracked AI-introduced issues still survived at the latest repository revision. Then [Coding Agents Don't Know When to Act](https://arxiv.org/abs/2605.07769) tested whether agents abstain when a reported issue has already been fixed. Even recent models still proposed unnecessary code changes in 35 to 65 percent of no-change tasks. The paper calls this action bias. In normal team language: the agent wants to do something, even when the correct move is to leave the code alone. That connects directly to what developers keep debating on Hacker News, in issue trackers, and in AI tool changelogs: coding agents are impressive, but they create a new kind of review debt. The team gets more code, more diffs, more generated tests, more "looks right" explanations, and more pressure to merge. The take: the winning AI development workflow is not the one that generates the most code. It is the one that makes agent output easiest to reject, verify, and maintain. ## The Review Problem Got Bigger Traditional code review assumed human-paced output. A developer writes a branch. Another developer reviews the diff. CI runs. Maybe a staff engineer looks at the architecture. The whole workflow is built around the idea that code creation is slow enough for review to keep up. Agents break that assumption. You can now ask one agent to write the feature, another to add tests, another to update docs, and another to handle review comments. That is useful. It is also how a small task turns into a 2,000-line pull request before lunch. The problem is not that the code is always bad. Often it works. The problem is that working code is not the same thing as maintainable code. AI agents are especially good at producing plausible glue: - extra adapters that duplicate existing helpers - tests that assert implementation details - abstractions that only serve the generated patch - verbose type guards around impossible states - "fixed" code for bugs that are no longer reproducible - documentation that describes the diff instead of the product behavior Each item is small. Together they become a maintenance tax. That is why the [agent reliability cliff](/blog/the-agent-reliability-cliff) matters. The first demo works. The tenth workflow depends on whether your system can catch subtle wrongness before it compounds. ## The Opposing View Is Fair There is a reasonable counterargument: humans also introduce technical debt. They do. A tired developer can over-abstract, copy-paste, skip tests, or patch symptoms. Code review has never been perfect. AI-generated code is not uniquely dangerous just because a model wrote it. The difference is throughput. An agent can produce more mediocre code per hour than a person can. It can also produce that code with a confident summary, a passing narrow test, and no intuitive sense that the repo is getting harder to understand. That changes the control system. If a human introduces one questionable helper, review can catch it. If an automation lane opens five AI pull requests a day, the reviewer needs better evidence than "the agent says it ran tests." This is why [Microsoft Research's April 2026 paper](https://www.microsoft.com/en-us/research/publication/to-copilot-and-beyond-22-ai-systems-developers-want-built/) is worth reading. The surveyed developers did not simply ask for more code generation. They wanted quality signals earlier in the workflow, clearer authority boundaries, provenance, uncertainty signaling, and least-privilege access. Microsoft calls the pattern bounded delegation: developers want AI to absorb surrounding assembly work without taking over the craft itself. That is the right frame. AI should not remove review. It should make review sharper. ## The New Review Stack If your team is adopting coding agents seriously, treat review as infrastructure. Not vibes. Not "one more senior engineer will skim it." Infrastructure. A practical stack has five gates. ### 1. Reproduction before patching The agent should prove the bug exists before editing. This is the direct lesson from FixedBench. If the issue is already fixed, the correct output is no diff. That has to be a valid success state in your workflow. Add a rule to your agent instructions, skills, or issue template: ```text Before patching, reproduce the reported behavior or explain why it cannot be reproduced. If the bug no longer reproduces, return a no-change report with the evidence. Do not modify code just to satisfy the task shape. ``` That rule sounds boring. It prevents a lot of useless churn. ### 2. Diff budgets Every agent task should have a rough diff budget. Small bug fix: 1 to 3 files. UI copy change: no new abstraction. Test-only improvement: no production code unless reproduction proves a bug. Migration: explicit file list and rollback note. Diff budgets are not bureaucracy. They are a way to make agent output reviewable. If the agent exceeds the budget, it should stop and explain why before continuing. This pairs well with [Codex's review-oriented workflow](/blog/codex-vs-claude-code-april-2026) and [Claude Code skills](/blog/skills-are-the-new-agent-operating-system). The tool can generate. The skill defines where it should stop. ### 3. Evidence receipts Every agent-authored change should end with a receipt: - files changed - tests run - tests not run - screenshots or browser checks for UI work - source links for factual content - risks left open - reviewer focus area This is not a status update. It is the review surface. The faster agents get, the more important receipts become. A reviewer should not have to reverse-engineer what the agent believed, which commands it ran, or where it was uncertain. ### 4. Separate reviewer passes Do not let the same agent that wrote the patch be the only reviewer. A separate reviewer can be another model, another agent harness, or a deterministic check. For code, the best reviewer is still a mix of tests, static analysis, and a human. But even an agent reviewer is useful if it receives the diff cold and is instructed to look for deletion risk, missed tests, duplicated logic, and scope creep. This is where tools like [GitHub Copilot coding agent](/blog/github-copilot-coding-agent-cli-2026), Codex cloud tasks, and Claude Code subagents start to matter. The future workflow is not "agent writes code." It is "agent writes, independent reviewer checks, CI gates, human approves." ### 5. Provenance without theater Teams need to know when a change was AI-assisted, but they do not need performative co-author spam on every commit. The useful provenance is operational: - which tool produced the diff - which prompt or issue created it - which model or agent mode was used - which tests and review gates passed - whether a human materially rewrote the result That is the point of the [AI co-author attribution debate](/blog/vscode-copilot-ai-coauthor-attribution). The weak argument is credit. The strong argument is reviewability. ## What This Means for Tool Choice The best AI coding tool is increasingly the one with the best review loop. For a solo developer, [Claude Code](/tools/claude-code) still wins when you want tight local iteration, strong planning, and project-specific skills. It is excellent when you stay close to the diff and steer the work. [Codex](/blog/openai-codex-guide) is compelling when the task is issue-shaped and you want an async branch or pull request to review later. Its product direction is clearly about delegated work returning reviewable artifacts. GitHub Copilot's advantage is distribution. If the whole team already lives in issues, pull requests, Actions, code owners, and branch protection, Copilot can fit into the system without inventing a new task surface. Cursor remains strong for visual diff control. It is still the easiest place to accept or reject generated edits line by line while your mental model is warm. The mistake is choosing by generation speed alone. Speed without review structure just moves the bottleneck. For budget planning, pair this with the [AI coding tools pricing guide](/blog/ai-coding-tools-pricing-2026). Agent cost is not only token cost. It is also review cost. ## The Practical Rule Give agents permission to do less. That sounds backwards. It is not. An agent that can say "no code change needed" is safer than one that always patches. An agent that stops after a diff budget is safer than one that refactors the neighborhood. An agent that returns a receipt is more useful than one that writes a confident paragraph. The next wave of AI development will reward teams that make inaction, verification, and rejection first-class outcomes. Do not ask "how do we make agents write more code?" Ask "how do we make generated code cheap to review and easy to refuse?" That is where the leverage is now. ## Sources - [Debt Behind the AI Boom: A Large-Scale Empirical Study of AI-Generated Code in the Wild](https://arxiv.org/abs/2603.28592) - [Coding Agents Don't Know When to Act](https://arxiv.org/abs/2605.07769) - [Microsoft Research: To Copilot and Beyond: 22 AI Systems Developers Want Built](https://www.microsoft.com/en-us/research/publication/to-copilot-and-beyond-22-ai-systems-developers-want-built/) - [Ars Technica: Developers say AI coding tools work and that is precisely what worries them](https://arstechnica.com/ai/2026/01/developers-say-ai-coding-tools-work-and-thats-precisely-what-worries-them/) ## Frequently Asked Questions ### Why is AI code review becoming a bottleneck? AI coding agents can produce diffs faster than teams can inspect them. The bottleneck shifts from writing code to verifying whether the generated code is correct, scoped, maintainable, tested, and aligned with the existing codebase. ### Do AI coding agents create more technical debt? They can. The issue is not that every AI-generated change is bad. The risk is volume plus confidence. A large empirical study of AI-authored commits found persistent code smells, correctness issues, and security issues in real repositories, which means teams need stronger review gates around generated code. ### What should an AI coding agent do before editing files? It should reproduce the reported issue, inspect the relevant code path, and confirm that a change is actually needed. If the bug no longer reproduces, the agent should return a no-change report with evidence instead of modifying code. ### How do you make AI-generated pull requests easier to review? Use small task scopes, diff budgets, required tests, independent reviewer passes, and evidence receipts. The reviewer should see what changed, why it changed, what was verified, what was not verified, and where to focus. ### Should AI-generated code be labeled? Yes, but the useful label is operational provenance, not credit theater. Track which tool produced the diff, which task or prompt started it, which checks passed, and whether a human materially rewrote it. That helps reviewers and future maintainers understand the change.

Claude Agent SDK Credits End the Subscription Arbitrage

Developers Digest — Fri, 15 May 2026 00:00:00 GMT

Anthropic just drew a line through the middle of Claude Code usage. Starting June 15, 2026, the [Claude Agent SDK credit](https://support.claude.com/en/articles/15036540-use-the-claude-agent-sdk-with-your-claude-plan) separates programmatic agent usage from normal subscription usage. Agent SDK calls, `claude -p`, Claude Code GitHub Actions, and third-party Agent SDK apps draw from a new monthly credit. Interactive Claude Code in the terminal or IDE keeps using the regular subscription pool. That sounds like billing housekeeping. It is bigger than that. The era where every agent workflow could hide inside a flat subscription is ending. Coding teams now need to separate interactive work, scripted agent work, CI agents, and third-party orchestration as different budget lanes. If you have been following the [Claude Code token-burn observability problem](/blog/claude-code-token-burn-cache-observability), [agent FinOps](/blog/400-dollar-overnight-bill-agent-finops), or the rise of [terminal agents as portable runtime surfaces](/blog/terminal-agents-portable-runtime-surface), this is the same story from the pricing side. The agent runtime is maturing. The meter is catching up. ## What changed Anthropic's support article says eligible Pro, Max, Team, and Enterprise users can claim a separate monthly Agent SDK credit beginning June 15, 2026. The credit covers: - Claude Agent SDK usage in Python or TypeScript projects - `claude -p` non-interactive mode - the Claude Code GitHub Actions integration - third-party apps that authenticate through the Agent SDK It does not cover: - interactive Claude Code in the terminal or IDE - Claude conversations on web, desktop, or mobile - Claude Cowork - API-key usage from the Claude Developer Platform The published individual-plan numbers are simple: Pro gets $20, Max 5x gets $100, and Max 20x gets $200. Team and Enterprise seats have their own eligibility rules. Credits are per-user, refresh monthly, do not roll over, and do not pool across teammates. The important operational detail is what happens after the credit runs out. If extra usage is enabled, Agent SDK usage moves to standard API rates. If extra usage is not enabled, Agent SDK requests stop until the credit refreshes. ## The old mental model is wrong now The old developer mental model was: > I pay for Claude. Therefore my local agent scripts, terminal usage, CI experiments, and third-party wrappers are all basically part of the same bucket. That was always a little fuzzy. Now it is explicitly wrong. There are at least four different usage lanes: | Lane | Example | Budget posture | |---|---|---| | Interactive coding | Claude Code in a terminal or IDE | subscription usage limit | | Headless local automation | `claude -p` scripts, cron jobs, local loops | Agent SDK credit, then API-style extra usage | | CI and repository automation | Claude Code GitHub Actions, PR checks | Agent SDK credit or platform API budget | | Third-party orchestrators | Agent SDK-based apps and harnesses | Agent SDK credit or API-key billing | That distinction matters because these lanes fail differently. Interactive coding usually fails with a human present. A headless script can loop while you are away. A CI agent can run for every pull request. A third-party harness can multiply sessions across worktrees. A shared team automation can burn through individual credits in ways nobody sees until the run stops. That is why the official docs tell teams running shared production automation to use the Claude Developer Platform with an API key for predictable pay-as-you-go billing. ## The community reaction is rational The Reddit reaction is noisy, but the underlying concern is rational. Developers built real workflows around `claude -p`, Agent SDK integrations, Zed-style editor agents, OpenClaw-style harnesses, board-based orchestrators, and GitHub Actions. Many of those workflows were economically attractive because they appeared to sit near a subscription-shaped ceiling. Anthropic is now saying: interactive native use remains in the subscription lane; programmatic use gets its own credit and then behaves more like API usage. The fair complaint is predictability. A workflow that was "I have Max, let it run" becomes "I have Max, plus an SDK credit, plus possible extra usage, plus per-user non-pooled limits, plus a cutover date." The fair counterargument is also real. Autonomous workloads are not the same product as a human driving Claude Code. They can run unattended, batch tasks, power third-party apps, and create support costs that look much more like API infrastructure than chat usage. The practical take is not "Anthropic is wrong" or "users are entitled." The practical take is that agent pricing is becoming a product architecture constraint. ## What to change before June 15 Do not wait until the cutover to discover which workflows are programmatic. Start with a usage inventory: 1. Search your repos for `claude -p`, `@anthropic-ai/claude-agent-sdk`, `ClaudeSDK`, and Claude Code GitHub Actions. 2. List every third-party tool that asks you to authenticate with Claude rather than an API key. 3. Separate personal scripts from shared automation. 4. Mark which jobs can stop safely when the credit runs out. 5. Mark which jobs need API-key billing, a hard spend cap, or a different provider route. Then add receipts. Every programmatic agent run should record: - agent surface: `claude -p`, Agent SDK, GitHub Action, or third-party app - account or seat owner - model - estimated cost - input and output tokens when available - task type - repository - success or failure - stop reason - whether extra usage was enabled This is the same argument behind [agent swarms needing receipts](/blog/agent-swarms-need-receipts) and [parallel coding agents needing merge discipline](/blog/parallel-coding-agents-merge-discipline). Once agents run outside a human typing loop, a final answer is not enough. You need a billable event trail. ## The engineering pattern: separate lanes The cleanest response is to split your agent workflow into lanes. **Interactive lane.** Human-driven Claude Code sessions for exploration, refactors, and debugging. Keep this on the normal subscription path. **Personal automation lane.** Small `claude -p` scripts, local loops, and one-off helpers. Let these use the Agent SDK credit, but add local stop limits and a visible monthly ledger. **Production automation lane.** CI reviewers, nightly issue triage, deploy repair loops, and shared repo agents. Move these to API-key billing with explicit spend caps, account ownership, and logs. **Provider-routing lane.** Workflows that can run on Codex, Claude, local models, or cheaper models depending on task risk. This is where [Codex loops](/blog/codex-loops-boris-cherny-agent-routines), [OpenAI Codex managed workflows](/blog/openai-codex-cloud-security-playbook-2026), and multi-provider agent stacks become practical rather than ideological. That split avoids the worst version of the June 15 surprise: a critical automation depending on an individual user's non-pooled monthly credit. ## The opportunity There is a product opportunity hiding in the backlash. Developers do not only need cheaper usage. They need an agent budget router: - classify each run as interactive, personal automation, CI, or production - choose subscription, Agent SDK credit, API key, or alternate provider - apply a task-level budget before the first token - stop when the marginal value is gone - write a receipt that finance and engineering can both understand That is where agent tooling should go next. Not just prettier chat panes. Not just more wrappers. Budget-aware execution. The companies that win this layer will make the meter feel boring. You will know which account paid, which lane ran, why it stopped, and whether the result justified the spend. ## The take Claude Agent SDK credits are the end of subscription arbitrage for unattended coding agents. That is annoying for some workflows. It is also clarifying. Interactive Claude Code can stay a subscription product. Autonomous agent infrastructure needs budgets, ownership, metering, stop conditions, and receipts. The sooner teams model those lanes explicitly, the less painful June 15 will be. ## Sources - Anthropic Help Center: [Use the Claude Agent SDK with your Claude plan](https://support.claude.com/en/articles/15036540-use-the-claude-agent-sdk-with-your-claude-plan) - Claude Code Docs: [Legal and compliance](https://code.claude.com/docs/en/legal-and-compliance) - Anthropic: [Claude pricing](https://claude.com/pricing) - InfoWorld: [Anthropic puts Claude agents on a meter across its subscriptions](https://www.infoworld.com/article/4171274/anthropic-puts-claude-agents-on-a-meter-across-its-subscriptions.html) - Reddit: [ClaudeCode discussion of the June 15 programmatic usage change](https://www.reddit.com/r/ClaudeCode/comments/1tccd7c/its_official_anthropic_pulled_the_plug_on_all/) ## Frequently Asked Questions ### Does the June 15 change affect normal Claude Code usage? Not for interactive Claude Code in the terminal or IDE. Anthropic says interactive Claude Code continues to use normal subscription usage limits. The separate credit applies to Agent SDK usage, `claude -p`, Claude Code GitHub Actions, and third-party Agent SDK apps. ### How much Agent SDK credit do Claude Pro and Max users get? Anthropic lists $20 per month for Pro, $100 per month for Max 5x, and $200 per month for Max 20x. Team and Enterprise eligibility depends on seat type. ### What happens when the Agent SDK credit runs out? If extra usage is enabled, additional Agent SDK usage moves to standard API rates. If extra usage is not enabled, Agent SDK requests stop until the monthly credit refreshes. ### Should teams use personal Claude subscriptions for CI agents? Usually no. Anthropic's own guidance says teams running shared production automation should use the Claude Developer Platform with an API key for predictable pay-as-you-go billing. ### Is `claude -p` still useful? Yes. It is still useful for personal scripts, quick audits, and local automation. The difference is that it now belongs in a metered programmatic lane, not the same mental bucket as interactive terminal work.

Claude Code Plugin URLs Turn Skills Into a Supply Chain

Developers Digest — Thu, 14 May 2026 00:00:00 GMT

Claude Code's recent releases look like maintenance notes at first glance. Look closer. The [v2.1.129 release](https://github.com/anthropics/claude-code/releases) added `--plugin-url ` so a plugin zip archive can be fetched from a URL for the current session. The same release added `skillOverrides`, made gateway model discovery opt-in, fixed cache TTL behavior, and improved PR metrics. The [v2.1.136 release](https://github.com/anthropics/claude-code/releases) added `settings.autoMode.hard_deny` for classifier rules that block unconditionally, and fixed several plugin, MCP, worktree, and plan-mode issues. That is not a flashy model launch. It is a sign that Claude Code is turning into an agent extension platform. ## The take Plugin URLs make agent workflows more portable. They also make them easier to contaminate. Once a coding agent can fetch plugins, load skills, run hooks, connect MCP servers, and remember permission choices, the extension layer becomes part of the software supply chain. It deserves the same review posture as package installs, CI actions, shell scripts, and browser extensions. This is the security side of the argument in [Claude Code 2.1.128 is an ops release](/blog/claude-code-2-1-128-mcp-ops). The product is no longer only a terminal assistant. It is an operating surface with plugins, policies, telemetry, worktrees, and tools. That is powerful. It is also where teams need rules. ## Why plugin URLs matter A URL-based plugin install is convenient for experiments, internal rollout, and temporary sessions. It also changes the threat model. Before plugins, the risky surface was mostly the model's proposed actions: edit this file, run this command, call this tool. With plugins, the risky surface expands to the instructions and tools the model inherits before it proposes anything. That means a bad plugin can shape the agent's judgment upstream: - It can add misleading skills. - It can add hooks that run at surprising times. - It can connect tools that expose too much. - It can change the agent's default workflow. - It can make a risky path look normal. This is why [agent skills need exit criteria](/blog/agent-skills-production-checklist), but also why they need source control. A skill is not just markdown once it changes behavior. ## Hard deny is the right kind of boring The `settings.autoMode.hard_deny` addition is the important counterweight. Auto modes need an absolute refusal layer. Allow lists and user-intent classifiers are useful, but production teams also need rules that block a class of action regardless of how convincingly the task is phrased. Examples: - Never publish secrets. - Never run destructive git cleanup outside an approved flow. - Never send email without approval. - Never install an unreviewed plugin from an arbitrary URL. - Never touch production data from a local agent session. That is not pessimism. It is operational design. The same pattern appears in [OpenAI Codex cloud security](/blog/openai-codex-cloud-security-playbook-2026), [agent swarms needing receipts](/blog/agent-swarms-need-receipts), and [parallel coding agents needing merge discipline](/blog/parallel-coding-agents-merge-discipline). As agent autonomy rises, policy has to move from "remember to be careful" into executable controls. ## The opposing view The fair counterargument is that this is overkill for a solo developer. If you are experimenting locally, plugin URLs are mostly a convenience. You can install a community skill pack, try it for one task, and delete it later. Heavy governance can slow down discovery. That is true. But the posture changes when the agent can touch customer code, run long sessions, create PRs, use MCP tools, or operate inside a company repo. At that point, the plugin is not a toy. It is part of the execution environment. The lightweight version of governance is enough for most teams: 1. Pin plugin sources. 2. Keep approved plugin URLs in repo docs. 3. Review plugin manifests and hooks before use. 4. Disable or hide skills that do not apply with `skillOverrides`. 5. Put unconditional blockers in hard-deny policy. 6. Log which plugins were active in the final handoff. That is not bureaucracy. It is reproducibility. ## What I would standardize For any agent plugin system, I want four surfaces visible in the final receipt: **Extension inventory.** Which plugins, skills, hooks, and MCP servers were active? **Source provenance.** Were they local, marketplace-installed, or fetched from a URL? **Permission policy.** Which actions were allowed, denied, or hard-denied? **Runtime evidence.** Which commands, tests, PRs, or deploy checks prove the plugin-assisted run behaved correctly? That receipt lets a human reviewer answer the only question that matters: did the agent produce this change under an environment we would trust again? ## The practical bottom line Claude Code plugin URLs are useful. Hard-deny rules are necessary. The two belong together. One makes agent extensions easier to distribute. The other gives teams a way to say "never, even if the task sounds reasonable." That is the next maturity layer for coding agents: not better vibes, but governed extension surfaces with auditable receipts. Sources: [Claude Code releases](https://github.com/anthropics/claude-code/releases), [Claude Code plugins docs](https://docs.anthropic.com/en/docs/claude-code/plugins), [Claude Code settings docs](https://docs.anthropic.com/en/docs/claude-code/settings), [Anthropic MCP docs](https://docs.anthropic.com/en/docs/claude-code/mcp). ## Frequently Asked Questions ### What is Claude Code `--plugin-url`? It is a Claude Code option that fetches a plugin zip archive from a URL for the current session. It makes plugins easier to try and distribute, but it also means teams should review and pin plugin sources. ### What is `settings.autoMode.hard_deny`? It is a Claude Code setting for auto mode classifier rules that block actions unconditionally. These rules are useful for non-negotiable policy boundaries such as secret exposure, destructive commands, unapproved sends, or unreviewed plugin installs. ### Are Claude Code plugins dangerous? Plugins are not inherently dangerous, but they are powerful. They can add skills, hooks, MCP servers, and behavior that affects agent execution. Treat them like other developer supply-chain inputs. ### How should teams manage agent plugins? Start with a small approved list, pin sources, review manifests and hooks, use `skillOverrides` to hide irrelevant skills, configure hard-deny rules for sensitive actions, and include active plugins in the final agent receipt.

Codex CLI Vim Mode Is an Ergonomics Signal

Developers Digest — Thu, 14 May 2026 00:00:00 GMT

The most interesting line in [Codex CLI 0.129.0](https://github.com/openai/codex/releases/tag/rust-v0.129.0) is not the biggest one. It is this: the TUI composer now supports modal Vim editing, including `/vim`, default-mode configuration, and Vim-specific keymap contexts. That sounds like a small quality-of-life feature. It is more than that. It is a sign that terminal agents are being designed for people who live inside terminals all day, not just people trying a chat demo. ## The take Agent UX is moving from chat convenience to workbench ergonomics. The old AI coding interface was a prompt box. The newer interface is a terminal runtime with diffs, resumable threads, worktrees, hooks, plugins, permissions, browser tools, and receipts. Once a tool reaches that stage, keyboard behavior is not polish. It is workflow infrastructure. That is why modal editing matters. If a developer edits prompts, plans, file paths, command notes, and review instructions inside an agent composer dozens of times a day, the composer becomes part of the coding surface. It should respect the developer's muscle memory. This fits the broader pattern in [terminal agents becoming portable runtime surfaces](/blog/terminal-agents-portable-runtime-surface), [Codex loops](/blog/codex-loops-boris-cherny-agent-routines), and [Codex `/goal` workflows](/blog/codex-goal-vs-claude-managed-outcomes-practical-differences). The agent is not just answering. It is sitting inside the developer's control loop. ## Why this release is bigger than Vim Codex CLI 0.129.0 added more than modal editing. The release also improved resume and fork flows, raw scrollback mode, `/ide` context injection, workspace-aware `/diff`, status-line summaries, `/keymap debug`, plugin sharing controls, hook browsing, and experimental goal visibility. That cluster tells a clear story. Codex is treating the terminal as the product surface, not just the place where logs appear. The difference is practical: - Resume and fork pickers make agent work interruptible. - Workspace-aware diffs make review local and concrete. - `/ide` context injection connects editor state to terminal work. - `/keymap debug` acknowledges that terminal input is messy. - Hook browsing turns lifecycle automation into something a user can inspect. - Plugin sharing controls treat extensions as collaborative infrastructure. Those are not model capabilities. They are operational capabilities. ## The opposing view The fair opposing view is that Vim mode does not make the agent smarter. Correct. A modal composer will not fix a bad plan, hallucinated API, unsafe shell command, or weak test. Teams still need [agent receipts](/blog/agent-swarms-need-receipts), [security boundaries](/blog/openai-codex-cloud-security-playbook-2026), and [merge discipline](/blog/parallel-coding-agents-merge-discipline). But daily tools win through repeated friction reduction. A feature that saves two seconds once is not interesting. A feature that saves cognitive switching every turn becomes meaningful. That is the same reason developers care about tmux, shell history, editor keybindings, fuzzy finders, and clipboard behavior. None of those writes better code by itself. Together, they make the workbench feel native. Agents need that same maturity. ## What agent tools should copy Every terminal agent should treat input ergonomics as a first-class surface. That means: 1. Respect existing editor muscle memory. 2. Make keymaps inspectable. 3. Keep prompt editing recoverable after interrupts. 4. Let users fork and resume work without losing context. 5. Show diffs close to the conversation. 6. Let hooks and plugins be browsed before they run. 7. Expose enough status to know which branch, PR, model, and mode are active. This is especially important for long-running work. If an agent session lasts hours, the interface cannot feel like a disposable chat window. It has to feel like a dependable terminal workspace. ## The practical bottom line Codex CLI Vim mode is a small feature with a large signal. AI coding tools are entering the ergonomics phase. The winners will not only have strong models. They will make agent work feel native to the developer's existing environment: terminal, editor, keyboard, git, browser, and review loop. That is how coding agents become daily tools instead of impressive demos. Sources: [Codex CLI 0.129.0 release notes](https://github.com/openai/codex/releases/tag/rust-v0.129.0), [Codex CLI 0.130.0 release notes](https://github.com/openai/codex/releases/tag/rust-v0.130.0), [OpenAI Codex repository](https://github.com/openai/codex), [OpenAI Codex docs](https://developers.openai.com/codex/). ## Frequently Asked Questions ### What changed in Codex CLI 0.129.0? Codex CLI 0.129.0 added modal Vim editing in the TUI composer, improved resume and fork flows, added raw scrollback mode, improved `/diff`, added `/ide` context injection, expanded plugin management, and improved hooks and goal surfaces. ### Why does Vim mode matter for coding agents? It makes the agent composer feel native for developers who already use modal editing. For high-frequency agent work, prompt and plan editing are part of the coding workflow, so input ergonomics matter. ### Does modal editing improve model quality? No. Modal editing does not make the model smarter. It reduces interface friction so developers can supervise, correct, resume, and review agent work more effectively. ### What should teams look for in a terminal agent? Look for resumable sessions, visible diffs, inspectable keymaps, clear permission modes, plugin and hook visibility, branch and PR status, and receipts that explain what the agent changed and verified.

Skills for Real Engineers Need Governance, Not Fandom

Developers Digest — Thu, 14 May 2026 00:00:00 GMT

[Matt Pocock's `skills` repo](https://github.com/mattpocock/skills) is the latest proof that the agent-skills format has escaped the docs corner. The repo is popular because it does not pitch "vibe coding." It frames skills as engineering process: grilling a vague request before implementation, building shared language, using red-green-refactor loops, diagnosing failures, designing interfaces, writing PRDs, and converting product intent into issues. That is useful. It also creates a new problem. Once teams install skills from creators, vendors, coworkers, and internal repos, the question stops being "do skills work?" and becomes "who governs the instructions your agents are allowed to inherit?" ## The take Skills are becoming production controls. That means they need the same boring discipline as any other production control: ownership, versioning, review, tests, deprecation, and rollback. The existing Developers Digest posts on [agent skills needing exit criteria](/blog/agent-skills-production-checklist), [Google's skills repo](/blog/google-skills-agent-playbook), and [Karpathy-style CLAUDE.md rule sets](/blog/karpathy-claude-md-skills-menu) all point in the same direction. Reusable agent instructions are not prompt lore anymore. They are part of the software supply chain. The fresh signal from `mattpocock/skills` is cultural. Developers are not just asking agents to write code faster. They are trying to transfer experienced engineering taste into repeatable procedures. That is the right move, but only if the procedures stay inspectable. ## Why this repo hit a nerve The repo names real failure modes: - The agent did not understand the work. - The agent was too verbose. - The code did not work. - The architecture drifted into a ball of mud. - The team lacked a shared language. Those are not model-selection problems. They are workflow problems. That is why a skill such as "grill me" matters. The skill is not magic wording. It forces the agent to stop and extract ambiguity before implementation. That pairs directly with the operating lesson in [long-running agents need harnesses](/blog/long-running-agents-need-harnesses): the model is only one part of the system. The task contract, feedback loop, and stop condition are where the real leverage lives. The Hacker News counterargument is also worth taking seriously. Some commenters see elaborate skills as overbuilt prompt theater. The fair version of that critique is simple: if a skill is just fancy language without measurable behavior, it should not survive. That is the governance bar. ## What governance looks like A production skill should answer five questions: 1. Who owns it? 2. Which failure mode does it reduce? 3. Which observable behavior should change when it is active? 4. Which repo, tool, or workflow is it allowed to affect? 5. When should it be deleted or rewritten? Without those answers, a skill library turns into the agent equivalent of stale wiki pages. This matters even more when skills spread across tools. The same instruction may be consumed by Claude Code, Codex, Cursor, or a custom agent runner. If the skill says "commit after every meaningful change," that is harmless in one workflow and dangerous in another. If it says "always use TDD," that might improve a backend module and slow down a throwaway spike. Good skills encode judgment. Bad skills encode superstition. ## The opposing view The strongest opposing view is that skills are just prompts with file names. There is truth in that. A markdown file does not guarantee better engineering. A popular repo does not prove a method works. And an LLM confidently praising a prompt pattern is not evidence. The right response is not to reject skills. It is to demand receipts. For every important skill, track whether it changes the work: - Did it reduce review comments? - Did it increase passing local checks? - Did it catch unclear requirements earlier? - Did it shrink final diffs? - Did it reduce abandoned agent sessions? - Did it improve handoff quality? That is the same move described in [agent replays with TraceTrail](/blog/agent-replays-with-tracetrail) and [Claude Code token-burn observability](/blog/claude-code-token-burn-cache-observability). Once an instruction affects agent behavior, it should be observable. ## What teams should copy Do not copy the whole repo into every project. Copy the operating shape: - A skill starts with a narrow trigger. - It names the failure mode. - It gives a procedure, not a vibe. - It includes stop conditions. - It asks for evidence at the end. - It stays short enough for an agent to actually use. For a product team, the first three skills I would write are not framework-specific. **Ambiguity gate.** Before implementation, force the agent to identify missing requirements, user-visible risk, and files it expects to touch. **Verification ladder.** Require the agent to choose cheap checks first, then escalate to build, browser QA, or production smoke tests when the change affects users. **Review receipt.** Require a final report with files changed, commands run, commands skipped, screenshots or URLs where relevant, and residual risk. Those three are less glamorous than a huge catalog. They also compound faster. ## The practical bottom line The skills trend is real, but the winning teams will not be the ones with the biggest `~/.claude/skills` folder. They will be the ones that treat skills as governed operating controls: small, reviewed, measured, and deleted when they stop helping. Matt Pocock's repo is a useful menu. The production lesson is to build your own kitchen. Sources: [mattpocock/skills](https://github.com/mattpocock/skills), [Hacker News discussion of the grill-me skill](https://news.ycombinator.com/item?id=47550391), [Claude Code skills docs](https://docs.anthropic.com/en/docs/claude-code/skills), [Google skills repo](https://github.com/google/skills). ## Frequently Asked Questions ### What are AI coding skills? AI coding skills are reusable instruction files that teach an agent how to handle a recurring kind of work. In tools like Claude Code, they can describe when to ask clarifying questions, how to run tests, what evidence to return, and which project constraints matter. ### Why does a skills repo need governance? Because skills can change agent behavior across many sessions. If they are stale, too broad, or copied without review, they can make agents confidently apply the wrong process. Governance keeps skills owned, versioned, measured, and removable. ### Should teams install community skill packs? Community skill packs are useful as examples and starting points. Production teams should copy the shape, then adapt each skill to their own repo, commands, review standards, and risk profile. ### How do you know if a skill works? Measure behavior. Useful signals include fewer review comments, better test coverage, clearer final reports, fewer abandoned sessions, smaller diffs, and more reliable local verification.

Agent Memory Benchmarks Are Not Enough

Developers Digest — Wed, 13 May 2026 00:00:00 GMT

Agent memory is having its GitHub trending moment. Today, `rohitg00/agentmemory` is near the top of [GitHub Trending](https://github.com/trending), pitching persistent memory for Claude Code, Codex CLI, Cursor, Gemini CLI, and other MCP-capable coding agents. The promise is obvious: stop re-explaining the same architecture, bugs, preferences, and workflow rules every session. That is a real pain. Anyone using [Claude Code](/blog/what-is-claude-code-complete-guide-2026), [Codex](/blog/openai-codex-guide), or terminal agents long enough has hit it. The agent forgets the migration plan. It rediscovers a test command. It misses a convention you corrected yesterday. But the interesting question is not whether agents need memory. They do. The question is what kind of memory you can trust. For coding agents, retrieval accuracy is only the first benchmark. The production bar is higher: can the agent remember the right thing, forget the stale thing, show where the memory came from, and roll back a bad learning without poisoning future sessions? That is the difference between useful memory and a second hallucination surface. ## Why This Is Trending Now The trend makes sense because the agent stack has matured around it. We already have better runtime surfaces for agents, from terminal tools to managed job systems. We already have [context reduction patterns](/blog/agent-context-reduction-pattern) that keep raw logs and tool output outside the model window. We already have [skills](/blog/why-skills-beat-prompts-for-coding-agents-2026), hooks, plugins, worktrees, traces, and MCP servers. Memory is the next control plane. The `agentmemory` repo is not just a vector store wrapper. Its README claims cross-agent support, hooks, MCP tools, a local server, replayable sessions, SQLite-backed storage, benchmark reports, and a viewer. It also compares itself against Mem0, Letta, Khoj, claude-mem, and other memory systems. That broader shape is the signal. Developer memory is moving from "paste this into `CLAUDE.md`" to a runtime layer with capture, retrieval, replay, deletion, and governance. That is exactly where teams should slow down. ## The Benchmark Trap Most memory demos optimize for the happy path: 1. Save a fact. 2. Start a new session. 3. Ask a related question. 4. Watch the agent recall the fact. That proves something. It does not prove enough. The `agentmemory` README highlights LongMemEval-S retrieval numbers and token savings. Letta's docs frame memory as context-window management across core memory, recall memory, and archival memory. LangChain's memory docs split the problem into semantic, episodic, and procedural memory. Those are useful frames. But real coding agents fail in messier ways: - they retrieve a true memory that no longer applies - they mix two project conventions from different repos - they overfit to a one-off correction - they bury the source of a learned rule - they keep private or sensitive facts longer than they should - they recall "we tried X and it failed" without the conditions that made it fail - they inject too much memory and increase token burn Retrieval benchmarks reward finding stored facts. Coding work also needs contradiction handling, provenance, permissioning, and deletion. The most important memory test is not "can the agent find a fact?" It is "can the agent decide whether this fact still deserves authority?" ## Four Memory Types Teams Actually Need For developer workflows, I would separate memory into four buckets. **Project memory** is stable repo context: build commands, route structure, architecture decisions, service boundaries, design rules, and deployment quirks. This belongs in explicit files like `AGENTS.md`, `CLAUDE.md`, `DESIGN.md`, or repo docs. It should be readable, reviewed, and versioned. **Episodic memory** is what happened in a session: which bug was investigated, what failed, what test confirmed the fix, what deploy was verified. This is where replayable sessions and receipts matter. It complements [long-running agent harnesses](/blog/long-running-agents-need-harnesses) because the agent can resume from evidence, not vibes. **Procedural memory** is how the agent should do work: review checklists, handoff formats, QA routines, branch discipline, and source-quality rules. This is where [self-improving skills](/blog/self-improving-skills-claude-code) are powerful because they turn corrections into auditable workflow artifacts. **User memory** is preference and personal context: tone, priorities, preferred tools, boundaries, and recurring workflows. This is valuable, but it needs the strictest deletion and visibility controls because it can easily cross from helpful into creepy or wrong. Lumping all four into "memory" makes the system harder to reason about. A source link should have different authority from a preference. A one-session debugging note should not outrank a repo instruction. A stale deploy workaround should not survive a platform migration. ## The Minimum Viable Memory Contract If you are adding memory to a coding agent, ask for a contract before you ask for a benchmark. At minimum, the memory layer should expose: - source provenance for every injected memory - memory type: project, episodic, procedural, or user - created and last-verified timestamps - confidence or authority level - scope: repo, organization, user, or global - expiration or stale-after rules - deletion paths that actually remove the memory from retrieval - review and rollback for automatically learned rules - receipts showing which memories affected a run This sounds like paperwork until it saves you from a bad day. Imagine an agent recalls "deploys use Vercel" after the project moved to Coolify. If the memory has a timestamp, source file, scope, and stale-after rule, the agent can downgrade it. If it is just an embedding in a memory store, the agent may confidently run the wrong playbook. That is why transparent memory beats clever memory for engineering teams. ## The Opposing View Is Right About One Thing The skeptical take is that agents already have too much context and too many hidden influences. Adding another retrieval layer can make them less predictable. That critique is valid. Bad memory systems create failure modes that are harder to debug than a cold-start agent. The model appears to "know" something, but the user cannot see which memory caused the behavior. A stale preference gets retrieved because it is semantically close. A low-confidence observation becomes a rule. A memory extracted from a failed session becomes future guidance. This is why I prefer memory that behaves more like Git than magic. For durable workflow knowledge, put the final form in markdown files, skills, repo instructions, or structured manifests. For episodic memory, keep session logs, summaries, and receipts. For semantic search, make retrieval visible and scoped. For automatic learning, require review above a confidence threshold. Memory should make an agent easier to inspect, not harder. ## Where `agentmemory` Looks Interesting The interesting part of `agentmemory` is not only that it stores memories. It is that it treats memory as a shared local service for multiple agents. That matches where developer workflows are going. A real team may use Claude Code for one task, Codex for another, Cursor for IDE edits, Gemini CLI for cheap research, and custom MCP tools for internal systems. If each agent maintains a separate memory silo, you get duplicated context, conflicting facts, and no central deletion story. A shared memory layer could become the place where agents coordinate: - previous session summaries - accepted workflow rules - failed approaches - recurring file paths - deploy receipts - known flaky tests - user-approved preferences But it only works if the memory layer is governed. Cross-agent memory multiplies value and blast radius at the same time. That is the tradeoff to evaluate, not just the star count. ## How I Would Evaluate It Before installing any persistent memory layer across a team, I would run a small harness. Create five realistic repo tasks: 1. A bug fix where the agent must remember a prior failed approach. 2. A feature where a repo convention matters. 3. A migration where an old convention becomes false. 4. A security-sensitive task where private details must not be recalled broadly. 5. A cleanup task where a memory should be deleted and stay deleted. Run each task cold, then run it with memory. Measure: - fewer repeated explanations - fewer irrelevant memories injected - lower token cost per successful run - higher task completion rate - fewer stale-memory mistakes - source receipts for every memory used - deletion and rollback behavior If memory improves recall but increases stale mistakes, it is not ready for broad automation. If it reduces repeated context and produces receipts you can audit, it is worth expanding. This pairs naturally with [Claude Code token observability](/blog/claude-code-token-burn-cache-observability) and [agent receipts](/blog/agent-swarms-need-receipts). Memory without cost and provenance telemetry is just another hidden dependency. ## The Practical Take Persistent memory is going to become standard in coding agents. Not because it is flashy. Because stateless agents waste human attention. They force developers to repeat architecture, preferences, failures, and operating rules that should compound. But the winning memory systems will not be the ones that simply retrieve the most facts. They will be the ones that make memory governable: - explicit enough to inspect - scoped enough to avoid cross-project leakage - fresh enough to survive migrations - reversible enough to undo bad learnings - measured enough to prove it helps The agent that remembers everything is not the goal. The agent that remembers what still deserves trust is. ## FAQ ### What is agent memory? Agent memory is persistent state that helps an AI agent carry useful context across turns, sessions, or tasks. For coding agents, this can include repo conventions, previous debugging attempts, user preferences, session summaries, and reusable procedures. ### Is persistent memory better than a larger context window? Not by itself. A larger context window lets the model read more at once. Persistent memory decides what should be carried forward across sessions. Good systems use both, plus context reduction so raw logs and tool output do not flood the prompt. ### Should agent memory live in a vector database? Sometimes. Vector search is useful for semantic recall, but durable coding rules often belong in explicit files, skills, manifests, or structured records with source links. The safest systems combine searchable memory with readable, reviewable artifacts. ### What is the biggest risk with coding-agent memory? Stale or over-scoped recall. A true memory can become wrong after a migration, or a rule from one repo can leak into another. That is why scope, timestamps, provenance, expiration, deletion, and rollback matter. ### How should teams evaluate memory tools? Use real repo tasks and measure repeated-context reduction, task completion, token cost, stale-memory failures, source receipts, and deletion behavior. Do not rely only on retrieval benchmarks. ## Sources - GitHub Trending: [today's trending repositories](https://github.com/trending) - GitHub: [`rohitg00/agentmemory`](https://github.com/rohitg00/agentmemory) - Letta Docs: [Agent memory and architecture](https://docs.letta.com/guides/agents/architectures/memgpt) - Letta Docs: [Memory overview](https://docs.letta.com/guides/agents/memory) - LangChain Docs: [Memory overview](https://docs.langchain.com/oss/python/concepts/memory) - LangChain Docs: [Deep agents long-term memory](https://docs.langchain.com/oss/python/deepagents/long-term-memory) - arXiv: [STALE: Can LLM Agents Know When Their Memories Are No Longer Valid?](https://arxiv.org/abs/2605.06527) - arXiv: [Memory for Autonomous LLM Agents: Mechanisms, Evaluation, and Emerging Frontiers](https://arxiv.org/abs/2603.07670)

Claude Platform on AWS Is Enterprise Agent Plumbing, Not Just Procurement

Developers Digest — Tue, 12 May 2026 00:00:00 GMT

Anthropic's Claude Platform on AWS announcement looks like a procurement story at first glance: AWS customers can access Claude platform features with AWS authentication, billing, and commitment retirement. That framing undersells it. For engineering leaders, this is about where agent adoption actually gets unblocked. Most teams do not fail to adopt AI coding agents because nobody can write a prompt. They fail because the platform questions pile up: - Who owns identity? - Which budget pays for the runs? - Can usage retire an existing cloud commitment? - Where do logs and access controls live? - Can security review the integration without another vendor path? - Can developers use the same models across experimentation and production? That is why this announcement belongs next to [Claude Managed Agents as backend job runtime](/blog/claude-managed-agents-backend-job-runtime), [Claude Code vs Codex App](/blog/claude-code-vs-codex-app-2026), and [OpenAI vs Anthropic developer experience](/blog/openai-vs-anthropic-2026). The battleground is no longer only model quality. It is the operational path from first prototype to approved platform. ## The Real Product Is Approval Surface Area Every serious AI tool has two products: 1. the thing developers touch; 2. the thing the company can approve. Developers see Claude Code, API calls, agents, and model quality. Platform teams see authentication, billing, data controls, audit, support paths, vendor risk, and existing cloud contracts. Claude Platform on AWS is aimed at the second product. It says: use the Claude platform through infrastructure your company may already have approved. That matters because a lot of AI adoption dies in the gap between "this works in a local demo" and "this can run inside our enterprise constraints." ## Why AWS Billing Is a Developer Feature Billing sounds boring until it changes behavior. If Claude platform usage can flow through AWS billing and commitment retirement, the buying motion changes. A team that could not get a separate AI vendor budget may be able to route usage through an existing cloud relationship. A platform team that already reports AWS spend can put AI agent usage beside compute, storage, and data costs. That makes [agent FinOps](/blog/400-dollar-overnight-bill-agent-finops) less theoretical. The useful question becomes: ```txt Which product team, repo, environment, and workflow burned these tokens? ``` Not: ```txt Who has the shared API key? ``` Enterprise agent adoption needs that shift. Agents will not stay small. They will run code review, migration tasks, test generation, incident summaries, docs refreshes, and background maintenance loops. The spend has to become attributable. ## Identity Is the Other Half Authentication is not just login polish. It defines what an agent can touch. When agent platforms integrate with an enterprise cloud identity path, companies can ask sharper questions: - Which teams can create agent environments? - Which roles can access production context? - Which workloads can call which models? - Which usage belongs to experimentation vs approved production? - Which logs are visible to security and platform owners? This is the same reason [Codex cloud internet controls](/blog/openai-codex-cloud-security-playbook-2026) matter. The moment an agent can read code, call tools, or run tasks in a company environment, identity becomes part of the product. ## Opposing View: This Is Just Channel Strategy There is a cynical read: this is just Anthropic making Claude easier to buy through AWS. That is partly true. Distribution matters. Cloud marketplaces and billing relationships are sales infrastructure. But for developer platforms, distribution is architecture. A model that can be bought, governed, and monitored through existing enterprise systems is more likely to become part of production workflows. A model that requires a one-off contract, a separate admin layer, and manual usage reconciliation stays in the experimentation bucket longer. So yes, this is channel strategy. It is also product strategy. ## What This Means for Agent Builders If you are building internal agent systems, take the hint. Enterprise buyers will ask for: - cloud-native identity integration; - project-level spend attribution; - environment-level policy; - model routing controls; - audit logs; - data retention settings; - support for existing procurement and commitment structures; - clean separation between experimentation and production use. The agent runtime matters, but the wrapper around the runtime determines whether it can scale inside a company. That is why [terminal agents as runtime surfaces](/blog/terminal-agents-portable-runtime-surface) are only one side of the story. The other side is platform plumbing. ## What Developers Should Watch For individual developers, the near-term benefit is not "AWS is involved." It is that enterprise AI workflows may get less fragmented. Watch for these practical changes: - fewer separate vendor approvals for Claude-based tools; - more company-approved Claude Code and API environments; - cleaner budget tags for agent runs; - stronger admin controls around model access; - more teams standardizing on approved agent workflows instead of shadow tools. This could make Claude easier to use in serious company contexts, especially where AWS is already the center of gravity. ## The Bigger Pattern OpenAI, Anthropic, GitHub, AWS, Google, and Microsoft are all converging on the same truth: agent adoption is a platform problem. The winning setup will not be "one model endpoint and a clever prompt." It will look like: - identity; - policy; - runtime isolation; - spend controls; - audit trails; - model choice; - environment routing; - human escalation; - deployment verification. That is why the [Claude Code token burn observability](/blog/claude-code-token-burn-cache-observability) conversation and the enterprise-platform conversation are connected. You cannot responsibly scale agent usage if you cannot govern it. ## The Takeaway Claude Platform on AWS is not exciting because it adds another way to buy Claude. It is exciting because it moves AI agent adoption into the systems enterprises already use to approve software. That is the quiet bottleneck. The teams that win with agents will not only pick the best model. They will build a platform where agents have identity, budgets, boundaries, receipts, and a path to production. Claude on AWS is one more sign that the category is growing up. ## FAQ ### What is Claude Platform on AWS? Anthropic describes it as a way for AWS customers to access Claude platform features using AWS authentication, billing, and commitment retirement. It is generally available as of the May 2026 announcement. ### Why does this matter for developers? It can make Claude-based tools easier to approve, budget, and govern inside companies that already operate through AWS. That matters for production agent workflows because identity, spend attribution, and policy controls become part of the adoption path. ### Does this replace Claude Code? No. Claude Code remains a developer-facing coding agent. Claude Platform on AWS is more about enterprise access and platform integration around Claude capabilities. Sources: [Anthropic: Introducing the Claude Platform on AWS](https://claude.com/blog/claude-platform-on-aws), [Hacker News discussion](https://news.ycombinator.com/item?id=48103042), [AWS Marketplace documentation](https://docs.aws.amazon.com/marketplace/latest/buyerguide/buyer-iam-users-groups-policies.html), [Anthropic Claude Code documentation](https://docs.anthropic.com/en/docs/claude-code/overview), [Anthropic Claude API documentation](https://docs.anthropic.com/en/api/overview).

Interaction Models Are the Next AI Developer Tool Interface

Developers Digest — Tue, 12 May 2026 00:00:00 GMT

Thinking Machines' post on interaction models is one of the more useful AI interface pieces to land this week because it names a problem every developer-tool team is running into: chat is not the final shape. Turn-based chat is great for asking a question. It is awkward for shared work. Coding agents already proved that. A serious agent session is not one prompt and one answer. It is a loop of reading files, asking clarifying questions, editing code, running tests, showing diffs, getting corrected, opening browser checks, and leaving a receipt. That is why [terminal agents are becoming runtime surfaces](/blog/terminal-agents-portable-runtime-surface), why [Codex loops](/blog/codex-loops-boris-cherny-agent-routines) matter, and why [long-running agent harnesses](/blog/long-running-agents-need-harnesses) keep showing up. The next interface layer is not "better chat." It is better coordination. ## What Interaction Models Mean Thinking Machines describes interaction models as systems that handle multimodal, real-time collaboration across audio, video, and text. The important idea is not merely multimodality. The important idea is that the model participates in an ongoing interaction instead of waiting for a fully packaged prompt. For developer tools, that maps cleanly to the work we already do: - watch a test fail; - inspect a diff; - hear a spoken constraint; - see a screenshot; - follow a cursor; - notice a console error; - ask whether to continue; - remember which file is the current focus; - hand control back to the human at the right moment. That is a different product shape from a chat box glued beside an editor. ## Why Chat Feels Wrong for Coding Agents Chat forces developers to serialize messy work into text. You have to explain: - which file matters; - what changed; - which visual bug you mean; - which test output is relevant; - which instruction still applies; - which previous decision should be ignored. A good coding agent can infer some of that from the repo, but the interface still makes the human do too much packaging. This is why tools keep adding richer surfaces: IDE diffs, terminal execution, browser screenshots, task plans, subagents, worktrees, PR comments, and persisted instructions. They are not decorations. They are attempts to escape the limitations of pure chat. ## The Developer Tool Version In developer tools, an interaction model should treat the repo, terminal, browser, issue tracker, and human as parts of one workspace. Imagine a coding agent interface where: - the agent can see the current failing test and the diff beside it; - your spoken correction is attached to the exact UI state; - the browser screenshot becomes part of the task context; - the agent knows whether it is in exploration, implementation, review, or deploy verification mode; - every action lands in a receipt that another agent can resume. That is not science fiction. Pieces of it already exist across [Claude Code](/blog/what-is-claude-code-complete-guide-2026), Codex, Cursor, Zed, GitHub Copilot, and browser automation workflows. The problem is that the pieces are still fragmented. ## Opposing View: Chat Is Enough There is a fair counterargument: chat is simple, universal, and composable. A text box can drive anything. Developers already understand it. APIs are easier. Logs are easier. Automation is easier. I agree with the first half. Chat should not disappear. But chat should become one control among many, not the whole interface. The same way command lines did not disappear when IDEs improved, text prompts will remain useful. They just should not be responsible for carrying every bit of state. The best developer tools will support text, but they will not force every interaction through text. ## The Missing Primitive Is Shared State The real prize is shared state. Developer work has a lot of state: - files; - diffs; - test results; - logs; - browser screenshots; - issue comments; - design constraints; - deploy status; - previous agent attempts; - budget and time limits. Chat transcripts are a poor database for that. They are verbose, ambiguous, and hard to resume. A better interaction model should store task state explicitly. That is why [agent context reduction](/blog/agent-context-reduction-pattern) matters. The goal is not to stuff more transcript into a context window. The goal is to keep the right state in the right structure. ## What To Build Now If you are building AI developer tools, do not wait for a perfect multimodal model to improve the interface. Start with the interaction contract. Add these primitives: - **Mode**: exploration, implementation, review, verification, deploy. - **Current artifact**: file, PR, route, screenshot, test, issue. - **Authority level**: read-only, edit, command execution, merge, deploy. - **Evidence**: tests run, screenshots captured, source links checked. - **Resume state**: what another agent needs to continue without replaying the whole chat. - **Escalation rule**: when the agent must stop and ask. Those primitives make any model better because they reduce ambiguity. ## Why This Matters for Content and SEO Too The same idea applies outside code. A content automation should not only say "write a post." It should know: - the trend source; - the existing posts to avoid duplicating; - the internal links to include; - the image style; - the checks to run; - the deployment verification step; - the next self-improvement note. That is exactly the loop behind [skills as agent operating systems](/blog/skills-are-the-new-agent-operating-system). A skill is a tiny interaction model: state, constraints, tools, and expected output. ## The Takeaway Interaction models are a useful frame because they push AI tools beyond prompt-response thinking. For developer tools, the future interface is a shared workspace where the model can coordinate across code, tests, browser state, voice, screenshots, issues, and deployment receipts. Chat will still be there. It just will not be the whole product. The best agent tools will feel less like asking a chatbot to code and more like working inside a system that understands the work in progress. ## FAQ ### What is an interaction model in AI? An interaction model is a system design for how a model collaborates with users across time, modalities, and shared state. Instead of treating every request as a standalone chat turn, it handles ongoing work. ### Why does this matter for AI coding tools? Coding work involves files, diffs, tests, terminals, screenshots, issue trackers, and deployment checks. A chat-only interface makes developers compress all of that state into text, which is inefficient and error-prone. ### Does this mean chat interfaces are going away? No. Text prompts remain useful. The shift is that chat becomes one input inside a richer workspace, not the entire interface. Sources: [Thinking Machines: Interaction Models](https://thinkingmachines.ai/blog/interaction-models/), [Hacker News discussion](https://news.ycombinator.com/item?id=48100524), [Anthropic Claude Code overview](https://docs.anthropic.com/en/docs/claude-code/overview), [OpenAI Codex documentation](https://developers.openai.com/codex/), [W3C Multimodal Interaction Architecture](https://www.w3.org/TR/mmi-arch/).

TanStack's npm Compromise Is the CI Lesson Agent Teams Needed

Developers Digest — Tue, 12 May 2026 00:00:00 GMT

TanStack's May 11 npm postmortem is the kind of incident AI-heavy engineering teams should read slowly. The headline was a serious supply-chain compromise: malicious versions were published across dozens of `@tanstack/*` packages after an attacker chained GitHub Actions behavior, cache poisoning, and OIDC token extraction. The durable lesson is broader than TanStack. If you are letting agents open pull requests, edit workflow files, run CI, or prepare releases, your agent program is now coupled to your CI trust model. That is the same operational theme behind [prompt injection in open source](/blog/prompt-injection-open-source), [agent receipts](/blog/agent-receipts-ai-coding), and [long-running agent harnesses](/blog/long-running-agents-need-harnesses). Agent output is not safe because the diff looks small. It is safe when the workflow around the diff has the right boundaries. ## What Happened TanStack says the attacker chained three important primitives: - a `pull_request_target` workflow path that crossed the fork and base-repository trust boundary; - GitHub Actions cache poisoning across that boundary; - OIDC token extraction from runner memory, which enabled npm publishing. The exact details matter, but the pattern matters more: a CI workflow treated untrusted pull request context as if it could safely influence trusted release machinery. That is the part agent teams should underline. Agents do not invent new categories of infrastructure risk every time. They amplify the old ones by increasing the number of PRs, workflow edits, dependency updates, and release-adjacent tasks moving through the system. ## Why This Hits Agent Workflows Differently Classic CI security assumes human developers are the primary authors of risky changes. AI coding agents change the volume and shape of that work. A team that runs [Codex loops](/blog/codex-loops-boris-cherny-agent-routines), [Claude Code subagents](/blog/claude-code-agent-teams-subagents-2026), or GitHub-hosted coding agents will naturally delegate chores like: - dependency refreshes; - test fixture updates; - workflow cleanups; - release note generation; - package publishing checks; - flaky CI repair. Those tasks feel boring, which is exactly why they get delegated. But boring does not mean low privilege. A one-line workflow change can matter more than a 2,000-line application diff. The dangerous failure mode is not "the agent wrote bad TypeScript." It is "the agent made a plausible CI change that lets untrusted code reach a trusted credential boundary." ## The Real Boundary Is Not Human vs AI The easy take is to say "do not let AI touch CI." That is too blunt. The better boundary is trusted vs untrusted execution. A human can make the same mistake. An agent can make the same mistake faster. The fix is to design the release system so neither can accidentally turn a fork PR into a credentialed publish path. For agent teams, that means release automation should be split into layers: 1. **Untrusted validation**: test the proposed change without secrets and without publish rights. 2. **Reviewable artifact creation**: build packages, diffs, previews, and SBOMs as artifacts. 3. **Trusted promotion**: publish only from protected branches, protected environments, or manually approved release jobs. 4. **Receipt capture**: record exactly which commit, workflow, token audience, package version, and actor performed the release. That last point is where agent operations and security converge. A good [agent FinOps](/blog/400-dollar-overnight-bill-agent-finops) system tells you what the agent spent. A good agent security system tells you what authority the agent touched. ## `pull_request_target` Needs a Higher Bar `pull_request_target` exists for real reasons. It can run with base-repository context, which is useful for labels, comments, and some automation around external contributions. But any workflow that combines `pull_request_target`, untrusted checkout behavior, caches, generated scripts, install steps, or release credentials deserves a hard review. This is not an agent-specific rule. It is a GitHub Actions trust-boundary rule. Agent teams should make it explicit: - agents may comment on external PRs; - agents may summarize CI and review state; - agents may propose workflow changes in a normal PR; - agents may not create or modify credentialed publish paths without human review; - agents may not merge changes that alter release credentials, OIDC audiences, package permissions, or protected environment rules. That sounds bureaucratic until you compare it with the blast radius of a compromised package. ## The Agent Review Checklist Should Include CI Authority Most AI code review checklists focus on code quality: - Does it compile? - Are tests passing? - Is the implementation too broad? - Did the agent delete something important? After this incident, agent review needs an authority section too. Ask these questions for every agent-authored PR that touches CI, dependencies, package publishing, install scripts, or repository settings: - Does this change alter when secrets are available? - Does it run untrusted code before a credentialed step? - Does it restore caches across trust boundaries? - Does it make package publishing easier without adding an approval gate? - Does it change token permissions from read to write? - Does it add dynamic script execution in a privileged job? - Does it rely on labels, branch names, or filenames as a security control? This is the same discipline as [agent bugs moving up the stack](/blog/overnight-agents-workflow). The bug is often not a bad line of code. It is a bad operating assumption. ## Opposing View: This Is Just CI Security The opposing take is reasonable: TanStack's postmortem is about GitHub Actions and npm publishing, not AI agents. You do not need to mention agents to understand the vulnerability class. That is true. The root cause lives in CI and release engineering. But AI changes the exposure surface. More teams are now asking agents to maintain the exact files that define CI trust boundaries. More teams are also running background loops that wake up, inspect GitHub state, and push small changes without the same attention a senior engineer would give a release workflow. So the agent angle is not "AI caused this." The agent angle is "agent adoption makes this category of mistake easier to repeat at scale." ## The Practical Policy Here is the policy I would put into an agent runbook: ```txt Agents may propose CI and release changes. Agents may not merge or execute credential-affecting CI changes. Any change touching package publishing, OIDC, secrets, environments, workflow permissions, caches, or pull_request_target requires human review. Trusted publish jobs must run from protected branches or protected environments only. Every release job must emit a receipt: commit, package, version, workflow, actor, token audience, and artifact hash. ``` That is not anti-agent. It is how you make agents boring enough to use. ## What To Measure Next If your team is already running coding agents, track these metrics: - agent-authored PRs that touch `.github/workflows`; - agent-authored dependency and lockfile PRs; - workflows that use `pull_request_target`; - workflows with `id-token: write`; - publish jobs without protected environment approval; - release jobs that consume caches built from untrusted PR context; - mean time from package publish to rollback. Those numbers will tell you whether your agent system is increasing release risk or just increasing normal application throughput. ## The Takeaway TanStack's incident should not make teams stop using agents. It should make teams stop treating CI as background plumbing. AI agents inherit your trust boundaries. If those boundaries are fuzzy, agents will make the fuzziness visible. If the boundaries are explicit, agents can work inside them productively. The next mature agent platform will not only generate code. It will understand workflow authority, ask for escalation before touching release paths, and leave receipts that make supply-chain review boring. That is where this category has to go. ## FAQ ### Was the TanStack incident caused by AI? No. TanStack's public postmortem describes a GitHub Actions and npm supply-chain compromise. The AI lesson is that coding-agent workflows often touch the same CI and release files, so teams need stronger trust-boundary policies before delegating those chores. ### Should agents be banned from editing CI files? Not completely. Agents can propose CI changes, summarize workflows, and open reviewable PRs. They should not merge or execute changes that affect secrets, OIDC, package publishing, protected environments, or trusted release jobs without human approval. ### What is the safest first agent security control? Start by blocking autonomous changes to `.github/workflows`, package publishing configuration, and repository secrets. Then add a review checklist for credential boundaries, cache behavior, OIDC token use, and protected environment rules. Sources: [TanStack npm supply-chain compromise postmortem](https://tanstack.com/blog/npm-supply-chain-compromise-postmortem), [Hacker News discussion](https://news.ycombinator.com/item?id=48100706), [GitHub Actions `pull_request_target` documentation](https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#pull_request_target), [GitHub Actions OIDC hardening guide](https://docs.github.com/en/actions/deployment/security-hardening-your-deployments/about-security-hardening-with-openid-connect), [npm package provenance documentation](https://docs.npmjs.com/generating-provenance-statements).

Codebase Graphs Are the New Agent Map

Developers Digest — Sun, 10 May 2026 00:00:00 GMT

The most useful GitHub trend this morning is not another chat wrapper. It is a map. [Graphify](https://github.com/safishamsi/graphify) is a fast-growing Claude Code skill that turns a folder of code, markdown, PDFs, screenshots, diagrams, schemas, and other project material into a queryable knowledge graph. The pitch is specific: drop it on a repo or research folder, get an interactive graph, an Obsidian-style vault, a wiki, a JSON graph, a report of high-degree nodes, surprising connections, suggested questions, and provenance labels for what was extracted versus inferred. That is a much more interesting signal than the star count alone. The agent market has spent the last year arguing about which model writes the best patch. The next bottleneck is different: agents need durable maps of the systems they are operating inside. Without that, every long coding run becomes another expensive rediscovery loop. That is the same pressure behind [terminal agents becoming portable runtime surfaces](/blog/terminal-agents-portable-runtime-surface), [Claude Code token-burn observability](/blog/claude-code-token-burn-cache-observability), and [the context reduction pattern](/blog/agent-context-reduction-pattern). The agent does not need every file pasted into context. It needs the right local map, with evidence, boundaries, and a path back to verification. ## The Take Codebase graphs are becoming the new repo map. Aider made the repo-map idea concrete for AI coding: use tree-sitter to build a compact view of symbols and relationships, then spend context on the parts of the codebase that matter. That pattern still works, and it is why [Aider vs Claude Code](/blog/aider-vs-claude-code) is still a useful comparison. Graphify points at the next version of the same idea. Modern agent work is not only source code. It includes: - product notes - schemas - migration history - screenshots - architecture diagrams - bug reports - transcripts - research papers - design system rules - deployment runbooks - prior agent decisions Those objects do not fit neatly into a file tree. They fit better as a graph. If the agent can ask "what connects this billing route to this auth policy?" or "which docs contradict the current schema?" or "what changed since the last successful deploy?", it can navigate like an engineer instead of rereading the whole repo like a distracted intern. ## Why This Is Trending Now The timing makes sense. Coding agents have become capable enough that the failure mode moved up a layer. The model can usually make a plausible edit. The hard part is knowing which edit is appropriate inside this specific system. That is why developers keep building surrounding infrastructure: - skills and memory files to preserve local conventions - repo maps to compress code structure - MCP servers to expose tool state - terminal runtimes with approvals and rollback - hooks that run tests after edits - cost monitors that catch runaway context - PR receipts that explain what changed and why Graphify sits in that same category. It is not trying to be the model. It is trying to be part of the agent's working memory. The README claims a 71.5x token reduction on a mixed corpus of Karpathy repos, papers, and images. Treat that as a project-specific benchmark, not a universal law. But the direction is right: structure beats repeated full-context reads when the corpus gets large enough. ## The Real Product Is Provenance The best detail in Graphify is not the visual graph. It is the edge labeling. The project says each edge is tagged as `EXTRACTED`, `INFERRED`, or `AMBIGUOUS`. That matters because agent context is dangerous when it looks more certain than it is. A useful codebase map should separate: - facts found directly in code - relationships inferred from names or call paths - claims copied from docs - stale notes that may no longer match production - hypotheses that need verification That distinction is the difference between a map and fan fiction. This is also where many memory systems fall apart. A persistent note that says "the checkout flow uses Stripe webhooks" is not enough. The agent needs to know where that came from, when it was observed, which files support it, and which tests or logs can prove it still holds. That is why the next useful agent-memory product will look less like a notebook and more like a graph with receipts. ## The Opposing Take The skeptical view is fair: knowledge graphs have been oversold before. Developers have seen enterprise graph demos where everything connects to everything, the visualization looks impressive, and the daily workflow never changes. A codebase graph can become another artifact that ages out of sync, costs tokens to maintain, and gives the agent a false sense of understanding. There are real failure modes: - The graph can preserve stale architecture decisions after the code moved on. - Inferred edges can look factual if the UI does not mark uncertainty clearly. - Generated wiki pages can compress away the edge case that matters. - Multimodal extraction can misread screenshots or diagrams. - A graph can help exploration but still fail to validate behavior. - Rebuild hooks can add noise if every commit produces a large artifact churn. So the right question is not "does the graph look clever?" The right question is "does this graph reduce real agent mistakes?" If it does not help the agent choose better files, avoid duplicate work, explain risk, run better tests, or leave better receipts, it is decoration. ## What A Serious Codebase Graph Needs For agent work, a codebase graph should be scored like infrastructure. ### 1. Incremental Updates The graph has to stay current without turning every edit into a full re-index. Graphify's cache and `--update` path are the right shape. Code changes should be cheap to refresh. Docs, diagrams, and PDFs can take a slower pass. The important part is that the agent knows whether it is reading a fresh edge or stale context. ### 2. Source Links Every useful node should route back to evidence. If a graph says a route depends on a policy, click through to the route, policy, migration, test, or doc. If the relationship came from inference, say that. If it came from a generated summary, point to the raw source. This is the same standard public technical content should meet: claims need sources. Agents should hold themselves to the same rule. ### 3. Agent-Navigable Output The visual graph is useful for humans, but agents need boring files. Graphify's wiki output is interesting because it gives another agent a markdown entry point. That is the practical surface. A coding agent can read `index.md`, follow links, inspect a community page, and then jump to files. It does not need to parse a dense PNG of nodes. ### 4. Uncertainty Labels The graph should make uncertainty loud. `EXTRACTED`, `INFERRED`, and `AMBIGUOUS` are good starting labels. Teams may need more: `STALE`, `TESTED`, `PRODUCTION_OBSERVED`, `DOC_ONLY`, `HUMAN_CONFIRMED`, or `BROKEN_BY_RECENT_DIFF`. This is where graph memory connects to [agent swarms needing receipts](/blog/agent-swarms-need-receipts). More context is not better unless the context explains how much to trust it. ### 5. Verification Paths A graph should not end at an answer. It should end at a check. If the agent asks "what owns this checkout failure?", the graph can identify likely files and docs. The next step should be a test, log query, smoke check, or reproduction command. That is how codebase maps become operational, not ornamental. This is the same lesson behind [long-running agents needing harnesses](/blog/long-running-agents-need-harnesses). A map is useful because it points the harness at the right verification loop. ## Where This Fits In The Stack I would not replace existing tools with a graph layer. I would add it where current agent workflows already leak time. Use a codebase graph when: - the repo is too large for normal context stuffing - architecture knowledge lives across docs, tickets, schemas, and code - multiple agents are editing related modules - onboarding requires repeated "where does this live?" questions - migrations and policies matter as much as application code - historical decisions affect current implementation choices Do not use it as a substitute for: - tests - typechecks - code review - runtime logs - source-level inspection - explicit task acceptance criteria The graph should narrow the search space. It should not become the authority. ## My Take Graphify is interesting because it names a real pain: agents are still bad at carrying system structure across sessions. That does not mean every team needs a knowledge graph tomorrow. Small repos still fit in simple context windows. Many projects need better tests before they need better maps. And any generated graph has to prove that it reduces mistakes, not just tokens. But the direction is right. AI coding is moving from prompt craft to operating systems. Repos need maps. Agents need provenance. Teams need receipts. The winning context layer will not be the one that remembers the most. It will be the one that helps an agent decide what to inspect, what to trust, and what to verify next. Sources: [Graphify on GitHub](https://github.com/safishamsi/graphify), [Aider repo map documentation](https://aider.chat/docs/repomap.html), [Sourcegraph Cody docs](https://sourcegraph.com/docs/cody), [Model Context Protocol introduction](https://modelcontextprotocol.io/introduction), [Claude Code memory docs](https://docs.anthropic.com/en/docs/claude-code/memory). ## FAQ ### What is Graphify? Graphify is a Claude Code skill and CLI workflow that turns folders of code, docs, PDFs, images, diagrams, and other project material into a queryable knowledge graph. It can output an interactive graph, markdown wiki, Obsidian-style vault, JSON graph, and report. ### Why do AI coding agents need codebase graphs? Agents need compact structure. A graph can show relationships among files, functions, docs, schemas, decisions, and tests without stuffing the whole repo into context. That helps the agent choose better files and ask better follow-up questions. ### Is a codebase graph better than a repo map? It depends on the job. A repo map is excellent for symbol-level code navigation. A broader graph is more useful when the task crosses code, documentation, diagrams, research, schemas, and prior decisions. The best systems will likely use both. ### What is the risk of using generated knowledge graphs? The main risk is false confidence. If inferred or stale relationships look factual, the agent may make wrong edits faster. A serious graph needs source links, uncertainty labels, freshness metadata, and verification paths. ### Should every repo add a codebase graph? No. Small repos may not need it. Add a graph when repeated context discovery is slowing agents down, when knowledge lives across many artifact types, or when multiple agents need a shared map of the same system.

Claude Managed Agents Are Starting to Look Like Backend Jobs

Developers Digest — Sat, 09 May 2026 00:00:00 GMT

Anthropic's latest Claude Managed Agents update looks like an agent feature launch on the surface: multiagent sessions, outcomes, dreaming, vault refresh, and webhooks. The more useful read is that managed agents are turning into a backend job runtime. That is the angle developers should care about. Once an agent can run for a while, split work across specialized threads, refresh credentials, emit webhooks, ask for permission, and prove an outcome, it stops behaving like a chat tab. It starts behaving like a long-running production process. That puts Claude Managed Agents in the same operational lane as [Codex goals and Claude managed outcomes](/blog/codex-goal-vs-claude-managed-outcomes-practical-differences), [terminal agents as portable runtime surfaces](/blog/terminal-agents-portable-runtime-surface), and [long-running agent harnesses](/blog/long-running-agents-need-harnesses). The winning teams will not just prompt these systems better. They will wrap them like jobs: queued, idempotent, observable, interruptible, budgeted, and auditable. ## What Changed Anthropic's announcement says managed agents now include multiagent orchestration, outcomes, dreaming, vault refresh, and webhooks ([Anthropic announcement](https://claude.com/blog/new-in-claude-managed-agents)). The docs make the shift clearer. [Multiagent sessions](https://platform.claude.com/docs/en/managed-agents/multi-agent) let a coordinator agent delegate to other agents inside a single session. Those agents share a container and filesystem, but each runs in its own context-isolated session thread with its own conversation history. The coordinator sees condensed activity on the primary event stream, while operators can inspect individual session threads when needed. [Outcomes](https://platform.claude.com/docs/en/managed-agents/define-outcomes) turn "done" into a rubric-driven evaluation loop. Instead of trusting that an agent stopped at the right time, you define success criteria and inspect whether the outcome was satisfied, needs revision, hit max iterations, or failed. [Webhooks](https://platform.claude.com/docs/en/managed-agents/webhooks) notify your system about state changes such as sessions starting, idling, rescheduling, terminating, creating threads, or finishing outcome evaluation. The webhook docs also say payloads include the event type and resource ID, then your app fetches the fresh object by ID. That last detail matters. It is exactly how serious backend systems avoid stale event payloads, duplicate delivery bugs, and polling loops. ## The Take The agent platform race is moving from "can the model use tools?" to "can the run be operated like infrastructure?" A production agent run needs the same boring properties as a background job: - a durable job identifier - explicit status transitions - retry semantics - duplicate delivery handling - permission checkpoints - logs and event streams - typed completion states - budget limits - a way to wake humans up only when needed Claude Managed Agents is not the only path there. You can build this around Codex, Claude Code, GitHub Actions, a queue, or your own harness. But Anthropic's managed-agent surface is a strong signal about where the category is going. Agent execution is becoming backend execution. ## Webhooks Change the Integration Shape Without webhooks, a managed agent is something your app starts and then checks later. With webhooks, it becomes something your app can subscribe to. That difference changes the architecture. Your application can now react when an agent idles for a permission approval, when a multiagent thread is created, when a transient error triggers a reschedule, or when an outcome evaluation finishes. That is the same reason [agent-native backends](/blog/agent-native-backends-insforge) are interesting. The valuable surface is not just the model. It is the control plane around the run. The webhook docs also include the important production caveats: - event payloads are small and require a follow-up fetch - duplicate deliveries can happen - ordering is not guaranteed - non-2xx responses trigger retry behavior - endpoints can be disabled after repeated delivery failures Those are normal webhook rules, but they are easy to forget when the product category is called "agents." If you wire this like a toy chat callback, it will break like a toy chat callback. The right shape is boring: 1. Verify the signature. 2. Deduplicate by event ID. 3. Fetch the current session, thread, or outcome object by ID. 4. Update your own run record transactionally. 5. Trigger the next action only from your stored state. 6. Treat ordering as a hint, not a guarantee. That is not glamorous. It is what keeps an overnight agent from waking up three people for the same stuck approval. ## Multiagent Sessions Need Handoff Discipline The multiagent docs are also more operational than they first look. The coordinator can delegate to a roster of agents. Anthropic frames the best use cases as parallelization, specialization, and escalation. That maps directly to how engineering teams already split work: researcher, implementer, reviewer, test writer, security reviewer, docs writer. But the docs include constraints that should shape your design: - all agents share the same container and filesystem - each agent has isolated thread context - tools and context are not shared - the coordinator can delegate only one level deep - the roster can include up to 20 unique agents - session status aggregates thread activity - permission requests from worker threads are cross-posted to the primary thread Those details create a useful boundary. Do not treat multiagent sessions as a magic swarm. Treat them as a supervised job with worker threads. Each worker needs a narrow assignment, a completion artifact, and a reason to exist. If your coordinator delegates "improve the codebase" to five agents, you just made five vague agents. If it delegates "review auth policy changes," "write regression tests," and "summarize docs changes," you have an actual workflow. This is the same practical lesson behind [parallel coding agents needing merge discipline](/blog/parallel-coding-agents-merge-discipline). Parallelism is only useful when the handoffs are crisp enough to merge. ## Outcomes Are the Stop Condition The most important primitive is still outcomes. Tools let the agent act. Multiagent sessions let it split work. Webhooks let your app react. But outcomes define when the run is allowed to stop. That is why the existing [Codex `/goal` vs Claude outcomes comparison](/blog/codex-goal-vs-claude-managed-outcomes-practical-differences) still matters. A durable loop is not the same thing as a good stopping rule. "Keep going" and "prove it is done" are different product primitives. For production workflows, outcomes should be written like acceptance criteria: - what files or artifacts must exist - what tests or checks must pass - what source evidence must be cited - what risk review must be completed - what business constraint must remain true - what human handoff note must be left behind The anti-pattern is using an outcome as a vibe check. Bad outcome: "Make the report good." Better outcome: "The report cites three primary sources, lists assumptions, includes a recommendation table, flags unknowns, and has no unsupported pricing claims." This matters even more as agents start coordinating with other agents. The coordinator can produce a polished summary while a worker missed the actual requirement. Outcomes force the final handoff to be judged against a rubric instead of the coordinator's confidence. ## The Opposing Take There is a fair skeptical response: isn't this just queue infrastructure with a model attached? In many ways, yes. That is the point. Teams already know how to run jobs, retries, event handlers, dashboards, queues, alerts, and approval workflows. The mistake would be treating agents as a brand-new metaphysical category that needs brand-new operational instincts. The harder skeptical question is whether managed-agent platforms hide too much. If the provider owns the session runtime, filesystem, thread orchestration, credential vault, and outcome evaluation loop, you get speed but lose some control. You need to understand what can be exported, logged, replayed, interrupted, and governed from your side. For some teams, a self-hosted harness around Claude Code, Codex, or an open-source agent runtime will be the better answer. For others, a managed runtime is exactly the right tradeoff because the provider handles the painful execution substrate. The decision should not be ideological. Ask what failure evidence you get back. ## The Production Checklist Before treating managed agents as production infrastructure, I would require: - a local run record for every agent session - webhook signature verification - idempotent event handling - duplicate event detection - explicit state machine transitions - max runtime and max spend caps - per-tool permission policy - outcome rubrics stored in version control - thread-level logs or summaries for worker agents - human escalation rules for idled sessions - a receipt artifact after completion - a rollback or replay plan for failed runs This is also where [managed-agent FinOps](/blog/400-dollar-overnight-bill-agent-finops) becomes unavoidable. A long-running agent that can reschedule, fan out, call tools, and revise toward an outcome can produce serious value. It can also burn money in a loop if you do not cap it. ## A Concrete Architecture If I were adding Claude Managed Agents to a developer platform today, I would not start with a chat UI. I would start with a job table: ```txt agent_runs id provider_session_id status objective outcome_rubric_version max_runtime_minutes max_budget_usd created_by created_at updated_at completed_at agent_events id provider_event_id run_id event_type provider_resource_id received_at processed_at ``` Then I would wire webhooks into that table, not directly into business actions. The webhook handler should only authenticate, dedupe, fetch current state, and store the event. A separate worker should decide whether to notify a human, resume a session, fetch a thread transcript, or mark the run complete. That extra hop is what lets you debug the system later. It also makes it easier to swap providers. The same run model can hold Codex automation receipts, Claude Managed Agent sessions, or GitHub Copilot agent tasks. ## What To Watch Next The next useful features will probably sound boring: - first-class run budgets - better thread export - outcome history diffs - webhook replay tooling - built-in dead-letter queues - per-agent cost attribution - approval policies as code - portable receipts across providers Those are not flashy agent demos. They are the things that make agents safe to use every day. That is why this Anthropic update matters. It is not just another layer of agent capability. It is another step toward agents being operated like backend systems. The teams that win will not be the teams with the most dramatic autonomous demo. They will be the teams whose agents can fail quietly, resume cleanly, explain what happened, and hand off a receipt a human can trust. Sources: [Anthropic announcement](https://claude.com/blog/new-in-claude-managed-agents), [Claude Managed Agents multiagent sessions](https://platform.claude.com/docs/en/managed-agents/multi-agent), [Claude Managed Agents webhooks](https://platform.claude.com/docs/en/managed-agents/webhooks), [Claude Managed Agents outcomes](https://platform.claude.com/docs/en/managed-agents/define-outcomes), [Claude Managed Agents launch post](https://claude.com/blog/claude-managed-agents). ## FAQ ### What are Claude Managed Agents? Claude Managed Agents are Anthropic's hosted infrastructure for running longer-lived Claude agents with managed environments, sessions, tools, files, credentials, tracing, and orchestration features. ### Why compare managed agents to backend jobs? Because production agent runs need the same mechanics as backend jobs: IDs, states, retries, webhooks, logs, budgets, approvals, and completion criteria. The model is only one part of the runtime. ### What are multiagent sessions in Claude Managed Agents? Multiagent sessions let a coordinator agent delegate work to other configured agents inside one managed session. Worker agents have isolated context threads while sharing the same container and filesystem. ### What are outcomes in Claude Managed Agents? Outcomes define what "done" means for an agent run. They use rubric-style criteria so the system can evaluate whether the output is satisfied, needs revision, reached max iterations, or failed. ### How should developers handle Claude Managed Agents webhooks? Treat them like normal production webhooks. Verify signatures, deduplicate by event ID, fetch current resource state by ID, handle retries, and never assume delivery ordering.

Agent-Native Backends Are the Next AI Coding Bottleneck

Developers Digest — Fri, 08 May 2026 00:00:00 GMT

The most interesting backend trend on GitHub this morning is not "another Supabase alternative." It is the shape of the interface. [InsForge](https://github.com/InsForge/InsForge) describes itself as an open-source backend platform for agentic coding. The pitch is direct: give coding agents database, auth, storage, compute, hosting, and an AI gateway so they can ship full-stack apps end to end. The project exposes those backend primitives through an MCP server, plus a CLI and skills path for cloud users. That matters because AI coding agents are getting weirdly good at the frontend half of software and still fragile around the backend half. A model can generate a Next.js page, wire a form, and make the UI look decent. The failure mode usually shows up one layer deeper: wrong schema assumptions, missing migrations, auth rules that look plausible but are unsafe, storage buckets with unclear policies, functions deployed without logs, or a production deploy that the agent never actually verified. That is the same operating lesson behind [terminal agents becoming portable runtime surfaces](/blog/terminal-agents-portable-runtime-surface) and [long-running agents needing harnesses](/blog/long-running-agents-need-harnesses). Once the agent can change real infrastructure, the runtime around the model matters more than the prompt. ## The Take The next backend platform category is not just backend-as-a-service. It is **backend-as-an-agent-control-plane**. That sounds like vendor language, but the distinction is practical. A normal backend platform is optimized for a human developer reading docs, clicking dashboards, writing migrations, and checking logs. An agent-native backend needs to expose the same primitives as structured operations the agent can inspect, change, verify, and report back on. InsForge is interesting because its README names those verbs: - read backend context and state - pull documentation, schemas, metadata, deployed functions, bucket contents, auth config, and runtime logs - deploy edge functions - run database migrations - create storage buckets - set up auth providers - configure backend resources directly That list is not just a feature list. It is a definition of what an agent needs to safely touch a backend. For a broader stack decision, pair this with [Convex vs Supabase for AI apps](/blog/convex-vs-supabase-ai-apps) and the [Next.js AI app stack guide](/blog/nextjs-ai-app-stack-2026). Those posts answer which backend feels good to humans. This post is about what changes when an agent is the operator. ## Why Backends Break Agents Backends punish uncertainty. Frontend code can be visually inspected. If the padding is wrong, the page looks wrong. If a component imports the wrong icon, the build usually catches it. If the agent makes a bad layout choice, you can screenshot it and iterate. Backend mistakes hide longer. A generated migration can pass locally and still fail against production data. An auth rule can satisfy the happy path while leaking a tenant boundary. A storage upload can work for the owner and fail for a collaborator. A serverless function can deploy but time out under real input. A model gateway can be wired correctly but blow through cost because nobody set a session cap. That is why [agent skills need exit criteria](/blog/agent-skills-production-checklist). "Build the backend" is too vague. The useful instruction is closer to: > Change the schema, apply the migration, update the SDK usage, verify auth behavior, inspect logs, run the route smoke test, and leave a receipt. The agent cannot do that reliably if every backend operation lives behind a dashboard built for humans. ## What Agent-Native Actually Means Agent-native does not mean "the backend has AI features." It means the backend gives the agent a constrained operating surface: ### 1. Discoverable State The agent needs to ask what exists before it edits anything. That includes schemas, tables, policies, functions, storage buckets, secrets that are present but not exposed, deploy history, logs, and environment shape. The goal is not to dump the whole system into context. The goal is to return compact, structured facts the agent can reason over. This is the backend version of [the context reduction pattern](/blog/agent-context-reduction-pattern). Keep the large state in the system. Return the summary, evidence, and next safe action. ### 2. Safe Mutations "Run arbitrary SQL" is powerful, but it is not enough. An agent-native backend should separate read-only inspection, proposed migrations, applied migrations, function deploys, auth config changes, and destructive operations. Each category should be visible in the transcript. Risky operations should be gated. The platform should make it easy to preview and roll back where possible. That is the same permission-boundary problem terminal agents are solving with approvals and sandboxing. Backends need the equivalent. ### 3. Verification Hooks Agents need a short path from "I changed it" to "I proved it works." For backend work, that means logs, health checks, migration status, endpoint tests, auth policy checks, and deployed function output need to be callable from the same surface the agent used to make the change. This is where normal BaaS dashboards fall short for automation. They are excellent for humans. They are not always excellent as machine-verifiable receipts. ### 4. Portable Primitives InsForge's primitive list is familiar: Postgres, auth, S3-compatible storage, edge functions, model gateway, compute, deployment. That familiarity is a feature. The agent should not have to learn a new database concept for every project. It should learn the team's conventions around boring primitives. The better the platform maps to known infrastructure, the easier it is to review the agent's work. ## The Opposing Take There is a fair skeptical read here: do we really need another backend platform because coding agents exist? Maybe not. Supabase, Convex, Neon, Clerk, Railway, Fly.io, Cloudflare, Vercel, and plain Docker already cover most backend needs. The best developer teams can build an agent-readable layer around those tools with CLIs, APIs, docs, migrations, and smoke tests. In many cases, that is the right answer. The risk with a new agent-native platform is abstraction drift. If the agent learns a simplified control plane but production behavior lives in the underlying database, storage system, auth provider, and deployment target, the abstraction can hide the exact details that matter during an incident. There is also a security angle. Giving an agent backend tools is not automatically safer than giving it shell access. It is only safer if permissions, logs, previews, approvals, and rollback boundaries are better than the raw tools they replace. So the bar should be high. Do not evaluate InsForge or any agent-native backend by whether the demo scaffolds an app. Evaluate whether it makes backend changes more inspectable than the tools you already use. ## The Evaluation Checklist If a backend claims to be built for agents, I would score it on these questions: - Can the agent list the current schema, functions, auth config, buckets, and deployment state without overloading context? - Can it propose a migration before applying it? - Can destructive actions require explicit approval? - Can every mutation produce a receipt with who changed what, when, and why? - Can the agent read runtime logs after a failed deploy? - Can it run a route-level smoke test after creating an endpoint? - Can it verify auth and storage policies from multiple user roles? - Can it export enough state for human review in a pull request? - Can it work locally and in production without hiding environmental differences? - Can the team bypass the agent layer and use standard Postgres, S3, functions, and deploy tooling when needed? That last question matters. The agent layer should make common work safer. It should not become the only way to understand the system. ## My Take InsForge is worth watching because it names a real bottleneck. AI coding agents are no longer blocked by generating files. They are blocked by operating systems safely: repos, browsers, CI, deployments, databases, auth, storage, logs, and cost controls. The frontend agent story is already crowded. The backend operator story is earlier and more important. Whoever makes backend state inspectable, mutations gated, and verification receipts automatic will have a real wedge. That does not mean every team should migrate to a new backend. It means every team using coding agents should ask whether their backend is legible to the agent. If the answer is no, the agent will keep guessing. And backend guesses are expensive. Sources: [InsForge GitHub repository](https://github.com/InsForge/InsForge), [InsForge docs](https://docs.insforge.dev/introduction), [Supabase docs](https://supabase.com/docs), [Convex docs](https://docs.convex.dev/), [Model Context Protocol introduction](https://modelcontextprotocol.io/introduction). ## FAQ ### What is InsForge? InsForge is an open-source backend platform for agentic coding. It combines backend primitives such as Postgres, auth, storage, edge functions, a model gateway, compute, and deployment with agent-facing interfaces such as MCP, CLI commands, and skills. ### Is InsForge a Supabase alternative? Partly, but the more interesting framing is agent-native backend control plane. Supabase is a mature backend platform for human developers. InsForge is trying to make backend operations directly inspectable and operable by coding agents. ### Do coding agents need backend-specific tools? Yes, if they are expected to do more than edit frontend files. Backend work requires schema awareness, migration control, policy checks, logs, deployment state, and verification receipts. A general shell can do some of that, but a constrained backend surface can make the work safer and easier to review. ### Should teams migrate their backend for AI coding agents? Not by default. Start by making the existing backend legible: document schemas, expose safe CLI commands, add smoke tests, preserve migration receipts, and make logs easy to inspect. Consider an agent-native platform only if it improves control and verification over your current stack.

6 Launches in One Day: The DD Empire Expansion

Developers Digest — Thu, 07 May 2026 00:00:00 GMT

## 6 New Launches In One Day Today the empire grew by five apps and one Chrome extension. All shipped on the same day, all under [developersdigest.tech](https://developersdigest.tech), all wired into the same auth, deploy, and monitoring spine that runs the rest of the portfolio. Here is what each one is, why it exists, and where to follow along. ## ssl-watch - Free SSL + DNS Monitor [ssl-watch](/apps/ssl-watch) is a free SSL, DNS, and domain expiry monitor. Paste a domain once, get email or Slack alerts before a certificate, nameserver, or registration silently breaks production. Every dev I know has been bitten by this at least once. Most paid options are bundled into uptime suites you do not need. ssl-watch does the one thing. Coming soon: [/apps/ssl-watch](/apps/ssl-watch). ## ctx-peek - See Inside Your Claude Code Context [ctx-peek](/apps/ctx-peek) takes a Claude Code transcript and shows exactly what is in the context window - token by token, file by file, with bloat hotspots highlighted. If your agent suddenly gets dumber after 30 minutes, this is usually why. ctx-peek tells you which files are eating the budget and what to prune. Coming soon: [/apps/ctx-peek](/apps/ctx-peek). ## modelpick - Pick The Right Model In 4 Questions [modelpick](/apps/modelpick) is a decision-tree wrapper over the AI Models directory. Answer four questions about your task - latency tolerance, context size, modality, budget - and get back the optimal model, provider, and a price estimate per million tokens. It exists because nobody should have to memorize the difference between Sonnet 4.5, 4.6, and 4.7 to ship a feature. Coming soon: [/apps/modelpick](/apps/modelpick). ## dd-pulse - Live Status For Every DD App [dd-pulse](/apps/dd-pulse) is the live status and metrics dashboard for the entire DD portfolio. Uptime, deploy state, weekly active users, all in one page. We built it for ourselves first - running 25+ Coolify apps without a unified pulse view was getting silly - and then realized other multi-app builders need the same thing. Coming soon: [/apps/dd-pulse](/apps/dd-pulse). ## og-forge - Branded OG Images In 200ms [og-forge](/apps/og-forge) is a hosted OG-image API. Pass a URL or params, get back a branded preview card in roughly 200ms. Templates ship for blog posts, repos, products, and changelog entries. Every DD app already burns hours on per-product OG generators. og-forge collapses that into one endpoint with caching and a decent default look. Coming soon: [/apps/og-forge](/apps/og-forge). ## dd-extension - The Empire In Your Omnibar The Chrome extension is the connective tissue. Type `dd` in the omnibar, hit space, then a slug - `dd ssl-watch`, `dd modelpick`, `dd traces` - and you are in the right app. It also surfaces live status from dd-pulse and lets you save snippets straight into the content engine. If you use more than two DD apps a day, this is the launcher you want pinned. Install link drops with the public release. ## Empire Stats After Today The portfolio now spans **17 products** across **6 categories** - observability, content, agents, education, marketplaces, and developer utilities. Roughly **70% of active surface area is AI-coding focused**: agent tooling, model selection, context inspection, traces, skills, MCP servers. The rest is the infra that makes the AI-coding work pay rent - auth, payments, status, OG images, SSL. Same Coolify cluster. Same Convex + Clerk + Stripe stack. Same push-to-deploy pipeline. The cost of adding the sixth thing today was lower than the cost of adding the second thing six months ago, which is the entire point of building an empire on one spine instead of six. ## What Comes Next Each of the five apps is in `coming soon` state today. Public betas roll out across the next two weeks, in roughly the order listed above. The Chrome extension goes to the Web Store once we finish the review prep. If you want to be in the first wave, the [/apps](/apps) directory is the source of truth - every product gets a status pill the moment it goes live. No newsletter blast, no countdown, just the page updating. Six things shipped today. We will keep going.

DevDigest OS: The Thesis Behind Treating an Empire as One Operating System

Developers Digest — Thu, 07 May 2026 00:00:00 GMT

## One Question What if your dev tools weren't separate apps but one operating system? Not a suite. Not a platform. An OS - a shared substrate where every tool knows about every other tool, every output is an input somewhere else, and the catalog itself is a protocol other agents can read. That is the thesis behind [DevDigest OS](/os). It is also why we shipped [/suites](/suites). The marketing pages are the surface. This post is the argument. ## The Thesis in One Paragraph Each DevDigest app earns its place by solving exactly one thing well. None of them are platforms. None of them try to swallow your stack. But they share conventions - design language, auth, embeds, the apps catalog - and that shared layer is what turns a portfolio of single-purpose tools into something that behaves like an operating system for shipping. A platform asks you to migrate. An OS asks you to plug in. ## Each App Earns Its Place The rule is simple: if you can describe what an app does in one sentence and a developer nods, it ships. If the sentence needs an "and" or a "plus," it is two apps and we split it. - [ShipBadge](https://shipbadge.dev) - embed a "shipping today" badge on any project. - [DD Pulse](https://pulse.developersdigest.tech) - uptime + status pages for indie products. - [OG Forge](https://ogforge.dev) - generate social cards from a URL. - [ctx-peek](https://ctxpeek.dev) - peek at any AI agent's context window. - [TraceTrail](https://tracetrail.dev) - replay agent runs step by step. - [SponsorKit](https://sponsorkit.dev) - sponsor pages with one config file. Each of these is a complete product on its own. None require any of the others. That is intentional. The OS only works if every component survives being used in isolation. ## But Together They Loop The interesting work happens at the seams. **ShipBadge → DD Pulse → status pages on every app you ship.** You wire ShipBadge into a new repo. ShipBadge sees you also use DD Pulse and offers a one-click upgrade: the same badge now renders live uptime data. The status page DD Pulse generates embeds the badge back. Two apps, one feedback loop, no integration code. **OG Forge → ctx-peek → public profiles for your AI work.** ctx-peek captures an agent run. OG Forge auto-generates a social card from the trace. The profile page on ctx-peek embeds the OG Forge image and links back. You posted a tweet about an agent run; the tweet card was built by another DD app you forgot you owned. **TraceTrail → DD Pulse → reliability dashboards for agents.** TraceTrail records agent runs. DD Pulse turns the failure rate into an uptime metric. A status page now answers "is my agent reliable today" alongside "is my API up." These loops are not features we built. They emerged the moment two apps shared the same conventions. That is the OS dividend. ## Cross-App Conventions: The Real Product The apps are the demos. The conventions are the product. - **Every output is shareable.** Every artifact in every DD app has a public URL. No login walls on outputs. - **Every output is embeddable.** Every public URL has an embed variant - iframe, oEmbed, or Markdown shortcode. ShipBadge in your README, OG Forge in your blog, ctx-peek in your tweet. - **Every output links back.** Embeds carry attribution. The attribution is a link to the source app. The source app is the catalog entry. The catalog entry surfaces the next adjacent tool. Read those three rules in sequence and you have described how the empire compounds without us writing a single integration. ## The Chrome Extension as Desktop Shell If the apps are programs, the [DevDigest Chrome extension](/extension) is the desktop. It overlays the browser with a launcher, a clipboard that knows about every DD app, and context-aware actions on any page you visit. You are reading a Vercel dashboard? The extension offers "monitor with DD Pulse." You are looking at a GitHub repo? It offers "embed ShipBadge." You are debugging an agent in the Claude Code sidebar? It offers "open in TraceTrail." The extension is the only place a user sees the OS as a single thing. Everywhere else, the apps stay sharp and singular. That separation is on purpose. The shell is opinionated; the apps are not. ## /api/apps - The Catalog as Protocol The piece most people miss: [/api/apps](/api/apps) is a public JSON endpoint. It returns the entire DevDigest catalog - every app, its tagline, its embed schema, its OG-card endpoint, its status page. That endpoint is consumed by: 1. The Chrome extension launcher. 2. The [/suites](/suites) page. 3. Our own internal cross-promotion banners. 4. Third-party agents that want to introspect the empire. That last one is the lever. When an LLM agent asks "what tool can generate a social card from a URL," `/api/apps` is a single fetch away from a structured answer. The catalog is not marketing copy. It is a discovery protocol other software can consume. If you want your own indie portfolio to compound like this, expose your catalog. Make it boring JSON. Make it fetchable without auth. The agents are coming for the rest. ## What This Is Not DevDigest OS is not: - **A platform.** You do not host on it. You do not deploy to it. There is no SDK lock-in. - **A bundle.** You do not buy "the suite." Every app prices independently. - **A monolith.** No app shares a database with another. The shared layer is conventions, not infrastructure. - **Finished.** The catalog grows whenever a tool earns its sentence. If any of those become true, we have lost the plot. Drift toward platform is the failure mode. ## The Compounding Argument Here is the only number that matters: the marginal utility of the *next* DD app is higher than the last. When we shipped ShipBadge alone, it was a badge service. When DD Pulse landed, ShipBadge became a status indicator. When OG Forge landed, both got social cards for free. When ctx-peek landed, all three got agent-run trace embeds. Every new app makes the previous apps more useful - not because we rewrite them, but because the conventions hold and the catalog updates. That is the definition of an operating system: the shared substrate is what creates leverage. A monolith compounds linearly. A pile of apps does not compound at all. An OS - small, sharp tools plus shared conventions plus a public catalog - compounds. ## What To Do Next If you build indie products, steal the pattern: 1. One app, one sentence. If you cannot explain it without "and," split it. 2. Every output gets a public URL, an embed, and a link back. 3. Publish a `/api/apps`-style catalog. JSON, no auth, stable schema. 4. Build a shell only after you have three apps. Not before. If you want to see it in motion, the [/os](/os) page is the live tour and [/suites](/suites) is the catalog grouped by job-to-be-done. Everything on both pages is pulled from the same `/api/apps` endpoint that the agents read. The empire is not the apps. The empire is the layer underneath that makes the apps stop being separate.

Terminal Agents Are Becoming Portable Runtime Surfaces

Developers Digest — Thu, 07 May 2026 00:00:00 GMT

DeepSeek-TUI hit the front page of GitHub trending because it is easy to describe: Claude Code, but wired around DeepSeek models. That framing is useful, but it undersells the bigger shift. The interesting part is not the clone label. The interesting part is that the agent runtime is becoming portable. The [DeepSeek-TUI repo](https://github.com/Hmbown/DeepSeek-TUI) describes a terminal coding agent with local file editing, shell execution, git operations, subagents, MCP servers, approval modes, rollback snapshots, durable background tasks, an HTTP/SSE runtime API, LSP diagnostics, skills, and live cost tracking. Whether that particular project becomes a daily driver is less important than what it proves: developers now expect the terminal agent surface to be separable from one model vendor. ## Quick verdict - If you are choosing a coding agent today, start with [/compare](/compare) and [/pricing](/pricing). - If you want a deeper three-way decision, start with [Claude Code vs Cursor vs Codex](/blog/claude-code-vs-cursor-vs-codex-2026). - If you are evaluating DeepSeek-TUI specifically, start with the tool card: [/tools/deepseek-tui](/tools/deepseek-tui). That is the same market pressure behind [free Claude Code model gateways](/blog/free-claude-code-model-gateway-tradeoffs), [Codex goals](/blog/codex-goal-vs-claude-managed-outcomes-practical-differences), and the newer [Claude Code token-burn observability debate](/blog/claude-code-token-burn-cache-observability). The work is no longer "can the model edit code?" The work is "can the runtime supervise edits safely, cheaply, and repeatably?" ## The Runtime Is The Product AI coding agents started as model demos. Ask for a function, get a diff. Ask for a test, get a test. The model was the product. That era is over for serious work. The model still matters, but the product surface has moved to the runtime around the model: - how the agent asks for permission before risky commands - how it snapshots state before a turn - how it restores a bad edit - how it shows diagnostics after changing files - how it compacts context before cost explodes - how it reports token spend and cache behavior - how it lets subagents split work without losing receipts - how it resumes after a restart - how it exposes a headless API for loops and CI DeepSeek-TUI's feature list reads like a checklist for that runtime layer. Plan, Agent, and YOLO modes are not model features. Rollback snapshots are not model features. LSP diagnostics are not model features. Durable task queues are not model features. They are harness features. That is why this belongs next to [long-running agents need harnesses, not hope](/blog/long-running-agents-need-harnesses). Once an agent can touch a real repo, the harness becomes the difference between "neat demo" and "tool I can leave alone for 20 minutes." ## The Portability Pressure Is Real Developers do not want one perfect agent. They want a stable operating model that can survive model churn. Today that might mean Claude Code for planning-heavy repo work, Codex for background tasks and review loops, Cursor for inline IDE edits, and a DeepSeek or Qwen-backed tool for cheaper exploratory passes. Tomorrow it will be a different mix. The platform that wins is the one that makes those swaps boring. The DeepSeek-TUI README is explicit that `auto` is a local routing mode: the runtime decides whether a turn should use Flash or Pro and what thinking level it needs before sending a concrete model request upstream. That is the right shape. Model routing should be visible, local, and accountable. If a cheap model handled the job, show that. If a harder turn moved up to the stronger model, show that too. This is also where [Codex vs Claude Code](/blog/codex-vs-claude-code-april-2026) comparisons need to mature. "Which model is smarter?" is too shallow. The real questions are: - Can I pin model choice for repeatable benchmarking? - Can I set a cost ceiling before the run starts? - Can I inspect why a router escalated? - Can I keep the same approval policy across providers? - Can I replay the run after a bad edit? - Can I export the session as evidence? That is what portable agent infrastructure looks like. ## Why The Clone Critique Is Too Easy The obvious opposing take is fair: a lot of AI developer tools are derivative. A fast GitHub trend can be novelty, not staying power. A Claude Code-shaped terminal app with another backend does not automatically become production infrastructure. There are real risks: - Approval modes can look safe while still allowing dangerous shell paths. - Rollback snapshots can give false confidence if generated files, databases, or external services changed outside git. - Cost telemetry can be approximate if provider accounting is opaque. - Subagents can multiply confusion if they do not leave clean receipts. - Skills can rot into another prompt pile if they are not short and tested. - A fast-moving repo can have gaps in security review, package provenance, or dependency discipline. That critique matters. The right answer is not to install every trending agent. The right answer is to evaluate the runtime primitives one by one. This is the same point behind [agent swarms need receipts](/blog/agent-swarms-need-receipts). More agents are not automatically better. More visible state is better. More rollback control is better. More deterministic verification is better. ## What A Serious Terminal Agent Needs If a team is evaluating DeepSeek-TUI, Codex, Claude Code, Cursor CLI, Kimi, Droid, or any other terminal agent, I would score the runtime before the model. ### 1. Permission Boundaries The minimum viable control plane is not "ask before shell." It is a permission system that separates read-only exploration, interactive editing, and auto-approved execution. Claude Code has permissions, hooks, and settings. Codex has permission profiles and sandboxing. DeepSeek-TUI advertises Plan, Agent, and YOLO modes. Different names, same requirement: the agent should know when it is allowed to observe, edit, execute, and escalate. The best runtimes make the policy visible in the UI and hard to bypass accidentally. ### 2. Rollback And Repro Rollback has to be more than "git checkout." A useful runtime should know what changed during a turn, what commands ran, what diagnostics appeared afterward, and what state can be restored without touching the repo's main `.git` history. DeepSeek-TUI's side-git snapshot idea is interesting because it treats rollback as an agent-runtime concern rather than a human cleanup chore. For production teams, rollback should pair with replay. If an agent made a risky edit, you need to know the exact instruction, tool calls, diff, and verification output that led there. That is why [agent replays](/blog/agent-replays-with-tracetrail) and local transcripts matter. ### 3. Diagnostics In The Loop The model should not wait for a human to paste TypeScript errors back into chat. DeepSeek-TUI advertises LSP diagnostics after edits through tools like rust-analyzer, pyright, typescript-language-server, gopls, and clangd. That is the right direction. The runtime should feed compiler and language-server feedback into the next turn automatically, because that is how real coding works. Codex and Claude Code users already do this manually by running `pnpm typecheck`, `cargo test`, `go test`, or focused linters. A stronger runtime makes the common loop automatic while still leaving the final verification command explicit. ### 4. Cost And Cache Telemetry The latest [Claude Code token burn post](/blog/claude-code-token-burn-cache-observability) makes the same point from the other side: coding agents need a usage dashboard that developers can debug. DeepSeek-TUI claims live cost tracking plus cache hit/miss breakdowns. That is exactly the category to watch. A terminal agent should show: - model selected - thinking level selected - input and output tokens - cached versus uncached input - per-turn estimated cost - session total - router decisions - context compaction events Without that, "cheap model" can become expensive by accident. With it, a team can choose when to route cheap, when to route smart, and when to stop. ### 5. Background Work With Stop Conditions Durable task queues and HTTP/SSE runtime APIs sound like implementation details, but they are the bridge from chat to operations. A terminal agent that can survive restarts and expose headless control can become a loop: watch a PR, fix deterministic CI failures, re-run tests, report when blocked, and stop when the same failure repeats. That is the [Codex loops](/blog/codex-loops-boris-cherny-agent-routines) lane. The hard part is not starting background work. The hard part is making it stop clearly. ## The Buying Criteria Changed The old buyer question was: > Which AI coding model writes the best code? The new buyer question is: > Which agent runtime lets my team supervise model work without losing control? That changes the shortlist. A great model with weak approvals is risky. A cheap model with no telemetry is not really cheap. A fast agent with no rollback is a liability. A beautiful UI with no headless API is limited to interactive work. A swarm system with no receipts is just parallel uncertainty. This is why DeepSeek-TUI is a useful signal even if you never install it. It shows what developers now expect from an open terminal agent: - multiple model routes - local workspace control - approval modes - rollback - diagnostics - skills - subagents - MCP - cost telemetry - resumable sessions - background execution That list is becoming table stakes. ## My Take Do not treat DeepSeek-TUI as "the Claude Code clone of the week." Treat it as evidence that the terminal-agent runtime is becoming a commodity surface. That is good for developers. It means the useful parts of agent systems are being named, copied, tested, and recombined. It also means the bar should go up. If a new coding agent launches without approvals, rollback, diagnostics, cost telemetry, session export, and clear provider routing, it is not competing with Claude Code or Codex. It is competing with last year's demo. The next durable layer is not one more chat window. It is the portable agent runtime: a control plane where models can change, but the team's operating rules stay intact. Sources: [DeepSeek-TUI on GitHub](https://github.com/Hmbown/DeepSeek-TUI), [OpenAI Codex app announcement](https://openai.com/index/introducing-the-codex-app/), [Claude Code features overview](https://code.claude.com/docs/en/features-overview), [Claude Code hooks reference](https://docs.anthropic.com/en/docs/claude-code/hooks), [Claude Code subagents docs](https://code.claude.com/docs/en/sub-agents). ## FAQ ### What is DeepSeek-TUI? DeepSeek-TUI is an open-source terminal coding agent built around DeepSeek models. It can read and edit local files, run shell commands, manage git workflows, use subagents, connect to MCP servers, report cost telemetry, and expose a terminal UI for supervised agent work. ### Is DeepSeek-TUI just a Claude Code clone? It is clearly inspired by Claude Code-style terminal agent workflows, but the more useful way to read it is as a portable runtime experiment. The important question is not whether it resembles another tool. The important question is whether its approvals, rollback, diagnostics, cost tracking, and model routing are strong enough for real work. ### Why do terminal agents need rollback? Terminal agents can edit files, run commands, and change local state. Rollback gives the user a way to inspect and recover from a bad turn without manually reconstructing every change. For serious use, rollback should be paired with transcripts, diffs, command logs, and verification output. ### Should teams use multiple coding agents? Yes, but only with clear boundaries. One agent might be better for planning, another for background review, another for cheap exploratory work, and another for IDE edits. The key is to keep the runtime rules consistent: permissions, tests, receipts, cost limits, and escalation paths. ### What should I look for before adopting a new terminal agent? Start with the runtime, not the model. Check permission modes, sandbox behavior, rollback, transcript export, diagnostics, context compaction, cost telemetry, model routing, subagent isolation, and whether the tool can run headless for CI or recurring workflows. Then benchmark model quality inside your own repo.

What Is Cline? The Open-Source AI Coding Tool That Runs in VS Code

Developers Digest — Thu, 07 May 2026 00:00:00 GMT

Cline is an open-source VS Code extension that turns your editor into an autonomous AI coding environment. Unlike autocomplete tools that suggest the next line, Cline operates as an agent - it reads files, writes code, runs terminal commands, and iterates on errors without constant hand-holding. The tool is free to install. The code is open source under the Apache 2.0 license. You bring your own API key for cloud models like Claude or OpenAI models, or you run local models through Ollama and pay nothing at all. This guide covers what Cline is, how it compares to paid alternatives, and whether it fits your workflow. ## Quick verdict Cline is the best open-source VS Code agent if you want model choice and control. It is a great fit when you want agentic workflows (multi-file edits, command runs, error recovery) without switching editors or locking into one vendor. - Want the short tool-card summary? Start at [Cline in the tools directory](/tools/cline). - If you want a more polished integrated UX, start with [Cursor](/blog/what-is-cursor-ai-code-editor-2026) or [Windsurf](/blog/windsurf-vs-cursor). - If you want a terminal-native agent workflow, start with [Claude Code](/blog/what-is-claude-code-complete-guide-2026) or [Codex](/blog/openai-codex-guide). - If cost is the deciding factor, start with the [pricing hub](/pricing) and the [AI coding tools pricing table](/blog/ai-coding-tools-pricing-2026). ## Why Cline Exists The AI coding market split into two camps. On one side: commercial products like [Cursor](/blog/what-is-cursor-ai-code-editor-2026), [Windsurf](/blog/windsurf-vs-cursor), and [GitHub Copilot](/blog/github-copilot-guide) that bundle models, UX, and subscriptions together. On the other side: open-source tools that prioritize flexibility and user control. Cline sits in the second camp. It does not try to replace your editor - it adds AI capabilities to the VS Code you already use. It does not lock you into a single model provider - it connects to whatever backend you configure. And it does not charge a subscription - you own the tool. For developers who want agentic AI coding without vendor dependencies, Cline is the most capable open-source option available in VS Code. ## What Cline Can Do Cline is an agent, not an autocomplete engine. The difference matters. Autocomplete tools (like basic Copilot) predict the next tokens based on your cursor position. They are reactive. You write, they suggest. Agentic tools (like Cline, [Claude Code](/blog/what-is-claude-code-complete-guide-2026), and [Codex](/blog/openai-codex-guide)) make decisions. You describe a task, and the agent figures out which files to read, what code to write, which commands to run, and how to fix errors when things break. Cline's core capabilities include: **Multi-file code generation.** Cline reads your project structure and writes code across multiple files in a single task. If you ask it to add a feature, it might create a new component, update imports, modify tests, and adjust configuration - all without you specifying each file. **Terminal command execution.** Cline runs shell commands directly. It can install dependencies, run builds, execute tests, and read output. When a command fails, it sees the error and attempts to fix the underlying code. **File system access.** Cline reads files and directories (respecting `.gitignore`), writes new files, and edits existing ones. It understands project context because it can actually see your code. **MCP (Model Context Protocol) support.** Cline integrates with [MCP servers](/blog/what-is-mcp) for extended capabilities - database access, API connections, browser automation, and custom tools. This makes Cline extensible beyond its built-in features. **Multi-model flexibility.** Cline works with local models through Ollama, or cloud models through API keys for Claude, OpenAI models, Gemini, Azure OpenAI, and others. You choose the model based on task, cost, and privacy requirements. **Iterative error correction.** When something fails - a test, a build, a command - Cline reads the output and tries again. This loop continues until the task succeeds or you intervene. The combination makes Cline a genuine coding agent rather than a fancy autocomplete. ## How Cline Works Cline runs as a VS Code extension with a sidebar panel. You open the panel, describe what you want, and Cline executes. The interaction model is chat-based, similar to ChatGPT or Claude. But unlike web chat interfaces, Cline has direct access to your workspace. It does not need you to paste code snippets or describe file contents - it reads them directly. A typical workflow looks like: 1. You describe a task: "Add error handling to the API routes in `src/api/`" 2. Cline reads the relevant files to understand the current code 3. Cline proposes changes and explains its approach 4. You approve (or Cline auto-executes if you have enabled that mode) 5. Cline writes the changes across all affected files 6. Cline runs tests or builds to verify 7. If errors appear, Cline reads the output and adjusts The agent loop continues until the task is complete or you stop it. ## Model Options Cline is model-agnostic. You pick the backend. ### Cloud Models For cloud models, you paste an API key and Cline calls the provider directly: - **Anthropic Claude** - Claude Sonnet and Opus through the Anthropic API - **OpenAI** - OpenAI models (GPT and more) - **Google Gemini** - Gemini Pro and Ultra through the Google AI API - **Azure OpenAI** - Enterprise deployments with Azure endpoints - **OpenRouter** - A proxy that routes to multiple providers Cloud models offer the strongest reasoning quality, especially Claude Opus and higher-tier OpenAI models. The tradeoff is cost (you pay per token) and data leaving your machine. ### Local Models For local models, Cline connects to Ollama running on your machine: ```bash # Install Ollama from https://ollama.ai ollama pull deepseek-coder-v2 # A strong coding model ollama serve # Start the local server ``` Then configure Cline to use Ollama as the provider. Local models keep everything on your hardware. No API costs, no data transmitted. The tradeoff is model quality - even the best local models lag behind Claude Opus or higher-tier OpenAI models on complex reasoning tasks. Popular local options for coding: - **DeepSeek Coder V2** - Strong code generation, relatively fast - **Mistral** - Good general-purpose model - **CodeLlama** - Meta's code-focused model - **Qwen2.5-Coder** - Alibaba's coding model with good performance For most developers, a hybrid approach works best: use local models for routine tasks and cloud models for complex work that needs stronger reasoning. ## Installation Setup takes about five minutes. ### Step 1: Install the Extension Open VS Code, go to Extensions (Cmd+Shift+X / Ctrl+Shift+X), search for "Cline", and install the extension by Saoudrizwan. ### Step 2: Configure a Model Click the Cline icon in the sidebar to open the panel. Choose your model provider: **For cloud models:** Select the provider (Anthropic, OpenAI, etc.) and paste your API key. **For local models:** Install Ollama, pull a model, run `ollama serve`, then select Ollama in Cline's settings. ### Step 3: Start Coding Type a task in the chat panel. Cline will ask for permission before reading files or running commands (unless you enable auto-approve). That is the basic setup. For the current recommended install and onboarding paths, follow the official docs. ## Cline vs. Paid Alternatives The natural question: why use Cline instead of Cursor, Windsurf, or Copilot? ### Cline vs. Cursor [Cursor](/blog/cursor-ai-code-editor-guide) is a proprietary VS Code fork with integrated AI. It costs $20/month for Pro or $200/month for unlimited usage. Cursor's UX is polished - inline diffs, composer mode, and tight model integration. Cline is free and works inside standard VS Code. You keep your existing extensions, settings, and keybindings. But Cline's UI is simpler (a sidebar panel rather than Cursor's multi-mode interface), and you manage model configuration yourself. **Choose Cline if:** You want open source, existing VS Code setup, or local model support. **Choose Cursor if:** You want a polished all-in-one product and do not mind vendor lock-in. ### Cline vs. Windsurf [Windsurf](/blog/windsurf-vs-cursor) (formerly Codeium) is another proprietary AI editor. It has a generous free tier and costs $15/month for Pro. Windsurf's Cascade agent handles multi-step tasks well. Cline is comparable in agentic capabilities but trades commercial polish for open-source flexibility. Windsurf has better out-of-box model optimization; Cline has better extensibility through MCP. **Choose Cline if:** Open source and model flexibility matter more than integrated UX. **Choose Windsurf if:** You want a free or low-cost commercial product with less setup. ### Cline vs. GitHub Copilot [Copilot](/blog/github-copilot-guide) excels at autocomplete. It suggests code as you type and integrates deeply with GitHub. Copilot's agentic features (Copilot Chat, Copilot Agent) are improving but still behind dedicated agent tools. Cline is more autonomous. It writes across files, runs commands, and iterates on errors. Copilot's strength is in-line suggestions during manual coding; Cline's strength is task delegation. **Choose Cline if:** You want an autonomous agent rather than autocomplete. **Choose Copilot if:** You want tight GitHub integration and inline suggestions while you code. ### Cline vs. Claude Code [Claude Code](/blog/what-is-claude-code-complete-guide-2026) is Anthropic's terminal-based agent. It is not open source, requires an Anthropic subscription ($20-$200/month), and runs in the terminal rather than VS Code. Claude Code has stronger reasoning (Opus access) and a more mature sub-agent architecture. Cline has VS Code integration and model flexibility. **Choose Cline if:** You want to stay in VS Code and use multiple model providers. **Choose Claude Code if:** You want the strongest reasoning quality and prefer terminal workflows. ### Cline vs. Aider [Aider](/blog/aider-vs-claude-code-2026-update) is another open-source CLI tool for AI coding. It runs in the terminal, supports multiple models, and focuses on git-aware editing. Cline has VS Code integration; Aider is terminal-only. Both are open source and model-agnostic. Aider has more mature git integration; Cline has MCP extensibility. **Choose Cline if:** You prefer working inside VS Code. **Choose Aider if:** You prefer terminal workflows and value git integration. ## When Cline Makes Sense Cline fits specific developer profiles: **Privacy-conscious developers.** With local models, code stays on your machine. With cloud models, code goes to the provider you configure. **Open-source advocates.** Apache 2.0 license means you can fork, modify, and audit the code. **Multi-model testers.** If you evaluate different models for different tasks, Cline's provider flexibility helps. **VS Code loyalists.** If your workflow depends on VS Code extensions and settings, Cline adds AI without requiring a new editor. **Budget-constrained developers.** Free tool plus cheap API calls (or free local models) beats $20-$200/month subscriptions. **Enterprise teams with data restrictions.** Local-first operation satisfies strict data governance requirements. ## When Cline Does Not Make Sense Cline has tradeoffs: **No commercial support.** If something breaks, you file a GitHub issue and wait for community response. No SLA, no phone support, no enterprise contracts. **Setup required.** Getting optimal performance requires configuring providers, tuning prompts, and sometimes debugging MCP integrations. Cursor and Windsurf work out of the box. **Weaker models locally.** Local models through Ollama are capable but not Claude-Opus-tier. For complex architectural work, you need cloud APIs (and their costs). **Less polished UX.** Cline's sidebar interface is functional but lacks Cursor's inline diffs and composer mode. The interaction is more chat-like than integrated. If you want zero-setup, polished UX, and commercial accountability, paid tools like Cursor or Claude Code are better choices. ## Practical Tips A few patterns that work well with Cline: **Start with a plan.** Before asking Cline to code, describe what you want at a high level. "Add authentication to the API" is better than "fix login." **Let it read first.** Point Cline at the relevant files before asking for changes. Context improves output quality. **Use cloud models for complex tasks.** Save local models for routine work. Switch to Claude or higher-tier OpenAI models when reasoning quality matters. **Enable MCP for extended workflows.** If you need database access, browser testing, or API integrations, configure MCP servers to expand Cline's capabilities. **Review before committing.** Cline edits files directly. Review diffs in VS Code's source control panel before committing changes. ## The Bottom Line Cline is the best open-source AI coding agent for VS Code. It brings autonomous capabilities - multi-file editing, terminal execution, iterative error correction - without subscriptions or vendor lock-in. The tradeoff is setup effort and polish. Cursor and Windsurf are easier to start with. Claude Code has stronger reasoning. But if open source, model flexibility, and VS Code integration matter to you, Cline is the right choice. For developers already paying for Claude or OpenAI API access, Cline is effectively free. For developers willing to run local models, it costs nothing at all. Install it from the [VS Code Marketplace](https://marketplace.visualstudio.com/items?itemName=saoudrizwan.cline), configure a model, and try delegating a real task. That is the only way to know if agentic AI coding fits your workflow. ## Sources - Official site: https://cline.bot/ - Official docs: https://docs.cline.bot/ - GitHub repo: https://github.com/cline/cline - License (Apache 2.0): https://github.com/cline/cline/blob/main/LICENSE - VS Code Marketplace listing: https://marketplace.visualstudio.com/items?itemName=saoudrizwan.cline ## Frequently Asked Questions ### Is Cline free? Yes. Cline is open source under the Apache 2.0 license with no licensing fees. You only pay for cloud model API calls if you use Claude, OpenAI models, or similar providers. Using local models through Ollama is completely free. ### What models does Cline support? Cline works with cloud providers (Anthropic Claude, OpenAI, Google Gemini, Azure OpenAI, OpenRouter) and local models through Ollama. You configure the provider and paste your API key. For local models, you run Ollama on your machine and Cline connects automatically. ### How does Cline compare to Cursor? Cursor is a proprietary VS Code fork with integrated AI at $20-$200/month. Cline is a free VS Code extension. Cursor has a more polished UI with inline diffs and composer mode. Cline keeps you in standard VS Code with your existing setup. Choose Cursor for polish; choose Cline for open source and flexibility. ### Can Cline run terminal commands? Yes. Cline executes shell commands directly, including builds, tests, package installations, and git operations. It reads command output and uses errors to guide subsequent fixes. You can configure approval requirements for command execution. ### What is MCP and why does Cline support it? MCP (Model Context Protocol) is a standard for extending AI agent capabilities. Cline uses MCP to connect to databases, APIs, browsers, and custom tools beyond its built-in features. This makes Cline extensible - you add capabilities without modifying the core tool. ### Is Cline good for large codebases? Cline handles project-wide context reasonably well, but performance depends on your model choice. Cloud models like Claude handle large context windows better than most local models. For very large monorepos, you may need to scope tasks to specific directories. ### How does Cline handle errors? When a command or build fails, Cline reads the error output and attempts to fix the underlying code. This loop continues iteratively until the task succeeds or you stop it. The error recovery is one of Cline's strengths compared to simpler autocomplete tools. ### Should I use local models or cloud models? Use local models (Ollama) for routine tasks, privacy-sensitive work, and cost savings. Use cloud models (Claude, OpenAI models) for complex reasoning, architectural decisions, and tasks where quality matters more than cost. Many developers use both, switching based on the task. ## Related Guides - [Best AI Coding Tools in 2026](/blog/best-ai-coding-tools-2026) - Full comparison of the AI coding landscape - [AI Coding Tools Pricing Comparison](/blog/ai-coding-tools-pricing-2026) - Cost breakdown for every major tool - [What Is Claude Code?](/blog/what-is-claude-code-complete-guide-2026) - Anthropic's terminal-based AI agent - [Cursor AI Guide](/blog/cursor-ai-code-editor-guide) - Deep dive on the leading proprietary AI editor - [Aider vs Claude Code](/blog/aider-vs-claude-code-2026-update) - Open-source CLI tool comparison

Claude Code Token Burn Is an Observability Problem

Developers Digest — Wed, 06 May 2026 00:00:00 GMT

Claude Code token burn is back in the feed. The current viral thread started with Alexander Zanfir's writeup, [Claude Diagnosed Its Own Cache Bug](https://medium.com/@alexzanfir/claude-diagnosed-its-own-cache-bug-a-six-month-timeline-332f577e1fe9). The useful part is not whether every claim in the timeline is proven from the outside. The useful part is that a coding agent was asked to audit its own usage, found suspicious cache-flush behavior, and produced a trail that other users could argue with. That is where the AI coding market is headed. Not "trust the quota bar." Not "trust a Reddit screenshot." Agent usage needs repro-grade observability. If you are already running [Claude Code](/blog/what-is-claude-code-complete-guide-2026), this belongs next to the [Claude Code usage limits playbook](/blog/claude-code-usage-limits-playbook-2026), [agent FinOps](/blog/400-dollar-overnight-bill-agent-finops), and the recent [Claude Code ops release](/blog/claude-code-2-1-128-mcp-ops). The product keeps getting more capable. The accounting layer has to catch up. ## What actually changed Anthropic did publish an official postmortem on April 23: [An update on recent Claude Code quality reports](https://www.anthropic.com/engineering/april-23-postmortem). It traced the recent quality issues to three separate changes: - a reasoning-effort default change that was later reverted - a stale-session thinking-cache bug that caused repeated cache misses - a system prompt change that hurt coding quality The cache section matters most for token burn. Anthropic says the bug caused old thinking to be cleared every turn after a stale session crossed an idle threshold. That made Claude seem forgetful and repetitive, and Anthropic wrote that it likely drove reports of usage limits draining faster than expected. So the simplified take, "Anthropic never acknowledged anything," is wrong. Zanfir's article now includes a correction on that point. But the opposing simplified take, "the postmortem means this is over," is also too neat. Users are still reporting confusing usage behavior, the community is still building monitors and workarounds, and Anthropic's own support docs still explain usage in broad plan-level terms rather than session-level cache health. The lesson is not that every complaint is a confirmed bug. The lesson is that coding-agent usage needs better local evidence. ## Cache misses are a product issue Prompt caching is usually explained as infrastructure. It should be treated as product behavior. When a coding agent is working in a large repo, the difference between a healthy cache and a broken cache can be the difference between a useful Max session and a five-hour reset that arrives before the patch is done. Anthropic's [usage-limit docs](https://support.anthropic.com/en/articles/11647753-understanding-usage-and-length-limits) say usage depends on conversation length, model, features, and product surface. Their [cost-management docs](https://docs.anthropic.com/en/docs/claude-code/costs) also point API users toward historical usage and workspace spend limits. That is useful, but it is not enough for serious agent work. A developer running a long Claude Code session needs to know: - how many input tokens were cached versus uncached - whether cache reads collapsed after resume - whether thinking blocks are being retained or pruned - whether MCP calls, subagents, or skills changed the prompt prefix - which turn caused a quota cliff - whether the next request is likely to rebuild the whole context That is not billing trivia. It changes whether you continue the session, compact, restart, split the task, switch models, or stop and file a bug. ## The community is building the missing gauges This is why the most interesting GitHub signal is not another wrapper promising free usage. It is tooling like [cc-cache-monitor](https://github.com/AlexZan/cc-cache-monitor), which tries to inspect Claude Code logs and surface cache behavior. Whether that specific project becomes the standard is less important than the pattern. Developers want the agent equivalent of a network waterfall: - turn number - model - input tokens - output tokens - cache reads - cache writes - cache misses - tool calls - estimated cost - session reset events That is the same argument behind [agent receipts](/blog/agent-swarms-need-receipts). Once agents run for hours, "it felt expensive" is not acceptable debugging data. ## The fair critique There is a fair critique of the community reaction: local reverse engineering can overfit. Claude Code is a hosted product, a local CLI, an API client, a model harness, a prompt layer, and a quota system at the same time. A user can observe symptoms, logs, and billing effects, but not every server-side decision. Cache behavior can change because of TTLs, model routing, product experiments, stale sessions, prompt changes, or user configuration. That means public bug claims should be written with care. But that is exactly why first-party observability matters. When the official product does not expose enough session-level telemetry, the community fills the gap with scripts, screenshots, Reddit threads, and partial reconstructions. Some will be right. Some will be wrong. All of them become louder than they need to be because the product does not provide the obvious facts. ## What Claude Code should expose Claude Code does not need to expose private chain-of-thought or internal prompts to fix this class of problem. It needs operational counters. Minimum viable usage telemetry: | Counter | Why it matters | |---|---| | `cache_read_tokens` | Shows whether reused context is actually cheap | | `cache_write_tokens` | Shows when the session is rebuilding expensive prefixes | | `uncached_input_tokens` | Separates real new work from repeated context cost | | `output_tokens` | Identifies verbosity and overthinking failures | | `thinking_budget` | Shows whether effort settings are driving cost | | `tool_call_count` | Catches runaway searches, MCP loops, and file rereads | | `session_age` | Makes idle-resume behavior visible | | `estimated_plan_usage` | Translates technical counters into quota impact | Expose it in `/usage`, export it as JSON, and let hooks read it. That would make Claude Code easier to trust without weakening the product. For teams, the same shape should become an OpenTelemetry stream. We covered the broader [managed-agent FinOps problem](/blog/400-dollar-overnight-bill-agent-finops), but Claude Code is the cleanest consumer example: the user needs one trace per agent run, with model calls and tool calls under it, tagged with usage counters and cost estimates. ## What to do this week Do not wait for the perfect official dashboard. 1. Upgrade Claude Code and read the release notes before assuming old workarounds still apply. 2. Start long tasks in fresh sessions when cache behavior feels suspicious. 3. Use `/compact` or split tasks before the context gets huge. 4. Track session-level cost or quota burn outside the chat transcript. 5. Add stop hooks that halt repeated failing loops before they become quota loops. 6. Keep a short repro log: version, model, effort setting, session age, resume behavior, and whether MCP/subagents/skills were active. The goal is not paranoia. The goal is to make usage complaints debuggable. ## The take The cache-burn controversy is not a reason to abandon Claude Code. It is a reason to operate it like infrastructure. Claude Code is becoming a serious agent runtime: subagents, hooks, MCP, worktrees, skills, plugins, and long-running loops. Serious runtimes need serious counters. If prompt caching saves quota, developers should be able to see it. If a stale session starts rebuilding context, developers should be able to catch it before the five-hour reset. The next differentiator in AI coding tools will not just be model quality. It will be whether the tool can explain what it spent. ## Sources - Anthropic: [An update on recent Claude Code quality reports](https://www.anthropic.com/engineering/april-23-postmortem) - Anthropic Help Center: [Understanding usage and length limits](https://support.anthropic.com/en/articles/11647753-understanding-usage-and-length-limits) - Anthropic Docs: [Manage costs effectively](https://docs.anthropic.com/en/docs/claude-code/costs) - Alexander Zanfir: [Claude Diagnosed Its Own Cache Bug](https://medium.com/@alexzanfir/claude-diagnosed-its-own-cache-bug-a-six-month-timeline-332f577e1fe9) - GitHub: [cc-cache-monitor](https://github.com/AlexZan/cc-cache-monitor) ## Frequently Asked Questions ### Why is Claude Code using so much quota? Claude Code usage depends on model choice, effort setting, conversation length, tool use, attached context, and cache behavior. If a long session repeatedly rebuilds context instead of reading from cache, quota can drain much faster than the visible response length suggests. ### Did Anthropic confirm a Claude Code cache bug? Yes. Anthropic's April 23 postmortem says a stale-session thinking-cache bug caused prior reasoning to be dropped every turn after an idle threshold and likely contributed to reports of usage limits draining faster than expected. Anthropic says that specific issue was fixed on April 10 in v2.1.101. ### Does that mean every current token-burn complaint is the same bug? No. Current reports can come from old client versions, long context, effort settings, MCP behavior, subagents, server-side cache eviction, or unrelated product issues. That is why session-level telemetry matters. ### How do I monitor Claude Code cache behavior? Start by checking Claude Code's built-in usage view and keeping session metadata for suspicious runs. Community tools like `cc-cache-monitor` are emerging to inspect local logs, but treat them as diagnostic aids rather than official billing truth. ### What should Claude Code expose in `/usage`? At minimum: cached input tokens, uncached input tokens, cache writes, output tokens, thinking budget, tool-call count, session age, model, effort setting, and estimated quota impact per turn. ### Should teams stop using long-running Claude Code sessions? No. Long sessions are still useful for deep coding work. Teams should add iteration caps, stop hooks, fresh-session checkpoints, and usage telemetry so long runs fail visibly instead of quietly burning quota.

How We Patched 100+ PRs Across Our App Empire in One Day

Developers Digest — Wed, 06 May 2026 00:00:00 GMT

## The Audit The Developers Digest empire is now 31 apps deployed across `*.developersdigest.tech`. That number snuck up on me. Each app started life from the same starter template, but templates drift the moment you fork them. Favicons go missing. Someone forgets to wire Google Analytics. The OG card pattern that worked last quarter quietly stops getting copied forward. By the time you have 24 production apps, the variance between them is louder than the consistency. So I ran a parallel `curl` audit across all 31 hosts. The matrix that came back was not pretty. **Reachability:** 24 of 31 apps responded with a 200. Seven were down - three returning 5xx (`agentfs`, `hookyard`, `tracetrail`) and four totally unreachable (`agent-eval-bench`, `cost-tape`, `hooks-directory`, `migrate`, `skill-builder`). Two of the dead hosts were still being linked from the public `/apps` page. That alone was an emergency. **Drift across the 24 reachable apps:** | Check | Coverage | Missing | |---|---|---| | favicon.ico | 17% | 20 / 24 | | llms.txt | 29% | 17 / 24 | | OG (full 3/3) | 46% | 13 / 24 | | sitemap.xml | 75% | 6 / 24 | | robots.txt | 75% | 6 / 24 | | GA tag | 75% | 6 / 24 | | Sentry init | 0% | 24 / 24 | The Sentry zero stung. The favicon number was the embarrassing one - empty browser tabs across most of the empire. ## The Fanout The interesting part is what happened next. Instead of opening one big "fix everything" PR, I treated each missing piece as a fanout job. One audit, one fix template, dozens of agents, one PR per repo. Here is the day's PR ledger: - **9 `chore: add llms.txt` PRs** - **17 `chore: add favicon.ico` PRs** - **4 `chore: add Google Analytics tracking` PRs** - **16 `chore: add Sentry` PRs** (queued; pending source-tree confirmation) - **4 robots.txt + sitemap.xml route handler PRs** - **8 OG image / metadata PRs** - **35 `migrate: replit -> coolify + neon + clerk` PRs** (a separate but parallel migration sweep) - **2 `developers-digest-site` apps-page PRs** (one to add Neon Data Lite, one to mark unreachable apps as Coming Soon so the public page stops linking to dead hosts) **Total open PRs by end of day: 58**, with a separate ledger of in-progress Sentry/OG batches still being prepped. Counted with the not-yet-opened batches, the day's pipeline was over 100 PRs. ## Status by Merge State Of the 58 PRs that landed in GitHub today: - **40 are CLEAN** - no failing checks, ready for `@devin-ai-integration` review and merge. - **18 are blocked by a single failing build check** - almost always pnpm-lock sync drift between the agent's working tree and CI. The fix is mechanical; the cost is that they cannot auto-merge. - **0 changes-requested** (none of these repos have formal review gates configured). - **51 awaiting first-pass Devin review.** The two PRs against `developers-digest-site` itself are the worst stuck - they fail four checks each (`analyze`, `check`, `lighthouse`, `typecheck`) because the marketing site has the strictest CI in the empire. That's by design and I am not going to soften it. ## The Pattern: Audit Once, Fix in Fanout, Document in Skills The thing worth extracting from this day is not any individual fix. It is the loop: 1. **Audit once.** A single 30-second `curl -P 10` sweep across all hosts produced a complete drift matrix. No app-by-app investigation, no spreadsheet maintenance. 2. **Fix in fanout.** Each row of the matrix becomes a templated PR job. Agents clone to `/tmp//` (in-place agents collide on branch switches), apply the same patch, push, open a private PR, tag Devin. One per repo. 3. **Document in skills.** Every recurring pattern from the day gets promoted into `~/.claude/skills/` so the next audit is faster and the next fanout has a tighter template. Today's session added entries for `dd-pr` (the branch → PR → tag-Devin convention) and the parallel-clone strategy. The key insight is that consistency across an app empire is not a one-time job. It is a *recurring drift problem*. The only durable answer is to run the audit weekly via cron and keep the fanout templates warm. ## What's Outstanding Two things did not get fixed today: - **GitHub Actions billing.** A handful of CI checks are queued behind an Actions usage cap on the org. Until that's resolved, even the CLEAN PRs can't auto-run their final checks. Migration to a higher tier is on tomorrow's list. - **Coolify dashboard work.** The seven down hosts all need triage in Coolify - some are 5xx (deploy broken, fixable via lockfile sync), some are 000 (DNS / TLS / image build). Each requires hands-on dashboard time. I will not be batching this; the failure modes are too varied. ## What This Cost The cost of this kind of day is mostly agent time, not human time. I spent about 90 minutes actively driving - writing the audit script, reviewing the drift matrix, queuing the fanouts, spot-checking Devin reviews. The agents did the rest in parallel. Three things made it tractable: - **One source of truth.** `apps-data.ts` on this site is the canonical list of every deployed app. Every audit script reads from it. - **Tight per-PR scope.** Each fanout PR touches one or two files. No PR ever combined "add favicon" with "fix Sentry" - that's how you get rejections. - **Honest skip allowed.** Agents that hit a repo with non-standard structure are allowed to skip with a written reason instead of forcing a broken PR. About 12% of the queued jobs ended up in the skip pile, which is fine. If you are running more than five deployed apps from the same starter, you already have this drift problem. The longer you wait to audit, the worse the matrix gets. Run the curl sweep this week.

219 PRs in One Day: A Parallel Agent Fan-Out Postmortem

Developers Digest — Wed, 06 May 2026 00:00:00 GMT

## The Setup I run a small empire: 35 apps under the developersdigest org, each one a separate repo, most of them deployed on Coolify, a few stragglers still on Replit and Vercel. Migrations across that many repos used to mean a week of context-switching. This week I tried something different: spawn one subagent per repo, fan out, let them work in parallel, then come back and review. The session shipped 219 pull requests in one day. Here is the honest breakdown - the patterns that survived contact with reality, the ones that exploded, and the fixes that turned chaos into a repeatable workflow. ## Why Parallel The work was embarrassingly parallel by nature. Same migration, 35 different codebases, no shared state. A sequential loop would have taken eight hours of agent time and probably twelve hours of me babysitting tool calls. A parallel fan-out is bounded by the slowest agent, not the sum of them. The pitch is simple: if your task decomposes into N independent units of work, the wall-clock time should be dominated by the longest unit, not N times the average. That is the whole shape of the speedup. Three sequential searches are slower than three parallel agents. Three hundred sequential migrations are catastrophically slower than three hundred parallel ones. ## Patterns That Worked **Tight scope per agent.** Every agent got one repo, one branch, one PR target. No agent was allowed to touch shared infra. No agent could decide its own scope. The prompt was a checklist, not a goal. When I gave agents room to interpret, they invented work - extra refactors, README rewrites, dependency bumps nobody asked for. When I gave them a checklist, they finished and stopped. **The honest-skip rule.** I baked into every prompt: *if this repo does not match the migration profile, return SKIPPED with a one-line reason and exit cleanly.* This was the single most useful pattern. Without it, agents will hallucinate work to look productive. With it, ~40 of the 200+ runs returned honest skips - repos already migrated, repos that were docs-only, repos with no deploy target. Those skips saved hours of cleanup. **`/tmp//` isolation.** The first thing I tried was running multiple agents against the same local checkout. Catastrophic. Branch switches collided, working trees got tangled, two agents committed to each other's branches. The fix: every agent clones fresh into `/tmp//`, works there, pushes, opens its PR, and never touches the canonical local copy. Disposable working directories are non-negotiable for parallel work. ## Patterns That Broke **Rogue `pkill` collisions.** A few agents had build steps that ran `pkill -f next` to clean up dev servers. With twenty agents running simultaneously, one agent's cleanup killed another agent's build mid-compile. Builds failed for reasons that had nothing to do with their code. I lost an hour chasing ghost failures before I traced it. **Disk fill.** Two hundred clones of medium-sized Next.js repos plus 200 `node_modules` installs blew through 60GB. Coolify started returning 500s on unrelated apps because the host disk was full. `docker builder prune -f` fixes this after the fact, but the better answer is to never let it happen. **False-empty remotes.** Several agents reported "nothing to commit, branch is clean" when in fact they had simply failed to detect modified files because they had `cd`'d into the wrong directory after a clone. The PR opened but contained zero diff. From the dispatch log it looked like a successful run. I caught these only by spot-checking PR diffs by hand. ## Fixes **Build-lock script.** A simple flock-based wrapper around any command that touches a shared resource. Builds serialize through the lock, everything else stays parallel. Crude but it works. **Fallback to local copies.** When a `/tmp//` clone failed for billing or network reasons, fall back to copying from a local cache directory rather than failing the run. Saved a dozen agents during a brief GitHub API blip. **Narrow filters.** Instead of "run this on every repo," I now generate the target list explicitly with a query - "repos with `nixpacks.toml` and no `coolify.yml`, modified in the last 90 days." Smaller, sharper target list, fewer wasted runs, fewer false-empty PRs. ## Outcome 219 pull requests opened. Maybe 70% of them are mergeable as-is. The rest need small edits - a wrong env var name, a stale port number, a missing health check. The bottleneck now is not agent capacity. It is human review bandwidth and, embarrassingly, a GitHub Actions billing cap I hit around PR 180. Two non-code lessons came out of this: 1. **Devin review is the new rate limit.** I tag @devin-ai-integration on every DD PR for a second-pass review before merge. With 200+ open PRs that queue is now the choke point. Parallelizing the agent does nothing if the reviewer is serial. 2. **GitHub billing scales with your agents.** I tripped a private-repo Actions minute cap I had never come close to before. Worth budgeting for if you plan to run anything like this regularly. ## The Skill Codification The whole recipe - the clone pattern, the honest-skip rule, the build lock, the PR-and-tag-Devin flow - is now a single skill called `replit-to-coolify`. I trigger it with one phrase and a target repo, and the same well-debugged prompt runs every time. That is the actual outcome of a session like this. Not the 219 PRs. The 219 PRs are the artifact. The skill is the asset. Next time I have a many-repos-one-change job, I do not have to re-derive the patterns. I run the skill, fan out, and review. The whole cycle from "I should migrate these" to "PRs are open" collapses from a week to an afternoon. If you are sitting on a portfolio of repos that need the same change, the leverage is real. Just budget for the disk, the billing, and the reviewer queue before you press go.

38 Apps in One Day: Migrating an Empire from Replit to Coolify

Developers Digest — Wed, 06 May 2026 00:00:00 GMT

## The Hook 38 apps. One day. Roughly 120 pull requests across the empire. By the time the dust settled, every Replit-hosted project under our org had a `migrate/coolify-clerk` PR sitting in review, pre-staged for one-click Coolify deploy. This is the candid version. What worked, what was a stub that did not need migrating, and what the recipe actually looks like when you repeat it 38 times in a day. ## Why We Moved Off Replit Replit was a great place to scaffold something at 1am. It is not where you want production infra to sit long-term. The reasons stacked up: - **Vendor lock-in on the runtime.** Replit's Nix layer, deployment targets, and proprietary databases meant every app carried a small but real "this only runs here" tax. Moving them was easier than continuing to pay it. - **Runtime quirks at scale.** Cold starts, opaque crash loops, and a control plane we did not own. When 38 apps are all using the same hosting layer, every quirk multiplies. - **No infra parity with the rest of our stack.** The serious DD apps already lived on Coolify (Hetzner) with Neon Postgres, Clerk auth, and Cloudflare DNS. Splitting hosting providers meant two debugging playbooks, two billing surfaces, two sets of secrets. - **Cost.** We were paying for always-on Replit deployments for apps that get a few hundred hits a week. A Hetzner box running 30+ containers is cheaper than the Replit equivalent by an order of magnitude. The decision was not "Replit bad." It was "one stack, one playbook, one bill." ## The Recipe Every migration followed the same shape, regardless of whether the app was Express or Next.js: **For Express + Vite + Drizzle apps:** 1. Strip Replit-specific files (`.replit`, `replit.nix`, runtime polyfills). 2. Add Clerk for auth, replacing whatever Replit Auth shim was in place. 3. Move the database to Neon Postgres, point Drizzle at the new connection string. 4. Wrap it in a single-container Dockerfile that builds the Vite frontend and serves it from Express. 5. Add `coolify.json` plus health check endpoint. **For Next.js + Prisma apps:** 1. Same Replit cleanup. 2. Clerk for auth. 3. Neon for the Postgres, Prisma migrations rerun against the new DB. 4. Single-stage Dockerfile using the standalone Next.js output. 5. Coolify config plus `/api/health`. The single-container constraint was deliberate. Coolify is happiest when an app is one container with one port. No sidecars, no multi-service compose files unless the app genuinely needs them. Most did not. ## The Honest Stats The 38-app number sounds impressive until you break it down: - **38 repos targeted.** Pulled from the org-wide list of anything tagged or known to have Replit deployment history. - **~16 were genuine app migrations.** Real code, real users, real database, real port to do. - **~9 were empty stubs.** Repos scaffolded during a brainstorm, never actually built. The migration agent correctly skipped these and filed an "empty stub, no action" report. - **~5 were monorepos.** A single repo containing 2 to 4 deployable apps. Each got its own `migrate/` branch with the apps split into separate Coolify services. - **~4 were false-empties.** Looked empty at first pass because the actual app lived in a subdirectory or behind a non-default branch. The agent flagged these for human review rather than guessing. - **~4 were already migrated.** Drift from a previous half-finished migration attempt. We closed those out and noted the existing deploy. Total PRs opened across all categories: roughly 120. That includes the migration PRs, follow-up cleanup PRs (lockfile sync, env var fixes, health check tweaks), and a handful of `chore: archive` PRs for repos that should not have existed in the first place. ## The Tooling The fan-out was the interesting part. The pipeline was three CLIs and one orchestrator: - **`gh` CLI** for everything GitHub. Listing org repos, cloning, branch creation, PR open, PR comment, tagging reviewers. Every agent used `gh` and only `gh`. - **`neonctl`** for spinning up Neon Postgres branches per app. New project, new connection string, dump it into the env file, done. - **Claude Code subagent fan-out** as the orchestrator. The parent session held a queue of 38 repos. It dispatched one subagent per repo, each cloning to its own `/tmp//` directory to avoid the in-place collision problem we have hit before with 5+ parallel agents on the same checkout. At peak, we had 8 to 10 subagents running concurrently. Each one followed the same `replit-to-coolify` skill: clone, audit, decide if migration is needed, apply the recipe or honest-skip, open a PR on a `migrate/coolify-clerk` branch, tag the reviewer, exit. The honest-skip rule was load-bearing. Without permission to skip, an agent will hallucinate work to fill the silence. With it, the empty stubs and false-empties got flagged correctly instead of receiving fake migration PRs. ## Ship Status Every PR is sitting in review right now, tagged `@devin-ai-integration` for the automated review pass. The standing rule held across all 120 PRs: branch, PR, tag Devin, never direct-push to main. Each migration PR is pre-staged for one-click Coolify deploy. The Dockerfile builds, the health check responds, the env vars are documented in the PR body. When Devin signs off and we merge, Coolify picks up the push and deploys. We are merging in batches rather than all at once. Five to ten apps per evening, watch the Coolify queue, fix anything that breaks the build (usually a `pnpm-lock.yaml` sync issue, the recurring failure mode), move to the next batch. ## What's Next Once everything is on Coolify: - **Decommission Replit deployments** after a 7-day grace period of dual-running. - **Standardize the observability layer.** Every app gets the same Sentry config and the same `/api/health` shape, so the empire dashboard can poll one endpoint per app and get a real signal. - **Consolidate Neon projects.** 38 separate Neon projects is too many. Group by tier and traffic so the free tier covers what it should and the paid tier covers what actually needs it. - **Write the `replit-to-coolify` skill into the standard scaffold.** New apps should never touch Replit again. The skill is now part of the default scaffold path. The interesting part of the day was not the migration itself. The recipe is boring once you have it. The interesting part was that 38 apps moved in a day because the orchestration was tight, the skip rule was honored, and every agent had the same playbook. That is the leverage. Not the agents. The playbook the agents share.

Claude Code Complete Course

Developers Digest — Wed, 06 May 2026 00:00:00 GMT

# Claude Code Complete Course This course is a full practical path from first install to team rollout. Every module uses official documentation and release sources, with direct links for verification. ## Official Sources Used Throughout - Claude Code overview: https://docs.anthropic.com/en/docs/claude-code/overview - Claude Code quickstart: https://docs.anthropic.com/en/docs/claude-code/quickstart - Claude Code tutorials: https://docs.anthropic.com/en/docs/claude-code/tutorials - Claude Code CLI reference: https://docs.anthropic.com/en/docs/claude-code/cli-reference - Claude Code settings: https://docs.anthropic.com/en/docs/claude-code/settings - Claude Code output styles: https://docs.anthropic.com/en/docs/claude-code/output-styles - Claude Code memory: https://docs.anthropic.com/en/docs/claude-code/memory - Claude Code MCP: https://docs.anthropic.com/en/docs/claude-code/mcp - Claude Code SDK MCP: https://docs.anthropic.com/en/docs/claude-code/sdk/sdk-mcp - Claude Code GitHub Actions: https://docs.anthropic.com/en/docs/claude-code/github-actions - Claude Code costs: https://docs.anthropic.com/en/docs/claude-code/costs - Claude Code security: https://docs.anthropic.com/en/docs/claude-code/security - Anthropic news and release updates: https://www.anthropic.com/news - Claude Code Action repository: https://github.com/anthropics/claude-code-action - GitHub Actions docs: https://docs.github.com/en/actions - GitHub Actions security hardening: https://docs.github.com/en/actions/security-guides/security-hardening-for-github-actions ## Course Outcomes By the end of this course, you will be able to: 1. Install and configure Claude Code for safe daily use. 2. Write deterministic prompts that reduce rework. 3. Run code changes with explicit review gates. 4. Integrate MCP tools with least privilege. 5. Automate PR workflows with GitHub Actions. 6. Track and optimize token costs. 7. Implement team governance for AI-assisted coding. ## Module 1 - Setup and First Run ### What You Learn - Installation flow and environment checks. - Authentication and first interactive session. - Basic command lifecycle and safe editing posture. ### Exercises 1. Install Claude Code and verify command availability. 2. Run your first session in a sandbox repository. 3. Perform one small refactor and inspect the diff. ### Screenshot Checklist - Terminal showing successful install. - First `claude` launch. - Login complete state. - First proposed diff with approval prompt. ### Primary Reading - Quickstart: https://docs.anthropic.com/en/docs/claude-code/quickstart - Overview: https://docs.anthropic.com/en/docs/claude-code/overview ## Module 2 - Prompt Engineering for Code Tasks ### What You Learn - Constraint-first prompting. - File scope limits and acceptance criteria. - Plan then patch then test pattern. ### Prompt Template ```text Objective: [exact outcome] Constraints: [files allowed, style rules, non-goals] Process: propose a plan first, then patch, then run tests Validation: list tests run and summarize risk ``` ### Exercises 1. Convert a vague prompt into a constrained prompt. 2. Compare results across three prompt variants. 3. Produce a reusable prompt template library for your team. ### Primary Reading - Tutorials: https://docs.anthropic.com/en/docs/claude-code/tutorials - Prompt engineering overview: https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/overview ## Module 3 - Diff Quality and Review Discipline ### What You Learn - Breaking large changes into staged commits. - Review-first behavior before applying broad edits. - Human review checklist for correctness and maintainability. ### Review Checklist - Does every changed file map to the requested scope? - Are tests added or updated where behavior changed? - Is error handling preserved or improved? - Is rollback straightforward if production issues appear? ### Primary Reading - CLI reference: https://docs.anthropic.com/en/docs/claude-code/cli-reference - Security docs: https://docs.anthropic.com/en/docs/claude-code/security ## Module 4 - Settings, Memory, and Output Control ### What You Learn - Configure output style by task type. - Use memory features for long-running workflows. - Reduce context noise during focused implementation. ### Exercises 1. Create two settings profiles: concise and teaching. 2. Run the same task with each profile and compare outcomes. 3. Document when each profile should be used. ### Primary Reading - Settings: https://docs.anthropic.com/en/docs/claude-code/settings - Output styles: https://docs.anthropic.com/en/docs/claude-code/output-styles - Memory: https://docs.anthropic.com/en/docs/claude-code/memory ## Module 5 - MCP Integration Basics ### What You Learn - MCP architecture and trust boundaries. - Connecting tools safely. - Diagnosing tool timeout and data-shape failures. ### Exercises 1. Configure one MCP server in a test project. 2. Execute one tool-assisted coding task. 3. Validate fallback behavior for tool failures. ### Primary Reading - MCP docs: https://docs.anthropic.com/en/docs/claude-code/mcp - SDK MCP: https://docs.anthropic.com/en/docs/claude-code/sdk/sdk-mcp - MCP GitHub org: https://github.com/modelcontextprotocol ## Module 6 - MCP Advanced Workflows ### What You Learn - Multi-tool sequencing patterns. - Stable intermediate outputs. - Failure handling and retries. ### Exercises 1. Implement two-step tool workflow with validation between steps. 2. Add bounded retries and fallback handling. 3. Write an operational runbook for the workflow. ### Primary Reading - MCP TypeScript SDK: https://github.com/modelcontextprotocol/typescript-sdk - MCP Python SDK: https://github.com/modelcontextprotocol/python-sdk ## Module 7 - GitHub Actions Integration ### What You Learn - Action workflow design for pull requests. - Permissions minimization. - Secret handling and protected branches. ### Exercises 1. Configure `anthropics/claude-code-action@v1` in a repo. 2. Trigger review workflow from PR comments. 3. Add timeout, concurrency, and permission limits. ### Primary Reading - Claude Code Actions docs: https://docs.anthropic.com/en/docs/claude-code/github-actions - Action repository: https://github.com/anthropics/claude-code-action - GitHub Actions docs: https://docs.github.com/en/actions - Security hardening: https://docs.github.com/en/actions/security-guides/security-hardening-for-github-actions ## Module 8 - Cost Engineering ### What You Learn - Cost drivers in coding sessions. - Task decomposition for lower token usage. - Repeatable cost benchmarking. ### Exercises 1. Run baseline task and record cost. 2. Apply scope and prompt optimizations. 3. Compare cost and quality before and after. ### Primary Reading - Costs docs: https://docs.anthropic.com/en/docs/claude-code/costs ## Module 9 - Security and Governance ### What You Learn - Risk tiers for AI-assisted changes. - Human review requirements by tier. - Sensitive data handling boundaries. ### Governance Policy Starter - Tier 1 low risk: docs and non-critical refactors. - Tier 2 medium risk: feature edits requiring full tests. - Tier 3 high risk: auth, billing, infra changes with mandatory senior review. ### Primary Reading - Security docs: https://docs.anthropic.com/en/docs/claude-code/security - Anthropic news for updates: https://www.anthropic.com/news ## Module 10 - Team Rollout Plan ### What You Learn - Pilot design and success metrics. - Change management for engineering teams. - Standard operating procedures for daily use. ### Rollout Framework 1. Week 1: two-engineer pilot. 2. Week 2: evaluate quality and cycle-time. 3. Week 3: expand to one full squad. 4. Week 4: publish org standards and templates. ## Module 11 - Production Incident Scenarios ### What You Learn - Detecting incorrect automated edits. - Rollback and remediation paths. - Communication templates for incident response. ### Exercises 1. Simulate flawed patch in staging. 2. Run rollback with audit notes. 3. Document root cause and prevention controls. ## Module 12 - Capstone ### Capstone Brief Build a full feature with this flow: 1. Define acceptance criteria. 2. Generate plan. 3. Apply staged changes. 4. Run tests and lint. 5. Submit PR with risk and rollback summary. 6. Run CI assistant checks and finalize review. ### Capstone Scoring - Correctness: 30 percent - Code quality: 20 percent - Test quality: 20 percent - Security and governance: 15 percent - Cost discipline: 15 percent ## Required Screenshots for Publication Capture these and add to your course assets folder: 1. Install command and success output. 2. First authentication flow complete state. 3. First plan response. 4. Approval prompt before patch. 5. Diff preview. 6. Test run output. 7. MCP configuration example. 8. MCP tool call result. 9. GitHub Actions YAML excerpt. 10. PR comment trigger example. 11. Action run summary. 12. Cost output comparison. 13. Security checklist file. 14. Capstone final PR summary. ## Author QA Checklist - Every claim includes at least one official link. - Every lesson includes a hands-on exercise. - Every module includes at least one screenshot requirement. - Every advanced module includes cost and risk notes. - Every workflow can be run in a clean repository from scratch. ## Suggested Publishing Plan for Developers Digest 1. Publish this complete guide first. 2. Split each module into individual course lessons in `/courses`. 3. Add one hero image for the course page at `/public/images/courses/`. 4. Add a companion blog post for each advanced module. 5. Link all assets from tutorials and guides index pages. ## Release Maintenance Cadence Before each cohort or major promotion: - Re-check all official docs and release pages. - Re-run every command shown in lessons. - Re-capture screenshots if UI or workflow changed. - Update lesson notes with dated verification.

Claude Code 2.1.128 Is an Ops Release, Not a Feature Drop

Developers Digest — Tue, 05 May 2026 00:00:00 GMT

Claude Code 2.1.128 does not look like a launch. That is the point. The interesting part of the [2.1.128 release notes](https://github.com/anthropics/claude-code/releases/tag/v2.1.128) is how much of the work is about agent operations: MCP visibility, worktree correctness, telemetry isolation, plugin packaging, permission persistence, and noisy reconnect behavior. For people treating [Claude Code](/blog/what-is-claude-code-complete-guide-2026) as a daily coding agent instead of a demo, this is the kind of release that matters. ## Quick verdict If you use MCP, worktrees, hooks, plugins, or OTEL-instrumented local commands, upgrade. This is the kind of maintenance release that prevents expensive agent sessions later. If you are still choosing between coding agents, start with [/compare](/compare) and the cost side of the decision at [/pricing](/pricing). ## The take Claude Code is moving from "agent that edits files" toward "agent runtime you can operate." The new release says `/mcp` now shows tool counts for connected servers and flags servers that connect with 0 tools. That sounds tiny until you debug a broken [MCP server](/blog/what-is-an-mcp-server-beginner-guide-2026) in a real project. A server that connects but exposes no useful tools is one of the worst failure modes because the agent appears integrated while silently losing capability. The release also reserves `workspace` as an MCP server name, summarizes reconnecting MCP tools by server prefix, and fixes MCP image results when structured content and content blocks are returned together. This is plumbing. It is also the difference between "MCP is cool" and "MCP is supportable." That pairs with the direction in [Claude Code hooks](/blog/claude-code-hooks-explained), [Claude Code subagents](/blog/claude-code-sub-agents), and [parallel agent merge discipline](/blog/parallel-coding-agents-merge-discipline): once agents touch real repos, observability becomes product functionality. ## The worktree fix is the sleeper The release note that jumped out: `EnterWorktree` now creates the new branch from local HEAD as documented, instead of `origin/`. That means unpushed local commits are no longer dropped when entering a new worktree session. If you use [Claude Code agent teams](/blog/claude-code-agent-teams-subagents-2026), this matters immediately. Parallel agents often start from the current local state, not from pristine remote main. If a worktree is created from the wrong base, the agent can produce a valid-looking patch that is missing the exact context it needed. This is the practical version of the argument in [long-running agents need harnesses](/blog/long-running-agents-need-harnesses). The agent is not just the model. It is the git base, working directory, permission layer, tool registry, and handoff log around the model. ## OTEL isolation is a real production concern Another small but important change: subprocesses such as Bash, hooks, MCP, and LSP no longer inherit `OTEL_*` environment variables from Claude Code. That prevents OTEL-instrumented apps run through the Bash tool from accidentally using the CLI's own OTLP endpoint. If you have ever run local traces while an agent is executing tests, this is not cosmetic. It prevents telemetry from becoming polluted or misrouted. The same theme shows up in [local OTEL traces for agents](/blog/dd-traces-local-otel) and [agent finops](/blog/400-dollar-overnight-bill-agent-finops): measurement is only useful when you know which process produced the span. ## The opposing view The fair critique is that these are not headline features. No new model capability. No giant context-window claim. No magic "agent does everything" demo. Some users will skip the changelog because the bullet list feels like maintenance. But maintenance is exactly what agent tools need now. The AI coding market has enough demos. The scarce thing is operational discipline: reliable worktrees, visible tool counts, quieter reconnects, clean telemetry, persistent permission choices, and predictable plugin loading. That is also why [skills need exit criteria](/blog/agent-skills-production-checklist). Teams are not blocked by a lack of agent ambition. They are blocked by missing control surfaces. ## What to do after upgrading If Claude Code is part of your daily workflow, this release suggests a short checklist: 1. Run `/mcp` and check every connected server has the expected tool count. 2. Rename any MCP server called `workspace`. 3. Test one worktree-based agent flow from a branch with unpushed local commits. 4. Confirm local test commands still emit OTEL traces to the endpoint you expect. 5. Review which Bash permission prompts should persist into `.claude/settings.local.json`. That is less exciting than installing a new model. It is also more likely to prevent a bad agent session. ## Frequently Asked Questions ### What changed in Claude Code 2.1.128? The release includes MCP tool-count visibility, `workspace` reserved as an MCP server name, cleaner MCP reconnect summaries, a worktree base fix for `EnterWorktree`, OTEL environment isolation for subprocesses, plugin archive support, and multiple terminal and permission fixes. ### Why does MCP tool count matter? It makes broken integrations easier to spot. If an MCP server connects but exposes 0 tools, the agent may appear connected while missing the capabilities you expected. ### Should teams upgrade immediately? If your workflow uses MCP, hooks, worktrees, plugins, or OTEL-instrumented local commands, yes. This is an operational reliability release more than a feature release. ### How does this relate to parallel agents? Parallel agents depend on correct worktree state and clean tool visibility. A wrong branch base or silent MCP failure can make a parallel agent produce a patch that looks valid but was built from the wrong context.

Codex Automations: Where Scheduled AI Agents Actually Help

Developers Digest — Tue, 05 May 2026 00:00:00 GMT

Codex automations are easy to misunderstand. The weak version is "schedule a prompt." That is useful, but not that interesting. The strong version is different: > Give an agent a repeatable workspace job, clear evidence sources, a reviewable output, and a safe schedule. That is where Codex becomes practical for engineering teams. OpenAI's [Codex Automations](https://openai.com/academy/codex-automations) guide says Codex can return on a schedule, do recurring work, and surface results for review. The examples are deliberately mundane: morning briefs, weekly reviews, checking missing information, summarizing recent activity, and recurring status updates. That mundanity is the point. The best automations do not replace judgment. They remove repeated context gathering. ## What Codex Automations Are Good For The sweet spot is recurring work with the same shape every time. Good examples: - daily repo brief from git history, issues, and open PRs - weekly QA sweep over known pages - stale docs check against recent code changes - dependency update summary - changelog draft from merged commits - SEO report from analytics and recent content - recurring "what changed while I was away" handoff - review-comment triage before a sprint planning block OpenAI's [Codex app announcement](https://openai.com/index/introducing-the-codex-app/) gives similar internal examples: daily issue triage, CI failure summaries, release briefs, and bug checks. That is a strong signal about intended use. Automations are not just for novelty reminders. They are for operational work that is annoying because it is repeated, not because it is intellectually hard. ## The Automation Test Before scheduling a Codex automation, ask five questions. ### 1. Does it have stable inputs? Bad: ```txt Tell me what matters. ``` Good: ```txt Inspect the last 24 hours of git commits, open GitHub PRs, QA.md, and SEO-DAILY.md. ``` Stable inputs make the task reproducible. If the input set changes every run, the output will drift. ### 2. Is the output reviewable in under two minutes? An automation should produce something you can scan quickly: - changed files - priority list - short report - draft PR description - markdown note - table of gaps - yes/no status with evidence If the output requires a long investigation to trust, the automation did not save much time. ### 3. Can the agent act safely? Some jobs should report only. Some can edit files. A few can open PRs. Almost none should push, merge, email, delete data, or spend money without explicit approval. The default should be: ```txt Report first. Draft changes only when low risk. Do not publish, send, push, merge, or delete. ``` That rule is boring. It is also what keeps scheduled agents from becoming scheduled incidents. ### 4. Is there a verification command? The best automations end with checks: - `pnpm lint` - `pnpm typecheck` - `pnpm build` - route smoke test - broken-link scan - screenshot check - data freshness check No verification means the automation is mostly a writer. Verification turns it into a worker. ### 5. Does it improve with memory? OpenAI notes that some automations can return to the same conversation and continue from existing context. That is valuable when the work has a running state: - a recurring SEO plan - an open migration - an issue queue - a content backlog - a weekly release rhythm If every run starts cold, it can still help. But the compounding value comes when Codex remembers what happened last time and avoids repeating the same shallow recommendation. ## The Best Engineering Automations ### Daily Repo Brief This is the first automation I would set up on almost any project. ```txt Every weekday morning, review the last 24 hours of git history, open PRs, failing checks, and QA.md. Produce a short repo brief with: 1. What changed 2. What is risky 3. What needs review 4. The next 3 actions Do not edit files unless I explicitly ask in this thread. ``` Why it works: - stable inputs - low risk - high context value - easy to review This is not glamorous, but it reduces the cost of re-entering a project. ### CI Failure Triage The automation: ```txt When scheduled, inspect recent failing checks, summarize the likely cause, link to the relevant logs, and propose the smallest fix. Do not modify code unless the fix is isolated and the failing test is clear. ``` Why it works: - CI has concrete evidence - logs are reviewable - the agent can compare failure text to recent diffs - the output saves immediate debugging time The trap is letting it guess. The prompt should require log links, command names, and the exact failing step. ### Stale Docs Sweep The automation: ```txt Every Friday, compare recent code changes against README.md, AGENTS.md, CLAUDE.md, docs, and content guides. Report docs that appear stale. Only edit docs when the code evidence is direct. ``` Why it works: - docs drift slowly - recent commits are a good signal - the task is narrow - the output is easy to review This is especially valuable in agent-heavy repos, where instructions are part of the product. ### SEO Compounding Pass The automation: ```txt Every morning, inspect analytics, recent content, SEO-DAILY.md, and QA.md. Pick the five highest-impact SEO improvements that are safe to complete today. Prefer internal links, metadata fixes, source freshness, comparison routing, and stale high-traffic pages. ``` Why it works: - analytics create a priority signal - content files are editable - verification is straightforward - improvements compound The key is avoiding volume theater. Five meaningful actions beat twenty generic internal links. ### Release Brief Draft The automation: ```txt Every Thursday, inspect merged commits since last release and draft a release brief. Group changes by user impact, include known risks, and list verification evidence. Do not publish. ``` Why it works: - merged commits are stable - release notes are repetitive - humans should still approve tone and priority This is a good example of Codex as an operator, not a decision maker. ## Where Automations Fail ### Vague ownership If nobody owns the output, it becomes noise. Bad: ```txt Check the project every day. ``` Better: ```txt Every day, update HANDOFF.md with missing video-to-blog coverage and list the top 3 gaps for review. ``` ### Too much autonomy Scheduled agents should not surprise you. Avoid: - auto-publishing public content - sending emails - changing billing settings - merging PRs - deleting data - making large refactors There are exceptions, but they need explicit trust, clear rollback, and narrow scope. ### No evidence trail Every automation should show what it inspected. Good output includes: - files read - commands run - external sources checked - analytics windows used - assumptions made - skipped actions and why Without that trail, you are reviewing vibes. ### Weak schedules Not every recurring job should run daily. Daily: - repo brief - analytics pulse - priority triage Weekly: - docs drift - release notes - dependency sweep - content backlog review Monthly: - pricing refresh - full SEO audit - architecture docs review - stale screenshot cleanup Wrong frequency turns useful automation into background clutter. ## A Good Codex Automation Prompt Template Use this: ```txt Purpose: Explain why this automation exists. Inputs: List exact files, dashboards, repos, issue filters, or docs to inspect. Actions: Describe what Codex should do every run. Boundaries: Say what it must not do without approval. Output: Specify the report, file edit, summary, PR draft, or checklist format. Verification: List commands, screenshots, links, or evidence required before it reports done. Memory: Tell it what to remember or compare against from prior runs. ``` That looks heavier than a casual prompt because scheduled work needs more discipline. A bad one-off prompt wastes a turn. A bad automation wastes attention every time it runs. ## How This Connects To `/goal` Codex automations and Codex `/goal` are related, but not identical. - Automations answer: **when should the agent run?** - Goals answer: **what persistent target should the agent keep working toward?** The strongest pattern is both: ```txt Every weekday, return to this SEO improvement goal. Review analytics, choose the highest-impact safe action, make the edit, run checks, update SEO-DAILY.md, and report what changed. ``` The automation provides cadence. The goal provides continuity. That is the move from "scheduled prompt" to "recurring agent workflow." ## Practical Takeaway Codex automations are most useful when they are: - specific - repeatable - evidence-driven - reviewable - bounded - verified Do not automate taste. Do not automate judgment. Automate context gathering, routine checks, safe edits, and report generation. That is where scheduled AI agents are already useful: not as autonomous executives, but as reliable operators for the boring work that makes engineering teams faster. ## Sources - OpenAI Academy: [Codex Automations](https://openai.com/academy/codex-automations) - OpenAI: [Introducing the Codex app](https://openai.com/index/introducing-the-codex-app/) - OpenAI: [Codex for almost everything](https://openai.com/index/codex-for-almost-everything/) - OpenAI Developers: [Codex changelog](https://developers.openai.com/codex/changelog) - OpenAI Developers: [Codex docs](https://developers.openai.com/codex/)

Codex Is Becoming a General-Purpose AI Agent, Not Just a Coding Tool

Developers Digest — Tue, 05 May 2026 00:00:00 GMT

Codex is still described as a coding agent, but that label is starting to undersell what the product is becoming. The old mental model was simple: > Codex edits code, runs tests, and opens pull requests. That is still true. But OpenAI's recent product direction points at something broader: Codex as a **general-purpose work agent** that happens to be strongest when the work has files, tools, verification steps, and repeatable outputs. That distinction matters. A chatbot answers. A coding assistant edits code. A general-purpose agent can move across apps, gather context, update artifacts, check its work, and come back later. That is the interesting version of Codex. ## The Official Signal OpenAI's [Codex for almost everything](https://openai.com/index/codex-for-almost-everything/) announcement is the clearest product signal so far. OpenAI says Codex can now operate your computer, use more tools and apps, generate images, remember preferences, learn from previous actions, and take on ongoing repeatable work. That is not just "better autocomplete." It is the shape of an agent workspace. The newer [OpenAI Academy overview of Codex](https://openai.com/academy/what-is-codex/) says the quiet part directly: Codex can be useful beyond software for tasks that require more than a single answer, including gathering information from multiple sources, creating and updating files, and producing documents, slides, and spreadsheets. So yes, code is still the home base. But the product boundary is expanding. ## What Makes Codex General-Purpose The important part is not that Codex can "do anything." It cannot. The useful framing is narrower: Codex is good for work that has **state, tools, artifacts, and review**. That includes: - reading a repo, notes, docs, emails, or dashboards - making changes across many files - using a browser to inspect a local app - generating product images or mockups from context - opening documents, spreadsheets, slides, and PDFs in a workspace - running repeatable tasks through automations - carrying context forward with memory and previous-thread continuation - coordinating work across plugins and app integrations Those are not all "coding" tasks. They are operational tasks. The reason Codex is good at them is the same reason it is good at code: it can interact with a workspace, not just produce a paragraph. ## The Best Non-Code Use Cases ### 1. Research To Artifact Codex is useful when the output is not an answer, but a file. Examples: - turn a pile of source links into a brief - convert notes into a product spec - make a slide outline from raw research - summarize a folder of PDFs into an internal memo - update a roadmap document from Linear, Slack, and repo state ChatGPT can help think through those tasks. Codex is better when you want the final result saved, structured, and checked against source material. ### 2. Browser-Based QA OpenAI's Codex update added an in-app browser and browser-oriented workflows for frontend design, apps, and games. That matters because a lot of product work fails at the visual or interactive layer. The useful prompt is not: ```txt Make this page better. ``` The useful prompt is: ```txt Open the local app, test the onboarding flow on desktop and mobile, capture what breaks, fix the highest-impact issues, and verify the flow works after the change. ``` That is not just coding. It is product QA with code edits as one possible action. ### 3. Repeatable Operator Work Automations are the most underrated part of the broader Codex direction. If Codex can wake up later with context, it becomes useful for work like: - checking stale docs - reviewing open PR comments - auditing broken links - refreshing SEO notes - checking dashboards and producing a priority list - following up on recurring operational tasks This is where Codex starts to look less like an IDE feature and more like a junior operator for recurring workflows. For the deeper setup pattern, read the [Codex automations playbook](/blog/codex-automations-recurring-engineering-work). The catch: the task needs a clear review loop. "Improve the business" is too vague. "Every weekday, inspect these five pages, fix broken internal links, run build, and report changed files" is usable. ### 4. File And Document Work The Codex app can preview more file types, including docs, spreadsheets, slides, PDFs, and richer artifacts. That unlocks a category of work that coding agents usually ignore: - clean up a spreadsheet - turn a technical memo into slides - inspect a PDF and extract action items - compare a document against a checklist - update a launch plan after a repo change This does not mean Codex replaces dedicated document tools. It means the agent can participate in the work where engineering, content, and operations overlap. ### 5. Image And Product Mockup Iteration OpenAI also added image generation into the Codex workflow. For developers, the interesting use case is not generic art. It is context-aware product imagery: - app mockups - visual concepts for features - blog hero images - game assets - lightweight design explorations tied to real code The best version of this is a loop: screenshot the current state, generate a visual direction, implement the UI, inspect it in browser, then iterate. That is a general-purpose creative workflow wrapped around a development environment. ## Where Codex Still Should Not Be Used Do not turn this into blind autopilot. Codex is still strongest when the task has: - clear inputs - a known workspace - explicit acceptance criteria - files or artifacts to update - commands or checks to run - a human review step It is weaker when the task depends on private judgment, ambiguous taste, unclear authority, or irreversible action. Bad Codex task: ```txt Handle my sponsorship pipeline. ``` Better Codex task: ```txt Read the last seven days of sponsorship emails, draft a priority list, identify replies that need review, and do not send anything. ``` The difference is control. General-purpose does not mean permissionless. ## How To Prompt Codex Like A General Agent The prompt format changes once you stop thinking of Codex as only a coding tool. Use this structure: ```txt Goal: Create a concise weekly content operations report. Context: Use the repo's recent git history, SEO-DAILY.md, QA.md, and current analytics report. Actions: Find the top 5 signals, update SEO-DAILY.md, and create a short next-actions section. Constraints: Do not publish new content. Do not touch unrelated files. No private sponsor details. Verification: Run lint or explain why no code checks apply. Report files changed. ``` That prompt gives Codex a job, boundaries, and evidence requirements. It is not asking for a vibe. It is delegating a workflow. ## The Real Category Shift The category is moving from "AI coding tool" to "agentic workspace." That does not make the coding angle less important. It makes code one artifact among many. A real software project includes PRs, docs, screenshots, QA notes, dashboards, deployment logs, customer feedback, specs, spreadsheets, and follow-up tasks. Codex is starting to sit across that whole surface. That is why the comparison with [Claude Code](/blog/codex-vs-claude-code-april-2026), [Cursor](/blog/cursor-vs-codex), and [GitHub Copilot](/blog/github-copilot-coding-agent-cli-2026) needs to widen. The question is not only "which model writes better code?" The better question is: > Which agent can safely move work forward across the tools where the work actually lives? For Codex, the answer is increasingly: more than code, but still with engineering-style constraints. ## Practical Takeaway Use Codex for non-code work when the task looks like a workflow: - gather context - update files - inspect outputs - run checks - leave a report - continue later if needed Do not use it as a magical executive assistant. Use it as a workspace agent with explicit scope. That is the useful version of "general purpose." Not a model that does everything. An agent that can keep moving through a real workspace until a reviewable artifact exists. ## Sources - OpenAI: [Codex for almost everything](https://openai.com/index/codex-for-almost-everything/) - OpenAI Academy: [What is Codex?](https://openai.com/academy/what-is-codex/) - OpenAI: [Codex product page](https://openai.com/codex) - OpenAI Developers: [Codex docs](https://developers.openai.com/codex/) - OpenAI Developers: [Codex changelog](https://developers.openai.com/codex/changelog)

Codex Loops: What Boris Cherny Gets Right About Managing Agent Work

Developers Digest — Tue, 05 May 2026 00:00:00 GMT

Boris Cherny's recent interview is worth watching because it names the thing most AI coding demos still hide: the future of agent work is not one perfect prompt. It is many supervised loops. In the interview, Boris describes a personal Claude Code setup that has moved far past "agent writes a diff." He talks about running multiple sessions, using sub-agents heavily, and leaning more and more on `/loop`: recurring agent jobs scheduled with cron. The examples are wonderfully boring: - babysit pull requests; - fix CI; - auto-rebase branches; - keep CI healthy; - cluster Twitter feedback every 30 minutes; - report back when a changing data stream needs attention. That is the useful part. The examples are not magical. They are the exact maintenance chores every engineering team already does poorly. This is also where Codex content should go next. [Codex automations](/blog/codex-automations-recurring-engineering-work), [Codex goals](/blog/codex-goal-vs-claude-managed-outcomes-practical-differences), the [Codex GitHub Action](/blog/codex-sdk-vs-cli-github-action), and the [Codex cloud security playbook](/blog/openai-codex-cloud-security-playbook-2026) all point in the same direction: the winning agent workflow is a loop with boundaries, receipts, and escalation rules. ## The Big Shift: From Tasks to Loops The first AI coding workflow was a task: ```text Fix this bug. ``` The second workflow was a scoped task: ```text Fix the billing webhook validation. Only touch app/api/billing and lib/billing. Run pnpm test billing and pnpm typecheck. Return changed files, tests run, and risks. ``` The loop workflow is different: ```text Every 15 minutes, inspect open PRs labeled codex-watch. If CI is red for a deterministic reason, attempt one fix. If main moved, rebase once. If the same failure appears twice, stop and leave a concise report. Never push directly to main. ``` That is not just "task, repeated." It has a trigger, scope, action budget, stop condition, and reporting path. Those are the pieces that turn an agent from a clever assistant into a useful background process. ## Why Loops Beat One-Shot Agents One-shot agents are good at bounded edits. Loops are good at changing state. A PR changes after review comments land. CI changes after a dependency cache expires. A deployment changes after Coolify finishes building. User feedback changes every hour. A model eval changes after new examples arrive. These are not single-shot problems. They are state-monitoring problems. That is why Boris's examples land. PR babysitting and CI repair are high-value because they sit in the annoying gap between "the code is basically right" and "the work is actually merged." Codex is well positioned for this because the surface area is already there: - [Codex CLI](/blog/openai-codex-guide) for local scoped work; - [Codex GitHub Action](/blog/codex-sdk-vs-cli-github-action) for repo-triggered review and automation; - [Codex automations](/blog/codex-automations-recurring-engineering-work) for recurring checks and reports; - [Codex goals](/blog/codex-goal-vs-claude-managed-outcomes-practical-differences) for longer-lived objectives; - browser verification for UI and deploy checks. The missing piece is not capability. It is loop design. ## The Loop Contract Every useful Codex loop should fit on one page. ```yaml name: pr-babysitter trigger: every: 15m scope: include: - pull_requests: labels: ["codex-watch"] exclude: - main permissions: repo: write-branch ci: read deploys: read budget: max_attempts_per_pr: 1 max_runtime_minutes: 20 max_files_changed: 8 stop: - same_failure_seen_twice - merge_conflict_requires_product_decision - tests_fail_after_one_fix report: destination: pr-comment fields: - summary - action_taken - tests_run - remaining_blocker ``` The contract matters because loops are powerful in the same way cron jobs are powerful: they keep running after the interesting part is over. Without a contract, a loop becomes background chaos. With a contract, it becomes a junior operations teammate that handles the boring parts and escalates the judgment calls. ## Four Codex Loops I Would Actually Run Start with loops that are safe, boring, and obviously reviewable. ### 1. PR Babysitter Trigger: every 15 minutes on PRs with a label. Job: - check CI; - rebase if main moved; - fix one deterministic failure; - summarize review comments; - report blockers. Stop if the same failure appears twice. Stop if the branch has merge conflicts that require a human decision. Stop if the fix touches files outside the declared scope. This is the cleanest Codex loop because it maps to GitHub's natural workflow. The output is a PR comment, a small branch commit, or a status report. ### 2. CI Health Loop Trigger: every 30 minutes on `main`. Job: - inspect the latest CI failures; - cluster failures by signature; - identify flakes vs deterministic failures; - open one issue or draft one fix branch. The important thing is not letting the agent quietly mutate production code. The first version should be report-only. Once the reports are useful, let it open a branch for the top deterministic failure. This pairs well with [long-running agent harnesses](/blog/long-running-agents-need-harnesses), because CI health is exactly where retry limits, tool logs, and receipts matter. ### 3. Deploy Verification Loop Trigger: after push to `main`, or every 10 minutes while a deploy is in progress. Job: - check deployment queue; - wait for active deploy to finish; - hit `/api/health`; - verify changed routes return 200; - confirm expected image paths or page text are present; - report live links. This is the loop I want for content automation. A blog post is not done when the commit lands. It is done when production returns 200 and the page references the expected hero image. For Codex, this should be a first-class recurring pattern because it is one of the easiest ways to turn agent work into visible shipped work. ### 4. Feedback Clustering Loop Trigger: every 30 or 60 minutes. Job: - pull feedback from GitHub issues, X, YouTube comments, Discord, Linear, or support channels; - cluster it by product area; - identify repeated complaints; - map each cluster to an existing post, guide, tool, or product gap. Boris mentioned clustering Twitter feedback. That is the exact pattern content teams should steal. It turns the outside world into a recurring editorial signal. For Developers Digest, this is how "go hard on Codex" becomes a system: - Codex question appears repeatedly; - loop clusters it; - agent checks whether a post already exists; - if not, a scoped draft gets proposed; - human picks the angle; - Codex ships the article and verifies production. ## The Failure Modes Loops fail differently from one-shot agents. ### They Keep Spending A one-shot agent fails and stops. A loop fails and comes back in 15 minutes. That can be good. It can also create the exact cost pattern from the [$400 overnight agent bill](/blog/400-dollar-overnight-bill-agent-finops): retry, inspect, edit, rerun, repeat. Every loop needs a hard budget: - max attempts per target; - max runtime; - max files changed; - max tool calls; - max spend; - max consecutive failures. ### They Hide Stale Assumptions A loop can keep acting on yesterday's plan after today's context changes. Fix: every loop run starts by refreshing the state it depends on. For PRs, fetch latest base and head. For CI, inspect the current run, not the last one cached in context. For deploys, ask production, not local build output. ### They Need Ownership If five loops can touch the same PR, you do not have automation. You have a race condition. Assign ownership: - one loop owns PR rebase; - one loop owns CI failure triage; - one loop owns content production verification; - one loop owns feedback clustering. Shared read access is fine. Shared write access should be rare. ### They Need Escalation The best loop is not the one that never asks for help. The best loop is the one that knows when it has hit a judgment boundary. Escalate when: - product behavior is ambiguous; - security permissions need widening; - the same failure repeats; - tests contradict each other; - a deploy is healthy but the page is wrong; - the loop would need to touch files outside scope. This is where agents become useful teammates instead of background scripts with model access. ## What Boris Gets Right The important insight in the interview is not that Boris runs an absurd number of agents. Most teams should not copy that directly. The important insight is that he is moving up a level of abstraction. He is not only asking agents to write code. He is asking agents to maintain workflows over time. That is the same shift Codex needs to own. Codex should not only answer: ```text Can you fix this bug? ``` It should answer: ```text Can you keep this PR moving until it is either merged or blocked by a human decision? ``` That second question is much more valuable. ## The Codex Version Here is the content and product thesis: Codex wins when it becomes the loop manager for engineering work. Not just the model that writes the code. Not just the CLI that edits files. The system that can: - start from a goal; - run scoped work; - verify with browser, tests, and production checks; - return on a schedule; - report what changed; - stop when judgment is required. That is the difference between agent assistance and agent operations. The next Codex content cluster should cover: - PR babysitting loops; - CI repair loops; - deploy verification loops; - feedback clustering loops; - cost caps for loops; - loop prompts and YAML contracts; - GitHub Action implementations; - when to use Codex automations vs CLI vs SDK. That cluster is more useful than another generic "what is Codex" post because it meets teams where they are: trying to turn agent output into shipped, reviewed, production-safe work. ## The Bigger Take Boris's loop-heavy workflow is a preview of where agentic coding is going. The headline is not "engineers will manage thousands of agents." The headline is smaller and more practical: Recurring engineering work is about to become agent-managed. The winning teams will not be the ones with the most agents. They will be the ones with the clearest loop contracts. For Codex, that is the content lane to own: how to design, run, verify, and stop the loops that keep software moving. ## FAQ ### What are agent loops? Agent loops are recurring AI workflows that inspect state, decide whether action is needed, act within a defined scope, and report results. They are useful for PR babysitting, CI repair, deploy verification, feedback clustering, and other changing-state engineering work. ### How is a loop different from a cron job? A cron job runs a fixed command on a schedule. An agent loop runs a recurring decision process: inspect the current state, choose an action, apply bounded changes, verify, and escalate if needed. ### How does this apply to Codex? Codex has the right surfaces for loops: CLI for local work, GitHub Action for repo events, automations for recurring checks, goals for longer-running objectives, and browser verification for production checks. The missing part is a clear loop contract. ### What is the safest Codex loop to start with? Start with a read-only PR review loop. Have Codex inspect pull requests with a label, summarize CI and review status, and post a concise comment. Add write access only after the signal is consistently useful. Sources: [Boris Cherny interview on YouTube](https://www.youtube.com/watch?v=SlGRN8jh2RI), [OpenAI Codex CLI docs](https://developers.openai.com/codex/cli), [OpenAI Codex SDK docs](https://developers.openai.com/codex/sdk), [openai/codex-action README](https://github.com/openai/codex-action), [OpenAI Codex changelog](https://developers.openai.com/codex/changelog).

Codex SDK vs CLI vs GitHub Action: Which Surface Should You Build On?

Developers Digest — Tue, 05 May 2026 00:00:00 GMT

Codex used to be easy to place in your head: install the CLI, run it in a repo, review the diff. That mental model is now too small. OpenAI has split Codex across several surfaces: app, web, IDE extension, CLI, GitHub integration, Slack, automations, and an SDK. The practical question for builders is not "should I use Codex?" It is **where should Codex live in my workflow?** This is the decision tree I would use: | Surface | Best job | Main risk | |---|---|---| | Codex CLI | Local, scoped engineering tasks | Human prompts stay informal | | Codex GitHub Action | CI-adjacent review, comments, generated artifacts | Over-permissioned runners | | Codex SDK | Productized agent features inside your own app | You now own the full UX and control plane | If you are new to the product, start with the [OpenAI Codex guide](/blog/openai-codex-guide). If you already understand Codex and want the current product direction, read the [April Codex changelog breakdown](/blog/codex-changelog-april-2026). This post is narrower: it is about choosing the right integration surface before you wire Codex into a team workflow. ## The Short Answer Use the **Codex CLI** when the human is still in the loop and the job starts from a terminal. Use the **Codex GitHub Action** when the job is triggered by repository events and the output belongs in GitHub: PR comments, review summaries, generated migration notes, failing-test explanations, release checks, or structured artifacts. Use the **Codex SDK** when Codex is not the product surface but the engine behind your own product: an internal code-mod assistant, a migration dashboard, an app-builder workflow, a customer-facing repo assistant, or a specialized review system with its own UI. The mistake is trying to make one surface do all three jobs. That is how teams end up with a brittle shell script that should have been an app, or a full SDK integration that should have been a 20-line GitHub Action. ## Codex CLI: Best for Human-Steered Work The CLI is still the most direct Codex surface. OpenAI's docs position it as the terminal pairing experience, and the command shape is exactly what you want for local repo work: ```bash codex exec "Add input validation to the billing webhook and update the tests." ``` The CLI is the right default when: - the developer is already in the repo; - local services matter; - the task needs quick back-and-forth; - you want to inspect files before approving changes; - the output should become a normal local diff. This is where Codex competes directly with [Claude Code](/blog/what-is-claude-code-complete-guide-2026), Cursor agents, and other terminal-native coding tools. Codex's advantage is the OpenAI model stack, sandboxing defaults, and the growing app/CLI ecosystem around approvals, goals, browser verification, and worktrees. The CLI's weakness is that it inherits human prompt quality. If every task starts as "fix the thing," Codex will produce fuzzy work. The better pattern is to keep a tiny prompt template near your repo: ```text Goal: Constraints: - files/modules in scope - files/modules out of scope - command to verify - expected user-visible behavior Return: - summary - changed files - tests run - risks ``` That template is simple, but it converts Codex from "smart terminal" into a repeatable engineering loop. It also sets you up for the other surfaces later. ## Codex GitHub Action: Best for Repo Events The `openai/codex-action` repo gives teams a way to run Codex inside GitHub Actions while controlling privileges. The README is explicit about the architecture: the action installs the Codex CLI and configures a secure proxy to the Responses API. It also gives you knobs for sandbox mode, model, effort, output schema, output files, working directory, and safety strategy. This is the right surface when the trigger is already a GitHub event: - PR opened; - label added; - issue assigned; - nightly scheduled workflow; - release branch cut; - dependency update opened; - failing CI run needs explanation. The most useful first workflow is not "let Codex rewrite code automatically." Start with review output: 1. Check out the PR. 2. Fetch base and head refs. 3. Run Codex with a prompt constrained to the PR diff. 4. Post the final message as a PR comment. 5. Keep permissions read-only until the workflow earns trust. This is a better first step because review comments are easy to ignore, easy to compare, and easy to audit. Once the signal is good, you can graduate to generated artifacts or narrow autofix branches. ## The Safety Knob That Matters The GitHub Action docs include an unusually important input: `safety-strategy`. The default is `drop-sudo`, which removes sudo access before Codex runs on Linux and macOS runners. There are also `unprivileged-user`, `read-only`, and `unsafe` modes. That is not a small implementation detail. It is the difference between "agent can inspect this checkout" and "agent is running with broad runner privileges." For most teams, the starting point should be: ```yaml permissions: contents: read with: sandbox: read-only safety-strategy: drop-sudo ``` Then loosen only what the workflow proves it needs. This is the same security lesson from the [Codex cloud security playbook](/blog/openai-codex-cloud-security-playbook-2026): the agent's usefulness comes from access, and the risk comes from access. Good workflows make that access explicit. ## Codex SDK: Best for Productizing the Loop The SDK matters when Codex becomes part of your product rather than a tool your developers run. Examples: - a migration assistant that opens scoped modernization tasks; - a customer repo analyzer that produces implementation plans; - an internal platform that assigns small tasks to agents; - a code-review product with its own dashboard; - a teaching app that lets users run Codex against sandbox repos; - a maintenance workflow that turns errors into proposed fixes. If the UI, state model, permissions, billing, or reporting belong to your app, the SDK is the right surface. You get to design the control plane. You also have to design the control plane. That tradeoff is the whole point. With the CLI, OpenAI owns most of the product surface. With the GitHub Action, GitHub owns the event surface. With the SDK, you own the user experience, state transitions, permissions, observability, and failure handling. Do not pick the SDK because it sounds more serious. Pick it when your workflow has product requirements that the CLI and GitHub Action cannot express. ## A Practical Decision Matrix Here is the simplest way to decide. | Question | Pick | |---|---| | Does a human start the task from a terminal? | CLI | | Does a GitHub event start the task? | GitHub Action | | Does your app need to own the UX? | SDK | | Is the output a local diff? | CLI | | Is the output a PR comment or CI artifact? | GitHub Action | | Is the output a product workflow with users and state? | SDK | | Do you need a quick proof of concept? | CLI | | Do you need repeatable repo automation? | GitHub Action | | Do you need a differentiated product? | SDK | Most teams should move in this order: 1. CLI for manual proof. 2. GitHub Action for repeatable repo events. 3. SDK only after the workflow has proven value. That order keeps you from overbuilding. ## The Architecture Pattern The winning pattern is to keep the **task contract** portable across all three surfaces. Do not write one prompt for CLI, a different prompt for GitHub Actions, and a third prompt inside your SDK app. Write one task spec format: ```yaml goal: "Refactor the billing webhook validation" scope: include: - app/api/billing/** - lib/billing/** exclude: - migrations/** verification: commands: - pnpm test billing - pnpm typecheck output: format: - summary - changed_files - tests_run - risks ``` Then adapt the transport: - CLI reads the task spec from a local file. - GitHub Action reads it from `.github/codex/review.yml` or a prompt file. - SDK stores it as structured state in your app. This is how Codex content compounds. You are not building random prompts. You are designing a reusable task contract that can move from human use to automation to product. For the larger version of that idea, read [Codex automations for recurring engineering work](/blog/codex-automations-recurring-engineering-work) and [Codex `/goal` vs Claude Managed Outcomes](/blog/codex-goal-vs-claude-managed-outcomes-practical-differences). ## What I Would Build First If I were adding Codex to a team today, I would not start with the SDK. I would ship three small things: 1. A repo-level `AGENTS.md` with exact project rules. 2. A `codex-tasks/` folder with reusable task specs. 3. A GitHub Action that runs Codex in read-only mode on PRs and posts concise review comments. Then I would watch three numbers: - how often Codex catches real issues before humans do; - how often humans ignore the comment; - how often the workflow needs write access. If the comments are useful, move from read-only review to generated patch branches. If the task specs become durable and reusable, consider the SDK. If developers keep manually running the same task locally, wrap it in the CLI first. The SDK should be the reward for a proven workflow, not the starting point. ## The Bigger Take Codex is turning into a multi-surface agent platform. That is good, but it creates a new design problem: teams have to decide which surface owns which job. The CLI is for developer-steered work. The GitHub Action is for repo-triggered automation. The SDK is for productized agent workflows. Use the smallest surface that preserves the control you need. Then keep the task contract portable so the workflow can grow without a rewrite. That is how you go hard on Codex without turning your engineering process into a pile of disconnected agent experiments. ## FAQ ### Should I start with the Codex SDK? Usually no. Start with the CLI or GitHub Action unless your app needs to own the user experience, state model, permissions, or reporting. The SDK is best after the workflow has proven value. ### Is openai/codex-action just the CLI in GitHub Actions? Broadly, yes. The action handles installing the Codex CLI and configuring a secure proxy to the Responses API, then exposes workflow inputs for prompt, model, effort, sandbox, output schema, output file, and safety strategy. ### What is the safest first GitHub Action workflow? Run Codex in read-only mode on pull requests and post a concise review comment. Keep repository permissions narrow and use the default `drop-sudo` safety strategy on Linux or macOS runners. ### When does the Codex SDK make sense? Use the SDK when Codex powers your own product or internal platform: migration dashboards, custom review systems, app-builder workflows, sandbox teaching tools, or maintenance agents with their own UI and state. Sources: [OpenAI Codex CLI docs](https://developers.openai.com/codex/cli), [OpenAI Codex SDK docs](https://developers.openai.com/codex/sdk), [openai/codex-action README](https://github.com/openai/codex-action), [OpenAI Codex changelog](https://developers.openai.com/codex/changelog).

Free Claude Code Is Really a Model Gateway Bet

Developers Digest — Tue, 05 May 2026 00:00:00 GMT

The viral headline is "use Claude Code for free." The more interesting pattern is model gateways for coding agents. The [Free Claude Code repo](https://github.com/Alishahryar1/free-claude-code) describes itself as a drop-in Anthropic-compatible proxy for Claude Code. Its README lists backends including NVIDIA NIM, OpenRouter, DeepSeek, LM Studio, llama.cpp, and Ollama, with per-model routing for Opus, Sonnet, Haiku, and fallback traffic. That is a bigger idea than a cost hack. It is a control plane between the coding agent and the model market. ## The take AI coding agents are becoming frontends. Model gateways are becoming infrastructure. Claude Code has the strongest workflow surface in many developer teams: terminal UX, project memory, tools, MCP, hooks, subagents, and worktree patterns. But some teams want provider flexibility, local routing, cheaper background work, or experiments with open models. Free Claude Code is one answer to that tension. Keep the agent UX. Swap the model backend. That overlaps with the argument in [self-hosting Claude Code on your own infra](/blog/self-hosting-claude-code-on-your-own-infra), [Aider vs Claude Code](/blog/aider-vs-claude-code-2026-update), and [Claude Code vs Codex vs Cursor vs OpenCode](/blog/claude-code-vs-codex-vs-cursor-vs-opencode). The coding-agent layer and the model layer are starting to separate. ## Why developers care Cost is the obvious reason. Long agent runs can burn through premium-model quota fast. If a proxy can route simple edits to a cheaper or local model and reserve frontier models for planning, debugging, and gnarly refactors, the economics change. But cost is not the only reason. Provider routing also gives teams: - local-model paths for sensitive code - fallback routes during provider outages - experiments with new coding models before native tools support them - separate budgets for planning, editing, and review - one place to log usage and failures That is why model gateways keep showing up around agent tools. Developers do not only want "the best model." They want the right model for the subtask. ## The security tradeoff The opposing view is important: a proxy between your coding agent and the model is now in the trust path. That proxy sees prompts, code context, tool calls, and sometimes secrets if your workflow is sloppy. It can also reshape requests and responses. That is powerful, but it means you should treat any model gateway like developer infrastructure, not a browser extension you installed on a whim. Before using a project like this on serious code, review: - where the proxy runs - what traffic it logs - how auth tokens are stored - whether it forwards secrets to third-party providers - how tool-use and reasoning blocks are translated - whether tests cover streaming and tool calls The Free Claude Code README says the proxy normalizes thinking blocks, tool calls, token usage metadata, and provider errors into the shape Claude Code expects. That is useful. It is also exactly the area where subtle bugs can become bad agent behavior. For more on the operational side, read [agent receipts](/blog/agent-swarms-need-receipts) and [the agent reliability cliff](/blog/the-agent-reliability-cliff). ## The quality tradeoff The other risk is capability mismatch. Claude Code's UX can make a weaker model feel more capable than it is. A local model may handle search-and-replace tasks well, then fail on multi-file architecture work. A cheap hosted model may stream quickly, then break tool-call formatting. A fallback route may save a run during an outage, but produce lower-quality patches. That does not make model gateways bad. It means routing policy should be explicit: | Task | Reasonable route | |---|---| | formatting, simple edits, docs cleanup | cheap or local model | | test repair with clear failure output | mid-tier coding model | | architecture refactor | frontier model | | security-sensitive repo exploration | local model when quality is enough | | final review before merge | strongest model plus human review | The practical question is not "can this run Claude Code for free?" It is "which parts of Claude Code work are safe to route away from the default model?" ## How I would use it I would not start by routing everything through a free model. I would start with a low-risk repo and three explicit lanes: 1. **Local lane:** docs, formatting, small mechanical edits. 2. **Budget lane:** first-pass test fixes and simple implementation tasks. 3. **Frontier lane:** planning, architecture, security-sensitive review, and final verification. Then I would log every run: prompt, model route, task type, tests run, whether the patch merged, and what human review fixed. Without that feedback loop, model routing becomes vibes. The real opportunity is not "free Claude Code." It is a team-owned gateway that makes coding-agent work measurable, cheaper where possible, and stricter where quality matters. ## Frequently Asked Questions ### What is Free Claude Code? Free Claude Code is an open-source Anthropic-compatible proxy that lets Claude Code talk to other backends, including NVIDIA NIM, OpenRouter, DeepSeek, LM Studio, llama.cpp, and Ollama. ### Is Free Claude Code actually free? The repo can route to free or local providers, but "free" depends on the backend you choose. Some routes still require API keys, local hardware, or third-party quota. ### Is a Claude Code proxy safe for work code? Only if you trust and operate it like infrastructure. Review logging, auth, provider routing, secret handling, and tool-call translation before sending private code through any proxy. ### Who should use a model gateway for coding agents? Teams that need provider flexibility, lower costs, local-model experiments, or outage fallback paths. If you just want the simplest reliable Claude Code setup, the official path is still easier.

GPT Image 2 Prompt Libraries Are Becoming Production Infrastructure

Developers Digest — Tue, 05 May 2026 00:00:00 GMT

The GPT Image 2 prompt-library wave looks like another pile of examples. It is more useful than that. The [OpenAI image-generation docs](https://developers.openai.com/api/docs/guides/image-generation) frame GPT Image as a programmable generation and editing system, with the Image API for single prompts and the Responses API for conversational image workflows. The current prompt-library repos are the missing practical layer on top: reusable recipes for layout, lighting, materials, product shots, diagrams, UI screens, and visual consistency. One current example, [awesome-gpt-image-2](https://github.com/freestylefly/awesome-gpt-image-2), describes itself as a prompt-as-code library with hundreds of reverse-engineered cases and industrial templates. The README says its goal is to turn scattered examples into structured protocols that agents and automation workflows can reuse. That is the right framing. ## The take Image prompts are becoming build artifacts. For a blog, product page, app directory, course hero, or social campaign, the prompt is not just creative prose. It is the spec that tells the image model what the asset should do, what it should avoid, what layout constraints matter, and how it fits the rest of the system. That is why a prompt library can be more valuable than another gallery. A gallery helps you admire outputs. A library helps you reproduce a direction. This is the same shift we are seeing with [agent skills](/blog/agent-skills-production-checklist), [skills as an agent operating system](/blog/skills-are-the-new-agent-operating-system), and [DESIGN.md for AI agents](/blog/design-md-for-ai-agents). The useful artifact is the reusable instruction layer. ## Why developers should care Developers are getting pulled into visual production. Landing pages need hero images. Docs need diagrams. Product launches need social cards. Internal tools need empty states and onboarding graphics. The image model can generate the pixels, but the team still needs repeatability. OpenAI's docs call out practical controls such as size, quality, output format, compression, and the distinction between the Image API and Responses API. They also note limitations around text rendering, consistency, and composition control. Those limitations are exactly why structured prompts matter. A production prompt should capture: - asset type - subject - scene and backdrop - composition - lighting - color constraints - material details - exact text rules - avoid list - validation criteria That is not artistic overkill. It is how you keep a site from turning into 30 unrelated stock images. ## The opposing view The fair criticism is that prompt libraries can become cargo cults. Copying a viral prompt rarely gives you a production asset. It gives you someone else's taste, aspect ratio, subject, and hidden assumptions. Worse, many prompt repos collect examples without source clarity, commercial-use clarity, or a real test harness. That matters. If you are shipping public brand assets, you need to know what is original, what was inspired by community content, and what rights or licenses apply. The awesome-gpt-image-2 README includes a disclaimer that it organizes public prompts and examples for learning and research, and tells users to obtain authorization from original rights holders before commercial use. That is the correct caution. Prompt libraries are reference material, not automatic rights clearance. ## What a useful prompt library looks like The best libraries will not just store prompts. They will store decisions. For each asset pattern, I want: 1. A short use case label. 2. A structured prompt schema. 3. Example outputs. 4. Known failure modes. 5. Model and quality settings. 6. Post-processing notes. 7. Brand constraints. 8. A checklist for accepting or rejecting the output. That is why I like prompt-as-code framing. It turns "make it look better" into a repeatable workflow an agent can run. For example, a Developers Digest blog hero prompt should say: cream background, tactile cards, black outlines, no readable generated text, no logos, no gradients, no emojis, restrained accent colors, and a concrete abstraction of the topic. That is a reusable visual contract, not a moodboard. ## How to use GPT Image 2 prompts in a real content workflow Start with one asset family, not the whole brand. For a technical blog, I would make four prompt templates: - article hero - comparison table visual - workflow diagram - social preview Then I would add a lightweight eval pass: - Does it explain the topic visually? - Does it match the brand system? - Is there any readable fake text? - Is the composition usable at mobile crop? - Is the file size acceptable? - Does the post reference the asset from a permanent repo path? That last one is boring, but critical. A generated image under a temporary path is not a published asset. Move it into the project, compress it, reference it in frontmatter, and verify the route. This is where prompt libraries become production infrastructure. They do not replace taste. They make taste easier to repeat. ## Frequently Asked Questions ### What is GPT Image 2? GPT Image 2 is OpenAI's current image-generation model available through image-generation workflows in the OpenAI API. The docs describe generation, editing, quality, size, format, and cost controls. ### Why are GPT Image 2 prompt libraries trending? Because strong image outputs are easier to repeat when prompts are structured into reusable schemas instead of one-off prose. Developers want templates for UI, infographics, product shots, brand visuals, and content assets. ### Can I use community prompt-library images commercially? Do not assume that. Treat community prompt libraries as references, then check the repo license, disclaimers, original sources, and rights for any examples you reuse. ### How should teams store image prompts? Store them near the content or design system, with the final asset path, model settings, known failure modes, and acceptance checklist. The prompt is part of the production artifact.

Karpathy's Loopy Era Is the Best Way to Understand Codex

Developers Digest — Tue, 05 May 2026 00:00:00 GMT

Andrej Karpathy's "loopy era" interview with No Priors is one of the better explanations of the current AI coding shift because it does not frame the change as better autocomplete. The useful claim is sharper: the agent is now assumed. The new skill is designing loops that keep useful work moving without a human prompting every next step. That is exactly the lens I would use for Codex. If you still think of [OpenAI Codex](/blog/openai-codex-guide) as "a model that writes code," you will underuse it. The more interesting version is Codex as a control surface for agentic engineering: task specs, repo rules, parallel sessions, objective checks, budgets, escalation, and production verification. This also connects cleanly to Boris Cherny's loop-heavy workflow. Boris's `/loop` framing is about recurring engineering chores. Karpathy's loopy era is the larger principle underneath it: remove yourself from the prompt-next-step loop when the task has enough structure to run. For the existing Codex cluster, read this alongside [Codex loops and Boris Cherny](/blog/codex-loops-boris-cherny-agent-routines), [Codex `/goal` vs Claude managed outcomes](/blog/codex-goal-vs-claude-managed-outcomes-practical-differences), and [Codex SDK vs CLI vs GitHub Action](/blog/codex-sdk-vs-cli-github-action). They are all pointing at the same workflow shape. ## The Karpathy Takeaway In the No Priors interview, Karpathy describes a personal workflow that moved from mostly hand-written code to mostly agent delegation. The important part is not the percentage. It is the unit of work. He is not talking about: - writing one function faster; - accepting a completion; - asking a chatbot for a snippet; - replacing an engineer with one giant prompt. He is talking about moving in **macro actions** over a repository. One agent researches. Another writes code. Another plans. Another explores a separate implementation path. The human steers, reviews, and designs the system around the agents. That is the jump from "vibe coding" to agentic engineering. The developer is less like a typist and more like an operator of parallel technical loops. This is also why [AI coding tool comparisons](/blog/ai-coding-tools-comparison-matrix-2026) that only score code generation miss the next decision point. The question is not just which model writes the best React component. It is which environment lets you safely run more useful loops. ## AutoResearch Is the Cleanest Example Karpathy's AutoResearch example is so useful because it has the ingredients that make loops work: ```text objective + metric + boundary + worker loop + result review ``` He describes setting up a research loop where agents try experiments, evaluate objective metrics, and continue without waiting for him to inspect every intermediate result. The goal is to maximize useful token throughput while removing the human as the bottleneck. That sounds abstract until you map it to software: | AutoResearch primitive | Software engineering version | |---|---| | Objective | Improve this benchmark, fix this failing path, reduce this latency | | Metric | Test pass rate, benchmark score, bundle size, route 200, typecheck | | Boundary | Files in scope, commands allowed, time budget, permission model | | Worker loop | Codex task, GitHub Action, CLI session, automation | | Result review | PR diff, logs, eval report, deploy check, human approval | This is why Codex is interesting right now. It already lives close to the software loop. It can read repo instructions, edit files, run commands, review diffs, and report what changed. With the [Codex GitHub Action](/blog/codex-sdk-vs-cli-github-action), the loop can also be attached to pull request events. With [Codex automations](/blog/codex-automations-recurring-engineering-work), the same pattern can become recurring work instead of one-off delegation. The point is not that Codex magically solves engineering. The point is that Codex is one of the more natural places to formalize the loop. ## The Loop Contract Matters More Than the Prompt The weak version of agentic engineering is: ```text Make the app better. ``` The stronger version is: ```yaml goal: "Reduce checkout route cold-start time by 20 percent" scope: include: - app/checkout/** - lib/payments/** exclude: - migrations/** - auth/** metric: command: "pnpm bench checkout" success: "p95 improves by at least 20 percent and tests pass" budget: max_runtime_minutes: 40 max_files_changed: 8 max_attempts: 2 stop: - metric_cannot_be_reproduced - same_failure_twice - needs_product_decision report: include: - changed_files - commands_run - before_after_metric - remaining_risks ``` That contract is the practical translation of Karpathy's loopy era into Codex work. It gives the agent enough room to continue. It gives the human enough structure to review. It gives the workflow a stopping point. Most importantly, it makes the loop portable. The same contract can start in the [Codex CLI](/blog/openai-codex-guide), move into GitHub Actions, and eventually become a productized workflow through an SDK. This is the real content lane for Codex: not "here is a clever prompt," but "here is the smallest reliable loop contract for a real engineering job." ## Where Codex Fits Codex has three especially useful roles in this loopy model. ### 1. The Local Loop The local loop is still human-steered. You run Codex from a repo, give it a narrow target, inspect the diff, and decide what happens next. This is where Codex competes with [Claude Code](/blog/what-is-claude-code-complete-guide-2026), Aider, Cursor agents, and other terminal or IDE coding tools. It is also where the loop contract can stay lightweight: ```text Fix the failing tests in lib/billing. Only touch lib/billing and tests/billing. Run pnpm test billing and pnpm typecheck. Stop after one implementation path if the failure is ambiguous. ``` The local loop is best for high-context work where the developer is actively supervising. It is not the highest-leverage loop, but it is the safest place to learn how Codex behaves in your repo. ### 2. The GitHub Loop The GitHub loop is event-driven. A PR opens. A label is added. CI fails. A nightly schedule fires. Codex comments, reviews, drafts a patch, or produces an artifact. This is where the [Codex GitHub Action](/blog/codex-sdk-vs-cli-github-action) becomes more than a convenience wrapper. GitHub already has the state machine: - issues; - pull requests; - checks; - labels; - branches; - comments; - required reviews. Codex can sit inside that state machine if the permissions are narrow and the output is inspectable. Start read-only. Let it summarize failures, review diffs, and propose next actions. Only widen write access after the comments are consistently useful. That is the difference between agent automation and an overpowered CI job. ### 3. The Recurring Loop The recurring loop is the closest to Karpathy's point. It does not wait for a human prompt. It wakes up, refreshes state, checks whether useful work exists, acts inside a boundary, and reports. Examples: - watch PRs with a `codex-watch` label; - retry one deterministic CI failure; - verify deploys after `main` changes; - cluster repeated product feedback; - scan docs for drift against the current API; - create a daily content brief from new Codex changelog items. This is also where the [long-running agent harness](/blog/long-running-agents-need-harnesses) matters. A recurring loop without receipts is just an expensive cron job with model access. A recurring loop with logs, budgets, stop conditions, and escalation is an engineering system. ## The Opposing View Is Right About One Thing The skeptical view is not "agents are useless." The better skeptical view is that many loops are fake autonomy. Karpathy says the caveat clearly: this works best when the objective metric is easy to evaluate. If you cannot evaluate the result, you cannot safely automate the loop. That is a major limitation. Codex loops are good at: - fixing deterministic tests; - reducing benchmark numbers; - producing structured reports; - rebasing and summarizing; - verifying route health; - checking docs against source files; - comparing before and after outputs. Codex loops are weaker at: - ambiguous product taste; - visual design without screenshots and rubrics; - architecture decisions with hidden business constraints; - security work without narrow permissions; - content judgment without an editorial bar; - anything where "better" is not measurable enough. This is why [debugging agent workflows](/blog/debug-ai-agent-workflows) and [agent architecture](/blog/agent-architecture-multi-step-ai-workflows) are not side topics. They are the infrastructure around the loop. Once the agent can continue without you, failures become harder to see and more expensive to ignore. ## The Better Codex Workflow If I were setting up a Codex-heavy repo after watching the Karpathy interview, I would do five things. ### 1. Write `AGENTS.md` Like a Runtime Contract Do not treat repo instructions as polite preferences. Treat them as the first layer of the loop contract. Include: - commands to verify changes; - files that are off-limits; - deploy verification rules; - content style constraints; - security boundaries; - escalation triggers; - what "done" means. For a deeper version of that, see the [Codex macOS certificate runbook](/blog/openai-codex-macos-certificate-update-runbook). The useful part is not the certificate topic. It is the operational shape: exact commands, exact checks, and exact recovery paths. ### 2. Keep a Folder of Task Specs Create a `codex-tasks/` folder with reusable loop contracts: ```text codex-tasks/ fix-ci.yml verify-deploy.yml review-pr.yml update-blog-seo.yml refresh-docs.yml ``` Each file should name the trigger, scope, verification command, budget, stop conditions, and report format. This is how you move from improvisation to repeatability. It also makes Codex easier to compare against Claude Code or Cursor because you are comparing the same task contract, not vibes. ### 3. Split Parallel Work by Ownership Karpathy's macro-action point only works when tasks do not collide. Good split: - agent 1 owns `app/billing/**`; - agent 2 owns `tests/billing/**`; - agent 3 owns documentation; - agent 4 reviews the final diff. Bad split: - four agents all "make billing better." Parallel agents multiply throughput only when ownership is explicit. Otherwise they multiply merge conflicts and review load. ### 4. Make Metrics Boring The best loop metrics are not fancy: - `pnpm typecheck` passes; - `pnpm test billing` passes; - route returns `200`; - benchmark improves by a named threshold; - generated page includes the expected hero image; - no files outside scope changed; - no new lint errors; - production health count increments. This is why Codex is a good fit for engineering loops. Software has many cheap objective checks. Use them before asking the model to judge its own work. ### 5. Escalate Early The loop should stop sooner than your ego wants. Stop when: - the same failure appears twice; - the fix requires a product decision; - the agent wants broader permissions; - the task crosses ownership boundaries; - the metric is noisy; - the diff grows beyond reviewable size; - production behavior disagrees with local output. This is the part many agent demos skip. The future is not an agent that never asks for help. The future is an agent that knows exactly when it has crossed from execution into judgment. ## The Takeaway Karpathy's loopy era is not a slogan about agents getting smarter. It is a workflow claim: > The leverage comes from arranging work so agents can continue against metrics and boundaries while humans stop being the next-step bottleneck. Codex makes that concrete for software teams. The best Codex workflows will not be the longest prompts. They will be the cleanest loops: - one objective; - one owner; - one metric; - one boundary; - one budget; - one report path; - one escalation rule. That is how Codex moves from "AI coding tool" to agentic engineering infrastructure. ## Sources - No Priors, "Skill Issue: Andrej Karpathy on Code Agents, AutoResearch, and the Loopy Era of AI": https://www.youtube.com/watch?v=kwSVtQ7dziU - Karpathy's AutoResearch repository: https://github.com/karpathy/auto-research - OpenAI Codex docs: https://developers.openai.com/codex/ - OpenAI Codex CLI slash commands: https://developers.openai.com/codex/cli/slash-commands/ - OpenAI Codex changelog: https://developers.openai.com/codex/changelog/ - `openai/codex-action` repository: https://github.com/openai/codex-action

OpenAI's Codex Mac Certificate Deadline Is a Runbook Test

Developers Digest — Tue, 05 May 2026 00:00:00 GMT

OpenAI's latest macOS security notice looks, at first glance, like a normal "please update your app" banner. It is more useful than that. The May 8, 2026 deadline is a practical runbook test for every team that now treats AI coding tools as part of the developer workstation. The short version: OpenAI says a GitHub Actions workflow used in its macOS app-signing process downloaded and executed a malicious Axios package during the March 31, 2026 supply-chain incident. The workflow had access to certificate and notarization material used for ChatGPT Desktop, Codex, Codex CLI, and Atlas. OpenAI says it found no evidence that user data, internal systems, intellectual property, published software, or the certificate itself were compromised, but it is rotating the certificate anyway. That is the right boring move. Treat the material as exposed, rotate it, ship new builds, and force the old line to die on a calendar date. For Developers Digest readers, the interesting part is not "Axios was compromised." The interesting part is what this says about [Codex](/blog/openai-codex-guide), [Claude Code](/blog/what-is-claude-code-complete-guide-2026), Cursor, Copilot, and every other agent that now sits close to source code, terminals, secrets, browsers, and internal repos. The agent is not just an app. It is a privileged developer surface. ## What Actually Changes on May 8 OpenAI says macOS users need to update by **May 8, 2026**. After that date, older macOS builds signed with the previous certificate will no longer receive updates or support and may stop functioning. The first versions signed with the updated certificate are: | Product | Earliest supported version | |---|---:| | ChatGPT Desktop | `1.2026.051` | | Codex App | `26.406.40811` | | Codex CLI | `0.119.0` | | Atlas | `1.2026.84.2` | This does not affect iOS, Android, Linux, Windows, or web versions according to OpenAI. It is specifically about macOS app signing and notarization. The right user action is simple: update through the in-app updater or official OpenAI download pages. Do not install OpenAI, ChatGPT, Codex, or Atlas builds from email links, ads, file-sharing links, random mirrors, or third-party download pages. The right team action is slightly broader: treat this as a drill. ## Why This Matters for AI Coding Teams Classic developer-tool updates were annoying but usually narrow. Your editor updated. Your terminal updated. Your package manager updated. You checked that it still launched and moved on. AI coding tools have a larger blast radius. A local agent can read files, edit code, run shell commands, call MCP servers, use browser sessions, and sometimes touch cloud runners. That does not make the tools bad. It means they deserve the same operational treatment you would give any privileged engineering surface. If you already read [the Codex April changelog](/blog/codex-changelog-april-2026), this direction is obvious. Codex is becoming more stateful, more integrated, and more capable. That is useful. It also means update hygiene becomes part of agent governance. The mistake is turning this into panic. OpenAI's notice is careful: it says there is no evidence of user-data compromise, software alteration, or misuse of the signing material. The better take is operational: this is what mature incident response around an AI developer tool should look like, and it gives teams a concrete checklist to copy. ## The Runbook I Would Use For solo developers, update the apps and move on. For teams, write the one-page runbook now. 1. Inventory every OpenAI macOS surface in use: ChatGPT Desktop, Codex App, Codex CLI, Atlas. 2. Confirm every Mac is on or above the minimum versions OpenAI listed. 3. Document the official update paths your team accepts. 4. Block installs from third-party mirrors, email links, shared zip files, and ad-driven download pages. 5. Add AI coding tools to your normal endpoint-management inventory. 6. Capture which repos, MCP servers, terminal permissions, and cloud accounts each tool can reach. 7. Keep one "known-good rollback" note, but do not pin to builds that will lose signing support. The key is step 6. Version numbers are table stakes. Permission mapping is the real maturity test. If a developer's Codex app can reach production repos, GitHub tokens, local `.env` files, and browser sessions, you need to know that before the next incident. This is the same lesson behind [the agent reliability cliff](/blog/the-agent-reliability-cliff): serious agent workflows fail at the surrounding control loop before they fail at model intelligence. ## The Opposing View: Is This Just Update Theater? There is a reasonable skeptical take here: OpenAI says it found no evidence that the certificate was exfiltrated or misused. It also says published software was not modified. So why make everyone update? Because signing material is not a normal secret. The whole point of a signing certificate is that the operating system and the user can trust that an app came from the named developer. If there is credible exposure in the signing pipeline, the clean answer is rotation. Waiting for public misuse would be worse. The more interesting critique is that this still depends on users and teams doing the boring part. A company can rotate certificates, publish clean builds, and warn users. If a team has no inventory of AI desktop tools, no version baseline, and no trusted download policy, it still has a gap. That gap is not specific to OpenAI. It applies to every agent tool that ships fast and sits inside the developer loop. ## What Tool Builders Should Copy OpenAI's post is useful because it names concrete remediation steps, not just vague reassurance. The good pattern: - explain the affected workflow; - state which products are in scope; - give exact minimum versions; - name the cutoff date; - say what was and was not found; - give safe download paths; - explain why revocation is staged instead of immediate. That is the template AI developer-tool companies should use. The best security post is not the one that sounds most dramatic. It is the one that lets a team close tickets without guessing. This is also where [skills as an agent operating system](/blog/skills-are-the-new-agent-operating-system) becomes more than a productivity pattern. If your organization uses agent skills, MCP configs, hooks, or local runbooks, the security update process should live there too. The next time a certificate rotation, OAuth scope change, or plugin revocation lands, your agent should know the team's exact update checklist. ## A Practical Codex Check For Codex CLI users on macOS, the minimum supported version after the certificate rotation is `0.119.0`. If your team installs Codex through the official docs, the check should be simple: ```bash codex --version ``` Then update through the official route documented by OpenAI. If your team wraps Codex in a dotfiles repo, bootstrap script, MDM profile, or devcontainer setup, update that source of truth too. Otherwise the same outdated version comes back the next time someone rebuilds a laptop. For the Codex desktop app, open the app and use the built-in update path or download from OpenAI's official page. Treat random "fixed" installers as hostile by default. ## The Bigger Take The AI coding stack is crossing a line from "tools developers try" into "infrastructure developers depend on." That changes the maintenance model. The useful response is not to avoid Codex, Claude Code, or local agents. The useful response is to operate them like real engineering systems: - pinned install sources; - known version baselines; - permission maps; - endpoint inventory; - update deadlines; - post-incident verification. That is less exciting than a new model benchmark. It matters more. The May 8 Codex and ChatGPT macOS deadline is a small event if you update one laptop. It is a larger signal if you run an engineering team: AI developer tools now deserve the same boring operational discipline as package managers, CI credentials, browser profiles, and deploy keys. ## FAQ ### Do I need to update Codex CLI on macOS? Yes. OpenAI lists `Codex CLI 0.119.0` as the earliest version signed with the updated certificate. On May 8, 2026, older macOS builds signed with the previous certificate will no longer receive support and may stop functioning. ### Was OpenAI user data compromised? OpenAI says it found no evidence that user data, products, internal systems, intellectual property, published software, or passwords/API keys were compromised. The certificate rotation is a precaution after exposure in the macOS app-signing workflow. ### Does this affect Windows or Linux Codex users? OpenAI says the issue only affects macOS apps. It does not affect iOS, Android, Linux, Windows, or web versions. ### Where should I download Codex updates? Use the in-app updater or official OpenAI download/docs links. Avoid installers sent through email, messages, ads, file-sharing links, mirrors, or third-party download sites. Sources: [OpenAI's Axios developer tool compromise response](https://openai.com/index/axios-developer-tool-compromise/), [Axios coverage of the OpenAI macOS signing incident](https://www.axios.com/2026/04/11/openai-axios-mac-cyberattack), [OpenAI Codex CLI docs](https://developers.openai.com/codex/cli).

Agent Skills Need Exit Criteria, Not More Prompt Lore

Developers Digest — Mon, 04 May 2026 00:00:00 GMT

The interesting part of [Addy Osmani's `agent-skills` repo](https://github.com/addyosmani/agent-skills) is not that it gives AI coding agents more markdown to read. The interesting part is that it treats senior engineering judgment as a reusable artifact. That is why the repo moved fast through the AI developer crowd. It packages production concerns like testing, accessibility, performance, code review, debugging, and migration work into skill files that can be dropped into tools such as Claude Code, Cursor, and Antigravity. The repo description is blunt: "Production-grade engineering skills for AI coding agents." That framing matters because the next phase of AI coding is not "write a better prompt." It is "make the agent inherit the team's definition of done." ## The take Skills are only useful when they contain exit criteria. A weak skill says: > Write better React components. A useful skill says: > Before finishing, run the local checks, verify the responsive states, preserve existing user edits, avoid new dependencies unless justified, and report what was not verified. That second version is closer to a production checklist than a prompt. It gives the agent a way to stop, inspect its own work, and produce a handoff that a human can review. That is the same reason [Claude Code skills are becoming a real workflow layer](/blog/skills-are-how-agents-learn-the-job), and why [skills beat prompts for coding agents](/blog/why-skills-beat-prompts-for-coding-agents-2026). The durable part is not the prose. It is the repeated operating procedure. ## Why developers are paying attention The repo is useful because it meets agents at the exact place they fail: judgment transfer. Most AI coding failures are not syntax failures anymore. They are taste, scope, verification, and integration failures. The agent can write the component, but it may not know the local design system. It can add tests, but it may test the wrong behavior. It can refactor the module, but it may erase an edge case the team learned the hard way. A skill can encode those constraints in a way that survives across sessions. That is different from a one-off instruction. A one-off prompt is a sticky note. A skill is closer to a small operating manual. ## The opposing view The fair criticism is that skills can become another pile of stale docs. If every team ships a 4,000-line skill pack, agents will skim, misapply, or ignore the important bits. Worse, bloated skills can make the agent sound more confident without making it more correct. That is the trap. Skills should not become a second codebase of aspirational process. Good skills are short, specific, and tied to observable behavior: - Which files or commands matter - What the agent must check before finishing - What it should never change casually - What evidence it should return - When it should stop and ask That is also why [long-running agents need harnesses, not hope](/blog/long-running-agents-need-harnesses). The skill is the instruction layer. The harness is the runtime layer. You need both if the work matters. ## What to copy from the repo The repo is best treated as a menu, not a template. Do not copy every skill into your project. Start with the recurring failures you already see: 1. Agents change too much. 2. Agents forget verification. 3. Agents ignore design constraints. 4. Agents lose context between sessions. 5. Agents produce vague final reports. Then write one skill per repeated failure. For example, a frontend repo does not need a generic "build nice UI" skill. It needs a design-system skill that says which tokens, components, breakpoints, and visual checks count as done. That pairs well with a project-level design contract like [`DESIGN.md`](https://github.com/google-labs-code/design.md), which gives agents a persistent way to understand a visual identity. For backend work, the useful skill is usually not "write APIs." It is "when changing this endpoint, update the schema, migration, tests, docs, and client types in the same change." ## How I would use it I would start with three production skills: **Review receipt skill.** Every agent change must report files changed, commands run, commands not run, and risks left open. This is the human review surface. **Scope discipline skill.** The agent must preserve unrelated local changes, avoid broad refactors, and explain why any new abstraction exists. **Verification ladder skill.** The agent starts with cheap checks, escalates to build or browser QA when the change touches user-facing behavior, and reports the exact result. Those three skills solve more real problems than a giant library of framework-specific tips. They also compose with [Claude Code subagents](/blog/claude-code-sub-agents), [multi-agent coordination](/blog/how-to-coordinate-multiple-ai-agents), and [agent replays](/blog/agent-replays-with-tracetrail). When multiple agents are working at once, the skill is how you make their handoffs consistent. ## The practical bottom line Agent skills are becoming the new team playbook. The best ones do not teach the model to code. The model already knows enough about code. They teach the model how your team decides a change is finished. That is the shift Addy's repo makes visible. The winning teams will not have the longest prompts. They will have the clearest operating rules, the smallest reusable skills, and the strongest verification habits. Sources: [addyosmani/agent-skills](https://github.com/addyosmani/agent-skills), [google-labs-code/design.md](https://github.com/google-labs-code/design.md), [Claude Code skills docs](https://docs.anthropic.com/en/docs/claude-code/skills). ## Frequently Asked Questions ### What are agent skills for AI coding tools? Agent skills are reusable markdown files that teach AI coding assistants like Claude Code and Cursor how to approach specific types of work. Unlike one-off prompts, skills persist across sessions and encode team-specific constraints, verification steps, and exit criteria. They turn senior engineering judgment into a repeatable artifact that agents can reference whenever they tackle similar tasks. ### What is the difference between a skill and a prompt? A prompt is a single instruction for one task. A skill is a reusable operating procedure that loads automatically when relevant work arises. Prompts are like sticky notes - used once and discarded. Skills are like a small operating manual that the agent consults every time it handles a specific category of work. Skills survive across sessions and apply consistently. ### What makes Addy Osmani's agent-skills repo useful? The repo packages production engineering concerns - testing, accessibility, performance, code review, debugging, and migration - into skill files ready for Claude Code, Cursor, and Antigravity. The value is not the prose itself but the exit criteria embedded in each skill. They define what "done" means for each task type, which is exactly where agents fail without guidance. ### How many skills should a project have? Start small. One skill per repeated failure pattern is the right ratio. A giant library of framework-specific tips will bloat context and make agents skim or misapply the important bits. Focus on the three to five recurring problems your team actually sees: agents changing too much, skipping verification, ignoring design constraints, losing context, or producing vague reports. ### What should a good agent skill contain? A useful skill is short, specific, and tied to observable behavior. It should include which files or commands matter, what the agent must check before finishing, what it should never change casually, what evidence it should return, and when it should stop and ask. Exit criteria are the core - without them, the skill is just more prose. ### Can I use skills with Claude Code and Cursor? Yes. Both tools support skill files in markdown format. Claude Code reads skills from a designated directory and auto-loads them based on trigger conditions. Cursor supports similar files through its rules system. The format is nearly identical, so skills written for one tool often work in the other with minimal changes. ### How do skills differ from CLAUDE.md or Cursor Rules? CLAUDE.md and Cursor Rules are project-level configuration that applies to everything in the repo. Skills are task-specific instructions that load only when relevant. Think of CLAUDE.md as "how we work here" and skills as "how to do this specific type of work." Both are useful, and they compose together. ### Do skills replace human code review? No. Skills make agent output more reviewable by ensuring consistent verification steps and handoff reports. The agent produces evidence - files changed, commands run, checks passed, risks noted - that a human can audit efficiently. Skills shift the review from "did the agent write correct code" to "did the agent follow the team's definition of done."

GitHub Copilot Agent Metrics Are the Real Product Update

Developers Digest — Mon, 04 May 2026 00:00:00 GMT

GitHub Copilot's most important recent agent update is not a better demo. It is measurement. That sounds boring, but it is the thing most teams need before they can trust cloud coding agents with real work. A coding agent that opens a pull request is interesting. A coding agent that shows up in adoption metrics, session logs, validation checks, and review workflows is much more useful. For the broader Copilot platform story, read [GitHub Copilot Coding Agent and CLI: Why GitHub Is Back in the Agent Race](/blog/github-copilot-coding-agent-cli-2026). This piece is about the operational layer underneath it. ## The take Agent adoption will be managed through metrics, not vibes. GitHub has been adding Copilot cloud agent fields to its usage reporting. The [April 23 changelog](https://github.blog/changelog/2026-04-23-copilot-cloud-agent-fields-added-to-usage-metrics) added a `used_copilot_cloud_agent` field to user-level reports. The [April 10 changelog](https://github.blog/changelog/2026-04-10-copilot-usage-metrics-now-aggregate-copilot-cloud-agent-active-user-counts/) added aggregate cloud-agent active user counts. Earlier, GitHub said [Copilot metrics was generally available](https://github.blog/changelog/2026-02-27-copilot-metrics-is-now-generally-available/), including reporting across completions, chat, and agent features. That is the real maturity signal. Autocomplete can be adopted informally. Cloud agents cannot. Once an agent is opening branches, spending compute, running checks, and asking humans to review its work, leadership will ask different questions: - Who is using it? - Which repos are using it? - How many agent-authored changes become accepted changes? - How much review time does it create? - Which workflows save time, and which just move work into PR review? If those questions are not answerable, the agent becomes a novelty tool instead of an engineering system. ## Why this matters now GitHub is also moving Copilot toward usage-based economics. The company said [Copilot is moving to usage-based billing](https://github.blog/news-insights/company-news/github-copilot-is-moving-to-usage-based-billing/) because the product has changed from simple assistance into longer, multi-step agent workflows. That is a fair technical point. A quick code completion and a long cloud-agent run do not cost the same to serve. It is also where developer skepticism is strongest. In Copilot communities, the recurring complaint is not only "this costs more." It is "I do not understand what I am spending, why the metric changed, or whether the agent output was worth it." That is the pricing problem every AI coding tool is walking into. The unit of value is not the prompt. It is the accepted change. This is why [AI coding tools pricing](/blog/ai-coding-tools-pricing-q2-2026), [agent receipts](/blog/agent-swarms-need-receipts), and [parallel agent merge discipline](/blog/parallel-coding-agents-merge-discipline) belong in the same conversation. Billing only feels reasonable when the work is measurable. ## What teams should measure The obvious metric is active users. That is useful, but incomplete. For coding agents, teams need a stronger scorecard: **Agent sessions started.** How often developers delegate work instead of editing manually? **PRs opened.** How many sessions make it to a reviewable branch or pull request? **PRs merged.** How many agent-created changes become production code? **Review cycles.** How many rounds does the agent need before the PR is acceptable? **Checks passed.** Did tests, type checks, code scanning, and required checks pass before human review? **Human correction cost.** Did the reviewer accept, request small changes, or rewrite the agent output? **Task type.** Does the agent work better for docs, tests, dependency upgrades, bug fixes, or feature work? GitHub's metrics API gives teams a better starting point, but teams still need to connect usage to outcomes. Agent usage without merge quality is just activity tracking. ## The opposing view The strongest opposing view is that metrics can create the wrong incentives. That is true. If a company celebrates "agent PRs opened," developers may delegate too much vague work. If managers track "AI-generated lines," agents may produce bigger diffs instead of better ones. If cost dashboards punish experimentation too early, developers may stop trying the workflows that would eventually pay off. The answer is not fewer metrics. The answer is better metrics. The useful score is not agent output volume. It is reviewable, merged, low-regret change. That is why an agent dashboard should pair usage with quality. A team should be able to see that Copilot cloud agent was active in a repo, but also whether the resulting work passed required checks, respected branch protection, and survived code review. ## Session visibility is part of trust GitHub's [Copilot coding agent docs](https://docs.github.com/en/copilot/using-github-copilot/coding-agent/about-assigning-tasks-to-copilot) emphasize session logs, branch protections, required checks, and security validation. The details matter because agent work has to be reviewable. If a developer cannot inspect what the agent tried, which files it touched, which checks it ran, and why it made a choice, the PR becomes harder to trust. This is the same pattern behind [Claude Code subagents](/blog/claude-code-sub-agents), [Codex managed agents](/blog/openai-codex-managed-agents-aws-2026), and [long-running agent harnesses](/blog/long-running-agents-need-harnesses). Autonomy is only useful when the system produces enough evidence for humans to evaluate it. For Copilot, GitHub has a natural advantage: the evidence already has a home. Issues define the task. Branches isolate the work. Pull requests expose the diff. Actions run checks. Reviews capture the decision. Metrics report adoption. That is the workflow graph most engineering teams already understand. ## The practical bottom line GitHub Copilot's cloud agent will not win only by writing more code. It will win if teams can answer a simple question: did this agent produce accepted work at a cost and review burden we can defend? That means metrics matter. Session logs matter. Validation matters. Small PRs matter. Review quality matters. The next phase of AI coding is not just better agents. It is better accounting for what agents actually do. Sources: [GitHub Copilot cloud agent fields in usage metrics](https://github.blog/changelog/2026-04-23-copilot-cloud-agent-fields-added-to-usage-metrics), [cloud agent active user counts](https://github.blog/changelog/2026-04-10-copilot-usage-metrics-now-aggregate-copilot-cloud-agent-active-user-counts/), [Copilot metrics GA](https://github.blog/changelog/2026-02-27-copilot-metrics-is-now-generally-available/), [GitHub Copilot usage metrics docs](https://docs.github.com/en/copilot/reference/copilot-usage-metrics/copilot-usage-metrics), [about Copilot coding agent](https://docs.github.com/en/copilot/using-github-copilot/coding-agent/about-assigning-tasks-to-copilot), [Copilot usage-based billing announcement](https://github.blog/news-insights/company-news/github-copilot-is-moving-to-usage-based-billing/).

Google Skills Shows the Next Agent Playbook

Developers Digest — Mon, 04 May 2026 00:00:00 GMT

[Google's `google/skills` repo](https://github.com/google/skills) is easy to misread as another examples directory. It is more interesting than that. The repo describes itself as "Agent Skills for Google products and technologies." That sounds narrow, but the pattern is broad: product teams are starting to ship instructions for agents, not just docs for humans. That is a meaningful shift for developer tools. ## The take The best docs for AI agents will look less like articles and more like executable playbooks. Traditional docs answer a human question: "How do I use this product?" Agent skills answer a different question: "When you are asked to do this task inside a real repo, what should you inspect, change, verify, and report?" That distinction matters. Agents do not fail only because they lack information. They fail because they lack local procedure. ## Why this is timely The skill trend is bigger than one repo. Developers are experimenting with [Claude Code skills](/blog/what-are-claude-code-skills-beginner-guide), [Karpathy-style CLAUDE.md rule sets](/blog/karpathy-claude-md-skills-menu), and production skill packs like [Addy Osmani's `agent-skills`](https://github.com/addyosmani/agent-skills). Google joining the pattern is a signal that product-specific agent enablement is becoming normal. That is different from the old docs model. Old model: - Human reads docs - Human translates docs into repo changes - Agent helps with the code New model: - Agent reads a task-specific skill - Agent follows the product workflow - Human reviews the result and evidence The second model is much closer to how teams already work with internal runbooks. ## What makes product skills useful Product skills are useful when they reduce ambiguity at the point of action. A generic agent already knows that tests exist. A good product skill tells it which setup command matters, which config file is canonical, which migration command is safe, which dashboard is source of truth, and which result proves the change worked. That is the missing bridge between documentation and implementation. It also helps explain why [MCP servers are useful but not enough](/blog/clis-over-mcps). Tools give an agent capabilities. Skills tell it when and how to use them. ## The opposing view There is a real downside: vendor skills can turn into product marketing disguised as implementation guidance. If a skill only says "use our product for everything," it is not a skill. It is a sales page. Developers should be skeptical of any agent instruction that hides tradeoffs, skips verification, or routes every problem to one vendor. The useful version is more disciplined: - Start from the user's existing stack - Prefer official setup steps - Show the minimal integration path - Include known limits - Verify the result locally - Link to the source docs That is also why comparison content should stay fair. If you are choosing between AI coding tools, the practical question is still the one covered in [the AI coding tools comparison matrix](/blog/ai-coding-tools-comparison-matrix-2026): which tool fits the workflow, budget, and risk profile? ## What developer tool companies should do Every developer tool company should ship a small agent playbook. Not a 50-page guide. Not a pile of generic prompts. A repo of focused skills that answer common implementation tasks: 1. Install the SDK. 2. Add auth. 3. Create a database migration. 4. Wire the CI check. 5. Debug the three most common errors. 6. Verify production configuration. Each skill should include the exact files, commands, source links, and stop conditions. That would make docs more useful for both humans and agents. Humans get a concise checklist. Agents get a bounded procedure. ## What teams should copy Teams should copy the shape, not the content. Create product-specific skills for your own internal systems: - How to add a new route in this app - How to update billing safely - How to migrate data without breaking analytics - How to run release checks - How to debug the deployment platform That is how skills become a compounding asset. Every painful bug becomes a shorter future runbook. The important part is to keep the skill small enough that an agent will actually use it. If the skill cannot fit in a quick scan, it probably belongs in docs with a short skill pointing to the relevant section. ## The practical bottom line Google's skills repo is not just another AI coding artifact. It is a preview of a docs format that treats agents as first-class users. The docs page explains what is possible. The skill tells the agent how to act. That is where developer education is heading: fewer vague prompts, more product-aware procedures, and tighter verification loops. Sources: [google/skills](https://github.com/google/skills), [addyosmani/agent-skills](https://github.com/addyosmani/agent-skills), [Claude Code skills docs](https://docs.anthropic.com/en/docs/claude-code/skills), [google-labs-code/design.md](https://github.com/google-labs-code/design.md).

Parallel Coding Agents Need Merge Discipline

Developers Digest — Mon, 04 May 2026 00:00:00 GMT

Parallel coding agents are having their moment because the promise is obvious: split the work, run several agents at once, and get a bigger change done faster. That promise is real. It is also incomplete. The hard part is not spawning agents. The hard part is merging their work without creating a review mess. ## The take Parallel agents need merge discipline before they need more autonomy. A single coding agent can already create a noisy diff. Three agents can create three noisy diffs that overlap in surprising ways. If each agent touches shared files, changes conventions, or invents a slightly different abstraction, the human reviewer becomes the integration layer. That is not leverage. That is deferred coordination cost. This is why [Claude Code subagents](/blog/claude-code-sub-agents), [parallel development workflows](/blog/building-24-apps-with-ai-agents), and [multi-agent orchestration](/blog/how-to-coordinate-multiple-ai-agents) need a boring operational rule: every agent should have a clear write boundary and an expected receipt. ## What good parallel work looks like Good parallel agent work has three properties. First, the tasks are independent. One agent updates docs, another writes tests, another implements a clearly bounded module. Their file ownership does not overlap unless the overlap is explicit. Second, each agent returns evidence. Not "done." Evidence. Files changed, commands run, checks passed, checks skipped, and risks left open. Third, the final merge has a single owner. Someone or something has to reconcile style, naming, shared assumptions, and test coverage. Without those three pieces, parallelism just makes uncertainty arrive faster. ## The opposing view The strongest opposing view is that agents should simply learn to coordinate with each other. That might happen over time. We already see tools moving toward richer agent teams, background workers, and autonomous task loops. OpenAI has been pushing managed agent workflows through Codex, while Anthropic has made subagents and skills part of the Claude Code operating model. But for real repos today, coordination by vibes is not enough. Agents still miss implicit boundaries. They can both decide to "clean up" the same helper. They can both update the same README. They can both create similar utilities in different folders. The result might compile, but the architecture gets fuzzier. That is why [agent swarms need receipts](/blog/agent-swarms-need-receipts). Parallelism is only useful when the review surface stays legible. ## A practical task split Here is a task split that usually works: **Agent A: implementation.** Owns the feature files only. It should not update broad docs or shared infrastructure unless assigned. **Agent B: tests and fixtures.** Owns tests, mocks, and focused regression coverage. It should not rewrite the implementation unless blocked. **Agent C: docs and examples.** Owns docs, examples, changelog notes, or content updates. It should not change runtime code. **Main agent: integration.** Pulls the pieces together, resolves conflicts, runs checks, and writes the final report. That structure is slower than pure chaos, but faster than cleanup. It also maps well to the agent skill trend. A test agent should have a testing skill. A docs agent should have a documentation skill. An integration agent should have a review receipt skill. That is how [agent skills become production checklists](/blog/agent-skills-production-checklist), not just reusable prompts. ## What to avoid Avoid assigning several agents to "improve the codebase." That sounds productive, but it creates overlapping intent. Every agent can justify touching any file. The resulting merge has no obvious owner. Also avoid asking multiple agents to independently solve the same implementation problem unless you are explicitly doing option generation. Option generation is useful, but it is a different workflow. You compare approaches, pick one, and discard the others. You do not merge all of them. The best parallel tasks are narrow and named: - Add route tests for this endpoint - Update this component to use the existing design token - Write migration docs for this exact API - Find dead links in this content folder - Implement this one adapter behind this interface Specificity is the cheapest coordination mechanism. ## The practical bottom line Parallel coding agents are useful when they reduce elapsed time without expanding review cost. That requires task ownership, receipts, and a final integration pass. It also requires the humility to keep some work single-threaded when the next step depends on one hard decision. The future is not one agent doing everything. It is small teams of agents working under clear contracts. The team that wins will not be the one that spawns the most agents. It will be the one that makes each agent's work easiest to trust, review, and merge. Sources: [Claude Code subagents docs](https://docs.anthropic.com/en/docs/claude-code/sub-agents), [Claude Code skills docs](https://docs.anthropic.com/en/docs/claude-code/skills), [OpenAI Codex docs](https://developers.openai.com/codex/), [addyosmani/agent-skills](https://github.com/addyosmani/agent-skills).

Karpathy CLAUDE.md Skills: Use the Viral Rules as a Menu, Not a Template

Developers Digest — Sun, 03 May 2026 00:00:00 GMT

The most interesting developer-tool signal this week is not a new model. It is a plain instruction file. The GitHub repo [forrestchang/andrej-karpathy-skills](https://github.com/forrestchang/andrej-karpathy-skills) packages a `CLAUDE.md`, Cursor rule, and Claude Code plugin around four coding-agent principles inspired by Andrej Karpathy's public comments on LLM coding failure modes. That is wild for a repo whose core artifact is basically a behavioral checklist. It is also the right kind of wild. The repo went viral because teams have discovered the same thing at the same time: coding agents do not only need better models. They need better operating constraints. If you are new to this layer, start with [how to write a CLAUDE.md file](/blog/how-to-write-claudemd-the-complete-guide) and [why skills beat prompts for coding agents](/blog/why-skills-beat-prompts-for-coding-agents-2026). This post is the next step: how to interpret a viral rules file without letting it become another bloated prompt dump. ## What the repo actually says The useful part is short. The `CLAUDE.md` file centers on four principles: - Think before coding. - Keep the implementation simple. - Make surgical changes. - Define success criteria and verify them. The repo's README maps those principles to common agent failures: hidden assumptions, overbuilt abstractions, unrelated edits, and vague "make it work" loops. The full file is only about 65 lines, which is part of why it spread. Developers can understand it, copy it, and argue with it in one sitting. That last part matters. Good agent instructions are not sacred text. They are editable work rules. ## Why this hit a nerve Most agent failures are not dramatic model failures. They are small workflow failures repeated quickly. The agent silently picks one interpretation of an ambiguous task. It writes a flexible abstraction for a one-off requirement. It "cleans up" adjacent code and creates a regression. It says something is done because the diff exists, not because the behavior was verified. That is why a repo like this can become a trending event. It names the boring failure modes that show up in real diffs. The same issue shows up in [the agent reliability cliff](/blog/the-agent-reliability-cliff): the demo looks fine, then the production loop collapses because assumptions, tests, and ownership were never made explicit. The opposing view is worth taking seriously too. A [Reddit thread](https://www.reddit.com/r/ClaudeAI/comments/1stfoo7/why_does_this_claudemd_file_have_so_many_stars/) around the repo had a good skeptical read: the star count may say more about copy-pasteability and Karpathy name value than measured capability. Another commenter framed it as a menu rather than a template, which is the right mental model. Stars prove demand. They do not prove effectiveness in your repo. ## The mistake is copying it unchanged The fastest way to misuse this repo is to append the whole thing to every project and call it done. Generic rules are helpful until they conflict with local reality. "Surgical changes" means something different in a package migration, a design-system cleanup, a schema refactor, and a one-line bug fix. "Ask when uncertain" is right for product ambiguity, but it is wasteful when the codebase already has a clear pattern the agent can inspect. This is where [Claude Code skills](/blog/what-are-claude-code-skills-beginner-guide) and `CLAUDE.md` should work together: - `CLAUDE.md` should hold the global rules every session needs. - Skills should hold procedures that only matter for specific tasks. - Repo docs should point to real files, commands, tests, and failure modes. - Hooks should enforce what prose instructions cannot reliably enforce. For the hook layer, see [Claude Code hooks explained](/blog/claude-code-hooks-explained). The short version: if a rule can be checked automatically, do not leave it as vibes in a markdown file. ## Turn viral rules into local rules Here is the practical translation. Do not write: ```md Be simple. ``` Write: ```md Do not add a new abstraction unless it removes duplication in at least two call sites or matches an existing pattern in this repo. ``` Do not write: ```md Make surgical changes. ``` Write: ```md When editing an existing route, only touch the files required for that route unless a failing test proves shared code must change. ``` Do not write: ```md Verify your work. ``` Write: ```md For UI changes, run the app locally, capture desktop and mobile screenshots, and mention any viewport you did not verify. ``` That is the difference between a motivational instruction and an operating constraint. The first one sounds correct. The second one changes behavior. ## The best agents need fewer generic words The lesson from this repo is not that every project needs a bigger `CLAUDE.md`. It is the opposite. The best instruction files get shorter at the top and more specific at the leaves. The global file should contain durable judgment: - how much autonomy the agent has - when to ask questions - how to handle unrelated changes - what must be verified before stopping - which design, content, or security rules are non-negotiable Then task-specific skills should take over. A blog-writing skill, migration skill, review skill, release skill, or browser-QA skill can include the exact workflow for that slice without forcing every session to carry every rule. That is also why [agent teams and subagents](/blog/claude-code-agent-teams-subagents-2026) are becoming more important. The main agent should not need every procedure in its context. It should know when to delegate to a specialist with the right local instructions. ## My take `andrej-karpathy-skills` is valuable because it is small, legible, and pointed at real failure modes. It is not valuable because 108k people starred it. It is not valuable because a famous name is adjacent to the idea. It is valuable because it gives developers a shared vocabulary for the behavior they already wanted from coding agents: think first, stay simple, touch less, verify more. The best move is to steal the shape, not the file. Copy the four categories into your own repo. Delete anything that does not apply. Add concrete commands, file paths, test gates, and design constraints. Split repeated procedures into skills. Put mechanical checks into hooks. Then review the agent's diff and ask the only question that matters: Did these instructions make the work smaller, clearer, and easier to verify? If yes, keep them. If not, rewrite them. Agent instructions are code-adjacent infrastructure now. Treat them like something that has to earn its place in the repo.

The 98% Context Reduction Pattern