
TL;DR
Anthropic's open-source vulnerability harness shows where AI security work is going: reproducible exploit loops, separate verification agents, and patch receipts.
Read next
Anthropic's Project Glasswing update is a useful signal for developer teams: AI can find vulnerability candidates faster than humans can verify, disclose, patch, and ship them.
8 min readAnthropic's Claude containment writeup points to the next security layer for coding agents: deterministic capability ledgers, not another approval prompt.
9 min readManual approval prompts stop protecting users when coding agents ask too often. The better pattern is risk-aware autonomy: safe defaults, narrow deny rules, and approvals only for meaningful changes.
7 min read| Source | Description |
|---|---|
| Anthropic - Using LLMs to secure source code | Anthropic's May 27, 2026 guide to threat modeling, sandboxing, discovery, verification, triage, and patching with Claude |
| Anthropic defending-code-reference-harness | Open-source reference implementation with Claude Code skills and an autonomous vulnerability-discovery pipeline |
| Claude vulnerability detection agent cookbook | Claude Agent SDK walkthrough for a lighter recon, scan, triage, report, and patch loop |
| Harness security notes | Project documentation for sandbox assumptions and safety boundaries |
| HN discussion | Hacker News discussion that pushed on false positives, reproducibility, and operational risk |
Anthropic's defending-code-reference-harness hit the Hacker News front page today, and the interesting part is not that Claude can look for bugs. We already crossed that line.
The interesting part is the shape of the workflow.
The repo turns AI security work into a loop: build a threat model, run discovery agents, verify findings in a fresh environment, dedupe them, write exploitability reports, generate patches, and then test whether the original proof of concept still fails. Anthropic's accompanying post says the bottleneck has moved: discovery is now straightforward to parallelize, while verification, triage, and patching are where teams get stuck.
That is the useful developer takeaway.
If your AI security process is still "ask a model to review the repo for vulnerabilities," you are building a better checklist. The next step is a reproducible harness.
Most AI security demos start with a prompt:
Review this codebase for security issues.
That can work for a first pass. It can also produce confident noise. The model lacks the system's threat model, deployment assumptions, dependency boundaries, reachable entry points, and historical bug shapes. It may flag a scary-looking path that is not attacker controlled. It may miss a boring path that is internet-facing in production.
Anthropic's guide makes a sharper point: the false positive is often not a model reasoning failure. It is a threat-model failure.
That matches what developers see in normal code review too. A reviewer who does not know which inputs are trusted, which services are internal, which queues are adversarial, and which legacy components are intentionally isolated will give shallow advice. A model does the same thing at higher speed.
The better unit is not a prompt. It is a repo-local security harness:
THREAT_MODEL.md that names assets, entry points, trust boundaries, and out-of-scope cases.That is why this belongs next to agent containment, AI security triage, and permissions, logs, and rollback for coding agents. The core question is not "can the model find something?" It is "can the system prove what happened?"
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
From the archive
One subtle design choice in Anthropic's loop is that discovery and verification are separate jobs.
That matters.
If you ask one agent to both find and dismiss issues, it can self-censor. It may drop weird leads too early because they look unlikely. It may overfit to the obvious vulnerability classes in your prompt. It may spend too much of the context budget justifying why something is safe instead of exploring attack paths.
Discovery should optimize for recall. Let it fan out. Let it partition the codebase by attack surface. Let it produce candidate findings with proof attempts, confidence, and missing evidence. Let it be creative within a sandbox.
Verification should optimize for precision. It should take the candidate, rebuild a fresh environment, reproduce the proof, confirm reachability, check whether a compensating control exists, and label the finding accordingly.
This is the same engineering pattern behind good agent swarms. The fastest agent is not always the one that merges code. The useful system has specialized roles:
Security work makes that separation non-optional. A false positive can waste maintainer time. A false negative can leave a real bug alive. A sloppy patch can make the system worse.
The repo's autonomous pipeline runs target code inside gVisor-isolated containers and restricts egress to the model API. The README is explicit that the autonomous reference pipeline refuses to run outside that sandbox unless overridden.
That is not just a safety footnote. It is the product boundary.
A vulnerability-discovery agent is supposed to do adversarial work. It may craft malformed inputs, run binaries, trigger crashes, write exploit scripts, inspect logs, and generate patches. If you run that in the same shell that has your cloud credentials, SSH keys, package registry token, and browser session, you have built a security tool with an insecure runtime.
This is where the conversation connects to the lethal trifecta problem: private data, untrusted content, and external communication should not casually share the same agent session.
For security agents, the safer default is boring:
The HN skepticism around the harness is healthy because this is exactly where tools tend to oversell. Sandboxes are not magic. Containers can be misconfigured. Build environments can drift from production. Agents can find bugs in the harness instead of the target. A proof of concept can be real and still low severity in the actual deployment.
That does not weaken the harness argument. It strengthens it. If the environment matters this much, then the environment has to be part of the security artifact.
The strongest part of Anthropic's writeup is the insistence on threat modeling before scanning.
Security teams already know this. AI tooling makes it easier to skip, because the model can produce a long list of plausible issues without asking enough domain questions. That feels productive until the triage meeting starts.
The better pattern is to treat the threat model as executable agent context.
Not executable as in "run this file." Executable as in: the harness actually consumes it. Discovery agents read it before they search. Triage agents use it to calibrate severity. Patch agents use it to avoid fixing non-issues while missing the real trust boundary.
A good agent-readable threat model should answer:
This is not bureaucracy. It is context engineering for security work.
Teams already write README.md, AGENTS.md, CLAUDE.md, design docs, architecture diagrams, runbooks, and test fixtures so coding agents can operate with less guessing. THREAT_MODEL.md belongs in that family.
The patch step is where AI security demos often get too optimistic.
Generating a fix is not enough. A security patch has to prove four things:
That proof should travel with the pull request.
Call it a patch receipt:
Finding: heap overflow in parser X
Threat model path: untrusted file import
Proof: crash input repros 3/3 before patch
Verification: fresh container reproduced crash
Patch: bounds check before allocation
Regression: crash input no longer crashes
Variant search: fresh agent found no adjacent parser bypass in one run
Human review: owner approved severity and scope
The exact fields will vary. The habit should not.
This is the same receipt culture needed for parallel coding agents and long-running agent harnesses. When machines can generate more work than humans can inspect line by line, the review packet becomes part of the work product.
You do not need Anthropic's full reference harness to improve your workflow.
Start smaller:
THREAT_MODEL.md to one service.Then widen the loop.
Add more target components. Add a separate verification pass. Add a dedupe step. Add a regression search after patches. Add periodic scanning when high-risk code changes.
The mistake is trying to jump straight from manual security review to autonomous security operation. The useful path is boring and incremental: one harness, one bug class, one proof format, one owner loop.
AI security is entering its CI era.
The winning teams will not be the ones with the longest scan prompt. They will be the ones with the best repro harness: clear threat models, faithful sandboxes, separated discovery and verification, patch receipts, and enough operational discipline to turn findings into shipped fixes.
The model finds candidates. The harness proves them. The team owns the patch.
That is the loop.
It is an open-source reference implementation for AI-assisted vulnerability discovery and remediation with Claude. It includes Claude Code skills for threat modeling, scanning, triage, patching, and customization, plus an autonomous pipeline that runs recon, find, verify, report, and patch stages.
A prompt can produce useful leads, but a harness gives the agent a repeatable target, a threat model, a sandbox, a verification path, and a patch receipt. That makes findings easier to reproduce, dedupe, prioritize, and fix.
No. Start with narrow, supervised workflows. Use one service, one vulnerability class, one sandbox, and one receipt format before scaling. Autonomous scanning without verification and ownership can create a larger triage queue instead of reducing risk.
Include the finding, threat-model path, reproduction steps, verification environment, patch summary, tests run, variant search, and human review point. The receipt should make it clear what was proven and what still depends on judgment.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
CDN, DNS, DDoS protection, and edge computing. Free tier handles most needs. This site uses Cloudflare for DNS and analy...
View ToolAnthropic's agentic coding CLI. Runs in your terminal, edits files autonomously, spawns sub-agents, and maintains memory...
View ToolGives AI agents access to 250+ external tools (GitHub, Slack, Gmail, databases) with managed OAuth. Handles the auth and...
View ToolLightweight Python framework for multi-agent systems. Agent handoffs, tool use, guardrails, tracing. Successor to the ex...
View ToolSpec out AI agents, run them overnight, wake up to a verified GitHub repo.
View AppTurn a one-liner into a working Claude Code skill. From idea to installed in a minute.
View AppDesign subagents visually instead of editing YAML by hand.
View AppA complete, citation-backed Claude Code course with setup, prompting systems, MCP, CI, security, cost controls, and capstone workflows.
ai-developmentConfigure Claude Code for maximum productivity -- CLAUDE.md, sub-agents, MCP servers, and autonomous workflows.
AI Agents50+ customizable shortcuts for cancel, history, transcript, and more.
Claude Code
Anthropic's Claude containment writeup points to the next security layer for coding agents: deterministic capability led...

Before an AI agent gets tools, files, APIs, MCP servers, or deployment access, decide what it can read, write, call, log...

Anthropic's Project Glasswing update is a useful signal for developer teams: AI can find vulnerability candidates faster...

Manual approval prompts stop protecting users when coding agents ask too often. The better pattern is risk-aware autonom...

The ChatGPT for Google Sheets exfiltration report is not just a spreadsheet bug. It is a warning about agentic office to...

AI coding agents become safer when permissions, logs, and rollback are designed as one system. Here is the operating loo...

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.