
TL;DR
LangChain's rubrics for Deep Agents point at a practical agent pattern: self-correction works only when rubrics are versioned, executable, and sampled against human review.
LangChain's new rubrics for Deep Agents are worth paying attention to because they move evals closer to the agent loop itself.
The headline feature is simple: give an agent a rubric, let it evaluate its own work against that rubric, and use the result to correct the output before handing it back.
That sounds like a small product feature. It is bigger than that.
It is a sign that agent evals are becoming runtime infrastructure, not only offline dashboards.
Last updated: June 23, 2026
The useful version is not "the model grades itself and declares victory." That would be too easy to game and too easy to trust blindly.
The useful version is a disciplined loop:
That is where rubrics become practical.
LangChain's post describes rubrics as a way to build agents that evaluate and correct their work. The feature lands in the Deep Agents context, which LangChain positions as a batteries-included agent harness for complex, multi-step tasks.
The surrounding product story matters:
The direction is clear: agents should not only produce output. They should check whether the output meets a declared standard.
That is the right instinct.
Traditional evals happen after the fact.
You run a batch of examples. You score them. You compare model A to model B. You decide whether to ship a prompt, model, retrieval setting, or tool change.
That is still necessary. We covered that in agent evals need baseline receipts: a useful eval keeps the baseline, candidate, task fixture, trajectory, cost, and human review note together.
But agent work has another problem. The failure often happens during the run.
A coding agent may:
If the agent only learns that after the run is complete, the system has already wasted time and tokens. A runtime rubric gives the agent a chance to catch obvious quality failures before the user becomes the evaluator.
That is not a replacement for offline evals. It is a new layer.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
From the archive
Jun 23, 2026 • 8 min read
Jun 23, 2026 • 7 min read
Jun 23, 2026 • 8 min read
Jun 23, 2026 • 8 min read
A rubric is useful only if it changes behavior.
Bad rubric:
Make the answer good, correct, helpful, and safe.
Good rubric:
Before final response, verify:
1. Every factual claim cites a source URL or local file.
2. The answer separates verified facts from inference.
3. The recommendation names one case where it should not be used.
4. Any code change includes the exact command used to verify it.
5. If a source was unreachable, the answer says so.
The difference is not style. The second rubric is inspectable. A human reviewer can tell whether the agent followed it.
For developer workflows, rubrics should be:
| Property | Why It Matters |
|---|---|
| specific | vague criteria become vibes |
| versioned | teams need to know which rubric judged which run |
| task-scoped | a support rubric is not a coding rubric |
| evidence-linked | the judge should inspect the trace, not only the final answer |
| cheap enough | a rubric that doubles cost everywhere will get bypassed |
| sampled by humans | model-judged quality needs calibration |
That last point is the one teams skip.
Self-correction is useful. Self-certification is dangerous.
Rubrics sit between the agent harness and the eval platform.
At the harness level, a rubric can decide whether to loop:
draft -> rubric check -> revise -> rubric check -> final
At the eval-platform level, rubrics become the reusable criteria that compare candidate systems against baselines.
At the product level, rubrics become part of the user promise:
This is why rubrics belong next to long-running agent harnesses. A loop without a rubric tends to optimize for "keep going." A loop with a rubric can optimize for "stop when the output satisfies these criteria, or stop when it clearly does not."
That is a much safer loop.
The obvious critique is that a model grading a model can launder mistakes.
That critique is correct.
A rubric check can fail in several ways:
This is the same reliability cliff we discussed in the agent reliability cliff. Adding another agent step does not automatically improve the system. If that step has weak criteria or no external signal, it can create confidence without quality.
The fix is not to avoid rubrics. The fix is to bind them to evidence.
Use executable checks where possible:
Then use LLM rubric judges for the parts that are genuinely semantic: reasoning quality, user fit, clarity, missing caveats, and tradeoff coverage.
If you are adding rubrics to an agent this week, start small.
| Step | Action |
|---|---|
| 1 | Pick one workflow with repeat failures |
| 2 | Write a five-point rubric that a human reviewer already uses mentally |
| 3 | Save the rubric with a version ID |
| 4 | Run the agent with and without the rubric on the same task set |
| 5 | Compare accepted outcomes, cost, latency, and review time |
| 6 | Sample failures manually and adjust the rubric |
Do not measure only the rubric score. Measure whether the output is accepted faster.
That connects directly to the AI affordability cost model. A rubric that adds tokens but reduces retries and review time can be a net win. A rubric that adds tokens and mostly agrees with bad output is just ceremony.
LangChain rubrics are interesting because they make a quiet but important claim: agent quality criteria should be explicit and reusable.
That is the right direction.
The mature agent stack will not be "prompt plus tools." It will be:
Rubrics are not magic. They are a way to turn taste, policy, and task-specific quality into something an agent can inspect before it stops.
For teams building real agent workflows, that matters. The agent that can revise against a clear rubric is more useful than the agent that simply runs longer. The team that versions and samples those rubrics is safer than the team that trusts a self-grade.
Use rubrics to make the loop better.
Do not use them to avoid owning the loop.
LangChain rubrics let developers define criteria that Deep Agents can use to evaluate and improve their own outputs. They bring quality checks closer to the agent runtime instead of leaving all evaluation for offline dashboards.
No. Rubrics help agents catch failures, but model-judged quality should still be sampled against human review and executable checks. A rubric is a control, not proof of correctness.
A good rubric is specific, task-scoped, versioned, evidence-linked, and reviewable. It should name observable criteria rather than vague goals like "be helpful" or "write good code."
LangSmith evals help teams compare agent behavior across datasets, experiments, and baselines. Rubrics can become the reusable criteria inside those evals and, in Deep Agents, part of the runtime correction loop.
Not automatically. Rubrics add evaluation work. They save money only when they reduce retries, review time, rejected outputs, or failed agent loops enough to offset the extra tokens and latency.
Fetched June 23, 2026.
Read next
Hex's data-agent lab shows the practical eval pattern AI teams should copy: compare candidates against stable baselines, keep receipts, and judge changes by task behavior.
8 min readThe math of agent pipelines is brutal. 85% reliability per step compounds to about 20% at 10 steps. Here is why long chains collapse in production, and the six patterns the field has converged on to fight the decay.
9 min readA long-running coding agent is only useful if the environment around it can queue tasks, capture logs, checkpoint state, verify behavior, limit cost, and recover from failure.
8 min readTechnical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
Most popular LLM framework. 100K+ GitHub stars. Chains, RAG, vector stores, tool use. LangGraph adds stateful multi-agen...
View ToolOpen-source terminal agent runtime with approval modes, rollback snapshots, MCP servers, LSP diagnostics, and a headless...
View ToolFrontend stack for agent-native apps. React hooks, prebuilt copilot UI, AG-UI runtime, frontend tools, shared state, and...
View ToolGives AI agents access to 250+ external tools (GitHub, Slack, Gmail, databases) with managed OAuth. Handles the auth and...
View ToolRun hundreds of agent evals in parallel. Find regressions in minutes.
View AppCompare AI coding agents on reproducible tasks with scored, shareable runs.
View AppSpec out AI agents, run them overnight, wake up to a verified GitHub repo.
View AppConfigure Claude Code for maximum productivity -- CLAUDE.md, sub-agents, MCP servers, and autonomous workflows.
AI AgentsWhat MCP servers are, how they work, and how to build your own in 5 minutes.
AI AgentsStep-by-step guide to building an MCP server in TypeScript - from project setup to tool definitions, resource handling, testing, and deployment.
AI Agents
Hex's data-agent lab shows the practical eval pattern AI teams should copy: compare candidates against stable baselines,...

The math of agent pipelines is brutal. 85% reliability per step compounds to about 20% at 10 steps. Here is why long cha...

A long-running coding agent is only useful if the environment around it can queue tasks, capture logs, checkpoint state,...

The 2026 agent decision is not CrewAI vs LangGraph. It is whether your loop lives in vendor infrastructure, a self-hoste...

A viral Hacker News thread about AI affordability points at the right problem, but developer teams need a more useful co...

Armin Ronacher's new essay explores the tension between letting AI agents loop autonomously and maintaining the engineer...

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.