
TL;DR
Hex's data-agent lab shows the practical eval pattern AI teams should copy: compare candidates against stable baselines, keep receipts, and judge changes by task behavior.
Hex's writeup on evaluating data agents is the most useful AI developer post on the wire today because it skips the usual benchmark theater.
The interesting part is not that Hex built an eval tool. Everyone is building eval tools. The interesting part is the shape of the tool: a lab bench for pairwise experiments, stable production baselines, locally executed candidate runs, custom rubrics, side-by-side trajectories, and a fake business with realistic data.
That is the right direction for agent products.
Last updated: June 20, 2026
Agents do not fail like autocomplete models. They fail across a run: a bad assumption in step two, a missed join in step four, an overconfident final chart, a useful tool call made too late, or a correct answer that cost 10x too much. A single score at the end loses most of the evidence.
If you are building coding agents, data agents, browser agents, or internal operators, the durable primitive is not "run more benchmarks." It is baseline receipts.
Hex describes analytics as a difficult domain for agents because easy questions can look hard, hard questions can look easy, warehouse context is private and out of distribution, and wrong answers can still sound plausible. That maps almost perfectly to software work.
A coding agent can pass a unit test while choosing the wrong abstraction. A data agent can produce a clean chart from the wrong grain. A support agent can cite the right document but apply it to the wrong customer state. A browser agent can finish the happy path while missing the broken edge state.
This is why context reduction, agent memory contracts, and long-running harnesses keep showing up as the same conversation. The model is only one part of the system. The context stores, retrieval layer, tool choices, planner, permissions, UI state, and final judge all change the result.
Hex makes that explicit. Their post says agent performance is increasingly a function of the rich context stores an agent can access, not just the model or system prompt. That is the sentence to steal for your own roadmap.
An agent eval that only compares model A to model B is under-instrumented. A useful eval compares the whole candidate system to the whole baseline system and keeps enough evidence to explain the delta.
The most practical detail in Hex's setup is the hybrid workflow: local candidate runs compared against shared remote production baselines.
That sounds boring. It is not.
Most agent teams drift into a messy eval loop:
Nobody knows whether the agent improved or whether the measurement surface moved underneath it.
A stable baseline fixes that. It gives every experiment a reference point that does not depend on the developer's laptop, branch, cached data, or current prompt edits. It also changes the conversation from "this run scored 82" to "this candidate beat the production baseline on these cases, lost on these cases, cost this much more, and changed these behaviors."
That is the same operational instinct behind token-budget ledgers. You do not just ask whether the agent finished. You ask what it spent, what it touched, which path it took, and whether the new path is worth shipping.
For a coding-agent team, a baseline receipt should include:
| Receipt field | Why it matters |
|---|---|
| Baseline version | Prevents comparing against a moving target |
| Candidate version | Ties behavior to a branch, prompt, model, tool config, or memory change |
| Task fixture | Captures the repo, issue, database state, browser state, or document set |
| Trajectory summary | Shows tool calls, files touched, retries, and major decisions |
| Rubric results | Separates correctness, efficiency, safety, style, and user-fit |
| Cost and latency | Makes expensive wins visible before they become defaults |
| Human review note | Preserves the examples that aggregate scores flatten |
The key is not to make every receipt huge. The key is to make it stable enough that a reviewer can replay the important claim.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
From the archive
Jun 20, 2026 • 7 min read
Jun 20, 2026 • 9 min read
Jun 20, 2026 • 6 min read
Jun 20, 2026 • 11 min read
Absolute eval scores are comforting because they look like grades. Pairwise evals are useful because they look like engineering.
Hex's writeup describes candidate and baseline runs as the default mental model. That one choice prevents a lot of bad dashboard behavior. You are not arguing about whether a synthetic benchmark number is impressive. You are asking whether the candidate made the real workflow better than the thing users currently have.
This matters even more as agent systems add model panels and judges. In the OpenRouter Fusion post, the right lesson was that multi-model panels should be escalation lanes, not autopilot defaults. The same applies to eval judges. A judge is useful when it has a concrete comparison, a task-specific rubric, and access to the run evidence. A judge is weaker when it scores a final answer in isolation.
The pairwise question is sharper:
Given the same task fixture, did candidate run B improve on baseline run A?
Evaluate:
- correctness
- evidence use
- unnecessary tool calls
- cost and latency
- policy violations
- final user usefulness
That framing makes it harder for a model to reward fluent nonsense. It also makes the failures legible. Maybe the candidate solved more tasks but stopped using the semantic layer. Maybe it was more accurate but doubled warehouse queries. Maybe it followed the workspace guide better but got slower. Those are product decisions, not benchmark trivia.
Hex also built Shorelane Commerce, a fake business with realistic data. That is the part more teams should copy.
Public benchmarks are useful for broad model selection, but they rarely match your operating environment. Internal production data is realistic, but it is private, messy, permissioned, and hard to share across dev, CI, vendors, and external reviewers. A synthetic-but-realistic fixture gives you the missing middle.
For developer tools, the equivalent could be:
The goal is not to trick the model. The goal is to give the agent a world where the right answer depends on context, not trivia.
This is also where opposing opinions matter. The Hacker News thread for Hex's post had not developed much discussion when I checked it, but recent HN skepticism around AI testing agents is relevant: developers asked why they should pay token costs for nondeterministic test agents when LLMs can already write deterministic end-to-end tests. That pushback is fair.
The answer is not "replace tests with agents." The answer is "use deterministic tests for known invariants and agent evals for exploratory workflows where the path matters." A browser checkout test should be deterministic. A release-readiness agent that investigates a suspicious analytics drop needs trajectory evaluation.
Do not start with a giant eval platform. Start with one receipt format and one shared baseline.
For a small agent team, the first useful loop is:
This is enough to catch most prompt, retrieval, memory, and tool-routing regressions. It also creates the habit that matters: every agent change has to explain what it improved against the current product.
OpenAI's eval docs frame evals as structured tests for model and application behavior. Anthropic's testing docs emphasize defining success, improving consistency, and using stricter output controls when you need schema conformance. Hex's contribution is the product-engineering layer between those ideas: shared baselines, pairwise runs, rich trajectories, and realistic fixtures.
That is the pattern to copy.
Agent evals should feel less like leaderboard watching and more like code review.
A good code review asks what changed, why it changed, what evidence supports it, what risk remains, and whether the tradeoff is worth merging. A good agent eval should do the same.
The next time a prompt, model, tool, memory layer, or context store makes an agent "feel better," ask for the baseline receipt. If the team cannot show the candidate run next to the production run, with the task fixture, trajectory, rubric, cost, and reviewer note, it does not have an eval yet. It has an anecdote.
A baseline receipt is a compact record of how the current production agent handled a task: the fixture, agent version, tool calls, trajectory, costs, rubric results, and reviewer notes. Candidate changes are compared against that receipt so teams can see what actually improved or regressed.
Pairwise evals compare a candidate run against a known baseline on the same task. That makes regressions easier to spot and keeps the discussion tied to product behavior. Benchmark scores are still useful for model selection, but they usually hide the trajectory details that matter in agent systems.
No. Deterministic tests should cover known invariants, API contracts, schema rules, browser flows, and safety gates. Agent evals are better for exploratory tasks where the path, evidence use, tool efficiency, and final judgment all matter.
Evaluate evidence use, unnecessary tool calls, cost, latency, policy violations, context-store behavior, memory use, final answer usefulness, and whether the candidate changed the action compared with the baseline.
Read next
Efficient agents do not stuff every tool result into the model context. They keep intermediate state in code, files, and execution environments, then return compact summaries and receipts.
8 min readPersistent memory for coding agents is trending because every session still starts too cold. The hard part is not saving facts. It is proving recall, freshness, deletion, and rollback under real development pressure.
9 min readA long-running coding agent is only useful if the environment around it can queue tasks, capture logs, checkpoint state, verify behavior, limit cost, and recover from failure.
8 min readTechnical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
TypeScript-first AI agent framework. Agents, tools, memory, workflows, RAG, evals, tracing, MCP, and production deployme...
View ToolFull-stack AI dev environment in the browser. Describe an app, get a deployed project with database, auth, and hosting....
View ToolGives AI agents access to 250+ external tools (GitHub, Slack, Gmail, databases) with managed OAuth. Handles the auth and...
View ToolLightweight Python framework for multi-agent systems. Agent handoffs, tool use, guardrails, tracing. Successor to the ex...
View ToolRun hundreds of agent evals in parallel. Find regressions in minutes.
View AppCompare AI coding agents on reproducible tasks with scored, shareable runs.
View AppSpec out AI agents, run them overnight, wake up to a verified GitHub repo.
View AppConfigure Claude Code for maximum productivity -- CLAUDE.md, sub-agents, MCP servers, and autonomous workflows.
AI AgentsWhat MCP servers are, how they work, and how to build your own in 5 minutes.
AI AgentsStep-by-step guide to building an MCP server in TypeScript - from project setup to tool definitions, resource handling, testing, and deployment.
AI Agents
Efficient agents do not stuff every tool result into the model context. They keep intermediate state in code, files, and...

Persistent memory for coding agents is trending because every session still starts too cold. The hard part is not saving...

A long-running coding agent is only useful if the environment around it can queue tasks, capture logs, checkpoint state,...

OpenAI's harness engineering post and new token-use research point to the same lesson: agentic coding teams need token b...

OpenRouter Fusion turns multi-model panels into an API feature. The useful lesson is not to run every prompt through mor...

GitHub's latest agent workspace trend points at a boring but important primitive: agents need explicit filesystem contra...

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.