Claude Opus 4.6: Anthropic's Smartest Model Gets Agent Teams

8 min read
Claude Opus 4.6: Anthropic's Smartest Model Gets Agent Teams

Anthropic dropped Claude Opus 4.6 and it's a leap. Not an incremental bump—a leap.

The flagship is now smarter on coding. Thinks more carefully. Plans more deliberately. Sustains agentic tasks for longer. Handles larger codebases without drift. And it has a million tokens of context. That's not a typo.

Let's dig into what matters.

The Numbers

Opus 4.6 wins across most benchmarks, but the story isn't clean. In some categories it's dominant. In others, Opus 4.5 still edges it out. GPT-5.3 (which dropped right after this release) has a few wins too. That's fine. What matters is the pattern.

Benchmark comparison across knowledge work, agentic search, coding, and reasoning

Agentic terminal coding is a massive jump. This is the real story. If you're using Claude to build software at scale, this model substantially outperforms 4.5, Sonnet, and Gemini 3 Pro. Not marginal. Substantial.

Agentic search is a clean win. Across the board, better than everything else. That matters for RAG pipelines and knowledge-heavy workloads.

Long context retrieval and reasoning are a tier above. Pass a million tokens into this thing and it actually uses them. Opus 4.5 and Sonnet fall back. Context doesn't degrade into noise the way it does with smaller models.

BenchmarkOpus 4.6Opus 4.5GPT-5.3Gemini 3 Pro
Agentic Coding92.1%93.2%89.7%86.5%
Agentic Terminal Coding87.4%71.2%68.9%65.3%
Agentic Search94.6%81.3%79.8%77.2%
Multidisciplinary Reasoning (with tools)53.1%48.7%51.2%46.9%
Long Context Retrieval96.8%84.2%82.1%

Performance breakdown showing agentic capabilities

Context Compaction & Adaptive Thinking

Two API features shipped with this.

Context compaction does what you'd expect—prunes tokens intelligently so you can fit more without wasting input cost. It's not magic, but it works.

Adaptive thinking is more interesting. The model now decides how much thinking effort a task requires. Simple queries get a quick pass. Complex problems get deeper reasoning. You pay for what you use. Smart.

Agent Teams: The Real Innovation

This is the feature that matters for the next 12 months.

Subagents have a constraint: they report back to an orchestrator. Everything threads through the main agent. That's limiting when you're running long-horizon tasks. Token budget gets consumed by state synchronization.

Agent teams flip that. Multiple agents coordinate with each other and with shared resources—todo lists, scratch pads, progress files. No central bottleneck. The orchestrator stays clean. Context stays coherent.

Agent team architecture with direct coordination

You can tab through teammates in real time. Inject instructions. Observe progress. Shift between them like separate Claude Code sessions. Because they are, technically.

The cost scales. You're running multiple sessions. But if you're on the Max tier (which anyone serious about agents should be), it's worth it.

Building a C Compiler with a Swarm

Anthropic published a case study. A team of Claude agents built a C compiler. From scratch. 100,000 lines. Compiles Linux 6.9. Can play Doom.

Cost: $20,000. Time: 2,000+ Claude Code sessions.

The approach matters more than the result.

Write extremely high-quality tests. Let Claude validate its own work. This is how you keep quality from degrading across hundreds of sessions.

Offload context to external files. Progress notes. Readme files. Architecture docs. Let the agent reference them instead of keeping everything in the conversation thread.

Inject time awareness. LLMs are time-blind. A task that takes a week feels instant. Anthropic sampled real time at random intervals so the model understood pacing and deadline pressure.

Parallelize by role. Backend engineer. Frontend engineer. Team lead. Each role tackles a different scope. No stepping on toes.

This is the template. You can apply it to codebases, data pipelines, research tasks, anything long-horizon.

Pricing & Context Tiers

Input: $5 per million tokens. Output: $25 per million tokens.

That changes above 200k tokens. Then it gets expensive. If you're using the full million-token context and generating high-volume output, you need to budget for it.

Opus 4.6 is still in beta on the million-token context. Rollout is coming. Costs may shift.

What Still Works Better

Be honest about the gaps.

Opus 4.5 still wins on some pure knowledge tasks. GPT-5.3 outperforms on a few benchmarks that Anthropic didn't lead on. That's expected. There's no single best model anymore. You pick the right tool for the job.

For agentic work at scale, reasoning with massive context, and long-horizon coding tasks, Opus 4.6 is the frontier.

Practical Next Steps

  1. Migrate critical agentic workflows. If you're running multi-step tasks with Opus 4.5, test them on 4.6. The terminal coding gap is significant.
  2. Experiment with agent teams. Enable the experimental feature in your settings.json. Start with a small task. Get the shape of coordination right before scaling up.
  3. Build with long context in mind. Don't just stuff a million tokens in there. Structure your data so the model can actually use it. Progress files. Architecture diagrams. Clear state.
  4. Budget for scale. If you're parallelizing work across teams of agents, costs compound. But the output can justify it.

Further Reading


Watch the Video

<iframe width="100%" height="415" src="https://www.youtube.com/embed/r2zxcB67vwM" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>