
Exploring Claude Opus 4.6: Features, Benchmarks, Anthropic's Latest Frontier Model In this video, I delve into the details of Claude Opus 4.6, highlighting key features and performance benchmarks. The focus is on its new coding capabilities, a substantial increase in token context, and the innovative 'agent teams' feature in beta. I also touch on its usage, pricing, and an impressive experiment in which multiple agents built a C compiler. This comprehensive overview provides insights into how this model outperforms others in various aspects and introduces advanced API features like context compaction and adaptive thinking. Stay tuned for an in-depth look at the potential use cases and configurations of this powerful model. 00:00 Introduction to Claude Opus 4.6 00:15 Key Features and Improvements 00:39 Benchmarks and Comparisons 01:21 Agent Teams and New API Features 03:09 Long Context Capabilities 03:38 Experimental Features in Coot Code 05:50 Use Cases and Practical Applications 06:12 Building a C Compiler with Claude 09:29 Conclusion and Final Thoughts
Weekly deep dives on AI agents, coding tools, and building with LLMs - delivered to your inbox.
Free forever. No spam.
Subscribe FreeNew tutorials, open-source projects, and deep dives on coding agents - delivered weekly.
--- type: transcript date: 2026-02-09 youtube_id: r2zxcB67vwM --- # Transcript: Claude Opus 4.6 in 10 Minutes In this video, I'm going to be taking a look at Claude Opus 4.6. I'm going to touch on the blog post and then the one thing that I want to focus on a little bit more is what they've been able to build with the model and then specifically a new feature that they have within beta within cloud code that leverages this model. First thing right off the bat in terms of some of the key details they mentioned this is the smartest model to date. It's better in terms of coding skills. It thinks more carefully. It plans more carefully. It sustains agentic tasks for longer and can operate more reliably on larger code bases. Now, one of the big features and jumps with this model is it has a million tokens of context. It is currently in beta though, so just be mindful of that. Now, in terms of some of the benchmarks, so across a number of different benchmarks like knowledge work, agentic search, coding, and reasoning just to name a few, we can see that this model outperforms a number of different models. Now, one of the interesting things, at least in terms of timing, was a GBD 5.3 Codeex just came out right after this model release. Now, in terms of how you can use this, you're now going to be able to assemble agent teams. Now, the one thing and where this is different than sub agents is sub agents effectively have to report back to the orchestrator agent or the main agent thread. Whereas what you can do with agent teams is they can actually interact with each other as well as common resources, things like to-do list or different scratch pads that the agent has access to. Now, in terms of a couple new features to the API, they do now have a context compaction feature. They also have something that they're calling adaptive thinking. Depending on the complexity of the task, the model itself can determine how much thinking is required for each different task. Now, in terms of how you can leverage it, so it's within the cloud AI app. You can also use it within the API. And then in terms of the pricing, it is $5 per million tokens of input and $25 per million tokens of output. Now, the one thing that I do want to mention with this is if you do exceed the 200,000 tokens of context, this does get substantially more expensive. Just be mindful if you're looking to have a higher output as well as a higher input and really leverage the threshold, it will cost on that upper tier. Now in terms of the eval it's actually pretty interesting because it is mixed in some regards. For instance on agentic coding we actually see that opus 4.5 does outperform by just a very slight margin. But when we look at agentic terminal coding we can see that this is a substantial leap over opus 4.5 as well as some of the other models like sonnet as well as gemini 3 pro. Now the big jumps are in some other categories. In terms of agentic search, we can see that this is substantially better than all of the models across the board. In terms of multiddisciplinary reasoning or humanities last exam, we can also see that with tools, this is a 53.1%. And then across a number of different and new benchmarks that I haven't really seen before, such as agentic financial analysis for the finance agent as well as office task, we can see that this also outperforms across the board. What is interesting is there are some areas where Opus 4.5 still is better. And with the latest release of GPT 5.3, there are a number of different benchmarks where GPD 5.3 does outperform Opus. Now, to touch on the long context, the one thing that's really interesting with this model is just the substantial leap in terms of its competence both across context retrieval as well as long context reasoning. Now, in terms of long context retrieval, what's interesting with this is you're going to be able to pass in a ton of information. A million tokens of context is an absolute enormous amount of tokens. And then in terms of long context reasoning, we see that opus 4.6 is substantially better than sonnet 4.5 as well. Next up, I want to touch on one of the experimental features that they released as a part of the announcement with Opus 4.6. Within cloud code, they now have a feature that they're calling agent teams. Now, one thing to know with this is you are going to have to enable the experimental agent teams feature within the settings JSON. What you're going to be able to do now is you're going to be able to have different agents coordinate directly with one another without actually circling back and going to the main orchestrator agent. And one of the benefits of this is you're going to be able to keep that orchestrator thread of the main parent agent as clean as possible. And why is that important? Well, first up, you're going to be able to have longer coherence. And one of the things that I pointed out in the previous slide is for instance, say if you're working on a long context or a long horizon task. If now you have up to that million tokens of context, you're going to be able to keep clear that orchestrator agent for much much longer. Effectively, you can think of it almost being like a team lead where it can spawn off different resources and depending on what is needed, it doesn't necessarily need to communicate directly back to the team lead. the teammates themselves can coordinate amongst one another and share all of those different tokens amongst each other. And one of the really neat things with this is they actually set it up in a way where it can automatically spin up a session where you can tab through and still interact with each teammate as if it's an individual Claude Code session. So you're going to be able to shift arrow up and down through the different teammates. So depending on what they're doing, if you want to interject just like you would with Cloud Code and have it pick up a message or stop what it's doing or have any different feedback or just observe what they're all doing, you're going to be able to have all of that as a part of the feature. I'm likely going to cover this in another video. It's almost a bigger topic in of itself. There are also some configuration steps. You're going to have to set up T-Max as well as iTerm 2 does seem to be like the preferred setup for this, but I've been playing around with this and it is quite impressive. The other thing to know with this is now that instead of you're just interacting with one cloud code session, you're effectively coordinating a ton of different cloud code sessions. And because of that, it can potentially come with an increased cost. But for a lot of people, if you have a cloud code subscription, especially one of the max tiers, this might not necessarily be an issue. There are a ton of different use cases for this. So, I quickly ran it on an old project that I have and I found a number of different issues within the project. And then once it found the issues, I had the different agents actually go and coordinate how they wanted to break out the problem and actually test the solution. So that's a potential use case, but they have some others mentioned within the docs. And I'm sure there's going to be some other emergent use cases that come out that people just discover as they play around with this. Last but not least, I do want to touch on this engineering article that they put out with the release of Opus 4.5, building a C compiler with a team of parallel clouds. Now, one of the interesting things of what they did is they were able to build a C compiler completely from scratch. Now, there are some big numbers in terms of how much this cost, as well as how long this took. It took over 2,000 cloud code sessions and $20,000 in API costs, but the agent was able to produce a 100,000line compiler that could build Linux 6.9. And as they show in the video, it could even play Doom. Now, in terms of how they did this, they had the swarm of agents. But in addition to that, effectively the approach that they did to accomplish this was through the Ralph looping mechanism or something very close to it. A couple interesting pieces that I found within the article. First up, they mentioned write extremely highquality tests. So this will give Claude a mechanism to actually validate its work. But further one thing that I found particularly useful was to put yourself in Claude's shoes. Think about it from how Claude sees things. The way that they did this was similar to a lot of different mechanisms that we see within cloud code as well as some other systems. Basically what they do is they offload context to extensive readmemes and progress files. You can think of it almost like to-do list by the sounds of it. And then what they do is they try to avoid context pollution within here. They want to only emit what they need within the main threads as well as the sub aents by the sounds of it. And one of the interesting things that they included with this is the time blindness that LLMs have. Often times you might experience where Opus will say, "Okay, this task is going to take you 3 weeks and then you go ahead and you click enter and then half an hour later you have exactly what you had asked for and there isn't that idea of time." It's a similar exercise when it's actually going through an agentic task without it having the coherence of time. You don't know whether it's actually a month into a task or a week into a task. LLMs just don't have a sense of time. And the interesting thing that they did with this is they just interjected the time mechanism at random samples give it additional awareness in terms of time and then further in terms of how they paralyzed this. Basically the idea is give each agent different tasks that they can pick up and work on. Similar to almost like a team of developers where you can pick up different tasks on something like a linear board and work on those tasks until they're done. Then another way in terms of how you can configure these agents is you can have different roles very similar to how there's different roles within an organization. You could have for instance a back-end engineer, a front-end engineer or these different types of roles that make sense depending on the task. You might have a team lead, you might have a generalist, you might have all of the different aspects that are built and make sense for the particular task. Now, in terms of what they did here, so this is really trying to stress what the model is capable of, but we've seen some early experiments from cursor for instance, where I think it was just this week they mentioned that they were able to get an agent to commit up to a,000 commits per hour on a task. Now, what they're actually building and whether it's actually a useful codebase, that's a different question, but they have run similar experiments to something like this where they have been able to build increasingly powerful software. So with the ability to really not be constrained in terms of budget as well as these new concepts with sub aents, there are some new paradigms that are really possible with just given how powerful these models are. But overall, that's pretty much it for this video. I know there's a ton to cover. Kudos to the team at Anthropic for the latest release. And if you found this video useful, please like, comment, share, and subscribe. Otherwise, until the next
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.