Grok 4 in 6 Minutes - Developers Digest

Grok 4 in 6 Minutes - Developers Digest

Transcript

--- type: transcript date: 2025-07-10 youtube_id: 8nDlgRldmzk --- # Transcript: Grok 4 in 6 Minutes XAI has just launched Gro 4, the world's most powerful AI model. In this video, I'm going to go over what they announced, some of its capabilities, how it stands out, as well as how much it costs, how you can get started with this, as well as what's to come with the XAI team. Now, in terms of benchmarks, so humanity's last exam, this is an exam that was created from the team over at Scale AI. And humanity's last exam is frontier knowledge across a ton of different domains. These are very challenging questions. These are questions that are developed at the frontier of each of the respective fields of whether it's math, chemistry, linguistics as well as a handful of others. And a good score for a human would be around 5%. First up in terms of the texton version of humanity's last exam, we can see that this model achieves at 26.9. But where we really start to see the model outperform is when we actually equip it with tools. the direction that we're going is not just have these LLMs respond back with the answer but have both reasoning as well as tool calling capability to ultimately get to better results. Just one quick call out with this chart is obviously as you scale out test time compute it is going to be more expensive but another big portion of this is it is going to take a considerable amount of more time to be able to get these improved results like we see on the screen here. If we take a look at some of the more common benchmarks like GPQA, Amy as well as live codebench, we can see across the board that Gro 4 does have very strong performance. And what is interesting with how they laid this out is they do have Gro 4 with no tools, Gro 4 heavy as well as Gro 4 with tools. That is one distinction with this chart is it is comparing to some of the other OpenAI Google as well as anthropic models all with no tool calls. The one thing that would be interesting with this chart is if we did have some similar comparisons to what the Gro 4 functionality has access to because that will spawn off a number of different agents that will be able to search the web and perform different functions to ultimately yield these better results. But in terms of some of the benchmarks, we do actually even see Grog 4 saturate completely the Amy score. It has a 100%. these benchmarks are just going to get continually better and it's going to get to the point where AI is just better than most people at most things most of the time. Now, one thing that is definitely going to be a little controversial. So, to be able to access this groheavy mode, you will have to pay $300 a month. So, we definitely have started to see a little bit of a trend across the industry where we have OpenAI have a $200 a month tier. We have Anthropic with$100 and $200 tiers. We have Google even with a $250 tier. And now to access this Groheavy capability, it does come at a hefty $300 a month. But with that being said, it could definitely be justified depending on the type of work that you do. Now, another thing that they did demonstrate is the new voice capability, and they did compare it to OpenAI's voice mode, and it did seem quite good by the demo. So, you're able to ask it to whisper or sing or all of those fun types of things. It's starting to sound more and more like a human. When compared to the OpenAI app, it did seem like it was a little bit more snappy. From personally using the OpenAI voice mode, it didn't seem like it was as eager to interrupt and those types of things. Another really impressive score that they were able to achieve was on Arc AGI. We can see that Grot 4 sits just shy of 16%. Almost double the score of Claude for Opus. And we can see all of the other models here where it's plotted against the score. But most importantly, also the cost. We do see that some models can be considerably more expensive. 03 preview at over $100. We have Claude 4 between the $1 and $10 range. What's really impressive here is the second best model at this benchmark is also cheaper. It is a really compelling value proposition for sure. Another really fun benchmark is vending bench ran by the team at Anden Labs. We can see here that in terms of actually running a small business like a vending machine, Gro 4 was able to perform longer as well as reliably increase its net worth over time. In terms of some of the other specifics, it does have a 256,000 token context window. You will be able to access Frontier multimodal reasoning. One quick aside with this, they did mention that they are retraining for the multimodal capabilities. So for its image as well as video understanding that is expected to be much better over the coming months. Now in terms of what's next, they did mention that they do have a coding model that is going to be coming in the next handful of weeks. From there they're going to have a multimodal agent. So that's going to be interesting to see in the fall. And then finally they are going to be training a video generation model where I think it was described that they are going to be leveraging 100,000 GB200s which is the state-of-the-art hardware from Nvidia. I don't think that there's a video generation model that has leveraged quite that much compute yet. Now, in terms of accessing the model within my X account, I do pay for premium. I believe it's eight or 10 bucks a month, but I do see within here I can try this out if you are interested. That is one option. Another option in terms of where you can access this is on the Grock site. So, grock.com. You will be able to see it within the drop down there. So now in terms of the pricing for Super Gro, so it's going to be either $30 a month or if you want to leverage that front-ofthe-line capability with the tool calling and the agentic reasoning and all of that built in. It's going to cost you $300 a month or if you do want to pay for a year, the price tag is going to be three grand for that. Now, one thing that is definitely worth calling out that they do have within their API documentation is if you are considering moving from something like Gro 3 or Gro 3 Mini to Gro 4, you do just have to realize that Gro 4 is a reasoning model. There is no non-reasoning mode within Gro 4. So depending on the application that you're building, that is something to consider because with reasoning models, they are inherently going to take more time to actually generate the response for you. But otherwise, that's pretty much it for this video. If you found this video useful, please comment, share, and subscribe.