TL;DR
xAI has launched Grok 4, claiming the title of the world's most powerful AI model. With a $300/month Super Grok tier, saturated AMI benchmarks, and a coding model on the horizon, this is xAI's bigge...
xAI has launched Grok 4, and the benchmarks back up a bold claim: this is the highest-scoring AI model on several key evaluations. But the headline numbers only tell part of the story. The real picture involves tool-augmented reasoning, a $300/month price tag, and a roadmap that includes a dedicated coding model, multimodal agents, and a video generation model trained on 100,000 NVIDIA GB200s.
Humanity's Last Exam is a benchmark created by Scale AI that tests frontier knowledge across domains including mathematics, chemistry, linguistics, and more. A strong human score on this exam sits around 5%. Grok 4 scores 26.9% on the text-only version - already competitive with the best models available.
But the more telling result is what happens when you add tools. With access to web browsing, terminal, and other agentic capabilities, Grok 4's score improves dramatically. This aligns with an industry-wide pattern: models paired with tools consistently outperform models reasoning in isolation. The tradeoff is cost and latency - tool-augmented runs consume more compute and take considerably longer to produce results.
On the AMI benchmark (a math-focused evaluation), Grok 4 achieves a perfect 100%. This is not a typo. The benchmark is effectively saturated, which tells us less about Grok 4 being uniquely good at math and more about the benchmark reaching its ceiling. Expect new, harder math evaluations to emerge as the field continues advancing.
Across GPQA (graduate-level question answering) and LiveCodeBench (live coding evaluation), Grok 4 shows strong performance against OpenAI, Google, and Anthropic models. One important caveat: xAI's comparison chart includes Grok 4 variants with tool access alongside competitor models running without tools. A more apples-to-apples comparison would show the gap narrowing, though Grok 4 would likely still hold competitive positioning.
On the ARC-AGI benchmark, Grok 4 scores just under 16% - nearly double Claude 4 Opus's result. What makes this benchmark interesting is the cost axis. Some models achieve similar scores at dramatically different price points. O3 Preview costs over $100 per run on this benchmark, while Claude 4 sits between $1 and $10. Grok 4 offers the second-best score at a competitive cost, making it an appealing value proposition for researchers and developers running repeated evaluations.
VendingBench, from the team at Andean Labs, simulates running a small business (a vending machine operation). Grok 4 performed longer and more reliably increased its net worth over time compared to competitor models. It is a fun benchmark, but it tests something practical: sustained decision-making over extended periods with real economic consequences.
Grok 4 introduces the most expensive consumer AI subscription tier yet at $300/month for Super Grok. This includes access to Grok Heavy mode, which uses extended agentic reasoning, tool calling, and web search to tackle complex problems.
Here is how the premium AI subscription landscape looks now:
| Provider | Tier | Price |
|---|---|---|
| OpenAI | Pro | $200/mo |
| Anthropic | Max | $100-200/mo |
| AI Ultra | $250/mo | |
| xAI | Super Grok | $300/mo |
Whether $300/month is justified depends entirely on your use case. For professionals working on complex research, financial modeling, or technical problems where Grok 4's extended reasoning capabilities provide measurable value, the cost could be a rounding error. For casual users, the standard Grok tier (accessible through an X Premium subscription at around $8-10/month) provides a reasonable entry point.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
Grok 4 includes updated voice interaction, and the demo compared it directly against OpenAI's voice mode. The results suggest Grok 4's voice is more responsive and less prone to interrupting the user mid-sentence. It handles requests like whispering and singing, pushing closer to natural human speech patterns.
Voice AI is becoming a competitive differentiator. As these capabilities mature, the quality of voice interaction will factor into which assistant people choose for daily use - not just which model scores highest on text benchmarks.
The always-on reasoning aspect is worth noting. Grok 4 is inherently a reasoning model - there is no way to disable the chain-of-thought process. This means every API call involves reasoning overhead. For applications where speed matters more than depth (chatbots, simple completions, high-throughput pipelines), you would still want to use Grok 3 or Grok 3 Mini.
xAI laid out an aggressive roadmap:
Coding model - Arriving within weeks of launch. Given how competitive the AI coding space has become (Claude Code, Cursor, Codex, Gemini CLI), a dedicated Grok coding model enters a crowded but high-value market.
Multimodal agent - Expected in the fall. This combines vision, reasoning, and action capabilities into a single agent that can understand images and video, reason about them, and take actions based on its analysis.
Video generation model - The most ambitious item on the roadmap. xAI plans to train this on 100,000 NVIDIA GB200s, which would represent one of the largest compute allocations for a video generation model to date. The scale suggests xAI is aiming to compete directly with OpenAI's Sora and Google's Veo at the frontier.
There are multiple entry points:
Unlike most other frontier models, Grok 4 does not offer a non-reasoning mode. Every request triggers the full chain-of-thought reasoning process. This is a deliberate architectural choice - xAI is betting that the quality improvements from always-on reasoning outweigh the latency and cost tradeoffs.
For developers building applications on the Grok 4 API, this has practical implications. If your use case involves high-throughput, low-latency requests - chatbots, autocomplete, simple classification tasks - Grok 4 is the wrong model. Grok 3 or Grok 3 Mini remain better suited for those workloads. Grok 4 is designed for tasks where thinking time translates directly into better output: complex code generation, multi-step problem solving, research synthesis, and financial modeling.
The migration path from Grok 3 to Grok 4 is straightforward from an API perspective, but developers should audit their applications for latency sensitivity before switching. A response that took 2 seconds with Grok 3 might take 15-30 seconds with Grok 4 as the model reasons through the problem.
Grok 4 arrives in a crowded market. OpenAI has GPT-5 and O3. Anthropic has Claude 4 Opus. Google has Gemini 2.5 Pro. Each model has different strengths, and the "best" model depends entirely on the specific task.
What distinguishes xAI's approach is the aggressive roadmap. Announcing a coding model, multimodal agent, and video generation model in rapid succession signals that xAI is not content to compete on a single axis. They are building across the full stack of AI capabilities simultaneously, backed by what appears to be near-unlimited compute resources.
The $300/month price point is also a strategic signal. By pricing above OpenAI and Google, xAI is positioning Grok 4 as a premium product for power users rather than trying to win on volume. Whether this strategy succeeds depends on whether the tool-augmented reasoning capabilities justify the premium in real-world usage, not just benchmarks.
The pricing escalation across the industry tells us something about where things are heading. When multiple companies independently arrive at $200-300/month tiers for their most capable models, it signals that the compute required for frontier reasoning is genuinely expensive - and that there is demand willing to pay for it.
At the same time, the benchmark saturation on tests like AMI (100%) means the evaluation landscape needs to evolve. The models are outpacing the measurements we use to compare them. Expect new, harder benchmarks to emerge that better differentiate between models that all score perfectly on today's tests.
For developers, the practical question remains: which model is best for your specific use case? Grok 4's strengths lie in extended reasoning, tool-augmented problem solving, and sustained performance over long tasks. If your work involves complex analysis, research synthesis, or multi-step agentic workflows, it is worth evaluating directly.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
xAI's model with real-time X/Twitter data access. Grok 3 rivals top models on reasoning. Built-in web search and current...
View ToolOpen-source reasoning models from China. DeepSeek-R1 rivals o1 on math and code benchmarks. V3 for general use. Fully op...
View Tool
New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.
Google's frontier model family. Gemini 2.5 Pro has 1M token context and top-tier coding benchmarks. Gemini 3 Pro pushes...

In this video, I introduce the beta release of Grok 2 and Grok 2 Mini. I discuss the new models available on the X platform and their impressive performance, including their ranking on the...

Learn The Fundamentals Of Becoming An AI Engineer On Scrimba; https://scrimba.com/the-ai-engineer-path-c02v?via=developersdigest In this video, I dive into the key highlights of the groundbreaking...

To learn for free on Brilliant, go to https://brilliant.org/DevelopersDigest/ . You’ll also get 20% off an annual premium subscription. Introducing Grok Code Fast 1: xAI's Fast and Efficient...

xAI launched Grok 3 with 200,000 GPUs, outperforming GPT-4o, Sonnet 3.5, and DeepSeek R1 on reasoning benchmarks. Here i...

xAI's Grok Code Fast 1 arrives with a specific mission: eliminate the friction in agentic coding workflows. While models...

Anthropic's Claude Haiku 4.5 delivers Sonnet 4-level coding performance at one-third the cost and twice the speed. Here...