TL;DR
xAI has launched Grok 4, claiming the title of the world's most powerful AI model. With a $300/month Super Grok tier, saturated AMI benchmarks, and a coding model on the horizon, this is xAI's bigge...
Read next
xAI's Grok Code Fast 1 arrives with a specific mission: eliminate the friction in agentic coding workflows. While models like GPT-5, Claude 4, and Gemini 2.5 Pro deliver impressive benchmark scores...
7 min readGPT-5 introduces a fundamentally different approach to inference. Instead of forcing developers to manually configure reasoning parameters, the model operates as a unified system with real-time rou...
7 min readAnthropic has released Claude Opus 4.5, positioning it as their most capable model yet for coding agents and computer use. The release brings significant price cuts, efficiency gains, and enough au...
5 min readUpdate (May 2026): Since this article was published, xAI has released Grok 4.1 (November 2025) with 65% fewer hallucinations, emotional intelligence features, and a 2M token context window via API. Grok 5 (6T parameters) was announced in January 2026. Pricing has also changed - SuperGrok is now $30/month, with SuperGrok Heavy at $300/month for the multi-agent tier. The core analysis of Grok 4's architecture and capabilities below remains relevant.
xAI has launched Grok 4, and the benchmarks back up a bold claim: this is the highest-scoring AI model on several key evaluations. But the headline numbers only tell part of the story. The real picture involves tool-augmented reasoning, a $300/month price tag, and a roadmap that includes a dedicated coding model, multimodal agents, and a video generation model trained on 100,000 NVIDIA GB200s.
For model-selection context, compare this with Claude vs GPT for Coding: Which Model Writes Better TypeScript? and OpenAI vs Anthropic in 2026 - Models, Tools, and Developer Experience; the useful question is not only benchmark quality, but where the model fits in a real developer workflow.
Humanity's Last Exam is a benchmark created by Scale AI that tests frontier knowledge across domains including mathematics, chemistry, linguistics, and more. A strong human score on this exam sits around 5%. Grok 4 scores 26.9% on the text-only version - already competitive with the best models available.
But the more telling result is what happens when you add tools. With access to web browsing, terminal, and other agentic capabilities, Grok 4's score improves dramatically. This aligns with an industry-wide pattern: models paired with tools consistently outperform models reasoning in isolation. The tradeoff is cost and latency - tool-augmented runs consume more compute and take considerably longer to produce results.
On the AMI benchmark (a math-focused evaluation), Grok 4 achieves a perfect 100%. This is not a typo. The benchmark is effectively saturated, which tells us less about Grok 4 being uniquely good at math and more about the benchmark reaching its ceiling. Expect new, harder math evaluations to emerge as the field continues advancing.
Across GPQA (graduate-level question answering) and LiveCodeBench (live coding evaluation), Grok 4 shows strong performance against OpenAI, Google, and Anthropic models. One important caveat: xAI's comparison chart includes Grok 4 variants with tool access alongside competitor models running without tools. A more apples-to-apples comparison would show the gap narrowing, though Grok 4 would likely still hold competitive positioning.
On the ARC-AGI benchmark, Grok 4 scores just under 16% - nearly double Claude 4 Opus's result. What makes this benchmark interesting is the cost axis. Some models achieve similar scores at dramatically different price points. O3 Preview costs over $100 per run on this benchmark, while Claude 4 sits between $1 and $10. Grok 4 offers the second-best score at a competitive cost, making it an appealing value proposition for researchers and developers running repeated evaluations.
VendingBench, from the team at Andean Labs, simulates running a small business (a vending machine operation). Grok 4 performed longer and more reliably increased its net worth over time compared to competitor models. It is a fun benchmark, but it tests something practical: sustained decision-making over extended periods with real economic consequences.
Grok 4 introduces the most expensive consumer AI subscription tier yet at $300/month for Super Grok. This includes access to Grok Heavy mode, which uses extended agentic reasoning, tool calling, and web search to tackle complex problems.
Here is how the premium AI subscription landscape looks now:
| Provider | Tier | Price |
|---|---|---|
| OpenAI | Pro | $200/mo |
| Anthropic | Max | $100-200/mo |
| AI Ultra | $250/mo | |
| xAI | Super Grok | $300/mo |
Whether $300/month is justified depends entirely on your use case. For professionals working on complex research, financial modeling, or technical problems where Grok 4's extended reasoning capabilities provide measurable value, the cost could be a rounding error. For casual users, the standard Grok tier (accessible through an X Premium subscription at around $8-10/month) provides a reasonable entry point.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
From the archive
Jul 5, 2025 • 9 min read
Jun 7, 2025 • 9 min read
Apr 29, 2025 • 8 min read
Feb 27, 2025 • 7 min read
Grok 4 includes updated voice interaction, and the demo compared it directly against OpenAI's voice mode. The results suggest Grok 4's voice is more responsive and less prone to interrupting the user mid-sentence. It handles requests like whispering and singing, pushing closer to natural human speech patterns.
Voice AI is becoming a competitive differentiator. As these capabilities mature, the quality of voice interaction will factor into which assistant people choose for daily use - not just which model scores highest on text benchmarks.
The always-on reasoning aspect is worth noting. Grok 4 is inherently a reasoning model - there is no way to disable the chain-of-thought process. This means every API call involves reasoning overhead. For applications where speed matters more than depth (chatbots, simple completions, high-throughput pipelines), you would still want to use Grok 3 or Grok 3 Mini.
xAI laid out an aggressive roadmap:
Coding model - Arriving within weeks of launch. Given how competitive the AI coding space has become (Claude Code, Cursor, Codex, Gemini CLI), a dedicated Grok coding model enters a crowded but high-value market.
Multimodal agent - Expected in the fall. This combines vision, reasoning, and action capabilities into a single agent that can understand images and video, reason about them, and take actions based on its analysis.
Video generation model - The most ambitious item on the roadmap. xAI plans to train this on 100,000 NVIDIA GB200s, which would represent one of the largest compute allocations for a video generation model to date. The scale suggests xAI is aiming to compete directly with OpenAI's Sora and Google's Veo at the frontier.
There are multiple entry points:
Unlike most other frontier models, Grok 4 does not offer a non-reasoning mode. Every request triggers the full chain-of-thought reasoning process. This is a deliberate architectural choice - xAI is betting that the quality improvements from always-on reasoning outweigh the latency and cost tradeoffs.
For developers building applications on the Grok 4 API, this has practical implications. If your use case involves high-throughput, low-latency requests - chatbots, autocomplete, simple classification tasks - Grok 4 is the wrong model. Grok 3 or Grok 3 Mini remain better suited for those workloads. Grok 4 is designed for tasks where thinking time translates directly into better output: complex code generation, multi-step problem solving, research synthesis, and financial modeling.
The migration path from Grok 3 to Grok 4 is straightforward from an API perspective, but developers should audit their applications for latency sensitivity before switching. A response that took 2 seconds with Grok 3 might take 15-30 seconds with Grok 4 as the model reasons through the problem.
Grok 4 arrives in a crowded market. OpenAI has GPT-5 and O3. Anthropic has Claude 4 Opus. Google has Gemini 2.5 Pro. Each model has different strengths, and the "best" model depends entirely on the specific task.
What distinguishes xAI's approach is the aggressive roadmap. Announcing a coding model, multimodal agent, and video generation model in rapid succession signals that xAI is not content to compete on a single axis. They are building across the full stack of AI capabilities simultaneously, backed by what appears to be near-unlimited compute resources.
The $300/month price point is also a strategic signal. By pricing above OpenAI and Google, xAI is positioning Grok 4 as a premium product for power users rather than trying to win on volume. Whether this strategy succeeds depends on whether the tool-augmented reasoning capabilities justify the premium in real-world usage, not just benchmarks.
The pricing escalation across the industry tells us something about where things are heading. When multiple companies independently arrive at $200-300/month tiers for their most capable models, it signals that the compute required for frontier reasoning is genuinely expensive - and that there is demand willing to pay for it.
At the same time, the benchmark saturation on tests like AMI (100%) means the evaluation landscape needs to evolve. The models are outpacing the measurements we use to compare them. Expect new, harder benchmarks to emerge that better differentiate between models that all score perfectly on today's tests.
For developers, the practical question remains: which model is best for your specific use case? Grok 4's strengths lie in extended reasoning, tool-augmented problem solving, and sustained performance over long tasks. If your work involves complex analysis, research synthesis, or multi-step agentic workflows, it is worth evaluating directly.
Grok 4 has multiple pricing tiers: X Premium ($8-10/month) includes basic access within the X platform, SuperGrok ($30/month) provides enhanced capabilities, and SuperGrok Heavy ($300/month or $3,000/year) unlocks full agentic reasoning with tool calling. API pricing follows standard xAI rates.
Grok 4 leads on several benchmarks including Humanity's Last Exam (26.9% text-only) and ARC-AGI (nearly 16%). However, "better" depends on your use case. Grok 4 excels at extended reasoning and tool-augmented tasks, while GPT-5 and Claude 4 may be faster for simpler requests. Grok 4 is always-on reasoning with no lightweight mode, which adds latency but improves output quality for complex problems.
Grok Heavy is xAI's premium reasoning mode that combines extended agentic thinking with tool calling, web search, and terminal access. It takes longer to respond (15-30 seconds vs 2-3 seconds) but produces more thorough, well-reasoned outputs. Grok Heavy requires the $300/month SuperGrok Heavy subscription.
Yes, Grok 4 can write code and performs well on LiveCodeBench. However, xAI also announced a dedicated coding model coming soon after Grok 4's launch. For coding-specific workflows, you might prefer Claude Code, Cursor, or OpenAI Codex until the Grok coding model ships.
Grok 4 has a 256,000 token context window. Since the May 2026 update, Grok 4.1 offers up to 2 million tokens via API for applications requiring extremely long context.
Grok 4 supports text and image input. Video understanding is coming via retraining. xAI also announced plans for a video generation model trained on 100,000 NVIDIA GB200s, though this is a separate product from the base Grok 4 model.
Use Grok 3 or Grok 3 Mini for high-throughput, low-latency tasks like chatbots, autocomplete, or simple classification. Use Grok 4 for complex reasoning tasks where thinking time improves output quality - code generation, research synthesis, multi-step problem solving, or financial modeling. Grok 4 has no non-reasoning mode, so every request involves reasoning overhead.
xAI announced Grok 4.1 (November 2025) with 65% fewer hallucinations and 2M token context, and Grok 5 (6 trillion parameters) in January 2026. The roadmap also includes a dedicated coding model, multimodal agent, and video generation model.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
xAI's model with real-time X/Twitter data access. Grok 3 rivals top models on reasoning. Built-in web search and current...
View ToolOpen-source reasoning models from China. DeepSeek-R1 rivals o1 on math and code benchmarks. V3 for general use. Fully op...
View ToolGoogle's frontier model family. Gemini 2.5 Pro has 1M token context and top-tier coding benchmarks. Gemini 3 Pro pushes...
View ToolAnthropic's AI. Opus 4.6 for hard problems, Sonnet 4.6 for speed, Haiku 4.5 for cost. 200K context window. Best coding m...
View ToolInstall Ollama and LM Studio, pull your first model, and run AI locally for coding, chat, and automation - with zero cloud dependency.
Getting StartedUse opus, sonnet, haiku, and best to switch models easily.
Claude CodeInteractive UI to switch models and effort sliders mid-session.
Claude Code
xAI's Grok Code Fast 1 arrives with a specific mission: eliminate the friction in agentic coding workflows. While models...

GPT-5 introduces a fundamentally different approach to inference. Instead of forcing developers to manually configure re...

Anthropic has released Claude Opus 4.5, positioning it as their most capable model yet for coding agents and computer us...

xAI launched Grok 3 with 200,000 GPUs, outperforming GPT-4o, Sonnet 3.5, and DeepSeek R1 on reasoning benchmarks. Here i...

Claude Opus 4.7 vs GPT-5.5 for real TypeScript work. Benchmarks, pricing, model families, and practical differences.

A developer's comparison of OpenAI and Anthropic ecosystems - models, coding tools, APIs, pricing, and which to choose f...

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.