Grok 4: xAI's Most Powerful AI Model

Developers Digest•July 10, 2025•7 min read

Prefer video? Watch the full tutorial with code walkthroughs.

TL;DR

xAI has launched Grok 4, claiming the title of the world's most powerful AI model. With a $300/month Super Grok tier, saturated AMI benchmarks, and a coding model on the horizon, this is xAI's bigge...

Grok Code Fast 1: xAI's Speed-Optimized Coding Model

xAI's Grok Code Fast 1 arrives with a specific mission: eliminate the friction in agentic coding workflows. While models like GPT-5, Claude 4, and Gemini 2.5 Pro deliver impressive benchmark scores...

7 min read

GPT-5: OpenAI's Most Capable Model

GPT-5 introduces a fundamentally different approach to inference. Instead of forcing developers to manually configure reasoning parameters, the model operates as a unified system with real-time rou...

7 min read

Claude Opus 4.5: Anthropic's Most Intelligent Model

Anthropic has released Claude Opus 4.5, positioning it as their most capable model yet for coding agents and computer use. The release brings significant price cuts, efficiency gains, and enough au...

5 min read

Update (May 2026): Since this article was published, xAI has released Grok 4.1 (November 2025) with 65% fewer hallucinations, emotional intelligence features, and a 2M token context window via API. Grok 5 (6T parameters) was announced in January 2026. Pricing has also changed - SuperGrok is now $30/month, with SuperGrok Heavy at $300/month for the multi-agent tier. The core analysis of Grok 4's architecture and capabilities below remains relevant.

xAI has launched Grok 4, and the benchmarks back up a bold claim: this is the highest-scoring AI model on several key evaluations. But the headline numbers only tell part of the story. The real picture involves tool-augmented reasoning, a $300/month price tag, and a roadmap that includes a dedicated coding model, multimodal agents, and a video generation model trained on 100,000 NVIDIA GB200s.

Benchmark Breakdown

Humanity's Last Exam

For model-selection context, compare this with Claude vs GPT for Coding: Which Model Writes Better TypeScript? and OpenAI vs Anthropic in 2026 - Models, Tools, and Developer Experience; the useful question is not only benchmark quality, but where the model fits in a real developer workflow.

Humanity's Last Exam is a benchmark created by Scale AI that tests frontier knowledge across domains including mathematics, chemistry, linguistics, and more. A strong human score on this exam sits around 5%. Grok 4 scores 26.9% on the text-only version - already competitive with the best models available.

But the more telling result is what happens when you add tools. With access to web browsing, terminal, and other agentic capabilities, Grok 4's score improves dramatically. This aligns with an industry-wide pattern: models paired with tools consistently outperform models reasoning in isolation. The tradeoff is cost and latency - tool-augmented runs consume more compute and take considerably longer to produce results.

Saturating AMI

On the AMI benchmark (a math-focused evaluation), Grok 4 achieves a perfect 100%. This is not a typo. The benchmark is effectively saturated, which tells us less about Grok 4 being uniquely good at math and more about the benchmark reaching its ceiling. Expect new, harder math evaluations to emerge as the field continues advancing.

GPQA and LiveCodeBench

Across GPQA (graduate-level question answering) and LiveCodeBench (live coding evaluation), Grok 4 shows strong performance against OpenAI, Google, and Anthropic models. One important caveat: xAI's comparison chart includes Grok 4 variants with tool access alongside competitor models running without tools. A more apples-to-apples comparison would show the gap narrowing, though Grok 4 would likely still hold competitive positioning.

ARC-AGI

On the ARC-AGI benchmark, Grok 4 scores just under 16% - nearly double Claude 4 Opus's result. What makes this benchmark interesting is the cost axis. Some models achieve similar scores at dramatically different price points. O3 Preview costs over $100 per run on this benchmark, while Claude 4 sits between $1 and $10. Grok 4 offers the second-best score at a competitive cost, making it an appealing value proposition for researchers and developers running repeated evaluations.

VendingBench

VendingBench, from the team at Andean Labs, simulates running a small business (a vending machine operation). Grok 4 performed longer and more reliably increased its net worth over time compared to competitor models. It is a fun benchmark, but it tests something practical: sustained decision-making over extended periods with real economic consequences.

The $300 Question

Grok 4 introduces the most expensive consumer AI subscription tier yet at $300/month for Super Grok. This includes access to Grok Heavy mode, which uses extended agentic reasoning, tool calling, and web search to tackle complex problems.

Here is how the premium AI subscription landscape looks now:

Provider	Tier	Price
OpenAI	Pro	$200/mo
Anthropic	Max	$100-200/mo
Google	AI Ultra	$250/mo
xAI	Super Grok	$300/mo

Whether $300/month is justified depends entirely on your use case. For professionals working on complex research, financial modeling, or technical problems where Grok 4's extended reasoning capabilities provide measurable value, the cost could be a rounding error. For casual users, the standard Grok tier (accessible through an X Premium subscription at around $8-10/month) provides a reasonable entry point.

Get the weekly deep dive

Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.

From the archive

Claude Code: The Future of Coding?

Jul 5, 2025 • 9 min read

OpenAI Agents SDK for TypeScript: A Practical Guide

Jun 7, 2025 • 9 min read

Qwen 3: Alibaba's Open-Source Model That Outclassed Llama 4

Apr 29, 2025 • 8 min read

Diffusion Language Models: How Mercury Changed the LLM Speed Game

Feb 27, 2025 • 7 min read

Voice Capabilities

Grok 4 includes updated voice interaction, and the demo compared it directly against OpenAI's voice mode. The results suggest Grok 4's voice is more responsive and less prone to interrupting the user mid-sentence. It handles requests like whispering and singing, pushing closer to natural human speech patterns.

Voice AI is becoming a competitive differentiator. As these capabilities mature, the quality of voice interaction will factor into which assistant people choose for daily use - not just which model scores highest on text benchmarks.

Technical Specifications

Context window: 256,000 tokens
Multimodal support: Text and image input (video understanding coming via retraining)
Reasoning: Always-on reasoning model (no non-reasoning mode available)
API access: Available with standard pricing

The always-on reasoning aspect is worth noting. Grok 4 is inherently a reasoning model - there is no way to disable the chain-of-thought process. This means every API call involves reasoning overhead. For applications where speed matters more than depth (chatbots, simple completions, high-throughput pipelines), you would still want to use Grok 3 or Grok 3 Mini.

What Is Coming Next

xAI laid out an aggressive roadmap:

Coding model - Arriving within weeks of launch. Given how competitive the AI coding space has become (Claude Code, Cursor, Codex, Gemini CLI), a dedicated Grok coding model enters a crowded but high-value market.
Multimodal agent - Expected in the fall. This combines vision, reasoning, and action capabilities into a single agent that can understand images and video, reason about them, and take actions based on its analysis.
Video generation model - The most ambitious item on the roadmap. xAI plans to train this on 100,000 NVIDIA GB200s, which would represent one of the largest compute allocations for a video generation model to date. The scale suggests xAI is aiming to compete directly with OpenAI's Sora and Google's Veo at the frontier.

How to Access Grok 4

There are multiple entry points:

X Premium ($8-10/month) - Includes basic Grok 4 access within the X platform
grok.com - Direct access through xAI's web interface, available in the model dropdown
Super Grok ($30/month) - Standard enhanced access
Super Grok with Heavy ($300/month or $3,000/year) - Full access to Grok Heavy with tool calling and agentic reasoning

Reasoning-Only Architecture

Unlike most other frontier models, Grok 4 does not offer a non-reasoning mode. Every request triggers the full chain-of-thought reasoning process. This is a deliberate architectural choice - xAI is betting that the quality improvements from always-on reasoning outweigh the latency and cost tradeoffs.

For developers building applications on the Grok 4 API, this has practical implications. If your use case involves high-throughput, low-latency requests - chatbots, autocomplete, simple classification tasks - Grok 4 is the wrong model. Grok 3 or Grok 3 Mini remain better suited for those workloads. Grok 4 is designed for tasks where thinking time translates directly into better output: complex code generation, multi-step problem solving, research synthesis, and financial modeling.

The migration path from Grok 3 to Grok 4 is straightforward from an API perspective, but developers should audit their applications for latency sensitivity before switching. A response that took 2 seconds with Grok 3 might take 15-30 seconds with Grok 4 as the model reasons through the problem.

The Competitive Landscape

Grok 4 arrives in a crowded market. OpenAI has GPT-5 and O3. Anthropic has Claude 4 Opus. Google has Gemini 2.5 Pro. Each model has different strengths, and the "best" model depends entirely on the specific task.

What distinguishes xAI's approach is the aggressive roadmap. Announcing a coding model, multimodal agent, and video generation model in rapid succession signals that xAI is not content to compete on a single axis. They are building across the full stack of AI capabilities simultaneously, backed by what appears to be near-unlimited compute resources.

The $300/month price point is also a strategic signal. By pricing above OpenAI and Google, xAI is positioning Grok 4 as a premium product for power users rather than trying to win on volume. Whether this strategy succeeds depends on whether the tool-augmented reasoning capabilities justify the premium in real-world usage, not just benchmarks.

The Bigger Picture

The pricing escalation across the industry tells us something about where things are heading. When multiple companies independently arrive at $200-300/month tiers for their most capable models, it signals that the compute required for frontier reasoning is genuinely expensive - and that there is demand willing to pay for it.

At the same time, the benchmark saturation on tests like AMI (100%) means the evaluation landscape needs to evolve. The models are outpacing the measurements we use to compare them. Expect new, harder benchmarks to emerge that better differentiate between models that all score perfectly on today's tests.

For developers, the practical question remains: which model is best for your specific use case? Grok 4's strengths lie in extended reasoning, tool-augmented problem solving, and sustained performance over long tasks. If your work involves complex analysis, research synthesis, or multi-step agentic workflows, it is worth evaluating directly.

Frequently Asked Questions

How much does Grok 4 cost?

Grok 4 has multiple pricing tiers: X Premium ($8-10/month) includes basic access within the X platform, SuperGrok ($30/month) provides enhanced capabilities, and SuperGrok Heavy ($300/month or $3,000/year) unlocks full agentic reasoning with tool calling. API pricing follows standard xAI rates.

Is Grok 4 better than GPT-5 or Claude?

Grok 4 leads on several benchmarks including Humanity's Last Exam (26.9% text-only) and ARC-AGI (nearly 16%). However, "better" depends on your use case. Grok 4 excels at extended reasoning and tool-augmented tasks, while GPT-5 and Claude 4 may be faster for simpler requests. Grok 4 is always-on reasoning with no lightweight mode, which adds latency but improves output quality for complex problems.

What is Grok Heavy mode?

Grok Heavy is xAI's premium reasoning mode that combines extended agentic thinking with tool calling, web search, and terminal access. It takes longer to respond (15-30 seconds vs 2-3 seconds) but produces more thorough, well-reasoned outputs. Grok Heavy requires the $300/month SuperGrok Heavy subscription.

Can Grok 4 write code?

Yes, Grok 4 can write code and performs well on LiveCodeBench. However, xAI also announced a dedicated coding model coming soon after Grok 4's launch. For coding-specific workflows, you might prefer Claude Code, Cursor, or OpenAI Codex until the Grok coding model ships.

What is Grok 4's context window?

Grok 4 has a 256,000 token context window. Since the May 2026 update, Grok 4.1 offers up to 2 million tokens via API for applications requiring extremely long context.

Does Grok 4 support images and video?

Grok 4 supports text and image input. Video understanding is coming via retraining. xAI also announced plans for a video generation model trained on 100,000 NVIDIA GB200s, though this is a separate product from the base Grok 4 model.

Should I use Grok 3 or Grok 4?

Use Grok 3 or Grok 3 Mini for high-throughput, low-latency tasks like chatbots, autocomplete, or simple classification. Use Grok 4 for complex reasoning tasks where thinking time improves output quality - code generation, research synthesis, multi-step problem solving, or financial modeling. Grok 4 has no non-reasoning mode, so every request involves reasoning overhead.

What comes after Grok 4?

xAI announced Grok 4.1 (November 2025) with 65% fewer hallucinations and 2M token context, and Grok 5 (6 trillion parameters) in January 2026. The roadmap also includes a dedicated coding model, multimodal agent, and video generation model.

Watch the Video

Suggest an editSave

Discuss this article on Twitter/X

Developers Digest

Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.

300+ videos30K+ GitHub stars50+ articles

Subscribe YouTube GitHub Twitter/X

Related Tools

AI Models

Grok

xAI's model with real-time X/Twitter data access. Grok 3 rivals top models on reasoning. Built-in web search and current...

View Tool

AI Models

DeepSeek

Open-source reasoning models from China. DeepSeek-R1 rivals o1 on math and code benchmarks. V3 for general use. Fully op...

View Tool

AI Models

Gemini

Google's frontier model family. Gemini 2.5 Pro has 1M token context and top-tier coding benchmarks. Gemini 3 Pro pushes...

View Tool

AI ModelsDaily Driver

Claude

Anthropic's AI. Opus 4.6 for hard problems, Sonnet 4.6 for speed, Haiku 4.5 for cost. 200K context window. Best coding m...

View Tool

Apps from Developers Digest

Directories

AI Models

Pick a model in 30 seconds. Built for the answer, not the marketing.

View App

Directories

Demos

Try AI models in the browser before paying for a single token.

View App

Related Guides

Guide

Run AI Models Locally with Ollama and LM Studio

Install Ollama and LM Studio, pull your first model, and run AI locally for coding, chat, and automation - with zero cloud dependency.

Getting Started

Guide

Model Aliases - Claude Code

Use opus, sonnet, haiku, and best to switch models easily.

Claude Code

Guide

Model Picker (/model) - Claude Code

Interactive UI to switch models and effort sliders mid-session.

Claude Code

7 min read

Grok

Grok Code Fast 1: xAI's Speed-Optimized Coding Model

xAI's Grok Code Fast 1 arrives with a specific mission: eliminate the friction in agentic coding workflows. While models...

September 2, 2025

7 min read

OpenAI

GPT-5: OpenAI's Most Capable Model

GPT-5 introduces a fundamentally different approach to inference. Instead of forcing developers to manually configure re...

August 8, 2025

5 min read

Claude

Claude Opus 4.5: Anthropic's Most Intelligent Model

Anthropic has released Claude Opus 4.5, positioning it as their most capable model yet for coding agents and computer us...

November 24, 2025

xAI Grok 3 Launch: The Smartest AI on Earth?

9 min read

xAI

xAI Grok 3 Launch: The Smartest AI on Earth?

xAI launched Grok 3 with 200,000 GPUs, outperforming GPT-4o, Sonnet 3.5, and DeepSeek R1 on reasoning benchmarks. Here i...

February 18, 2025

5 min read

Claude

Claude vs GPT for Coding: Which Model Writes Better TypeScript?

Claude Opus 4.7 vs GPT-5.5 for real TypeScript work. Benchmarks, pricing, model families, and practical differences.

March 19, 2026

10 min read

OpenAI

OpenAI vs Anthropic in 2026 - Models, Tools, and Developer Experience

A developer's comparison of OpenAI and Anthropic ecosystems - models, coding tools, APIs, pricing, and which to choose f...

March 19, 2026

Get Smarter About AI Dev

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.

One email per weekReal code, not theoryFree forever

Grok Code Fast 1: xAI's Speed-Optimized Coding Model

GPT-5: OpenAI's Most Capable Model

Claude Opus 4.5: Anthropic's Most Intelligent Model

Benchmark Breakdown

Humanity's Last Exam

Saturating AMI

GPQA and LiveCodeBench

ARC-AGI

VendingBench

The $300 Question

Claude Code: The Future of Coding?

OpenAI Agents SDK for TypeScript: A Practical Guide

Qwen 3: Alibaba's Open-Source Model That Outclassed Llama 4

Diffusion Language Models: How Mercury Changed the LLM Speed Game

Voice Capabilities

Technical Specifications

What Is Coming Next

How to Access Grok 4

Reasoning-Only Architecture

The Competitive Landscape

The Bigger Picture

Frequently Asked Questions

How much does Grok 4 cost?

Is Grok 4 better than GPT-5 or Claude?

What is Grok Heavy mode?

Can Grok 4 write code?

What is Grok 4's context window?

Does Grok 4 support images and video?

Should I use Grok 3 or Grok 4?

What comes after Grok 4?

Watch the Video

Related Tools

Grok

DeepSeek

Gemini

Claude

Apps from Developers Digest

AI Models

Demos

Related Guides

Run AI Models Locally with Ollama and LM Studio

Model Aliases - Claude Code

Model Picker (/model) - Claude Code

Related Posts

Grok Code Fast 1: xAI's Speed-Optimized Coding Model

GPT-5: OpenAI's Most Capable Model

Claude Opus 4.5: Anthropic's Most Intelligent Model

xAI Grok 3 Launch: The Smartest AI on Earth?

Claude vs GPT for Coding: Which Model Writes Better TypeScript?

OpenAI vs Anthropic in 2026 - Models, Tools, and Developer Experience

Get Smarter About AI Dev

Grok Code Fast 1: xAI's Speed-Optimized Coding Model

GPT-5: OpenAI's Most Capable Model

Claude Opus 4.5: Anthropic's Most Intelligent Model

Benchmark Breakdown

Humanity's Last Exam

Saturating AMI

GPQA and LiveCodeBench

ARC-AGI

VendingBench

The $300 Question

Claude Code: The Future of Coding?

OpenAI Agents SDK for TypeScript: A Practical Guide

Qwen 3: Alibaba's Open-Source Model That Outclassed Llama 4

Diffusion Language Models: How Mercury Changed the LLM Speed Game

Voice Capabilities

Technical Specifications

What Is Coming Next

How to Access Grok 4

Reasoning-Only Architecture

The Competitive Landscape

The Bigger Picture

Frequently Asked Questions

How much does Grok 4 cost?

Is Grok 4 better than GPT-5 or Claude?

What is Grok Heavy mode?

Can Grok 4 write code?

What is Grok 4's context window?

Does Grok 4 support images and video?

Should I use Grok 3 or Grok 4?