
TL;DR
Meta's Llama 4 family brings mixture-of-experts to open source with Scout and Maverick. Here's how to run them locally, access them through APIs, and decide when they beat the competition.
Meta changed the trajectory of open-source AI when it released the original Llama in 2023. Each generation pushed the boundary of what you could run without paying an API bill. Llama 4 is the biggest leap yet - not because it is the best model on every benchmark, but because it brings mixture-of-experts (MoE) architecture to the open-source mainstream, delivering dramatically better performance per dollar of compute.
The Llama 4 family ships two models: Scout, built for efficiency and long contexts, and Maverick, built for raw capability. Both use MoE to keep inference costs low while packing in far more knowledge than their parameter counts suggest. And both ship under a permissive license that lets you fine-tune, self-host, and build commercial products without restrictions.
For developers, this means frontier-adjacent intelligence that runs on your own hardware, integrates with your own infrastructure, and costs nothing per token once deployed.
Scout is the workhorse. It uses 16 expert networks with 17 billion active parameters per forward pass out of 109 billion total. This gives it the knowledge capacity of a 109B model with the inference cost closer to a 17B dense model.
The standout feature is the context window: 10 million tokens. That is not a typo. Scout handles entire codebases, book-length documents, and massive datasets in a single context. In practice, most providers cap this lower due to infrastructure constraints, but the architecture supports it natively.
Scout targets the sweet spot where developers spend most of their time: code generation, summarization, multi-turn conversation, document analysis, and general-purpose assistance. It is fast, it is cheap to serve, and it handles breadth well.
Maverick is the heavy hitter. It uses 128 expert networks with the same 17 billion active parameters per forward pass, but draws from 400 billion total parameters. The much larger expert pool means Maverick stores more specialized knowledge and handles nuanced tasks with greater precision.
Maverick targets use cases where quality matters more than speed: complex reasoning, creative writing, difficult code generation, and tasks that benefit from deeper world knowledge. It also supports a 1 million token context window, which is generous for most workloads.
The architecture choice is deliberate. By keeping active parameters at 17B for both models, Meta ensures that inference hardware requirements stay manageable. The difference between Scout and Maverick is not compute per token - it is the depth and breadth of knowledge the model can draw from.
Llama 3 used dense architectures. Every token passed through every parameter. Llama 4 switches to mixture-of-experts, which is the single biggest architectural change in the family's history. Here is what that shift means in practice:
Mixture-of-experts architecture. Instead of one monolithic network, Llama 4 routes each token to a subset of specialized expert layers. This dramatically improves the ratio of knowledge stored to compute required. You get a smarter model without proportionally higher inference costs.
Native multimodality. Llama 4 processes images, video, and text natively. The models were trained from the ground up on multimodal data, not retrofitted with vision adapters. This means image understanding is a first-class capability, not an afterthought.
Massive context windows. Llama 3 topped out at 128K tokens. Scout supports 10M tokens and Maverick supports 1M. For developers working with large codebases or document collections, this removes a major constraint.
Improved multilingual performance. Llama 4 was trained on a broader multilingual corpus, with stronger performance across European and Asian languages compared to Llama 3's English-dominant training.
Better instruction following. Meta invested heavily in post-training alignment. Llama 4 models follow complex, multi-constraint prompts more reliably than their predecessors, narrowing the gap with closed-source models on instruction adherence.
Benchmarks are directional, not definitive. But they help frame where Llama 4 fits relative to the competition.
| Benchmark | Llama 4 Maverick | Claude Sonnet 4.6 | GPT-5 | DeepSeek R1 | Gemini 2.5 Pro |
|---|---|---|---|---|---|
| MMLU-Pro | 80.5 | 84.1 | 85.3 | 81.2 | 83.7 |
| HumanEval+ | 79.1 | 85.7 | 87.2 | 82.4 | 84.9 |
| GPQA Diamond | 69.8 | 72.8 | 75.1 | 71.5 | 73.2 |
| LiveCodeBench | 55.8 | 69.4 | 72.1 | 65.9 | 67.3 |
| MT-Bench | 8.8 | 9.3 | 9.4 | 9.1 | 9.2 |
| Multilingual MGSM | 91.4 | 88.7 | 90.1 | 82.3 | 93.2 |
Maverick holds its own on knowledge benchmarks (MMLU-Pro) and leads on multilingual math (MGSM). It trails Claude and GPT-5 on coding tasks and structured reasoning, which is expected given the gap in active parameter count. For an open-source model you can self-host, the numbers are strong.
| Benchmark | Llama 4 Scout | Llama 3.1 70B | Qwen 2.5 72B | Gemma 2 27B |
|---|---|---|---|---|
| MMLU-Pro | 74.3 | 66.4 | 71.1 | 58.7 |
| HumanEval+ | 72.8 | 64.2 | 68.9 | 55.3 |
| GPQA Diamond | 61.3 | 46.7 | 52.8 | 40.1 |
| MT-Bench | 8.5 | 8.1 | 8.3 | 7.6 |
Scout outperforms Llama 3.1 70B across the board while using fewer active parameters. It also beats Qwen 2.5 72B on most tasks. The MoE architecture lets Scout punch well above its active parameter weight class.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
Meta offers hosted inference through their API. This is the fastest way to start.
from openai import OpenAI
client = OpenAI(
api_key="your-meta-api-key",
base_url="https://api.llama.com/v1"
)
response = client.chat.completions.create(
model="llama-4-maverick",
messages=[{"role": "user", "content": "Explain the CAP theorem with examples"}]
)
print(response.choices[0].message.content)
Meta's API follows the OpenAI format, so any compatible client library works without modification. Switch llama-4-maverick to llama-4-scout for the smaller model.
Running Llama 4 locally eliminates API costs and keeps your data on your machine. Ollama makes it straightforward.
# Install Ollama (macOS)
brew install ollama
# Pull Llama 4 Scout (quantized variants)
ollama pull llama4:scout # Default quantization - ~60 GB
ollama pull llama4:scout-q4 # 4-bit quantized - ~35 GB
ollama pull llama4:scout-q8 # 8-bit quantized - ~55 GB
# Pull Llama 4 Maverick (requires serious hardware)
ollama pull llama4:maverick-q4 # 4-bit quantized - ~120 GB
# Run interactively
ollama run llama4:scout-q4
For API-style access to your local model:
# Ollama exposes an OpenAI-compatible API on port 11434
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama4:scout-q4",
"messages": [{"role": "user", "content": "Write a REST API in Go"}]
}'
Any tool that supports custom OpenAI endpoints works with your local Llama 4 instance. Point your editor, scripts, or agents at http://localhost:11434/v1 and you are set.
Llama 4 is available across every major inference platform:
Third-party providers are often the sweet spot: you get managed infrastructure without API lock-in, since you can switch providers or self-host at any time. The model weights are the same everywhere.
MoE models are memory-hungry because the full parameter set needs to be loaded even though only a fraction activates per token. Here is what you need:
| Model | Quantization | RAM / VRAM Required | Recommended Hardware |
|---|---|---|---|
| Scout | Q4_K_M | 35 GB | Mac Studio M2 Ultra 64GB, or 1x A100 80GB |
| Scout | Q8_0 | 55 GB | Mac Studio M2 Ultra 96GB, or 1x A100 80GB |
| Scout | FP16 | 110 GB | 2x A100 80GB |
| Maverick | Q4_K_M | 120 GB | Mac Pro M2 Ultra 192GB, or 2x A100 80GB |
| Maverick | Q8_0 | 200 GB | 3x A100 80GB |
| Maverick | FP16 | 400 GB | 8x A100 80GB |
For most developers, Scout Q4 is the practical local option. It fits on a well-equipped Mac Studio or a single A100 GPU and delivers strong performance across general tasks. Maverick is better accessed through an API unless you have multi-GPU infrastructure.
Apple Silicon users benefit from unified memory architecture. A Mac Studio with 64GB of unified memory can run Scout Q4 with room for the operating system and other applications. The M2 Ultra and M4 chips handle MoE models efficiently because they avoid the PCIe bottleneck that plagues GPU setups when the model does not fit in a single card.
Llama 4 ships under Meta's updated license, which is functionally similar to MIT for most developers. Here is what the license allows:
The only restriction is a user threshold: companies with over 700 million monthly active users need a separate license from Meta. For the vast majority of developers, startups, and enterprises, the license is unrestricted.
This matters for several practical reasons:
Data privacy. Self-hosting means your prompts and completions never leave your network. For healthcare, legal, finance, and government applications, this can be the deciding factor.
Cost at scale. API pricing works at low volume, but the math changes at scale. A team sending millions of tokens per day saves significantly by running their own inference server, even accounting for hardware costs.
Customization. Fine-tuning Llama 4 on domain-specific data produces a model that outperforms general-purpose APIs on your particular workload. This is not theoretical - companies routinely get 10-20% quality improvements from targeted fine-tuning on a few thousand examples.
No vendor lock-in. If your provider raises prices, changes terms, or goes down, you still have the weights. You can deploy on any cloud, any hardware, or any framework.
Choose Llama 4 when:
Choose Claude or GPT-5 when:
Choose DeepSeek when:
The practical answer for most teams is a hybrid approach. Run Llama 4 Scout locally for high-volume tasks, privacy-sensitive workloads, and rapid iteration. Route complex agentic work and precision-critical tasks to Claude or GPT-5. Use the same OpenAI-compatible API format across all providers so switching is a config change, not a code change.
The fastest path from zero to running Llama 4:
Try it through an API. Sign up with Together AI or Fireworks, grab an API key, and point any OpenAI-compatible client at their Llama 4 endpoint. Working inference in under five minutes.
Run locally with Ollama. Install Ollama, pull llama4:scout-q4, and start experimenting. No API key, no usage limits, no data leaving your machine. You need at least 35 GB of available memory.
Integrate with your tools. Any editor, CLI, or framework that supports custom OpenAI-compatible endpoints works with Llama 4. Set the base URL and model name and your existing workflows adapt instantly.
Fine-tune for your domain. If you have domain-specific data, fine-tuning Scout on even a few thousand examples can meaningfully improve performance on your particular tasks. Tools like Axolotl and Unsloth make this accessible without deep ML expertise.
Benchmark against your workload. Run your actual prompts through Llama 4 and your current model. Compare quality, latency, and cost across your real use cases. Synthetic benchmarks tell part of the story. Your data tells the rest.
Meta's bet on open source continues to pay dividends for the developer community. Llama 4 does not top every leaderboard, but it puts genuinely capable AI into the hands of anyone willing to download the weights. For a growing number of use cases, that is exactly what matters.
Llama 4 Scout and Maverick are available under Meta's Llama 4 Community License. Visit llama.meta.com for model weights, documentation, and research papers.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
Meta's open-source model family. Llama 4 available in Scout (17B active) and Maverick (17B active, 128 experts). Free to...
View ToolOpen-source AI code assistant for VS Code and JetBrains. Bring your own model - local or API. Tab autocomplete, chat,...
View ToolNew tutorials, open-source projects, and deep dives on coding agents - delivered weekly.
LLM data framework for connecting custom data sources to language models. Best-in-class RAG, data connectors, and query...
Install Ollama, pull your first model, and run AI locally for coding, chat, and automation - with zero cloud dependency.
Getting StartedConfigure Claude Code for maximum productivity -- CLAUDE.md, sub-agents, MCP servers, and autonomous workflows.
AI Agents
#LLAMA-2 #Meta #AI In this video I show you LLAMA 2, Meta AI's newest open source model that can be used for commercial use (so long as you have less than 700 million active users). In this...

Introducing Continue: The Open Source Alternative to GitHub Copilot for Coding The video introduces 'Continue,' an open source alternative to GitHub Copilot, designed to enhance coding with...

In this video, I demonstrate how to set up and deploy a Llama 3.1 Phi Mistral Gemma 2 model using Olama on an AWS EC2 instance with GPU. Starting from scratch, I guide you through the entire...

DeepSeek's R1 and V3 models deliver frontier-level performance under an MIT license. Here's how to use them through the...

NVIDIA's Nemotron 3 Super combines latent mixture of experts with hybrid Mamba architecture - 120B total parameters, 12B...

NVIDIA's Nemotron Nano 9B V2 delivers something rare: a small language model that doesn't trade capability for speed. Th...