
TL;DR
JetBrains released Mellum2 on June 2, 2026 - a 12B MoE model with only 2.5B active parameters per token. Here is how to run it locally, when to use it, and where it fits in your AI coding stack.
| Source | Link |
|---|---|
| JetBrains Mellum2 Blog Post | blog.jetbrains.com/ai/2026/06/mellum2-goes-open-source |
| Hugging Face Model Collection | huggingface.co/collections/JetBrains/mellum-2 |
| JetBrains AI Blog | blog.jetbrains.com/ai |
| Mellum2 Hugging Face Blog | huggingface.co/blog/JetBrains/mellum2-launch |
| Ollama Mellum-4b (prior version) | ollama.com/JetBrains/Mellum-4b-base |
JetBrains released Mellum2 on June 2, 2026 under the Apache 2.0 license. It is a 12-billion parameter Mixture-of-Experts model that activates only 2.5B parameters per token - roughly 5x less compute per forward pass than a dense 12B model. The result is sub-second inference on modest hardware while maintaining competitive benchmark scores for code generation, reasoning, and routing tasks.
This is not a replacement for Claude, GPT-5.5, or DeepSeek V4. JetBrains explicitly positions Mellum2 as a "focal model" - a fast, specialized component inside larger AI systems. Think: routing decisions, RAG summarization, sub-agents, local code completion, and any task where latency matters more than peak reasoning quality.
Last updated: June 18, 2026
Mellum2 is trained on natural language and code data. It deliberately avoids multimodal capabilities - no images, no audio - in favor of specialization for software engineering workflows.
The core use cases JetBrains highlights:
Routing and orchestration. Analyze incoming prompts and decide which model or tool handles them. At 2.5B active parameters, Mellum2 can make routing decisions in milliseconds rather than seconds.
RAG pipeline acceleration. Summarize retrieved context before passing it to a frontier model. Keeping the context window smaller for the expensive model saves tokens and improves coherence.
Fast sub-agents. In agentic workflows where you need a quick classification, extraction, or decision before the main agent continues, Mellum2 handles the intermediate step without blocking.
Private local deployment. Teams with data residency requirements or air-gapped infrastructure can run Mellum2 entirely on-premise. No API calls, no token billing, no third-party data exposure.
High-throughput code features. IDE completions, inline suggestions, and background analysis where sub-second latency is required.
Mellum2 uses a Mixture-of-Experts architecture with 64 total experts, activating 8 per token. The full model has 12B parameters, but inference only touches 2.5B per forward pass - a roughly 5x reduction in compute compared to running all 12B.
JetBrains reports inference times "less than half" compared to similar-sized models. In practice, this means:
The 8192-token context window is sufficient for most code completion and summarization tasks, though shorter than frontier models. For tasks requiring longer context, you would still reach for a model with 128K+ context.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
From the archive
Jun 18, 2026 • 12 min read
Jun 17, 2026 • 12 min read
Jun 17, 2026 • 10 min read
Jun 17, 2026 • 11 min read
JetBrains shared benchmark results for Mellum2 across several evaluation suites. The model competes with similar-sized open-weights models while being faster to run:
| Benchmark | Category |
|---|---|
| LiveCodeBench v6 | Code generation |
| AIME 2025/26 | Mathematical reasoning |
| GSMPlus | Grade-school math |
| GPQA Diamond | Graduate-level science |
| MMLU-Redux | General knowledge |
The raw numbers position Mellum2 as competitive with other 12B-class models on code tasks, with the efficiency advantage making it practical for higher-volume deployments. It does not match frontier models like Fable 5, GPT-5.5, or DeepSeek V4-Pro on complex reasoning - but that is not the intended use case.
For SWE-bench and similar agentic evaluations, Mellum2 is better suited as a helper model (routing, summarization, tool selection) than as the primary agent.
Mellum2 weights are available through Hugging Face under the Apache 2.0 license. You can run it locally in several ways:
vLLM supports Mellum2 natively and is the recommended choice for production deployments. The MoE architecture works well with vLLM's optimized inference engine.
pip install vllm
# Download and serve
python -m vllm.entrypoints.openai.api_server \
--model JetBrains/Mellum2-12B \
--tensor-parallel-size 1 \
--max-model-len 8192
This exposes an OpenAI-compatible API endpoint at http://localhost:8000/v1.
For development and experimentation:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "JetBrains/Mellum2-12B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
torch_dtype="auto"
)
prompt = "def fibonacci(n):"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))
Early community reports suggest Ollama has compatibility challenges with Mellum2's MoE architecture. The earlier Mellum-4b (dense model) works with Ollama, but the new MoE version may require workarounds. Check the Ollama community for current status.
If you need Ollama specifically, consider using the 4B dense variant while waiting for official MoE support:
ollama pull JetBrains/Mellum-4b-base
For the full 12B model with 2.5B active parameters:
The MoE architecture means memory footprint is larger than a 2.5B dense model (you still load all experts), but inference is faster.
Good fit:
Not a fit:
Mellum2 can serve as a fast local model for tasks that do not need frontier reasoning. One pattern is using Mellum2 for initial triage and routing, then handing off to Claude Code for complex work.
Example setup with a custom MCP server that routes to Mellum2 locally:
{
"mcpServers": {
"mellum-router": {
"command": "python",
"args": ["-m", "mellum_mcp_server"],
"env": {
"MELLUM_MODEL_PATH": "/path/to/mellum2",
"MELLUM_PORT": "8000"
}
}
}
}
This lets Claude Code offload fast classification, summarization, or extraction tasks to Mellum2 while keeping complex reasoning on the frontier model.
| Model | Size | Active Params | License | Best For |
|---|---|---|---|---|
| Mellum2 | 12B | 2.5B | Apache 2.0 | Fast routing, code completion, sub-agents |
| DeepSeek V4-Flash | 17B | 4B | MIT | High-throughput coding, API + self-host |
| Qwen3-8B | 8B | 8B | Apache 2.0 | Balanced local coding |
| Llama 4.1-8B | 8B | 8B | Llama License | General-purpose local model |
Mellum2's advantage is the combination of code specialization and MoE efficiency. DeepSeek V4-Flash is stronger on absolute capability but has higher active parameters. Qwen3 and Llama 4.1 are competitive but dense, meaning slower inference at similar parameter counts.
Mellum2 is a 12-billion parameter Mixture-of-Experts model from JetBrains, released June 2, 2026 under Apache 2.0. It activates only 2.5B parameters per token, making it efficient for high-throughput inference while maintaining competitive benchmark scores for code generation and reasoning tasks.
The recommended approach is vLLM for production or Hugging Face Transformers for development. Download the model from Hugging Face (JetBrains/Mellum2-12B) and serve it with vLLM's OpenAI-compatible API server. Ollama support for the MoE architecture is still being worked on.
Yes. Mellum2 is released under the Apache 2.0 license, which permits commercial use, modification, and distribution. There are no API costs - you run inference on your own hardware.
DeepSeek V4-Pro is stronger on absolute capability and benchmarks like SWE-bench. Mellum2 is faster and more efficient for high-volume, lower-complexity tasks. Many teams use both: Mellum2 for routing, summarization, and fast sub-agents; DeepSeek V4 or a frontier model for complex reasoning.
No. Mellum2 is designed as a component model, not a standalone IDE agent. It works best alongside frontier models - handling routing, summarization, and fast tasks while the frontier model handles complex reasoning.
Minimum 16GB VRAM (RTX 4080, A100 40GB). The full 12B model loads all experts into memory, but inference only activates 2.5B per token. INT8 quantization can reduce memory requirements if needed.
Mellum2 is a model, not a tool. You can build MCP servers that use Mellum2 for inference, and several community implementations exist. The model itself does not have MCP built in - you provide the tool infrastructure.
8192 tokens. This is sufficient for most code completion and summarization tasks but shorter than frontier models with 128K+ context. For long-context tasks, use a model with a larger window.
Read next
Choosing a local coding LLM in 2026 means balancing benchmark performance, hardware cost, and the compliance pressure to keep code off third-party servers. Here is what to run and on what hardware.
8 min readDeepSeek V4 splits into Flash and Pro, ships a 1M context window, and undercuts every closed model on price. Here's how to wire it up with the OpenAI SDK, when to pick it over Claude or GPT, and what changed since V3 and R1.
10 min readOpenCode is the fastest-growing open-source AI coding agent - 160K GitHub stars, 7.5M monthly users, 75+ model providers. Here is how to set it up, configure models, and use it effectively in your workflow.
11 min readTechnical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
Open-source AI code assistant for VS Code and JetBrains. Bring your own model - local or API. Tab autocomplete, chat,...
View ToolOpen-source terminal agent runtime with approval modes, rollback snapshots, MCP servers, LSP diagnostics, and a headless...
View ToolOpen-source AI agent built in Rust, now governed by the Agentic AI Foundation at the Linux Foundation. Desktop app, CLI,...
View ToolThe original AI coding assistant. 77M+ developers. Inline completions in VS Code and JetBrains. Copilot Workspace genera...
View ToolEvery coding agent in one window. Stop alt-tabbing between Claude, Codex, and Cursor.
View AppRun any MCP server without running infra. Private endpoints, no DevOps.
View AppScore every coding agent on your own tasks. Catch regressions in CI.
View AppInstall Ollama and LM Studio, pull your first model, and run AI locally for coding, chat, and automation - with zero cloud dependency.
Getting StartedConfigure model, effort, tools, MCP servers, and invocation scope.
Claude CodeConfigure model, tools, MCP, skills, memory, and scoping.
Claude Code
Choosing a local coding LLM in 2026 means balancing benchmark performance, hardware cost, and the compliance pressure to...

DeepSeek V4 splits into Flash and Pro, ships a 1M context window, and undercuts every closed model on price. Here's how...

OpenCode is the fastest-growing open-source AI coding agent - 160K GitHub stars, 7.5M monthly users, 75+ model providers...

Z.ai's GLM-5.2 lands as a 753B open-weights coding model that beats GPT-5.5 on SWE-bench Pro for roughly one-sixth the p...

A data-rich, source-cited comparison of the three open-weights coding models that matter in 2026: GLM-5.2, DeepSeek V4,...

DeepSeek V4 Pro lands a 63.5 on SWE-bench Verified at $0.435/$0.87 per million tokens, and Flash runs agent inner loops...

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.