Mellum2 Developer Guide: JetBrains' Open-Source Coding Model

Official Sources#

Source	Link
JetBrains Mellum2 Blog Post	blog.jetbrains.com/ai/2026/06/mellum2-goes-open-source
Hugging Face Model Collection	huggingface.co/collections/JetBrains/mellum-2
JetBrains AI Blog	blog.jetbrains.com/ai
Mellum2 Hugging Face Blog	huggingface.co/blog/JetBrains/mellum2-launch
Ollama Mellum-4b (prior version)	ollama.com/JetBrains/Mellum-4b-base

JetBrains released Mellum2 on June 2, 2026 under the Apache 2.0 license. It is a 12-billion parameter Mixture-of-Experts model that activates only 2.5B parameters per token - roughly 5x less compute per forward pass than a dense 12B model. The result is sub-second inference on modest hardware while maintaining competitive benchmark scores for code generation, reasoning, and routing tasks.

This is not a replacement for Claude, GPT-5.5, or DeepSeek V4. JetBrains explicitly positions Mellum2 as a "focal model" - a fast, specialized component inside larger AI systems. Think: routing decisions, RAG summarization, sub-agents, local code completion, and any task where latency matters more than peak reasoning quality.

Last updated: June 18, 2026

What Mellum2 Is Built For#

Mellum2 is trained on natural language and code data. It deliberately avoids multimodal capabilities - no images, no audio - in favor of specialization for software engineering workflows.

The core use cases JetBrains highlights:

Routing and orchestration. Analyze incoming prompts and decide which model or tool handles them. At 2.5B active parameters, Mellum2 can make routing decisions in milliseconds rather than seconds.

RAG pipeline acceleration. Summarize retrieved context before passing it to a frontier model. Keeping the context window smaller for the expensive model saves tokens and improves coherence.

Fast sub-agents. In agentic workflows where you need a quick classification, extraction, or decision before the main agent continues, Mellum2 handles the intermediate step without blocking.

Private local deployment. Teams with data residency requirements or air-gapped infrastructure can run Mellum2 entirely on-premise. No API calls, no token billing, no third-party data exposure.

High-throughput code features. IDE completions, inline suggestions, and background analysis where sub-second latency is required.

Architecture and Efficiency#

Mellum2 uses a Mixture-of-Experts architecture with 64 total experts, activating 8 per token. The full model has 12B parameters, but inference only touches 2.5B per forward pass - a roughly 5x reduction in compute compared to running all 12B.

JetBrains reports inference times "less than half" compared to similar-sized models. In practice, this means:

Faster time-to-first-token for interactive use
Higher throughput for batch processing
Lower GPU memory pressure for concurrent requests

The 8192-token context window is sufficient for most code completion and summarization tasks, though shorter than frontier models. For tasks requiring longer context, you would still reach for a model with 128K+ context.

From the archive

Midjourney Built a Full-Body Scanner: The Image-Generation Company's Strangest, Most Revealing Bet Yet

Jun 18, 2026 • 12 min read

Noam Shazeer Joins OpenAI After Two Years Back at Google

Jun 18, 2026 • 5 min read

AI Model Routing: Why the Orchestration Layer Is the Next Big Play Next to the Labs

Jun 17, 2026 • 12 min read

Build Your First Agent with Vercel eve: A Step-by-Step Tutorial

Jun 17, 2026 • 10 min read

Benchmark Performance#

JetBrains shared benchmark results for Mellum2 across several evaluation suites. The model competes with similar-sized open-weights models while being faster to run:

Benchmark	Category
LiveCodeBench v6	Code generation
AIME 2025/26	Mathematical reasoning
GSMPlus	Grade-school math
GPQA Diamond	Graduate-level science
MMLU-Redux	General knowledge

The raw numbers position Mellum2 as competitive with other 12B-class models on code tasks, with the efficiency advantage making it practical for higher-volume deployments. It does not match frontier models like Fable 5, GPT-5.5, or DeepSeek V4-Pro on complex reasoning - but that is not the intended use case.

For SWE-bench and similar agentic evaluations, Mellum2 is better suited as a helper model (routing, summarization, tool selection) than as the primary agent.

Local Deployment Options#

Mellum2 weights are available through Hugging Face under the Apache 2.0 license. You can run it locally in several ways:

vLLM (Recommended for Production)#

vLLM supports Mellum2 natively and is the recommended choice for production deployments. The MoE architecture works well with vLLM's optimized inference engine.

Terminal

pip install vllm

# Download and serve
python -m vllm.entrypoints.openai.api_server \
  --model JetBrains/Mellum2-12B \
  --tensor-parallel-size 1 \
  --max-model-len 8192

This exposes an OpenAI-compatible API endpoint at http://localhost:8000/v1.

Hugging Face Transformers#

For development and experimentation:

Python

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "JetBrains/Mellum2-12B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype="auto"
)

prompt = "def fibonacci(n):"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))

Ollama Compatibility#

Early community reports suggest Ollama has compatibility challenges with Mellum2's MoE architecture. The earlier Mellum-4b (dense model) works with Ollama, but the new MoE version may require workarounds. Check the Ollama community for current status.

If you need Ollama specifically, consider using the 4B dense variant while waiting for official MoE support:

Terminal

ollama pull JetBrains/Mellum-4b-base

Hardware Requirements#

For the full 12B model with 2.5B active parameters:

Minimum: 16GB VRAM (RTX 4080, A100 40GB)
Recommended: 24GB VRAM for comfortable batch processing
Quantization: INT8/INT4 quantization reduces memory further if needed

The MoE architecture means memory footprint is larger than a 2.5B dense model (you still load all experts), but inference is faster.

When to Use Mellum2#

Good fit:

Local code completion where latency matters more than peak quality
Routing decisions in multi-model pipelines
RAG context summarization before calling a frontier model
Sub-agent tasks in agentic workflows
Private deployments with no API access
High-volume classification and extraction tasks
Development and testing before committing to frontier API costs

Not a fit:

Primary agent for complex multi-step reasoning
Tasks requiring 128K+ context windows
Replacing Fable 5, GPT-5.5, or DeepSeek V4 as your main coding model
SWE-bench-style issue resolution as the sole agent
Multimodal tasks (images, diagrams, screenshots)

Integrating With Claude Code#

Mellum2 can serve as a fast local model for tasks that do not need frontier reasoning. One pattern is using Mellum2 for initial triage and routing, then handing off to Claude Code for complex work.

Example setup with a custom MCP server that routes to Mellum2 locally:

JSON

{
  "mcpServers": {
    "mellum-router": {
      "command": "python",
      "args": ["-m", "mellum_mcp_server"],
      "env": {
        "MELLUM_MODEL_PATH": "/path/to/mellum2",
        "MELLUM_PORT": "8000"
      }
    }
  }
}

This lets Claude Code offload fast classification, summarization, or extraction tasks to Mellum2 while keeping complex reasoning on the frontier model.

Mellum2 vs Other Local Models#

Model	Size	Active Params	License	Best For
Mellum2	12B	2.5B	Apache 2.0	Fast routing, code completion, sub-agents
DeepSeek V4-Flash	17B	4B	MIT	High-throughput coding, API + self-host
Qwen3-8B	8B	8B	Apache 2.0	Balanced local coding
Llama 4.1-8B	8B	8B	Llama License	General-purpose local model

Mellum2's advantage is the combination of code specialization and MoE efficiency. DeepSeek V4-Flash is stronger on absolute capability but has higher active parameters. Qwen3 and Llama 4.1 are competitive but dense, meaning slower inference at similar parameter counts.

FAQ#

What is Mellum2?#

Mellum2 is a 12-billion parameter Mixture-of-Experts model from JetBrains, released June 2, 2026 under Apache 2.0. It activates only 2.5B parameters per token, making it efficient for high-throughput inference while maintaining competitive benchmark scores for code generation and reasoning tasks.

How do I run Mellum2 locally?#

The recommended approach is vLLM for production or Hugging Face Transformers for development. Download the model from Hugging Face (JetBrains/Mellum2-12B) and serve it with vLLM's OpenAI-compatible API server. Ollama support for the MoE architecture is still being worked on.

Is Mellum2 free to use?#

Yes. Mellum2 is released under the Apache 2.0 license, which permits commercial use, modification, and distribution. There are no API costs - you run inference on your own hardware.

How does Mellum2 compare to DeepSeek V4?#

DeepSeek V4-Pro is stronger on absolute capability and benchmarks like SWE-bench. Mellum2 is faster and more efficient for high-volume, lower-complexity tasks. Many teams use both: Mellum2 for routing, summarization, and fast sub-agents; DeepSeek V4 or a frontier model for complex reasoning.

Can Mellum2 replace Claude Code or Cursor?#

No. Mellum2 is designed as a component model, not a standalone IDE agent. It works best alongside frontier models - handling routing, summarization, and fast tasks while the frontier model handles complex reasoning.

What hardware do I need to run Mellum2?#

Minimum 16GB VRAM (RTX 4080, A100 40GB). The full 12B model loads all experts into memory, but inference only activates 2.5B per token. INT8 quantization can reduce memory requirements if needed.

Does Mellum2 support MCP?#

Mellum2 is a model, not a tool. You can build MCP servers that use Mellum2 for inference, and several community implementations exist. The model itself does not have MCP built in - you provide the tool infrastructure.

What context length does Mellum2 support?#

8192 tokens. This is sufficient for most code completion and summarization tasks but shorter than frontier models with 128K+ context. For long-context tasks, use a model with a larger window.

Sources#

Official Sources#

Source	Link
JetBrains Mellum2 Blog Post	blog.jetbrains.com/ai/2026/06/mellum2-goes-open-source
Hugging Face Model Collection	huggingface.co/collections/JetBrains/mellum-2
JetBrains AI Blog	blog.jetbrains.com/ai
Mellum2 Hugging Face Blog	huggingface.co/blog/JetBrains/mellum2-launch
Ollama Mellum-4b (prior version)	ollama.com/JetBrains/Mellum-4b-base

Last updated: June 18, 2026

What Mellum2 Is Built For#

Mellum2 is trained on natural language and code data. It deliberately avoids multimodal capabilities - no images, no audio - in favor of specialization for software engineering workflows.

The core use cases JetBrains highlights:

Routing and orchestration. Analyze incoming prompts and decide which model or tool handles them. At 2.5B active parameters, Mellum2 can make routing decisions in milliseconds rather than seconds.

RAG pipeline acceleration. Summarize retrieved context before passing it to a frontier model. Keeping the context window smaller for the expensive model saves tokens and improves coherence.

Fast sub-agents. In agentic workflows where you need a quick classification, extraction, or decision before the main agent continues, Mellum2 handles the intermediate step without blocking.

Private local deployment. Teams with data residency requirements or air-gapped infrastructure can run Mellum2 entirely on-premise. No API calls, no token billing, no third-party data exposure.

High-throughput code features. IDE completions, inline suggestions, and background analysis where sub-second latency is required.

Architecture and Efficiency#

JetBrains reports inference times "less than half" compared to similar-sized models. In practice, this means:

Faster time-to-first-token for interactive use
Higher throughput for batch processing
Lower GPU memory pressure for concurrent requests

From the archive

Midjourney Built a Full-Body Scanner: The Image-Generation Company's Strangest, Most Revealing Bet Yet

Jun 18, 2026 • 12 min read

Noam Shazeer Joins OpenAI After Two Years Back at Google

Jun 18, 2026 • 5 min read

AI Model Routing: Why the Orchestration Layer Is the Next Big Play Next to the Labs

Jun 17, 2026 • 12 min read

Build Your First Agent with Vercel eve: A Step-by-Step Tutorial

Jun 17, 2026 • 10 min read

Benchmark Performance#

JetBrains shared benchmark results for Mellum2 across several evaluation suites. The model competes with similar-sized open-weights models while being faster to run:

Benchmark	Category
LiveCodeBench v6	Code generation
AIME 2025/26	Mathematical reasoning
GSMPlus	Grade-school math
GPQA Diamond	Graduate-level science
MMLU-Redux	General knowledge

For SWE-bench and similar agentic evaluations, Mellum2 is better suited as a helper model (routing, summarization, tool selection) than as the primary agent.

Local Deployment Options#

Mellum2 weights are available through Hugging Face under the Apache 2.0 license. You can run it locally in several ways:

vLLM (Recommended for Production)#

vLLM supports Mellum2 natively and is the recommended choice for production deployments. The MoE architecture works well with vLLM's optimized inference engine.

Terminal

pip install vllm

# Download and serve
python -m vllm.entrypoints.openai.api_server \
  --model JetBrains/Mellum2-12B \
  --tensor-parallel-size 1 \
  --max-model-len 8192

This exposes an OpenAI-compatible API endpoint at http://localhost:8000/v1.

Hugging Face Transformers#

For development and experimentation:

Python

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "JetBrains/Mellum2-12B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype="auto"
)

prompt = "def fibonacci(n):"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))

Ollama Compatibility#

If you need Ollama specifically, consider using the 4B dense variant while waiting for official MoE support:

Terminal

ollama pull JetBrains/Mellum-4b-base

Hardware Requirements#

For the full 12B model with 2.5B active parameters:

Minimum: 16GB VRAM (RTX 4080, A100 40GB)
Recommended: 24GB VRAM for comfortable batch processing
Quantization: INT8/INT4 quantization reduces memory further if needed

The MoE architecture means memory footprint is larger than a 2.5B dense model (you still load all experts), but inference is faster.

When to Use Mellum2#

Good fit:

Local code completion where latency matters more than peak quality
Routing decisions in multi-model pipelines
RAG context summarization before calling a frontier model
Sub-agent tasks in agentic workflows
Private deployments with no API access
High-volume classification and extraction tasks
Development and testing before committing to frontier API costs

Not a fit:

Primary agent for complex multi-step reasoning
Tasks requiring 128K+ context windows
Replacing Fable 5, GPT-5.5, or DeepSeek V4 as your main coding model
SWE-bench-style issue resolution as the sole agent
Multimodal tasks (images, diagrams, screenshots)

Integrating With Claude Code#

Mellum2 can serve as a fast local model for tasks that do not need frontier reasoning. One pattern is using Mellum2 for initial triage and routing, then handing off to Claude Code for complex work.

Example setup with a custom MCP server that routes to Mellum2 locally:

JSON

{
  "mcpServers": {
    "mellum-router": {
      "command": "python",
      "args": ["-m", "mellum_mcp_server"],
      "env": {
        "MELLUM_MODEL_PATH": "/path/to/mellum2",
        "MELLUM_PORT": "8000"
      }
    }
  }
}

This lets Claude Code offload fast classification, summarization, or extraction tasks to Mellum2 while keeping complex reasoning on the frontier model.

Mellum2 vs Other Local Models#

Model	Size	Active Params	License	Best For
Mellum2	12B	2.5B	Apache 2.0	Fast routing, code completion, sub-agents
DeepSeek V4-Flash	17B	4B	MIT	High-throughput coding, API + self-host
Qwen3-8B	8B	8B	Apache 2.0	Balanced local coding
Llama 4.1-8B	8B	8B	Llama License	General-purpose local model

FAQ#

What is Mellum2?#

How do I run Mellum2 locally?#

Is Mellum2 free to use?#

Yes. Mellum2 is released under the Apache 2.0 license, which permits commercial use, modification, and distribution. There are no API costs - you run inference on your own hardware.

How does Mellum2 compare to DeepSeek V4?#

Can Mellum2 replace Claude Code or Cursor?#

What hardware do I need to run Mellum2?#

Minimum 16GB VRAM (RTX 4080, A100 40GB). The full 12B model loads all experts into memory, but inference only activates 2.5B per token. INT8 quantization can reduce memory requirements if needed.

Does Mellum2 support MCP?#

What context length does Mellum2 support?#

8192 tokens. This is sufficient for most code completion and summarization tasks but shorter than frontier models with 128K+ context. For long-context tasks, use a model with a larger window.

Official Sources#

What Mellum2 Is Built For#

Architecture and Efficiency#

Midjourney Built a Full-Body Scanner: The Image-Generation Company's Strangest, Most Revealing Bet Yet

Noam Shazeer Joins OpenAI After Two Years Back at Google

AI Model Routing: Why the Orchestration Layer Is the Next Big Play Next to the Labs

Build Your First Agent with Vercel eve: A Step-by-Step Tutorial

Benchmark Performance#

Local Deployment Options#

vLLM (Recommended for Production)#

Hugging Face Transformers#

Ollama Compatibility#

Hardware Requirements#

When to Use Mellum2#

Integrating With Claude Code#

Mellum2 vs Other Local Models#

FAQ#

What is Mellum2?#

How do I run Mellum2 locally?#

Is Mellum2 free to use?#

How does Mellum2 compare to DeepSeek V4?#

Can Mellum2 replace Claude Code or Cursor?#

What hardware do I need to run Mellum2?#

Does Mellum2 support MCP?#

What context length does Mellum2 support?#

Sources#

The Best Local Coding LLMs in 2026: Run Enterprise-Grade AI Without the Cloud

DeepSeek V4: The Developer's Guide to Flash and Pro

OpenCode Developer Guide: The Open Source AI Coding Agent with 160K Stars

Related Tools

Continue.dev

DeepSeek-TUI

Goose

GitHub Copilot

Apps from Developers Digest

Agent Hub

MCPaaS Plus

Agent Eval Bench Plus

Related Guides

Run AI Models Locally with Ollama and LM Studio

Skill Frontmatter - Claude Code

Subagent Frontmatter - Claude Code

Related Videos

Introducing GPT-5 Codex: Optimized Agentic Coding for Developers

Gemini CLI in 6 Minutes: Google's Free and Open-Source Coding Assistant

Introducing NVIDIA’s Open-Source Nemotron Ultra 253B Model

Related Posts

The Best Local Coding LLMs in 2026: Run Enterprise-Grade AI Without the Cloud

DeepSeek V4: The Developer's Guide to Flash and Pro

OpenCode Developer Guide: The Open Source AI Coding Agent with 160K Stars

Where to Run GLM-5.2 Free and Cheap: Every Provider Compared (2026)

The Router Era: Why Not Owning a Frontier Model Became an Advantage

GLM-5.2 Cost Math: When Open-Weights Coding Models Actually Save You Money

Build with the member tools

Get Smarter About AI Dev

Official Sources#

What Mellum2 Is Built For#

Architecture and Efficiency#

Midjourney Built a Full-Body Scanner: The Image-Generation Company's Strangest, Most Revealing Bet Yet

Noam Shazeer Joins OpenAI After Two Years Back at Google

AI Model Routing: Why the Orchestration Layer Is the Next Big Play Next to the Labs

Build Your First Agent with Vercel eve: A Step-by-Step Tutorial

Benchmark Performance#

Local Deployment Options#

vLLM (Recommended for Production)#

Hugging Face Transformers#

Ollama Compatibility#

Hardware Requirements#

When to Use Mellum2#

Integrating With Claude Code#

Mellum2 vs Other Local Models#

FAQ#

What is Mellum2?#

How do I run Mellum2 locally?#

Is Mellum2 free to use?#

How does Mellum2 compare to DeepSeek V4?#

Can Mellum2 replace Claude Code or Cursor?#

What hardware do I need to run Mellum2?#

Does Mellum2 support MCP?#

What context length does Mellum2 support?#