TL;DR
A practical walkthrough of Nemotron 3 Super: latent mixture of experts, hybrid Mamba transformer architecture, 1M context, reasoning modes, and the code you actually need to run it on NVIDIA hardware.
Read next
NVIDIA's Nemotron 3 Super combines latent mixture of experts with hybrid Mamba architecture - 120B total parameters, 12B active per token, 1M context, and up to 4x more experts at the same cost.
5 min readNVIDIA's Nemotron Nano 9B V2 delivers something rare: a small language model that doesn't trade capability for speed. This 9B parameter model outperforms Qwen 3B across instruction following, math,...
7 min readNVIDIA's Nemotron Nano 2 VL delivers vision-language capabilities at a fraction of the computational cost. This 12-billion-parameter open-source model processes videos, analyzes documents, and reas...
5 min readNVIDIA shipped Nemotron 3 Super on March 11. The headline number is 120 billion total parameters with 12 billion active per token, but the parameter count is the least interesting thing about this release. What matters for anyone building on top of it is the architecture, the reasoning controls, and the fact that NVIDIA published the full training pipeline alongside the weights.
The model is a Latent Mixture of Experts Hybrid Mamba Transformer. Four words doing a lot of work. Each one corresponds to a real architectural decision that affects how you serve the model, how much it costs to run, and which workloads it handles well. This post walks through the pieces, then shows the code you need to actually use it.
For the visual breakdown of the announcement and a live demo, watch the companion video on the channel. Everything below is the practitioner's version.
Standard mixture of experts routes raw token embeddings to a small subset of expert feedforward networks. You save compute because only a fraction of the experts fire per token. The cost is memory bandwidth: every active expert still needs its weights resident on the device.
Latent MoE compresses tokens into a lower dimensional latent representation before routing. Experts then operate on the compressed view. The compression frees enough budget that NVIDIA can fit roughly 4x more experts at the same compute cost. More experts means more specialization, which translates to better performance on niche tasks (code in unusual languages, math at the edges of training data, multi-step planning) without the inference bill that a dense 120B model would carry.
The hybrid part is Mamba. Transformer attention layers handle reasoning steps that benefit from arbitrary token-to-token routing. Mamba state space layers handle the long stretches of sequence where attention's quadratic cost would dominate. The result is a model that uses the full 1 million token context window without falling over. NVIDIA reports 4x higher KV cache plus SSM cache utilization compared to a pure transformer at the same sequence length.
Multi-token prediction is the third lever. The model predicts multiple future tokens per forward pass instead of one. In practice this gives roughly 3x tokens per step at inference time when the speculative predictions hit. Combined with NVFP4 pretraining (4 bit floating point during training, which roughly doubles training throughput vs FP8), you get a model that is both cheaper to train and cheaper to serve than its parameter count would suggest.
The model is sized to fit comfortably on a single H100 node or a workstation with two or three high memory consumer cards once quantized. Concrete starting points:
If you are evaluating before committing to infrastructure, NVIDIA NIM exposes the model via an OpenAI compatible API at no charge for moderate volumes. Use that for the first round of validation.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
From the archive
Apr 29, 2026 • 12 min read
Apr 29, 2026 • 11 min read
Apr 29, 2026 • 10 min read
Apr 29, 2026 • 13 min read
The simplest path is the hosted endpoint. Nemotron 3 Super on NVIDIA NIM speaks the OpenAI chat completions protocol, so any existing client works.
from openai import OpenAI
client = OpenAI(
base_url="https://integrate.api.nvidia.com/v1",
api_key="nvapi-your-key-here",
)
response = client.chat.completions.create(
model="nvidia/nemotron-3-super-120b-a12b",
messages=[
{"role": "system", "content": "You are a careful code reviewer."},
{"role": "user", "content": "Refactor this Python function for clarity: ..."},
],
temperature=1.0,
top_p=0.95,
max_tokens=2048,
extra_body={
"enable_thinking": True,
},
)
print(response.choices[0].message.content)
The recommended sampling settings (temperature 1.0, top_p 0.95) come straight from the model card. Both reasoning on and reasoning off use the same sampler. The extra_body dict carries Nemotron specific flags through the OpenAI client.
Nemotron 3 Super exposes three reasoning controls. Pick the one that matches the workload:
| Flag | Behavior | When to use |
|---|---|---|
enable_thinking: True | Full chain of thought before answering | Multi step reasoning, agentic tool use, hard math, tricky code refactors |
enable_thinking: False | Direct answer, no visible thinking trace | Chat, summarization, classification, anything latency sensitive |
low_effort: True | Reduced reasoning tokens, faster | Light reasoning where you still want some deliberation |
reasoning_budget: <int> | Hard cap on reasoning tokens | Cost control in production, prevent runaway thinking |
The reasoning budget is the most useful flag once you ship to production. Without it, a hard prompt can spend 8000+ tokens deliberating before producing the answer. With reasoning_budget: 1024 you cap the thinking phase and force the model to commit. Tune the cap per workload.
For teams that want the weights local, Hugging Face hosts the model under nvidia/nemotron-3-super-120b-a12b. The custom Mamba and latent MoE layers ship in the model repo, so you need trust_remote_code=True on load.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "nvidia/nemotron-3-super-120b-a12b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
messages = [
{"role": "system", "content": "You are a precise SQL assistant."},
{"role": "user", "content": "Write a query that returns daily active users for the last 30 days from a table called events(user_id, ts)."},
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt",
enable_thinking=True,
).to(model.device)
with torch.inference_mode():
output = model.generate(
inputs,
max_new_tokens=1024,
temperature=1.0,
top_p=0.95,
do_sample=True,
)
print(tokenizer.decode(output[0][inputs.shape[-1]:], skip_special_tokens=True))
apply_chat_template is the right entry point. It wires the reasoning flag into the prompt template so the model sees the correct control tokens. Skipping the template and concatenating strings by hand is a common source of degraded outputs.
For real deployments, the Hugging Face path is a starting point, not a destination. NVIDIA's serving stack for Nemotron 3 Super is Triton Inference Server with the TensorRT LLM backend. The conversion path looks like this:
# 1. Pull weights
huggingface-cli download nvidia/nemotron-3-super-120b-a12b \
--local-dir ./nemotron-3-super
# 2. Build a TensorRT LLM engine
python -m tensorrt_llm.commands.build \
--checkpoint_dir ./nemotron-3-super \
--output_dir ./engines/nemotron-3-super-bf16 \
--gemm_plugin bfloat16 \
--max_input_len 131072 \
--max_seq_len 1048576 \
--max_batch_size 8 \
--tp_size 8
# 3. Stage in the Triton model repository
mkdir -p model_repo/nemotron-3-super/1
cp -r engines/nemotron-3-super-bf16/* model_repo/nemotron-3-super/1/
# 4. Launch Triton
tritonserver --model-repository=model_repo \
--http-port=8000 \
--grpc-port=8001 \
--metrics-port=8002
Two flags matter most. --max_seq_len 1048576 tells the engine to allocate KV plus SSM cache for the full 1M context. If you only ever need 128k, drop this and you reclaim significant memory. --tp_size 8 matches an 8 GPU H100 node. For 4x H200 use --tp_size 4 with FP8 quantization to fit.
Once Triton is up, the Python client looks like any other Triton call:
import tritonclient.http as httpclient
import numpy as np
client = httpclient.InferenceServerClient(url="localhost:8000")
prompt = "Explain the difference between Mamba and standard attention in three sentences."
inputs = [
httpclient.InferInput("text_input", [1, 1], "BYTES"),
httpclient.InferInput("max_tokens", [1, 1], "INT32"),
httpclient.InferInput("temperature", [1, 1], "FP32"),
]
inputs[0].set_data_from_numpy(np.array([[prompt]], dtype=object))
inputs[1].set_data_from_numpy(np.array([[512]], dtype=np.int32))
inputs[2].set_data_from_numpy(np.array([[1.0]], dtype=np.float32))
result = client.infer(model_name="nemotron-3-super", inputs=inputs)
print(result.as_numpy("text_output")[0][0].decode("utf-8"))
For continuous batching, dynamic reasoning budgets, and streaming, use the TensorRT LLM in flight batching backend rather than the simple ensemble shown above. The model card includes a reference Triton config that handles all of this.
NVIDIA published numbers across the standard reasoning and coding suites. The ones that matter for evaluation:
Benchmarks are vibes. Run your own evals on your own workload before committing.
Nemotron 3 Super is the right call when:
It is the wrong call when you need a tiny model for edge inference (look at Nemotron Nano 9B V2 instead), when you need vision (try Nemotron Nano 2 VL), or when you are vendor agnostic and your stack lives on AMD or Apple Silicon.
Nemotron 3 ships in three sizes. Nano (30B with 3B active) is shipping. Super (120B with 12B active) is the focus of this post. Ultra (~500B with ~50B active) lands in the first half of 2026. The architecture story is consistent across the three: same latent MoE, same hybrid Mamba, same reasoning controls, scaled across the size envelope. If Super clears the bar for your use case, plan capacity for Ultra now.
The fastest path to a working call is NIM. Grab a key, paste the OpenAI client snippet from above, and you have a reasoning capable model behind a familiar API in five minutes. From there, the Hugging Face checkpoint and the Triton path are both well worn.
For the visual walk through, the architecture diagrams, and a side by side latency demo with reasoning on and off, watch the Nemotron 3 Super video. For more on the family, see the Nemotron Nano 9B V2 deep dive and the Nano 2 VL post.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
Configure Claude Code for maximum productivity -- CLAUDE.md, sub-agents, MCP servers, and autonomous workflows.
AI AgentsInstall Ollama and LM Studio, pull your first model, and run AI locally for coding, chat, and automation - with zero cloud dependency.
Getting StartedStep-by-step guide to building an MCP server in TypeScript - from project setup to tool definitions, resource handling, testing, and deployment.
AI Agents
NVIDIA's Nemotron 3 Super combines latent mixture of experts with hybrid Mamba architecture - 120B total parameters, 12B...

NVIDIA's Nemotron Nano 9B V2 delivers something rare: a small language model that doesn't trade capability for speed. Th...

NVIDIA's Nemotron Nano 2 VL delivers vision-language capabilities at a fraction of the computational cost. This 12-billi...

DeepSeek V4 splits into Flash and Pro, ships a 1M context window, and undercuts every closed model on price. Here's how...

DeepSeek's R1 and V3 models deliver frontier-level performance under an MIT license. Here's how to use them through the...

Meta's Llama 4 family brings mixture-of-experts to open source with Scout and Maverick. Here's how to run them locally,...

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.