NVIDIA Nemotron 3 Super: A Developer's Guide to the 120B Hybrid MoE

What Nemotron 3 Super Actually Is

NVIDIA shipped Nemotron 3 Super on March 11. The headline number is 120 billion total parameters with 12 billion active per token, but the parameter count is the least interesting thing about this release. What matters for anyone building on top of it is the architecture, the reasoning controls, and the fact that NVIDIA published the full training pipeline alongside the weights.

The model is a Latent Mixture of Experts Hybrid Mamba Transformer. Four words doing a lot of work. Each one corresponds to a real architectural decision that affects how you serve the model, how much it costs to run, and which workloads it handles well. This post walks through the pieces, then shows the code you need to actually use it.

For the visual breakdown of the announcement and a live demo, watch the companion video on the channel. Everything below is the practitioner's version.

The Architecture, Decoded

Standard mixture of experts routes raw token embeddings to a small subset of expert feedforward networks. You save compute because only a fraction of the experts fire per token. The cost is memory bandwidth: every active expert still needs its weights resident on the device.

Latent MoE compresses tokens into a lower dimensional latent representation before routing. Experts then operate on the compressed view. The compression frees enough budget that NVIDIA can fit roughly 4x more experts at the same compute cost. More experts means more specialization, which translates to better performance on niche tasks (code in unusual languages, math at the edges of training data, multi-step planning) without the inference bill that a dense 120B model would carry.

The hybrid part is Mamba. Transformer attention layers handle reasoning steps that benefit from arbitrary token-to-token routing. Mamba state space layers handle the long stretches of sequence where attention's quadratic cost would dominate. The result is a model that uses the full 1 million token context window without falling over. NVIDIA reports 4x higher KV cache plus SSM cache utilization compared to a pure transformer at the same sequence length.

Multi-token prediction is the third lever. The model predicts multiple future tokens per forward pass instead of one. In practice this gives roughly 3x tokens per step at inference time when the speculative predictions hit. Combined with NVFP4 pretraining (4 bit floating point during training, which roughly doubles training throughput vs FP8), you get a model that is both cheaper to train and cheaper to serve than its parameter count would suggest.

Hardware Requirements

The model is sized to fit comfortably on a single H100 node or a workstation with two or three high memory consumer cards once quantized. Concrete starting points:

Full precision serving: 8x H100 80GB. Recommended for production multi-tenant inference where throughput matters and you want headroom for the 1M context.
Single node serving: 4x H100 80GB or 2x H200 141GB. Works for most teams. KV cache plus SSM cache fits with margin.
Workstation, FP4 quantized: A pair of RTX 6000 Ada cards (48GB each), or a DGX Spark. This is the dev box configuration. Speeds are lower but the model loads end to end with reasoning enabled.
Single GPU, heavy quantization: Possible on an 80GB H100 with INT4 plus expert offloading, but expect significant throughput penalties. Prototype only.

If you are evaluating before committing to infrastructure, NVIDIA NIM exposes the model via an OpenAI compatible API at no charge for moderate volumes. Use that for the first round of validation.

Get the weekly deep dive

Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.

From the archive

Open-Source MCP Servers Worth Installing in 2026

Apr 29, 2026 • 12 min read

OpenAI AgentKit in Production: An Honest Builder's Review

Apr 29, 2026 • 11 min read

OpenAI Privacy Filter: Production PII Redaction Guide

Apr 29, 2026 • 10 min read

Assistants to Responses API: A Migration Field Guide

Apr 29, 2026 • 13 min read

Calling the Model: NIM via OpenAI SDK

The simplest path is the hosted endpoint. Nemotron 3 Super on NVIDIA NIM speaks the OpenAI chat completions protocol, so any existing client works.

from openai import OpenAI

client = OpenAI(
    base_url="https://integrate.api.nvidia.com/v1",
    api_key="nvapi-your-key-here",
)

response = client.chat.completions.create(
    model="nvidia/nemotron-3-super-120b-a12b",
    messages=[
        {"role": "system", "content": "You are a careful code reviewer."},
        {"role": "user", "content": "Refactor this Python function for clarity: ..."},
    ],
    temperature=1.0,
    top_p=0.95,
    max_tokens=2048,
    extra_body={
        "enable_thinking": True,
    },
)

print(response.choices[0].message.content)

The recommended sampling settings (temperature 1.0, top_p 0.95) come straight from the model card. Both reasoning on and reasoning off use the same sampler. The extra_body dict carries Nemotron specific flags through the OpenAI client.

Reasoning Modes

Nemotron 3 Super exposes three reasoning controls. Pick the one that matches the workload:

Flag	Behavior	When to use
`enable_thinking: True`	Full chain of thought before answering	Multi step reasoning, agentic tool use, hard math, tricky code refactors
`enable_thinking: False`	Direct answer, no visible thinking trace	Chat, summarization, classification, anything latency sensitive
`low_effort: True`	Reduced reasoning tokens, faster	Light reasoning where you still want some deliberation
`reasoning_budget: <int>`	Hard cap on reasoning tokens	Cost control in production, prevent runaway thinking

The reasoning budget is the most useful flag once you ship to production. Without it, a hard prompt can spend 8000+ tokens deliberating before producing the answer. With reasoning_budget: 1024 you cap the thinking phase and force the model to commit. Tune the cap per workload.

Self Hosting with Transformers

For teams that want the weights local, Hugging Face hosts the model under nvidia/nemotron-3-super-120b-a12b. The custom Mamba and latent MoE layers ship in the model repo, so you need trust_remote_code=True on load.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "nvidia/nemotron-3-super-120b-a12b"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

messages = [
    {"role": "system", "content": "You are a precise SQL assistant."},
    {"role": "user", "content": "Write a query that returns daily active users for the last 30 days from a table called events(user_id, ts)."},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    enable_thinking=True,
).to(model.device)

with torch.inference_mode():
    output = model.generate(
        inputs,
        max_new_tokens=1024,
        temperature=1.0,
        top_p=0.95,
        do_sample=True,
    )

print(tokenizer.decode(output[0][inputs.shape[-1]:], skip_special_tokens=True))

apply_chat_template is the right entry point. It wires the reasoning flag into the prompt template so the model sees the correct control tokens. Skipping the template and concatenating strings by hand is a common source of degraded outputs.

Production Serving with Triton and TensorRT LLM

For real deployments, the Hugging Face path is a starting point, not a destination. NVIDIA's serving stack for Nemotron 3 Super is Triton Inference Server with the TensorRT LLM backend. The conversion path looks like this:

# 1. Pull weights
huggingface-cli download nvidia/nemotron-3-super-120b-a12b \
    --local-dir ./nemotron-3-super

# 2. Build a TensorRT LLM engine
python -m tensorrt_llm.commands.build \
    --checkpoint_dir ./nemotron-3-super \
    --output_dir ./engines/nemotron-3-super-bf16 \
    --gemm_plugin bfloat16 \
    --max_input_len 131072 \
    --max_seq_len 1048576 \
    --max_batch_size 8 \
    --tp_size 8

# 3. Stage in the Triton model repository
mkdir -p model_repo/nemotron-3-super/1
cp -r engines/nemotron-3-super-bf16/* model_repo/nemotron-3-super/1/

# 4. Launch Triton
tritonserver --model-repository=model_repo \
    --http-port=8000 \
    --grpc-port=8001 \
    --metrics-port=8002

Two flags matter most. --max_seq_len 1048576 tells the engine to allocate KV plus SSM cache for the full 1M context. If you only ever need 128k, drop this and you reclaim significant memory. --tp_size 8 matches an 8 GPU H100 node. For 4x H200 use --tp_size 4 with FP8 quantization to fit.

Once Triton is up, the Python client looks like any other Triton call:

import tritonclient.http as httpclient
import numpy as np

client = httpclient.InferenceServerClient(url="localhost:8000")

prompt = "Explain the difference between Mamba and standard attention in three sentences."
inputs = [
    httpclient.InferInput("text_input", [1, 1], "BYTES"),
    httpclient.InferInput("max_tokens", [1, 1], "INT32"),
    httpclient.InferInput("temperature", [1, 1], "FP32"),
]
inputs[0].set_data_from_numpy(np.array([[prompt]], dtype=object))
inputs[1].set_data_from_numpy(np.array([[512]], dtype=np.int32))
inputs[2].set_data_from_numpy(np.array([[1.0]], dtype=np.float32))

result = client.infer(model_name="nemotron-3-super", inputs=inputs)
print(result.as_numpy("text_output")[0][0].decode("utf-8"))

For continuous batching, dynamic reasoning budgets, and streaming, use the TensorRT LLM in flight batching backend rather than the simple ensemble shown above. The model card includes a reference Triton config that handles all of this.

Benchmarks Worth Quoting

NVIDIA published numbers across the standard reasoning and coding suites. The ones that matter for evaluation:

MMLU Pro: Competitive with the leading sub 250B open weight models, with a meaningful jump in reasoning on mode.
LiveCodeBench: Strong on the coding subset, helped by the 10 billion token reasoning and coding pre-training corpus.
SWE Bench Verified: This is where the NeMo RL Gym training shows up. NVIDIA reports roughly 2x intelligence index over the previous Nemotron generation on agent style tasks.
Long context retrieval: Maintains accuracy across the full 1M window, which is the entire point of the hybrid Mamba design.

Benchmarks are vibes. Run your own evals on your own workload before committing.

When to Pick This Over the Alternatives

Nemotron 3 Super is the right call when:

You are already on NVIDIA hardware and want a model that is tuned for that stack end to end (NVFP4, TensorRT LLM, NIM).
You need a long context window that actually works rather than one that nominally exists but degrades after 200k tokens.
You want commercial use rights without negotiating a license.
Your workload mixes reasoning heavy and latency sensitive requests, and you want one model that can switch modes per request.

It is the wrong call when you need a tiny model for edge inference (look at Nemotron Nano 9B V2 instead), when you need vision (try Nemotron Nano 2 VL), or when you are vendor agnostic and your stack lives on AMD or Apple Silicon.

Where the Family Goes Next

Nemotron 3 ships in three sizes. Nano (30B with 3B active) is shipping. Super (120B with 12B active) is the focus of this post. Ultra (~500B with ~50B active) lands in the first half of 2026. The architecture story is consistent across the three: same latent MoE, same hybrid Mamba, same reasoning controls, scaled across the size envelope. If Super clears the bar for your use case, plan capacity for Ultra now.

Try It

The fastest path to a working call is NIM. Grab a key, paste the OpenAI client snippet from above, and you have a reasoning capable model behind a familiar API in five minutes. From there, the Hugging Face checkpoint and the Triton path are both well worn.

For the visual walk through, the architecture diagrams, and a side by side latency demo with reasoning on and off, watch the Nemotron 3 Super video. For more on the family, see the Nemotron Nano 9B V2 deep dive and the Nano 2 VL post.

NVIDIA's Nemotron 3 Super in 6 Minutes

NVIDIA Nemotron Nano 9B V2: Local AI That Punches Up

NVIDIA Nemotron Nano 2 VL: Open Source Vision-Language Model

What Nemotron 3 Super Actually Is

The Architecture, Decoded

Hardware Requirements

Open-Source MCP Servers Worth Installing in 2026

OpenAI AgentKit in Production: An Honest Builder's Review

OpenAI Privacy Filter: Production PII Redaction Guide

Assistants to Responses API: A Migration Field Guide

Calling the Model: NIM via OpenAI SDK

Reasoning Modes

Self Hosting with Transformers

Production Serving with Triton and TensorRT LLM

Benchmarks Worth Quoting

When to Pick This Over the Alternatives

Where the Family Goes Next

Try It

Comments

Related Tools

Qwen3-Coder

Related Guides

Claude Code Setup Guide

Run AI Models Locally with Ollama and LM Studio

Building Your First MCP Server

Related Videos

NVIDIA's NEW Nemotron 3 Super in 6 Minutes

Related Posts

NVIDIA's Nemotron 3 Super in 6 Minutes

NVIDIA Nemotron Nano 9B V2: Local AI That Punches Up

NVIDIA Nemotron Nano 2 VL: Open Source Vision-Language Model

DeepSeek V4: The Developer's Guide to Flash and Pro

DeepSeek R1 and V3: The Developer's Guide to Open-Source AI

Llama 4: The Complete Developer's Guide to Meta's Open Source Models

Get Smarter About AI Dev

NVIDIA's Nemotron 3 Super in 6 Minutes

NVIDIA Nemotron Nano 9B V2: Local AI That Punches Up

NVIDIA Nemotron Nano 2 VL: Open Source Vision-Language Model

What Nemotron 3 Super Actually Is

The Architecture, Decoded

Hardware Requirements

Open-Source MCP Servers Worth Installing in 2026

OpenAI AgentKit in Production: An Honest Builder's Review

OpenAI Privacy Filter: Production PII Redaction Guide

Assistants to Responses API: A Migration Field Guide

Calling the Model: NIM via OpenAI SDK

Reasoning Modes

Self Hosting with Transformers

Production Serving with Triton and TensorRT LLM

Benchmarks Worth Quoting

When to Pick This Over the Alternatives

Where the Family Goes Next

Try It

Comments

Related Tools

Qwen3-Coder

Related Guides

Claude Code Setup Guide

Run AI Models Locally with Ollama and LM Studio

Building Your First MCP Server

Related Videos

NVIDIA's NEW Nemotron 3 Super in 6 Minutes

Related Posts

NVIDIA's Nemotron 3 Super in 6 Minutes

NVIDIA Nemotron Nano 9B V2: Local AI That Punches Up

NVIDIA Nemotron Nano 2 VL: Open Source Vision-Language Model

DeepSeek V4: The Developer's Guide to Flash and Pro

DeepSeek R1 and V3: The Developer's Guide to Open-Source AI

Llama 4: The Complete Developer's Guide to Meta's Open Source Models

Get Smarter About AI Dev