
Weekly deep dives on AI agents, coding tools, and building with LLMs - delivered to your inbox.
Free forever. No spam.
Subscribe Free
New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.
NVIDIA Nemotron 3 Super: Latent MoE + Hybrid Mamba, 1M Context, Faster Inference NVIDIA released Nemotron 3 Super, a new mixture-of-experts model with a new architecture combining latent mixture of experts and a hybrid Mamba approach to blend transformer strengths with Mamba speed. The model has 120B total parameters with about 12B active per token, improving inference speed. Unlike standard MoE that routes raw tokens to experts, latent MoE compresses tokens before routing so experts process smaller inputs, enabling up to four times more experts at the same cost. The script covers third-party openness vs. intelligence benchmarking, noting NVIDIA’s permissive access (download weights, self-host, fine-tune, commercialize) and training documentation. It highlights 1M-token context, strong long-context multi-user efficiency, availability via Perplexity, developer tools, Hugging Face, major clouds, and benchmarks showing improved throughput and coding performance versus prior Nemotron and other sub-250B models. https://nvda.ws/3Pvzn8o 00:00 Nvidia Model Overview 00:17 Mixture of Experts Basics 00:58 Latent MoE Explained 01:54 Openness vs Intelligence 03:03 Efficiency and Long Context 03:34 How to Use It Today 04:34 Where to Access It 04:57 Benchmarks and Throughput 05:44 Wrap Up and Thanks
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.