
Try out the model 👉: https://nvda.ws/4lNtzBU In this video, we explore the benchmarks and capabilities of NVIDIA's newly released small language model, Nemotron Nano 2. We compare its performance to a comparable model, Qwen-2.5-3B, highlighting its superior speed and accuracy. You'll learn about its hybrid architecture combining Mamba and Transformer elements, the training process it underwent, and its ability to handle both reasoning and non-reasoning tasks efficiently. Additionally, we explore its tool usage and the flexibility of controlling the model's thinking process. You'll also find practical demonstrations and insights into its open-source dataset. Join us for an in-depth examination of this groundbreaking model and see how you can leverage it on various hardware platforms. Technical report; https://research.nvidia.com/labs/adlr/files/NVIDIA-Nemotron-Nano-2-Technical-Report.pdf
Weekly deep dives on AI agents, coding tools, and building with LLMs - delivered to your inbox.
Free forever. No spam.
Subscribe FreeNew tutorials, open-source projects, and deep dives on coding agents - delivered weekly.
--- type: transcript date: 2025-08-26 youtube_id: 2j_cA7NcoVE --- # Transcript: Nemotron Nano 9B V2 in 10 Minutes Nvidia has just released a new small language model, Neotron Nano2. In this video, I'm going to go over the benchmarks. I'm going to show you how to get started as well as how it stacks up and what is particularly interesting about this model. Okay, so first up in terms of the benchmarks when we compare this to Quen 38B, which is a comparable model of its size. We can see that from instruction following, math, science, coding, as well as tool use that this model does outperform Quen 38B. But where this is really impressive is if we look on the right hand side here is we have the measured throughput of the model and this model is as fast as 6.3 times faster. One of the things with how this model actually works was through creating a hybrid architecture which is through a Mamba as well as transformer combination that allows you to have both a reasoning model but you have the speed of the Mamba architecture but the accuracy of the transformer architecture. And given the size of the model, this is a pretty interesting size model because you'll be able to run this on the edge. You'll be able to run this on local consumer hardware. If you have something like a gaming GPU, you will be able to run this model. Next up is the data set. This is something that is often overlooked in terms of some of the other openweight models that get released, but they are actually open sourcing a predominant amount of the data that was used for the pre-training. If you are interested, you will be able to pull down the Nemo pre-training data set V1, which is now on HuggingFace. So, if you do want to use this as a base for another model, you can absolutely do that. The model now is available on HuggingFace. So, you will be able to pull this down if you do want to run this on your own hardware. Additionally, I will be showing you how this on build.envidia.com. If you are interested in trying this out, you can also do that there. Now, just to dive in some more specifics of the model. The model is a unified model that does allow for both reasoning and non-reasoning tasks. And what's neat with this is you are going to be able to control the thinking process through the system prompt. You are going to be able to specify for particularly hard questions to have the model think through or if it's a particularly simple query, you can go ahead and just specify it to respond faster. And what's nice with this is you have the flexibility and the capabilities of a reasoning model all the while being at the size that you can run at the edge or on consumer hardware like I mentioned. Now in terms of some specifics of this hybrid architecture which is one of the big interesting pieces of this model. So it is leveraging Mamba 2 as well as the multilayer perceptrons or MLP layers combined with just four attention layers. Now in terms of how the model was trained, it leveraged Megatron LM as well as Nemo for reinforcement learning. Now in terms of some of the supported languages, you will be able to use this in English, German, Spanish, French, Italian as well as Japanese. and it was improved leveraging Quinn, which is definitely something nice to see these open source models leveraging and cross-pollinating between the different ecosystems. Now, in terms of the reasoning budget control, what you're going to be able to do with this is you're going to be able to have the thinking budget. And during inference, what's neat with this is you within your application or the user can determine how many tokens you want the model to think for. If it's a particularly hard problem, you can go ahead and dial that up. And what you'll see within here is the more thinking tokens that are allocated do correlate with better responses. Now, what's really interesting with this diagram in particular is if I do zoom out here, what you'll really see is where this does really perform quite well is on the AME 2025, you do see some increases on GPQA as well as live codebench, but you see quite a dramatic increase on the AME 2025 score. But one thing to know with this is on the math 500, we can see just how accurate this model is with its thinking budget dialed up. It looks like it's within the high 90s, maybe 95% or higher in terms of accuracy. Okay, so next up in terms of the phases of how the model was trained. Now, another interesting thing within their technical report that they did release is just the data mixture across the various phases. Within here, for the first phase, we can see it's predominantly code as well as different crawled content. We have academic material as well as multilingual and some STEM material. And if we just scroll down here, what's interesting is if we go all the way down to phase three, we can see just how much this composition did end up changing. So we can see by the end of it, it did leverage more and more stem. The code was actually considerably decreased and by the end of it, the crawled content is considerably smaller over each phase of training. And I thought the illustration of this data mixture was definitely something that was quite interesting to actually lay out within the technical report. And if you are interested in any of this, I'm going to link it within the description of the video. You can go ahead and take a look. Now, to quickly demonstrate the models. So, Nvidia has a platform build.envidia.com. And what's great with this is they have a ton of different models on here. So, if I just quickly show you and I go over to Explorer, you'll be able to try out a ton of different models from the OpenAI open source models through to a number of the Nvidia models they have on here as well as a whole host of other models if you're interested. Now to quickly demonstrate the model, I want to both show you the capability and the flexibility of the reasoning process, but also I want to highlight the speed of the inference within this. Also, additionally, I do want to point out within this, you will be able to try this out within an application if you are interested. They do have some code samples within here for Python node as well as a shell script if you want to try this out within your terminal. But what I'm going to be able to do within this is we have some of the preset questions like how many Rs are within strawberry. And this was one that went viral a number of months ago for a lot of different models where it did get hung up on something seemingly so simple. But a lot of models did actually struggle with this type of question. Within here we see the word strawberry contains three Rs. We have the breakdown within here broke out the word and we see the position of the R is in the third, eighth and ninth letter of the word here. And now how you can leverage this model is within here we have the reasoning process that is folded. But within here you can see in just a fraction of a second that it went through this whole reasoning process and that's how it ultimately derived its answer that it actually gave to us. Within here we can see okay we see the user is asking about how many Rs are within the word strawberry. I need to count the number of times the letter R appears in that word. Let me start by writing down the word to visualize it better. And through here we see it goes through a number of different steps before it actually returns that answer to us. And it is incredibly fast. I don't think I've actually seen speeds quite this fast of a model unless it was some providers that are very specifically focused on the inference of transformer-based models. It is particularly interesting this sort of new paradigm of the architecture between Mamba and transformers that they're exploring here. Now to touch on some other capabilities of the model. It's a very flexible model in that you will be able to leverage tools within the model. You do have the ability to control whether you want the model to think as well as the thinking budget. Now just to demonstrate the tool process. If I say tell me five things about Harry Potter. Now just to walk through the process of what happened here. So I asked about five things about Harry Potter. The first thing that it did is it determined from that question that there was a tool that was needed to be invoked. Within here we have the arguments for that tool of name and then it determined that we were asking about Harry James Potter in this question. Once we have that information we see that it's reasoning through and we see okay the user is asking about five things about Harry Potter. I used the describe Harry Potter character tool with Harry's name. The response gave me some details. Let me check about the info here. Within here we have the full name, the nickname, so on and so forth. And then within here it actually reflected and it went oh this is actually six points but some might be combined. Let me check them a little bit more carefully. Within here we have full name, nickname, Hogwarts house, actor who played him, children as well as birthate. And then it determined okay this is five. I should present this in a friendly way. So on and so forth. And then within here we can see the full name. We have the Hogwarts house and we have it all listed out in that final answer here. We even have the birthday. And then within here, we can see the three children of Harry Potter. In terms of some of the specifics, does Harry Potter actually have three children? That's one thing that I don't know enough about the Harry Potter series to comment on. But if you do know the answer to some of this information and if it's accurate, let me know within the comments below. Now, additionally, within the model, you can control, like I mentioned, exactly how many thinking tokens that you want to leverage. And now if I just show you it with the thinking process off just to show you what this looks like and I ask it for something like 10 paragraphs about Mamba architecture. Here we can see very quickly it's going through and giving me exactly what I asked for. And then here we go. We have the Mamba architecture represents a significant advancement in the field of artificial analysis. So on and so forth. I went through and I counted these. We do in fact have 10 different paragraphs of what looks to be like coherent information on the Mamba architecture. Now, additionally, what the model allows you to do is control the thinking budget. If you do turn on reasoning, you can specify to have a minimum number of thinking tokens. If it's a very hard problem, you can go ahead and dial this up and say, "Okay, I want this to think for a minimum number of this many tokens." And it will go through and leverage that many tokens before it actually gives you that final response. Okay. So, now just to pull this all together. So, we have a 9 billion parameter model. We have better performance than Quinn 8B. We have it at considerably faster speeds for inference. We have the flexibility of being able to simply specify slash to think or slash to no think within the system prompt. And additionally to being able to toggle on or off the thinking process, we are also going to be able to specify our thinking budget. Overall, it gives us just a very dynamic model. And then in addition to turning on and off the thinking process, we can also control the thinking budget. And then finally, we do have the ability for it to leverage tools all in combination with one another. It gives us a very performant model. It gives us a very accurate model and also it gives us a very capable model. But overall, that's pretty much it for this video. Kudos to the team at NVIDIA for releasing yet again another great open source model as well as all of their contributions whether it's the research the technical papers or just being able to host this on your own hardware and being able to use it with a permissive license. We all know Nvidia's hardware story but increasingly it is incredibly impressive also their models that they are developing inhouse not to mention that they do open source all of these models that everyone can learn something from. So kudos to the team at Nvidia for definitely building just an impressive company across the board. But overall, if you found this video useful, please comment, share, and subscribe. Otherwise, until the next
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.