
Unleashing the Power of Nvidia's 253 Billion Parameter Model: Llama Nemotron Ultra Try out NVIDIA models hosted on NVIDIA platform for free here: https://nvda.ws/3ROOOGW In this video, we explore Nvidia's groundbreaking AI model, the Llama 3.1 Nemotron Ultra, a 253 billion parameter model making waves in scientific reasoning, coding, math, and agentic AI. We dive into its impressive benchmarks, compare it to competitors, and demonstrate its capabilities through various tests. The video also guides you on how to try out Nemotron via Nvidia's platform and showcases examples of the model's ability to follow complex instructions, render visual artifacts, and demonstrate intricate reasoning. Nvidia's open-source approach and post-training techniques are highlighted, alongside practical applications in coding and scientific problem-solving. 00:00 Introduction to Nvidia's AI Revolution 00:27 Overview of Nemotron Model 01:27 Getting Started with Nemotron 02:14 Demonstrating Nemotron Ultra's Capabilities 04:04 Advanced Instruction Following 05:13 Reasoning and Thinking Abilities 07:26 Chain of Thought Puzzle 09:13 Physics and Coding Demonstrations 10:51 Conclusion and Final Thoughts
--- type: transcript date: 2025-05-16 youtube_id: XPz4oabjOD8 --- # Transcript: Introducing NVIDIA’s Open-Source Nemotron Ultra 253B Model There's one company that's arguably powering the AI revolution more than any other, and that's Nvidia. We all know their hardware story, but their new 253 billion parameter model, Llama Neotron Ultra, is quietly becoming one of the world's best open reasoning models for science, coding, math, and AI. Today, I'm going to be going through some of the benchmarks. I'm going to show you the model in action, and then I'm also going to be doing some tests just to demonstrate some of its capabilities. Right off the bat, there's three different models within the Neotron family. The Nano Super as well as the Ultra model. So, in this video, I'm going to be specifically focused on the Ultra model. One thing that's really impressive with the model is just given its capability across a number of different benchmarks, whether it's scientific reasoning, complex math, tool calling, coding, or instruction following. When we compare that to some other models, we can see basically across the board that we have better performance for this Neotron Ultra model. The fact that they were able to get this performance on Llama 3.1 just goes to show you how important that post-training piece still is. And my assumption is once Llama 4 Behemoth comes out that we'll even see better results from the team over at Nvidia, whether it's with a Neotron update or a different series of model that's post-trained to improve the capability. This is a great example of open source. And by Meta releasing into the wild these Llama series of models, we can see a ton of different companies being able to take advantage and refine those models and really make them even that much better. Now, to try out Neotron Ultra, you can head to build.envidia.com. Within here, you have all of these different models that you can try out on their inference platform. All that you need to get started is you can select the model that you want to leverage. And within here you both have a playground as well as the area where you can get different scripts whether you want to try this out within Python node or a shell script. One nice thing with this is it is set up in a way where you can still use the OpenAI SDK. Even though it is called OpenAI, you can leverage this with different providers. All that you need to get set up with Nvidia is get your API key which you can get right here within the platform. Paste in the base URL and then once you have that you can specify the model and in this case it's going to be Nvidia Llama 3.1 Neotron Ultra. Otherwise, all of these different parameters are going to be very familiar to other inference APIs. You can try this out directly within the platform. Now, one thing that I want to demonstrate is you do have the ability to turn on and off the reasoning. For instance, if you leave reasoning on and you want detailed thinking on a particular topic, you can go ahead and specify such. Alternatively, you can turn off the detailed thinking and it will skip that reasoning step. What I've done is I've plugged in Neotron Ultra within an application that I've been working on on and off, which is similar to the cloud artifacts feature. What I can say within here is I can say create a React component that reads developers digest. I'll go ahead and I'll send that in. And what we see here is we see it loading. And then as soon as we start to get the response back, we'll see that streaming into the interface here. So we have this basic React component. We can see it importing React and doing all of the necessary steps to get all of that set up. But the piece with this application is it's about 4,000 tokens of system prompts to actually get all of this working. Basically, behind the scenes, this involves a very detailed system prompt specifying exactly how I want all of the different tokens to stream in. Next, what I'm going to demonstrate is I'm going to say I want to create five visual artifacts. I want things like HTML pages, React components, SVGs, as well as mermaid diagrams, and whatever you find creative. What I really want to demonstrate with this is just its ability to follow instructions. Now, within here, if I just go and I look at all of these various components, I have the HTML page, I have the React component, I have the SVG, the flowchart diagram, as well as this ASI artbox animation. I can look at the code for all of these. I can see that it went and it there isn't even one single syntax error or anything. It was able to follow all of the instructions. And the trick with this is basically it's parsing all of these different artifacts within their own particular XML tags. And that's how the front end of the application is able to render this. Now, just to push the instruction following capability a little bit further, what I'm going to say is I want to make these components a fair bit more involved. Let's make them colorful if that's appropriate or otherwise create interesting aspects within each of them that make them a little bit more complicated. Again, this is basically an extension of what we have already there. But the one thing that I'm looking out for this is whether it does cause any syntax errors to actually break the rendering of any of these artifacts. Now, if I look at some of our components, I can see we have this enhanced colorful HTML page. I can go within our interactive React component where it's changing out the state of the color. Within here, I have this creative SVG that it's made for us. From there, we have a more involved flowchart. And then finally, we have this dynamic colorful asy art animation within here. This is one that it didn't quite render. It does seem to be like there potentially is a special character. That doesn't necessarily mean there's an issue with the model per se so much as it could just be a rendering issue within my application. The one thing that I found with this application is it does give a pretty good indication on instruction following especially for less powerful models it will become very clear about the issues. Now one thing that I want to demonstrate is the thinking ability. Now within this I'm going to say I want to create the game of pong within HTML. But let's really think through each of the different steps that are required to create this. I'll go ahead and I'll send that in. Now, the way that I've set this up within the application is if there is a thinking tag from the LLM's response, what it will do is we'll have it rendering within this indentation here. Basically, what the model did is it went through all of these thinking tokens before actually going and beginning to create this implementation. Here we go. I have this working Pong game. So, I can use the S and D keys and my arrow keys and I can play Pong. It is just me controlling both sides of the screen here. But you can see that basically based on the reasoning tokens here, we can see, okay, the user wants to create a game of Pong in HTML and is asking about the steps involved. Let's start by recalling what Pong is. It's a simple tennis-like game where players hit the ball back and forth on the screen. The classic version has two paddles and a ball. And basically what it did within here is it went through and made the decisions. We can even see the thinking process for player one is going to be using the WS keys as well as player two is going to be using the arrow keys. And you can read through all of the different steps. And the thing to note with this thinking process and where it's important with the capability of these models is the longer time that a model can spend thinking correlates with overall the better response. The fact that you can dynamically just turn on reasoning and turn off reasoning is overall a really great and powerful option for whatever the application you might be building. By having that reasoning toggle where you can turn reasoning on or off, it really gives you the flexibility where depending on the use case, you could allow the model to think through the process if it requires thinking. Otherwise, if you want a fast response, you could just specify to not have it go through that thinking step. Now, in terms of some of the important specifics, the model supports a context length of up to 128,000 tokens. Both of these models have a very permissive license, so you're going to be able to monetize whatever application that you use them. You can also take this model and further post train it if you'd like. One thing to note with the model, the pre-training data has a cutoff of 2023, which is when Llama 3.14 had their pre-training cutoff. And this model was trained between November as well as April 2025. Now, I'm going to demonstrate a chain of thought puzzle. So, this is one that Jensen Wong, the CEO of Nvidia, demonstrated at the GTC conference. Basically, what we're going to test is I'm going to say, I need to seat seven people at a round wedding table. The constraints are my parents and in-laws must not sit together. My spouse wants to be on the left in the photos. Then resolve if we add a pastor who can sit anywhere. And I'm going to take this a step further and I'm going to say render this within a mermaid diagram. Here we see it going through the thinking process. And this really demonstrates just how much thinking that the model can do before it actually gives an answer. We can see it's going through a very involved thinking process. And once it's done, I'll just scroll up and I'll show you just how much thinking went into this type of question before it ultimately gave us the response. Within here, it's trying different arrangements and basically it's figuring out by trial and error which arrangements don't work. I really encourage you to read through the reasoning traces because it is very interesting to see how these LLM ultimately arrive at the conclusion that they get. Now, just to demonstrate how much thinking went into this answer, I'll just scroll down here as I go through this, and you can see just exactly how much thinking went into this. So, there were a ton of different iterations where it was trying different things, realizing that certain seating arrangements weren't going to work. And if I just go down and you can see all of the different trial and error as well as the thought process that went into each iteration before ultimately arriving with this explanation as well as this mermaid diagram for seven people. We have no valid arrangement that exists that satisfies all the constraints. Parents and in-laws cannot be seated without adjacency. But with eight people the pastor acts as a buffer allowing the parents and in-laws to be separated. And then from here we can basically see the arrangements. Now I'm going to send in a physics question basically saying I have a mass spring system on the moon. This is the gravity as well as this is the math. Derive the period of oscillation and compare it to earth's gravity and explain the math in latex. Within here we see it going through. Now I don't have the latex rendering within here. It wasn't as involved as that wedding arrangement seating but within here we can see that basically it went through and it provided the step-by-step explanation as well as the calculation. Unfortunately, it doesn't have the latex rendering, but this is just to give you yet another idea. Next, I'm going to ask it to generate a Next.js 15 route handler that fetches the top five stories from the hacker news API every 5 minutes, cache them in Reddus, and return a JSON with the title, URL, as well as the score. Include environment safe TypeScript code and a simple test using BEST. Within here, I can see the route that it's set up. I can see the Reddus configuration. I also see that it's using ZOD for our schema. And then within here, we're initializing our Reddus client. We're going to be setting up a cache for our stories. We're going to be checking whether those stories exist within our cache. And then we're going to be making that request if they don't exist to the HackerN News API. And then finally, here is a test to actually test the endpoint. By the looks of it, it does look like a capable coding model. But again, the flexibility of this model is being able to dynamically turn on and off when you want the model to think can be super helpful because for instance, if I just want a really quick response and I say in one sentence, explain how transformers differ from RNN's. And we can see that very quickly we have that inference response where it doesn't go through that thinking process, but all the while it still gives us a coherent and accurate answer. Overall, I encourage you to check out Neotron Ultra as well as all of the other models of the Neotron family. But that's pretty much it for this video. If you found this video useful, please comment, share, and subscribe. Otherwise, until the next
Weekly deep dives on AI agents, coding tools, and building with LLMs - delivered to your inbox.
Free forever. No spam.
Subscribe FreeNew tutorials, open-source projects, and deep dives on coding agents - delivered weekly.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.