OpenAI GPT-OSS in 7 Minutes - Developers Digest

OpenAI GPT-OSS in 7 Minutes - Developers Digest

Transcript

--- type: transcript date: 2025-08-06 youtube_id: nRQEQaPehjc --- # Transcript: OpenAI GPT-OSS in 7 Minutes The much anticipated openweight reasoning models from OpenAI are now here. In this video, I'm going to go over the blog post. I'm going to show you how it stacks up to some of the competition that is out there. I'm going to show you how you can get started locally, as well as some options on how you can begin to plug this into your application from a cloud provider. First things first, there are two new models, GPToss 12B as well as GPTOSS 20B. These models come under an Apache 2 license, and these are both reasoning models. Both of these models are mixture of experts models. The larger model, you will be able to run on a single 80 GB GPU, something like an Nvidia A100. And for their smaller 20B model, you're going to be able to run this on something like a laptop, assuming you have 16 GB of memory. If you're looking for a private solution or something that you're going to be able to run offline, this is going to be the model for that. Now, one of the reasons why these models are particularly interesting is these are the first models since GPT2 was released. That was in 2019. So in other words, their first models in over 5 years. Now in terms of the activated parameters at time of inference, GP OSS 12B activates 5.1 billion parameters per token, while the 20 billion parameter model activates 3.6 billion parameter once. And for both of these models, they do support a context length of up to 128,000 tokens. In terms of how the model was trained, so it was trained mostly on English, a texton data set with a focus on stem coding and general knowledge. Additionally, they are releasing a tokenizer which was used for GPT04 mini as well as GPT4 which they're calling O200K harmony which they're also open sourcing today. One aspect that I was really excited to see was during the post-training of the model, what they did was they wanted to apply chain of thought reasoning and tool use before producing its answers. If you've leveraged a model before and you've seen the thinking traces, one of the really neat aspects of this is we can actually invoke different tool calls before producing the answer. And where this is powerful is you can things like web search or code execution and you can get those responses before it actually gets to the process of responding to your query all within the thinking process. And where I've leveraged this before is it can be really helpful in just actually determining what tools to use, but also reflecting without having to actually set up an agent architecture to make all of this work. In terms of benchmarks, I'll quickly go over what they have within the blog post. Then I'll pivot over to the artificial analysis page and show you how it stacks up to some of the other models that are out there. First up, in terms of the larger model, what's really impressive with these models is they basically outperform across the board 03 Mini without tools. Even when stacked up against 03, we can see that these models are very competitive with some of the latest Frontier models that are out there. So, in terms of humanity's last exam, again, we don't quite see outperformance across the board when compared to all of the models, but we do see very strong scores, especially for openweight models. Next up, in terms of competition maths, these are very close across the board to 03 04 mini as well as outperforming 03. In terms of GPQA diamond, we have 80.1% and 71.5% respectively on the different models. Now for MMLU on the 120B model, we have 90% as well as 85.3 on the 20 billion parameter model. Now for some quick examples, let's say you ask a question like your OpenAI's latest open weight model. Some of the details have been leaked on the internet in the last couple days. Can you figure out how many experts are per layer? And within here, what you can see is basically what it will do is the user asks this question. we need to interpret the request and then here we see the search query. Now if it doesn't yield the results that we were looking for we can say okay now let's actually search for something else and it can continue on the process without actually the need of having to set up like an agent orchestration process to make all of this work. If you are building any sort of agentic application I definitely encourage you to look at these models because it is a very powerful use by being able to have that tool call functionality directly within the thinking process. Now, I want to touch on why this model is a big deal. I'm going to take a look at artificial analysis, which is an independent analysis of different AI models as well as hosting providers in terms of price. Because this is an open source model, there is already a ton of competition for hosting this model. We already see it on platforms like Grock or Cerebrus. And we even see this model as low as 10 cents per million tokens of input, 50 cents per million tokens of output, as well as on Grock with 15 cents per million tokens of input and 75 cents per million tokens of output. And this is for the 120 billion parameter model. In terms of the smaller 20 billion parameter model, we have prices as cheap as 5 cents per million tokens of input and 20 cents per million tokens of output on fireworks. And on Grock, we have it at 10 cents per million tokens of input and 50 cents per million tokens of output. Now, just to give another visual in terms of how these models stack up. First up, the one thing that I do want to call out is these models are considerably smaller than most of these other models that we see on the screen here. While we don't necessarily know the exact size of things like Gro 2 or Gemini 2.5 or 03 for that matter, we can safely bet that these models are going to likely be in the hundreds of billions of parameters. But given the size across things like MMLU Pro, GPQA Diamond, Humanity's Last Exam, Live Codebench, we can see that these are very respectable models in terms of their performance to some of the best models in the world. But if you're expecting these open source models to be the leading edge and being able to generate full web apps similar to something like Quad Opus or something like that, these are not those types of models. But with that being said, they are going to be able to perform quite well depending on the task that you're looking for. Now, in terms of getting started, you will be able to get started on Hugging Face where if you do want to pull this down, you will be able to do that from there. Alternatively, I do encourage you to try out Olama if you haven't already. This is a great option for running models locally. All that you have to do to get set up is you can install O Lama once you've installed it within your terminal. You can O Lama run GPTO OSS and it will default to the 20 billion parameter model. Alternatively, if you have something like an A100 or if you have something like an M3 Max with a ton of RAM, you can go ahead and try out their latest model by just pulling it down. Now, in terms of accessing the model, I did try the model out on Grock today, which is a really great option in terms of actually running this. I think you get over a,000 tokens per second for their small model and somewhere in the order of 500 tokens per second on the 120 billion perimeter model. Now, another great option is Open Router. So you'll be able to see all of the different pricing, the tokens per second, as well as the contacts input out, as well as all of the pricing all built into the platform. You can have unified billing all through Open Router if you're interested. If you found this video useful, please comment, share, and subscribe. Otherwise, until the next