OpenAI's O3 and 04-Mini in 8 Minutes - Developers Digest

OpenAI's O3 and 04-Mini in 8 Minutes - Developers Digest

Transcript

--- type: transcript date: 2025-04-16 youtube_id: JtQOanSpxf4 --- # Transcript: OpenAI's O3 and 04-Mini in 8 Minutes OpenAI has just released their most intelligent models to date, 03 as well as 04 mini. And what was really interesting with the announcement today is its integration within Chad GBT and how these models can leverage different tools. Here is a demonstration of a query. It had to go and search the web for some information and then once it had that information, it went and use code execution to create a graph and ultimately plot out the information that we see on the screen here. Now, just to quickly go over some aspects from the blog post. These models are trained to think longer before responding. They're the smartest reasoning models that they've released to date. And what's interesting with these models is for the first time, our reasoning models can agentically use and combine every tool within chat GPT. This includes searching the web, analyzing uploaded files, other data with Python, reasoning deeply about visual inputs, and even generating images. One interesting part with this is these models are trained to reason about how to use the particular tools that they have available to them. This allows for detailed and thoughtful answers in the right output formats. All in all, this allows you to have more complicated queries that could involve multiaceted questions where it needs to do a web search. It needs to do some Python execution before ultimately giving the final answer. They mentioned that combining these models with the tool use significantly improves the performance by being able to execute code, perform mathematical functions or search the web. All of those things are going to ultimately provide better results. In terms of the benchmarks, these models are considerably better than 01 as well as 03 Mini. If we take a look at some of the results for the AME competition math in 2024 as well as 2025, we can see that those outperform both 01 as well as 03 Mini. Additionally for competition code we also see a considerable jump for 03 mini we had a score of 2073 and for 03 as well as 04 respectively we have scores over 2700. Now a similar thing for GPQA diamond we can see that all of these models even without tools outperform the predecessors. Now for humanity's last exam we can see these scores on how they perform compared to 01 pro as well as 01 mini. Now, interestingly with this is they actually included in the benchmarks how these models compared when you pair them with different tool use capabilities such as being able to execute Python or have tools like browser usage. Just by combining the tool usage, you can see how much of a performance bump those get. And a similar thing on O4 Mini, we can see that with no tools, we see that it does outperform 03 Mini. And then for O4 Mini, we see a similar thing here as well. We see basically across the board that these models outperform just about everything with the exception of OpenAI's deep research. Now, a similar thing on multimodality. We see that these models are very strong for that. One thing that really stands out is on some of these coding benchmarks for the Swebench verified benchmark, we see scores of 69.1 as well as 68.1, which is almost a 20 point leap from 03 minis. And a similar thing here for the Ador Polyglot benchmark. We do see these models outperform by quite a large margin. In terms of instruction following as well as agentic code use, we do see that 03 does outperform. Now 04 mini doesn't quite outperform 01. Even though these models do appear to be broadly better at most tasks, there's obviously going to be some exceptions. A similar thing for the towbench function calling eval. We can see that 04 mini high doesn't quite outperform 01 high. A similar thing for 03 high. We see the results very similar to 01 high. They also call out within the blog post that they're continuing to scale reinforcement learning. In other words, more compute equals better performance. And one thing that they call out many times within the blog post is that the model is trained to use tools through reinforcement learning. These models really have the ability not just how to use tools, but to reason about when to use them. Let's say you have a dozen or more different tool calls. the more tool calls that you would previously add, you would often get tool invocations that weren't necessarily the right tool invocations at the right time. Whereas this model really looks like they're trying to solve that problem where it can reason across a number of different tools that it has access to and go and decide on when to actually use them. Another really neat use case of the models is its ability to actually reason through images. Instead of just taking the image and thinking through it once, it can analyze different portions of the image. This can be helpful as demonstrated today in a poster where it went through different aspects of the poster and was able to ultimately get the information and derive an equation and a solution from those different aspects. As they say here, for the first time, these models can integrate images directly into their chain of thought. They don't just see images, they think through it. This unlocks a new class of problem solving that blends visual and textual reasoning, reflecting the state-of-the-art performance across multimodal benchmarks. People can upload a photo of a whiteboard, a textbook, diagram, or a handdrawn sketch, and the model can interpret it even if the image is blurry, reversed, or low quality. Additionally, with tool use, the model can manipulate images on the fly, rotating, zooming, transforming them as a part of the reasoning process. Within the blog post, they call out some different examples that you can take a look at. And within this, you can see how it reasons about an image. For instance, it will go and it will extract different pieces. It will reason about the particular task. And as you can see here, the chain of thought is quite long. In this example, we can see before actually arriving at an answer for this particular query, it took about 3 minutes of reasoning and then it finally gave the answer here with this table of results. Now, in terms of cost and performance, we can basically see across the board that these models while they might be slightly more expensive than 03 mini, the intelligence of these models are considerably improved. And a similar story for 01 and 03. By and large, these models are cheaper and more performant to run. Now, another big announcement from today was the codec cli. So, this is something similar to claude code where what it allows you to do is to have a code editor from your terminal. What you can do within here is you can specify the model and it will go and reason through your particular codebase. If you're asking for edits or changes or new features, it will do all of that from the terminal. Now, the nice thing with this is they did open source this. So, you can go ahead and npm install this. You can go ahead and also take a look at the GitHub repository. I'll link this within the description of the video if you're interested. And they also mentioned that today they're launching a million initiative to support projects using codec CLI and the OpenAI models. If you're interested in trying to get a hold of $25,000 in API credits, you can go ahead and try that out. In terms of access, chatbt plus pro and team members are going to be able to access 03 as well as 04 mini and 04 mini high within the model selector starting today. These are going to be replacing 01 as well as 03 mini and 03 mini high. For enterprise and education users, you will have to wait a week and free users can try 04 mini by clicking the think button in the composer before submitting their query as they describe here. Rate limits do apply across the board and they expect to see the release of Open Eyes's O3 Pro mode in a few weeks with full tool support, but until then, Pro users can still access 01 Pro. The other great thing for both 03 as well as O4 Mini, you are going to be able to access this from the API today. And in terms of pricing, we can see that it's $10 per million tokens of input for 03, $2.50 per million tokens of cash inputs, and for output, it is $40 per million tokens of output. And for 04 mini, we have it at a110 cents per million tokens of input, cashed at 27.5 cents per million tokens, and output at $4.40 per million tokens. Finally, what's interesting is what they called out. They said that these models are a step in the direction of where they're heading, where they have specialized reasoning capabilities with the O series with a more natural conversation abilities and tool use of the GPT series. By unifying these strengths, our future models will support seamless, natural conversations along proactive tool use and advanced problem solving. It does look like they're moving in a direction where they're going to have models that can reason or just provide answers by using a model like GPD 4.1 under the hood and ultimately just give you a seamless experience without potentially having to worry about which model to use for which particular task. Overall, that's pretty much it for this video. If you found this video useful, please comment, share, and subscribe. Otherwise, until the next