GPT-5 in 8 Minutes - Developers Digest

GPT-5 in 8 Minutes - Developers Digest

Transcript

--- type: transcript date: 2025-08-08 youtube_id: 7w38FqMYA1E --- # Transcript: GPT-5 in 8 Minutes OpenAI has just released GPT5, their smartest, fastest, and most useful model yet. In this video, I'm going to go over the blog post and touch on some of the key aspects of what you need to know. I'll touch on things like pricing as well as its availability within the API and some of the new features that they released. I'll also touch on some of the benchmarks in terms of how it performs relative to some other models that are out there. I'll touch on the API as well as the pricing. Let's dive in. First things first, one of the big changes with GPT5 is they describe this as a unified system which allows for efficient model and real-time routing depending on the complexity of your query. What's really interesting with this is instead of just having to set the parameters between thinking mode, whether it has a certain level of thinking, what you're going to be able to do with GPT5 is you can say things like think hard about this or if it's just an inherently hard problem, it will go and determine that it will spend a little bit more time thinking through what it needs to do. Whereas say if it's a basic query or a conversational query, it's not going to spend additional time on tasks that don't actually require any test time compute or thinking within the process. Next up, they describe GPT5 as not only outperforming previous models on benchmarks, but they also described that it most importantly performs on real world queries. What they did focus on with this was reducing hallucinations, instruction following, but most importantly were areas like writing, coding, and health, which they describe as ChachiT's most common use cases. Now, in terms of coding, so they did describe that there were particular improvements to front-end coding. They did have a number of examples within here. Here's an example of a ball rolling game where it's iterating up the score for us. We also have this pixel game where we can draw on this canvas. There's a typing game within here, a drum simulator game, as well as a lowfi game. So, during the demonstration, they did demonstrate GPD5 within cursor where it did generate this dashboard. But one of the more impressive generations was this castle demonstration where it looks like something like 3JS where it's an interactive game where there's a castle that's being defended and you can shoot different balloons. This was something that was definitely an example of something that was quite impressive in terms of front-end generation. Now, another interesting area that they highlighted was the health aspects and how people are using something like Chat GBT for health tasks. And here's a demonstration of 03 where you ask something like what does it mean if my mother had cancer? Does that put me at risk? Now, for 03, what it did is it was a little bit more of a dry response where it gives a table. It links out different websites and it has some useful information within here. And when we compare to GPD5, it's a lot less robotic. We see, I'm sorry, you're dealing with this worry. Many people have the same question. And it goes through and it gives what looks to be like relevant information to what was within 03. It does look like a response that I personally would have probably preferred for asking this type of question. But now just to quickly touch on some of the eval. So instead of going through them within their own blog post, what I wanted to do is to go over artificial analysis and some of the benchmarks that they plot because what's interesting with these benchmarks as opposed to the ones on OpenAI is this actually plots across all of the different vendors within the industry. First up for the artificial analysis intelligence index which is an aggregate score of eight different evaluations. MMLU, GPQA Diamond, Humanities Last Exam, Live Codebench, so on and so forth. Right off the bat for GPD5 on high mode, this model is state-of-the-art when compared to all other models. Even GPD5 medium, it does outperform the best model out there. Now, one interesting aspect of this for GPD on low mode, we do see that this ranks higher than Claude 4 sonnet thinking, but just shy of Quen 3, the 235 billion parameter reasoning model. Now, one of the really interesting things with this, so for GPD5 with minimal thinking, we see that this plots below GPD 4.1 and just above a Llama 4 Maverick. And where this is interesting is both in terms of cost but efficiency of the model. We'll see Gro 4 outperform on a number of different benchmarks. But one thing to keep in mind is it does come at an added usage in terms of actually token usage to think through the process to get the results that we actually see. Where these numbers are helpful is in terms of the intelligence to output tokens used. And this is something that is important both in terms of speed, cost, but overall just in terms of efficiency of the models. In terms of the intelligence index to the number of tokens used, we see GPT5 basically outrank all of the other models on this particular curve. Now to dive into some of the specific benchmarks of the models for MMLU Pro, humanity's last exam, Amy, long context as well as instruction following. This model is best-in-class. And where the model doesn't quite outperform is on things like GPQA Diamond, where we do see Gro 4 still have the lead. Now, on this chart, they don't quite have GP5 within live C codebench quite yet. For site coding, we do see that this is still shy of actually 04 mini on high mode as well as Grock. Now, while GPD5 doesn't quite outperform on every single benchmark, we do see that in aggregate that this does outperform basically all other models. It does have really good general capabilities across the board. In terms of LM Arena, which is a place where you can choose preferred responses across a number of different tasks. For text responses, we see GPD5 outperforms preferred responses from Gemini 2.5 Pro. For the webdev arena, we have GPD5 outperform Gemini 2.5 Pro, DeepSeek R1, as well as Claude for Opus. Now, in terms of ARC AGI, we do see that GPD5 doesn't quite outperform Gro 4. We do see that Gro 4 has a 66.7 whereas GPD5 on high mode has a 65.7. But one interesting consideration with this is the cost per task for Gro 4 is a dollar. Whereas with GPD5, we have the cost per task at basically half that of Gro 4. Now in terms of the APIs, we have four different models. So we have GPD5, GPD5 mini, GPD5 Nano, as well as GPD5 chat. Now if we quickly compare the models, we have GPD5, which is their flagship model, but we also have cheaper, faster models, GPD5 mini as well as GPD5 Nano. All of these models are reasoning models. They're all multimodals. We're going to be able to pass in both text as well as image into the model. For the flagship model, it is a $1.25 per million tokens of input and $10 per million tokens of output. For the mini model, it is 25 cents per million tokens of input, $2 per million tokens of output. And just to put this into perspective, so for both Gro 4 as well as Claude 4 sonnet thinking, it is $3 per million tokens of input as well as $15 per million tokens of output. And one thing that is interesting with GPD5 is it has the same pricing as Gemini 2.5 Pro with much better performance. Okay. Now, in terms of the context window, it has a total context window of 400,000. So, it has a much much bigger context window across the board. In terms of the max output tokens, we have 128,000 across the board for all of the different sizes of models. Now, in terms of some of the other features of the model, we have streaming, function calling, structured outputs, as you might expect. Now, additionally, within the flagship model, you will have predicted outputs which can be particularly useful for things like code refactor as well as things like editing text. So, depending on the application, that can definitely be helpful. Next up, the team over at Cognition have a benchmark called the Junior Dev Aval. They're the team behind the popular agent coding tool Devon and they show on their benchmark this outperforms across the board for exploration, planning, as well as code execution when compared to sonnid as well as GP4.1. Now, in terms of some reaction, so they did have the CEO of Cursor who did give it quite high praise, saying that this was the best coding model that they've used to date. They did have a quick demonstration of it resolving an issue within GitHub on the live stream. Now, in terms of accessing the models, both Windsurf as well as Cursor are giving this available for free. If you do want to try this out within a coding context, you can go and give both of the platforms a shot. Now in terms of accessing this, so GPD5 is rolling out to all users. Plus subscribers are going to get more usage as well as pro subscribers are going to be able to access GPD5 Pro, which is a version with extended reasoning for more comprehensive as well as accurate answers. You can think of that effectively akin to something like the high mode from the API. But otherwise, that's pretty much it for this video. Kudos to the team over at OpenAI. And if you found this video useful, please comment, share, and subscribe. Otherwise, until the next