OpenAI's GPT 4.1 in 7 Minutes - Developers Digest

OpenAI's GPT 4.1 in 7 Minutes - Developers Digest

Transcript

--- type: transcript date: 2025-04-14 youtube_id: VrnpooSbHtU --- # Transcript: OpenAI's GPT 4.1 in 7 Minutes OpenAI has just released a new family of models. GBD4.1, GPD4.1 mini, as well as GPD4.1 Nano. All support up to a million tokens of contexts, as well as have a refreshed knowledge cutoff of June 2024. For GPD4.1, it scores a 54.6 on the Swebench verified benchmark, which is a 21.4% increase over GPD40 and 26.6 over GPD4.5. Now, just as an aside, an interesting piece of this announcement is they also announced that GPD4.5 is going to be deprecated from their API. And the reason they said for this is in order to free up some of those GPUs. So, GPD4.5 is both an incredibly expensive model and from what I understand also a very large model. In a similar big jump on the instruction following, we have GPD4.1 that scores a 38.3% or a 10.5% increase over GPD40. So, if you have an application that does have a pretty beefy system prompt where you have a lot of instructions of what for it to do or not to do, these models are considerably better than GP40. Now, in terms of long context, now a great thing with this model is on the video MME, so you can pass in videos into the model as well. There is also a new state-of-the-art result of 72% on no subtitle categories, a 6.7% increase over GB40. So now in terms of intelligence of the model, so the MMLU, we see these three models plotted against GPD40 mini as well as GPD40, we can see for both GPD4.1 mini as well as GPD4.1. These models do have an increased level of intelligence. Both of these models in terms of latency are going to be a little bit slower than G40 Mini as well as GPD40. Now with that being said, we have GPD4.1 Nano. We do see that this model does fit in the lower quadrant where we have a fast model at a decreased level of intelligence. In terms of coding, GPD4.1 is significantly better than GPD40 at a variety of different coding tasks including front-end coding, making fewer extraneous edits, and following the diff format reliably, ensuring consistent tool usage, and more. We can see on the Swebench verified benchmark that this model is ahead of basically all of their models, including even 03 Mini. And then for 4.1 mini, we do see that is considerably improved over 40 mini as well. In terms of some other benchmarks, I was happy to see they included the ADAR polyglot benchmark and we can see the accuracy of 4.1 when compared to all of their other models here. They don't quite perform compared to the reasoning models of 01 as well as 03 mini on high mode. But when compared to non-reasoning models like GBD40, we do see a considerable increase. Even the 4.1 mini model on both the whole as well as the diff modes, we do see it outperform even GPD40. That is great news because this is a model that is considerably faster as well as cheaper than GPD40. And then for 4.1 nano, we see those results here have basically more than doubled on the GPD40 mini benchmark. Another great thing with this model is in terms of front-end coding, human graders prefer GPD4.1's websites over GPD40's 80% of the time. And they found them both more functional as well as aesthetically pleasing. Just a quick comparison, we have GPD4.0 on the left here with a flashc card app and we have 4.1 on the right. And we can see on the right we have some different colors, we have some icons and we also have an animation on the front end for those flash cards. Whereas for GPD40, it is a much more rudimentary front end. Now, in terms of instruction followings, GPD4.1 follows instructions more reliably across different formats, including negative instructions, ordered instructions, content requirements, ranking, and handling overconfidence. It has a 49.1 on hard instruction following prompts compared to 29.2 for GPD40. They saw a significant improvement with GPD4.1 outperforming GPD40 by 10.5% on the multi-challenge benchmark. And in terms of the ifal, we have the score at 87.4 when compared to 81% for GP40, which shows better adherence to verifiable instructions. Next, in terms of long context, this has a million tokens of context. Now, a great thing with this is it also is priced the same whether you're using 10,000 tokens or the full million. Here is the needle in a haystack accuracy benchmark. They also announced that they did open source an MRCR benchmark for long context you can find on hugging face. I'll also link that within the description of the video if you're interested. And we can also see those results here. We have GPD4.1, GPD4.1 mini as well as GPD4.1 Nano. And we can see all of those respective results. and the accuracy continuing to hold quite well for a considerable size of tokens all the way to the million tokens of context. In terms of the vision benchmarks, we even have 4.1 mini outperforming 40 as well as 4.1 almost outperforming even 01 on high mode. The interesting thing with this is even 4.1 mini does outperform both 4.1 as well as GP40 as well as all of the other models in terms of accuracy. In terms of being able to access the model, all developer tiers are going to be able to try this out within the OpenAI playground. Alternatively, you can get them from the API. In terms of the pricing, we have $2 per million tokens of input for GP4.1 with cash responses at 50 cents per million tokens as well as $8 per million tokens of output. And for GP4.1 Mini, it is just 40 cents per million tokens of input, 10 cents per million cash tokens, as well as a $160 per million tokens of output. And then finally, we have GBD4.0 O Nano at 10 cents per million tokens, 2.5 1.5 cents per million cash tokens, and 40 cents per million tokens of output. Now, while these models aren't yet on artificial analysis, these models are basically going to fill the spectrum of price here. They're going to have the Frontier that's going to be cheaper than Claude 3.7 Sonnet, as well as having models that are going to compete with things like Gemini 2.0 Flash as well as the Llama for Scout model. Now the great thing if you are interested in trying out the model you'll be able to try this out for free both on cursor as well as windsurf. Windsurf was a part of their announcement today and they mentioned that they are going to be providing this for free over the next week as well as providing a meaningful discount to these models as well in terms of a quick demonstration of the model. So here is when I ask for it to create a beautiful SAS landing page. Now the one thing that I found with the model is it does write out an awful lot of CSS. It looks like it might bias a little bit more towards CSS rather than Tailwind. With that being said, if I ask to convert it to Tailwind, I can see that it should have likely no problem to create something like that as well. Overall, that's pretty much it for this video. If you found this video useful, please comment, share, and subscribe. Otherwise, until the next one.