OpenAI's O3 and O3-Mini in 12 Minutes - Developers Digest

Transcript

--- type: transcript date: 2024-12-20 youtube_id: duQukAv_lPY --- # Transcript: OpenAI's O3 and O3-Mini in 12 Minutes open AI has just unveiled 03 as well as 03 mini so on the 12th day of their AI holiday event the 12 days of open AI they just dropped the announcement for their next Generation reasoning models the one thing to know with this model is it's not available today it's hopefully going to be available at the end of January so in this I'll highlight a number of clips that they had from their announcement let's get into it we're going to announce two models today 03 and 03 mini 03 is a very very smart model uh 03 mini is an incredibly model but still uh but a really good at performance cost so to get the bad news out of the way first we're not going to publicly launch these today um the good news is we're going to make them available for Public Safety testing starting today now O is a really strong model at very hard technical benchmarks and I'm want to start with coding benchmarks you can bring this up so on software style benchmarks we have sweet bench verified which is a benchmark consisting of real world software tasks we're seeing that O3 performs at about 71.7% accuracy which is over 20% better than our 01 models now this really signifies that we're really climbing the frontier of utility as well on competition code we see that 01 achieves an ELO on this contest coding site called code force is about 1891 at our most aggressive High test time compute settings we're able to achieve almost like a 2727 ELO here and not just programming but also mathematics so we see that on competition math benchmarks just like competitive programming we achiev very very strong scores so 03 gets about 96.7% accuracy versus an 01 performance of 83.3% on the Amy there's another very tough Benchmark which is called GP QA diamond and this measures the model's performance on PhD level science questions here we get another state-of-the-art number 87.7% which is about 10% better than our 01 performance which was at 78% just the put this in perspective if you take an expert PhD they typically get about 70% in kind of their field of strength here one thing that you might notice yeah from from some of these benchmarks is that we're reaching saturation for a lot of them or nearing saturation so the last year has really highlighted the need for really harder benchmarks to accurately assess where our Frontier models lie and I think a couple have emerged us fairly promising over the last months one in particular I want to call out is epic ai's Frontier math Ben now you can see the scores look a lot lower than they did for the the previous benchmarks we showed and this is because this is considered today the toughest mathematical Benchmark out there this is a data set that consists of Novel unpublished and also very hard to extremely hard yeah very very hard problems even T houses you know it would take professional mathematicians hours or even days to solve one of these problems and today all offerings out there um have less than two % accuracy um on on this Benchmark and we're seeing with 03 in aggressive test time settings we're able to get over 25 hello everybody my name is Greg camrad and I'm the president of the arc prise Foundation now Arc prise is a nonprofit with the mission of being a North star towards AGI through and during benchmarks so our first Benchmark Arc AGI was developed in 2019 by FR SW Chet and its paper on the measure of intelligence however it has been unbeaten for 5 years years now in AI world that's like it feels like centuries is where it is so the system that beats AR AGI is going to be an important Milestone towards general intelligence but I'm excited to say today that we have a new state-of-the-art score to announce 03 it has scored 75.7 on Arc ai's semi-private holdout set now this is extremely impressive because this is within the uh compute requirements that we have for our public leaderboard and this is the new number one entry on rkg Pub so congratulations to that yeah now uh as a capabilities demonstration when we ask 03 to think longer and we actually ramp up to high compute 03 was able to score 85.7% on the same hidden holdout set this is especially important sorry 87.5 yes this is especially important because um Human Performance is is comparable at 85% threshold so being Above This is a major Milestone and we have never tested A system that has done this or any model that has done this beforehand very happy to uh tell you more about O3 mini which is a brand new model in the 03 family that truly defines a new cost efficient reasoning Frontier for all three Medi will support three different options low median and high reasoning effort so the users can freely adjust the uh thinking time based on their different use cases I'm happy to show the first set of emils of all three mini um so on the left hand side we show the coding eme so it's like code forces ELO which measures how good the programmer is uh and the higher is better so as we can see on the plot with more thinking time all three me is able to have like increasing Yow or all performing all One Mini and with like medium thinking time is able to measure even better than all one I would like to do a live Dem sounds great Al uh so um and hopefully you can test out all the three different like low medium high thinking type of the model but let me P the pr so I'm testing out all three mini High first and the task is that I'm asking the model to uh use Python to implement a code generator and executor so if I launch this I run this like python script it will launch a server um and um locally with a with a with a UI that contains a text box and then we can uh make coding request in a text box it will send the request to call al3 Mini API and al3 mini API will solve the task and return a piece of code and it will then uh save the code locally on my desktop and then open the terminal to execute the code automatically so it's a very complicated right um and I'll put like a big Chun of code so if we copy the code and paste it to our server and then we would like to run launch This Server so we should get a text box when you're launching it yeah okay great oh yeah see I hope so it seems to be launching something oh great we have a we have a UI where we can enter some coding props let to try out a simple one like train open the eye and random number so it's sending the request to ultra medium so you should be pretty fast right so on this 41 terminal yeah 41 that's right so it saves the generated code to this like local script um on the desktop and the print out open 41 um is there any other tasks you guys want to test it out I wonder if you can get it till you get its own gpq numbers that EXA that's a great ask just as what I expect a lot um okay so now let's me copy the code and send it and the UI so um in this task we asked the model to evaluate all three mini with the low reasoning effort on this hard gpq set and the model needs to First download the the the raw file from this URL and then you need to figure out which part is UR question which part is the um which part is the answer and or which part is the options right and then formulate all the questions and to and then ask the model to answer it and then part the result and then to grade it that's actually blazingly fast yeah and it's actually really fast because it's calling the allra me with low reasoning effort um yeah let's see how it goes if two tests are really hard here yeah the long tails open the is a hard data set yes count is like maybe 196 easy problems to hard problem while we're waiting for this do you want to show the what the request was again oh it's actually Returns the results it's uh 61. 62% 6 62% right with a low reasoning effort model and actually pretty fast then full evaluation in a minute and somehow very cool to like just ask a model to evaluate itself yeah exactly right if we just summarize what we just that ask a model to write a script to evaluate itself um on this like hard GQ Set uh from a UI right from this code generator and executor created by the model itself in the first place next year we're going to bring you on and you're going to have to improve ask the model to improve it yeah let's definely ask them all to improve it next time maybe not um um so um besides code forces and gbq the model is also a pretty good um um math model so we we show on this plot uh with like on this am 2024 data set Al mini low achieves um comparable performance with 01 mini and 03 mini median achieves like comparable better performance than o1 we check the solid bar which are pass ones and we can further push the performance with all3 mini High mhm right and on the right hand side plot when we measure the latency on this like anonymized om preview traffic we show that all3 mini low drastically reduce the latency of o1 mini right almost like achieving comparable latency with gbt 40 where under a second so probably like instant response and also mini median it's like half the latency of all um and here's another set of eval I'm even more excited to to show you guys is some uh API features right we get a lot of requests from our developer communities to support like function calling structured outputs developer messages all mini series models and here U all three mini will support all these features same as o1 um and notably it achieves like comparable better performance that for all on most of the evils providing a more cost effective solution to our developers cool um and if we actually unveil the true gbq Diamond performance that I run a couple days ago uh it actually also mean a 62% right basically as itself yeah right next time you should totally just ask model to automatically do the evaluation you set after um yeah so um with that and that's it for alter Mei and I hope our user can have a much better user experience than already next year fantastic work thank you this up 03 mini and 03 apply please if you'd like for safety testing to help us uh test these models as an additional step we plan to launch 03 mini around the end of January and full 03 shortly after that but uh that will you know the more people can help us safety test the more we can uh make sure we hit that so please check it out uh and thanks for following along with us with this it's been a lot of fun for us we hope you've enjoyed it too Merry Christmas let me know what your thoughts are from this announcement how do you think this will affect coding and software development in 2025 otherwise if you found this video useful please like comment share and subscribe otherwise until the next one

OpenAI's O3 and O3-Mini in 12 Minutes - Developers Digest