the first day of the 12 days of open AI is here and we got our first announcements and it is huge 01 is ready let's watch the video I'll react to it let's get into it all right here we go we have some members of the open AI team really I think the only one I recognize is obviously Sam Alman let's start watching hello welcome to the 12 days of open a we're going to try something that as far as we know no tech company has done before which is every day for the next 12
every weekday we are going toch lach or demo some new thing that we built and we think we've got some great stuff for you starting today we hope you'll really love it let's pause for a second right away look at that sticker the strawberry I love it what a good sign for things to come for these next 12 days you know we'll try to make this fun and fast and not take too long but it'll be a way to show you what we've been working on and a little holiday present from us so we'll jump
right into to this first day uh today we actually have two things to launch the first one is the full version of 01 we have been very hard at work we've listened to your feedback you want uh you like Owen preview but you want it to be smarter and faster and be multimodal and be better at instruction following bunch of other things so we' put a lot of work into this and for scientists engineers coders we think they will really love this new model uh I'd like to show you quickly about how it performs all
right so there it is there's the big news we have the full version of 01 ready to go and of course yeah you all want it faster and better and smarter and yeah of course we all want these things but uh it's funny how he kind of Liv list that off like it's no big deal and uh but no they're delivering so let's take a look so they're about to show us these different benchmarks from the 01 the 01 preview and GPT 40 against three major high-end benchmarks this is math coding and PhD level science
so let's hear what he has to say so you can see uh the jump from GPT 40 to 01 preview across math competition coding GP QA Diamond um and you can see that 01 is a pretty big step forward um it's also much better in a lot of other ways but raw intelligence is something that we care about coding performance in particular is an area where people are using the model a lot so in just a minute uh these guys will demo some things about A1 they'll show you how it does at speed how it
does at really hard problems how it does with multimodality but first I want to talk just for a minute about the second thing we're launch all right before I let him get to the second thing that they're launching yeah I mean look at this massive jump and math from the aim 2024 competition math Benchmark we have a 13.4 and GPT 40 all the way up to 56.7 and 01 preview and another massive jump up to 83.3 and 01 so I honestly was not expecting 01 to be such a big leap from 01 preview but it
really is now the only Benchmark where it didn't and actually went down is from 01 preview to 01 on the PHD level science questions it actually went down a little bit and in fact this is the least big jump from gbt 40 to the 01 series of models so it's pretty interesting however an expert human still does not score as well as these 01 models so pretty interesting stuff it looks like we're getting just another massive leap in the thinking models let's keep watching a lot of people uh Power users of chbt at this point
they really use it a lot and they want more compute than $20 a month can buy so we're launching a new tier chat gbt Pro and pro has unlimited access to our models uh and also things like advanced voice mode it also has a uh a new thing called 01 PR mode so 01 is the smartest model in the world now except for 01 being used in PR mode and for the hardest problems that people have uh o1 PR mode lets you do even a little bit better um so you can see at competition math
you can see at GP QA Diamond um and these boosts may look small but in in complex workflows where you're really pushing the limits of these models it's pretty significant all right so let me pause for a second and I think I understand what's going on so first they're launching a new subscription tier and we've heard rumors of this from a few months back I reported on it and it looks like it has come to fruition they have a $20 month plan and now they have a $200 month plan that is a 10 times increase
in the price but what you get for that is unlimited everything and not only that they are essentially giving you more compute during test time compute to get these increased scores in the o1 PR mode as what we're seeing here and so that's basically from my understanding how it's working they're essentially allowing the model to think more they're giving you more compute and of course they're charging you a lot more so they should give you more compute but it's super impressive and actually speaks volumes towards how scalable test time compute really is and how we
now have this new lever of scalability on the post-training side of a model and I love it that's super exciting for me so let's keep watching one thing that people really have said they want is reliability and here you can see how the reliability of an answer from prom mode Compares to1 and this is an even stronger Delta and again for our Pro users we've heard a lot about how much people want this so I find this really interesting and this is probably something I need to do a little bit more research in myself which
is when you're running one of these models against a benchmark and you get a score I feel like the reliability should be baked into the Benchmark itself not having to run the same question over and over again to see what the reli ility is I always thought the reliability was built into the Benchmark which is odd and so I need to do a little bit more digging into how these benchmarks actually operate but what we are seeing here is when you run it more and more times the reliability is much higher on 01 Pro modes
so that extra compute during test time actually does make a difference for reliability Chach PT Pro is $200 a month uh launches today over the course of this these 12 days we have some other things to add to it that we think you also really love um but but Unlimited Model use and uh this new 01 Pro mode so should I spend $200 a month for 01 Pro mode are you going to let me know in the comments I want to jump right in and we'll show some of those demos that we talked about uh
and these are some of the guys that helped build 01 uh with many other people behind them on the team thanks Sam hi um I'm hungan I'm Jason and I'm Max World research scientists who worked on building 01 01 is being rolled out today to all uh plus and soon to be Pro subscri rers on chat TBT replacing o1 preview o1 model is uh faster and smarter than the o1 preview model which we launched in September after the launch many people asked about the multimodal input so we added that uh so now the o1 model
live today is able to reason through both images and text joint okay they kind of just put that under the radar almost like it was not a big deal now 01 is multimodal that is one of the big biggest upgrades that everybody has been waiting for so really cool to see as Sam mentioned today we're also going to launch a new tier of Chad gbt called Chach BT Pro Chach BT Pro offers unlimited access to our best models like 01 40 and advanced voice chbt Pro also has a special way of using 01 called 01
Pro mode with 01 Pro mode you can ask the model to use even more compute to think even harder on some of the most difficult problems yeah so that that's exactly what I thought it was going to be I'm glad he confirmed it essentially you're just saying give me more compute and I wonder what the limits of that are is it limited I wonder how much they've tested what if you let it think for a day a week a month very interesting to think about we think the audience for chbt pro will be the power
users of chbt those who are already pushing the models to the limits of their capabilities on tasks like math programming and writing it's been amazing to see how much people are pushing a one preview uh how much people who do technical work all day get out of this and uh we're really excited to let them push it further yeah sure we also really think that 01 will be much better for everyday use cases not necessarily just really hard math and programming problems in particular one piece of feedback we received about o1 preview constantly was that
it was way too slow it would think for 10 seconds if you said hi to it and we fixed that was really annoying it it was kind of funny honestly it really thought it cared really thought hard about saying back yeah um and so we fixed that 01 will now think much more intelligently if you ask it a simple question it'll respond really quickly and if you ask it a really hard question it'll think for a really long time uh we ran a pretty detailed Suite of human evaluations for this model and what we found
was that it made major mistakes about 34% less often than 01 preview while thinking fully about 50% faster and we think this will be a really really noticeable difference for all of you so I really enjoy just talking to these models I'm a big history buff and I'll show you a really quick demo of for example a sort of question that I might ask one of these Models All right so that's a really cool update and again I don't use 01 for many use cases only the most complex use cases where I'm really trying to
think through hard problems math problems logic problems do I go to 01 but if they really make it faster and more efficient to use and smarter in terms of how much thinking it puts towards some of these easier problems then of course like I'm going to go to1 more often and then now thinking about it that $200 a month price point maybe it makes sense so let's keep watching so so uh right here I on the left I have 01 on the right I have o1 preview and I'm just asking it a really simple history
question list the Roman ERS of the second century tell me about their dates what they did um not hard but you know GPT 40 actually gets this wrong a reasonable fraction of the time that's surprising to hear this seems like something gbt 40 would get right instantly and often and so I've asked one this I've asked preview this I tested this offline a few times and I found that 01 on average responded about 60% FAS F than1 preview um this could be a little bit variable because right now we're in the process of swapping all
our gpus from 01 uh Pro preview to 01 so actually 0 thought for about 14 seconds 01 preview still going there's a lot of Roman emper there's a lot of Roman emperors yeah foro actually gets this wrong a lot of the time there are a lot of folks who rolled for like uh 6 days 12 days a month and it's sometimes forgets those can you do them all from memory including the six day people no yep so so here we go o1 thought for about 14 seconds 01 preview thought for about 33 seconds these should
both be faster once we finish deploying but we wanted this to go live right now exactly um so yeah we we think you'll really enjoy talking to this model we we found that it gave great responses it thought much faster it should just be a much better user experience for everyone so one other feature we know that people really wanted for everyday use cases that we've had requested a lot is multimodal inputs and image understanding and hungan is going to talk about that now this is one that I've been really looking forward to without having
multimodal understanding the 01 model can only go so far for my personal use cases so let's keep watching I created this toy problem uh with some hand drone diagrams and so on so here it is it's hard to see so I already took a photo of this and so let's look at this photo in a laptop so once you upload the image into the chat GPT you can click on it and um to see the zoomed in version so this is a system of a data center in space so maybe um in the future we
might want to train AI models in the space uh I think we should do that but the power number looks a little low one g g okay but the general idea rookie numbers this rookie numbers rookie okay yeah so uh we have a sun right here uh taking in power on this solar panel and then uh there's a small data center here it's exactly what they look like yeah GPU re and then pump nice pump here and one interesting thing about U operation in space is that on Earth we can do air cooling water cooling
to cool down the gpus but in space there's nothing there so we have to radiate this um heat into the deep space and that's why we need this uh giant radiator cooling panel and this problem is about finding the lower bound estimate of the cooling panel area required to operate um this one gwatt uh uh the center probably going to be very big yep let's see how big is let's see so that's the problem I'm going to this prompt and uh yeah this is essentially asking for that so this is awesome this is so cool
just a handdrawn very complex problem not much information in the actual schematic but then they ask very specific questions over text and let's see how it does let me uh hit go and the model will think for seconds by the way most people don't know I've been working with hangan for a long time henan actually has a PHD in thermodynamics which it's totally unrelated to Ai and you always joke that you haven't been able to use your PhD work in your job until today so you can you can trust hungan on this analysis finally finally
thanks for hyping up now I really have to get this right uh okay so the model finished thinking only 10 seconds it's a simple problem so let's see if how the model did it so power input um so first of all this one gwatt that was only drawn in the paper so the model was able to pick that up nicely and then um radiative heat transfer only that's the thing I mentioned so in space nothing else and then some simplifying um uh choices and one critical thing is that I intentionally made this problem under specified
meaning that um the critical parameter is a temp temperature of the cooling panel uh I left it out so that uh we can test out the model's ability to handle um ambiguity and so on so the model was able to recognize that this is actually a unspecified but important parameter and it actually picked the right um range of par uh temperature which is about the room temperature and with that it continues to the analysis and does a whole bunch of things and then wow yeah look at this simple problem oh my goodness this is amazing
found out the area which is 2.42 million square meters just to get a sense of how big this is this is about 2% of the uh land area of San Francisco this is huge not that bad not that bad yeah oh okay um yeah so I guess this this uh reasonable I'll skip through the rest of the details but I think the model did a great job um making nice consistent assumptions that um you know make the required area as little as possible and so um yeah so this is the demonstration of the multimodal reasoning
and this is a simple problem but o1 is actually very strong and on standard benchmarks like mm muu and math Vista one actually has the state-ofthe-art performance now Jason will showcase the the pro mode great so I want to give a short demo of chbt o1 pro mode um people will find o1 pro mode the most useful for say hard math science or programming problem so here I have a pretty challenging chemistry problem that o1 preview gets usually Incorrect and so I will uh let the model start thinking um one thing we've learned with these
models is that uh for these very challenging problems the model can think for up to a few minutes I think for this problem the model usually thinks anywhere from one minute to up to three minutes um and so we have to provide some entertainment for for people while the model is thinking so I'll describe the problem a little bit and then if the model's still thinking when I'm done I've prepared a dad joke for for us uh to fill the rest of the time um so I hope it for a long time you can see
uh the problem asks for a protein that fits a very specific set of criteria so their six criteria and the challeng is so one cool thing uh while he's explaining I'll just pause for a second is I haven't seen this UI element yet where it's kind of a progress bar and so it has predetermined I guess how long it's going to take and it's going through it maybe I've missed that but I haven't personally seen that progress bar this UI element yet let me know if you have each of them asked for pretty chemistry domain
specific knowledge that the model would have to recall and the other thing to know about this problem uh is that none of these criteria actually give away what the correct answer is so for any given criteria there could be uh dozens of proteins that might fit that criteria and so the model has to think through all the candidates and then check if they fit all the criteria so is this a dig at Google and their Alpha fold project is this saying like hey we can do that now with the 01 model I'm not sure this
is kind of out of my area of expertise but yeah let me know okay so you could see the model actually was faster this time uh so it finished in 53 seconds you can click and see some of the thought process that the model went through to get the answer uh you could see it's uh thinking about different candidates like neuro Lian initially um and then it arrives at the correct answer which is uh retino chisen uh which is great um okay so it's a summarize um we saw from Max that 01 is smarter and
faster than uh o1 preview we saw from hangan that oan can now reason over both text and images and then finally we saw with chbt Pro mode uh you can use o1 to think about uh the the to to to to reason about the hardest uh science and math problems yep there's more to come um for the chat PT Pro tier uh we're working on even more computer intensive tasks to U power longer and bigger task for those who want to push the model even further and we're still working on adding tools to the o1
um model such as web browsing file uploads and things like that that's going to be huge that is the other piece I needed multimodel and I also needed to have tools just web browsing alone would be a huge huge gain so very excited about that we're also hard at work to bring 01 to to the API we're going to be adding some new features for developers structured outputs function calling developer messages and uh API image understanding which we think you'll really enjoy we expect this to be a great model for developers and really unlock a
whole new frontier of agentic things you guys can build we hope you love it as much as we do that was great thank you guys so much congratulations uh to you and the team on on getting this done uh we we really hope that you'll enjoy 01 and PR mode uh or Pro tier uh we have a lot more stuff to come tomorrow we'll be back with something great for developers uh and and we'll keep going from there before we wrap up can can we hear your joke yes uh so um I made this joke
this morning the uh the joke is this so Santa was trying to get his large language model to do a math problem and he was prompting it really hard but it wasn't working how did he eventually fix it no idea he used reindeer enforcement learning [Laughter] thank you very much thank you all right yeah that was a dad joke to say the least so what do you think is going to come over the next 11 days a few things that I personally would be very excited about any kind of preview of their browser or kind
of more of an agentic system that can actually browse the web I would love to see I know operator has been rumored to be coming soon so I really want to see something from open AI where agents can actually take action on your behalf I'd love to see an update to dolly that would be amazing obviously a SORA release would be amazing but I don't really think that's going to be coming maybe it will given it was leaked like a week ago but those are the things I'm most excited about let me know in the
comments what are you most excited about what do you want to see over the next 11 days should I keep covering all of these videos let me know if you enjoyed this video please consider giving a like And subscribe and I'll see you in the next one