Alright, GPT 4. 1. Three new models just appeared, 4.
1, mini, and nano. This is a mainly coding-focused AI assistant, previously if you wanted to create a flash card app from just one text prompt, I mean, this is okay, it kinda works — you can create and review your flash cards. However, if you look at the new one, the bones are very similar, but the usability is on a completely different level.
This went from good to great in just one release. Loving it! Note that this video comes in two parts, this is part one, and part two will be where the real fun begins.
So, these new models form a new Pareto frontier, which roughly means that you can choose how much speed you are willing to sacrifice to get more intelligence. If you are typing something and you want the AI to autocomplete your text, it does not need to be smart as Einstein to do that, but it needs to be super fast, so you probably need the nano for that. But for most things, like the flash card app, you will probably want to invoke the regular 4.
1. But it gets better. On coding tasks, 4.
1 can outperform the previous 4. 5, I know, and surprisingly, 4. 1 outperforms even those much slower, thinking AIs on coding benchmarks.
Also, wow. The context window is now 1 million tokens long, you can put heaps and heaps of textbooks in there, thousands of pages in total and ask questions about them, and then, what happens? This is called the needle in a haystack test, and OpenAI says that it recalls all of it very well.
However, when looking for 8 needles in the haystack, the accuracy decreases considerably. Respect to OpenAI for showing a little weakness there too in the name of Scholarly integrity. Independent tests also seem to verify that, and here, Google DeepMind’s Gemini 2.
5 Pro reigns supreme. I’d love to see some more rigorous testing on this to get to the bottom of it. This is important, because even if you don’t put in many textbooks, remembering your past conversations and understanding you will be super important going forward.
It better not miss that marriage anniversary date! And, yes I hear you asking, Károly, GPT 4. 5 is already out.
This is GPT 4. 1. Yes, and look, it gets better, I mean funnier, look at this wall of other models.
You could say that the marketing could be a bit better here and I would agree. However, this also shows the breakneck pace of innovation and competition here. More on that in a moment.
Now, benchmarks. I enjoy seeing numbers going up over time as much as the next Scholar. I mean, having it defeat hundreds of PhD level questions, mathematical and biological olympiad level questions is incredible.
These systems do unbelievably well on them. But here is the problem with these benchmarks. Almost all of these AI assistants are trained on nearly the whole internet.
What does that mean? That means almost whatever your question is, they had already seen something that is similar. And what does that mean?
It means these benchmarks will mean less and less over time. Thus, I would not take them too seriously. Perhaps as one little data point.
One little pointer. So then, is testing AI assistants reliably impossible? Well, not quite.
There is a solution. Sort of. Okay, this was part one.
Normally here is where most other videos end. Elsewhere. But not here.
Here we talk about research papers, and the wider context around these papers. For me, that is the coolest part, that’s what makes Two Minute Papers different. So here comes the coolest part.
Part two. Here, we are going to answer questions like: Can we really know how smart these AI systems are? Why do they say that it is devilishly difficult to train them?
How do you make a small textbook bigger? Yes, you heard it right. Dear Fellow Scholars, this is Two Minute Papers with Dr Károly Zsolnai-Fehér.
So, the problem with benchmarks is that they ask questions that these AIs already know. How about a test with things they don’t know? Enter Humanity’s Last Exam, a paper that just appeared where the authors ask the smartest people in the world to create questions that none of the current AI systems can answer.
These are really, really tough. Questions from all kinds of disciplines, classics, ecology, mathematics, computer science, linguistics, chemistry, you name it. And the result?
Now hold on to your papers Fellow Scholars, and look at that. Other tests? Easy-peasy.
Humanity’s Last Exam? Wow. They fail spectacularly on that.
A stunning result. And even if you plug in the newer ones, once again, Gemini 2. 5 Pro pops up again.
Google is getting their mojo back in AI. So, yes, I would very much like to have more of these systems tested on Humanity’s Last Exam. I think we should keep asking for it.
Doing my part here. But…wait a second. I hear you asking, Károly, okay, but this one can be gamed too, right?
You just put these questions, or similar ones in the training dataset, yum-yum-yum, it eats it up, and suddenly, the next version of the AI is incredibly good at that, because it has already seen it. Like with other benchmarks. Well, not quite.
Because get this, many of the questions are reserved in a hidden dataset that is not published anywhere, making this one a bit more truthful than most other tests. In short: private datasets might still be a good yardstick to measure AI progress in the future. Please test new systems like GPT 4.
1 on those too. And as you see, luckily for us, there is breakneck competition between these AI labs to create the best models out there, often for free. So we are completely spoiled here.
OpenAI came up with ChatGPT and took the world by storm. But now, Google DeepMind has Gemini 2. 5 Pro, which is an absolute powerhouse at a very low price point, so much so that it might even steal the show for now.
And DeepSeek is also on their tails, just a tiny bit behind, but it is all free to use for all of us. Amazing. So all of these, including GPT 4.
1 are amazing gifts for all of us, especially that these systems are devilishly difficult to train. Why is that? Now, I don’t have a lot of eye candy for this part, so I really appreciate that you are all Fellow Scholars, and you know that most of the value is in the narrative, not in the visuals.
Huge thank you and huge respect to all of you for that. Now get this. In an interview, scientists are OpenAI were asked, if they had to redo GPT4, a state of the art flagship system from just two years ago, what would it take?
The answer was something absolutely incredible, I couldn’t believe my ears: they said 5-10 people would be enough to do that. And now, for their more recent models, they need hundreds of people and all the might and compute of OpenAI. But here comes the crazy part: both compute and training data is growing like crazy, but compute grows even faster than data.
What does that mean? It means that data has become the bottleneck. So the goal now is to use the tons of compute to squeeze out every drop of information out of this training data.
That means data efficiency. Do you know what system out there is really data efficient? The human brain.
Oh yes! That is huge. I mean, not the brain, but the realization.
So let me say again: we are not compute constrained. It’s nice to have more graphics cards, of course, but now what we need most is more human ingenuity to make better use of the data we already have. Imagine that you need to take an exam and you have access to a textbook.
And you have all the time in the world, fantastic. You think you are lucky, until you find out that there is one big problem: the textbook is really tiny. Not even close to enough for the exam.
Why? Because it only has two problems in it. But the test will have a hundred problems.
So what do we do? Well, first, what you don’t do is just memorize the two problems it contains. This won’t help with the other 98.
So, instead you dissect these two problems. You try to understand the fundamental principles, the methods, and the reasoning behind the solutions. And that, Fellow Scholars is going to be the next chapter for AI too.
Lots of compute, and comparatively very little data. But the training it gets even more difficult. Why?
Well, they have small bugs, tiny little problems during the training of an AI system. Is that a problem? No and yes.
You see, imagine a dripping faucet in your house. Nobody cares about it. But note that the new system is more than a 100 times more demanding, a 100 times more complex, so remember that dripping faucet that everyone ignored?
Well, multiply it by a hundred, that is now a broken pipe that pours water into the foundation of the whole house and then it slowly starts sinking. Oh yes, that’s a classic. A small problem magnified by a 100 is suddenly not a small problem anymore.
And don’t forget, this is just how things are at the moment. But the landscape is changing really, really fast. Once again, there is breakneck competition.
Remember Sora, OpenAI’s text to video AI? When they first showcased it, the news took the world by the storm. We couldn’t believe how good it was.
And by the time they released it? I would go so far as to say that that now, it might not be able to compete with DeepMind’s Veo2 that has appeared since. And there are so many more models being published for free, you can’t even keep track of them.
Some of them are 7 billion parameters, they run almost everywhere. So, these are all amazing gifts to humanity from OpenAI, I am really thankful for them, and I am also thankful for the fact that there is huge competition between many other labs too. The result?
We, the users, the Fellow Scholars get spoiled. Often for free. Thank you!
So whatever you hear today on AI this, AI that, this is still just the beginning of humanity’s AI journey. And it is already so incredibly capable. Loving it.
What a time to be alive! This was super fun, hope you enjoyed the journey, and consider subscribing and hitting the bell icon if you wish to see more like this. And check out our new sponsor, because it helps you try a bunch of amazing AI systems for free or for very cheap.