The truth is, it's probably not possible to satisfactorily condense 12 months worth of weird progress in AI, as well as predictions for the year to come, into just one video. To be honest, I'm going to try anyway, because it has been a very strange time. We are mid-singularity for some people and pre-bubble burst for others, but wherever you are on that spectrum, here's my 10 takeaways from 2025, as someone who does little other than follow AI, plus five things we can confidently anticipate in 2026.

2025 was always going to be the year of reasoning models, models that take longer to think and spend more tokens doing so. That led, of course, most famously with Gemini 3 Pro, to benchmark after benchmark being beaten. But inevitably also skepticism about the inherent value of benchmarks being beaten.

But frankly, just the fact that whatever test you or I or the industry can create, AI models can soon surpass, is itself a fascinating phenomena. Yes, model aptitude is jagged or man, those spikes are getting pretty impressive, whether it comes to video understanding, chart table analysis, coding, or general knowledge and reasoning. This is the same year though that we caught glimpses of a flaw in that paradigm, that thinking longer may boost accuracy, but reduce diversity of outputs.

By browbeating base models until they can beat benchmarks, we are ensuring that the first answer a model gives, as shown in yellow here, is much more likely to be smart. But this 2025 paradigm does not seem to be producing reasoning paths that weren't already present in that base model and couldn't have been found if you sample that base model enough times. But the thinking longer approach isn't everything.

There's also scaling up the parameters and data that go into that base model, and we have seen rich rewards from that approach. Here's Demis Hassabis speaking just the other week. We're recording now.

Gemini 3 has just been released and it's leading on this whole range of different benchmarks. Um, how [clears throat] how is that been possible? Like wasn't there supposed to be a problem with scaling hitting a wall?

I think a lot of people thought that, especially as other companies have sort of had slower progress, shall we say, but I think we've never really seen any wall as such. Like what I would say is, um, maybe there's like diminishing returns. And people when I say that, people think only think like, oh, so there's no returns.

Like it's zero or one. It's either exponential or or or it's asymptotic. No, actually there's a lot of room between those two regimes, and I think we're in the in between those.

So it's not like you're going to double the performance on all the benchmarks every time you release a new iteration. Maybe that's what was happening in the early very early days, you know, three, four years ago. But you are getting significant improvements, like we've seen with Gemini 3, that are well worth the investment and the return on that investment and doing.

So I that we haven't seen any slowdown on. My second major takeaway from 2025, of course, relates to Genie 3 and how the world will soon be playable. Announced in August by Google DeepMind, Genie 3 is a model that can generate dynamic worlds from just a text prompt or perhaps an image you feed to it.

And that world isn't completely ephemeral. It retains consistency for a few minutes at a time at 720p resolution. In other words, you could take a photo, let Genie 3 turn that into a playable world, carve your initials into a tree inside that world, and return a few minutes later to see your initials still there.

Of course, whether you think this is going to lead to the most epic games ever or whole new swathe of people retreating into their own virtual worlds is up to you. Whatever you believe, my third takeaway from 2025 is that inevitably those worlds are going to get more and more realistic. Just this year, we got VO3.

1, Sora 2, Nano Banana Pro, as well as incredible text-to-speech and text-to-music models. These are all incredibly fun, of course, but my fourth takeaway is that AI slop has officially gone mainstream and isn't going anywhere. Two quick examples, cuz I'm sure you guys have hundreds of your own, but I got recommended in my feed this video, which as of now has 2.

4 million views, and it's this, you know, sad tale of a 73-year-old guy giving his life lessons. Thing is, it's all AI generated. That hasn't stopped hundreds of thousands of people being fooled though and giving comments as if this is a real video.

Fine, he or it might be giving good life lessons, but what happens to a world where no one can trust what they're watching, what they're hearing? Another way of putting it is that in 2024, the top comment on a video like this with the technology of the time would have been, this is AI rubbish. Whereas in 2025, it's just people pouring out their heart in response, not realizing {slash} not caring, for some people, that it is all AI generated, even the script.

The second anecdote being this video sent to me by a close family member, and again, of course, it's all AI generated about Trump ending NATO. This family member thought that the video was real and what's more, I talk to him all the time about AI and deepfakes. So it's hard to make someone immune.

My fifth takeaway was that there was so much great and encouraging AI news that wasn't necessarily related to the latest frontier model. I could have picked any of a hundred examples, but take Dolphin Emma, a large language model developed by Google to decode dolphin language essentially. It's still being refined, of course, as they feed it more and more data, but this is the kind of project I think we could all get behind.

A model that can recognize the signature whistles or unique names that are used by mothers and calves to be reunited, is a model that could emit, in token form at least, those same whistles and potentially summon such dolphins. My sixth takeaway is that people's desire for such progress is finally balanced with their kind of hatred for AI slop overall. This is perhaps why a survey done in the summer on Americans showed that the net rating for AI overall is positive just about.

2,300 Americans were asked, please say whether you believe the overall impact of AI on society is positive or negative, and we had 8% more people saying positive. Mind you, being only one percentage point higher than social media is somewhat worrying. Now, that survey factors in the overall impression, but specifically on AI art, the picture is far less positive.

Here in the UK, the government has a plan to make it opt out for artists. In other words, they have to actively say that they don't want their work to be used for training AI models. Only 3% of the UK public back that approach.

On a deeper level though, even at the very top of these AGI labs, questions are being asked about the meaning of solving creativity. Parts of it that have hit you harder than you expected though? Uh, yes, for sure.

On on the way, I mean, even the AlphaGo match, right? Just seeing, you know, you know, that how we managed to to crack Go, but Go was this beautiful mystery and it changed it. And so that was that was interesting and kind of bittersweet.

I think even the the more recent things of like language and then imaging and, you know, what does it mean for creativity? Uh, I I'm, you know, have huge respect and passion for the creative arts and having done game design myself and you know, I talk to film directors and it's it's an interesting dual moment for them too. There's like first on one hand they've got these amazing tools that speed up prototyping ideas by 10x, but on the other hand, um, is it replacing certain creative skills?

So I think there's there's sort of these tradeoffs going on, um, all over the place, which, um, I think is inevitable with something as, uh, as technology is powerful and as transformative as as AI is. >> Next, and I have done entire documentaries on this, so I'm going to keep it short, but AI has basically been enlisted this year in governments the world over. From outrage in Sweden because their prime minister uses ChatGPT to help him in his role to US senators admitting that they use Grok to analyze aspects of the big beautiful bill, GenAI in the military, of course, which is its own video, and of course, government entities using generative AI models to find efficiencies to very mixed effect.

A lot of this, being honest, relates to how smart many thought models would be by now, but that's for the second half of this video. Instead of just a bunch of confusing news, I hope you end this video with at least one framework of how to understand AI and how it's progressing. Because, of course, if you just look at the headlines, it can be incredibly misleading.

You're like, it's going to lead to the elimination of all jobs, but wait, it makes horrific mistakes. What on earth is going on? My eighth takeaway concerns GPT-5, which, if we're being honest, is the model that was probably the most looked forward to in 2025.

Sam Altman, I believe, misunderstood, and I've got the reason why in a moment, but I think he misunderstood that model before he released it. He said GPT-5 is the first time it really feels like talking to an expert in any topic, like a PhD-level expert. And in the live stream launch of that model, he said again, it's a legitimate PhD-level expert in anything, any area you need.

But the mistake there is thinking that there's just a single axis of intelligence, and being PhD-level on certain exams in one area means it's not going to make trivial mistakes elsewhere. As people have, of course, found with GPT-5, 5. 1, 5.

2, and indeed all other language models, those basic hallucinations remain. As I said at the time with my GPT-5 video, that doesn't mean that hundreds of millions of people won't experience an overall smarter model. Back in February, it was 400 million people using ChatGPT every week.

Now, it's closer to 900 million people. But one of the biggest stories of the year was just how far certain model providers are willing to go to make their models appealing to users. We had OpenAI briefly making GPT-4o incredibly sycophantic.

Even someone saying, I've stopped taking all my medications and I left my family because I know they were responsible for the radio signals coming in through the walls, to which GPT-4o said, seriously, good for you for standing up for yourself and taking control of your life. We had Meta accused of almost purely optimizing for user preference to get crazy high benchmark preference scores, but then releasing a different model as Llama 4. It seems to most people that this approach went so badly that Meta scrapped that entire approach and had to rebuild from scratch their super intelligence unit.

Of course, we'll see in 2026 if that produces results. Even though GPT-5 didn't go down as well as Sam Altman might have hoped, there were some quiet successes along the way for OpenAI. Like GPT-4.

5 passing the Turing test. This happened in April to quite little fanfare actually, where humans couldn't tell that they were speaking to GPT-4. 5.

On balance, they couldn't tell it apart from just another human typing out a response. One thing that did give some strange vibes about OpenAI's approach was them having to almost justify how they're going to get future revenue in a post from just a week ago. It does seem like a mixed sign when a company publicly relies on the compute that feeds their models and the revenue that comes out.

Yes, that has been the correlation and probable causation so far, but that doesn't mean that will continue indefinitely. Why? Because, well, ninth, we have seen the dogged increase in performance of Chinese and other open weight models.

Even on my own private benchmark, Simple Bench, testing trick questions and common sense reasoning, a Chinese model released in the last 24 hours, GLM-4. 7, got a score that would have been state-of-the-art around 9 months ago. Yes, OpenAI and Google DeepMind and Anthropic keep innovating and still hold top spots, but they do seem to be on that hamster wheel where they have to keep innovating.

Even if we just have a pause in innovation for 6 or 12 months, Chinese models could catch up and a lot of that API and consumer spend could switch to cheaper models from China. Or maybe Google and OpenAI have to reduce their prices to stop people switching and reduce their profit margins. For coding and question answering, no Chinese model has quite made it into my top four as judged by the council I use for lmcouncil.

ai, but for image generation, they certainly have with SeaDram in particular, and their 4. 5 model really getting quite close. SeaDram 4.

5 for me is still in third place, but it's not too far behind Nano Banana Pro or GPT Image 1. 5 released just the other week. One thing I would say is that even if you don't care about Chinese models being significantly cheaper, you can never fully write off the open-source community because it's not just the Chinese model providers that we're talking about.

Nvidia, that giant, have released fully open-source Nemotron 3. Now, for sure, it's not the smartest model out there, but this was just December 15th, and Nemotron Ultra, 16 times larger, is being released soon, according to them. Not just that, it's fully open-source, including the training data that goes into the model.

Again, this ninth takeaway isn't about Chinese models or Nvidia catching up, it's them staying in the race. And what them staying in the race means is that one slip-up from the frontier, just 6 or 9 months of little progress, could mean profit margins shrinking rapidly. I don't think that's going to happen, but it must keep lab leaders up at night.

The 10th takeaway is the breakout performance of the METER time horizons benchmark. You may have thought I was going to say the breakout performance of Claude Opus 4. 5 on that benchmark.

No, I mean the benchmark as a whole. Grading models on how long it takes humans to complete the tasks that models can complete successfully half the time. That may sound confusing, but to give an example, albeit with colossal error bars, Claude Opus 4.

5 can half the time successfully complete tasks that it takes humans almost 5 hours to complete. I can barely get models to spend more than 5 minutes on my problems. Maybe they're too easy.

This chart has been cited in all sorts of governmental analyses and AI 2027 projections and debates about the future of AI. I have had more than one lengthy discussion with more than one of the authors of this benchmark, so I want to give it some context. First, it's drawn from three benchmarks focused on coding and machine learning engineering tasks.

This is not a generalized measure of AI intelligence. Second, as A One Mr Goel points out on Substack, the further along you get in the METER plot, the more you're relying on a weaker and weaker signal. For example, when you're looking at the 1 to 4-hour range on the METER plot, it is drawn from only 14 samples.

That also leads to the massive error bars with a 95% confidence interval of between 1 hour 49 minutes to 20 hours 25 minutes. There's also actually a slightly bigger sample size at the higher end towards 16-hour long tasks, which led to the weird phenomena of Claude actually getting some of those right, but not the 2 to 4-hour long tasks. Also, as I noted back in March, that is again based on the average human duration for completing those tasks, but that varies wildly.

Meta found that contractors took 5 to 18 times longer to fix issues than repo maintainers. I could do many more, but there is a third bit of context I'd want to give you before you extrapolate too much from this graph. This period of time between 2020 and end of 2025 has coincided with an exponential increase in effective compute power.

There are plenty of reasons to believe we may only have 1 to 3 or 4 years of such scaling left. The final part of this video is about why that doesn't mean that the incredible progress we've seen in AI will peter out in that time period, but it is a cautionary note on extrapolating lines from a graph. And actually, before I forget, there was one more thing, which is if you raise the bar up to 80% success, remember this chart is based on models having 50% reliability on the task.

If you raise that to 80% success, Claude's performance drops off quite significantly. I keep forgetting stuff cuz there's actually one more thing I wanted to raise about Meta, courtesy of Mr Goel. Actually, I don't know how to pronounce the name.

Mr Goel? Goel? Sorry.

But he made the point that the more popular a benchmark becomes, the more famous it becomes, the more incentives there are for companies to game that benchmark. Essentially, train specifically for the tasks found in the METER time horizons benchmark, like cybersecurity capture the flag, and thereby appear to have a model that follows the exponential. You might wish then that there was a benchmark that measured pure raw intelligence, no gaming allowed, but that is really hard to craft.

Some people, like Yann LeCun, doesn't even believe there's such a thing as general intelligence. He said that's just an illusion, even for humans, and we're just specialized at certain tasks. Just yesterday, Demis Hassabis, CEO of Google DeepMind, shot back saying that Yann is just plain incorrect.

The human brain and AI foundation models are approximate Turing machines and are in fact extremely general. That key point about generality is at the heart of disagreements about 2026, which brings me to the final part of this video. My five predictions, if you will, about the year to come in AI.

But before I get to those, there's just two points I want to make. The first concerns the sponsors of today's video, MATS, and I was very pleased to hear that multiple people applied to their summer 2026 program via the link in my description of my last video. If you didn't see that spot, MATS basically finds and trains technical researchers working on essentially one of the most talent-constrained problems in the world, reducing risk from unaligned AI.

Previous alumni have gone on to work at places like Anthropic and DeepMind, and as I said in the last video, it would be pretty meta if the researchers who apply this year via the link in the description end up doing the kind of insane AI security and alignment work that gets featured on this channel. In case you missed it, the program also comes with world-class mentorship, a stipend, compute budget, and full cost coverage. And the second point is actually an admission about a prediction I got wrong.

I think it was around this time last year that I predicted that video avatars would be a thing by now. My evidence was based on this paper, Vatar 1 from Microsoft, where the lip-syncing from speech to the avatar moving their head was incredible. Check this out.

>> So, you know, sometimes nothing happens and sometimes everything happens all at once and you just kind of deal with it. And it's also just strange to both be extremely worried about different things and have your anxiety levels like peak to the highest they've ever been. And you're telling me, if we plugged Gemini 3 Flash as the model producing those answers, we wouldn't get a pretty incredible Skype or Zoom call out of that?

Well, that's what I thought, but here we are almost at Christmas and no such avatar or Zoom calling exists. Okay, that's not quite true, but no frontier-level AI avatars are available. A lot of the predictions I made did come true, but not all of them.

My first framework, if you will, for 2026 concerns what I'm calling lateral productivity. Everyone focuses on whether models are better than the best experts in their domains, but not nearly as much focus goes on the fact that even if models are only at the 90th percentile in a domain, that means someone outside of that domain can be helped to upskill very rapidly. Here's just one study from the autumn from the AI Security Institute, and they found that non-experts who used frontier models to write experimental protocols for viral recovery had significantly higher odds of writing a feasible protocol, almost five times as much, than a group using the internet alone.

This belies that myth that, "Oh, you could just Google it before. Nothing's changed. " Obviously, this particular study has safety implications because we don't necessarily want everyone to be able to create a virus, for example.

But again, just for me, the fact you have access to a very imperfect model in any domain is itself remarkable. Just the other week, the back doors of my car wouldn't open after an MOT, and I was like, "What the hell is going on? " And I used Gemini 3 to figure out that they'd put on the child lock for the back doors.

It also told me how to unlock those doors by looking at a latch inside the door, which I would have never found. Obviously, the model is not as good as the best mechanic, but I didn't have access to the best mechanic that evening. Well, likewise, across any domain.

The other day, I interviewed the founder of Sunday Robotics for Patreon, Tony Zao, because in November, they produced one of the robotics demos of the year, in wine glasses. Their memo robot is scheduled for deployment in 2026. And you might say, "Well, I wouldn't trust them with my wine glasses.

" But they may also soon be able to make a bed. Obviously, not as well as the best bed maker, but you might not care about having the best. Even a decently made bed might be good enough for you.

Okay, so my next framework for 2026 is a bit harder to convey, but I'm just going to try my best. A good chunk actually of this video has been building up to you with this example, because to see where AI is going in 2026, you have to have an opinion about the generality of our current methods. Let's imagine for a moment that you trained a robot on all of the internet's data, literally everything, videos, audio, everything.

Let's say the model inside of it has a quadrillion parameters, we've scaled up to the absolute max. Well, for the true believers, scale is all you need. It's just a central axis of general intelligence.

So you could give the robot a task like, "Can you pick up the cup? " And the model, having understood all the patterns latent in all of that data, would proceed to do so elegantly and excellently. You can almost think of this as the single axis camp.

Generality of intelligence as a single knob you can dial. Dario Amodei of Anthropic is, I believe, in this camp, or at least was, and Ilya Sutskever definitely was, but no longer is. For Sutskever, formerly chief scientist of OpenAI, training a model to predict the next word would force it to encapsulate so many patterns latent in the data.

For example, if the final sentence of a book was, "And the murderer is," to successfully predict the next word, the model would have to have incredible general intelligence, understand human emotions and intentions, and decipher socio-economic statistics, and everything. Now, by the way, he doesn't believe that, saying that the generalization of models is actually inadequate. There's a disconnect between eval performance and actual real-world performance.

If, like Amodei though, you're more drawn to that singular axis approach, then it makes sense to just extrapolate the curves forward. But now getting to kind of the job side of this, um I I I do have a fair amount of concern about this. Um on one hand, I think comparative advantage is a very powerful tool.

If I look at coding, programming, which is one area where AI is making the most progress, um what we are finding is we are not far from a world, I think we'll be there in 3 to 6 months, where AI is writing 90% of the code. And then in 12 months, we may be in a world where AI is writing essentially all of the code. This intelligence ratchet, if you will, is what's modeled in AI 2027, the report studied by, among others, J.

D. Vance. But let's go back to my robot analogy, because what would happen if the robot doesn't pick up the cup?

Or at least doesn't do so well. Maybe it picks it up, but it does so slowly and very jerkily. Maybe it breaks the cup or damages the gripper or knocks something else over.

Maybe it doesn't lift it high enough or accurately enough or with good enough energy efficiency. Maybe it can pick up a cup, but not any other object. Suddenly, you have to have a dozen benchmarks measuring each of these things, and then optimize on those benchmarks to get good performance.

Or maybe the world is just full of chaos, and there are a thousand benchmarks you need to optimize for to get a smooth and happy lifting of a cup and any object. Maybe it's as granular as you have to train on differently colored cups, different noise levels, and on and on and on. Just personally, I don't think we're in either extreme.

I think those like the lead author of AI 2027 are more drawn to the singular axis approach. Scale up, and of course, with a few more on hobbling or tweaks along the way, you just get more and more intelligence. That led to a median estimate from Daniel Kokotajlo of AI systems being able to replace 99% of current fully remote jobs by around 2027.

The other extreme, where every tiny little niggle and variation has to be benchmarked for and optimized, led to one of the former members of Epoch AI, a great independent research organization, predicting it would be 40 years before this occurred. Endless incremental progress across innumerable benchmarks. As you can probably tell from my voice, I'm more in the middle.

Why so? Well, partly because of my own benchmark, Simple Bench. If we were in the single axis world, well then it should have been saturated long ago.

Almost straight after the first model to get any kind of performance, say 40% like this time last year, would have been succeeded by a model getting 80% or even 100% performance. Okay, you might say there's noise in the benchmark, but let's say 90% performance. Just a truly generally smarter model would have emerged, and the silly mistakes would have all gone away.

If we were in the thousand benchmarks to pick up a cup world, well then there'd be no increase in performance on Simple Bench. No one is creating a benchmark for what a certain person would do in a CPR situation if, as children, they argued about Pokémon. There just would never be a benchmark for that, and so performance on Simple Bench would never improve.

The models therefore must be picking up some general patterns latent in internet-scaled data. I think therefore we're in that middle world of steady improvement. Hence why it's very hard to pin down the {quote} IQ of models, and even Sam Altman seems to find it difficult.

I don't think he thinks they're PhDs anymore, but he does propose a decent definition of superintelligence. There's all this stuff the last couple of days about GPT-5. 2 has an IQ of 147 or 144 or 151 or whatever it is.

It's like, you know, depending on whose test, it's like it's some high number, and you have like a lot of experts in their field saying it can do these amazing things, and it's like contributing, it's making me more effective. You have the PII things we talked about. One thing you don't have is the ability for the model to not be able to do something today, realize it can't, go off and figure out how to learn to get good at that thing, learn to understand it.

And when you come back the next day, it gets it right. And that kind of continuous learning, like toddlers can do it. It does seem to me like an important part of what we need to build.

Now, can you have something that most people consider an AGI without that? I would say clear I mean, there's a lot of people that would say we're at AGI with our current models. Um I think almost everyone would agree that if we were at the current level of intelligence and have that other thing, it would clearly be very AGI-like.

Um but maybe most of the world will say, "Okay, fine. Even without that, like it's doing most knowledge tasks that matter. Um smarter than us in most of us in most ways, we're at AGI.

You know, it's discovering small piece of new science, we're at AGI. What I think this means is that the term, although it's been very hard for all of us to stop using, is very underdefined. Right.

I I have a I have a a candidate like one thing I would love is we got it wrong with AGI. We never defined that. The, you know, the new term everyone's focused on is when we get to superintelligence.

Um so, my proposal is that we agree that, you know, AGI kind of went wooshing by. It was didn't change the world that much. Or it will in the long term, but okay, fine.

We've built AGIs at some point, you know, we're in this like fuzzy period where if some people think we have some people think we have, and more people will think we have, and and then we'll say, "Okay, what's next? " Um A candidate definition for superintelligence is when a system can do a better job being president of the United States, CEO of a major company, you know, running a very large scientific lab than any person can even with the assistance of AI. If I were to convert my remarks from an observation to a prediction, I would say that Amodei is going to be wrong.

It definitely won't be 100% of coding done by models even at the end of 2026, nor will mainstream scientists agree that models are, say, 150 IQ. But I also would predict that by the end of next year, there won't be any benchmark that the average human, untrained in that domain, in text at least, will outperform the frontier model at the end of next year. If I'm right, we won't see unemployment spike to 10 to 20% as Dario Amodei predicted, at least not in the next 1 to 5 years, which by now is 2026 to 2030.

Right, I have two or three final papers and one final analogy in this, my last video of the year. It's why I'm still really quite optimistic about AI in the short term as well as long term. Because, and you may have to bear with me for just a minute, here's what I see LLMs as being.

We beat out the Neanderthals because we could communicate better. We had more nuanced language. We could pass on tips and stories from one generation to another.

When writing came along, we could store that information. With the printing press, we could spread that information across continents quickly. With the internet and then the World Wide Web, we could access all of that information almost instantly.

Each of these, somewhat obviously, were new information paradigms. Then, somewhat appropriately for what I'm about to say, it was Google who revolutionized search. It almost compressed the internet, making it one search query away.

LLMs, for me, very imperfectly, are that next stage of compression. Suddenly, it's not just results you get, it's answers, albeit imperfect ones. Don't worry, this isn't why I'm optimistic.

I'm going to get to the next paradigm in a second. But it is why I think LLMs were and still are so revolutionary. We got answers, not just a list of results.

Of course, one of the first results of the printing press were witch burnings. Early Google searches, and some would say current Google searches, were a bit of a mess. LLMs hallucinate all the time.

So what we need is to move toward that next paradigm, automated information discovery. LLMs will play their part, but they're not the be-all and end-all. Take AlphaFold from Google DeepMind, which is basically an LLM plus automated tests and evolution.

You give it a starter code base plus an evaluation function. And yes, by the way, I did a separate video on Alpha Evolve, but just briefly. Then Alpha Evolve will run in a loop where it picks previously good programs from a database, builds a prompt that includes that program plus other high-scoring inspiration programs, then asks an LLM like Gemini 3 could be today to propose a patch.

It then applies the patch, runs the evaluation, and then saves the new program. It can build on what works and discard what fails, but does this have any practical implication? Um yes, it developed a more efficient scheduling algorithm for data centers, found a functionally equivalent simplification in the circuit design of hardware accelerators, and accelerated the training of the LLM underpinning Alpha Evolve itself.

Basically, it even made certain model training runs on Google's TPUs run about a percent or so faster. Bear in mind the scale of this. One of the solutions that Alpha Evolve came to, which has been in production for now 18 months, recovers on average 0.

7% of Google's worldwide compute resources, automated information discovery. How about the first improvement of a particular algorithm for 56 years used for matrix multiplication. Then what I'm calling Alpha software, which was a paper released in September of this year.

Because after all, as Google says, the cycle of scientific discovery is frequently bottlenecked by the slow manual creation of software to support computational experiments. For this breakthrough, think of adding web search or deep research to the previous mix. Obviously, I'm simplifying here, but the effectiveness of the tree search they did, for example, in bioinformatics helped discover 40 novel for single cell data analysis, outperforming the top human developed methods on a public leaderboard.

And even if a system needs to learn on the job and become specialized in each domain, well, there are already working prototypes for continual learning. For example, nested learning, which I covered in a previous video from Google. The architecture helps the model choose what to learn and to memorize.

Of course, choosing what to learn will be incredibly important from both a safety and intelligence point of view, because did you know that LLMs can get brain rot by reading Twitter influencer material? Yes, there's literally a paper on that. The results providing significant multi-perspective evidence that data quality is a causal driver of LLM capability decay, reframing curation for continual pre-training as a training time safety problem.

Twitter influencers may be what genuinely slows down the singularity. Now, of course, there is a lot more than automated information discovery to look forward to in 2026. How about enhanced EQ for models?

They uncovered what you could call a geometry of conversations, where they could pinpoint the moments where models start to piss you off through semantic shift and repetition and misunderstanding your original goal, not reciprocating user effort in the conversation, for example. Plus, just general latency as being a key frustrator of users. Either way, the paper showed that all of this could be modeled and improved.

I for one really look forward to the day where models just got me a lot better. Then there are the coding improvements that I find really hard to put into words. Yes, as Andrej Karpathy we are still in that world of jagged capabilities, but as someone who uses these models daily, the sheer improvement in reliability and quality from January to December is incredible.

So, that's my take, a much more fulsome video than I was planning on making, but I'd love to know what you guys think. You may completely disagree or may just want to chuck the transcript into self-chat and get models to debate it on lmcouncil. ai.

Either way, honestly, thank you for joining me this year, and thank you for watching to the end of this quite substantial video. You guys are an amazing bunch. Thank you so much for all the support you've given this year.

Have a wonderful Christmas and an even better 2026.

What the Freakiness of 2025 in AI Tells Us About 2026