a new research paper was just dropped showing how large language models can think internally before ever outputting a single token you're probably used to hearing about Chain of Thought with test time compute these are the new thinking models and the thinking happens where you can actually read the internal monologue however with this new approach the thinking actually happens in latent space so inside the model before it ever outputs a single token and this is very different from Chain of Thought and has the potential to tackle problems that can't be described with words alone so before
I get into this paper I want to show you what Yan laon Chief AI scientist at meta has said about the limitations of large language models he's one of the leading Minds in Ai and one of the loudest voices saying that large language models cannot reason cannot plan like humans do and this is due to the limitations that describing things with language alone have let me show you this first clip of an interview on The Lex Friedman podcast where he describes exactly this if you have a model of this type you can use it for
planning so now you can do what llms cannot do which is planning what you're going to do so as to arrive at a particular uh outcome or satisfy a particular objective I can predict that uh if I have uh an object like this right and I open my hand it's going to fall right and uh and if I push it with a particular force on the table it's going to move if I push the table itself it's probably not going to move uh with the same Force um so we have we have this internal model
of the world in our in our mind uh which allows us to plan sequences of actions to arrive at a particular goal all right so what he's essentially saying is that to really have true reasoning models models that can actually understand the world around us we need More Than Just What language can describe and of course he has his own thoughts on what it'll take but that's not the purpose of today's video his argument is that language models alone are not enough to reach true planning and true reasoning now keep in mind his interview was
recorded before the entire thinking model phenomenon really went mainstream now of course he probably already knew about thinking models test time compute Chain of Thought he probably knew about all of this and still believes that's not enough now for me when I look at the Chain of Thought when I see the actual reasoning steps that a model is going through before it gives you the final output that's pretty convincing but he is not convinced let me show you where we're fooled by their fluency right we just assume that if a system is fluent in manipulating
language then it has all the characteristics of human intelligence but that impression is false so he says it right there we are being fooled by the fact that these large language models are incredibly good at manipulating language at telling us what we want to know but that in itself is not enough for True reasoning and true logic but again I go back to actually being able to read The Chain of Thought and see the reasoning patterns now of course there's a chance he still might be right the Chain of Thought is really just language manipulation
so who knows let me show you one more clip from that interview and then we're going to get into the paper there's ample evidence that we're not going to be able to learn good we presentations of the real world using generative model so I'm telling people everybody is talking about generative AI if you're really interested in human level AI abandon the idea of generative AI all right so there he said it abandoned the idea of generative AI which is you know very surprising given all of the advancements as of late he is in the minority
in this thinking all of the major a companies out there obviously believe that we're going to be able to hit AGI and Asi true reasoning true logic true representations of the real world using language models alone and it seems to be Chain of Thought and test time compute is the last lever necessary to reach that but again he doesn't so now with that let me show you this paper because it actually might be the thing that he's been talking about we're missing from large language models so here's the paper scaling up test time compute with
latent reasoning a recurrent depth approach now what the authors of this paper have figured out is that you can actually have a model that does the thinking in latent space so inside the model before ever outputting a token and this might be the missing piece that Yan laon thinks is necessary to really have a true reasoning and a true planning model this type of architecture allows the model to think and to scale up that thinking internally and I'm going to explain how it works and what they've discovered so in the abstract this novel language model
architecture is capable of scaling test time computation by implicitly reasoning in latent space our model works by iterating a recurrent block thereby unrolling to arbitrary depth at test time so they have this hidden block inside the model which can infinitely think and go deeper until it comes up with the final answer and all of this happens at test time that's the important part and that's in contrast to mainstream reasoning models that scale up compute by producing more tokens so remember kind of the current wave of thinking models use Chain of Thought at test time and
the Chain of Thought is literally just outputting tokens correcting itself reflecting back on those tokens and the number of tokens output in these thinking models is far more than the traditional non-thinking models and there are a lot of benefits with this new approach so it doesn't require any specialized training data and I'll get to what that means in a minute you don't need massive context Windows as with Chain of Thought and the current thinking models that require huge context windows and most importantly and most interestingly it can capture different types of reasoning that are not
easily represented with words alone and again referencing Yan laon's main argument against large language models is that you can't represent the real world completely using words and so maybe this is the solution so they created a proof of concept 3.5 billion parameter model that shows some of these techniques in action so let's take a step back let's talk about how this all works and what it means so let's talk about how human thinking Works a substantial amount of thought happens through complex recurrent firing patterns in the Brain before the first word of an answer is
uttered now if we think about current Chain of Thought it is outputting immediately even with chain of thought the thinking happens with output you're thinking out loud you're having to say these words out loud or you're thinking them in your mind but you're still having the language as the kind of first thing that happens in the thinking process but that's not always what happens there's a lot of thinking that happens before you even use a single word you can conceptualize different topics you can conceptualize different situations without actually saying a single word in your head
and without verbalizing in any of those words and in fact it kind of reminds me of people who have no internal monologue how do they think well they can and they don't use any language so it says more recently researchers have explored ways to enhance reasoning capability of models by scaling test time computation those are the thinking models deep seek R1 01 03 the mainstream approach involves posttraining on longchain of thought examples to develop the model's ability to verbalize immediate calculations in its context window and thereby externalize thoughts now what it's saying is you need
a lot of examples of how to think to train the models to think now one thing I'm a little confused about is when I think about the Berkeley PhD student who was able to essentially elicit the thinking behavior in a model for just $30 and he didn't have a bunch of examples of how to think he used reinforcement learning with verifiable rewards which doesn't require a bunch of examples of how to think so maybe this paper was written before that example how ever the constraint that expensive internal reasoning must always be projected down to a
single verbalized next token appears wasteful it is plausible that models could be more competent if they were able to natively think in their continuous latent space so again internal thinking no language just being able to think about a problem before that first token comes out and this is not a new discovery this idea is foundational to machine learning and has been rediscovered in every decade for example as recurrent neural networks diffusion models and as universal or looped Transformers so again it's not new but the authors are saying it is rediscovered with every kind of generation
of AI all right so what is the gist of what's happening at test time the model can improve its performance through recurrent reasoning in latent space recurrent reasoning meaning it can kind of think and think and think over and over again about the same problem and in latent space means internally without having to Output those tokens enabling it to compete with other open source models that benefit from more parameters in training data so it is more efficient this new technique so let's talk about some of the benefits recurrent layers enable a Transformer model to perform
arbitrarily more computations before emitting a token so some of the advantages Laten reasoning does not require construction of bespoke training data now I just talked about that traditionally when making a model become a thinking model you need a bunch of examples about how to think but we've already seen smaller examples of reinforcement learning with verifiable rewards that get a model to start thinking without explicitly teaching it how to think and that's different than what they're saying here we don't know if that'll scale up but we've already seen multiple examples next latent reasoning models require less
memory for training and inference than Chain of Thought reasoning models now that is because Chain of Thought in the traditional thinking models take a lot of tokens which means that the context window has to be really big and that's computationally very expensive recurrent depth networks perform more flops which is just think about it as compute per parameter than standard Transformers significantly reducing communication cost between accelerators at scale accelerators meaning gpus so they're basically saying that they're able to utilize a single GPU much more so then they would need to multiple gpus connected together and then
next by constructing an architecture that is compute heavy and small in parameter count we hope to set a strong prior towards model that solve problems by thinking by learning meta strategies logic and abstraction instead of memorizing so this is an important factor as well now with traditional large language models it is memorizing information and the ability to generalize outside of its training data is still under discussion and argument and a lot of people are saying that models cannot gener I outside of their training data which means it's not truly artificial general intelligence however with this
new technique they might be able to all right so let's look how it works so this is at test time so you have the input hello then you have the recurrent block so this is the iterative stage and it can go on forever where it's thinking thinking and thinking about what to do next what do I put next and then finally you get the actual output of world because it's hello world now these green blocks are the thinking part of the model and it happens before a single token is output so you can see that
this is world the first token being output and all of the thinking happens before that let me show you the proof that it actually does work what we're looking at here in this graph on the Y AIS is performance think about just how good the output is and on the x- axis we have the recurrence at test time basically the amount of thinking it does at test time before again remember before a token is actually output and you can see on the different EV valves we have H swag GSM AK human eval and you can
see as it thinks 4 8 16 32 64 times the performance continues during that thinking time so it shows right there there is the proof that the more thinking it does inside the model without generating a token improves the performance of the next generated token and I should actually say the total answer in general cuz these are against different benchmarks and what we're seeing here is the more training tokens going into these models the better they perform also but that's kind of how all Transformer models work that is the scaling law and we're seeing the
same thing here 100 200 300 all the way up to 800 billion tokens and it continues to do better and here's another interesting thing it can actually decide how much compute to use depending on the task it hand and so just like Chain of Thought it can think longer it can think shorter and that's actually a really good way to optimize for efficiency so that's what we're seeing here in figure 10 we have from left to right high school math philosophy logical fallacies and moral scenarios so for high school math on the xaxis we have
the amount of compute use just think about the amount of compute and on the y- AIS what we're seeing is how often the model needs that many steps to actually accom the correct answer and so what we're seeing here high school math it doesn't need a lot of steps it gets to a really good answer pretty quickly philosophy it starts to need more logical fallacies it needs more moral scenarios it needs more so for simpler tasks high school math it doesn't need as much for more complex tasks it needs more and that's exactly what you
would do as a human if you're thinking about a problem if it's a simpler problem you don't think as long if it's a more complicated problem you probably will think more now here's the interesting thing that it also says using this technique of latent space thinking does not negate the use of Chain of Thought at test time and actually using tokens so you can combine these different things you can have some initial thinking without generating any tokens and then when you finally do you can generate those tokens to use additional thinking and take a second
to think about how you solve really hard problems you probably think about it in your head you write some things down think about it some more iterate on what was written down and continue that process until you solve the problem and I think with these two techniques latent based thinking and actual token based thinking these two things together can be really powerful and mirror how humans think so I found this fascinating it's a proof of concept you can download the model you can try it out yourself I hope you enjoyed this because I just find
all of these papers so very fascinating if you enjoyed this video please consider giving a like And subscribe and I'll see you in the next one