MIT might have just solved the context window problem. Just a few days ago, MIT released a technical paper claiming that through their system of recursive language models, they've made giant steps forward in solving one of the biggest issues in the AI space today, context rot. Now, context rot is something we have known about for a while.
Just this past summer, Chroma did a deep dive on this topic and the results we got were that context window length doesn't actually matter. Whether it's 200,000 tokens or a million tokens, the effectiveness of your large language model is going to drop after about 100,000 tokens. However, in this [clears throat] paper from MIT, their recursive language model setup is showing high performance that not 200,000 tokens, not a million tokens, but when dealing with data sets of over 10 million tokens.
Those numbers are absolutely insane. So, how did they actually achieve this? And what does this mean for you?
Well, that's exactly what we're going to cover today. So, let's hop into this paper. So, this is the paper from MIT.
It's titled recursive language models. and recursive language models are what this whole thing is all about and how they use RLMs to deal with very large prompts that otherwise these models couldn't handle just due to the size. So imagine we're using GPT5 which is what they use in this study.
It has a context window length of 272,000 tokens yet using RLMs it's able to handle data sets that are 10 million plus tokens long. How does it do that? Well, that's what this paper is all about.
And let's start with the actual results. So, let's start by just looking at these two graphs because it kind of illustrates this entire study pretty succinctly. So, on the left we have GPT5 and that's just the base model.
So, imagine you're just kind of inside of ChatG's interface. And then on the right, we have the RLM version. Now, it shows two different tests.
The first test is a very simplistic tests and they both absolutely crush it. It's called a needle in the haystack test. It's essentially giving the large language model a large document and somewhere buried in that document is the answer to your question.
So in the document it says my favorite color is yellow. I ask the LLM what's my favorite color? It comes back with yellow.
These systems, these frontier models are very good at this no matter the context length because it's such a simple ask. Yet if we give it a more complicated task like the Olong tasks, you start to see this idea of context rot and how RLMs are able to sort of ignore that problem. So the olong test to put it simply is when they give the AI system a data set and then they ask the AI to find all the combinations of the entries inside that data set that meet certain criteria.
The olong pairs test then essentially ramps up the complexity of that task. And so you can see with this sort of test context rot at work as we increase the input context length here on the bottom axes the effectiveness the score drops dramatically. Right?
and it stops exactly at the context window length. And here it shows 262. Now, if we look at the RLM, first of all, notice we don't stop at 262.
This pushes all the way to 1 million. So, we can handle a much larger data set. Secondly, notice the score.
The score doesn't drop nearly as much. And at a certain point, it almost sorts of level off. As we go from 1 million to 262, it's virtually the same.
And if you look at olong pairs, right, we're sitting at about what 50 here on RLM. We're at zero over here on the left. That is a wild difference.
And let's break down these numbers a little bit more. Here's some more results of four different tests. So it shows the Quen 3 coder and GPT5.
They ran the study on both models, but for today's discussion, we're just going to focus on GPT5. So I have the base model highlighted and the RLM highlighted. Same two models you saw before.
And the two tests I want to bring to your attention are the browse comp and olong pairs. So we already saw olong pairs before. And the interesting thing about olong pairs is the task length aka the data set is pretty small 32k.
Yet rlm crushes the base model 58 versus 04 which is very interesting because like I talked about in the beginning the rm was supposed to deal with huge context window length issues right yet when we look at a smaller task it's still way more effective which is very interesting. Now, if we look at browse comp, we're on the other end of the spectrum and we're dealing with a massive task length, right? 11 million tokens, right?
You know, 20 what 40 times the context length essentially, right? Quick math there. The base model can't even deal with it.
And of note, what this test is, it's kind of like needle in the haystack except instead of one document, it's 100 documents and it also needs to not only find things inside those documents, but sort of synthesize it to get the answer. So, the base model can't handle it, right? can't handle it at all.
Yet the RLM scored 91. So what do we see here from the results? Right?
We see a system that can not only handle huge tasks but perform more effectively at very complicated tasks at small token links, right? Crazy stuff. So this then begs the question, what the heck are RLMs and how do they work?
So this is the RLM. It's pretty self-explanatory. So I'll just leave this on the screen for like 15 seconds and then we'll just continue on.
I'm just kidding. This is actually this is actually very complicated and convoluted at first, right? The way they describe it is right here.
Right? Again, very very self-explanatory. So, it's a recursive language model treats prompts as part of the environment.
It loads the input prompt as a variable inside a Python ripple environment and writes code to peak into, decompose, and invoke itself recursively over programmatic snippets of the variable. Duh. Easy, right?
Light work. But actually simpler than it sounds. And let me explain it.
And as I walk through this explanation, let me caution you. You're going to be very confused at first. Yet by the end, you're just going to be like, "Oh, duh.
" Obviously, and I guarantee you there'll be like 20 people in the comments like, "This is how I already do it. " So you'll understand by the end. So first things first, they're doing this inside of a Python environment.
What does that mean? for all intents and purposes, they're doing the study on a computer that's running Python and GPT5 can interact with Python and it can have Python run code for it. Okay, so in this example, we have GPT5 over here, 272 context.
It's our primary large language model. We want to have GPT5 answer some questions about our very large document. In this case, we're going to say we wanted to answer questions about Warren Peace, a huge novel that we are going to say is 1 million tokens.
Obviously, it's probably in the training data. Assume it's not in the training data. And I can't, right?
I can't just take this million token document and shove it into chat GBT. I can't just drop it into the chat window. That won't work.
So, how do I actually get accurate answers from this document? This is where the RLM system comes in. So, as they stated, we are going to programmatically figure out what's in that document first.
What does that mean? That means our system, our GPT5 is going to have Python run some code to give us some information about this document. And by information, I mean it's going to run some code for us.
So what is that going to look like? So essentially what the model is going to do in this RLM system is it's going to get sort of like a reconnaissance of the huge document. It's going to see what the lay of the land is and how it can sort of peacemeal this.
So in our case, we're saying, "Hey, we want our main prompt is I want you to break down every interaction involving Napoleon and War in Peace. " So our system's just going to run some Python and get some information. In this case, what it wants to know is what's the length of this book and can we sort of break it down into component parts?
That is literally what you see in blue. That's one line of code. So seven seven tokens have been used.
And what we get back from Python isn't the whole document. We just get the information we need. So, it's going to tell us, hey, you're dealing with 2.
4 million characters. And by the way, you wanted the first thousand words in the document. Well, here they are.
So, it realizes, here's how big it is. Here's the first like couple pages. Okay, it has chapters.
Well, let's break it down into chapters and then figure out which chapters include Napoleon in there. Okay, it's doing this just through Python. That's just a couple lines of code.
And Python is going to bring back that information. And that information is what you see here, right? It's going to say, "Hey, in that book I found Napoleon in chapters 3, 7, 12, 15, 18, 24, blah blah blah blah blah.
" So when we say, "Hey, these RLMs interact programmatically with the document. " That's what we're saying. So GPT5 created the code, Python ran the code, and now it got that information back about Napoleon's in these chapters.
Now what happens, right? Imagine your GPT5, you know, it's in that book, but I can't take all that information in. I can't even take in that chapter by chapter.
That's still too much. So what do we do? Well, this is where the recursive part of the recursive language model comes in.
And that's just a fancy way of saying our large language model is going to spawn little sub aents, right? You see gbt5 mini, that's the recursive section, right? It's just going to do tool calls to smaller versions of itself to handle all this chapter information.
So I need to know about Napoleon in chapter 3. yet I can't put all of chapter 3 and 7 and 12 into here. So instead, we're going to take chapter 3 and we're going to give it to GPT5 mini down here, right, as a tool call.
So GPT5 Mini is now looking at all chapter 5. It gets the answer. Napoleon did this in chapter 3 and it sends that answer back.
We then repeat this process over and over and over. So I need chapter 7. Okay, GBD5 mini number two, you're up.
Here's chapter 7. Figure out what the answer is. Send it back to me.
And it does it over and over again and you eventually get your answer. And so what we've done, right, and I said this actually isn't that complicated. All we've done is we've taken this giant document and we're just chunking it up essentially, right?
We're smartly chunking it up because of this code we ran at the beginning, but we're just chunking it up and we're giving each chunk to essentially a sublarge language model that we've used as a tool call. That's it. That's recursive language models.
We're just offloading the context to some little mini agent that we're using as a tool call. That's it. That's RLMs in a nutshell.
Okay. So if you've ever used a sub agent in clawed code to do something on your behalf because you didn't want all that context to go in the main context window, you've kind of been using some of the fundamentals of these RLM systems, right? That's all it is.
And this is just one recursion layer deep. Okay? And they only did one recursion layer deep in the study.
And they talked about in theory they would like to see people go deeper. And what do I mean by that? Well, imagine, hey, remember I said we have chapter 3.
Well, let's say I gave chapter three to this guy. What if chapter 3 is too big? Well, if chapter 3 is too big, well, we could just repeat this process with this guy where it does like, you know, act one of chapter 3, act two, act three, and that becomes another mini, right?
And it just repeats the same process. It's it's just LLMs all the way down. So, all that to say is that is how the recursive language model works.
We have a big document. We have our large language model. We use code to figure out how we can sort of break down this huge document and then we spawn miniature versions of our large language model GPTI GPT5 mini in the case of the study to handle those individual sections that give us the answer.
It gets aggregated and that way our guy over here is able to handle these huge prompts without ever actually ingesting them, right? because he's kind of a liar and he just has everyone else do his work for him and then claims he did it at the end. That's RLMs.
And that is essentially what this picture and these next few paragraphs are explaining. That's all it is. Which is why I guarantee you someone in the comments like I already do that.
Yeah. Okay. But for the rest of you, hopefully that kind of made sense.
And hopefully that sort of explained why it's able to do these things so well because it never has to deal with context rot because we just keep splitting the context up so many times amongst so many you know column sub aents call it whatever you want right we've essentially spread this giant 1 million context document so thin that everyone can work on it effectively and the way they describe that concept in here is they say the long prompts aka the Large documents should not be fed into the neural network. Aka don't just dump them in to the large language model, but should instead be treated as part of the environment that the large language model can symbol symbolically interact with. Right?
When we say it's part of the environment, it's just something it knows it exists, right? It can interact with this thing. It just doesn't pull it all in.
That's what this paper is saying. And the paper goes on to make these observations that essentially reinforce everything I just told you, namely that they can scale to 10 million token regimes and outperform these base LMs. They also talk about the recursive sub calling of RLMs providing strong benefits on information dense inputs.
And I want to take a second to talk about this because I think at this point you're like, okay, this is awesome. Like how do I actually recreate this? How can I actually take these fundamentals?
I think the idea of sub calling right really in context of something like claw code like sub aents I think is really where you can see something like this work again observations three the RLM not only scales better on large context stuff but it also performs better on the small context stuff if it's very complicated right this olong stuff and again this paper just goes on and on about this there's also some interesting stuff that it tends to actually be cheaper too and really I think this paper and this whole context X window stuff is just where this whole space is going and you're seeing it all over the place. Things like the Ralph loop, things like GSD, it's all about context window management and how that changes your outputs. And right here, what we're seeing is we're seeing just sort of some of the code stuff that I talked about earlier.
So, I will put a link to this paper inside the description. Definitely check it out. I think the stuff is just really interesting.
Obviously, this field is changing by the day, but definitely read it, take a look at it. wild stuff and really start to think about how you can integrate these sort of ideas into your own workflows.