Good morning, everyone. I'm Mark Gorenberg. I'm Chair of the MIT Corporation and Founder of Zetta Venture Partners.
It's a pleasure to be with all of you here today. I'm especially delighted to have the honor to welcome Yann LeCun to MIT. Yann is the Chief AI Scientist at Meta, professor at NYU, and one of the true pioneers of modern artificial intelligence.
His work on convolutional networks and deep learning has transformed how machines see and learn, and how they listen and understand the world. And that earned him the Turing Award, which is essentially the Nobel Prize for Computing, in 2018 alongside Geoffrey Hinton and Yoshua Bengio. By the way, if you go to Wikipedia, you'll see Yann has received so many honors and awards that it would take our entire session if I walked through them.
So I'm going to send you there. So instead, over the next half hour, we'll explore some of the foundations of this field and future directions in what promises to be a fascinating discussion. So Yann, thank you for being here to open the MIT Generative AI Impact Consortium Symposium.
A real pleasure. So in your life, frankly, you've always been ahead of the conventional wisdom. As far as I can tell, you've always been right.
And let's talk about that a few times in this half hour. So, if we can go back and start at the beginning. So you did your PhD back in 1987 at what now is the Sorbonne.
I don't French, but in English, it translates to "connectionist learning models. " It essentially established the backpropagation learning algorithm for neural nets. And frankly, at the time, most people were chasing things like expert systems.
So what inspired you to have the idea and how did this propel your career? So maybe it's naiveté or ignorance when I was an undergrad, when I discovered a little by accident that people in the '50s and '60s, including MIT, actually had thought about the problem of self-organization, which eventually gave rise to early ideas that machines can learn. And I found this idea fascinating because I think biology is really an inspiration for a lot of engineering feats that we do.
And certainly in nature, everything that lives is capable of adaptation and everything with a nervous system can learn. And so maybe I thought I was not smart enough to conceive-- or that humans in general were not smart enough to conceive an intelligent system, that an intelligent system will have to build itself, and that's kind of what directed me to machine learning. And being naive and ignorant, I didn't know that at the time, the main approaches to AI were not at all interested in machine learning.
It was, as you said, expert systems. People basically transcribing the knowledge of experts into rules and facts and then hoping this would be useful. We're facing a little bit of the same problem today with LLMs where the transcription of human knowledge into a machine that we can interact with or talk to now is done through learning, but it's still very much a bottleneck of transferring knowledge from humans to machines.
So I discovered that this was a very non-popular idea, but I felt very strongly that this was the right thing to do. I had a very hard time finding a PhD advisor because nobody was working on this. So I found a very nice gentlemen, Maurice Milgram, who said, well, you look smart enough.
You don't need any funding because you have some in the country, from my engineering school, and he said, I can't help you technically, but I'll sign the papers. Wow. That's awesome.
So you went on to Toronto to work with Geoffrey Hinton. You went on to Bell Labs, you went on to NYU. The industry went through the AI winter in the '90s and into the 2000s, and then in 2013, you joined Facebook, created FAIR-- basically Facebook AI Research.
And changed the name from neural nets, effectively it became-- the industry started to recognize it as deep learning. So I went back and I watched this seminal talk that you did at NeurIPS back in 2016 on predictive learning, and you moved the industry again because the industry was really very focused on reinforcement learning. And you used the analogy of the cake, and you move the industry to what you term self-supervised learning.
Can you tell us a little bit about that? Yeah. Well, I think at the time, we're talking 2016, '15, this was just when it was pretty clear that deep learning had revolutionized computer vision, speech recognition, and it was just about to revolutionize natural language processing, but it was still pretty early.
And what the industry was using was supervised learning. At the research level, a lot of people believed-- for example, DeepMind was really completely engaged in this. They believed that the path towards a more powerful AI system was through reinforcement learning.
And I never believed in this because-- and reinforcement learning is the main component of this because it's incredibly inefficient in terms of the number of trials that a machine has to do. So I showed the analogy of the cake, which actually is a little interior to this. This was in a symposium at NYU that I organized in 2015 that I first showed this thing, but I tried to brainwash the community in 2016 with a-- Everybody loves cake.
Right. So the analogy is that if you think of AI intelligence as a cake, the bulk of the cake would have to be self-supervised, unsupervised learning, predictive learning, as I called it at the time. And then the icing on the cake would be supervised learning, and then the cherry on the cake is reinforcement learning.
You want to use reinforcement learning as little as possible because it's so inefficient. You don't have a choice. In the end, need to have some sort of way of correcting yourself, but really, it should be a last resort.
And the thing I advertised at the time-- so this is 10 years ago, is we should train a machine to capture the internal dependency of the data without training it for any task so that it can represent the world. And then on top of this-- and that only requires observation from unlabeled data. And then on top of this, you can train a system using the representation learned by the system.
You can train a system to solve a particular task or any particular task. So that whole idea of self-supervised learning, some of us started working on it in the 2000s, but it was under the radar, and the techniques we were using were not that great at the time. We tried to apply this-- I tried to apply this in the late 2000s on video prediction.
So just take a video and try to train a system to predict what's going to happen in the video. And it basically didn't work. But where it did work, beyond our wildest expectations, was for natural language understanding.
So take a sequence of symbols and then try to predict the next symbols, and that works really well. Now why does it work for text or symbol sequences and it doesn't work for video? And the answer is you can never predict the word that comes after a sequence of words, but you can predict a distribution over all possible words in a dictionary, or tokens, if you want to call them this way.
Because there's only a finite number of them, so it's easy to represent a distribution. But when it comes to predicting the future of a video, there's so many plausible futures in a video that it's basically impossible to represent all of those possibilities. If I take a video of this room and I pan the camera and I stop here, and I ask the system to complete the video, there's no way you can figure out what everybody looks like here, or how many people are sitting and what the size of the room is.
It can certainly not predict the texture of the ground and things of that type. So there are things that are just completely unpredictable. And if you train a system to try to predict all those details, you're basically killing it.
You're not-- it's not going anywhere. So it took us a number of years, going back maybe five years, five years ago, to realize this was never going to work and we had to invent new techniques. Well, so-- and self-supervised learning, the work that you did there, the seminal work there with the transformer technology paper, I mean, that's the basis for almost all LLMs today.
And so it's really changed the whole industry. And in fact, if we fast forward to today-- so the world changed with ChatGPT at the end of 2022, and then you came out with Llama in early 2023. Now you have something like a billion monthly users for Meta AI based on Llama technology, well over a billion downloads of Llama.
So it has really democratized AI. And I have a confession to make, which is that I was not involved in Llama at a technical level. This was-- the first Llama was actually a bit of a pirate project.
Oh, interesting. In parallel with a more official LLM project at Meta in late 2022-- or mid-2022. And this was a small group of about a dozen people in Paris who just decided that they wanted to have a lightweight, efficient LLM, and they built it.
And that became the workhorse eventually early 2023, and led Mark Zuckerberg to create the GenAI organization, which is now called Meta Superintelligence Lab, to basically productize it. But at a technical level, I personally had very little to do with it. But it is always the skunk works that emerges anyway.
Oh, absolutely. And the CapEx that's going in by these big companies now is 323 billion I think this year, about the top four companies, including Meta. Yet even with all that success, you've said that LLMs are effectively-- I'll paraphrase, a deep end for human-level intelligence.
Can you clarify why all this scaling can't solve the problem? Well, so that connects with what I just talked about. Here is an interesting little calculation you can make.
A typical LLM-- so something like Llama 3-- is trained on the order of 30 trillion tokens, 310 to the 13. A token is typically 3 bytes, so that-- about 10 to the 14 bytes to train a typical LLM. It's more now because people use synthetic data and everything.
That's for the pre-training phase. It would take any of us on the order of 400,000 years or half a million years, to read through that material. It's all the publicly available text on the internet.
Now, compare this with the amount of information that gets to a visual cortex in the first four years of life. A four-year-old has been awake a total of 16,000 hours, roughly. And there's about 1 byte per second going through our visual cortex, through each fiber of our optical nerves, and we have 2 million fibers.
So that's about 2 megabytes per second, times 16,000 hours, it's about 10 to the 14 bytes. A four-year-old has seen as much data through vision as the biggest LLMs trained on all the publicly available text. And that tells you that, first of all, we're missing something big, that we need AI systems to learn from natural, high-bandwidth sensory data like video.
We're never going to get to human-level intelligence by just training on text. This is not happening. Despite what you might hear, for some people who are in the cult in Silicon Valley who are going to tell you, by next year, we're going to have a data center-- a country of geniuses in a data center-- I'm quoting, OK?
I won't say who. I mean, this is just not happening. Yeah, you will have useful artifacts that can help people in their daily lives and maybe feel like they have the intelligence of a PhD, but it's because they would be regurgitating things that have been trained on, and those systems won't have actual intelligence of the type that we expect.
Not just from humans, but even from your house cat. So, I mean, house cats have an understanding of the physical world that is completely amazing, and they only have 2 billion-- 800 million neurons in their brain, it's not that big. And they certainly have a very good understanding of the physical world.
They can do complex planning of complex actions. And we are nowhere near matching this. So that's what I'm interested in.
How do we bridge that gap? How do we get systems to learn models of the physical world? And that will require new architectures that are not generative.
So I'm telling people that don't work on generative models, so they will think I'm crazy, but I really believe this. And as you say, I'm trying to be ahead of the game. Right.
So you're ahead of the conventional wisdom again. You've already started working on it. You call it JEPA.
JEPA, yeah. So tell us, how is JEPA fundamentally different from LLMs? OK, so JEPA stands for Joint Embedding Predictive Architecture, and we've been working on this kind of stuff for five years or so.
I published a long paper, which is kind of a vision paper for where I think AI research should go over the next 10 years. I published this in 2022, it's on OpenReview. And it's called "A Path Towards Autonomous Machine Intelligence" where I basically lay the groundwork for all of this.
And since then, we've been making progress towards that plan with many of my colleagues at Meta and at NYU. And if you type "joint embedding architectures" between quotes on Google Scholar, you'll get on the order of 750 hits. So there's a lot of people working on this, mostly in academia.
People have been very quick to dismiss the contributions of academia because all of AI research is in the hands of industry now. That's false. We don't do that here.
No. But academia basically tends to work on the next-generation thing that industry does not realize it's going to have a big impact on what they do in five to 10 years. So, OK, so what is the difference between JEPA and LLMs or generative architecture?
Generative architecture, you give it-- you take a piece of data, let's say a sequence of words of text, you corrupt it in some way by removing some words, for example. And then you train a neural net, big neural net, to predict the words that are missing. In the case of LLMs, this is a-- or the GPT architecture in particular, there's a trick, which is that you don't actually need to corrupt the text.
The architecture is such that it is causal, so to predict a particular word, because of the architecture, the system can only look at the words that are to the left of it. And so implicitly, when you train the system to just reconstruct the input sequence on its output, you implicitly train it to predict the next token. And it's very efficient, it's parallelizable, and everything.
OK, so that's generative architecture. It works because the tokens are discrete, there's only a finite number of them, and you can train the system to produce a distribution over all possible tokens. That's how LLMs are done.
And then you can use it to do autoregressive prediction. Like, you have it predict the next token, shift that in into the input, and now you can predict the second token, shift that in, et cetera. That's autoregressive prediction.
OK, now what I'm arguing for is you can't do this for video because if you tokenize video, they're still going to be a lot of things that are just not predictable, details like what everyone looks like here, that you just cannot predict. So the idea of JEPA is you take your video, but you encode it into a representation space where a lot of details are eliminated, and then this autoregressive prediction that we were previously doing in input space, now you're doing it in this representation space. Now the trick is-- and the reason why it didn't pop up earlier is that to train the encoder and the predictor simultaneously is very tricky.
The reason being that it's very easy for the predictor to basically force the encoder to not do anything, to ignore the input and produce a constant output representation, another prediction problem becomes trivial, but it's not a good solution. And so you have to find ways to trick the system into carrying as much information about the input as possible in the representation, but at the same time, eliminating some of the details that are not predictable. So the system finds a trade-off between carrying as much information as possible about the input, but only the stuff that it can predict.
And that's the basic concept of JEPA. It's not-- in terms of architecture, there is the encoder, which is different from what you see in LLMs. And the trick is in finding good training algorithms, basically, or procedures or regularizers to get the thing to learn interesting representations.
Now, it wasn't clear until fairly recently whether this type of joint embedding method to learn representations of natural data, like images and video, would ultimately be better than techniques that are trained to reconstruct at the pixel level. But at FAIR, we had basically an A-B comparison where a big group was working on a project called MAE, Masked Autoencoder, and a video version of it, which was basically take an image, corrupt it, and then train some gigantic neural net to reconstruct the full image, or the full video. And it didn't quite work.
And in fact, at MIT, you can ask someone about this. Kaiming He. He was one of the principals in this project.
And he was a little disappointed by the results, and in the end, he reoriented his research, left FAIR, and joined MIT on the faculty, associate professor, and now he's at MIT at CSAIL. So in parallel, there were other projects that tried to train the joint embedding architectures without attempting to reconstruct-- so non-generative architectures. And they turned out to work much better.
OK. And so it was clear empirical evidence that, for natural sensory data, you just don't want to use generative architectures. And now we have also data that shows that those systems surpass in performance, even supervised models in images, which wasn't the case until about a year ago.
Wow. Which applications are starting to show that early promise? Well, there is an open-source system put out by some of my colleagues at FAIR in Paris called DINO.
I mean, they pronounce it "deano. " It's DINO, but they're French, so they say "dino. " And this is DINOv3, the third version just came out a month or two ago.
And so this is basically a generic self-supervised vision encoder, image encoder, which you can use for all kinds of downstream applications, and there are basically hundreds of papers of using the DINO system, previous versions and current ones, for all kinds of stuff. Medical image analysis, biological image analysis, astronomy, and then just everyday computer vision. So I think this is really the self-supervised learning model.
Took a long time, but it finally won the battle if you want for image and video representation. Another project that I was more directly involved in is called V-JEPA, that stands for Video JEPA. And it's done by a group of people in Montreal, Paris, and New York that I work with.
And this system is trained from video. So you take a video, corrupted by masking a big chunk of it, and then train an architecture-- so run the full video and it partially maps one through two encoders that are essentially identical. And then simultaneously train a predictor to predict the representation of the full video from the partially masked, corrupted one.
OK, that's the first phase of training. We've trained this on an amount of video that corresponds to about a century of video. This is an insane amount-- Century of video.
Wow. Yeah, So it's not as efficient as a four-year-old, clearly, but those systems basically can show that they've learned a little bit of common sense. If you show them a video where something impossible occurs, like an object spontaneously disappears or changes shape or something, the prediction error goes through the roof.
And so they can tell you something really unusual occurred that I don't understand. And so that's the first sign of a self-supervised learning system, having acquired a bit of common sense. And you're already seeing some early success in robotics?
Yeah, that's right. And so you can have a second phase of training where you fine-tune a predictor which is action-conditioned. And now what you have out of this is a world model.
So what is a world model? Given a representation of the state of the world at time T, and given an action that an agent would imagine taking, can you predict the state of the world resulting from taking this action? That's a world model.
If you have a system with such a world model, you can use it for planning. You can imagine taking a sequence of actions, and then using the world model, predict what the outcome of this sequence of actions is going to be. And then you can have a cost function that measures to what extent a particular task has been achieved.
Have you made coffee or something? And then using basically what amounts to optimization methods, search for a sequence of actions that optimize this objective, minimizes this objective. This is classical planning and optimal control, except that the model, the dynamical model of the environment that we use, is learned through self-supervised learning, as opposed to written down as a bunch of equations, as is done in robotics or classical optimal control.
So this is really what we're after, and we've shown that we can do this with representations of the state of the world either derived from things like DINO, or learn from scratch, or on top of V-JEPA V-JEPA 2. And you can show that you can use this to get a robot to accomplish a task zero shot. You don't have to train it to accomplish this task.
There's no training whatsoever. No RL. The training is completely self-supervised.
And in the end, the system has a good enough world model that it can imagine how to accomplish a task without ever being trained to accomplish this task. Wow. So I saw that you had one robot that I think it trained on 62 hours of tasks on its own, and then it did self-supervised work.
So that training of 62 hours was not for a particular task. It was basically, here is the state of the world at time T, here is an action, and here is what the world is going to look like as a result of this action. You can do this with simulated data, with a robot simulator, or with real data where you have a robot arm moving around and you know what action was taken.
So this notion of world model, I think, which I already talked about in my 2016 Keynote at NeurIPS, I think is going to be a key component of future AI systems. And my prediction has been-- I've been not making friends in various corners of Silicon Valley, including at Meta, saying that within three to five years, this will be the dominant model for AI architectures, and nobody in their right mind would use LLMs of the type that we have today. Right.
So that'll propel this to be the decade of robotics basically coming up. Yeah. But on that point, there is a large number of robotics companies that have been created over the last few years building humanoid robots.
The big secret of the industry is that none of those companies has any idea how to make those robots smart enough to be useful-- or I should say, smart enough to be generally useful. We can train those robots for particular tasks, maybe in manufacturing and things like this, but your domestic robot, there is a bunch of breakthroughs that need to arrive in AI before that's possible. So the future of a lot of those companies essentially depends on whether we're going to make progress, significant progress, towards those kind of world model planning-type architectures.
It's OK, we're not worried about those companies. We're a huge believer in entrepreneurship here, by the way. We're going to have a whole new set of companies that are going to come out of these new ideas.
The other thing is that you've been a huge optimist. There's a lot of fearful people about AI today, the ethics of AI, et cetera, and you've been a huge optimist. Why do you think these systems are not going to escape our control?
What's your gestalt on all of this? OK, so the overall architecture of AI systems I've been advocating is one that I call objective-driven. So this idea that you're going to have a system that has some mental model of the world, it's going to plan a sequence of actions to arrive-- to satisfy an objective, to accomplish a task.
And by construction, that system can do nothing else but producing a sequence of actions that optimizes that objective. Now in that objective, you can hardwire guardrails. So let's say you have a new domestic robot and you ask the domestic robot, get me a coffee.
And it gets to the coffee machine, and there is someone standing in front of it, you don't want that robot to slash and kill the person in front of a coffee machine because it's only objective will be to fetch you coffee. And so-- and by the way, this is an example that people like Stuart Russell use as an example of how you could build dangerous machines. And I've always dismissed this, and he always thought I was stupid.
He actually called me stupid publicly in some interviews, so I'm used to it. A lot of people call me stupid, that's fine. We'll have to have the two of you back together.
That's OK, but go ahead. At some point, perhaps. But the point is, you can-- if you put guardrails in the objective functions of the system, hardware guardrails, that can be very low-level.
Like you can have a-- I don't know, a domestic robot that can cook, you can have a very low-level guardrail that says don't flail your arms if there are people around and you have a knife in your hands, things like that. So, we're going to have to design those guardrails, but by construction, the system will not be able to escape those guardrails, it will have to satisfy them. And so I'm not saying that designing those guardrails is going to be an easy task, but we are used to this with humans.
We design laws. Laws are basically objective functions that change the landscape of what actions you can take. The cost of each action we're taking, we're making-- we make laws to basically align human behavior with the common good.
We even do this with superhuman entities called corporations with limited success, I admit. So, I mean, we're used to this. We've been used to this problem for millennia.
And I would argue, it's not a more complex and challenging problem than, I don't know, designing turbojets that can fly you halfway around the world in complete safety. I mean, we can do amazing feats of this type, so I'm really not worried. I'm not saying it's an easy problem.
I'm just not-- I don't think-- But you're not worried it'll get out and be out of our control? Right. So I have like a hundred more questions to ask you, but we're almost out of time.
I do want to say, I know that you should definitely come back to MIT and spend a lot more time. I know that you love sailing. We're here, we have a great crew team.
I'm sure there are some here in the audience, and we're here on the Charles. You and your brother build airplanes. We have the number one Aero & Astro Department, so please come back.
And I know you're a huge fan of jazz. Yes. Right?
And we just built a whole new Music and Theater Arts Building. It's one of the best-- It's hard to beat New York on that dimension. Ah, we're-- we're willing to sit and compete.
But let me ask you the last question and just go full circle. So we started with your PhD you were working on roughly 40 years ago. If you were an MIT PhD student today, what would you work on?
This is a question I get a lot. What would you study if you were an undergrad? What would you work on?
I mean, I think for the last 40, 50 years, I think discovering the mysteries of human intelligence, and MIT is very engineering-focused. As an engineer myself, I think the best way to understand something is to build it. Richard Feynman said that, actually, as well.
Well, he didn't mean build it-- build a physical artifact, but he means deriving the ideas is yourself and appropriating them. And so I think that if you are a ambitious young scientist or engineer, there are three big questions to work on. One is, what's the universe made of?
What's life all about? And how does the brain work? So three scientific questions.
And those on the engineering side of it, at least for the last one, is how do we build intelligent machines? What are the essential components that constitute intelligence and the minimal set of things that-- people were working also on similar fields in biology and synthetic biology and things like that. So certainly I would probably work in this domain, as I decided to do 45 years ago.
That's right. But if I was an undergrad, people are asking this question. AI is going to come up and it's going to do all kinds of stuff at a low level that we may not need to learn anymore.
I think there are things that we shouldn't learn as engineering or science students anymore. And there are things-- those are things that have a short shelf life. So the joke I say is that if you're studying computer science or engineering of some kind, and you have a choice between a course that teaches you about a piece of technology that is currently fashionable like, I don't know, mobile app programming or LLM prompting or whatever it is-- I'm sure there are equivalent in various engineering disciplines, don't take those classes.
If you have a choice between-- There you go. --between mobile programming and quantum mechanics, take quantum mechanics, even if you are a computer scientist, because you will learn about things like, I don't know, path integrals? I mean, it's a general method that is applicable to all kinds of situations.
It's a concept that you can connect with other things. So it turns out, how to best decode the sequence of words that you most likely in a speech recognition system, for example, it's actually a path integral. It's discrete, but it's basically the same concept.
I mean, there's basic theoretical concepts like this that turn out to be abstractions that are applicable to a wide variety of things. So take those challenging courses, and that will put you on a good path. And maybe future AI assistants will take care of the low level.
And so think of yourself. Think of the situation currently of a PhD advisor with a pool of PhD students, and the big secret is that the students teach the advisors, not the other way around. You will be in the same situation as a student where during your PhD, you will have a staff life of virtual people working for you, AI assistants working for you.
It will move your own level of abstraction a couple levels up so that there are a lot of things you won't have to take care of. It used to be that you could do a PhD sequencing DNA. Not necessary anymore, we have machines to do this.
It used to be that you could have a career as a mathematician calculating logarithm tables and trigonometry tables. Not necessary anymore, we have calculators and computers. Or solving differential equations symbolically by hand, we solve them numerically.
So there's a lot of-- I mean, it's just a natural continuation of technological progress that humanity moves up the hierarchical ladder and leaves the low-level stuff to machines. Well, the future is going to be very exciting. Yann, thank you for such an inspiring set of comments today.
Thank you so much. Thank you.