[Music] All right. Um, so I'm going to uh talk a bit about where I think AI is going over the next few years. And this is going to um uh cover um some of the research um I've I've been personally involved in and working on um over the the last uh several years. And and first um what I want to say is that regardless of what your interest in AI Is, at some point we're going to need human level AI systems because they're going to assist us in our daily lives, maybe with us at all
times. And the best uh way to interact with a system is if it has an intelligence that's kind of similar to human intelligence because we're familiar with interacting with other humans. Um and so and that was you know depicted in the movie her from u uh the Spike Jones movie from from 2013. Um eventually we'll probably at least if the vision of uh of meta is fulfilled uh we're going to be interacting with those agents through smart devices like smart glasses and things of that type augmented reality or mixed reality glasses. Um but for this
we we need technology that doesn't exist yet. We need systems that basically um understand how the world works, can remember, can reason, plan. And we're nowhere near this at the Moment. So on the hardware side, uh we're making progress towards sort of, you know, having devices that allow us to have uh assistance at all times. I mean, currently they're in our smartphone, but it's somewhat inconvenient. If they could display information in smart glasses and we could just talk to them or interact with them with electromiogram bracelet that would allow us to, for example, point and
click and type with our hands in our Pocket. That would be uh more more more convenient. Um so that is going to occur also what's going to occur over the next decade or so is you know practical uh hardware platform for things like humanoid robots and domestic robots and things like this. But we don't have the technology that will allow those assistants and domestic robots to be sufficiently smart yet. So that needs to make significant progress over the next five five years Roughly. And the problem with this is that machine learning sucks. I mean, in
terms of sample efficiency, in terms of being able to acquire new skills quickly, machine learning is nowhere near the kind of capabilities that we observe, not just in humans, but in most animals. Um, so animals can learn new tasks extremely quickly, understand how the world works. Certainly, they can reason, they can plan. A lot of animals Can can do that. Um they have common sense what we would qualify as common sense. Um but their behavior is not driven by the statistics of their training data. Is is driven by you know hardwired objectives that were hardwired
into into us by evolution. Um okay so we we have some progress to to make. Now let's talk about what you know current AI technology that everybody talks about does. auto reggressive large language Models. We call them LLMs, right? Or chatbots, but they're really auto reggressive LLMs. So they they they're really trained to predict the next word that follows a text. Many of you already know this. Um you take a text and you train a system to just predict the next word from that follows that text. Now because of the architecture of the kind of
network that is trained to do this uh you can actually train it in parallel and and in effect uh the way you train Those systems is as an autoenccoder. You train the system to reproduce its input on its output but because it cannot see a particular input to produce the corresponding output. It has to it can only look at the ones to the left. It's called a causal architecture. um in effect you're training it to produce the word that follows a text. You're just training it to do this for every word in the text in
parallel which is very efficient. Okay, So it scales very well. You can scale those systems to hundreds of billions of parameters and uh there is some sort of emergent property that that comes out. Those systems seem to really have some level of understanding of the underlying uh reality but it's pretty shallow. Now once you've trained a system of this type, you can have it produce text auto reggressively, which means um it can predict the next word that follows a particular prompt that you you give it And then you shift that word into into its input.
Shift everything by by one and now it can predict the second next word, shift that in, third word, etc. Right? So that's autographive prediction. It's not a new concept. It's been around since the 1950s if not the 1940s actually. Um but there is a major limitation with this which is that um there is no way for the system to really kind of reflect about something Right it's going to spend the exact same amount of computation for every word or every token that it produces. Okay, so if you ask a system like this a question like
does like a yes no question uh does 2 plus 2 equal four okay it's going to spend a particular quantity of computation to answer this now asked you know does p equal np and it will spend the exact same amount of computation and then produce an answer yes or No which will most likely be wrong or at least unjustified if it's Right. Um, so that's not right. I mean, we tend to spend a lot more time on complex questions than simple questions. There is another problem which is that under some assumptions, right? If you um
if you think that in the space of all possible sequences of tokens or words, okay, think of it as a tree, there's a subtree of acceptable answers for the question, okay? Um at every token produced by the system there is some probability that it will take you out of the sub tree of acceptable answers. Now if you make an incredibly strong assumption which is that this probability is constant and independent for all the token in the in the in the answer which is of course false. Um but let's say a first approximation. Now what that
means is that the probability of staying within this sub tree Decreases exponentially as time goes by and it's you know almost inevitable and so you know the the more output the longer an output those system produce the more it it's going to become either wrong or irrelevant and there is some uh I mean this is kind of you know very informal a very informal argument but there are some some papers that kind of study So, we're missing something really big And um even bigger than just the the small issues that I I mentioned. Um you
know, never mind humans. Cats and dogs can do amazing stuff that we cannot come even close to reproducing with our AI systems. Any house cat can plan really highly complex actions. Just observe a cat, you know, standing at the bottom of a bunch of furniture thinking about how he's going to jump on the top of the furniture. They clearly plan. Um, in humans, you know, any 10-year-old can clear up the dinner table and fill up the dishwasher without being trained to do it. The first time you ask the 10-year-old to do it, he or she'll
be able to do it. That's called zero shot. No need for training. being able to solve new tasks without actually training ourselves to do it. Any 17-year-old can learn to drive a car in about 20 hours of practice. Now, we don't have domestic Robots and we don't have we don't even have cell loving cars despite the fact that we have millions of hours of training data of a car being driven by an expert and we could train a system to just, you know, emulate the human driver. But when we do this, we don't get a
system that is is nearly as reliable as a human and it cannot actually handle very unusual uh situations. So we have systems that can pass the bar exam. They can solve Complex math problems. They can prove theorems. They can beat us at chess and go and poker and diplomacy and whatever else. Uh but we don't have level five serving cards. We don't have domestic robots. And that's just another example of the the Moravec paradox, right? Moravec was a roboticist. And he said like how come all the things that we think of as sophisticated intellectual task
like playing chess or something uh or planning a a path through you know From one city to another. We can solve this with computers but we can get computers to solve those problem better than humans. But then what we take for granted just dealing with the real world acting in the real world we don't know what how to do this with the machines that points to an interesting remark which is um a typical LLM today like lama is trained on on the order of 30 trillion tokens three 10^ the 13 Tokens a token is kind
of like a subword unit but it's you know it's worth three quarters of a word roughly Um each token is about three bytes for lama. So the complete data volume is 0.9 10^ the 14 is basically 10^ the 14 bytes. Okay, that's essentially the entirety of all the publicly available text on the internet. It would take any of us on the order of 400,000 years to read um you know reading 12 hours a day which Obviously we would not survive for very long. Um now consider a human child a four-year-old has been awake a total
of about 16,000 hours. That's what psychologist tell us. Which by the way is not a big amount of video. That's 30 minutes worth of YouTube uploads. Um, we have 2 million optical nerve fibers, 1 million per optic nerve going to the to the brain, the visual cortex. Each nerve fiber carries about One bite per second, you know, give or take. So do the arithmetics, the result is about 10^ the 14 bytes. So if in four year four years um a child has seen as much data raw data as the biggest trained on all the publicly
available text on the internet 100,000 times less u less time than it would take to read the material. So what that tells you is that we're not going to get to human level AI by training from text. It's just not Just not happening. Okay. Despite what Dario Modi is saying that we're gonna have, you know, super intelligent PhD level uh assistant by next year or the year after that, it's not going to happen. It's absolutely not going to happen unless you redefine what it means to be intelligent. If intelligent if you reduce or uh constrain
the definition of intelligence to being able to solve math problems and certain math problems and uh you know other type of of things of This type then yes maybe um but that's not full intelligence. So there's one immediate consequence of this which is that we're we're not going to have super intelligent system you know anytime soon and we're not going to die next year. Okay. Um so how do babies u learn how the world work? They they basically learn this mostly by observation in the first three months of life because they can't Really interact with
the world you know beyond their limbs. Uh so they certainly build an active model of their own body okay because they can move their limbs but um the model of the external world it's mostly through perception passive perception I mean they can move their eyes obviously but not really grab objects or anything that comes later um but you know things like gravity and inertia basically what we call intuitive Physics is not acquired by infants until the age of about 9 months. So before 9 months uh you show the little the scenario here at the bottom
to u a six-month old uh child where an object is on a platform and you push it off the platform it appears to float in the air. Sixmonth old baby will barely pay attention. A 10-month old will will go like the little girl here because in the meantime baby has learned that objects that are not supported are supposed to Fall. Um and whenever our model of the world or mental model of the world is violated, it makes us pay attention because of of course it could be dangerous. Okay, so that gives us a list of
what we want advanced machine intelligence system to have. We want systems that learn those world models from sensory input. Why from sensory input? Because it's much higher bandwidth uh than text and you don't need humans to produce it. You can have as much data as you want. So task like learning intuitive physics from video that would be a very good thing to get machines to do and I'll show some examples of that in a a few minutes. Uh we want systems that have persistent memory. We want systems that can plan uh complex actions and so
as to fulfill an objective and for that you need a mental model of the world because you need to be able to predict what the consequences Of your actions are going to be. And we need systems that are that that can reason obviously but that's really quite the same thing as manipulating a mental model. And we need systems that are controllable and safe which is not the case for LLM. uh some of my colleagues don't like me to do to say this but LLMs are intrinsically unsafe in a way because they will just um you
know produce answers that kind of Satisfy the statistics of the training data but it's not a kind of a direct way of controlling what they uh what they say. They're still useful still a lot of interesting applications you can build with them and we should absolutely work on them in a big way but they're not going to take us to human level AI. Okay. So first of all um what type of inference should an AI system be able to do? So as I explained earlier an LLM produces an answer by Just propagating through a bunch
of layers of a neural net and then producing a token. And that's not computationally sufficient for intelligent behavior or reasoning or planning. What you want is uh producing an output through an inference process. So basically imagine you have an observation and you have a system and you propose a potential output to the system and the system has you know gives you a scalar output that measures to What extent this output is compatible with the input or to what extent is incompatible with the input I should say. Okay. So then a system of this type can
perform inference by optimization basically by searching through an for an output that minimizes a particular objective function. Okay, the square that you see on the right here marked objective that's a a function with scalar output not represented that basically tells you to What extent the output is incompatible with the input. Okay, if the output is zero, it's compatible. If it's larger, it's not or less. This idea, I mean, this type of inference is very classical. Um, it's the type of inference that first of all, classical AI has been using for a long time. Reasoning in
classical AI and optimal control and robotics is performed by optimizing um a function with respect to the Output. You search for an output that is that minimizes some function, right? That's what you do when you do a shortest pass for example right you plan a trajectory from one city to another uh when you solve a a SAT problem when you um you know basically most computational tests can be reduced to an optimization problem and of course it's u it's exactly what you know probability graphical models and biz and nets and things like this were we're
doing when You try to figure out the optimal value for a laten variable whose value you don't know you're basically minimizing negative log likelihood or something of that type. Find the value of the variable that minimizes negative log likelihood or free energy if you are in the log domain. So this is very classical um and this is the type of process that psychologists would would qualify as system two. Okay. So in psychology there Is system one and system two. System one is the task that you can accomplish without thinking about them. they be you become
so used to them that they become automatic and subconscious. You don't need to if you're an experienced driver. You don't need to think about driving. You just drive. Um but um system two is the one where you recruit all the power of your of your mind and your mental model of the world to kind of plan an action Potentially imagining all kinds of you know catastrophe scenarios and avoiding them things like that. So compare a person driving a car for the first time and an experienced driver. The first per the first uh person uses system
two. The second one uses system one. Okay. So how are we going to uh uh kind of formalize this a little bit and and this is not really uh complicated theory here but the way you would build Such a a function that measures the degree of incompatibility between an input and an output um can be captured by this notion of energy based models. Energy based model is kind of a weaker form of probabilistic modeling if you want. Okay. So you have a function f. It's called an energy or free energy. And it depends on two
variables. In this little diagram here, there are scalar variables. X is one that you observe and y is one that you're supposed to Infer. Um and the dependency between x and y. So the the the relation between that produces y from x is not a function because you may have multiple values of y that are compatible with a single value of x. Okay, so you have to represent this by an implicit function and that's this energy function. Okay, so this energy function f ofxy would be zero when y is compatible with x and we take
a larger positive value when y is not compatible with x. And if you build A thing properly, this kind of energy landscape would be kind of smooth. So that if I give you a value of x, you can easily find a value of y that minimizes this energy through uh optimization perhaps using gradient information or something like this. Okay. So uh to train a system like this uh you would need to learn the train the parameters of this energy function which could be some big neural net in such a way that it gives you low
Energy on examples of x and y that are compatible and high energy for everything else. And that's the hard part. I'll come back to this later. Okay. Okay, so assuming you have such a energy function, how would you structure it inside and then how would you use it for uh reasoning and planning? Okay, so this could be an example of a diagram block diagram of the internal structure of this uh energy function. You have an observation on the left that goes Through a perception module, some big neural net that produces a representation of the state
of the world as you perceive it. You might want to combine this with the content of a memory which represents everything else that you know about the world but you're not currently perceiving. All right. So that gives you an idea of the state of the world. You feed this to your world model which would be kind of a centerpiece of that architecture. And That uh world model will also take a proposed a proposal for an action sequence to um accomplish given the state of the world and the action sequence. The war model predicts the next
state of the world after the action sequence has been accomplished. Okay, the predicted u uh state representation of the state. You can feed this to a few objective functions. One objective u is a task objective measures to what Extent you're accomplishing a task which could be specified through another input to that objective function. Okay. And then perhaps a set of other objectives that would be guardrails. So guardrails that guarantees that the sequence of actions that will be taken is not dangerous or you know respect some boundaries of some kind. Okay. So the way the systems
uh the system operates is that given a an input a perception the system searches Through optimization. It searches through an action sequence that minimizes the objectives and the guardrails. You can think of the guardrails as constraints that need to be satisfied as opposed to costs or you can view them as cost penalty functions if you want. Okay, so that's an example of inference by optimization. Now in in classical uh control theory um a world model is something that you know given the state Of the world at time t and an action you might take at
time t gives you the state of the world at time t plus delta t. Right? Let's say your world model is some sort of differential equation that governs the dynamics of a robot arm or a rocket or something like that. Um so you might want to run your world model multiple times. Okay, unfolded in time as represented here. So here you have a sequence of two actions. You feed it to your world model but the world model Kind of you know take the second step take as input the output the predicted output from the previous
time step. And the guardrail cost can be applied to the entire trajectory not just to the the the final state. Same for the task cost. Actually this type of uh architecture is very classical in optimal control is called uh or or this kind of operation is called model predictive control. Right? You plan a sequence of actions. Uh so that according to your world model A particular objective would be fulfilled and you do this by optimization. Now in reality the world is not completely deterministic. So it could be that you may have to do this in
the presence of uncertainty and those extra variables here uh the latent variables kind of you could think of as uh variables that represent everything you don't know about the uh environment that sort of Makes it nondeterministic if you want. Okay. So draw those variables to some uh from some distribution or optimize them in some way uh to uh be able to do the planning that makes everything much more complicated but we're going to ignore this for the time being. Ultimately though what you would want to do is not planning at a single level because um
it could be extremely it could be impossible because you just don't have The information. Um, turns out I have to be in Paris tomorrow. It's true. It's a true story. Um, and to go to Paris and be in Paris tomorrow uh, morning. I can I cannot plan my entire trip from here to Paris in terms of millisecond by millisecond muscle control, which is really kind of the lowest level actions that a humans uh, can take. Instead, I have to plan this at a very high abstract level where I Don't need to have all the information
to do this planning. So, I don't need to know anything about much to know that to fly to Paris, I need to to go to Paris. I need to go to the airport first and catch a plane. Okay. And then going to the airport now becomes my my sub goal. I need to figure out how to go to the airport. Well, I need to uh well, the example I have is from my office at NYU. So, if I'm in New York, I need to Just go on the street. Hail a taxi. You can do this
in New York. Um, now I'm sitting in my office at NYU. How do I go on the street? I need to walk to the elevator, press the button, walk out the building. How do I go to the elevator? I need to stand up from my chair, pick up my bag, open the door, walk to the elevator, avoiding potential obstacles on the way, uh, etc. And there is some point in the hierarchy where I can just act. I don't I don't Need to know more than what I currently perceive. I don't need to plan because I'm
used to standing up for my chair. Um, so I can just accomplish the task, right? So, but almost everything we do, we use this kind of hierarchical planning. And it's true of animals as well. Here's the thing though. We have no idea how to do this with machines. We have no idea how to train a a machine so that it has a world model that can do this type of hierarchical Planning. Robots do hierarchical planning, but the the levels are kind of hardwired by hand if you want. Um how do we train a system to
do this from examples? So that led me to kind of constructing all of those components that you see here can be put into some sort of architecture that people have called cognitive architecture where the centerpiece is a world model. Um this is actually kind of represented roughly Where it could be in the human brain. By the way, a world model is basically all the your prefrontal cortex. Perception is in the back. Uh motor control is kind of in the middle. Uh the memory is kind of inside is the hippocampus. That's your short-term memory. Um, and
then in the basil ganglia at the bottom of the brain, you have a bunch of those objective functions, at least if you squint. So that leads to this um architecture called objectived driven AI. And that's what I just described. Okay. And I wrote kind of a vision paper about this about three years ago. I put online on open review. It's not on archive. It's only open uh so that people can comment. Um and it's called a path towards autonomous machine intelligence where you know I explain a lot of what I just went through. Okay. So
how far have we have we gone in that Direction and you know can we train uh a system to learn how the world works from video the same way we train a system to produce language by just training it to predict the next word in a text. Can we do this with video? Right? show a video to the system and then train it to to predict what's going to happen next in the video. If the system is able to do a good job at this planning, that means it has understood the underlying nature of uh
of the World. It probably knows that the world is threedimensional, that you know some objects can move spontaneously, animate object and most other objects obey simple rules of physics and blah blah blah. Um, so can we use the same techniques that we use so successfully for text to train a video system? And the answer is absolutely not. Despite what you might read, okay, So there's a lot of people in the in the field today who strongly believe that the way you get a system to understand the real world is you train it to predict at
a pixel level what happens in a video. And I used to believe this until about five years ago. I completely changed my mind about this and I've become actually kind of philosophically opposed to this very idea. Okay. Um it just doesn't work. And we've tried to do this. I've Worked on this for the better part of the last 20 years trying to predict what's going to happen in the video based on this idea that you know you need self-supervised learning. You don't want to train a system to accomplish a particular task. you just want to
train you to understand the world and then learning you know learning a particular task will will be very simple and fast. Um which is why I never believed in reinforcement learning for example. Um So okay so why does it work for text so well and why does it not work for video and the answer is very simple. It's just that text is simple. Language is simple. We think of of text of of language as kind of the epitome of human intelligence. But in fact, no, language is simple. It's like chess. Chess is simple. It's like
solving integrals, computing integrals. It's actually simple. It's hard for humans, but Algorithmically, it's not that hard. Okay. So, why is it simple uh for machines to train a machine to predict the next word? Well, you can never predict the actual word that follows the text, but what you can do is produce a probability distribution over all possible words in your dictionary, right? And you can do this because there is only a finite number of words or tokens in that in that uh in that case. So, you can handle uncertainty pretty Easily. Now, for video though,
we don't have a way of representing a normalized distribution over all possible video frames. Not only don't we have it, it's actually an intractable mathematical problem. We don't even know how to represent proper distributions in such high dimensional continuous spaces. We can represent energy functions that would be the unnormalized logarithm of a distribution but even those are not very good and we Certainly don't know how to normalize them. So the whole idea of probabilistic modeling basically has to be thrown out the window if we want to predict what goes on in a video. And the
main issue is that if we train a system to just predict at the pixel level what goes on in the video, we get blurry predictions. We get the kind of prediction that you see this is old work from 2016 almost 10 years old now. And this is the kind of Prediction you get when you train, you know, with pretty big neural net to predict what's going to happen in short videos. Um the second column at the bottom here is also what you get when you try to predict the trajectory of cars on a highway. blurry.
Okay, so my solution to this is something I call joint embedding predictive architecture and this is what it looks like. Now what's the difference of what I showed you before? The difference is that why goes through an encoder now. Okay, so you take a video, call it Y, you corrupt it or you transform it or you encode an action that maybe took place between the second video and the first call this X um run it through an encoder as well and then try to predict not the entire video but the representation of that video. Okay,
so instead of making Prediction at the pixel level, make predictions in representation space. Now if the a variable here is an action that is you know being taken in the world then this is a world model that basically given the state of the world representation of state of the world at time t sx and an action predicts the representation of the world that will result from taking this action. So that's the difference between those two architectures. On the left you Have generative architectures. They predict why. On the right you have the joint embedding predictive architectures
JPA they predict an abstract representation of Y where all the details in Y that are essentially unpredictable are eliminated. Okay. So if I take a video of this room I I start from the left and I slowly move towards the right and I stop right here and I ask the system to predict you know tell me you know give Me the rest of the video. The system will predict that the camera is going to continue panning. It's probably going to predict that this is a room. It might even predict that the there's blinds on this
side because, you know, it can figure out the lighting pattern on everyone's face. There's no way it can predict what everyone of you looks like. Absolutely no way. The information is just not there in the initial segment. It can't predict, you know, Which chairs are occupied or not. It can't predict the detailed texture of of the carpet. So if you try to train a system to predict at the pixel level, it's going to spend all of its resources predicting things that it can't predict. As a result, it's going to predict, you know, a blurry mess,
which basically is the average of all the possible things that could plausibly happen. So if you do this in Representation space, you simplify the problem a lot. Um but then there's one problem that's complicated which is how you train those architectures. Okay. So um there's several types of those joint emitting architectures. The one on the left pure joint emitting architecture. Um if the the two encoders are identical and share the same weight that's called a Siamese neural network is a little concept from one of my papers in the early 90s. The JEPA with a predictor
is the one in the middle and then the action condition JEPA is the one on the right where the Z variable here is either a latent variable or an action um that is being taken and that could be seen as kind of a causal model of what happens in the world um given an action or transformation that may occur. So that's the kind of architecture we're going to need to train and um how are we going to train it? So that's where those energy based model story kind of becomes somewhat um somewhat useful because as
I said before the way you you would train a energy based model of this type is you would make sure that the energy produced by the system which is the value of those objective functions okay um is low for example training samples of X and Y and then high everywhere else. It's easy enough to make the energy low for training samples. to just show an Example of X and Y that that from your training set and tweak the parameters of the system so that the scalar energy goes down super easy. The problem is how do
you make sure the energy is higher everywhere else and there there's two types of methods is more but there is two major categories of methods. So if you don't explicitly make sure that the energy is higher outside the manifold of data, you run the risk of having a System that's collapsed. Basically gives you a zero energy for every pair of X and Y. And that's not a good model. You need a contrast between the good the good pairs of XY and the bad pairs of XY. Okay. So two classes of method contrastive methods. The contractive
method consists in generating uh other pairs XY that are not under manifold of data and then you push their energy up. There's an issue with contrasting Method. I I used to be a fan of them. I kind of invented them with this Siamese network stuff, but it doesn't work very well in high dimension. If you have a highdimensional representation space, then you're going to have to generate uh a number of contrastive samples here that basically in the limit grows exponentially with the dimension and so it doesn't scare very well. Experimentally it doesn't work very well
either. There's an alternative which are called regularized methods and those are methods where either by construction or through a regularizer function the volume of stuff that can take low energy the volume of space that can take low energy is limited somehow or minimized sounds a little mysterious but um I'll give you an example of how this can be done so um so again contrastive method versus regularized methods regularized method try to minimize the Volume of stuff that can take low energy Um so the way we're going to test this experimentally is that we're going to
train one of those joint embedding or JPA architecture on unlabelled data. Okay, either with video or images or whatever. Um and then we take the encoder and use the representation learned by the encoder as input to u a supervised classifier or predictor or something uh that we're going to train in supervised mode to solve a particular Task like object recognition or segmentation or whatever it is or action recognition video for example. Okay, that's a standard way you train self-s supervised learning. Um so the contrasting methods I I said I didn't want to use it but
just for completeness um there's an old paper of mine from 1993 where we kind of propose this uh thing it's it's come to be known also as metric learning you kind of try to train a neural net to project inputs Into a space where uclean distance kind of makes sense if you want semantic sense there's been a two other uh papers from my lab in the mid200s and then a paper from Google um in 2020 called SIMC clear which showed that this type of method can give decent performance for so self-s supervised training of uh
image recognition object recognition system but again the representation that are produced by those method tend to be Low dimensional not more than 200 or so okay there's another set of methods which I didn't mention they they're sort of regularized methods but they they they're not completely well understood based on distillation um and they're very popular at the moment. So basically they they consist in again having two encoders but the weights of the encoder on the right is an exponential moving average of the weights of the encoder on the left. So Think of it as the
encoder on the left is trained and its weight can change pretty quickly. Um and it's trained with the output coming out of the encoder on the right. The encoder on the right does not you don't back propagate gradient to it. There is the red arrow the red cross here says the gradients are not back propagated. So you just set the the weights of the encoder on the right with the exponential moving average of the Weight on the left. And somehow this works and if you do it right it prevents a collapse. a collapse in which
the encoder just ignore ignores the input and just produces constant output which of course would minimize the prediction error. Okay, somehow it prevents collapse. There's some theoretical papers on this but it's a little mysterious I should say but there's a bunch of papers that Use this idea and they work really well. So be away well from from deep mind sim from my colleagues at at meta u from 2020 and dinov2 I ja vja I'm going to talk a little bit about that um but with this so there's a a particular um image extraction self-supervised image
uh feature extraction system called this dinov2 it was produced by my colleagues at fair paris and it works really well people use the dinov2 features for all kinds of applications In computer vision. Just use it as a generic feature extractors extractor feed it to a supervised classifier and with just a few samples you can train your classifier to accomplish just about any vision task. Um so one of our colleagues for example took satellite images and had a small amount of label data where the the height of the canopy was labeled for some areas. And so
she trained the head uh using this small amount of data then Applied to the entire earth and then was able to evaluate how much carbon was captured in vegetation in total uh in the entire earth which is a quantity that um is interesting to know for climate u change prediction. Okay. But literally hundreds of people use this these features. Okay. Okay, so here's a a version of this where we can use it to build a world model and it's called Dino WM Dino world model. Uh this is a recent paper on Archive by uh a
student co-advised by Leral Pinto who is a roboticist at NYU and myself name is Ghu and the basic idea is um you take the representation of an image produced by Dinov2 and then the representation of an image of the environment after you've taken a particular action. Okay. and you train a predictor. You keep the encoder fixed, but you train a predictor to predict the Representation of the world after the action has been accomplished. Right? So once you have that predictor which is action conditioned, you can use it for planning. you um show the system in
initial state, extract the features with DOV2 and then run your world model multiple time steps maybe 10 something like that and then uh measure the distance in representation Space to a target state that you want to attain that you want for example a robot to accomplish a certain number of task actions so that you reach a particular goal. Um, and so you feed the target state to the encoder again, and then you measure the distance in representation space. And now by optimization, you figure out a sequence of actions that minimizes this task objective. And this
works really well. So these are images of Reconstructions produced by a separately trained decoder. We don't use a decoder for training. There's no decoder. It's very important whenever you train something to an image encoder through reconstruction, it doesn't work very well. Doesn't work as well as if you train using those joint embedding uh type criteria. Okay. So, what you see at the top is um this is a a problem where you have like a little blue dot that you can move around and it can push a a Shape a T-shaped object. So at the top
you have a sequence of action being taken and the result of applying the sequence of actions on the actual environment which is a simulated one. Okay. At the bottom this is the prediction that dino the dino world model would produce. So we take the representation resulting from the prediction of the sequence of actions and then for each time step we just run it through the decoder. So can we we can Produce an image that corresponds to this. But again this decoder is trained separately afterwards. Um and you can see that it's pretty similar to what
goes on in the first row. Right? So the last row is almost identical to the to the first one. Everything in between are sort of other methods that people have proposed to kind of solve this problem um using various techniques. None of them does nearly as good of a job. Now the dynamics here is super Simple but there are you know other situations where the dynamics is really more complicated. So here what the robot does is that it goes down on a on a plate on a platter and it moves by delta x delta y
and that pushes a bunch of blue chips uh which kind of push onto each other. So it's fairly complex kind of dynamical system which you know you can simulate but you can't really sort of reduce equations that would allow you to plan. Um and again at the top is the Ground truth what actually occurs um after executing a certain number of actions and at the bottom is a decoding version of the internal state of the system uh after having accomplished the same imagined uh actions and the result is pretty similar and again the other methods
are not not nearly as good. I'm going to bore you with uh results. It just works better. Okay, let me show you some videos. It's more fun. Okay, so you can apply this to all kinds of Situation. Train world models or predictors for v situation for pushing a tear around for navigating through a maze or for doing those those tasks um of you know moving a little string or a bunch of chips. Uh let me show you the last video. I think it's the more fun more fun one. So you set that system with arbitrary
goals with an initial condition and then what you see at the bottom is the robot using actions that are planned to try to Reproduce the configuration the target configuration that it's observing and it's limited to a relatively small number of actions here. uh and some of this is done open loop some of it is done using what's called receding horizon uh planning and you know we can apply this to all kinds of situation where that require planning and works pretty well but it's still a lot a lot more work to do there okay um those
architectures uh Um that's a bit of a cheat that I just told you because the encoder is pre-trained it's it's pre-trained using self-s supervised learning using uh distillation method denov v2 uses distillation method to train the visual encoder but the the predictor on the world model is trained separately right it assumes the encoder is already there in image ja and video japa everything is trained at once also using distillation method but we train also the predictor Uh simultaneously so there's two papers to um one very recent on video japa um and I'm going to go
fast on this but um this uh technique of using distillation method to train one of those JA architecture works really well for learning visual features to recognize objects or um or segment images or or doing other tasks. Um it's much faster than uh alternative methods or self-s supervised learning and it gives better results. Um uh Dino V2 gives better Result because it's trained on much bigger data sets and we haven't yet compared exactly but what we compare it with here is another project that took place at fair called MAE masked autoenccoder so that method is
basically an autoenccoder a denoing autoenccoder you take an image you mask certain parts of it and then you train a gigantic neural net to reconstruct the full image from the partially masked uh one and this doesn't work as well it's much more Expensive and in fact um this project was cancelled abandoned. So the main lesson there is if you're going to train a system to produce representations of images or video, do not train it to re to reconstruct. It's not going to work. Which is why I don't think any of the efforts using models like
Sora or video prediction systems to produce wild models are ever going to work. It's a complete waste of time. Um okay so video jpa is just a version of i ja for video where the input to the system is a sequence of 16 frames partially masked and then you train the system to reproduce uh to basically predict the representation of the full video from the representation of the partially masked one. And this works really well. It produces features that are very useful if you want to classify for example an action that takes place in a
video. Again I'm not going to bore You with details. uh and it can basically elucinate what goes on in the video. So this is a partially matched video and it what you see on the right is what the sort of an output produced by a separately trained decoder. Okay, using diffusion model but it's important that the decoder is not used for training the internal model. Right? If you do this actually doesn't work as well. Now here's a surprising result That this is a paper we just posted on archive a few weeks ago. Um, if you
take this video Japa model and you show it videos where something really weird occurs, something that's not physically possible, the system can tell you that it's not physically possible. So what you do is you you you take those 16 frame input and you slide it over a longer video and you measure the prediction error of the system. you measure to what extent uh you know it Can predict the representation of full video from a partial one or future frames from current ones even though it's not been trained to do that. And if you have something
really unusual appearing in a video like an object spontaneously disappears, the prediction error goes through the roof tells you like something very strange is happening here. And in fact uh if you construct a data set where you have Pairs of video where one is a perfectly normal video where physics is not violated and another one where some aspect of physics is violated an object disappears or changes shape or changes color or something like that. Um with practically 100% accuracy the system can tell you which one is less plausible than the other one. Um so very
interesting and the you know Performance is good or bad depending on the type of uh sort of common sense that you're testing uh the system for. Okay. But here is the thing that we are really sort of moving towards. This is um techniques for training those JA architectures with regularized methods that are based on information maximization. Um and there is a bunch of them that have been proposed over the last five Years or so. MCR squared from uh Yima's group at Berkeley Bar Twins from my group at META. WSE vragg which is also from my
group at at META VR and MMCR for my colleagues at NYU in computational neuroscience. Um so how does it work? Those things basically say okay I want to prevent the system from collapsing from just minimizing the prediction error by ignoring the input and then producing a Constant representation at the output of the encoder. And one way to do this is to have some measure of information that comes out of the encoder and try to maximize it. So, we have information theorists in the audience, and if he had some hair left, he would be pulling it
pulling them. My apologies. Um, we're old friends and colleagues. Um, here is the problem. We do not have Any good ways of maximizing information content because that would require having a lower bound on information content. And we don't have any lower bound on information content. We only have upper bounds. Okay. So what we're going to do is have some measure of information content that we know is only an upper bound, but we're still going to push up on it and we're going to cross our fingers that the actual information content Follows. Okay? And that's as
justified as I can make this thing. Okay? Um so other authors here for those other papers have other justification of you know trying to produce efficient coding trying to make sure that the representation vectors coming out of the encoder are filling up the space of representation are uniform around a sphere or something like that. But basically they're all sort of surrogates for computing Information content of some kind under some assumption. And because of the assumptions, they're all upper bounds. The assumptions basically ignore dependencies between variables. Uh okay, so um how can we do this? Okay,
so basically we can do this by making sure that the variables coming out of the encoders, okay, it's a vector coming out of the encoder. we can try to make sure That the variables the components of those vectors are somewhat independent or at least uncorrelated. Okay, so if we have a matrix coming out of this encoder, a matrix, you know, we we we feed a batch of samples to this encoder and we get a matrix out of it where each row is a sample and each column is a is a variable of uh of the
the vector, a component of the vector. Um, and we have two types of method. Sample contrasting method. So they're the contrasting methods I was talking about earlier. What they're trying to do is make sure that the rows coming out of that matrix the rows of that matrix are all different from each other possibly orthogonal to each other if you can maximally different from each other. This is what this seem clear method and various other uh semiisnet methods are doing. The alternative I'm proposing here is making the columns of that Matrix uh independent or at least
orthogonal to each other mutually orthogonal pair wise which means which is another way of saying I want the the variables to be uncorrelated. Okay. So there's various specific ways of doing this and those papers basically have different ways of of doing this. uh the the the variance coariance regularization method I've um I've mentioned uh does this by making sure that the the variables have Standard deviation one or at least one and then uh making sure that the offdagonal terms of the coariance matrix which is this matrix multiply this matrix transpose multiplied by itself that the
off diagonal terms of this coarance matrix are as close to zero as possible that guarantees that the variables are uncorrelated and then simultaneously minimize the prediction error. Okay, so what you get in the end is a system that Finds a trade-off between extracting as much information as possible from the input but only extracting the information that is actually predictable by the predictor and basically eliminating all the stuff in the observation in the X and Y that is not predictable. all the stuff from Y that is not predictable from X essentially. So if you train a
system like this completely self-supervised with unlabelled data um and you take the Representation feed it to a supervised classifier measure the performance it works well I'm not going to bore you with numbers um but what we can do is actually train a world model with this method and so this is a very recent paper just came out last week on archive u by two of my students Vlad Sabal and uh Kevin Jong together with uh some of my colleague Cho Colero um who was a postoc at at fair who's a Professor at Brown and Tim
Brner who is a a posttock at NYU who is on the job market. So um so basically the idea here is uh let me jump directly here. So it's a world model again um but the encoder and the predictor are trained simultaneously and the collapse is prevented using this variance coariance regularization or information maximization I just talked about as well as kind of you know minimizing the prediction error is Trained on sequences of uh video frames um and then once you have this world model trained end to end you can use it for planning um
the system can actually plan pretty well uh on simple cases so there's still a lot of work to to kind of scale those up, make them work in more complex situations. This is another project very recent by Embar who is a postto at META um which basically uses world model. They're trained in a a different fashion to plan in um in the Real world. So you you train it from video videos taken from a robot where the robot sits at a particular position. it moves by a known quantity and then you have another video frame
and you can train the system to predict what the world is going to look like if you accomplish a particular motion. Okay? And then you can use it for planning trajectories and it works which is pretty cool. Um so those systems can um predict what the world is going to look Like if you follow a trajectory and therefore actually kind of plan a sequence of action so that the world it looks like a a particular target. And um lots of uh fun videos there. Okay, so let me conclude because I'm badly out of time. I
have a number of recommendations. Abandon generative models in favor of those joint embedding architectures. Okay, everybody's talking about generative AI. Everybody's working on generative AI. Everybody's assuming That AI is just generative AI and I'm telling you forget about generative AI. emittic model. Okay, so this is the main frame theoretical framework in which all of machine learning is based but it really doesn't work in the context of sort of making nondeterministic prediction in high dimensional continuous space. Um so I'm just telling you use those energy based methods you know you don't need to normalize anything and
it's like you know Unnormalized logarithms of probabilities if you want but but you have a much more bigger space at your disposal. abandon contrasting methods again something that's very popular at the moment in favor of those regularized method and of course abandon reinforcement learning I've been saying this for 10 years so if you are interested in human level AI do not work on LMS okay work on LMS if you want an engineering job next year Okay but if you want a research job on a topic that is not going to fall out of favor three
or five years from now. Don't work on L&M. I mean L&M would be useful just not the centerpiece of AI systems. Okay. So we have a lot of problems to solve. I mean the research program that goes around this you know built around this is probably a 10ear program for a lot of people. So if you're looking for a good topic For a PhD in AI, uh these are a lot of problems to solve within this particular framework. Um so training large scale world models on all kinds of inputs. Um figuring out good planning algorithms,
optimization algorithms for planning. Turns out gradient based methods tend to get stuck in local minima. So we might have to use more sophisticated optimization like ADMM or some level of gradient free optimization which I would like to stay Away from. uh dealing with planning in a context of uncertainty with latent variables hierarchical planning completely unsolved completely open um associative memory I didn't talk about um and then sort of you know slightly more theoretical issues mathematical foundations for energy based learning and inference uh learning cost modules because the cost modules you know here we only use
kind of very simple situations so the Cost module can be built by hand it's just ukle and distance in representation space. In most cases, you probably have to learn the cost function. Um, planning with an accurate world model, um, adjusting the world model as you go, uh, etc. Um, what this model tells you is that you have three ways to be stupid. The first one is your world model can be wrong. So the effect of your action may not be the Ones that you think they would be. Second one is your cost function might be
inappropriate. They might lead to outcomes that are not the ones that you expect or you may not have any guardrails which will lead you to do really bad things to bad to good people without realizing. And then the third one is not being able to actually find even if your world model and your cost function are good, not being able to find a sequence of actions that will Actually kind of fulfill your objective. Okay? And some AI systems and certainly some people in some people all three are bad. They don't have the right world model.
They don't have the right cost function because they they have no morals and whatever action they decide to take is completely ineffective. That's a good description of some people in government that shall remain Unnamed. Um, okay. So, in the future, we're going to have virtual assistant with us at all times helping us in our daily lives and those systems eventually will constitute a kind of repository of all human knowledge. Right? will not go to a search engine uh or even a library unless we really like going to libraries which I do. Um but we'll just
ask a question to AI assistant. We may even pose a problem to Our AI assistant and might be able to solve it. Now all of our digital diet is going to be mediated by those AI assistants and it would be extremely dangerous for everything about humanity if those AI assistant came from a handful of companies on the west coast of the US or China for linguistic cultural diversity for different value systems, political leanings, whatever it is. um we cannot possibly get all of our Information diet from just a few uh systems of this type. We
need a high diversity in AI assistance for the same reason that we need a high diversity in the in the media and the press. And so the only way to get there because those systems are so expensive to train. The only way to get there is if the people who have the means to train foundation models release them in open source so that a lot of other people can fine-tune them for whatever Language, culture, value system or centers of interest they have and I don't see any alternative to this. So I think probably in terms
of ethics this is after all a symposium about ethics. Uh the most important aspect of AI ethics today is not bias. is not is AI going to kill us all. It's not any of this. It's are we going to have the tools to build very highly diverse AI systems so we don't get all of our information for just a handful Uh of those and that means uh open source uh foundation models. So if you have any level of influence to anyone um particularly in government make sure that governments don't uh make laws that would make
open source risky or um complicated or illegal and there are proposals in that direction. um that would be a very very uh bad outcome I think for the future. But if we manage to preserve uh diversity then perhaps humanity will go through a new Renaissance because you know if we have access to all the world's knowledge in even more efficient fashion than we currently do um and we have systems that assist us in all the decisions that we make every day will amplify human intelligence. It's like everyone would be walking around with a staff of
super smart people working for us. Okay, we should not be scared by the fact that they would be smarter than us because we set the objectives for them. Which is Why I think this idea of objective driven AI is super important. Um so we be like you know every politician right who don't know anything but they have a staff of people experts in various topics that adise them. We'll all be pointy hat managers or virtual people. Thank you very much. [Applause] [Music]