Transcript: Yann LeCun: World Models: Enabling the Next AI Revolution

Watch video

Computer Vision and Geometry Group, ETH Zurich58:54Transcript ✅Added Yesterday, 1:52 am GMT+8

Source video ID: 72Xj8k5WQX4

Transcript

0:00 — Yeah, I’ll talk about world models. Uh possibly the enabler for the next AR revolution. So, there’s a lot of machine learning people I think in the room, perhaps. Uh I have bad news for you. Machine learning sucks. Uh you know, basically if when we compare the learning abilities of uh machines with humans and animals, uh clearly there is a big gap. Um
0:31 — you know, people and animals can learn new tasks uh extremely quickly and with very few trials, very few samples. Uh people have common sense, animals too, physical common sense. There’s a lot of tasks that we can accomplish zero shot even if we’ve never faced them before. Um and how how do we do this with machines? How And you know, we have very powerful AI techniques that everybody is using, but they don’t really handle the real world. They don’t handle
1:02 — continuous, high-dimensional, noisy data. Language is easy by comparison. And so, uh the real world is messy. Language is simple. That And and you know, this connects with what you know, Vladlen said earlier and uh Jitendra as well, uh the Moravec paradox that you know, things that are simple are difficult for computers and things that are complicated for humans turn out to not be that difficult for computers. Uh like
1:33 — playing chess, computing integrals symbolically, and you know, solving equations, proving math theorems, etc. Uh so, you know, how is it that a 10-year-old can uh basically do what you would like a domestic robot to do and do most of those tasks without actually being trained to do them the first time you ask them, they can do it. They may not want to do it, but they can. Uh How come, you know, any teenager can learn to drive a car in a few hours of practice, yet
2:04 — the self-driving car companies have literally millions of hours of training data, and despite that, they can’t use those millions of hours of training data to get a machine to just imitating uh humans to uh to drive at the at the same level of uh reliability. Otherwise, we’ll have level five self-driving cars, and we don’t we don’t have them. At best, in the consumer car business, we have level two or three, and uh the the robo-taxis are engineered,
2:34 — you know, very heavily engineered with various uh sensors and other things. So, you know, we keep bumping into this Moravec paradox, and we really have to go beyond this if you believe that a intelligence requires grounding. Of course, some philosophers, and certainly some language people, don’t believe that’s necessary, but I think it is. Uh So, like uh Vladlen, we’re in Switzerland outside Jean Piaget. He was a big influence on me. Uh He had a debate with uh Noam Chomsky uh
3:04 — in France in the late 1970s, where um you know, they were debating whether language was innate or learned, and uh the trans- you know, there was transcriptions of that debate with, you know, people participating in it, and one of them was Jean uh was uh um uh was uh a a guy who had worked with Jean Piaget, who was a professor at MIT, and was talking about the perceptron,
3:36 — uh kind of saying, you know, there’s those simple machine learning models that are you know, capable of uh learning surprisingly complex uh tasks, and uh uh you know, that may be kind of evidence for for fact that learning is possible, contrary to what Chomsky was saying. Um This guy was Seymour Papert. He was professor at MIT, and 10 years before that, he had written a book that basically killed the entire field of neural net, including the this basically pointing out the limitation of the
4:06 — perceptron. But here he was, 10 years later, arguing for the fact that those things were actually interesting to study. But anyway, so Jean Piaget says, “Intelligence is not what you know, it’s what you do when you don’t know.” In fact, he never actually said this. This is apocryphal. Uh but there are other psychologists who basically kind of distilled his thinking into this sentence, which he never said. Um So he’s kind of quoted as as saying that. Uh so intelligence is not an accumulation
4:36 — of declarative knowledge. LLMs are an accumulation of declarative knowledge, not just, but the main reason they’re useful is because they can accumulate a lot of declarative knowledge. Uh intelligence is not a collection of skills. Uh You can probably build a machine to accomplish sort of any task if you spend enough resources on it, including things like self-driving. But that’s not really what intelligence is. Intelligence is the ability to learn to drive in about 20 hours. Or to learn, you know, learn any new
5:08 — task with very little training, or accomplish new tasks in a zero-shot. Uh that’s really what intelligence is, and that’s really what Jean Piaget means. What that means is that we’re not going to have any simple measures uh of intelligence because uh because again, any particular task, you can always, you know, if you spend enough effort and and time, you can always kind of crack it. So it’s more, you know, how adaptive you are. Um And, you know, this this connects to something that Bled and said, like the
5:39 — notion of AGI is complete nonsense. Uh human intelligence is specialized. The characterization of human intelligence is that it’s very quickly adaptive and we can learn new task. We all all of us know different sets of uh knowledge and have different skills. Uh and so it’s because we’ve been exposed to, you know, different environments and we we’ve had to solve different problems. Uh we’re adaptive. That’s really what
6:10 — intelligence is. Okay, so how do you uh how do humans learn and animals for that uh there’s a lot of uh learning that takes place in the early months of life mostly by observation. Uh so, you know, two-month-old baby, you know, can just take a rate so can develop a dynamical model of its own limbs but basically cannot affect the world. You know, it can’t move an object or anything. Uh but it does it, you know, can learn a lot of things
6:41 — about the world. One thing a baby can learn really quickly is that the world is three-dimensional. Why? Because the fact that an object has a distance every every uh point in the world has a distance from us is the best way to explain how our view of the world changes when we move our head. And of course babies don’t move necessarily their head but they are being moved so they this this see parallax and sort of they derive from this the fact that the world is
7:13 — three-dimensional. We can do this with learning machines today. They learn that that the world is three-dimensional only by being exposed passively to uh videos. Uh so that’s that’s uh an interesting uh thing. And I’m going to silence my phone because it’s actually Okay. So um
7:43 — So, basic concepts like object permanence is this is learned really quickly. Uh notions of stability, rigidity, and things like that. But then, uh what we would consider intuitive physics, uh things like uh inertia, gravity, that actually takes 9 months for uh human infants. Shorter for most animals. And if you put a 8-month-old on a high chair and you put a bunch of toys, uh 8 or 9 months old,
8:16 — the child will most likely systematically take all the toys and throw them on the floor and watch the result. They’re doing the experiment that gravity actually applies to everything. Uh So, that takes a long time. How does that happen? What type of learning is taking place there? They’re doing the experiment, but they can learn about gravity just by observation as well. Uh so, if you if you show the the scenario here at the bottom, where a car is on a platform, you push it off the platform, it appears to float in the air,
8:46 — uh a 6-month-old will barely pay attention, hasn’t learned about gravity yet. Uh a 10-month-old will go very surprised, like the little girl. Uh and that that’s actually how psychologists measure uh whether a baby has learned a particular concept about the world, uh which is a the violation of expectation. Um we actually we can actually use those techniques to test whether machine learning system has have acquired some, you know, some notion of common sense. Uh so, there there’s uh a lot that can be said about this. Uh
9:18 — Uh Jitendra and I uh collaborated on on a a paper here, mostly written by Emmanuel Dupoux, actually. And Jitendra I very little uh contributions to it on on on this this whole kind of set of questions. Okay, well, what is really what is intelligence is if it’s not an accumulation of uh of skills and not an accumulation accumulation of declarative knowledge is the ability to accomplish new tasks as I said, solve new problems with with without prior training. Uh
9:48 — and again AGI has no makes no sense as a as a as a phrase. Uh human intelligence is specialized and the the question is is not do you know how to do everything? Is it Can you learn quickly how to do anything? Or a wide spectrum of of things. Uh there’s a a little kind of somewhat philosophical paper here at the bottom uh written by some of my young colleagues. So, um here is a simple calculation.
10:18 — There’s still a lot of people particularly on the west coast of the US uh who believes that uh we’re going to reach what they call AGI, okay? Uh by by scaling up LLMs uh maybe training them on synthetic data, maybe you know, using a few tricks and post training and reinforcement learning. And I think that’s impossible. Uh I’m a believer in sort of grounded uh intelligence if you want, but but you can do the simple calculation.
10:48 — Uh an LLM you know, typical LLM of today is trained on something like uh 20 trillion words but that corresponds to about 30 trillion tokens. And each token is 3 bytes, something like that. Um So, the data volume is about 10 to the 14 bytes. This would take about 400,000 years for any human to read. Uh then compare this with what a a 4-year-old has seen during his or her life. That’s about 16 hours of wake
11:18 — time. And uh which by the way the small amount of video is about 30 minutes of YouTube uploads. And uh we have 2 million optic nerve fibers carrying about 1 byte per second each. So, the data volume uh that for your world I see through vision and probably through touch as well is about 10 to the 14 bytes. So, for your world through vision, same amount of data as 400,000 years through text. With all the human produced text
11:49 — available on the publicly on the internet. Uh, we’re not going to get to anything like human-like intelligence by just training on text. It’s just not going to happen. Um, so of course you’re going to say, well, video is much more redundant than than text. But in fact, that’s a feature not a bug. If you want to train a system, particularly using self-supervised learning, you need redundancy in the in the data.
12:20 — If you don’t have redundancy, you can’t learn anything. So, redundancy is a good thing. You don’t want too much of it though. Okay, so then there is another question about what are the right properties of intelligent systems. And in my opinion, an important property of an intelligent system is the mode of inference. Does it compute its output by propagating through a fixed number of layers of some neural net? Or consider the alternative. The alternative is computing the output of a
12:51 — system by searching for an output that is most compatible, if you want, with the input. Okay? So, you observe a situation that runs through some perception module that produce some sort of representation of the current state of the world as you observe it. Uh, you can directly produce an action. Okay? That’s a reactive uh, system, if you want. Or you could imagine an action and then have an intelligent system, the
13:22 — intelligent system, figure out is this a good action for this observation? Is this something that, you know, will accomplish the task I want? So, the objective here characterizes whether the task the system wants to accomplish has been accomplished or not. Think of it as a cost function. It’s not used for learning, it’s used for inference. Think of it as, you know, negative log likelihood in a probabilistic inference model. Or, as I prefer to think of it, an energy function. So, basically there, the inference
13:53 — is a process by which you search for an output that minimizes some energy function at inference time. Okay? That’s intrinsically more powerful computationally than just propagation through a fixed number of layers. And then contrast, uh, so the model on the on the left, uh, is sort of LLM-like, right? Take a window of of inputs, run this through a fixed number of
14:23 — layers of some big neural net with a few hundred billion parameters, produce one token. Okay, then shift that token on the in the input and then produce the second token, etc., etc. That’s auto-regressive prediction, and every token involves the computation of, you know, a fixed amount of computation running through a fixed number of layers of some neural net. Uh, this is not a good model. It’s not a good model of reasoning. The way you coerce an LLM to do reasoning is that you trick it into generating more tokens.
14:53 — But, that’s not the way we reason. We reason internally, we don’t reason in token space, in language even. Um, okay, compare this with the model on the on the right, uh, which is a slight specialization of the the previous one. Uh, you perceive, uh, the world or your environment, you get some idea of the state the current state of the of the world, and then you imagine a sequence of actions, a proposal for an action, you feed it to
15:23 — an internal world model for the system, and the world model predicts the outcome, and then feeds feeds this outcome to a an objective that measures to what extent a task has been accomplished or not. Okay, then by optimization, you search for an action sequence that optimizes this this uh objective, minimizes this energy. At inference time, I haven’t talked about learning yet, okay? In my opinion, that’s a much more powerful model. Um But you need a world model. Now, if you do have
15:53 — So so I’ve I’ve sort of uh settled on this kind of uh I idea or architecture uh about 5 years ago. I wrote a long paper about it uh that I put online in 2022 uh with some general architecture, etc. If you want to take pictures, here are QR codes, you can get to it. Um And uh it’s you know relatively easy to read, but kind of long. And
16:24 — and it’s really based on this idea that uh reasoning and planning are essential, and they basically proceed by energy minimization rather than forward propagation. And that for this to work, you need some world model, okay? So, same process that I described before, there is a few a few additional tricks. You observe the environment, perception module produce an uh representation of the initial state of the world, but only a representation of what you currently perceive. So, you may have to combine this
16:55 — with the content of the memory to uh get a complete complete idea of the the state of the world, uh what you know about it at least. Then you feed this to your world model together with a proposal for an action sequence, and and your world model predicts the outcome of that uh action sequence. You feed this to an objective, an energy function that measures to what extent a a particular task has been accomplished. So, this function outputs zero if the task is accomplished and some,
17:25 — you know, positive number if the task is not not accomplished and perhaps measures some distance to the task being accomplished. So, intuitively, you can have another set of objectives that are guardrails that would ensure that whatever state sequences the the system is going to take the world through is not going to kill anyone or hurt anyone or, you know, have any kind of deleterious effect. And so, a system constructed this way can be made intrinsically safe because it has
17:56 — to obey and, you know, optimize the guardrail objective with every output it produces. This is not the case for an LLM. An LLM, the only way an LLM can be made, uh, you know, safe or non-toxic or whatever you want to call it, is by fine-tuning it. And there is always a way to break the conditioning, if you want, to jailbreak the the system. Here, you can’t jailbreak a system like this. It It can do nothing but optimize
18:26 — the guardrail objectives and and the task objective. Of course, if you have a world model, certainly a lot of robotics system optimal control people in the room, you can apply this world model multiple time steps and, you know, each action an action sequence can be decomposed into a sequence. The guardrails can be applied to all the the steps in the sequence. Okay, that’s the way you would use a world model. And the way you plan by optimization there is akin to model predictive control, MPC, very classical
18:56 — stuff in optimal control going back to the 1960s. Ultimately, what you want though is something that can do hierarchical planning. All of us do hierarchical planning. Animals do hierarchical planning. What is hierarchical planning? Let’s say that Uh, I’m sitting in my office at NYU and I want to be in Paris tomorrow. There’s no way I can plan my entire trip to Paris in terms of kind of muscle actions 10 ms by 10 ms. Uh, which are the elementary actions
19:27 — that humans can can do. Uh, I can’t do that because first of all, it’s too long. But second of all, I don’t have the information. I don’t know if when I’m going down on the street how long I’m going to have to wait before a taxi stops. Right? So, there’s no way I can plan the entire thing. I have to do hierarchical planning. So, what I have to do is at a high level, I have to say, “Well, I don’t know how long it’s going to take me to go to the airport, but you know, maybe maybe roughly an hour and an hour and a half. So, uh, I need to get to the airport and catch a
19:57 — plane. Okay, that’s a two-step high-level plan. I don’t need to know many details uh, to make that plan. And now I have a sub-goal, which is being at the airport. I mean, New York, so going to the airport involves going down on the street and catching hailing a taxi and go to the airport. Uh, now I need to go down on the street. I’m in a NYU building, that involves uh, walking to the elevator, pushing the button, getting down, and walking out the door. Uh, now I have a sub-goal of getting to the
20:27 — elevator, right? Etcetera. So, you you can sort of go down this entire hierarchy and at some point you get to a point where the action you need to take is very simple. It’s something that you are familiar with. You may not have to use your full mental power to kind of plan the action. You can probably stand up from your chair without having to think about it. That could be just policy. Uh, but essentially, uh, ultimately, we want systems to do hierarchical planning. How do we solve that? This is an unsolved problem. If you’re a roboticist or an AI for
20:57 — robotics kind of person or agentic AI kind of person, if you’re studying a PhD on this topic, this is a great topic. It’s completely open. Nobody knows how to do this. Nobody he proved that they know how to do this. Um Okay, so now the big question is how how are we going to train those those world models? Um okay, hierarchical or not? Let’s say non-hierarchical to start. So first of all, we have to figure out what architecture to give them. And a natural instinct in these days
21:30 — days and age is to train a generative model. And in fact, I’ve been working on sort of trying to train world model like things for about 15 years, mostly failing for the first 10. Because I was trying to train generative models. Okay, what’s a generative model? So uh self-supervised learning has been incredibly successful, astonishingly successful in the context of language, right? You take a a string of words,
22:00 — you remove some of the words, um you correct the the input, and then you run the corrupted input through some big neural net, and you train it to recover the missing parts. Okay? That works amazingly well for text. So there are original models like BERT that that used to do this. An LLM is a special case of this where the only word you remove is the last one. So that the entire system is trying to just produce the next word in a in a sequence. Okay? But it works amazingly well if you and it it scales if you do it right.
22:31 — It doesn’t work if you apply it to video. So if you take a a video and then you show the initial segment of the video to the system, and you ask it to predict what’s going to happen next at a pixel level, it doesn’t really work. Like the representations you get out of the system for your video are not particularly good. Um And the reason is you simply cannot predict everything that takes place in a video. There’s an infinite number of plausible things. In text, it’s easy because there is only a finite number of words, and so you can
23:01 — get the system to produce a probability distribution over all possible words or tokens in your dictionary. But, you can’t do this with video, right? There’s just a incredibly large number of possible video frames. Uh let me take an example. Uh if I take a video of this room, right? I start here, and I kind of slowly rotate the camera, I stop here, and I ask the system to continue the video. You know, it’s probably going to predict, you know, we are in some sort of empty classroom, uh auditorium, and you know,
23:31 — the room has a finite size, there might be windows on this side, and things like that. There’s absolutely no way the system can predict what all of you look like. Or which, you know, chairs are unoccupied. Um it’s just impossible. You just don’t have the information. So, when you train a system to make this kind of prediction, you kill it. Now, of course, you’re going to tell me, “Oh, but we can train system to produce cute videos, right? Video generation.” Yes, but this prediction usually is done in
24:01 — representation space, not in pixel space. It’s only a second stage that actually turns the predictions into high-resolution uh high-frame-rate uh uh videos. And the system only needs to produce one cute-looking video. It doesn’t need to actually represent all plausible videos. Uh so, which is a much simpler problem. Okay, and as I said, I’ve been kind of attempting to work on this for the better part of the last 15 years. So,
24:31 — this is a 10-year 10-year-old paper where we tried to train some uh neural net to predict, you know, short video clips, you know, two frames from from four frames of context. You get blurry predictions. Why? Because the system predicts the average of everything that can happen. Of course, you can correct that with latent variable models, like diffusion models, uh but we which we didn’t know at the time. We tried to use GANs and stuff like that. Wasn’t too successful. Uh but, you know, perhaps using latent variable models would uh would help, uh,
25:01 — diffusion models in particular. Um, which of course produce cute videos. Do they actually understand the world? The evidence is no. Um, so here’s my solution. My solution is a architecture I called uh joint embedding, or more precisely joint embedding predictive architecture, JEPA, which is shown uh on the right. Okay, on the left you have generative architecture. You observe X, maybe you observe A, an action that is taking place, and you observe the result Y, and the
25:31 — system is trying to reconstruct Y in its most minute details. Uh, with uh JEPA, you observe X and Y and A, but you encode both X and Y, and the prediction takes place in that representation space. Okay? Major difference. What the system can do is essentially eliminate from the input uh, by constructing a representation of Y, it can eliminate all the information about about Y that is simply not
26:03 — predictable. Right? And that makes the prediction more abstract, with fewer details, but more accurate in a way. Um, so you know, how do you train a generative model? It’s easy to train a generative model because the cost is just a reconstruction cost. It’s just going to, you know, you’re just training it to reconstruct. You can train it as an auto encoder, but then you need to restrict the information content in the code, or as denoising auto encoder, which is what a lot of uh,
26:34 — techniques have attempted to do like masked auto encoders and things like that. So that means uh taking a an input, corrupting it in some ways, and then training an auto encoder to recover the initial one. Uh, and by the way, uh diffusion models are a bit of a special case of this of this sort of general thing of denoising. Um So, the values is when you train systems of this type to learn representations of images, you don’t get good representations.
27:05 — If you use the representation of images obtained this way, you feed it to a downstream uh task that you train supervise, okay, you train a head supervise. The results you get are not great. To get good results, you have to use joint-embedding architectures. All the best systems that use self-supervised learning to train an image or video representation systems system, all use joint-embedding. None of them uses reconstruction. Okay? All the best ones.
27:35 — And uh either you Let’s say you apply this to images, uh either you have two views of the same scene, and you train a neural net to produce representations, and you tell the system, “I want those two representations to be identical.” Um or you you use this uh corruption technique. You take a an input, you corrupt it, uh or transform it in some ways, and then uh you know, train the this Jepa architecture to predict the
28:05 — representation of the original image from the representation of the uh corrupted uh version. Okay, there’s a big issue with this, which is that the system can collapse. Now, the generative models can actually collapse to some extent. Like, if you try to train an autoencoder without a restricted a restricted you know, restriction on the information content of the code, your autoencoder is just going to learn the identity function, and that’s a collapse. It’s not going to learn anything useful. Um
28:35 — Similarly, a system like this can can collapse, and how can it collapse? It can essentially completely ignore the inputs, produce constant representations and other prediction problem is trivial. So, if you’re just trying a system of this type to minimize the prediction error, it’s going to collapse. It’s not going to do anything useful for you. So, the whole trick of how you do self-supervised learning for joint embedding uh system is how you
29:05 — prevent collapse. And there is uh my favorite concept for this. I’ll talk about other ways to do this but my favorite concept to prevent collapse is uh is information maximization. Okay? So, you basically come up with some objective function that measures the some sort of information content of the representation that comes out of your encoders. And you try to maximize that information
29:36 — content. Okay? So, your cost function is minus the information or whatever. Uh so, there’s a bunch of uh uh techniques uh for this uh you know, since like the last 6 or 7 years with names like MNCR, NCR squared, WMSE, Seegrid, VICReg, and Barlow Twins. Uh the Barlow Twins, VICReg, Seegrid uh come from uh people working with me. Uh the other ones from other groups. MNCR comes from uh uh Berkeley and NCR
30:09 — squared from uh a colleague at NYU Neuroscience. It was Sam Chelleli. Uh so, but this idea of JPAC is getting popularity. Uh there’s about 1,700 papers that mention joint embedding predictive architecture spelled out uh on Google Scholar. Okay, so how There’s an issue with this type of method, which is how do you measure information content? We need to have a cost function that is a differentiable measure of information
30:39 — content, so we can back propagate gradient and maximize it. And the bad news is, first of all, we don’t actually have objective measures of information content because all the proper definitions are based on knowing the distribution of the the vectors or whatever that you want to measure the information content of. And we don’t know the distribution. We only have samples coming out of an encoder. So, how you how do you compute information content from a finite number of samples? Okay, that’s the first problem. Second
31:10 — problem is to maximize something you would need a lower bound on information content so that when you maximize, you push the actual information content up. Problem is, every empirical measure that we we have are all upper bounds. So, what do we do? We come up with a good upper bound and we cross our fingers. And we show some theorems and whatever. Okay. Uh so, this technique uh
31:40 — and and many others and on like the way to properly explain how you how you know, how you can train self-supervised learning systems and and every learning system really is a framework I call energy-based models that I’ve been advocating for 20 years or so. Uh it’s basically the basic idea is like this. If you want to capture the dependency between two variables, X and Y, there is no real functional relationship between X and Y. So, you cannot there’s no single Y for a given X. Right? It’s just a dependency, but
32:11 — it’s not a function. Like it’s a relation or some kind of mapping, but not a function. Uh so, indicated by the diagram on the right here, you have a bunch of data points, so those are the the black uh uh uh dots. Um and so, they indicate some sort of dependency between X and Y. How do you capture this dependency given that you cannot run a function that that computes Y from X? So, one way to do this is to uh uh learn or build a contrast function
32:42 — energy function that tells you a point in this XY space is near the the the training data or not. Okay? So, we think of it as some some sort of landscape uh where the the black dots are in the valley. Okay? In Switzerland, it would be a lake. Uh and then, you know, you get like you know, level curves, right? As as you move out, you know, outside of those regions, the altitude goes up. Okay? The
33:13 — energy goes up, right? Now, if I give you a value for X, you can infer you can give me a bunch of values for Y that are compatible with X. They are values of Y that minimize the energy, right? So, it’s the kind of inference I was talking about earlier, inference by optimization, not by forward propagation. Uh but you can also possibly do it the other way around. If I give you a Y, you can infer X uh from Y. And you can give me multiple answers. So, in situations like like video prediction, where there is a basically an infinite number of possible answers,
33:44 — the proper way to train the system of this type is to think of it in terms of energy-based models. Uh and by the way, probabilistic uh models are special case. Where where your energy has particular form and the way you train it has particular loss function. Um so, it’s a slightly more general framework, if you want, than probabilistic uh inference and learning. » [clears throat] » Okay, so what you to train an energy-based model, you have to prevent collapse. The collapse problem I was telling you about before will be manifested by the energy
34:16 — function being flat everywhere. You train the system to minimize the energy for a bunch of training samples, and what the system gives you an energy function that is zero everywhere. That’s what a autoencoder that learns the identity function, that’s what it it does to you. A jetpack that ignores the input and produces constant representation at zero prediction error for everything. So, it’s a collapse. To prevent collapse, you need to do one of two things. One is contrastive methods. You You generate points outside the region of data, and you push the energy
34:46 — up. Okay? You come up with some cost function that makes sure the energy of the data points come down, and and the energy of other points is is higher. And there’s a whole bunch of them. And there is another set of method which I I’ve come to prefer, uh regularized method, which work by minimizing the volume of space that can take low energy. Okay? So, if you push down the energy of certain regions, the rest has to go up because there is only a small amount of a small volume of energy to go around. Of low energy.
35:16 — Um So, in practice, how how how do you sort of reduce this to practice? Okay? Those one of those two methods. Okay. So, um let’s go back to this idea of information maximization. Uh so, I want to train this uh this uh jet bar uh some measure of information. Uh let’s say I run a batch of samples through uh through one of the encoders. I get a matrix where each row is the representation for one sample. Each
35:47 — column is the value of one variable in the representation for all samples. Okay? There’s two ways to make that matrix informative. One way is to make sure all the rows are different. And other way is to make sure all the columns are different. You want to make sure the columns are different because if all the columns are the same, that means every variable in the representation carries the same information. And of course, that’s not very informative. So, you want to, you know,
36:17 — each variable in the representation to be maximally disentangled from the other ones to give you an independent information from the other variables. Okay? Um so, that would be an example of what we can call dimension contrastive methods, which is a form of regularized method. And then at the bottom the the type of criterion that makes the the rows all different, those are contrastive methods or sample contrastive methods.
36:48 — Okay, sample contrastive methods are very popular for certain applications. A lot of the perceptual pipelines in lens are trained with a technique called clip, which basically is contrastive method that does joint embedding between images and text. But I prefer the other one. So this idea that you need to find an abstract representation of an input to be able to make prediction is actually very natural. We do this all the time as humans.
37:18 — We do this all the time as scientists and engineers. Animals do it too. Let me explain why. In principle, I could explain or simulate everything that takes place in this room at the moment at the level of quantum field theory or particle physics, right? Could simulate the trajectory of every particle in this room. And that would go down to actually simulating all of our brain processes and everything. So in principle, running the simulation I could figure out if you know any of you actually
37:48 — understands the word I’m saying or not. Okay? Or if you are sleeping right now. Or if you are actually bored, right? Okay, but of course that’s completely impractical. And then you know what we do in science is that we invent abstractions to allow us to make predictions and those abstractions ignore a lot of details about the state the underlying state of the system. So we invent those abstractions you know from quantum field to particles, atoms, molecules, proteins, organelles, cells, organisms,
38:19 — individuals, societies, ecosystem. Every level in this hierarchy is a particular level of abstraction with which we describe uh the world. Which allows us to make longer range predictions, if you want, than the levels below by ignoring a lot of details about the level below. Which is why the the way to understand what goes on in this room at the moment is is more at the level of psychology than at the
38:50 — level of particle physics, right? Now, of course, physicists always make fun of everyone saying like, you know, you just apply physics, right? Even psychology is applied physics to some extent. Uh but in fact, you know, there is uh you know, specific knowledge about chemistry that does not derive directly from physics, right? So so this abstraction actually kind of contains uh new knowledge or information or structure, if you want, that was not apparent at the level below.
39:21 — So this idea of Jetpack really kind of constructs on this concept that you need to find an abstraction to be able to make predictions. Uh let’s say you you want to do um uh you want to design an airplane. You need to design the air flow for the airplane. You do computational fluid dynamics, right? You simulate the flow of air around around the wing. Uh you you model the state of the air in every little cube around around the wing by basically the velocity and the density and things
39:51 — like that. And and then you solve Navier-Stokes partial differential equations. And that simulates the flow of air. But in fact, it’s ignoring a huge amount of details in the underlying mechanism. The underlying mechanism is molecules of air bumping into each other and bumping on the plane. But you never simulate fluids at that level. It’s just too complicated. And also it would diverge from reality really quickly because it has too many details. So, you have to ignore details to be able to make accurate long-term
40:22 — predictions. Um and so, we do this in science all the time. And so, world models should not be simulators. Right? They should work in abstract space. They should not be digital twins, you know, that’s a buzzword. They should definitely not be generative models, as I just explained. Uh and they should not be video generation. So, a lot of people are working on video generation and they call this world models. They They’re not world models. They’re video generation systems. Uh so, one one big message from my talk
40:54 — is that if if you want to use world models, do not work on video generation. This is a different problem, okay? If you want to produce cute videos, work on video generation. But if you want to like control robots or industrial processes or understand the world, do not work on generation. Um you want world models to control complex systems where you cannot model the dynamics of the system by
41:24 — writing a bunch of equations. Okay, if you have a humanoid robot or a you know, any kind of robot, you can just write down the dynamical equations and then simulate the dynamics of the robot and you can get your humanoid robot to do somersaults and and kung fu and whatever, right? That’s simple. Uh as soon as the robot starts to interact with the real world, that’s a lot more complicated. Um and and that is actually more difficult to reduce to a simple equations. But then, you know, think about a
41:54 — complex system, like I said, a turbojet or uh I don’t know, a chemical plant or a patient uh or a robot, but a robot that interacts with the real world in complex ways. You cannot reduce this to a small number of equations. What you have to do is basically learn a energy model of the whole system, the system you control and its interaction with the environment, uh so that you can make predictions and you can plan a sequence of actions to arrive at a particular uh outcome.
42:26 — So that’s one model. I mean, the concept is very old. It goes back to the 1960s. Uh it’s the root of optimal control. Um and uh okay, so now I come down to a particular technique that I’m very fond of, which I think um we’re going to expand over the next uh few months and years to do this information maximization uh that I was telling you about earlier. And it’s called Sigreg. That means sketch isotropic Gaussian regularization.
42:56 — Okay, the trick here is the following. You you run a batch of samples through your encoders, and what you get is a bunch of points in the vector space of dimension whatever the dimension of your representation space is. We’re going to try to make the distribution of those points as a tropic Gaussian with the same uh variance in all dimensions. Why? Because an isotropic Gaussian is a distribution where all the variables are independent. Okay? So they’re maximally informative
43:28 — individually. Uh and it’s also the distribution that has maximum entropy for a given variance, but we don’t really care about that. Um what’s interesting is that it makes the variable independent of each other. Okay, so how do we do this? Now, of course, we don’t have the distribution. We just have a bunch of points uh in that space. And it may be a high-dimensional space like 2,000 dimensions. And we may have, you know, a few hundred or a few thousand points. Like how do how can we make sure this is uh this is a Gaussian. So here’s the trick.
43:58 — Uh the trick is you project the individual points along a single direction. And what you get is a marginal distribution. Okay? Now, of course, you still have discrete points. You don’t have a distribute you don’t have continuous uh density. You have discrete points. Okay, so one trick you can do is compute the cumulative distribution that those points give you, right? So, it’s a staircase, right? Because you have discrete points in one dimension.
44:29 — And then what you can you can you can do is you can ask, “What is the distance between the staircase, the cumulative empirical cumulative distribution of my points, and the cumulative distribution of, let’s say, a Gaussian?” You can do that, because you know what the Gaussian looks like. And for every point, you can tell on the staircase, you can tell if it’s to the left or to the right of the ideal Gaussian. And so, that gives you a gradient. Like, do I move the point this way or that way? In that projection. Okay? It gives you a gradient. Now, for
45:00 — every training sample uh in your in your batch. Okay. Now, if you make the distribution you know, by gradient descent by optimizing this cost function, it’s going to make the distribution Gaussian along the marginal uh of that distribution along this projection. But now, there’s a theorem that says, if you do this along lots and lots and lots of directions, in the limit, your joint distribution is actually a isotropic Gaussian.
45:31 — Okay? So, what we need to do now is do many many projections. For all of those projections, compute those gradients. Uh you know, move the points or back propagate through the network, change the weights so that the points move so that the overall distribution gets more Gaussian. And uh if you apply this to a distribution like the one on the top left here, like an X, these are actually two dimen- two dimensions among 1,024. And then you do a gradient descent. You just move the points here. You don’t train the neural net. You uh So, the
46:02 — technique I’m advocating for is on the left. You get something that’s sort of Gaussian-ish. Uh and this this really works in in practice. We actually applied it to um uh training world models that are action conditioned and and we’ve used them for for planning and it works decently. It’s a very the the the source code is available. It’s very simple. You can train it on one GPU. Um and uh what we need to do with this technique is scale it up basically.
46:34 — There’s a few other things that we need to do, but that’s the the main one. Uh and so in simple cases you can train this world model and you can use it to plan simple actions in like in a push T or like simple robotic situation in simulated environments. Um so that needs to be scaled up, but it’s it’s sort of a a good work. There is a a theoretical paper that uh we put out just a few days ago where if you make the hypothesis that the underlying distribution of your data is actually an isotropic
47:04 — Gaussian, if you assume that the observations you get from the world are some sort of complicated non-linear transformation of those points, like in this case like some sort of spiral uh transformation, you apply you train a neural net with Sigreg on it, it will recover the original Gaussian in the representation space. Okay? So it’s not a proof that it works in every case, but it’s a proof that if your original
47:36 — explanatory variables are Gaussian, the system will recover those variables up to a rotation. Okay. So we we can use those techniques to in in the context of self-supervised learning to train an image recognition system uh and there is another set of technique which I should mention because they work really well and they are the ones that have been scaled up so far. C-reg is conceptually my favorite method but it’s
48:07 — very recent and we haven’t scaled it up. Whereas those other methods that are based on distillation we scaled them up and we got really good results both for images and video with techniques like I-jeppa and V-jeppa. So what’s the basic idea of those distillation methods? You you still have those two encoders so this is a jeppa architecture you take a an input you transform it or corrupt it or mask it or something and then you train the system you know to predict in representation space but you don’t propagate gradient through the encoder on the right.
48:39 — Okay, those are two encoders with identical architectures and they kind of share the weights but the funny thing is that the encoder on the right use a exponential moving average over time of the weights of the encoder on the left. The encoder on the left gets gradient and gets updated all the time. The encoder on the right gets updated slower essentially and share the weights. This is derived from some intuitive ideas
49:09 — some people at Google DeepMind who are using techniques like this to stabilize the variance in reinforcement learning and they realized you could apply this to self-supervised learning from images. They call this BYOL bootstrap your own latent and there is like a whole bunch of methods coming from Meta in particular SimSiam, MoCo etc. that use this exponential moving average idea and particular method called I-jeppa which I I show here it produced really really good results
49:40 — and what we were able to do with I-jeppa is compare the results of I-jeppa with a generative approach called MAE masked autoencoder. And it it’s not only better but it is much faster to train. Um Another uh and another technique is is called DINO. Many of you I’m sure have heard of it. I know some of you have used it because there were projects in the robot demos that actually used DINO. Uh So, this is done by some of my former colleagues at Meta in Paris and it’s
50:12 — completely self-supervised. It’s a joint embedding architecture. It’s using distillation, but with various tricks, which I’m not going to explain. Uh there’s a lot of engineering that goes behind it. And those systems basically at this time produce the best generic representations of images. If you have any type of vision task that you want to to do, that’s probably the best the best uh the best technique to I mean the best encoder for images. Okay, but what we’ve done is among other things use DINO as an encoder and then train a world model and
50:42 — do planning. Let me show you just a cute video on this. If if I can. Okay, so you have a initial state here of a kind of simulated environment that has pretty complex dynamics. And you have goals at the top and at the bottom what you see is the sequence of actions of a planner that uses this trained world model to get the world to a configuration as close as possible to the original one in less less than 25 steps. Um and this uh
51:12 — has been applied to a number of different uh uh scenarios uh like double pendulum and and and push the in whatever. Um Now, I did that works really well. So, we uh we sent more recently applied it to video. So, there you take a video, you mask a big chunk of it, and you train the JAPA to again produce good representations so that you can predict the representation of full video from the representation of partially masked one. Once the system is trained, you use
51:42 — the encoder as a way to extract features from the video and you train a head on top of it to uh accomplish some task and it works like really well. It’s state of the art for a lot of traditional vision task particularly from video like action recognition, action prediction and stuff like that. The one interesting thing that I want to mention instead of boring you with the table of results is is that those systems V-Jeppa in particular has has learned some level of common sense. So one thing we can do V-Jeppa because we train it to
52:13 — predict what’s going to happen next in the video. We can try to predict her to do that. We can measure its internal prediction error. We can show it a video and monitor the internal prediction error at every every time step. The system takes a window of 16 frames. So we we just slide those those frames right on the on the video and measure the prediction error with for the next 50 16 frames. And the cool thing is that if you show it a video where something impossible occurs, something unphysical
52:44 — the prediction error shoots to the roof. So it’s like the little girl in the one of the early slides like you know looks at the scene of the car not falling. Same thing. You have a video of a a ball being thrown and the ball disappears. Prediction error will shoot through the roof. So that’s interesting because it’s the first time at least from my point of view that I’ve seen completely self-supervised system acquire some level of common sense.
53:16 — It’ll tell you what’s possible, what’s not possible. Um Let me skip this. It’s cute but it’s it just says V-Jeppa can be used for planning and you know this new versions of this that do a better job at planning and everything. But here is an interesting thing. Remember I told you the way babies learn that the world is three-dimensional is because it’s the best way to explain how your view of the the changes when you move your head. Okay, so we took the representation learned by uh uh some version of VJ pack or VJ pack 2.1.
53:47 — And then we trained a head on top of it to predict depth from a single image. And it does a really good job. Um it’s it’s produces really good results. In fact, better than you know, V3. Uh and uh what that shows is that this system, by just being trained to predict to miss, you know, fill in the blanks in videos at a representation level, basically understands that the world is three-dimensional. I mean, understands with double quotes.
54:17 — Understands the notion of object. If you use the representation as input to a segmentation system, it it works decently well. Uh and for, you know, various other things. Okay, let me conclude. So, it’s funny, huh? So, abandon generative models. I mean, if you work on LLM, of course. But you should not work on LLM. Uh at least if you’re in academia, you
54:47 — should absolutely not work in LLM. There is nothing you can bring to the table. Uh so, abandon generative model in favor of joint embedding architectures. If you are interested in, you know, uh intelligence, abandon probabilistic models in favor of those energy-based models. I didn’t have time to really kind of explain why. Uh I you know, I made an argument in favor of of those regularized methods or information maximization through
55:17 — through variables instead of samples. So, abandon contrastive methods, which again have a lot of practical applications. I’ve I’ve been saying for ever to abandon reinforcement learning. I don’t really mean abandon. I mean, minimize its use because it’s so so horribly inefficient in terms of sample efficiency. And I know there are people here who work on this, but uh but like, you know, RL is like what you do when you’re desperate and there is nothing else you can do.
55:47 — Okay. Well, you like you have to do most of the learning, you know, in by observation. Uh you know, learning world models, blah blah blah. And once you have good representations, you know, you can use RL on top of it because you already have the good representations. Uh you you won’t require too many samples. Sometimes you can’t you can’t avoid it. Uh and certainly, if you’re interested in making real progress in AI, in sort of grounded, you know, AI for the real world, if you want, physical AI, don’t work on LLMs. Don’t work on
56:17 — generative models, either. So, as you can probably guess, this does not make me very popular in Silicon Valley. Yes. Um and so, I left Meta, as many of you probably know, at uh the end of last year and formed a new company called Ami Labs. And the purpose of uh Ami Labs is sort of AI for the real world, like, you know, physical AI. Uh robotics is a use case, but it’s not just that. It’s control of uh industrial processes. Like, anything that is
56:48 — high-dimensional, continuous, and noisy, for which LLMs are completely helpless. Um this is the kind of problems we’re working on. And that’s it. Thank you very much. » [applause] » Okay, so I know there’s many questions. Maybe we’ll take, you know, one or two, but
57:19 — then we have to to wrap up. So, and please quick questions and quick answers. » Thanks for the talk. Uh I wanted to ask about the guardrails that you mentioned on the uh one of the earlier slides where you also talked about MPC. Engineers love MPC cuz they can put in their constraints, describe them in state space, like 3D space. But from what I understand, in your uh system, everything works in representation space. How do I even get a constraint like don’t bump into the wall into this representation space? Do you envision
57:50 — the system learning the constraints by itself, or can engineers really put them in? » No, you would have to learn uh a very small head on top of your representation that maps uh that to your you know, the constraint that you’re interested in. Uh so, that part has to be trained, but you can train it with a very small number of samples because it’s a tiny basically it’s just a projection. » But you need a different encoder for each kind of uh constraint that you might want to put in, no? » Well, you need a a different projector
58:21 — for for each constraint, right. So, if if your task is to like open a door, I’m not talking about constraint, I’m talking about like a task objective. Uh you need some cost function to tell you like is the door open or not, right? And uh so, that might have to be trained when you’re uh trained to accomplish the task, but basically that requires two samples. All right. » Okay, I think we’ll have to leave it here. Thank you, Jan, very much. » All right. Thank you. » [applause]