this year has been a wild ride for AI progress and as we wrap up 2024 here's some notable figures that kind of give their take on what we've been through and what's to come let's take a look at AI leaders like Andrew a DrJim fan and other researchers and Educators that talk about what they believe is coming next what they've been very excited about in the past here's one of my favorite uh Ai and Robotics researchers out of Nvidia DrJim fan a lot of stuff they did early on with Chad GPT him and his team were just absolutely mind-blowing uh Eureka that was one of his projects the Minecraft Voyager that was one of his and now they're doing the Groot project which is focusing very heavy on robotics Nvidia is of course kind of welln for training these robots in simulation so where time flows you know a 10,000 times kind of what it does in our universe so to speak but all the physical laws and friction and gravity Etc are replic ated very you know realistically very lifelike but time Flows at 10,000 times what it does here and so they throw a bunch of these robots in a simulation and these robots you know they train themselves how to pick stuff up and do various tasks then we take that learning and we're able to put it into real life physical robots and that sort of what they call Sim toore kind of transfer seems to be working really really well and here DrJim fan is kind of covering some of the things that happened this year he's saying uh say AI one more time and we'll end 2044 for good it's been a wild ride I like this saying the year went on like a diffusion model we watch our sci-fi Visions gradually denoise and materialize so let's celebrate one line each lightning round are you ready all right so we have a couple chapters here so chapter one robot Hardware we are the last generation without advanced robots everywhere everything that moves will be autonomous so this is the same thing he was saying earlier and here's why we should be so excited about about this so rise of highend humanoids the reason that a lot of people are building humanoids is because the world is built for us for that shape all the facilities appliances and tools are designed are form factor right so if you think about it a lot of things are built you know to be used by a hand with five digits Etc so we have the Tesla Optimus very few humanoids companies have the courage to show live interactive demo in the Wilds Tesla did it at the Wii robot event gen 3 hand 22 uh degrees of freedom it's very much ahead of the game by the way if any of you know exactly was it Telly operated during that event the bartenders or was that's uh you know completely sort of autonomous Etc if you knew if you know what the breakdown was like the dances probably were kind of prescripted were the bartenders tele operated or was that completely real autonomous uh Behavior by the robots if you know please let me know in the comments I know there's a number of people saying different things about that so I'm not 100% sure then we have the the 1X Neo friendly neighborhood humanoid aiming to deploy massively at homes Boston Dynamics Atlas the heavy duty Champion 160° joints unlock some insane gymnastics figure fast iteration speed from prototype to Car Factory deployment so they're you know working for BMW probably building cars either either now or very soon 4year intelligence gr1 robots are one of the few in mass production so thousands of them are being shipped worldwide clone Westworld styled design biomimetic muscles and tendons a fresh new perspective on how humanoids can materialize have you seen this thing their little commercials for the robot I got to say is kind of wild cuz it seems a little bit more like a you know a movie trailer for some like horror flick than it does for a tech company so this is clone robotics I should probably give them a follow and so here's kind of what that's looking like I mean is it just me or is that a little bit disturbing you know what I mean I mean it looks like like why is it twitching like that but and where are the legs they're coming soon but yeah I mean it's absolutely incredible feat of Technology but something about that just kind of creeps you out a little bit doesn't it it's very cool oh God yeah see this is this is a horror flick but still very impressive nevertheless all right I'm going to keep uh keep going here let's continue so that one was clone and I'm sure no one will have uh nightmares for life because of that but continuing rise of inexpensive robot Hardware Cheaper Than Cars highly scalable very affordable to most middle classes in the near future this is what's uh really interesting to me cuz this is not going to be for the sort of the upper class ultra wealthy individuals um just like with everything new technology when it kind of rolls out sort of the people with a lot of you know extra money disposable income Etc in a way they finance a lot of these right so first of all you start selling to people with a lot of EX you know extra cash on hand so to speak they finance that but that's sort of it it helps continue kind of the manufacturing cycle and eventually the costs come down and eventually you're able to get it to more and more people so we're probably just because of the sort of the scale and the incentives to get this out to as many people as possible we might see that process happen a lot faster you know especially since companies like Tesla and and Nvidia Etc they have a lot of money to throw at this so we might be seeing some of these robots be very affordable to middle class families very very soon next we have the unit tree G1 humanoid so around 40,000 weighs 77 lb and stands uh 50 in tall small But Mighty and Agility also if it goes Rogue I feel like you know like that's that's manageable right I mean if was like 200 lb of metal like I I'd be a lot more concerned you know what I mean this is I don't know I I feel like I feel like that's that's manageable then we have the unitary b2w so kind of if you've seen those little robot dogs so robot dog with four wheels that outmaneuvers most animals on Earth so you might have seen this thing a robot dog with the four wheels doing all sorts of interesting tricks and whatnot it's able to navigate certain ter Reigns here is an example of that incredibly incredibly well I mean look at that agility I got to say yeah it's probably either close to exceeding most animals or already exceeding I mean definitely exceeding most animals there might be probably some like mountain goats or whatever that're they're still as agile but we're rapidly approaching a time when it's going to be more agile faster and just more accurate able to scale mountains all sorts of terrains I mean that's a pretty impressive abilities that it's showing and um keep in mind that a lot of this is trained in assimilation so it's not like there's some computer scien is sitting there going like this is how you jump over a rock or this is how you go down this hill none of this is quote unquote programmed right we throw them in a simulation and it tries a million different ways to navigate this stuff every time it falls on its face it gets a negative 1 Point sort of some sort of a negative reinforcement and every time it successfully navigates an obstacle it gets plus one and over millions and perhaps perhaps uh billions of these iterations these simulations it slowly figures out how to do that in this video game in this simulated reality and then it sort of its neural Nets its brain sort of learns that and then we're able to take that data put it into a real life I now really really want one of these wouldn't you anyways then we put them into the real physical robot and it's it just it works this sort of idea of going from simulation to real life real world is is working really well so far so aloha so this is the thing that I was talking about in the beginning where we might be able to start building some of the stuff inside our homes with these little kits and I think Aloha was one of the things that really um kind of signified to me that this would be possible so first of all this is kind of uh what it's looking like and they have a full-blown version that's able to move and roll around a lot of the stuff you're able to train it at home with teley operation but it kind of learns to generalize a little bit so here's an article on it so it's it learns to cook clean do laundry so you start with teley operation by doing it the things for it and then it's able to kind of pick up on how to do that and actually complete those tasks uh by itself but here's the thing that kind of jumped out of me the entire setup which includes webcams and a laptop with a consumer grade GPU costs around $332,000 which is much cheaper than the off-the-shelf sort of the more expensive robots which can cost cost up to 200,000 a lot of the code is on GitHub including all the parts you would need how to run the simulation all the code that you need for that Etc so we're getting very close to where we can start kind of building our own it's only going to get cheaper it's going to get easier right there's going to be more and more tutorials on it so at some point it's possible that young kids might be kind of doing this as a hobby building robots to take care of their chores you know picking up the room or whatever picking up after the dog weeding the garden that might be a thing that we're going to be seeing emerging over this next decade we have apple Vision Pro so it plays an interesting role in robotics as a data collection device parses your head and hand poses in real time controls the robot to mimic your action then we have various embodied AI right so we have the Tesla full s driving which is the largest physical AI data flywheel in history right because all the cars on the road all the Teslas are constantly scanning Gathering data which is improving their ability to drive so the more the more drives the more data it collects the more cars they are on the road so it's kind of that a flywheel effect of improving and compounding its abilities uh as more cars are driving more miles and getting better and it's interesting he refers to it as a powerful Photon to action neural net then of course we have project Groot I believe DrJim fan is one of the either the lead on that or one of the leads so project Groot is a moonshot initiative to build the AI brain for general purpose robot OTS Jensen walks on stage at the sap center with 10 humanid robots in the background and then we have uh hover our team trained a 1. 5 million Foundation model that learns how to coordinate the motors of a humanoid capturing the subconscious processing that are cerebellum does every millisecond so 1. 5 million I'm guessing so 1.
5 million parameter Foundation model that's very very tiny so interestingly you know for things like the GPT the large language model mod it's kind of like slower slower processing but capable of doing more complex tasks but our brains certain subconscious processing happen quickly without us realizing it it's not very complicated but it needs to happen constantly happens needs to happen very fast so sounds like this is something similar but for robots DrEureka we've covered this and I don't understand why this didn't get as much attention as as I feel it should have so their team trained a robot dog to balance and walk on top of a yoga ball purely in simulation and then transferred zero shot to the real Hardware a large language model automatically writes the reward function and tweaks the parameters so we can watch Netflix and still get work done so Eureka and and I guess DrEureka they actually in the paper they use GPT 4 to write these reward models then it would be used to train those robots in simulation the sort of the results given back to the GPT 4 model and and gp4 model will be told to see if it can improve somehow so it's this iterative process of AI gp4 a large language model writing reward code to train robots in simulation and then iteratively improving that which is kind of wild to think about and the other thing that they've noticed is as some of these tasks get more and more complicated there's a Divergence between kind of the reward functions that the humans write and these humans are you know best in class uh robotic Engineers machine learning Engineers Etc and there's a Divergence between them and GPT 4 the kind of reward functions that it's write that it writes but it's actually better at doing some of the sort of the more difficult tasks than humans are which would suggest that you know once these things get a little bit more complicated that AI is going to be not only able to come up with novel Solutions so new solutions that humans can't come up with but those Solutions will also be better than what we can come up with so at some point AI research and robotic research will largely be driven by AI as opposed to kind of like human intelligence and then this Pio Pio from the startup physical intelligence a robot Vision language action model VA that performs impressive multi-step tasks like laundry folding laundry folding love it uses Aloha setup for cheap data scaling we have open vaa Stanford work on open source VA model trained on the open X embodiment data set that Aggregates robot motion trajectories from labs around the world we've briefly covered this in a previous video and of course we have the computer hardware right Nvidia so DrJim fan works for NVIDIA Nvidia of course the biggest uh AI hardware company on the planet scaling up Nvidia introduces Blackwell architecture and new beast in town dgx GB 200 crosses one xof flop compute in a single rack but also scaling down this is kind of one of the somewhat newer Trends we've seen right so as we're scaling things up and making the models bigger and the chips bigger we're also seeing a lot of progress through scaling stuff down through making stuff more efficient more compact sometimes that means maybe making it just a little bit more narrow in what it can do but sometimes just certain approaches tend to be more efficient like for example with the Deep seek model they were able to out of China they were able to train it uh for something like 11 times less the compute cost and for example meta used so we're seeing a lot of sort of uh possibilities and potential in scaling down and here the Jetson Nano super 67 tops of AI Computing in a $249 mini box designed for running small large language models on edge devices like robots it's nvidia's Raspberry Pi moment so if you haven't seen this thing so this is kind of what that's looking like it's a fairly small device it large it runs uh large language models and it can be used on what they call Edge devices so robots but there's tons of other applications I mean you can run something like this in your thermostat in your car in whatever you're doing if you need something that's like small efficient not necessarily online all the time then something like this is going to kind of uh provide that use case then of course we have the Google Willow chip so this is the quantum computer chip that uh Google announced recently so they used one of their Alpha system Alpha Cubit I believe it's called they figured out how to correct for the errors that Quantum chips make so it's a a neural net that corrects the errors of the these things which allow them to create the Google Willow chip and something like that is incredibly powerful if we're able to sort of find The Right Use cases for it because for example the the stat that they sort of said is in 5 minutes it would be able to solve a puzzle that would take one of our supercomputers currently you know 10 subti years which is you know if you take the entire sort of uh LIF span of our universe you know that would be billions of times of that so it's however many billion years we've been around it would be that times billions that's how long it would take a supercomputer to solve a problem that would take this sort of quantum computer five minutes to solve now the sort of the big problem the other side of that is we don't know exactly where to use it right because it's really cool for Benchmark purposes we still have't found like a real world application for it that supercomputers can't do uh that this is like uniquely capable of doing so we're still so we found a solution for something but we we need the problem for it to solve now so that's kind of where we are but still it's incredibly impressive and it's going to be potentially unlocking a lot of avenues down the road so as we kind of like do more stuff in that space this is certainly going to be probably going to give us a lot of breakthroughs down the road and then of course we have video generation and world modeling Sora of course from open AI you know as he says here it lost some charm due to the long weight which is very very true but you know as we've covered uh in some of the previous live streams and videos like it's very impressive in a lot of different ways especially when you start looking at some of the details like for example if you're looking at something that has Reflections if there's reflective surfaces or liquid the reflects for example the Sun or some sort of light it captures that incredibly incredibly well and for to be able to reflect something in the mirror if you think about it there's one specific shot that I was talking about where there's a table a wooden table that's polished and very reflective and then there's kind of a TV screen above it but it's reflecting a part of the TV screen that you can't see because it's kind of underneath it so in order for Sora to produce something like that even though it's producing just a 2D image it has to have some sort of representation in its kind of Laten space about what that 3D object would look like in order to understand how sh shadows and lights and stuff reflect off various surfaces it has to understand how you know light and shadows work so it is you know thinking of these things as video generation I think is a little bit sort of missing the point they are modeling the world they're modeling physics and and light in 3D spaces and we've seen some studies one out of hardw I believe was called Beyond surface statistics that have I don't want to say proved but certainly showed very strongly that these models create some mental model of the 3D space that they're trying to build so we feed them 2D images right because on you know if you think about if you're looking at a computer screen that's a 2d surface right so if I show you some picture right if I show you this image and I ask you you know what what is in this image like what is the the main object of this image what is the object in this image you would say well it's the car and if I say what's closer to you'd say well like the grass the front of the car what's what's away from well this hill the sky but the thing is if you think about it this is not a 3D image right it's a 2D image it's flat the only reason that we can make sense of the 3D of it is because we've seen this in the real world right but if you're feeding just 2D images to a neural net to an AI model it's not obvious that it would at some point just understand 3D space but this paper out of hardw Beyond surface statistics kind of shows that it figures it out right so they take this model and they feed it a bunch of 2D images there's no depth data in those images so they're not explaining to it what's near what's further away from the camera quote unquote they just feeded images and eventually it's asked to reproduce or create new versions of those images right so if it's shown a million images of cars later it's asked to produce a car and it produce kind of a a new car it just doesn't spit out one of the existing things in its data kind of produces a knew what it thinks of a of a car what that looks like but what they found by looking at how it produces those images is very early on in the process couple interesting things happens number one it has some internal representation of the Salient object which is the ceiling object would be in this picture the car right so like the main object in the frame right so it kind of knows like okay the main thing goes here right and you can kind of see it it's it's one of the first things that it kind of think about and also it understands depth right so the red is something that's closer to you and the blue is something that's further away right so very early in this Den noising process as it's creating the image it understands that this wheel is closer to the camera and this background trees or whatever that is that's further away and so when they're saying it's learning implicitly that's what they mean we never taught it about depth we never taught it about 3D of objects we just hey here's a million 2D images and implicitly it somehow figures out that okay there's some 3D space that I need to be aware of and uh yeah there's some sort of implicit learning going on and so that's kind of important why Drjiman is saying here you know video gen and World modeling kind of in in the same thing so Sora yes it's a video generation but it's a world simulator right so as he's saying here Sora is a text conditioned soft simulation of the visual world the model learns intricate rendering an intuitive physics all by some denoising and gradient maths so it's like if you have a puppy and you start throwing it a ball eventually figures out how to catch that ball and how the ball moves through the air even though it has no idea of physics or gravity or momentum or anything like that it builds kind of a mental model for how that ball is going to move through the air what it's going to fall Etc in a similar way as these models observe the massive amounts of videos that we feed to it it you know kind of understands some intuitive understanding of physics in this case some intuitive understanding of depth and kind of 3D objects and what is the Salient object in this frame for example this I think is the thing that the people that don't get the thing about AI what what it is like this if you can grock that aspect of it I think that's kind of like the biggest thing to understand about it the people that say oh it's just data in data out no no we're we're past that next we have vo as open ey delays the release Google stages a great comeback with a more accurate physics and fine grained object Dynamics certainly the camera controls a lot of the physics in vo I mean Google just took a massive Leap Forward um across so many different domains in Ai and uh vo is certainly one of them action-driven World models yes game engine yes you can run Doom literally anywhere even inside a diffusion model so this is where they ran Doom the old 1993 95 whatever game but instead of running it on code it was running inside a diffusion model so as the player is pushing buttons on the controller like to go forward to go shoot to open the door this thing would generate almost like a video of it would predict what would happen in the game so the game isn't running on code the game is running inside of this neural net that's like in real time showing you frames of what would happen as you're pushing the buttons same thing with Oasis and Minecraft and genie2 I believe that's Google as well where you can run more games with joystick controls inside a diffusion model so we're seeing more and more of this idea of that we're able to create simulations using neural Nets then we have World lab startup led by fafy Lee who is a well-known person in the AI space I believe she is an advisor to DrJim fan and many other prominent AI researchers and they have a stunning demo of a generative 3D Foundation model with strong geometric consistency so we've covered it basically you can take any image throw it in this thing and then you able the model creates kind of a 3D space and you're able to use your W ASD and the mouse to walk around in that image whether that's some you know fantasy landscape or you know just a regular photo kind of creates a 3D world around it and of course large language models right so CLA son 3. 5 shocked a lot of people with not only coding abilities but a lot of other you know Frontier abilities Gemini 1.