so Mr Mark Zuckerberg and the AI team at meta they're on quite a tear this week they're everywhere they're dropping llama 3.1 Mark is up on stage with Jensen hang chatting up a storm kind of semi aggressively going after apple and doubling down on his commitment to open source AI models an open source an open ecosystem for everybody to participate in and he mentioned this in his interview with Jensen and today we are seeing the confirmation of the new segment anything model 2 as AI takes over make this your Mantra let the robots do the
work subscribe to stay on top of AI news Sam 2 the first unified model for realtime prompt object segmentation in images and videos it's available under the Apache 2.0 license so anyone can use it to build their own experiences in addition to that they're also releasing the sa- V the data set that's four and a half times larger and has 50 plus times more annotations than the largest existing video segmentation data set this has been kind of one of the complaints about how AI is open sourced quote unquote by some of these companies some people
are pushing back saying well technically you're supposed to also open source the data where are we getting the data a lot of companies aren't really telling us where they're getting the data probably because they have acquired some of it at least in some fashion that they do not want to talk about about but let's take a look so this is the Sam 2 the next generation of meta segment anything model for videos and images so the images One existed previously was very good and now the ability to segment in video is one of the cool
new features and kind of the next Frontier for these models Jensen hang from Nvidia was talking about how they're working on similar stuff for them it's mainly through various robotic applications training robots and simulations and having them easily be able to recognize objects in the real world to manipulate them to navigate through the environment Etc I'll play that clip towards the end but the big takeaways here are that number one they're releasing it they're open sourcing it the model the data the code you're allowed to use it under the permissive Apache 2.0 license and as
I say it has many potential real world applications for example to create new video effects unlock new creative applications mainly I think this is going to be used as sort of a foundation for building better AI Vision as they say Aid in Faster annotation tools for visual data to build better computer vision systems for AI for robots for self-driving cars Etc and really sort of the big deal here is that number one Sam 2 exceeds previous capabilities in terms of accuracy better segmentation better performance so that makes sense it's better than the previous models but
also and this is where things get kind of exciting so they're able to segment any object in any video or image sort of from scratch so it's not a model that's trained to recognize cats or dogs or cars specifically a lot of the previous models they'd be good at recognizing faces for example if you've ever taken pictures with with a digital camera I'm sure you have you might see that little bounding box around somebody's face here the model is doing something that would be described as zero shot generalization so zero shot meaning you don't really
need to give it examples right you don't need to say here's 10 pictures of a dog now find the dog in this image as you'll see in a second you can click on a fast moving bicycle that's racing down the hill and it will keep track of the bicycle you will segment out the bike without segmenting out the person or the background or anything else they saying before this took highly specialized technical experts with a lot of resources with a lot of infrastructure I'll show you a demo of this model in a second and as
you'll see these Advanced capabilities that took a lot of expertise not that long ago now is childplay now Sam the first model has already been used in Marine Science to segment sonar images analyze coral reefs satellite imagery analysis cellular images and detecting skin cancer so as you can see here it keeps track of ball one two and three it's able to segment specific cells that are swimming around and as we read from Mark Zuckerberg's open letter the open letter about open source AI he's saying that open AI has more potential than any other modern technology
to increase human productivity creativity and quality of life all while accelerating economic growth and advancing groundbreaking medical and scientific research and this is important to understand that if you know let's imagine a hypothetical scenario where a company creates a tool like this that is very effective at all these things but they can make a lot of money just selling it in the medical field treating cancer and other such diseases the company that created at first could have probably made Millions perhaps even billions selling it to those highly profitable segments since meta released it open source
everybody can use it build upon it and they no longer need to pay huge sums of money all those potential profit opportunities are gone but as Mark is saying here open source AI will accelerate economic growth and Advance groundbreaking medical and scientific research it helps make this technology more accessible so we can not only use it but also innovate upon it I know some people in the comments have mentioned that they don't really like Mr Zuckerberg but I got to say got to give credit where credit is due the stuff that he's releasing is impressive
it's going to be very useful and have a real world impact he's open sourcing the code and weights and the data set but let's check out the web demo and see what it can do all right let's try it out I'll leave a link down below interestingly I guess if you're from Texas or Illinois you are not allowed to use it all right so here's a video and it looks like so it's a kid kicking a soccer ball up and down what's kind of weird to me is that nothing in the background moves it's perfectly
still there's no wind now the grass underneath his feet moves a little bit but nothing else does so I feel like so I'm kind of curious to see if they use the still for the background or what but you know obviously they want you to click on the ball we're going to get extra tricky and we're going to select his left shoe as soon as I click on it it perfectly selects his left shoe so it's right side of the screen but to his left and I think I'm also able to select other but let's
start with just one all right and once we have that selected we click track objects and now as you can see the shoe is segmented out it remains blue throughout the whole thing here's a looks like a slow motion kind of replay the reason I chose that shoes because at some point I think we lose it completely behind yeah more or less fully behind the other shoe yet it Still Remains all right let's do something a little bit more difficult we're going to select the watch here all right so let me reset it start over
and I'll find this shot where we can see his watch it selected the watch you can see it here and then we're going to track objects let's see what it can do so that's looking really well so even though the watch is not visible in every single frame it still maintains the segment the bounding box or the bounding range around it and uh it does it really really well but what good is any of this technology if it can't track the ball in a game of Cups so this is interesting so I'm going to select
the ball I think think that's pretty obvious and then we're going to click track objects and let's see what happens so the ball is completely invisible it's not on screens in well it's apparent to me that I did not get enough sleep last night obviously we want to be tracking the cup and not the ball obviously it'll track the ball when it's visible but now ideally it would track the ball in case they flip it from one cup to the other and maintain awareness of which cup it is in but maybe that's asking a little
bit too much let's see track objects so the blue cup is where the ball is so they're spinning it around and yep it maintained the location of the ball but enough games let's give it something really difficult to do I'm uploading my very own video and let's see if it's able to keep track of this one just a couple of alleycats hanging out no big deal I want to keep track of this one so the first frame it starts it's very small so this is kind of like what it's seeing as the object so let's
see how well it's going to be able to do that it's tracking the cat pretty well so far wow that I got to say is pretty impressive I think it did a very good job of following the cat as it Dodges and Matrix and parkours its way through the alley now they do give you some suggestions about how to make this better so for example here we can adjust the cat like so it kind of Blends into the background but we can kind of add a few pluses here but I I say for the most
part I guess it right I mean there's should be a little bit more there all right and here I guess you can say it's not perfect but still very very very good maybe adjust it here and let's see if that will make it better I'm going to say track objects and here so replaying it again and I mean absolutely Stellar I'm very impressed with this plus I just love watching this clip it's incredible especially that little cat chasing after him it's like wow really really you're also able to download the model here get the data
set read the paper try the demo that's the one we just saw and of course Sam 2 is available online on GitHub to download and use for your very own favorite purposes I'll leave you off of a brief clip of Mark Zuckerberg and Jensen Hong talking about these models their applications and why they're so exciting with that said my name is Wes rth and thank you for watching let's let's talk about let's talk about um the next the next wave um you know one of the things that I really love about the work that you
guys do computer vision um uh one of the models that we use a lot internally uh is segment everything and um uh you know that that we're now training AI models on video so that we can understand the world model our use case our use cases for Robotics and and uh industrial industrial uh digitalization and um uh connecting these AI models into Omniverse so that we can we can um uh model and represent the physical world better uh have robots that operate in these Omniverse worlds better uh your your application uh uh the the Rayband
metag glass um your vision for for bringing AI into the virtual world uh is really interesting tell us about that yeah well okay a lot to unpack in there um the segment anything model that that you're talking about we're actually presenting I think the next version of that here at at sigraph segment anything two um and it is it now works it's faster it works with um oh here we go um it works in video now as well I think these are actually cattle from my Ranch in kwaii this is super cool okay so it's
recognizing the cows track it's recognizing tracking the cows yeah yeah so it's um a lot of fun effects will be able to be made with this and because it'll be open a lot of more serious applications across the industry too so I mean scientists use this stuff to you know study um like coral reefs and natural habitats and um and kind of evolution of Landscapes and things like that but I mean it's uh being able to do this in video and having it be zero shot and be able to kind of interact with it and
tell it what you want to track is um it's it's it's pretty cool research so for example the reason why we use it uh for example you have a warehouse and it's got a whole bunch of cameras and the warehouse uh AI uh is watching everything that's going on and let's say a uh you know a stack of Boxes Fell uh or somebody spilled water on the ground um or you know what whatever accident is about to happen the AI recognizes it generates the text send it to somebody and you know uh you know help
will come along the way and so that's one way of using it uh instead of recording everything if there's an accident instead of recording every nanc of video and then going back and re retrieve that moment it just it just records the important stuff because it knows what it's looking at and so so having a vi video understanding model a video language model is really really powerful for all all these these interesting applications