all right so we're diving into long form video understanding and you know AI models usually struggle with those hourlong videos yeah it's tough It's like fitting a whole movie into a tiny tweet way too much data but meta AI has this new model long Vu and it seems to be changing how we think about this problem exactly long Vu doesn't try to process everything okay so how does it work well it uses something called spatio temporal adaptive compression H that sounds complex think of it like an AI editor you know it cuts out the boring

parts and just keeps the good stuff so how does it know what to cut video is pretty complex right so it uses the system called Dino V2 dinov V2 yeah it can spot even the tiniest differences between frames so it can tell if a frame is basically a repeat exactly and it just gets rid of those like when we watch a movie our brains naturally filter out stuff we don't need it's kind of like that makes sense yeah but how does it actually answer our questions about the video well that's where your question comes in

it uses that text query like keywords to zero in on just the most relevant Parts oh I see so it's not wasting time processing stuff I'm not even interested in exactly it only processes a fraction of the data which makes it super efficient how efficient are we talking it averages just two tokens per frame whoo two tokens for an hourong video that's insane but does that actually make it better oh yeah absolutely it actually outperformed another model L V1 Vision by 5% on The videoe Benchmark by percent that's significant and even a smaller version built

on llama 3.2 3B did amazingly well it was 3.4% better than its predecessors on the super long videos so it's not just efficient it's actually more accurate you got it and get this it even Rivals GPT 4V in some cases especially when there's a ton of visual information wow that's really impressive so we have a model that's efficient accurate and can even challenge the big players what kind of impact did this have well imagine analyzing hours of security footage and instantly finding what you're looking for or what about in sports like breaking down an entire

game to find those key plays and really assess player performance exactly and think about education oh yeah imagine asking a question about a lecture and getting an instant answer right from the video that would change how we learn absolutely but the real game Cher here is the whole new way of thinking about video understanding it's not just about processing everything it's about being more selective more focused more more like how we actually process things that's it and that opens up a whole world of possibilities so we've covered how it works and why it's a big

deal but how does it actually achieve this amazing compression without losing Vital Information it's a three-step process okay break it down for me first it gets rid of those redundant frames with dino2 right the temporal reduction yeah then it uses your question to focus on the most relevant frames and apply spatial pooling to the rest that's selective feature reduction gotta and finally it uses spatial token compression to squeeze even more efficiency out of those remaining frames so it's like a super smart filter removing the noise and amplifying the signal you got it the result is

a model that can handle those really complex long videos with Incredible efficiency and accuracy it really is an elegant solution to a problem that's been around for a long time but while we're all excited about these advancements we need to remember there could be consequences any tool can be misused right that's absolutely right ethical development is crucial with any Tech like this it's wild to think how this could change how we use video in the future you know totally it's not just convenience it could lead to things we haven't even thought of Imagine AI analyzing

surgery footage okay not just to see the steps but to actually judge the surgeon's technique oh that could be huge for patient safety and training new surgeons exactly or think about education we talked about getting instant answers from lectures but yeah what if it analyzes is the students face too like their expressions so it could adapt to how they're doing in real time yeah like a personal tutor that's amazing yeah what about movies though could it help make films better absolutely imagine it going through hours of footage to suggest the best edit like a virtual

codirector exactly we're moving past just identifying objects it's about AI understanding how stories are told in video the emotions the little things that make a video really grab you you got it that's what makes it so exciting we're right at the edge of something huge huge it's mindblowing but we can't forget the human side of all this right AI should be a tool to help us not replace us give us new ways to understand stuff to tell stories better to make smarter choices to learn and grow like never before you said it Long View and

these new models are just the start it's a journey to really unlock what video can do for us and that's a future worth exploring thanks for joining us for this deep dive into long Vu we've gone through how it works what it can do and how important it is to develop it ethically so we've talked about how long view could change things like Health Care education even film making but with that much power comes a lot of responsibility right absolutely we need to be talking about the ethics of all this especially as AI gets more

advanced and more a part of our Lives it has to benefit everyone not just a few people exactly we need to think about bias you know and privacy and people misusing this technology we have to make sure it aligns with our value for sure like we were saying long Vu analyzes videos right and a lot of times those videos have people in them we got to make sure it doesn't just pick up on existing biases like judging someone based on how they look or act right right that means being super careful about the data used

to train these models we need to be on the lookout for those biases sneaking into the algorithms it's like we need to build safeguards right into the tech you know making sure it promotes fairness and not just repeating the inequalities we already have exactly and it's not a one-time thing it's an ongoing conversation between the people developing the tech the ethicists the policy makers everyone really we have to make sure AI is used for good and we can't forget about privacy if AI can analyze hours of footage potentially even identifying people and tracking them how

do we protect people's privacy really good point transparency and control are so important people should know how their data is being used and they should have a say in how this technology is used so it's about finding that balance between innovating and protecting individual rights we got it it's definitely a complex issue but it's one we have to tackle head on video understanding is changing so fast and we need to be thoughtful and ethical about how we approach it it's a crucial time we have a chance to really shape how this technology develops and how

it's used so that everyone benefits that's a future worth fighting for so as we wrap up this deep dive into long Vu let's remember to celebrate the technology but also keep in mind that it's people who decide what kind of impact it has on the world I agree long Vu and these new models are incredibly powerful tools but it's up to us to use them the right way ethically with a vision for a future where AI empowers us all it's been amazing exploring this with you thanks for joining us for this deep dive into long

Vu keep exploring keep asking questions and let's all keep shaping the future of video understanding together until next time see