The AI rag landscape completely changed a few days ago with the release of Google's brand new embedding 2 model. Now, this is huge because this finally allows us to do what we've all been trying to do for a few years now, which is directly embed videos and images into our vector databases. But here's what nobody is telling you.
Being able to embed a video into a vector database is not the same as being able to analyze a video inside a vector database. confused? Well, that's not surprising.
Most people are, which is why none of these videos talking about embedding 2 get this right. It is not as simple as just hooking up embedding 2 to an existing rag structure and being able to say, "Oh, now I can ask questions about this video. " That is not the case.
There are actually additional steps required to create a rag architecture that can get the most out of these embedded videos. And today, I'm going to show you exactly how to do that. On top of that, we are going to have a nuanced but thorough discussion about rag and embedding because this is a confusing topic and you do need to understand what's actually happening under the hood here to figure out this whole architecture mess.
Lastly, I'll be giving you a GitHub repo with the proper rag architecture so you can just clone it, copy it, do whatever you want and essentially have a 90% solution ready to go so you can navigate this minefield right off the bat. All right, so let's open with the good stuff. So this is the GitHub repo.
I was talking about. Inside of here, you will find essentially the basic rag architecture that I'm going to be talking about in depth from now on. You have two options.
You can clone this thing and then point cloud code at it, or I have a markdown file here called claude code blueprint, which you could copy and paste in clawed code, and it will tell it exactly how it should set itself up. So, you can just take this and go forth and conquer, but I highly highly suggest you watch the rest of this so you actually understand the thought process behind it. Now, speaking of Claude Code, quick plug.
I just dropped my Cloud Code master class inside of Chase AI Plus this last week. There's a link to that in the pin comments, and I'm dropping a ton of updates tomorrow. So, if you're trying to go from zero to AI dev, this is the place to do it.
So, make sure to check it out. So, let's quickly talk about Gemini Embedding 2, Google's first natively multimodal embedding model. So, the big thing with this is we can now ingest things beyond text, right?
We can put images, videos, audios, and documents and embed them inside of a vector database. This is huge because before we were pretty much limited to text and if you wanted to do videos, it was kind of hacky workarounds where you do descriptions of videos and then put those in as vectors. This has huge business implications and organization implications.
Imagine you have a ton of proprietary data that was all video-based. How were you able to analyze it from a rag point of view? It was really tough to do.
Now, this is much easier because it embeds the video itself. Now, this obviously comes with some limitations, right? The videos can only be 120 seconds at a time, text up to 8,192, etc.
, etc. But there's workarounds for almost all these. But this is where we start to have issues.
So, the idea that is being promoted by most people discussing this topic and I think the, you know, intuitive idea is that okay, I can now embed videos. So, let's imagine I found a video I loved, right? Claude code plus playright equals insane browser automations, right?
You love this video so much you wanted to download it and put it inside a rag system. So you could ask questions about it, right? What if you wanted to know some things and you didn't want to watch the video every single time?
Well, you know, well, I have embedding too now, so I should be able to embed it. Then I ask questions. Simple enough, right?
In theory, it should be like this, right? Inside of Google, I can ask Gemini about this video. So I can say, "Hey, you know, how does Playright test without opening browsers?
" And it answers it. That is the expectation and that's what's kind of being sold. This is not how it works because you would expect that you would ask questions just like we were doing here on the right and it would give you responses in text and maybe would reference parts of the video.
However, instead what would happen if you took a standard naive rag system and hooked up embedding 2 to it. It would not give you a detailed text response. It would just give you a clip of the video.
So, it would be like me asking, "Hey, uh, talk to me about the playwright stuff that happened in this video. " Instead of saying, "Oh, yeah, here's A, B, and C. Here's what happened.
" It would just give me like a two-minute clip and be like, "Answers in there, brother. " Like, that's how it works. That's how most people are showing you how to set this up.
There's some value in that, but that's not really what we want, right? That's not how we really derive value. And to understand that dichconomy, you need to understand how embeddings work and how the whole rag system is even set up because we can solve this.
It is a solvable problem. So here's my beautiful illustration for a basic naive rag system. I know you're all here for the top tier graphics and this is some of my best work yet.
So rag retrieval augmented generation. You understand how this works, right? You ask the large language model a question.
It retrieves additional information from the vector database to augment the answer it generates for you. Hence, rag. This side of the equation is pretty easy to get.
Now, let's talk about the other side of the equation, the document ingestion equation. The idea that I have some sort of data source and I want to put it into the vector database to help augment the answers. Well, let's go through the standard user journey of a document, right?
Traditionally, that is some sort of text document. Correct? So, the text document, let's say this text document is about World War II battleships.
I send the text document to the brand new Google embedding model, right? This is the Google embedding model. This embedding model is going to turn this document into a vector.
Now, what is a vector? Well, a vector is just a point in space on a graph. You understand a graph, right?
Like back in geometry, right? XYaxis that's two dimensions a vector database is hundreds and hundreds and hundreds of dimensions thousands of dimensions okay but you can think of it as 3D in this case now these blue dots are the vectors and again it's just numbers that represent the point in space so you know you had your graph from school right what would this vector be right here this would be one one right that's what that point is in terms of math Same thing here, but instead of two numbers, you see it's three numbers. In reality, for us, it'd be 1,526 numbers, right?
That's what's in space. That's a vector. And so, when our document goes through the embedding model, what does it do?
It's turning our document into 1,526 different numbers, which are represented by one of these little dots on the graph. Now, how does it know what numbers to give this PDF document? And how does it know where to put it in space?
Well, it is put inside the vector database based on semantic meaning. Aka, what is the document even talking about? Remember what it is?
This is World War II battleships. So, where is it going to go? Well, it's going to go next to ships and boats, right?
Makes sense. It's not going to be by bananas and green apples, right? The numbers are all about being able to compare it to other vectors.
So our PDF document is now a vector and it has a billion different number numbers associated with it. So it's just a point in space with a bunch of numbers and this PDF document just imagine it living inside of you know one of these vectors right it's actually paired with it but for your mental model you can imagine it living inside the vector now we'll make this guy a little bigger now now you then ask a question saying hey I got a question about World War II battleships okay let me take a look your question tell me about World War II battleships also becomes a vector a string of numbers and our LLM is going to go into the vector database and say, "Hey, I have this question. It's a billion different numbers.
What vectors are closest to that number? " In this case, your PDF because it's about battleships has a number close to your question. And so, it grabs that vector.
Now, it grabs the vector and everything that's paired with it. And what was paired with it? What was the document itself?
Now, this is the important part, and this is where the video stuff's going to come in, right? because it grabbed a vector that is a word document. It is text, right?
The LLM can easily ingest that text to augment its answer, right? It isn't just grabbing a random attachment in space and saying, "Here's the attachment, brother. Figure it out.
" It's actually going to link it for sure, but it's going to use that as part of its answer. It can actually ingest it. Do you understand where I'm going with this?
And if you don't, I'll explain even more in a second. So, it grabs it, adds it to the answer. Cool.
Everybody's happy. Hopefully, you know, clear as mud. All right.
So, now let's think about this. Bye-bye, graph. Let's think about this from the video perspective.
It's the same journey. You have a video you want to put inside the vector database. It is a video about World War II battleships.
So, it goes through the embeder. Boom, boom, boom, boom, boom. What happens?
Well, you know what happens? It gets turned into 1526 different numbers or it's represented by that. So this video, not a description of the video, not a transcript of the video.
The video itself is embedded. Okay. So now it's a vector, right?
It's a bunch of numbers. You again say, "Hey, I want to know more about World War II battleships. " Okay, we turn that into a vector.
Turn that into a bunch of numbers. Which vectors are closest? Oh, it's this guy.
All right, bring him up. What's he paired with? What do we got?
Well, we got a video. Do you see the problem that's about to occur? This is a video.
This is an MP4 file. And our large language model unless it's something specific like Gemini cannot ingest this and answer your question based on the what's in the video. In fact, we can't even answer ask real questions about this video because it is just a video.
So, when you ask your question about World War II battleships, it's going to just hand it to you a two-minute clip out of whatever you know, however long that video was. hopefully was the right one and say, "Hey, there you go. Good luck.
Here's your answer. " Now, obviously, our large language model in this case does have training data about World War II and battleship, so it would be able to accentuate the video. But imagine this was a proprietary data video type deal.
There's no additional information coming. Is this the system that you have in mind when you hear that embedded videos are now a thing in RAG? No, it's not.
The expectation is that hey the LLM can take a look at what's inside the video do analysis take a look at the transcript all that and yeah sure give me the video clip but also accentuate that with like analysis of the video we can do that and you can probably start to figure out how we could and the places we could do that but understand what you see everywhere else what people talk about when it comes to betting too is this system this system is crippled right even if you use something like Gemini here and this isn't like sonnet or opus Do you also want to have Gemini every time you ask a question that brings in a video from the vector database to take all the time to digest it? No. No.
No. No. Absolutely not.
In which case the answer becomes kind of obvious, right? Enter Gemini. What we want to do is we don't want to just embed this video as is as a vector.
Although we still want to embed the video, we want to include other things with that pairing. We want to include a description of the video, a transcript of the video, some sort of text. So when this vector gets pulled, we don't just get the video alongside it.
We get the accompanying, you know, document as well or written transcription as well, something that the AI system can actually ingest. And like I said before, we don't want to do this on the back end. We want to do this on the front end when we ingest in the first place.
So when we first give it that video, it goes along with its written description. So we only have to figure that out once. And in terms of architecting this, well, it just means we create a pipeline where when we ingest a document, specifically something that is non-ext, it also gets run through Gemini, right?
In this case, it's 3. 1 flashlight to give us some sort of text explanation that can accompany the video. So when you ask the question, we can actually augment it with real answers.
And again, I know that was a long and potentially convoluted explanation of it, but I hope that got across, you know, a how the rag system data flows, where vectors play into all this. The and also your question probably is like, okay, so it can embed the video, which means it understands the video. Why doesn't it include an explanation?
And the answer is because that's not what embedding models do, right? Embedding models take data and they turn it into a vector. They embed it, right?
And it's all about differentiation or like similarity and it's all about the search side of it. It's not about the explanation part, you know, it's like the difference between being able to recognize a face in the crowd and then the other part of your brain be able to describe who it is, what they do, and all those other things, right? This just that's just not part of its job.
That's why we can't do it. That's why we have to augment it with Gemini. And so, if you're still here watching this, that is what our rag infrastructure needs to look like.
We have to add Gemini on this side. So there you have it. And because I cannot stop belaboring the point, here's sort of an example of this entire system in action.
What you're looking at right here is the multimodal rag system. This the same thing in the GitHub, right? And I gave it that same video we saw on YouTube, the one about um Playright CLI.
This is the sort of answer you get when we have the proper architecture, when the video is augmented with some sort of explanation, right? We ran it through Gemini and you can see, hey, it gives me a full text description. You can see all the chunks it grabbed.
And we'll talk about chunks in a second. And here at the bottom, hey, I have my full media preview. I can actually watch the video.
Compare that to the exact same rag system minus the Gemini side, right? This sort of like basic one that doesn't take into this any into account like everything everyone's showing you. Same question.
And the answer is like basically there's insufficient information to explain how to manage multiple browser automations. Here's some source files. It gives me the source files.
Right? Big big difference with what we can do with that. And the last thing I want to touch on before implementation is a part of the architecture that still definitely needs work and is not a solved thing even in the GitHub I give you.
And that is how do we chunk up the video? Right? Chunking is something we've been dealing with from text for a long time.
Right? If you don't understand chunking, well, it's the idea that, hey, we have this document that's going to go in our rag system, but we can't put the whole thing in there at once. There's a limit.
So, what we do instead is we like, you know, cut it off at certain points, right? And there's ways to intelligently do that. Now, video, it's kind of the same thing because if we have a video, right, and it's say an hour long, well, where do we cut the video off?
That's not a great color. Where do we like that? Also is not a great color.
going where do we like cut this video off and ingest it right how can we intelligently do that tough tough question right because if it's an hourong video that's a lot to analyze in the first place um do we do it with the transcript what are the options here not solved the solution I came up with just due to its simplicity is we have claude code automatically chop it up at like the two-minute mark essentially and there's a 30-second overlap it's a very simplistic way of handling uh chunking for video, but it's a way people have dealt with text documents for a long long time. Not to say it's the great the best way to do it, but you know, it's a somewhat proven proven formula. So, implementing this thing, like I said, two ways to do it.
One, clone this repo, point cloud code at it, and just say, "Hey, I essentially want to recreate it on my own. " Cloud code will take care of the rest. Second option, click on the cloud code blueprint, copy this whole thing, paste it into cloud code, and it will also do all the work.
I also lay out how to do that on the readme as well as some of the prerequisites, right? You're going to need updated Python, FFmpeg, Superbase CLI because I use Superbase for the vector database. Understand you could replace that with something like Pine Cone or whatever else you wanted to.
And then we do need Gemini, a Gemini API key, and a Superbase project. And if you do it this way, cloud code will quite literally hold your hand. So, I'm not going to spend the next 30 minutes watching you hit accept or having you watch me hit accept over and over and over.
I'm just going to walk you through some of the Superbase stuff real quick so you know where to find some API keys because that will probably trip you up because after you get the API keys with Superbase, like I said, this is going to have Claude Code install Superbase CLI for you or at least prompt you to get that done. That way, Cloud Code handles all the database creation. You don't have to go in there and like put in any SQL code or anything like that.
So, you'll go to superbase. com, create an account. It can be a free account.
The free tier is more than enough for this. And you're going to need a few different keys. You're going to need your public API key and you're also going to need your Superbase URL.
So, you're going to click inside your database that you created or your project. Then, you're going to come over to the left, go to project settings, go to API keys, click on legacy anon role API keys, and your anon public API key is right here. And to get your URL, you go to connect, scroll down, and it's right here.
Next, public superbase URL. You'll copy whatever the rest of that is. And if it's your first time hooking up the Superbase CLI, it will ask you for your access token.
But again, Cloud Code will walk you through it and it will even give you the link for that authentication piece. Beyond that, the only other key you'll need is a Gemini API key. I have a link here and Cloud Code will also give you that link.
Now, if you want to run some of the tests I'm about to show, I included most of them. I didn't include the actual video. You can download your own video.
I wasn't going to put it in here, but it's inside of assets. And so, if you go to docs, I have a few markdown docs as well as images. So, you can kind of test it out and and we'll see it it work right here.
And Claude Code came up with this super basic UI. So, when you go to upload, you can just drop your file in here. You hit upload.
Once you upload it the first time, again, I didn't spend any time on the actual UI. There'll be a little X button right here. Just click X and then do another one.
It lets you do it just once one at a time, but feel free to make a better UI. It shouldn't take long. So, here's what it kind of looks like.
So, if I do all and I can even ramp up the amount of results we get and I say, "Hey, how can I run multiple browsers automation at the same time and I click that. " Now, you'll see kind of the power of having this multimodal embedding because again, remember, usually what do we get back? Just text.
Obviously, we want text to be able to have a back and forth. But you'll see here I not only get text and obviously it calls out the actual links. It gives me this cool installation thing, but look at that.
I have the video, right? The video is embedded, so I can see it down here. But I also have some matched images, right?
This is something you really couldn't do before prior to embedding, too. Like this is a massive, massive, massive leap forward in terms of rag functionality, you know, given you set up the architecture correctly. But just having this, right, being able to not only get the text answered, but to have it accompanied by this sort of media, right, whether it's images or video, and I don't even have audio in here, although by virtue of it being a video, it includes audio.
Massive. Like this is great. This is really cool stuff.
I just think the architecture obviously needs a little work. And then I think the biggest thing that this needs work on is again the video chunking. All right, the video chunking is going to be a big deal.
Just like text chunking is a huge deal even now. and there's more and more sophisticated ways for dealing with text chunking. I think we're gonna have to, you know, apply that same logic to video and maybe like also add things like re-rankers.
So, that's where I'm going to leave you guys for today. I know we kind of sped through the actual setup process for this, but like between the clone and the prompt that's included here, Cloud Code kind of does it for you. So, um you're not missing out too much.
And this is just a skeleton. All right. This will give you 90 pin solution.
that's going to work. But this isn't like the ultimate production red production ready, you know, implementation of this new multimodal rag system. Definitely add your own bells and whistles.
Definitely play with the UI. Definitely think about chunking. And then also start to think about, you know, like data cleanup.
What would you do if you added or edited some sort of document that's already in the database? What do you do? Those are the type of questions that I would be asking as I move this to production.
So, as always, let me know what you thought. Definitely check out Chase AI Plus in the pin comment if you want to get your hands on the Cloud Code Masterass. If you're new to AI, I also have a free Chase AI community that's in the description.
Tons of free resources, so there's something for everybody.