one of the common use cases in AI is to build your own rag pipelines rag stands for retrieval augmented generation and it means that if you need to provide context to your large language models from your own data you do rag or retrieval augmented generation in retrieval augmented generation there are two key steps one is retrieval and then the other one is Generation but there is another step before that and then another step is to prepare your data for this rag or retrieval and generation what it means is that for example you have a text file or you have a PDF file where you have your own data first you need to split or chunk it into smaller components that is called as a splitter phase and then that splitted document is converted into embeddings or numerical representations and then these numerical representations get stored into a vector store and that Vector store could be anything it could be your in-memory Vector store or a database or it could be a hosted cloud service or any other database like post on your own system once these steps are done where we have divided the data into Changs where we have converted that into numerical representation and stored it into a vector store then in that Vector store or you can create an index on your data for faster access and then from here we first retrieve the data whenever a user ask secury from llm in context of our own data and then after the retrieval the llm generates the response which is grounded or augmented by our own data but that is the whole concept behind this retrieval augmented generation the key idea is that these llms These are pre-trained on a huge humongous amount of data and they don't know about our data so if we want to make that llm our own data where we need to do the rag now all of these steps which I have mentioned from chunking or splitting two generation there are few choices here you can build everything from scratch very very customizable if you have a huge requirement where you want to extract every bit of performance out of it where you are very very very aware of how these things work and you want to use your own components at each and every step then I would recommend that you do it at your own but if you don't really want to uh get into the Integrity of this rag pipeline then I would highly suggest have a look at any framework which can enable or which can abstract lot of details from you and you can focus on building your rack pipeline quickly with all these components and then just focus on your business logic and application there are numerous Frameworks which are available out there and H stack is one of those Frameworks which enables you to have an abstracted way of building a rag pipeline what HC does is it provides you components for each step which I mentioned earlier for chunking or splitting and then for retrieval for generation and then you just pick those component knit them together in a pipeline or a rag Pipeline and then you just simply run that Pipeline on your own data and it provides you an endtoend framework or a toolkit to build a pipeline so that is what htech does and this is not just the only tool which you can use there are lot of other tools um which have appeared on the market and as the industry is evolving and progressing the good thing is that we see more and more toolkits and I have covered like 50 or 60 such tools already this year on the channel so if you're interested just such with rag or rag Pipeline and you should be able to see lot of other videos here now for hch this is by Deep set and they also have a hosted Cloud version which is of course a paid version so I'm not going to go there we are just going to look at this GitHub repo of HC and my intention is to get it installed locally and then we will play around with it you can of course use API based portles with it like like open a coare and there are few others there are lot of Integrations which you can access from their website or you can integrate it with AMA based models AMA is one of the fastest way to run quanti ggf models locally what it does is it shrinks the model size uh so that there are lesser memory requirements in terms of vram on your GPU or even the smallest model you can run on CPU I have done hundreds of videos on AMA as of to this St so if you're interest should just search the channel and you should be able to find heaps of videos around AMA so I'm going to use the AMA integration with this H tag and then we will see how it works right okay so before I show you the installation let me give a huge thanks to M compute who are sponsoring the VM and GPU for this video If you're looking to install or run a GPU on affordable prices I will drop the link to their website in videos description plus you also going to get a coupon code of 50% discount on range of GPU so do check them out okay so that said and done let me take you to my terminal and we will start get cracking on it so this is my terminal where I'm running open 2 22. 4 and this is my GPU card nvd RDX a6000 let me clear the screen and now let's first install a virtual environment with Koda just to keep everything separate and simple so my Mya environment is created and activated as you can see in parenthesis on the left hand side let's install all the prerequisites which include torch Transformers and then I am installing this Hast tack plus I am also at the end installing Hast TXS integration with AMA with this AMA Das H package so let's run it and wait for it to finish this is going to take 3 to 4 minutes so it has installed everything now let me launch my jupyter notebook so that we will play around with it in the notebook environment in the browser and while that happens let me quickly take you to my ter another terminal on the same server I just wanted to show you that my AMA is already installed and I have these two models already present we'll be using Lama 3. 1 from here so let's wait for it to get a launch in the browser there you go so our jupyter note book is launched let me go with the notebook and our notebook is launched here first let's import some of the stuff which we have installed which includes HC of course this is imported and now just quickly if we want to check how the generator Works in um his tag so we are just using this AMA generator function from hch and then I'm specifying my local model 3.
1 Lama 3 1 and then this is the local URL of where AMA is running this is a default Port 11434 and these are some of the hyper parameters to control the output and the generator is initialized as you can see let me print out the response of the generator and there you go you can see that the generator has come up with this reply where it has given me the response plus it has given me lot of numerical representation too um one thing I have noticed that is quite slow with AMA but when I was trying it out with open AI it was quite uh fast anyway so that is just my observation okay so generator is done but we are more interested in the retrieval augmented generation so let's see how can we build a rag Pipeline with this ha stack using AMA so for that uh let me import some of the libraries and remember that in hch or in any the rack pipeline for that matter we have different components we have a component for splitting or chunking for um inmemory Vector store for retrieval and for Generation so that is why you see here what we importing we are just importing AMA generator and then this is our inmemory Vector store you can of course use uh any other Vector store like you can go with pine con VV chroma there are lot of them and then I already have done a video on it as how to select a vector store and you can search with a vector store on my channel and then we are importing all of this stuff so let me quickly import it that is done and now let's define a template and this all this template is doing it is just telling it this is a question and this is a context that's all this is a very standard format where we Define a prompt template as how to talk with the application okay now next up let's define our documents now the good thing about hastac is that it allows you to have this in memory store where we are initializing it so we are not using any external Vector store database here we just putting it all in the memory and then these are the documents which we are writing by documents this is a Content one liners of course you can replace it with full-blown documents of your choice you can use any uh standard input output library to just read it through the file and then put it in the content and then you can write those document in your memory so let me run it and that is done and it has told you that four it means that there are four documents if which it has read okay so that is done next up let's as usual get our generator from AMA we already have done it but let's for the sake of completion run it again and the generator has been done next up let's create a pipeline which is uh the real beauty of this hastag and such framework that it makes it so easy to build a pipeline by knitting these components together so you see we are just simply initializing a pipeline first component we adding retriever then prompt Builder with the help of the template which we have defined and retriever is simply specifying the memory store and then this is a generator with the llm and then we are connecting all of these together retriever prompt Builder with the documents and then this this is our prompt Builder you see it is just one by one in tandem retriever pra Builder pra Builder llm so this is your pipeline goes through phases so let me run it and you see once you run it it gives you the whole pipeline stack that these are the components of the pipeline retriever prom Builder and llm and then these are the connections for the retriever document and then uh how it is connected together how good is that and in order to run it all you need to do is to just run the pipe with um the pront Builder your own cury and your retriever so for example if you want to ask a query from your own data anything like what is my favorite sport and then it is going to run the pipe and give you the result back and print the result so let me run it and this is grounded in the datab by the way you see documents which we have given it is telling us my favorite sport is soccer so let's see if it is able to get it from the data it has finished running it let's print out the result here and there you go so it is telling us that your favorite sport is soccer and then it is giving us a lot of other information so Lama 3. 1 which is running locally and then there are few other T and tidbits around eval duration and lot of other stuff it it has given and because we are using local llm so there is no token cost everything is private local how good is that and similarly when I asked it what is my favorite food it is now telling me that there is no information provided about your favorite food the context only me mentions your preference of soccer the season of Summer and the Aion to sci-fi books and crowded places how good is that so you see we have ensured that the responses are totally grounded in our own information and it is it makes it so easy to build your end to end rack Pipeline with your own data with the help of this um H Tack and even um of course this is just a playground but if you're building a production grade one all you need to do here is that keepa and all that stuff maybe but I'm using AMA llama 3.