so in my last video I was showcasing a demo I made where I cover some interesting interactions between graph databases and large language models namely how you can both generate this kind of Knowledge Graph graph database using unstructured data by using a language model to identify entities and relationships in that data and how you can also interact with that graph database by using a language model chat interface asking it questions and it also offers some advantages compared to vectors search which is a more popular approach to these kinds of applications and I was very happy to see that the video got a lot of traction a lot of people liked it a lot of people also asked some very good questions and wanted to see more details about the code itself how exactly I had built it so finally it took me a while but now I got around to finally making that more detailed video where I start from scratch and then I walk step by step with the code exactly how it all works and so that's what this video is about uh without further Ado let's get straight into the action I hope you enjoy all right so let's start with a brief overview of the environment that we are using here uh first of all we have the data here under the folder and we have three different types of documents we have profiles of people that contain the names and the skills and the projects that the people have been working on uh we also have the briefs which contain a description of the customer need the solution that was delivered and importantly the technologies that were used uh in the project as well as some free form debriefing of the outcome a customer experience and then finally we have a slack messages which map onto the people in our first type of document uh so it's the same people and they are discussing these same projects uh in a free form way uh with their colleagues there's some back and forth in the messages this is the data that we are going to be using to generate our knowledge graph and by the way all this data these Json files and the markdown files that I showed previously they have all been generated by chat GPT uh so if you want to experiment with this a bit more try using different types of data uh that's a very good simple option uh we also have the EnV file that's going to contain our secrets our credentials uh in this case for connecting the open AI API as well as the neo4j database and then finally in the requirements we have the python libraries that need to be installed and imported for our solution to work and that's all we need uh for the code itself we're going to run it in a jupyter notebook so we can run it step by step with some annotations in between so it's a great way to run code in this more presentation like demo format uh so then let's let's get started and for the open AI API we're basically going to need four uh parameters uh we are going to need the API key and then the API base which is the API end points and uh uh then the API version that we want to use and this is just going to get that value from our environment variables that we're going to store here in a second right now it's empty but I'm going to get to that in a second and then for the neo4j we're also going to need a endpoint the URL to connect to the database and a username and a password for that user and then we're going to create the database driver object with the graph database Library that's going to take these arguments and that's basically going to be it and so these two cells are going to set up the credentials for all the services we need and next I'm going to show you where we actually get these credentials first we're going to go to uh Azure because that's where I have my open AI models running and you can of course just use the open AI API directly in which case it's going to be simpler in a lot of ways but at this point I'm just used to always using Azure because when I work with clients uh that's typically where the models are going to be running but for this one it doesn't matter so much but if you're also using Azure uh I'm going to show you where you can find these values for connecting to the endpoint which is under keys and endpoint we take the end points first of all and uh we're going to put that here and we can take either of the two keys and put that over here and then we are we also need to make sure that we have deployments here uh in our instance like you have one Azure open air instance in in some region that you choose and then within that instance you can have several different deployments running uh that you can use for different purposes like I have here for some workloads you might want to use gp4 but also because it's up to many times more expensive than 3. 5 sometimes the 3. 5 is perfectly fine to use so it's good to have both also for our knowledge graph today GPT 3.
5 is going to be just fine so what we want to get out of here is the deployment name and then we're also going to save that somewhere here we'll just put it straight to a variable all right so that's all we need on the open AI side and then for neo4j we're going to go to console. nj. and neo4j has this really nice hosted U offering called Aura where you can very quickly sign up and you get your first instance for free up to a million nodes so it's quite quite sufficient for playing around uh I already have an account here so I'm just going to log in but the sign up process shouldn't take you more than a minutes then once you sign up for the first time uh you won't have any instances here but like I said you can create your first instance for free and you can select to host it on Azure or Google cloud or AWS wherever you like I'm just going to open the instance I already have here now what's important here is that when you first create the instance you're going to get the password and you're going to get the username uh but you only going to see the password once when you create the instance and never again after that uh so make sure that you are writing it down you can take it straight and put it into your environment variables so it's going to go under neo4j password uh for the username we are just using the neo4j default username and then the connection URL you're going to get from here the prefix of the URL is going to be neo4j plus s Double Dash and then you're going to get the value specific to your database after that and that's all we're going to need from our neo4j portal for now you can leave it open we're going to use it later when we start to build the graph and here we can then look at the graph in a visual way and run these queries against the database of course now it's just totally empty so we aren't getting any results but we can just leave that be back in vs code we now have all our environment variables set so now we can just run all of this actually I got a change this because in the EnV it's called connection URL so this one needs to match so it gets the correct value and then we can just run all of this and again if you're just using the open AI API directly uh then you can just actually leave out everything else except the open AI API key so let's start defining some functions first up we are just going to have a function that's going to help us call the open a API given a system message or a prompt and then some contents so here we aren't using Lang chain or anything else fancy we are just using the open AI Library itself uh here we are going to pass the deployment name that we have here for engine and um we're going to Define Max tokens at 15,000 and then temperature we're going to set at zero and then we we have the messages which is going to define the actual content of our request so first we're going to pass the system message with the content being our prompt and then for the user message we're going to pass the text contents and then with the completion uh defined we can just see the results the completion itself is going to return a lot of lot of metadata about the request but if we just want the actual contents of the message we are going to drill into the contents of the response like this and this is just going to return to us the final final text response and then this function is just going to return those results cool now next up we're actually going to start looking at how to extract the entities and relationships uh from our data and this function is going to do that we're going to pass this function a folder of files and a prompt templates and it's going to return a Json object of all the entities and relationships that the language model finds uh in all the files that we pass so this function is going to be extract entities relationships uh we're going to pass it a folder prompts so what we're going to pass to this function is first of all the folder name uh so that's going to be people profiles project briefs Etc so the first thing we want to do is from that given folder we want to get all the all the files uh which is first going to be found under the data folder and then the given actual folder and then we're going to end it with the Wild Card asterisk so it's going to get all the files and this is basically going to be a list list of files containing all the files we're going to add a little print statements and then we're going to initialize an empty list where we're going save all the results that we get uh from our language model in a second and then we're going to start looping through each of the files another print statement it's always useful and then for each file we're going to try and get the contents strip out some whes space from the end and then we're going to talk about prompts in a second but basically what we're going to pass to this function is going to be a prompt template and then for each particular file type we're going to have a dedicated prompt that's going to give some instructions that are specific uh to that file type and then given the template that we pass to the function there's going to be a placeholder value for the actual contents of each file so we're going to take the template uh we're going to change that placeholder value to the actual text contents of each file at a time and then we're going to call the language model API using the function we first defined up here and actually I've given these arguments here sort of confusing names um this prompt here is actually not related to The Prompt we Define here here we actually want something more like a system message uh that's more generic information and then this content is actually going to be the file uh specific prompt uh so the first argument is going to be the file prompt and for this system message we are going to Define that here and this is more of a generic prompt compared to the file uh prompt template that we're going to Define next and that's going to contain a lot more instruction about how to extract those entities um but it's still also nice to have some kind of more generic uh context for the model just to improve the results which is something like your are helpful IT project and accounts management experts who extracts information from documents and this we're going to pass uh to all our requests regardless of the specific file type and then we're going to append each individual results for each individual file into our big list of results add some exception handling we just going to print it out like so so we also see which file the exceptions happen and then finally the function is just going to return that list and let's also add a timer here so we can see how long it does it take to process each each folder so we're going to start the timer here we're going to end the timer here and then we're going to print print out the duration like so okay so next we're going to talk about these file prompts that I talked about before and these are very important like I said they're going to contain a lot of uh information specific to our needs here which is to extract the entities from the Raw text and return a certain type of Json result so I have these prompts uh or these templates written down I'm just going to paste them here and then explain them in more detail first for the project briefs uh this is our instruction extract the following entities and relationships described in the mentioned format always finish the output I found that sometimes it might send partial responses which we can't use uh so this line is important here and then we are first specifying What entity types to look for and what attributes those entities should have and then we're going to generate relationships between those entities based on the document context and then finally we're going to specify the output and so for the entities each entity first of all going to have an ID and the entity types that we are looking for in these particular type of documents are projects first of all and projects are going to have IDs uh they're going to have a name and they're going to have a summary which is um well we're going to describe all of these attributes here as well so this is the project mentioned in the brief uh ID we're going to use the name of the project in lower case uh without any special characters as the ID now this is obviously problematic uh in any kind of production application it's not a very reliable ID because the names might not be consistent you might have projects or people with the same name uh so this is a big challenge actually in doing something like this is to figure out some reliable way to generate IDs uh for all these entities uh automatically using just the language model but for our purpose we have a small sample size of data in this demo so we don't really have these issues yet uh so we are just going to go with the simple ID that's going to work for now and then the summary property is going to have a summary of the original documents and its contents uh we are also going to be looking at technologies that are mentioned in the project brief they just a ID and a name and again same thing we are using the name in lower case as as the ID and additionally we also have the instruction here to identify as many of the Technologies used as possible so instead of getting an answer like Azure we get Azure functions Azure blob storage Azure data Factory Etc and so in this way we are just adding some detail to the identification of Technologies and then finally clients are also entities we want to identify IDs names and also Industries now the industry of each client isn't directly mentioned in the project brief but often you can infer it indirectly from the context just by looking at the other contents we can say okay this project and this client seems to be working in Ecommerce for example next for the relationships it's basically going to follow the format of one entity a relationship type and then the other entity and from the fact that the projects and Technologies and the clients are mentioned in the same document in the context of one project uh the language model can then understand that okay there should be relationships between all of these and so then for the output we're just going to want a Json object that contains two keys we have the entities as a list and we have the relationships as another list and finally at the end here we have the placeholder value that I mentioned earlier where we are just going to substitute this cext dollar sign here indicates that it is uh placeholder and then substitutes here C text is going to be replaced by the actual text of the file that we are reading and that's basically how this is going to work and so we have two more templates like this that I'm also going to copy here that are actually and these are very similar as the first one but just with some little details for example when it comes to the people uh the profile documents of people we are looking for the person entity instead of the client about the project and Technology entities are exactly the same the relationships that we are looking for are also slightly different people have skills and projects have people at these kinds of relationships as opposed to these ones here but otherwise uh it's really quite similar and then finally for the slack messages same idea here we are just looking again for the person entities and then a slack message entity uh with just this kind of relationship between the two and I actually realized I misnamed this it should be project prompt templates and that's it so now we are just going to run this cell so that the variables get stored and now we can do a little test of our previous functions here and see how they work we're just going to change a few things we're going to change these arguments because I renamed them here so this is going to be the system message and this is going to be the file prompt and I've actually made a mistake here this should be a Capital C for the chat completion and then finally I have um this should be a enumerates so that we get both the counter as well as the file in our Loop okay great and now we're just going to add a new cell here we're going to run the function and we're going to just run it for the project briefs for now so the folder name is going to be this one and then the prompt for that folder for those files is going to be the project prompt templates and then we'll just uh save the output into a results and then we also going to write that U write those results into a file just for testing so we can see how it looks like and actually here as well we're going to in our Loop we are going to want to because the output from the language model for each file is going to be this Json object we want to make sure that we read it into our list as a Json so Json load string so we get the output from the language model and we read that into a Json object and that way uh this list of results is going to be a list of Json objects cool so now let's see what we get we get an error uh this one should also be when we par the response from the language model uh it's choices uh not choice so let's try that again there we go now it seems to be running okay however we do get a issue with our rate limits from our open AI API so we're going to have to handle that somehow but still we got the first six files okay so we can see the result here let just format this so here here we have the extracted entities as a list and then the extracted relationships as another list and we get this kind of a pair of lists for each each file that we process so this looks good now for the rate limits there's a few things we could do we could request an increase in our limits uh for the Azure open AI so send a request to Azure uh we could also do some parallel requests using different instances at the same time uh but here the simplest thing we can do is we can just add uh little delay between requests let's call it 5 Seconds between each request so then that's going to add a bit of delay between the requests and that's then probably going to be enough for us to stay with in the rate limits but other than that uh the function Works nicely everything looks good here we have our node types under the label field for each entity uh we have the ID as the lowercase of the name and uh various types of relationships summaries for the projects all good so for our next function and this is actually going to be the last function we need but this is also going to be the most complicated one so far uh but here we are going to take the Json from the previous step and we're going to pass that into a the actual Cipher statements uh Cipher being the database language of NE 4J so taking the Json to then actually create those entities and creating those relationships in the database uh that's what this function is going to do so yeah there's going to be a lot of stuff here but we'll go through it slowly and I'm going to try and make it as clear as possible so this function is going to Che based on objects of entities and relationships and generate Cipher for creating those entities all right so generate Cipher it's going to take a Json object as the only argument we'll have an empty list for the entities and we'll have a same for the relation ships because they're going to be the statements for creating the two are going to be different we'll have another loop here and inside this loop we're going to have another loop because each object is going to have this list of entities so we're going to Loop through each of those and we're going to get the node type first of all which is going to be under the label key we're going to the ID from the ID key and we're also going to clean up that idea a little because some of the names the project names for example might have dashes in them that we don't want and we can't have in our Neo 4J and node IDs so we're just going to take any of those dashes and we're going to get rid of them and I don't think we have any underscores but just in case we might have them uh we'll just replace those as well and then this is going to be our final ID and then aside from the label and the ID that every node is going to have regardless of what type they are the nodes are also going to have some properties some other properties like the summary for example in the project entity type so this we are going to have to create dynamically and I'm going to write this out first and then I'm going to explain it a bit more this here is going to give us all the key value pairs in each entity so keys and values and then for all the keys except the label and the ID which we already we have defined here we are going to get the dictionary for each of those keys followed by the value and so this is going to be a dictionary of keys and values and you're going to see in a second how that's going to fit into our Cipher statement and so the first part of our Cipher is going to look like this and again I'm going to write this out first and then I'm going to describe it a bit more just so to make sure I don't make any mistakes so the merge keyword here is going to create a node of this type uh with this ID if it doesn't exist yet and if it does exist then it's going to update that with any new values and this is the reason we use merge instead of the create uh keyword which also exists in Cipher because we don't want to be adding duplicates if we happen to run these statements several times so that's going to be the base of our Cipher and then uh if we have other properties as defined here uh then we're going to add those on top of our base Cipher so for each key value pair in our properties dictionary we are just going to format that into a string that we can then use uh in our next part of the cipher which is going to look like this oncreate set and then when the node is created it's also going to set all these attributes with the values like so you're going to see in a second I'm also going to write the results of these ciphers out to a file so we can examine them in a bit more detail but that's basically going to be it for our entity statements and then for each of those statements we are just going to add it to our list of statements and then we are done with this part now secondly as well as the list of entities inside each Json object there's also going to be a list of relationships so we're also going to make a loop for that for relationship in in the list here so we can see in the outputs that for the relationships we have the ID of the first node uh we have the relationship type in the middle and then at the end we have the ID of the second node and so this is going to be one item uh that we Loop through and these three are going to be separated by the pipe character so we are going to get those The Source node ID the relationship type and the target node ID we're just going to get those by splitting the relationship object uh with the pipe character and that's going to give us a list with these three objects and just like before we're going to make sure sure that we don't have any special characters in either one of these IDs and what we also going to need for both The Source node and the target node is we're going to need the node types for both of those uh but we don't actually have that here in this object so we're going to have to get those from our first part The Entity uh definition so we're going to create another object here which is a entity label map and so what we're going to have here is that whenever we are going to Define an entity we're going to save it into a dictionary so we have a dictionary of all our entities where the key is the ID and then the value is the label or the node type of that entity so then we can access this in our relationship uh Cipher generation like so we can we can get the label for the source node like this and just the same thing for our Target node they're both in the same dictionary and now we can Define the cipher itself for creating the relationship between the two nodes and that's going to look like this so this first part is basically going to find us the nodes that we are looking for node a uh with the type of label that we expect our um Source label to have and the ID that our source should have and then the second node the B node or the target node with this node type and uh this ID so we're basically creating a reference to those nodes uh under these aliases a A and B and then in the second part we can create the actual relationship uh which is formatted like this we have the a node Dash and in the brackets we have the relationship type and then with an arrow leading to the second node and this is the type of notation that neo4j uses to describe relationships between nodes so that's our entire relationship generation statements right there so then all we need to do is just append that Cipher to our list of relationship statements and then we are basically done and also for this function we're going to write the results into a text file so we can examine them later separate them by uh new lines and we have two lists here so we're just going to join them into one list that we write uh into the file and then similarly the return for this function is going to be the combination uh of of the two lists and we're missing an S here actually and that's it now there's definitely a lot there to unpack in this one function especially if you aren't familiar with the cipher syntax and statements before but now we're going to run this function and we're going to see what the statements actually look like in this text file so I think that's going to help you make sense of it and of course all this code I will make available alongside the video so later on if something isn't clear from the video then you can take your time and go through it yourself um but let's give it a try so we're just going to use the Json file we already stored so we don't have to call the language model again we can just use the existing results and read that file and let's see so we have gone through the five files that we managed to process earlier in our Json here because the rest of the files there's actually 11 files but the rest got failed with the rate limitation um but for the contents we have here we have now generated the cipher statements here everything looks quite all right so here you can see the basic Syntax for creating nodes merge and N is a typical Alias for a node so a node of the type technology with an ID here and then with this statements on create set we set any additional attributes that we want that node to have and again being the Alias for the node so the name of the node we set as this the summary of the node we set as this and that's basically how it works and for the relationships we create the aliases to refer to our nodes A and B so project node with this ID technology node with this ID and then we generate the relationship between the two so everything looks good here now the final thing left to do is to actually run these statements against the neo4j database so we are actually going to Define one more function uh which is just going to be a pipeline function to run all these previously defined functions in a sequence uh bring it all together so we're going to give this function a list of all the folders that we want to process and actually we are just going to have a list of folders but we're going to have a dictionary that's going to contain the folder name as the key and then the value is going to be the prompt template that we want to use for those files right so this is what we're going to pass to our pipeline function as the argument so for key value in this kind of a dictionary we're going to run the extract entities and relationships uh function which is going to take the two arguments the folder name and the prompt template so for each of these items we're going to give it the key which is the folder name and then the value which is going to be the prompt templates and this function is going to return a list of the entities and relationships so every time we get a list back for each of these folders we're going to extend this uh list list that's currently empty but we're going to keep extending it with each list so we end up with one list that contains all the entities and all the relationships so this first step is just about extracting and then the second step is going to be to generate and execute this Cipher statements so we'll get the cipher by passing this complete list into the generate Cipher function uh which takes a Json object and then for each statement in the resulting Cipher statements we are then going to use um the database driver that we Define all the way here so GDs for short this is where our database driver lives and the method is simply going to be GDs execute query and we're going to pass the statement we're going to try and run it uh catch any exceptions actually we're going to write these into a file as well in case we have any errors instead of printing them out so we can store them for later so what statements and then what exception has occurred for that statement separate by new lines this one we don't need and that should do it we'll get rid of this and now we can just run the whole thing we're going to run it for all our files in all our folders and then we can finally also examine the out outcomes in our NE 4J workspace here that previously and currently we still have no nothing in the database but that's hopefully going to change in a second when we run this pipeline so let's see how that looks like some more errors right yeah this isn't person template it's people I think yeah that's right yeah order complete useful a lot of times but sometimes can be confusing now this is going to take a while several minutes at least uh so I'm going to speed it up and then I'll see you on the other side and there we go took us about 6 minutes but no errors no exceptions nothing written in our failed statements uh text file so now let's take a look at the ne 4J results and at this point after looking at the statements we created uh previously uh you maybe have some kind of an idea for how Cipher Works uh here also in this portal we can query the database we can also create nodes here manually uh but what's the kind of query we are using here is just going to return any node that has a relationship to any other node and then the return is just going to return all the nodes and all the relationships that match that pattern in other words uh this statement is going to return all the nodes and all the relationships so let's see what we get and there we go that's how you create a knowledge graph from unstructured data and now the last thing I want to show you is the consumption of the language model because you might be interested and you should be interested how much does it actually cost to use uh GPT models for this kind of purpose processing a decent amount of data and we can monitor that very nicely in the Azure portal now I'm again in my open AI instance on Azure and we have a section here called metrics and we can see a bunch of metrics here but the one we are most interested in is going to be the processed inference tokens which is the sum total of the prompt tokens that we send to the model and then the completion tokens that the model returns so inference tokens is the sum of those two and this is this is our Spike from the previous run of 6 minutes processing all the folders all the files uh for a total of about 25,000 tokens so if we again go to the pricing of the Azure open AI service at the time of recording this for the GPT 3.