Name: Bioinformatics Project from Scratch - Drug Discovery Part 2 (Exploratory Data Analysis)
Duration: 22 min 52 s
Channel: Data Professor
Description: okay so today is part two of the bioinformatics project series where we will show you how to apply data science for drug discovery and in the previous video I've covered about how you can collect d...

okay so today is part two of the bioinformatics project series where we will show you how to apply data science for drug discovery and in the previous video I've covered about how you can collect data directly from the bio activity database Chambal and so into this video we're going to take a step further by computing molecular descriptors and we're going to then perform exploratory data analysis on the computed descriptors and so without further ado let's get started okay so the first thing that you want to do is head over to the github of the data

professor and you want to click on the code repository and then scroll down and find Python I said before we proceed with doing part two exploratory data analysis we're going to do a recap a part one and we're not going to do a ordinary recap but we're going to do a concise version and so in this concise version let's have a look we've trimmed down the code and it will be a bit more lightweight okay so I'll show you that in a moment and so what you want to do now is you want to scroll

down and right click on the raw link and then Save Link As and then you want to save it into your computer okay and you can follow along on your local computer using Jupiter notebook however if you want to use Google collab we can do that as well and so let's go to the collab right now and so in collab you can click on the file open notebook and then click on the github tab and then you want to type in data professor and then go to the code repository so make sure that it is

data professor slash code and then you want to click on the CD d ml part one bio activity data concise okay but used to one already in my collab - part 1 we're going to also do part 2 which is also in the Python repository under code and then you want to click on the CD DML part 2 exploratory data analysis click on the wrong link and save the ass and then also save it in to your computer and so we could do the same thing right inside collab bye file open notebook okay and then

you want to click on the github and type in data professor slash code enter and then you want to find C D D ml part two exploratory data analysis and click on that one okay but since I already have it locally I'm gonna use it okay so let's start with part one so you want to click on the connect and then give it some time to load up okay so let's go over what I have done in this concise version so essentially there's two things so the first thing is that redundant code cells were deleted

and the second part is bad code cells for saving files to Google Drive has also been deleted and so at the end of the notebook we can simply download it by clicking on the files button on the left-hand side of the panel and then we could download a copy of the zip file of the curated data okay so I'm gonna show you that in just a moment all right so the notebook is loaded and you want to install the sample web resource client so go ahead and run the cell all right so it's installed and

let's go ahead and import the libraries and let's run the code cell for searching for coronavirus and this is the result and so we're gonna select the fifth entry which has the index number of four right here and so let's run that code cell so a detailed explanation of all of this has already been given in the previous video so I'm just gonna run the code cell one by one so if you want further information please check out the previous video of part 1 okay and so this is the bioactivity class label combine the data

frames [Music] and writing out the output file and so let's download the pre-processed file and there you go recap in order to download this file make sure that you hover your mouse over the three dotted line on the far right of the name of the file and then click on it and then choose download alright so we're done with the part one and so now let's continued with part two and so it should be noted that explanation for all of the coat cells in this part one has already been given in the previous video the

part one of the bioinformatics project series ok and so let's proceed to the contents that are intended for this part two of the bioinformatics project series and so let's close this notebook and now let's go to the part two and make sure to click on the connect and then you want to run the code cell for installing Conda and already kids and so what are the kids essentially will allow you to do is it will allow you to compute the molecular descriptors for the compounds in the data set that we have compiled from part one

ok so let me explain again in part one we have already downloaded the data set of the biological activity from the chambo database and so the data set will comprise of the molecule names and the corresponding smiles notation which is the information about the chemical structure which we will use in this part too in order to compute the molecular descriptors and the data from part one also contains the IC 50 which in part one we have already performed the binning into the bio activity class active inactive and intermediate ok and so in this part two

we're going to select only two by activity class which are the active and the inactive so that we can easily compare between the active compounds and the inactive compounds ok so without further ado let's have a look at the code so now kondeh and our ticket has already been installed and so let's load up the pandas library and make sure to click on the file button here on the left-hand panel and then you want to upload and then choose the bioactivity data that we have prepared from the previous part one okay and so it's right

here now it has already been uploaded and then we can close the panel here okay and so let's load up the CSV file so the following block of codes we're going to compute the Lipinski rule of five descriptors or simply Lipinsky descriptors you might be wondering what is Lipinski descriptors well Lipinski descriptor originates from the fact that Christopher Lipinski who is a scientist at Pfizer came up with a set of rules called the rule of five which was used to evaluate the drug likeness of compounds and so the drug likeness is based on the key

pharmacokinetic properties comprising of absorption distribution metabolism excretion which has an acronym of at me and this is also known as the pharmacokinetic profiles and so what essentially at me will tell us is that it will tell us the relative drug likeness of the compound whether it can be absorbed into the body distributed to the proper tissue and organs and become metabolized and eventually become excreted from the body and so in order to derive the rule of five Christopher Lipinski collected a set of FDA approved drug that are normally administered orally and then based on his

analysis he observed that the four descriptors that was used for his analysis had corresponding values in multiples of five as follows so the molecular weight should be less than $500 the optinal water partition coefficient or lock P has to be less than five hydrogen bond donors is less than five how do you think buying acceptors is less than ten and so as you can see all of the values are multiples of five okay and so let's proceed with computing the descriptors so let's load up the library and then compute the descriptors so this is a

custom function that was inspired from this link here and it was modified to include the descriptors for this analysis all right and so we have the Lipinski descriptors in this data frame and in order to get that we're going to apply the custom function called Lipinsky which was the custom function here which takes in as input the smiles notation so the smile sensation contains the chemical information and so what the chemical information tells us is the exact atomic details of the molecule and so it's gonna use that as the input to compute the molecular descriptors

right and so let's continue and so let's run that now let's have a look at the data frame all right so we can see that there are four descriptors that we have previously covered including molecular weight lock P which will tell us the size lock P will tell us the solubility and so this is the relative number of the hydrogen bond donors and acceptors so we can see that there are a total of 133 rows and 4 columns and as a recall the data frame that we have read directly from the t rated file from

Part 1 is shown in the DF data frame and so we're gonna combine the DF data frame and it Lipinski data frame together because we want to have the standard value and the bioactivity class Collins and so we're gonna use the PD concat function in order to combine the D F and the F Lipinsky data frame and then we're gonna put it into the DF combined variable and then let's have a look at the new data frame all right and so you can see that the last four columns are integrated into the DF data frame

here alright and so the dimensions of the data frame is correct 133 rows and then the number of columns has been expanded to be 8 okay so now we're gonna convert the standard value which is the ic50 to the P I see 50 scale and so the reason for doing the I see 52 PS d50 transformation which is essentially the negative logarithmic transformation from the ic50 value is that the original ic50 value has uneven distribution of the data points and so in order to make the distribution more even we will have to apply a negative

logarithmic transformation okay and so let me give you a challenge let me know how the distribution of the original ic50 looks like versus the P I see 50 that you have performed the transformation so let me know in the comments after you have tried this and so a hint is that what you can do is perform a simple scatterplot okay so let me know in the comments if you see any difference in the distribution of I see 50 versus VP I see 50 okay so let's do the actual transformation by running this custom function oh

and one point here which is worthy to note is the ic50 value we should contain inside the standard value column has large numbers and the large number here will after performing negative logarithm it will become a negative value and in order to prevent that we're going to need to cap the maximum value right here to be here 100 million so we need to cap the value to be a hundred million so that the resulting p ic50 would not be less than 1.0 otherwise it will have negative values and that will make interpretation a bit more

difficult okay so we're gonna cap the values to a hundred million by creating a custom function normal you and so what essentially the noir value function would do is that it will read through the individual values in the standard value column and if the value is greater than 100 million it will cap the value to be a hundred million so that the value will not exceed 100 million and so therefore after performing negative logarithmic transformation it will not be less than one point zero okay and so let's perform the norm value here let's describe the

value again and notice that the maximum value is 1 times 10 to the 8th power so it is 100 million whereas previously the valley is rather big okay okay so we're gonna apply the PSC 50 function to the normalized data frame and then we're gonna call the new data frame to be DF final okay so notice that we have now created a new column called pi/3 50 and we have already deleted the original ic50 column notice that the standard value here column has now been deleted and it is converted to B P ic50 which is

the negative logarithmic form of the ic50 and so let's describe the data frame all right and so now the maximum value is seven point three and the minimum value is one point zero and what we want to do now is to allow simple comparison between the two bioactivity classes therefore we're going to delete the intermediate class and we're gonna call it a new data frame to be df2 class all right and so we have now one hundred and nineteen rolls by eight columns and so let's perform exploratory data analysis using the Lipinsky descriptors and so

in chem informatics or drug discovery we're gonna call the xplit or e data analysis to be chemical space analysis because what it essentially does is it allows us to look at the chemical space and the chemical space is kind of like a chemical universe right so as Jose Medina Franco PO said each chemical compound could be thought of as like stars okay so the active molecules would be compared to a constellation and it will be referred to as constellation and so he developed a approach which he termed constellation plot whereby you could perform chemical space

analysis and create the constellation plot whereby the active molecule would be correspondingly have larger size in comparison with the less active molecule and so we're gonna apply a similar concept in our plot here as I will show you in the next few moments here and so what we want to do first is import the library Seabourn and the matplotlib as PLT and so now we're gonna create a simple frequency plot of the two bioactivity classes so using this block of code we're going to create a frequency plot comparing the inactive and the active molecules and

in doing so we're also going to save it as a PDF file and so the x and y labels are obtained using these two lines of code and the frequency plot is you seeing the count plot function where we use X variable to be bioactivity class right here and so as you can see there's no need to define the Y variable because the Y variable here is the frequency and so the edge color is black which means that the bar we have a black outline okay and so being able to save it as the PDF

file will allow you to use the resulting files for your report for your publication for your project and as I have already mentioned in the part 1 of this bioinformatics project series these two notebook are crafted based on actual research protocol that we use in our own research group ok so let's proceed alright so now we're gonna make a scatter plot of the molecular weight versus feed lock P or the solubility of the molecules and we're first gonna start by defining the figure size to be 5 point 5 by 5 point 5 and we're gonna

use the scatter plot function here whereby the X variable will be MW or molecular weight and the Y variable will be the log P and the data will be DF 2 class and the heel here would refer to the color and so the color will be defined on the basis of the bioactivity class and because there are two classes you will see that the color comprises of blue and orange whereby blue will refer to the inactive molecule and orange will refer to the active molecule and the size of the data points here will be according

to the P I see 50 values and we define the edge color to be black which is the edge of these circles and the alpha transparency is defined to be 0.7 and the X label and Y label are custom here MW and Lock P with a font size of 14 and we have the font weight to be bold and in this line we're gonna define that we want the figure legend to be outside the plot otherwise it will be embedded inside which will make it very difficult to see so we opted to have the figure

legend outside and then finally we're gonna save it into the PDF file so let's run this block of code here alright so it's finished and so let's do the same thing for the PS d50 value so same concept applies just changing the name of the variables and so here we see the distribution of the inactive class and the active class and so this is to be expected because we use feed threshold to define active and inactive and so the device show that we used was 5 and 6 right so if the pH d50 Valley is

greater than 6 it will be active and if the PSG 50 value is less than 5 it will be inactive and so you can see that the distribution of the inactive is rather vast in comparison to the distribution of the active molecules which is between 6 & 7 whereas the inactive is between 1 & 5 okay and so we're gonna perform mann-whitney u test in order to look at the difference between the two bioactivity class active and inactive and so we're gonna apply this mann-whitney u test to test the statistical significance of the difference whether

they are different or not different and so the code for performing the manually u test was modified from machine learning mastery comm and we made it into a function alright and so let's run it and let's apply the mann-whitney function to the PSC 50 and what it will do is it's going to compare the active class and the inactive class to see whether there is a statistical significance for the PSC 50 variable and so based on this analysis the p-value is rather low and therefore we reject the null hypothesis and therefore we can say that

it is having different distribution meaning that active and inactive okay and so we're gonna apply the same plots and statistical analysis for the other four lepenski descriptors as well and so let's breeze through this boxplot mann-whitney pops plot mann-whitney box plot man with me Oh again okay mann-whitney box plot man with me and so make a note that all of the files from the mann-whitney and the boxplot are saved as files and so the mann-whitney has its own CSV file and the boxplot has its own PDF file and so we can download all of this

at the end in order to use it for your own project and research and so let's have a look but before we do that let's do some interpretation of the results okay so let's make sense of the results here so let's start with the PSE fifty values so taking a look at the PSE 50 values the actives and inactive displayed statistically significant difference which is to be expected because the threshold value was already defined at six and five okay as I have already explained of the for lis pesky descriptors only lock P exhibited no difference

between the active and the inactive while the other three descriptors comprising of MW number of hydrogen bond donor and acceptor shows statistically significant difference between the active and inactive okay and so okay so let's continue alright and so finally we're going to sip up all of the files comprising of these CSV files and PDF files which was generated in this notebook and so all of the mann-whitney u test and the boxplot will now be sipped up and we could conveniently download it into our computer so let's zip up the file and click on the file

button on the Left panel and then hover your mouse on the 3.8 lions click on it and click on download all right and so it will download into your computer and so you will see the plots that we have generated in the notebook and the resulting mann-whitney u test okay all of them are downloaded and CSV file alright so if you find value in this video please give it a thumbs up and if you haven't yet subscribed please subscribe to the channel and as always the best way to learn data science is to do data

science and please enjoy the journey thank you for watching please like subscribe and share and I'll see you in the next one but in the meantime please check out these videos