Name: Bioinformatics Project from Scratch - Drug Discovery Part 1 (Data Collection and Pre-Processing)
Duration: 22 min 37 s
Channel: Data Professor
Description: welcome back to the data professor YouTube channel if you new here my name is Chi Minh nan toss an Ahmad and I'm an associate professor of bioinformatics on this YouTube channel we cover about data...

Related Videos

Bioinformatics Project from Scratch - Drug Discovery Part 1 (Data Collection and Pre-Processing)

148.1k3,184 Palavras15m readGrade 18

welcome back to the data professor YouTube channel if you new here my name is Chi Minh nan toss an Ahmad and I'm an associate professor of bioinformatics on this YouTube channel we cover about data science concepts and practical tutorials so if you're into this type of content please consider subscribing so in the previous video I have shown you how you can apply machine learning on a computational drug discovery project particularly we downloaded a dataset derived from the study of de l'année and the data set is essentially a collection of compounds along with their molecular solubility

value which is a important physical chemical property of compounds on whether it can be solubilized in water to what extent and so some of you might be wondering what if you want to collect original data let's say that you want to create a new data science project for your portfolio and you want something that is new original has never been done before then this video is for you because in biology there is a lot of unknown that is waiting to be researched about and so in this video I'm going to show you how you can

retrieve and download biological activity data of compounds from the chambo database which you can subsequently use to construct machine learning models which is also technically known as quantitative structure-activity relationship and so the development of such QSAR or q-star models holds great value for drug discovery efforts particularly it allows us to understand the origins of the biological activity and the interpretation of the model will allow us to understand how we can design a better drug and so such data that you're going to collect and download today by following along with this video not only will allow

you to build your data science portfolio but may also initiate or scratch the surface towards the development of novel therapeutic agents or novel drugs and so without further ado let's get started ok so the first thing that you want to do now is head over to the github of the data professor and then click on the code repository and then scroll down click on Python scroll down and click on the CDD ml part one bio activity data and then right-click on the raw link and you want to save it into your computer or if you

would like to follow along on the google collab you are more than welcome to do so so what you want to do is go to collab and then click on the file open notebook and then click on the github tab and then type for data professor enter and then it's gonna be the first file that you see here CDD ml part 1 okay and then you want to click on that and then it should open up a new notebook for you but I already have that so I'm going to follow the one that I have

open right here so the exciting part of this video is that you're going to collect original data so it's gonna be the same data that researchers in the field are collecting and publishing about and so today you're going to have the opportunity to contribute to computational drug discovery ok so the database that we're gonna use it's called chambo database and it is a database comprising of more than 2 million compounds and it is compiled from more than 76,000 documents and the version s of March 25 2020 is Chambal version 26 okay and so the first

thing that you want to do here is to install the jumbo web resource client and so we're gonna use the PIP install here so this library will allow you to download the biological activity data directly from the chambo database but before we do that let me show you how the temple database actually looks like so you could search on Google for Chambal CH embl okay so let's say that we're gonna search for coronavirus and then we're gonna go with the search for coronavirus in all targets we're gonna click on that and so the targets here

refers to the target proteins or target organism that the drug will act on so biologically these compounds will come into contact with the protein or the organism and induce a modulatory activity towards it it could either be to activate the protein or the organism or to inhibit it okay and so this will give seven targets here and if we scroll down we're gonna see the type of the target would be comprising of organism and single protein and the single protein will be SARS coronavirus 3c like Bhutanese and the replicates poly protein 1 a B and

so these are for the SARS coronavirus 1 and so as you can see that the SARS coronavirus 2 is not yet deposited in this database and so we're gonna work with what we have here okay so let's head back to the notebook and the chamber web resource client should have already been installed and let's proceed with importing the library so here we're going to import the pandas SPD and we're gonna use from chambo web resource client dot new client import new client okay action we're going to search for the target protein and so it's essentially

going to be the same process that we're searching right here in the search bar we type in coronavirus so we're gonna do exactly in the code but first we're going to assign new client target to the target variable and then we're going to create a variable called target underscore query equals to target search and then the search keyword will go here and then we're gonna create a targets data frame and then we're gonna assign the target query inside the argument using the front dictionary function okay and then finally we're gonna display the contents of the

data frame and then we're gonna run this in order to see that okay and so here we see seven results and it's the same thing that we see here seven targets okay so seven results here and then notice that there are two single protein right here and the rest are organism so same thing right here we have two single protein and the rest being organism okay and so in this tutorial we're going to use the single protein for our further investigation okay and so let's go to the next step so in this section we're gonna

select and retrieve bioactivity data for SARS coronavirus 3c like protein aids which is the fourth entry right here where actually it's the fifth entry but it has the index number of four okay so it actually is fifth entry let's call it the fifth entry okay and so let's run this cell and so notice that the timbal ID here is Chambal three nine to seven so this is the target ID so it's a unique identification of the target okay and so here we're gonna define a variable called activity and we're gonna use new client activity and

then afterward we're going to define a variable called re s and then we're gonna assign it the block of code here which is activity dot filter and then in the argument here we're gonna use the target jimbo ID equals to the selected target and then we're gonna have the closing parentheses as part of the filter function and then we're gonna apply another filter which is to select only the values containing ic50 for the column called standard type okay and so I'm gonna show you that in just a moment of the data frame because there are

so many columns why not I show only the first three because the font is throughout or big and we need to access the scroller here okay so let's find a column that I was talking about the standard type so here standard type ic50 okay so we're gonna select only the ic50 here and so let me show you what are the unique values in the standard type column okay and so we see that there are only I see 50 here okay so for this particular data set it wouldn't matter because all of the value here are

the same and they are I see 50 but in cases of other data set there might be a combination of other bioactivity unit types so it might be I see 50 it might be easy 50 or it could be percent activity so when we define a particular standard type here it will make our data set more uniform and so we won't have a mixture of different bioactivity units okay so we're going to use only the ic50 type and the standard value is the potency of the drug and so the number here represents the potency and

so the lower the number the better the potency of the truck becomes okay and likewise the higher the number the worse the potency becomes okay so ideally we want to find a number of the standard value to be as low as possible meaning that the inhibitory concentration at 50% will have a low concentration meaning that in order to elicit 50% of the inhibition of a target protein you would need lower concentration of the drug let's think of it this way the number here reflects the concentration of the drug and so the lower the concentration that

is required the better it is because if you have higher number it means that you require more amounts of the drug in order to produce the same inhibition at 50% and so analogously let's say that if you could take five milliliter of a medication versus five liter of medication right but which is impossible in order to produce the same effect which one will you choose right okay so something to think about so let's go back to here alright and so finally we're going to write out the data frame into a CSV file and we're going

to call it the bioactivity data CSV we're gonna have the index number to be false because we don't want the index number to be in the resulting CSV file okay so let's write that out and let me mount my Google Drive into the notebook and so I'm gonna click here I'm gonna paste in the authorization code enter mounted I think I might have already run this because the data folder has probably been created but let me check okay so it's right here data so it has already been created but for you guys let's say I

create data to okay so for you guys creating a new folder called data in your collab notebook folder for the first time would probably work so let's continue and so we're gonna copy the bioactivity data here into the folder unless a list at and we're gonna see the bioactivity data so let me also add the - I'll function here so I can also see the time at which it is created and it is created on April 29th so it's right now and so let's see the content of the CSV let's list this again in our

current working directory and we're gonna take a glimpse of the contents of the bioactivity data and it looks like this so it's a CSV data okay and then we're gonna proceed to the next step we're going to do some handling of the missing data if there is any and then we're gonna drop compounds with missing standard value so the thing is we have already dropped any missing values here okay so apparently for this data set there is no missing data however this code might come in handy for other data sets where there is missing data

okay so let's proceed to the next step and so here we're gonna do some data pre-processing of the bioactivity data so for the benefit of creating machine learning models where we could classify compounds into three categories as either being a active compound and inactive compound or an intermediate compound and so the active compound will be defined as drugs that have ic50 of less than 1 micromolar and 1 micromolar is equal to 1000 animals and so a drug having ic50 value of less than a thousand nano molar will be classified as active and a drug having

ic50 value greater than 10000 will be classified as inactive and drugs having value in between 1,000 and 10,000 will be called intermediate so in some of the research projects that we have normally done we either use the 2 class or the 3 class okay and so we're gonna use the conditions as defined here that I have already told you about and so we're gonna run this block of code and then we're gonna iterate over the molecule timbal ID column let's go back here let me show you molecule bori okay so this data set is comprised

of many compounds and a compound is a drug a molecule a molecule is a chemical structure that produces a modulatory activity or in other words it exerts some effect on the target protein kind of like when you take medication and the medication exert some effect on you like you might feel drowsy you might feel thirsty which are the side effects but the drug will directly act on the target protein in order to produce the desired biological effect which ultimately cures your symptoms and that is why you're taking drugs right or medication right and so you

see that this is the symbol molecule ID so each compound will be described by a molecule timbl ID and so each row represents one compound and there might be a possibility that multiple roles will contain the same molecule Tembo ID and if that is the case for simplicity we're going to keep only one of them okay because we don't want any redundancy in the data set right so for the molecule temple ID let me show you before we iterate so the F 2 dot molecule symbol IDE so it essentially contains the jumble ID as I

have mentioned so they are the unique identification number of each molecule and so we're gonna iterate through each of them right but first we're gonna create a empty variable called mol CID and so in each iteration of the for loop here we're gonna append the molecule Tembo ID into this empty variable so let's run that and then we're gonna see that the mol CID contains the molecule symbol ID again okay and here we're gonna do the same thing we're gonna define an empty variable called canonical smiles and then we're going to iterate over the canonical

smiles and then we're gonna append it to the empty variable here and then we're gonna do the same for the standard value which is the ic50 but actually this is only one way of doing things so this might actually be a complicated way actually another way would be to simply sew in the f2 dots molecule Timbo okay so we just call selection equals to whatever we want so we want the molecule symbol ID and we want the canonical smiles and we want the standard value and then the F 2 we have the selection and then

we're going to assign this to DF 3 actually this might be a easier way the f3 right so we get the data frame here containing the three columns that we needed and actually we could be the same here as well canonical smiles okay I have to run it first probably have to run this okay and so it gives you the same thing I'm not sure want the bioactivity to class as well which do we have yeah we have it here by our activity class and so we're gonna just append to it I said to the

f3 I'm gonna have PD concat and then the f3 with the bio activity class and then access equals to one no also their series and data frame objects okay so it's a list so this means to be made into a data frame or a series that would work alright so it works so same thing here let's look the same yes it looks the same okay so actually this might be a easier way so let me copy this an alternative method here as well and I'm gonna move it below this okay so select either way okay

so now we're gonna create a CSV file for the pre-processed bioactivity data and we're assigning DF 3.2 CSV and then the name of the file and then the index will be false and that's all so let's check the file LS okay so here it is to pre-process data so let's copy that into the Google Drive pre-processed data look all right so we have both of them here so let me annotate this a bit all right so congratulations you have successfully downloaded the biological activity data from the Campbell database and so now we could use this

for subsequent machine learning model building and I'm gonna cover that in a future video and so please stay tuned to that while in the meantime you could also use this data set that you have created on a data science project of your own or you could also modify the search query at the beginning so let me show you instead of using coronavirus you could use another keyword let's say aromatisse so the aromatase is an enzyme as part of the cytochrome p450 which is responsible for breast cancer and so the goal of drug discovery effort is

to find a compound or a molecule that will be able to inhibit the function of the aromatisse and sign okay and so here is the human aromatisse enzyme so as you can see try out different keywords and see what protein you have and then you could use this novel data set in your own data science project so the possibilities are endless and so now you have original data that you could play around with that no one else in the world might have because you guys might be using different keywords right and so this will be

a novelty in itself and so if this video was helpful to you please give it a thumbs up and if you haven't yet subscribed please subscribe to the channel for more awesome content on data science and as always the best way to learn data science is to do data science and so please enjoy the journey thank you for watching please like subscribe and share and I'll see you in the next one but in the meantime please check out these videos

Bioinformatics Project from Scratch - Drug Discovery Part 1 (Data Collection and Pre-Processing)

148.1k3,184 Palavras15m readGrade 18