This machine learning course starts at the beginning and goes all the way to an advanced level teaching you both the theory and applications of machine learning concepts ayush who teaches this course is a data scientist and machine learning engineer hi everyone and welcome to this course of machine learning this course will teach you machine learning from very basics to an advanced level in machine learning in This course this and this course will be having both theoretical plus practical understanding of machine learning algorithms and building real world ai projects after this course you will be you

will be able to build your own machine learning applications and real-world applications and many domains okay but wait first who am i my name is ayush and i'm a data scientist at artifact i'm in standard nine from india i've worked on various applications of Artificial intelligence like machine learning deep learning i've worked on various domains from deep learning which is another computer vision generative adversarial networks and nash language processing i've also contributed to the large ai projects and routine by andana okay so this is this is my basic uh skills and also i run a

small youtube channel which is not as near now not a small a big youtube channel which is around we have a 500 members where we When i where i make a content on machine learning deep learning and various ai things and i put end-to-end courses there currently deep learning course is being launched so you can start watching after this completing that start this course that deep learning course okay so that's that's that is from my side and you can connect me on linkedin if you want to know over more about me and i've also cleared

some microsoft exams so you can get to know more about there about Me okay i'm also a founder of android and ai tech platform and also product based platform okay so that's it from my side and i hope that you will get a lot more from this course now let's discuss the syllabus of this course we will start with the very basics of machine learning covering the fundamentals of machine learning in the section number one and then it will go further into understanding some algorithms like linear regression a Logistic regression support victim machine principal common

analysis learning theory and some in symbol learning methods like bagging boosting stacking cascading and then we'll talk about unsupervised learning and you may think yeah you sure you would you wanted to teach this much no absolutely no in this this is a 10 plus hours of course so absolutely there is a lot more content so this this course is divided into sections and each sections have a Different sub sections so you and i have made a full syllabus talks what topics i'm going to cover in each and every topic or something a section divided in

a sub section and you can assess the syllabus by visiting the course website the course website i premade with um are made with my friends friends of andrew so you can definitely go there and assess all the syllabus all the problem sets of this of this course and each and every Section you will be you you'll be having one problem set and all of these is in the description down box below okay all of where the time stamps as well as the course world notes okay so are all available on the course website so you

can go there see what's assignments are there complete that and join our discord server to just to discuss your assignments and you can also submit it through forum okay So this is the basic syllabus that we are going to follow and i really hope that it worked a lot in preparing the materials for this course teaching you all in the blackboard and understanding you're helping understand each and every topics of machine learning in a very very very easy way and this course has extensive syllabus a very good syllabus uh to um which i have which

i haven't seen on youtube youtube okay so please be sure to see the syllabus what we are Going to cover just have told you in in just shot the reason is i i want to just start with this uh video so i've given a shot you can see the syllabus by visiting our course website for absolutely free everything for absolutely free you can go to the course website which is the link in the description that box below so let's start with the section number one fundamentals of machine learning okay so now we'll start with machine

learning in This section you will get an introduction to machine learning and we'll talk about introduction to machine learning types of machine learning and there are types and we'll see some cool applications and then we will see some problem of overfitting and under fitting and then we'll wrap up this section okay so let's start okay so now we'll answer a question what is machine learning and you may ask here you sure this is very uh can you Tell what actually machine learning is yeah you will get to know more and more about what actually the

big picture of machine learning means you in depth when when we will go through this course but in simple terms machine learning is like this computer programs that uses algorithms to analyze the data and make intelligent predictions based on the data without being explicitly programmed and you might think hey are you sure What is kind of this um it i'm getting a little bit confused no worries so let's um let's get into this and let me explain you what i'm saying over here in this blackboard so you are given a data you are given a

data okay and then you give to the algorithm the algorithm the algorithm analyzes the data analyzes the data analyzes the data and whatever he has analyzed or learned from The data it makes predictions based onto that it makes predictions under that okay so specifically what we were the what what we are doing we want to make a function x that maps out input variable x to the output variable y and you may think here you're sure confusing me what is x what is y no worries x is the input feature x is the input feature

means uh let's take a problem statement of predicting the house prices Predicting the house prices okay so you want to do this so uh here on the basis of the size of the house so on the basis of size of the house you want to predict the price of the house okay so you give an input variable x and you get outward variable y okay so on the basis of the size of the house you want uh the price of the house okay so you wanted to make a function so you wanted to make a

function F that takes input x and maps that to y and that's the definition of machine learning that's the beauty of machine learning so what we were doing we are we want to have we want to map our input variable x maybe it can be maybe the size of the house maybe the number of fans of the house maybe the number of our bedrooms in the house so i'm denoting with a subscript one subscript to subscript three and we want To make a function that maps these input variables to the output variable which which we

did denote with y which is the price of the house and that's what and that we are doing and we will learn how to make this functions okay so that's the beauty of machine learning which you can see over here so let's see some of the more formal definitions which you will see you more over the internet then the machine learning is the field of study That gives computer the ability to learn without being explicitly programmed saved by author samuel but one definition that i said to use totally means if you if you if someone

asks you hey hey whatever your name uh what is machine learning you just ask hey i i know a better way to define a machine learning you just make a fun geo just to make a function that maps your input variables to the output variable okay and that's the beauty or that that's Called the machine learning and i hope that's you that you that you got it what i'm what i'm trying to say you hear and access are the input features that you that you wanted to on the basis of that these are the input

input features and the basis of that you want to predict the price okay the another definition of machine learning is the computer program is say to the learn from the experience e with respect to the sum class of a task t and Some performance measure by p if its performance on t as measure by p improves with experience e and this mean seems a little bit intimidating but let me clear this definition is very uh my my favorite definition although this this is my favorite definition that i've shown to you but let me tell you

what the definition tells you the definition is so before that let me take one example which is let's take an example of checkers okay playing checkers Playing checkers i hope that's you no no no let's let's take another because many other people do not play checkers over here okay so let's take a spammed email spam detection system email spam detection detection system which you will make in this course yeah you will make this course you will make this into the course okay so this is a problem statement and what's what's The let's fit that definition

onto this problem here t is the detecting the emails means detecting whether the email is a spam or not so let's let's denote zero as uh detecting a spam or one as uh okay so let's say let's say ham and e is the experience of his prediction is the experience here e is the experience of detection and p is the performance how much performance increases upon to the experience so in This way higher the experience is best the system is but i think this is more formally fit to something called a reinforcement learning which will

not see the advanced learning but again the definition which you have to take is you wanted to make a function of function f that maps your input variables to the output variable and that's machine learning and or you can say that's the beauty of machine learning Okay so some of the applications of machine learning there is um i just listed on a few but there are a lot more self-driving cars real estate stock price prediction and medical we have or corbin 19 detection maybe using chest radiographs and then uh disease prediction cancer detection etc and

then it's just a boom and i highly encourage people to get into this field and try to contribute to the world in a unique way okay so you can see over here Some some of the applications of machine learning so how it works so what you do uh let me get to get back to my boat and how it actually works all this how it works it's kind of a very easy how it works uh first you study a problem so again let's let's take a problem let's take a problem of again a spam detection

spam detection detection system okay i i think the c okay so to Make a span detection system so first you study problem you have a look at the data how actually the data is okay so first you do this the second thing is the basic workflow the second thing you train the algorithm and algorithm is just a function it's just a function okay f of x which you will uh how it is defined which you'll see later on but you just have a function which is algorithm this is this is the algorithm Algorithm and you

train with this algorithm and then you go further into evaluation evaluation you evaluate it into uh you give a new train new emails and check which is correcting for correct or not and then if it is good then you launch the system then you launch the system launch the system otherwise otherwise you analyze some error you do error analysis you do error analysis and then you go Further into tuning or improving your algorithm or evaluating your uh and then you evaluate then again do that do like this so that's what machine learning is called ml

is very very iterative process iterative process which you will see in more and more loops why this term that but but if beginner don't worry about it okay but if you want to get it after this course you can get to my gans course or maybe the deep learning course Or uh maybe another course which is dsa which which i'll talk about that is very important and then you can head over to the ml oops videos okay so this is the basic you can you can head over to that and learn something new from there

okay so uh what are the types of machine learning systems since there are uh i think the three types of main main machine learning systems are three types supervised unsupervised reinforcement Learning and although there is few more which is batch learning transfer learning active learning passive learning and etc which you which you can learn obviously when you go further no when you go further means into more into machine learning then you will see that uh how you are learning these all and you learn to grasp each and every content a few minutes okay so what

are supervised learning supervised learning uh let's Let's let's go get back to my boat again and let's take an example of another example which is my favorite uh no not not my favorite but yeah it's a house price prediction so i'm just providing about house price prediction prediction problem registry system okay so this is a house price prediction so you have a feature uh let's let's let's denote that on the basis of the size of the house Size of the house i'm just taking example because in the real world the size of the house thus

the fan number of fans in the house number of bedrooms in the house on the base of the butt just for now uh for predicting this from the size of the house you want to predict the price of the house you want to predict the price of the house which is which we denote as a y and whatever we give whatever we take as A feature we denote this as a x okay so there is uh we denote there is only one one feature which is x1 and only we are getting a one output which

is y okay so uh we have we provide x1 means the size of the house and we are getting the price of the house as an output okay so uh you can see over here that we have uh size and we have a label means we have given the price and the model can learn from this house the Trend is how the trend is on it can recognize patterns okay so you can see over here that our data is labeled as a y means we know what our output so it learned from their label and

input that's called the supervised learning how you can identify the problem is supervised because you see the target variable which is called the target this is called a target variable and this is called the features Features or variables okay and the one we want to predict is called the target variable so we this is the features or variable so you see that there is a relationship between your input value x and the output value y input value x and output value y well i'm saying it has some relationship because you can see over that we

are given the size of the house and going to predict the price of house so you can see over the entirety is that they have Some kind of relationship and we know what our output should look like in this case we know what our our output should look like in reduction it's kind of a continuous output our output will be in continuous value so we know what our output should look like and that's called supervised learning okay so i think you i i think it's going good means you're understanding what i'm saying so in simple

terms you can see the definition of a hill we feed the Data and this the data are labeled means that the output variable which we denote as a y are labeled and which we give which has some relationship between our input value x and the output value y okay so in the unsupervised learning uh which which i'm not going to go to dig dive just now and further i will i will be digging dive into this so we the data are not labeled and we can say that we don't know what our output should look

like and there is not Kind of any relationship between what are what will be our input variable and what will be the our output variable and we have to recognize our patterns based on the data for doing so we have different algorithm which we'll study later okay so i'm just not digging digging type but in simple terms but in simple terms what i what i'm saying that let me let me tell you what i'm what i'm saying Okay so uh let's say you this these are the t-shirt sizing you want to you have a different

different t-shirts and we're denoting this as t-shirt so we have a different different t-shirt and what else and this is our data this is our data and you have to then you have to simply classify or cluster it out so whatever what your model will do it simply make this as a l it simply makes this as xl and simply Mix as an m okay so it's just uh make clusters it makes cluster xl and l and for now i don't recommend to brainstorming these things because we will first we have to understand fully supervised

learning then it will most much clear uh unsupervised learning so i don't want to dig dive into unsupervised learning for now but you can see what the definitions are cut to cut okay so let's start let's keep deep diving a little supervised learning so what are Supervised learning so in supervised learning um we know our output variable and input variable etc that i would just explain your house price prediction so in supervised learning we have two types of problems the problems of supervised learning can be classified into two types so let me write it a

supervised learning problem supervised learning problem can be classified into regression problem regression problem and it can be Classified into classification problem classification problem okay so let's keep talking about what is regression what is classification so i'm gonna take take one example which is a house price prediction which is a house price prediction means house price prediction and we can see over here that our output will be in continuous value because super supervised learning and we know what our output is so you can see over That output will be in continuous value continuous value okay so

your output is in continuous value and classification and in classification uh let's say given a picture x means the often uh you want to classify an image of a picture as a cat or a non-cat or a non-cat okay so um you're gonna picture x um you Wanted to uh identify this as a cat or non-cat okay so you can see here but that is a binary it has only two it is it is a it is called a degree value it is called the degree value means we can now classify that in regression if

our output of the problem is in continuous value that's called that that's that's that is a regression problem if our output of the particular problem is in degree value we can classify that as a classification problem and you can see Over here the definition is the same okay okay so why we need to divide our data so let's let let me talk about because i'm talking too much about data but what is data it's this question arrives a very good question that i want to ask by myself is data are the they how does data

looks like in data let's say let's take an example of an application a system that Is a price of the house house price predictor okay so the data will look like this so i'm just just making a data frame so you have a size of the house then you have a number of fans then you have a number of bedrooms then then is the target variable which is the price so these are your x these are your x and this is your y okay and what you do uh let's let's take nine square feet two

to twenty two i'm just taking as a thousand Et cetera just a thousand dollars i'm just taking as an example don't think that is a nine it's a nine square feet size of the house don't don't think like that okay so uh this this is kind of thing uh so this this is kind of thing and we have a size and we have a number of fans we have a number of bedrooms okay so uh what you do uh here you can see over here that we have a data we have your data And what

you do you divide your data into training into training and testing set into test training and testing sets so let's say you have a hundred percent of data if you take 80 percent of your data for training or model and 20 of your data for testing the reason why you take this to evaluate your model because you from where you will get all those uh for testing so you just keep 20 20 for evaluating so check how how best your Model is okay so this is the evaluation of the data which we'll see more later

on okay so there are two problem that i want to highlight is uh overfitting and under fitting okay so what is overfilling so let's take an example again i just i just i just believe in examples okay so here is your x and y plane i hope that everyone remembers in in in their school days oops what is happens i Hope that i drawn correctly okay so you have this uh these features let me draw it very quickly this is your data points and what you do uh you simply draw a straight line you simply

draw a straight line to make predictions okay so let's this is your input which is x so it's a 2200 square feet so it will go over here and check what is the price so it will give here okay so like that it will make Prediction which we'll see later on so you can see over here that in under fitting in underfitting and under fitting what happens if your model if your model has not performed well onto the training data as well as when testing data means in under fitting your model does not perform well

under the training and the testing data means uh your model is performs bad under training and the testing data it means because you don't have a large Number of in a large number for data you can you can simply add more data okay but what happens in under fitting can then you will tell me what happens in under fitting so let me highlight this little bit let me so in under fitting if your model has learned too much this is because when you have too much of features so let's say your uh it will try

to touch each and every point It will try to touch each and every point and this is called and you can see if you if you have a if your model learned too much and it is generalizing very very well very very well onto the training data on the training data but it fails to generalize well fails to generalize well on testing data testing data then you can say that your Model is overfitting okay so you can see over here that the diagram is over here and you can see with that under fitting which your

does not feed a straight line a very good way and good fit is a good good good model you can just fit a straight line and in a bad fit over fit it okay so the solutions of this which you'll see later on but before that we'll touch some algorithms and again some notations which i've already taught you x means The input features x1 x2 all the way down to xn y means the output features m means the number of training examples and here let's say 600 training example which you will get to know more

further which when when we will go more further into this course okay so now we are done with this uh introduction of machine learning and i really hope that you enjoyed this tutorial and in the next section we'll be talking about one of the algorithm Which is linear regression and i hope that you will really enjoy that so let's meet at the next section okay so now we will briefly talk about supervised and unsupervised learning with adaptation and some cases studies and data sets so to fully understand what happens in supervised and unsupervised learning okay

so let's start so uh what happens in supervised learning And supervised learning as the name suggests someone is supervising over here i will take an example of a data set to help you understand better okay so in supervised learning uh what i have told you in the in the previous session is about in machine learning that you uh in in supervised learning we make a function f of x that maps your input variable to the output variable okay so here in Supervised learning we have the input data input data as well as the output data

okay means here as an example that we have this data set we have this data set just assume that we have this data set and here we have this outlook feature temperature feature means x1 x2 then we have a humidity as x3 Windy as x4 okay and here this the red one play tennis so here is our problem statement is given on these features outlook temperature and humidity and windy we have to predict whether the whether that boy will play tennis or not okay so this is your target variable this is your target variable or

or the variable that you want to predict okay this is the target variable or the variable that we wanted to predict That we want to predict so it is given in this case so it is given in this case so here uh we have our x variable as well as y variable as well as the y okay as well as the y variable okay so we're going to make a function f of x using this data that maps our input value all these features x1 all the way onto x4 uh do a y variable okay

so we'll give input whether there is sunny or whether you're hot or high or false and given on this Feature the function will give you output whether it will play no or yes okay so here and this is a supervised learning problem because we have our so we have our labels which you can see over here okay so and you can also see that we have a some kind of relationship between our input value and our output value why as you can see that and these are there is there is some kind of relationship like

a male stylistic example of house price Prediction so given an input feature size you want to predict the price of the house of the house of the house so it is on a shame name so there is some kind of relationship between our input feature x and the output feature why okay another another property of supervised learning is that these features which are input features are our independent features our independent features are independent Independent features what do i mean by independent features they do not have to depend on any and any feature they don't have

to depend on any feature but this target variable y is a dependent feature because the target of y is depending on these features it's depending on these features to be mapped and is depending on these features so that's why x1 x1 X1 all the way down to the x i i equals to 1 all around to the k means x 1 x 2 x 3 x 4 is a independent feature which you call usually as an indian independent variable or feature and y is your a dependent variable or feature okay so that's the that's that's

called the features um the supervised learning so let's um so let's let me write a basic definition or a good definition of supervised learning what what is Supervised learning a good definition okay so the good definition of supervised learning is here in supervised learning we have we have our input features x we have our input we have our input features x we have our input feature x large x and just assume that x large x this contains all features in a vector all the way under the x i okay okay so we have input we

have our input feature x And also we have our output feature y we have our output feature y output feature y y and there is some kind of there is some kind of relationship some kind of relationship between the input value x and the output value y and and x is called independent feature x x feature circle independent feature And y is the dependent feature because it is dependent on to the input features okay so we have seen here so show you an example that there is a y there is a y that is used

to train uh using x okay this is that using x and y okay so we'll be able to uh predict our mod so we will see in the next section how we make a predictive model okay as a part of linear regression so let's see some of the so um but before that i want to i wanted to uh just to show you that there Are two parts of supervised learning first one is a regression second one is classification classification so what do i mean by regression you know what your output will look like because

here we know that what our output because we have already seen our data so you can see the output is in decreased value what do you mean by degrade value your output is in finite means either it will be yes or no so here if it is integrate value if you know that your output is in Degree value then you then you then you consider that as a classification problem how you identify that is a classification problem when your output is in degree value and when your output is in continuous value means um it's not

finite maybe the age of the person maybe the stock prices that is a that is continuous okay so that's why if your output is in continuous then it's a regression if your output is degree then it's a classification okay so let's see Some of the applications of supervised learning to help you understand more better to get the feel of supervised learning so in supervised learning we have our favorite uh in supervised learning we have maybe the stock price prediction stock price prediction we are you are given closing price high closing high then etc maybe some

volume and predict what is the stock price and on the maybe the you want to predict this you can consider c Close as a target variable you want to predict what what with the closing price basis on high and volume okay so high in volume are your independent feature and c equals c will be your independent feature which will be the y next is maybe the house price prediction maybe you want to basis on the size like that you want to predict what will be the output why and maybe uh let's do example of a

classification problem given you want to Identify whether the person has a diabetes or not basis on maybe the age gender bmi etc okay so we have this output variable y okay so these are some of the applications of unsuper sorry supervised learning okay so now let's see so unsupervised learning as i'm not going to go deep dive into this uh unsupervised learning there is a next section that there is a particular section after supervised learning we Will cover the in depth about unsupervised learning but the core idea behind unsupervised learning that in this case in

supervised learning we are given x i as well as y i for uh for each uh for every i equals to 1 all the way down to the m and m here is the number of training examples means uh okay so that's m here is means we have for we are given i okay so here yeah in super supervised learning in supervised learning we have this in unsupervised Learning the unsupervised learning we have only x i's we have only x i's we don't have y eyes you have only x i we have x1 x2 x3

all the way down to the xm okay we don't have the label y i there is no supervisor that will guide you okay and what you have to do let's take an example um you have you have this uh so uh here is your data set so here you have a channel reason fresh milk Grocery frozen detergent and delicious so let's take an example that uh unfortunately is used in markets market segmentation segmenting your customers so you have these features and you don't have the whether the person with you what you do you just cluster

the person which has similar nature you have you just clustered the person let's take an example that these person use used to eat milk these person used to that so you cluster this out you cluster this Out okay then you can hand code it okay these person used to eat milk and then you can send promotion to these people or big deal another thing to these people so you can identify your business needs etc from these clusters either i'm not going to give deep dive into the application etc but i will go deep dive into

the application everything but as of now i hope that you understood supervised learning in that okay so that's it for this uh Just a small uh video on supervised learning and unsupervised learning and the next section we'll be starting with actually the math and then we'll leave dive into the machine learning the beauty of you will see the beauty of machine learning okay so let's meet at the next section till then do the problem sets so now we have seen an introduction to machine learning and i hope that you have really enjoyed that section now

It's time for getting enhance your dirty into the maths now we will see some learning algorithms which is linear regression and then we will do one project which is boston hot springs prediction so i uh so i'm very excited to have your first learning algorithm in a toolkit so head over to the next section okay so now we will see one algorithm which is a linear regression which is a learning algorithm um as in the in our Previous section we have seen a machine learning and an introduction to machine learning now we will see how

to make that function f of x that maps that maps your input variable x to the output variable of i okay so before that i want to recall something which is in supervised learning in super advised learning we are we were having two types of problems first one is a regression problem the second one is classification problem As you might think okay so linear regression as you know as you can see from the term regression it's a regression algorithm so it's a regression algorithm that we'll study today okay in this section okay so let's recall

what is regression it's it's a type it's it's a type of supervised learning and supervised learning algorithm and here we know our output will be in continuous value our output will be in continuous value means Let's say let's let's take an example that you want to predict the price of the house let's say you want to predict the price of the house okay so you can see over here the output of this particular problem will be in continuous value we don't have any kind of decreased value so we can identify that this problem is based

on to the uh regression problem okay so let's let's Let's start see let's ski let's see how this algorithm works in much more detail so that you could get more intuition about and you can ace an interview on linear regression and also i'll be putting some entropy questions over you of what someone can ask you and what someone not okay so let's assume that we have a scattered data so let me make one x and y plane i hope that this is good pretty good and let me make that one okay so i'm Just going

to make the scatter data that looks like this okay so this is your data and let's take as a problem statement as like this let's take a problem statement which is predicting the price of the house based on the size of the house so let me write that predicting the price of the house predicting the price of the house and this is the end in rupees so Price of the house based on the size of the house okay so you will give x which which will be the size of the house and you will get

the y which will the price of the house okay okay so we have the scattered data and in x axis we have our size which is our input variable and in y-axis we have our price okay so what we do we fit a straight line we fit a straight in linear Regression we fit a straight line like this we fit a straight line which is called the hypothesis which is called a hypothesis and uh regression term this is called the hypothesis which we'll see how we can compute the straight line so we make this straight

line and you can see over here that after making the straight line we can make predictions so let's say that this is this is the size and based on this we are making the prediction like this onto the y-axis Okay so and again let's say the prime let's say the size of the house is 2200 square feet then the price will be like this uh 2 22 000 etc so like like like this we are making predictions okay and you can see over here this line is little bit far away from the actual data point

so that's the issue that that we'll see later on but you just construct a straight line that touches each that that closely touches each point or definitely that is closely uh Uh passes through this tray uh scatter data points okay so let's see how we can compute the straight line because linear regression as you know linear means it constructs a linear line that separates the data okay okay so let's see how it works so in for making a straight line as i've already told you this is called the hypothesis so how we construct hypothesis so

this is called the Hypothesis function we compute hypothesis function like this we have weight of every features so let's say theta 0 times x 0 plus theta 1 theta 0 time times x 0 plus theta 1 times x 1 plus theta 2 times x 2 all the way down to the theta and times x n okay so you're summing it all up so what what i what what you can see over here that what you can see over here that We have the status and we have these features this is th this is the let's

say the size of the house let's say this is the maybe the number of fans in the house i'm just taking the problem statement predicting the price of the house so so that's why number of a fans in the house and and etc okay so these are the features x1 x2 and x0 is the biased term is the biased term or the y-intercept or the y-intercept maybe if you know about inter-intercept of y Means if x0 equals to 0 then your line will be crossing from the origin if x 0 equals to 1 then it

will be from 1 if x 0 equals to 2 then it will be from 2. okay so it determines the y intercept from where he wants to make a straight line okay so we have this theta zero times x zero theta one times x one theta two times x two all the way on to theta x times x x n and let's take a particular problem statement and let's understand that but before that you may Think hey use what is theta here what is theta here we only have to learn machine only have to learn

this theta machine only have to get the best theta now we are able to make prediction now let's say we we take theta zero let's take we take theta zero to be uh uh two okay and theta one to be let's say three theta two to be four okay so just i'm taking only uh two features and one bias term which is the This this is the bias term and these are the two features and x zero is obvious is always equals to one x zero is always equals to one so that's why we never

write x zero okay so that's why we never write we just write uh theta zero plus uh theta one times x one we did not write but i just just have showed you so for a clear ratio of those things okay so you can see over here that uh let's let's take an example that this we have two features like the size of the house And the number of fans on the basis of that you have to predict the price so for each feature we learned the weight these these are called the feature weights these

are called the feature weights so we learned this and if we get the best feature weight we will getting the best prediction if we get the bad feature weight we'll be getting bad prediction okay so let's say uh let's let's construct the problem so let's say your theta 0 to be 2 times x 0 Which is x where x 0 equals to 1 so it's obviously true plus theta 1 which is like let's say theta 1 is 3 as of the end times the size of the house plus the theta two before times the number

of fans in the house okay so now this now using this you can make the prediction you can just plug in the size and you can plug in the number of fans and you will be getting your desired output why okay so just now you may think hey machine Learning is not we are not it we are only using it as a computational power and you can see over here that how it how it learns theta will which we'll see because machine learning is totally based upon learning parameters okay so you will see how the

theta is learned so let me tell you uh thetas we have to only learn theta we have to only learn which will see the techniques where we have to only learn Theta we have to only learn theta and if the theta is bad then then you then your hypothesis when the function is bad so this is your function this this constructs your function like this theta one times x one plus theta two times x cubed all the way down to the theta n times x n and and here they denotes the number of features and

features are the columns okay and the data okay so uh this is a function and You can use this function to map your input variable to your output variable y okay so that's it that's that's kind of our we have our kind of a function that maps our input variable to our output variable okay okay so now i think the twos that said that we've got how we construct that a straight line and using this function we can construct that a straight line and we have to only learn these these this this is called These

are called the feature weights these are called the feature weights and these are called the feature weights and this is the bias term or the y intercept uh from where the the the line should originate as i've showed you earlier okay so uh now i hope that you got a intuition about hypothesis function okay so now let's keep talking about uh the vectorized form of this means the vectorized form Is maybe vector vectorization means how here we are separately computing for each values we are computing theta zero then times x zero theta one times x

1 means separately for each value so in vectorization we do at once we do at once so what we do we put over all the thetas so we put our so i'm just writing vectorized form vectorized forum so we put our all theta we put our all theta It will be in joint vector theta and uh let me do this theta zero theta one theta two all the way down to theta and and theta these theta this is the feature vector this is the feature vector which can which is all the weights you will further

see that yeah we have to only learn this then you will believe me that yeah we have only have to learn this whether it's a neural network but it's a machine learning okay so you have this base theta into joint Vector theta and you take a giant vector x and you store all your x's over there x 0 x 1 x 2 x 3 all the way down to the xn okay so you take this and then you take out the dot product okay so then you take out the dot product and theta times dot

product okay so now your hypothesis f of x will be like this okay so you just give this function and thetas what what it will do it will uh Theta zero times x one theta one times x one it will sum sum it all up theta one theta two like like this broadcast it okay and it is computing at once okay so you can draw and then some sum it all up okay so so you can write in summation format i equals to 1 all the way down to the n or i equals 0 all

the way around to the n theta theta i times x uh let's say i okay so so you are doing it's a vectorized form of that and you can see Over here and python is very easy just one line of code uh like like this you just do like this np dot dot theta and x done okay so it's very easy in python so don't worry how how we can code this all it's quite easy okay so now we have seen the hypothesis and and we have seen the vectorized form Of this linear regression okay

so you may think hey i use how i can get this theta but before that how we can evaluate your theta is best means your thetas are good your theta are good so for evaluating our model we have something called cost function we have something called cost function okay so why we evaluate a model to check how are p if our theta is good or not Okay so we check because using that theta we are making prediction we are multiplying with the features and we will get feature these the size we will get from the

user we will this game fan number of fans get the user we multiply with the weights and then sum it all up and then we give the result okay so this is the cost function and here uh what why why we use cost function to evaluate our model okay so You you will get to know how what what we do so let's say uh we have a scatter plot so let me plot on a scatter plot like this again the same i'll be doing the same but this time little bit more crunchy yeah i think

so okay so this is this is my plot and what we do we simply draw a straight line uh this this is your hypothesis this is your hypothesis which is f of x and what this Cost function will do this is your this is your actual data point these are your actual data point and these these are your actual data point it will what it will simply do it will simply take out the distance between predicted this this is the predictor that i'm highlighting and this is the actual okay so this is the predicted this

is the sorry this is the predicted and this is the actual this is the predicted this is actual okay so it takes out the difference Between or the the it takes a distance between like like this predicted minus the actual value predicted minus the actual value like this uh yeah like this and and then sum it all up means higher the this uh cost function will be better or more or less higher the cost function will be better your moral will be and less the cost function will be good your moral will be the reason

why if your if your points are on this line And your cost function will be zero because the predicted will be zero sorry let's say predicted will be all the same and actually will be also the same so it will be resulting in zero so just what what we do we this these are called the residuals in terms of uh cost function so we just take out the distance between predicted and actual value for all data points okay so this is the cost function okay so uh let's rest forward to Formulate this in a formula

let's formulate this is the foreign formula like this uh let me show you how we can formulate that yeah so you just take out of j of theta i'm just denoting j of theta will be like the uh short form of cost function because we are checking how our theta is good or bad or not okay because it only determines whether your model is good or bad 1 over m 1 over m and m here is the number of data points plus i equals to 1 all the way down to the m and you just

model predicted value less let's denote the model predicted value by y hat and how how we have we got y hat yeah we have got y hat like this y hat equals to uh f of x equals to theta times x means dot product of theta and x which is equivalently equals to the hypothesis Okay so we let's let's note as a y hat minus y okay and in other words we can write this out like this in other words we can write write this out like this theta transpose x i minus y i and

this this is just a hypothesis h of x okay and transpose like this are you you your uh X will be like this so you make this uh theta will to be like this okay so it will be easier like this okay so this so that's what this transpose is doing but if you can do or not just just we are doing the this uh let me write that theta times x okay the dot product between theta and x minus y i okay minus y i for each and every day at a point we are

taking out the difference Between predicted and actual value and then we're squaring we had a squaring here because it helped us to and further call something called the gradient descent okay for easily derivation of this cost function you will see why i'm taking derivation term over here okay so in other words it's this this this this is called the loss function which is known as mean square error m s e okay which is called the mean square error if you square root this if you Make this let me show you what i'm telling if you

square root over 1 over m plus i equals to 1 all the way down to the m theta x i minus y i squared if you square root this like this okay this this is called the root mean square error root mean Root mean squared error okay and higher this and better your model less okay okay that's just just we are taking the square root okay but we'll stick with this um mmc but in real world a million kaggle competition they have given what they're going to use mainly i have seen rmse to be used

very much okay so uh after we got our cost function it tells okay your model is that good or that bad now if Your modeling that how you can optimize or how how you can get optimal theta means best theta how you can get the optimal theta this is the great question to ask okay so how are you how you're going to get this optimal theta for getting that we have something called gradient decent something called a gradient decent algorithm gradient decent algorithm which is known as the optimization algorithm which is Known as the optimization

algorithm which will help us to get the best theta okay optimization algorithm okay so uh let's let's stick dive into this uh algorithm and let's understand how we go get this kind of thing so let's say our the visualization of the cost function will be like this okay just just for the sake of an example i'm visualizing like this okay so this is this is your cost function i'm just writing as a z of theta this is your cost function and now Your cos cos here use is your theta where the cost function is very

high okay so here is your theta okay so what gradient descent does it tweaks the theta means let's see your theta is zero then simply let's say it's mismakes theta little bit to 0.2 okay it tweaks the theta it changes theta a little bit and if the cost function decreases then updates the theta to be not like this to be like this if the theta go down means the cost Function little bit decreases then it changes again 0.3 if the cost function decreases then it's do not make this then it's uh update this theta okay

then again it simply tweaks checks if the j of theta going down if yes update the theta okay until and unless your cost function is until unless your cost function is approximately equal to zero okay so that is simply the gradient decent it's very very simple that i've Just shown to you so let's let's further formulate this mathematically because it's very very simple when we formulate this as a mathematics so how we simply what what we do for tweaking these things for tweaking the theta we take out the partial derivative we take out the partial

derivative of your cost function j of theta now you will think hey i use you have why why you have taken the calculus name i'm not a calculus student i'm not kind Of that don't worry at all at all it's kind of it's i just want to give you one definition of a partial derivative partial what if what it does it simply tweaks your theta it simply tweaks your theta and checks if the cost function decreasing okay and it's just like the slope it's just like the slope but you don't need calculus you don't even

need calculus yeah if you want to go on a research level then you obviously need but for now for a machine learning you Don't need calculus for deep learning even you don't need calculus just um just you can just what what this equation is doing it is just it is just uh tweaking your theta tweaking your theta a little bit so it gives us like this two over m so it gives us two over m plus i equals to one all the way down to the m and this is a y hat minus y squared

okay so this the after after deriving This partial we we get like this okay so now after this we do uh we do we take out the partial derivative for every theta means theta zero theta one theta two theta t all around the theta and we do for all theta we take out the partial derivative of all theta and then we up and then we update our theta so here is a full gradient decent algorithm so what what what we do we simply write theta z this is the Update kind of assignments theta is a

minus the learning rate alpha and the kind of a partial derivative of your okay partial derivative of your cost function okay so what we are doing this is how this this is uh this this is this will be our new theta that i've just shown to you we update the theta so this this is your old theta means the bad theta this is your learning rate and what rate your uh uh this is the hyper parameter which i will talk about in Just one second in detail but what we are doing over here we are

this this this the alpha determines the rate means if the alpha is too large if the alpha is too large it will go like this it will never converge it will be like diverging like like this if your theta is very small then i think that it will never converge at local minimum it will be like this it will never converge at local minimum so the optimal theta that i've used till now it's a 0.1 for larger Data data set 0.01 for okay for a little bit to smaller medium data set 0.001 a little bit

more smaller and 0.0001 okay so these these are the optimal uh for me i have seen so far is these alpha but you can tune it you can tune it using um grids or cv or randomized search cv which we will see later on okay okay so i i just say to you why by this theta it just determines the weight of your uh tweaking the parameter okay just Going down and this this this gives this is the simply the partial derivative of your cost function and we are updating this this this will be our

new theta and this is whole algorithm of gradient descent okay okay this this just just equations tells us the whole algorithm for gradient descent okay so now again i'm going to talk about uh vectorized bottom how we can vectorize this okay because i i totally believe in vectorization so Let's tell what what what we can do uh here we are in the pre previously we are taking out the partial derivative of theta zero and then we are taking the partial derivative of theta 1 separately what we can do what we can do and we can

simply put that into uh like like this partial derivative into a joint vector theta into a j of theta with respect to zero kind of this And like this all the way down to the n okay so you just put into the giant vector and you just take out the partial derivative of whatever you want to take out and then and then you just uh write the vectorized form theta z theta z minus the learning rate alpha and that gives us like like this 2 over m times the x transpose x theta x theta Minus

y and this is your new um derived equation which is a vectorized form okay and yeah yeah you can definitely use any any kind of this this is totally okay but for vectorization you can follow or not it's your opponent but for computational powers just just have told to you okay so we have seen so far and now we are done with this we have developed our linear regression model we first have developed the model from making predictions and then then we have Moved further to check how our theta is good then we have seen

how to how to get optimal theta okay but you may think ask here you see i have to do these kind of things yeah you have to do these these kind of things for getting your and and if i did if i say truth uh the truth is in scikit learn you can implement this in three lines of code you can implement this whole algorithm into three lines of code in some library but in programming assignments you have to Implement from scratch this algorithm so that you can ace any kind of interview okay okay so

um you may ask there is something called the normal equation there is something called normal equation that that i want to highlight little but normal equation gives you a better theta in just one way means normally patient gives you a theta optimal theta in just one equation like this so equation is x transpose X inverse of that x transpose y and for by you by using this and x here is the data points it means the features so you can simply use this uh formula for getting the for for getting your optimal theta okay this

is just the same as doing this but not in every algorithm it will work it will only for linear regression the normal equation is only for linear regression and i hope but Having a good intuition of all those because in interview they usually ask this they don't they usually talk about normal equation okay they don't all talk about normal equation the usual talk about this gradient descent etc although there are too many optimization algorithm some as like gradient descent in gradient descent we have this and then we have a stochastic gradient descent we have a

stochastic which is called hdd we have a Atom optimization algorithm we have rms prop rms prop and then gradient with momentum which we which you will see advanced level in uh i will talk about hdd pattern atom rm rms prop and some more optimization algorithm or convex optimization advanced as um which you will ever see in deep learning okay okay or you can head over to the newer of it and do deep learning courses currently learning and You can learn from there okay okay so now we are done with this and let's little bit let's

literally spend some time on to some assumptions of a linear regression okay because in an interview they usually ask why you wanted to choose this algorithm instead of this algorithm or what is the assumptions of this algorithm etc so the sum of assumptions um of a linear regression if the issue it should have a lean linear relationship linear relationship the Data should be linear the data should be linear and no or little multi-collinearity the code the correlation between uh variables would be uh no nothing okay no or little multicollinearity multicollinearity okay you can see ac

internet for more okay so now we have seen some assumptions but i want to just i want to Give you the things what is independent and dependent feature independent means the size of the house the pro the number of fans and number of let's say the bedroom so these are the independent features because they are not independent to any feature for the kind of any value but the target variable y is dependent on all these features so that's why the target variable is called the independent sorry dependent and these Are called independent okay so this

is just a casual information to know because everyone talks about this okay so now i think that we have talked very very much in small amount of time and i hope that you really really enjoy this tutorial and i'm putting all my effort then you can go on to my youtube channel new era new era and you can subscribe that youtube channel if you want okay okay so Um now we have talked that and in the next section i'm going to go over uh in theory pattern i'm going to go over polynomial regression but let's

spend some more amount of if i have time let's spend two more minutes onto polynomial regression okay so there is something called a polynomial regression as we have seen the assumption that data should be linear but let's say our data is not linear then what we do okay then what we do so let's say your data will Be like let's say your data is like this your data is like this your data is like this so you if you feel fit like this then it is obviously overfitting so what you do you just simply transform

your data you simply transform your data to be like this into the quadratic form you enter the quadratic forum to be uh you just simply transform this one one degree to the two degrees so it will confirm like this Okay so you two will be transformed as let's say four and six uh threes maybe transform that's a nine and whatever what whatever it was just so i'm taking an example okay so you transform your data to be fitting over the linear so you transform your data to be fitting over linear now your uh algorithm will

be fitting like this okay so now i hope that you have gone everything about polynomial etc now we will talk about it we'll do some past And house class prediction and then we'll move on to the irregularized linear models okay so let's head over to the uh boston hospital prediction and then we'll move on to the regularized linear models okay so now we have seen linear regression and we have done one project now it's time for getting your hands dirty in the programming assignment you will be able to find the programming assignment description box below

in the sign-in page okay so now We'll start with logistic regression after doing assignment come at the com come again and follow up with this course so now we'll talk about logistic regression and i hope that you will really enjoy this okay so now we have seen linear regression which is one of the regression algorithm now we will see one classification algorithm with this logistic regression don't worry uh don't think that this legislative question is a regression algorithm no it's a Classification algorithm so because the name is logistic regression because the underlying working of this

algorithm is same as is something similar to linear regression okay so you will get to know about this how it differs from linear regression okay so before that let's uh let's be clear about we are on the same page about close classification what is Classification okay so this is a great question to ask to yourself what is classification so let's take an example given in range x you want to classify this image as a cat if it is cat then we will name it as zero if it is non cat then we will name it

as a one okay so this is a tweet value our output is indicated value and the classification is a supervised learning approach so this algorithm is a supervised learning Algorithm so here we know what our output should look like so here our output is in degree value so we can classify this as a classification task okay so that's the specification and something called the binary classification and we have a multi-class classification also called a multi-class classification means uh maybe the person has a cancer person has a pneumonia etc this means integrated value your output is

in finite value Value okay so we have this classification and uh this tool so now it's time to study about legit regression in detail and so what we do in legitimation we classify the data we classify the data so uh let's start with hypothesis in linear regression we have our hypothesis which is uh hfx let's stay known as h of x equals to the in legitimation we also do the same theta zero times x zero plus Theta one times x one plus theta two times x two all the way down to the theta and times

x n okay so in linear regression we are doing the same for drawing a straight line and here we are also doing the same for a function and this is and this this this will be uh for linear regression and the same for a legislation and this is the step this is the fourth step for hypothesis and the second step of hypothesis means the more predicted Value equals to the sigmoid of h of x so let's denote this as a short form of z okay so this h of x is g and g here is

simply h of x and h of x is here theta zero all over theta so we're gonna name it as a theta transverse x okay so you just here you just do the sigmoid of your head h of x which is which is your prediction function so you just do do the sigmoid of this z and you get your output and you get your output so Let's say what the sigma does you get your output from legit logistic regression and this output makes the out this uh the this whatever the output came let's say 22

22 to between zero to one the sigmoid the if if you apply sigmoid to this you apply sigmoid to this then it then it makes your output between the range of zero and one and you then you set the threshold if and you set the threshold if you're the Model particular y hat is greater than uh y hat is greater than 0.5 then yeah this picture is a cat okay otherwise if it's smaller than 0.5 then it is a non cat okay and this is what it is this is what we are doing in linear

regression we are just this is this this was our hypothesis but in addition we add a sigmoid to our edge of x and the reason why we add sigmoid uh it's it's totally because uh that we want our output between zero And one in the range of zero and one so that we can make prediction like this so uh just you apply sigma and the formula for sigmoid is 1 over 1 plus e to the power minus z and z here is h of x okay so this this this will be your whole hypothesis this

is your prediction function okay now you just put the z and theta you have to one learn theta you have to learn these thetas which is called the parameter it's again the same as a linear regression okay so what we Are doing we are just uh doing the same as uh first the first step we are doing same as a linear regression and then we are applying a sigma at that h of x and then um we are getting the output which is the which is in the range of zero and one and uh we

set a threshold 0.5 is a threshold and if the particular the model predicted y hat from this output between the range is greater than 0.5 then it is a cat otherwise with smaller Than uh of 0.5 then it is a non-cat if you want to be more uh strict then you can make 0.7 the probability is greater than 0 70 okay so let's say you're more output like this 0. 80 okay so it is equal to the eighty percent your model is saying that a particular uh this image is a eighty percent accurate that this

is a cat okay so you just make it as a one round of two one means it is a cat otherwise if It is a 0.4 t then it's 40 that is so you make it as a non-count okay so this is the basic thing that you should understand is a prediction function that we have made and again we have to only learn these thetas um it means we have to get these tt three and these thetas together to get our good output okay so this this was a legit regression and then i and the

hypothesis for legit regression okay okay so now in linear question we have Seen something called the cost function something called as cost function and cost function simply what this does it simply gives you the accuracy of the model means if the cost function is very very high then then your model is very bad if your cost function is low then your uh then then then your model is good okay so it helps it help us to evaluate your model okay is the the loss function it Should be all the cost budget for a good uh

for a good model your j of theta means the cost function should be approximately equals to zero okay okay so uh in lecture expression we have defined little bit different this cost function like this uh let's do for one training example like this j of theta j of theta equals to minus 1 times y i the log Of h of x i h of x sine plus 1 minus y i minus 1 minus y i log of 1 minus h of x i okay so this this is your cost function for for one training example

and you can see over here that we have a cosplay one minus one times the y i times the log of h of x i and one minus y i uh times the log of one minus h of x time so what we are Actually doing so let's break down this equation and let's understand step by step okay so what we are doing here we are doing why i is the grand route is the ground root ground with truth and h of x i h of x sign is your model predicted value which is the

model predicted value and it's just taking the log of that your model and multiplying with the y i Okay so let's say your y is equals to zero your y is equals to zero and your model predictor y hat is also equals to zero then your cost function will be approximately equal to zero because uh they both are same so your boss mentioned will be zero okay will be low if let's say your ground through this one and your model predicted is equals to the zero then this this is a mismatch your model done very

bad so your cost function will be Very very high okay so this is what the basic integration behind is cost function and this is your basic formula and again uh you can see over here that we do this uh kind of for oops what is okay so uh let let me write the equation for m training example we have done for one training example so let's do for m training example so let me write the equation for that so you have a j of theta you have a j of Theta and 1 over m 1

over m i equals to 1 all the way down to the m y i log of h of x i plus 1 minus phi i the log of 1 minus h of x i okay so this is the log loss this is this some sometimes called as log loss in terms of machine learning so you Just uh this this is your calls function that is used to uh use as a loss function that we have seen so far and i've given an example when the both the output is correct me on ground truth and your

moderator is equals then your cost function will be zero otherwise is if it is different then your cost will be very very high okay so this is the cause function for your legitimacy model so let's recapitulate the two things the hypothesis and your Fault cost function so the hypothesis is that h of x equals to the uh sigmoid of z and z here is theta transpose times x okay the dot product between transpose times x and you just uh take the it's the equation which is uh similarly equals to 1 over 1 plus e to

the power minus c and z here is just a theta transpose x data transpose okay so uh the theta contains the parameter rates uh theta contains the Parameter base and x contains the x one x x zero x one all the way to multiply okay and in y convention we are using x zero equals to one you can rewatch that uh linear regression section once more if you are getting a little bit confused because i have expect i've gone a little bit slow there okay okay so and the gradient isn't for getting the good for

reducing this j of theta or for getting the good optimal parameter we use gradient decent algorithm and Gradient reason why it does the same here is your theta we have the cost function very very high this is your cost function diagram so your cost function will be very very high when your theta is here when you change theta your a little bit decreases over here again you change again you change means you're taking out the gradient of your cost function and checking if it is going down if it is then you just update your parameter

let's say theta 0 was Here to be there i want to be here too then you update it out one to be little bit 2.1 then your costs and decreases then you do the same for getting into the global optimum over here okay like this okay so here's an equation for the same so how's the equation uh you just uh i'm doing it here theta is a for t is just means of taking out a partial Derivative j of theta equals the one over m this this equation after deriving from cos function the decision equation

forms i equals to 1 all around to the m x i minus y i okay and then you just add up some some kind of um x j times x j okay so this this is your calls function that's it this this is the taking of the partial deliberative although it might Change a little bit because everyone has a different kind of uh but it's but it's similar to many of them okay and then you just take out the partial derivative of your cos function j of theta and then you uh update the theta by

the by taking out the gradient then you update the data like this theta is a theta z minus the learning rate alpha and this is your uh this is your pre-previous theta and this is a new theta is updated theta and you are taking another partial Derivative of cos punch it and it's just the same you just tweaks your parameter and checks if your cost function is decreasing or not okay okay so we are done with the legislative question and i really hope that you enjoyed so let's recap recap and then a little bit go

further into vectorization of this code okay so what we have seen we have seen hypothesis and hypothesis is given by the This sigma of z and is the simple 1 over 1 plus e to the power minus e and z here theta transpose to x okay and then what you do you have a cost function for getting the accuracy different model uh this j of theta which is equal to the minus one times y i the log of this is this is that using phi and and for m training example you have that and the

gradient descent you just update The this where you just updated theta and theta say minus the learning rate alpha and is take out the partial derivative of your cost bunch of j of theta okay and the alpha here is simple the datum is the rate of learning that we have seen linear regression okay okay so we have seen so far and now it's time for getting into more detail about vectorization uh what's the vectorization means so factorization Means is uh you just you hear you are taking some amount of time but if you want to

do at once if you want to do all the calculation at once so here is a vectorized code for uh gram cost function okay so i'm right writing for cos function which is a vectorized code so here it is minus 1 over m times y transpose times the taking of the dot product between y Transpose dot h plus 1 minus y t and transpose dot log of 1 minus h okay and h here is your model predicted and y here is a ground throughout okay and as we had just vectorized the code little bit to

get your job done okay and a good way okay so the gradient descent also a little bit Vectorized so here is a gradient descent theta this theta minus the learning uh this is a partial derivative uh learning rate alpha m times x transpose dot h minus y okay so this is then this is what you get after deriving your partial dedicative okay so this is the basic thing that's You should uh keep in mind about uh when performing legit regression and i really hope that you have enjoyed till now and now if you if you

can see but let's summarize a little bit so that you can get a more better feel what we have seen so far okay so legislation is a classification algorithm that will classify our example um that will give the probability after you just apply the sigmoid to the z then you get the probability means in between Zero and one you just get between zero and one and then what you do you simply uh take a threshold means if it is zero if it is the if your output is greater than 0.5 then you make it as

a one otherwise you make it a zero okay as a convention we take one as a positive indication means that the the image as a as a cat is a positive indication and zero at the images are not cat then it's a negative indication okay you can take anything but for convention you do this Kind of thing okay pretty much easy what i'm trying to say over here okay so we have the hypothesis and we have a cost function for a checking accuracy for modern we have a gradient reason for getting your best optimal theta

and again we are only learning theta over here using the gradient descent algorithm okay so now i think we are done by the legislative regression and in the next section we will go over to Uh project which is breast cancer detection system and then we'll go a little bit further into understanding the support vector machine okay and i really hope that you will enjoy that section also so let's meet at the next section okay so here i am on my jupyter notebook and you can download a jupiter notebook by searching online how to download the

juba jupiter notebook and you can follow the tutorials to download a jupiter Notebook okay so uh what we will do we will first start with uh important libraries then we will load the data we will understand what data we are working on and then we will follow the feature engineering then we will see how to select features we will do exploratory data analysis like data visualization and data analysis and will perform feature engineering and then before and then we'll see how to select The features on from the correlation of the features and you know you

don't need to have any kind of experience with pandas or although you can have a look if you want to in detail but you don't have to be expert in all of these if you are just this is just a beginner project you can also modify it and put it on your resume and make make some changes and what you can do you can save this model and simply deploy it over a website okay okay so i Will talk about deployment later on but before that uh let's let's walk through let let me make you

walk through this project so first of all we are importing the libraries first we are importing the numpy snp and we are importing np as allies uh pandas as pd pde is also alive to short form a name plotly is another great visualization library but i really use it i don't want to use it for now but in future you can use it plotly just just wanted to show You i uh seaborn which i'm going to use here and mathplotlab is also a visualization library that i'm going to use over here and this macbook live

in line tells you to use a matplotlib in the back end okay uh it's kind of a tell smart plot clip to choose the to plot the images in the backend use use the jupyter notebook as a backend okay so uh first of all uh this is the learning thing so we will load the data from the scikit-learn library scikit-learn Library is a famous library for machine learning so we'll load the data sets from loadbuster which is a boston house price prediction will make boston house price prediction so you so we want a data set

of that okay then we instantiate our load boston and then we take our x which is the data i will tell you what is x and y over here and y which is a load boston.target in our previous tutorials in the introduction to machine learning We have a talk about x and y variables x variables is the independent features which we which which we use to predict the model and this is the y which is the target variable because it's unsupervised so we know what our output is that we know what our output label y

is okay so um we will be given this and x and y is the target variable which will be in case in this case is the sales price and this is the features okay and then i call pd.data frame and It con constructs a data frame and we give x which is the features and we give the name of the columns dot feature names there is a dot attribute feature names to get the names of the columns automatically okay and then we make one column here we are making one column which is a sale price

we are making a sale price and then we are making it y over here okay ma why why means lord boston.target either you can make this Also like this either you can make this also it's just the same okay y is just a target variable and if you just run this now you can see this is the data frame um here you have a crime rate per capita that let's see the column distribution so i just first see let me show you what we have done these are the x variable till s l stat these

are the x variable and this this is the y variable that you use to predict and you may think yeah Use 24.0 means 24.0 is the dollars is just the price of the house no the price of the house is given 20.0 thousand dollars okay so you can so just just just we will see the whole thing just a second okay so this is the data frame that we constructed okay so now let's take a look at what data we are working on so there are 506 rows and there are 30 numeric and categorical columns

and with the median value Attribute 14 which is our target variable here i'm giving given is a sale price name but you can give it a median value it's usually the target variable okay so here is the column name is the crime is the per capita rate then the proportion of the residential land proportion of a non-retail business which knocks age what's the proportion of average number of rooms per dwelling in stratfor you can wait over here and the last which is The favorite the target variable which is the median value of the which is

the sales price in this case in a thousand dollars so 24 thousand dollars now you know your dream has gone out i think so okay so you're going to read it over here just just you can use dot d c e d e s e r because it is available in the cycle learn so you have this kind of thing to see the information about that data set but in real world we will work On real world data sets so you will see how uv download how we process etc okay so let's understand a little

bit more about the data we look at the shape of the data which is 506 rows and 14 columns means rows and then columns then your dot input tells us the information of your data which is non-null values means what are the data types and is there any null values into that null means missing values into that column okay and all the data sides are float What is the memory usage etc okay and let's uh dot describe will tell you the mean of that particular column the standard deviation the minimum 25 percent of that 70

percent max count etc okay so we had seen how what data we are working on these are the features which is the x variable and this is the target which is why which is a supervised learning problem as we can see over here okay okay so let's just take a look at the no You can use data dot is no dot sum and you can see over here that we have the 0 0 0 all the way and then we can plot a pair plot sns dot pair plot which will plot all the things which

with respect to every feature so you can see over here that we are just plotting the pair plot uh you can see but if you we we do not get a lot of information from this pair plot because it's very kind of small and we cannot see what data is it is it pointing on etc okay so we have to Definitely take care of that to do to check more uh visual visualization that i made okay just wait for a few seconds then it will show up okay so it's plots and we are unable to

see this kind of thing and we are unable to see so it's very hard to see this so what what we do let's let's take out and let's just take all the inferences from the sales price this is which is the target variable let's do some analysis So we plot a distribution plot and you can see over here this is a little bit skewed we want it is we are we are seeing over here this positive skew but here we can add some transformation so what's the seek skewness and word sum could notice so here

we have one point one zero eight zero nine eight and one point four nine five one nine seven which are which this this will help us to find out liars and outliers are those who are far away let's say that you are Working on uh age okay so uh age is 20 uh 20 years old 50 years old but let's say 150 years this is the outliers okay these are the outliers okay those who are exceptional those were exceptions of the data frame okay so we have to take care of this but before that let's

see some relationship with each and every column one you have taken two columns you can try different different columns So where there is a little bit crime there is the sale price is very high and there is a lot more crime there is sale prices very very low okay and what's the age you can also see and you can see over your 100 years old has been sold uh in a smaller rate and uh 20 years old has a little little bit higher ray and you can see see the data visualization over here okay and

you can see over here that i've Imported sci pi which is again that like numpy i'm just taking out the norm and the skewness so you just plot a distribution plot for seeing this kind of thing means uh for a normal distribution if you know what normal distribution you are just plotting the normal distribution and we and this is your actual this black line and you have this blue line so you have to transform it a little bit so your new which is uh two trend Mean is 22.53 and sigma is 9.19 okay so these

are some if you know what python you should know about these kind of for formatting etc okay and this is the qq plot which will help us to see the coordinate quantiles which is uh order values you can search more on the internet or go on wikipedia to learn more about this but to our main main focus will be this uh sales price means distribution plot okay so let's run this let's try let's add let's add a Log let's let's uh transform our sale price to a little bit more accurate so now you can see

over here that is his skewness is over now we have just applied a transformation over here log one p and it's now good okay to avoid outliers okay and this quadrant trial is also uh removed okay so data correlation what a correlation first of all correlation is the is a relation between features okay So if the if it is one then it is uh positive correlated then it's minus one very negatively correlated you can search more internet about it because it's not a statistics class so you see if the diagonal is one all the features

are perfectly correlated okay so how do we select the features which are highly correlated okay the features how we select if we have taken the absolute value of the sale price and we are taking the highly correlated feature From this uh statement and there is 12 feature that we get that is highly highly correlated you can choose this but i'm not going to choose you can either delete the rest of the columns except these two all okay and let's just start with model building so you just employ important split and train test please simply from

divide your data into training and testing means um let's say you have 80 of the data so Sorry hundred percent it will take eighty percent for training and twenty percent for testing and use draw because you don't want x to be sales price and why to be the sales price okay and then you test size then the band random state will tell you the uh means every time you run the data should not be changed okay so if you run this now let's take a look at the shape 404 and 13 columns for Training and

furniture for testing and for them for labels and one and two labels okay so let's let's start with uh let me make one more little little bit more uh so that it should be clearly visible okay so you just uh import uh from scikit-learn which is a linear regression and you just instantiate it and then you fit the model x strain and y train which we have used for training and why training extreme are the these are the input and this is that by a Target variable and then if you run this now it's instantiated

now we can now we can make prediction okay so you can see over the actual label because we know we we know what y test is 0 because this actual label and this is the prediction which is 3.36 is a predicted value from the model and three point two one eight eight eight eight seven five eight two nine is etcetera is the is the actual value which is little bit uh different from this but it's um good For perform very very good in terms of linear regression as we have seen so far okay now if

you want to check the accuracy for checking the accuracy we have seen a cost function you can run this and msc mean square error and let's say if you want to see msc there is 0.035 which is good and if you want to take out the rmse the square root you just npd or square root and you are getting the rmse so you can just print it out our msc Okay and you have seen that it's pretty much good okay you will learn more about how we can improve this by using xgboost tagging boosting etc

later on okay now we have done our full prediction project the code will be in uh my github which is in the description down box below there is a lot more project which which is available in computer vision natural language processing which you can take a look if you want to my projects okay so now i Think i have not quoted just now because it will take a lot of time so i just wait and have done just annotated each and every line of statement okay so if you have any kind of prop in a

search on internet because you have to master google to if you if you have encountered any kind of problem and in the next tutorial we'll be talking about regularized linear models which we'll talk about lasso and which which we'll also use in this uh To check uh as a last one regression and a ridge regression okay so let's let's get into the next uh section okay so now we have talked to talk talked about a linear regression and we have made one project now it's time for getting into the regularized linear models linear models okay

so uh if you remember that we have a pointed out or some problem which is uh overfitting which is overfitting which We have pointed out earlier okay so how does overfitting happen let's say if your model has learned too much so let me draw one uh x and y plane and let me draw some data points okay so these these are the data points and let's see or you have a complex model which is a complex function which maps your input variable to output variable so you just uh make the complex function like this which

is which is touching each and every line so Let's say a new uh example comes in and it's making a bad prediction so you can see over here it learned too much onto the training data which are inner trained and it's it is it is very going bad it is it is uh not generalizing well onto the new examples okay so that's that's called overfitting if your model working better best 100 accuracy under training and working worse on the validation accuracy then your model is likely to fit overfitting Okay so how it is cost let's

you have a lot of features and you um let's say a thousand features and so it will learn obviously the complex functions and then it will perform very very bad okay so uh for that we have some solution and the solution are uh either uh you do reduce some features reduce some features uh reduce some features this is you can do this another thing is what you can do you can uh this is the same as like this Regularization in simple terms regularization is just uh eliminating the features that are not useful okay so it

is just equals to this so we give another name which is called the regularization so let's see some other regularization techniques so the first one is lasso regression so let's see a lasso regression so what is last regression last regression is a regularized linear models what we do we just add a simple Term with a simple a regularization term at the at the end of the cost function let's say this this term lambda 1 over 2 i equals to 1 all the way down to the n theta squared i okay so you this is uh

this is for a ridge regression this is for a ridge regression okay so this is for raised regression i will talk about lasso just just after this okay so just you add this at the End of the cost function so your now new cost function will become like this 1 over m plus i equals to 1 all the way down to the m uh theta transpose x i and this this is the model predicted minus y i squared plus your regularization term uh this one i equals to 1 all the way down to the n

theta squared i okay so you can do like this now let's let's understand what is doing it is just let's say you have some features Which is the let's say the size of the house let's say the price uh this is the number of fans number of bedrooms and a number of uh grass okay num number for grass so you can see whether this this this seems to be a less important feature so it will it will simply make the theta means the parameter feature means the feature weight of this column to be zero okay

so it simply penalizes or closer and closer to zero in rich regression it Makes the feature weights closer and closer to zero okay so what is doing it is whatever the less important features are is just penalizing the theta of that and the theta is the is used to make prediction and let's say theta times the number of graphs equals to prediction so let's see it's a very very very very small okay so it just penalizes your theta okay so that's a ridge regression is doing just penalizing or eliminating The or by how how it

eliminating just making theta to be equal to zero okay but if a but in ridge regression it is it is making closer and closer to zero but in case of last regression it is simply it is in case of last regression in case of lasso regression i'm talking about lasso regression it's it's whatever the less important feature are it simply makes it simply eliminate it simply make the Theta zero so whatever let's say theta times the number of a graph uh so theta equals zero zero times the let's say seven which is equal to zero

so this feature says eliminate net so that's a lasso is very strict okay so these two and just just to use l2 norm um in in that case in range regression you are using l1 norm but in lasso you're using l two norm okay just adding the regularization term which is like this uh At the end of the cost function i equals to 1 all the way down to the n the absolute value i and one yeah there's this one okay okay so uh but one one more thing that you can see over here that

we we do not penalize our theta zero which is the biased term which is only the bias term so we don't want to penalize this so we start with i goes to 1 so we start with i equals to 1 rather than starting with i 0 so we do separately of all the things okay so What what what we do we just uh take um we just start with the i equals to 1 all the way down to the j which is the theta we do we separately do zero okay uh and it's and i

i think it's very clear let me make you clear what i'm saying i'm saying that you do let's say for theta zero j of theta zero you do separately this you do separately This without the regular regularization term without okay but uh for other other thetas one all the way around to the j you do you do at the regularization term at the end of the equation and then you separately update this also uh this theta zero and you separately update this theta 1 and you just take this as a gradient of this theta for

theta 1 all the branches of j and this gradient this the gradient of this cos function theta 0 for updating Your theta 0 and getting the best theta 0 okay so you do not want to penalize because only the bias term you don't want to penalize this okay so i think that is very clear to you and again for recap regularization is just penalizing or eliminating the less important features by making the parameter weights equals to zero okay now i hope that's pretty much clear now we hope that is very clear and in the next

tutorial we'll be working on Two uh uh last regression sorry uh largest regression which is the now we'll start with the specification uh which is which will cover logistic regression as a same as a linear regression but it's a classification not regression okay so let's uh i will be happy to see you in the next section okay so now we'll talk about a regularization order one other favorite Topic that i like to talk on i will take around 10 to 15 minutes to complete something called as a regularization topic and i think this is one

of the most important topic in machine learning or when maybe you go to deep learning okay so we learned about uh l1 arm l1 and l2 regularization which is often called a rich and lasso regularization and i hope that you will understand that and why what is a regularization so first of all this is the problem that Should come into mind and what is and why regularization i think these two questions must be your first question over here so but before that i'm going to highlight one uh something called as overfitting and i think overfitting

um you all know but just as uh just to those who are forgetting about it let's uh revisit that so uh let's assume that you have an x and y plane x in x here and y here okay and here is here you Have a data point okay so you have a data point like this okay and you fit it and your model learns a lot your model learns a lot means your model is performing very very best on the training set let's take an example that it performs it is it is the the error

the cost function over to here means that the residual error or the cost function over here will be approximately or very very low to zero and accuracy or on the training set will Very very high because this tries to touch each and every point over here okay and here your cost function means the difference between your predicted and actual value um summation of i equals to 1 all around the m will be approximately equals to zero and if it is if it is if you if if it is touching each and every point so it's

obvious that it is very very best onto the training set on which it is trained so it is it has learned a lot but let's for The sake of example some example come over here and some example come over here so what will happen your model will fail to generalize well under the testing set okay so that's why i'm telling that your model will fail to generalize well on testing set so you can assume that that this this this you can you can say that this this model is one overfitting because you find out that

your model is performing a very Very best on a training set and then if you evaluate it then you will see that your model is performing very very bad okay so that's the sign of overfitting so we we always wanted to reduce the overfitting so how we how can we reduce our overfilling we have something called as a regularization and how it happens it can happen if you have a lot of features a lot of features a lot of features or your polynomial degree is very high if you're using Polynomial regression okay so the the

measure for the problem is lots of features okay okay so now let's see uh how we can remove this uh how how we can prevent or how we can make our our model less prone to overfitting so some we have something called as regularization regularization will help you to eliminate to eliminate the features to eliminate the features which are less Important so again i'm saying it will help you it will help you it will help you to eliminate to eliminate the features to eliminate the features which are which are which are less which are less

helpful or contains less less information okay so it will Eliminate that so that's how we that's how that's what the regularization is doing it is saying that if the feature is less important remove that okay or make the make their respective and make their respective hyper parameter to be equals to zero okay make the respective hyper parameter to be equal to zero means i make make the respective not a hyper primer make the respective theta to be zero so we will see we will see how what what it will do just just for As an

example let's assume that it will help you to eliminate the features which are less helpful now let's see how how how it will do okay so let's assume let's assume that you're working on uh some problem which is house price prediction okay so so you're working on house price prediction and there you have some features so here you have x1 so maybe the size of the house my favorite size of the house okay then x2 is maybe your Favorite the number of fans in a house number of fans in a house okay number of x3

maybe the number of bedrooms number of bedrooms maybe another feature maybe another feature number of uh maybe some fan no no fancies or well maybe acs okay air conditioners so here we have four features okay and let's uh now what we'll do so we so for each for each feature for each feature for each feature x1 x2 x3 x4 We'll be having some parameter weights we'll be having some parameter which like theta theta one times x one plus theta two times x two plus theta three times x three plus theta four times x four and

you can see over here that we have this this is this called the hypothesis function so here these two only the these are the weights these these are the weights of these features and we only have to learn these weights by just tweaking it okay by just Tweaking it tweaking means let's take an example i hope that you already seen a linear regression just as an overview that your theta was this equals to zero to two point one theta zero equals to two point one previously and you just take out the partial derivative of your

cost function with respect to this theta and then you update the theta okay why and just see the linear regression part you all all will be very clear okay so here if you're if you're if you are Being confused please go to linear regression part um you will be not able to further continue so please please go back and lay in your regression if you're not able to understand why what is theta and why it is okay so this theta is a wait for the feature and we can simply give let's take an example that

our theta 1 is equals to 2 theta 2 equals to 4 theta 3 equals to 2 okay and theta 4 equals to 2 okay so we can simply multiply so these f these thetas To be learned okay so um our user will give input let's say the size of the house so two times size of the house maybe 24 square feet plus uh then num number of fans though so four times number of fans maybe four plus number of bedrooms two times because here we have theta three equals to two let's assume that the user

given two plus uh two times two times uh maybe number of ac is equals to one okay and This the the this this is the y hat of your model so here you learned some weights you learned some weights and usually you do the same you have some parameter weights you have some parameter weights which you learn by by tweaking or by changing and taking a look if it is if you if your model is performing best and we just take out the partial derivative of our cost function with respect to this theta and then

see if our if our cost function decreasing If it is then we update the new theta and then we update our old theta with the new theta okay so that's that's that's what we are doing over here and i hope that you are understanding what i'm trying to say over here so just for an um just for this intuitive example i hope that you understood what i'm trying to convey over here can i convey you over here okay so let's assume that you have this now uh let's assume that this number of uh Like your

model is overfilling okay so what you do in regularization you apply something called as uh rigid so let's see the ridge regression what it does rich regression rich regression okay so the the equation for this regression is just add is just add a regularization term at the end of the cost function okay at the end of the cost function so again here we have theta 0 also here we have oops what happens here we have theta 0 also here we have Theta 0 times x 0 just a biased term because then x 0 is always

equals to 1 okay so in the range of regression in ridge regression we add uh a regularization term so here here's what i'm here's what i mean so j of theta j of theta equals to 1 over m 1 over m plus 1 over m plus i equals to 1 all the way down to the m theta transpose theta transpose x i theta transpose x i minus y i squared plus i'm saying plus And here we have something called as learning rate and this is just alpha okay so don't assume that is the learning this

is just alpha uh or you can say that someone can write it as a lambda okay so it is just a greek letter uh just within note okay so don't don't compare with that uh learning rate alpha do not compare here i've i will tell you i will tell you what is this what is what this alpha Does 1 over 1 over 2 i equals to 1 all the way down to the m theta square i okay so here we have added a new uh a new regularization term a regularization term over here a regularization

term over here so what is this so this is simply what it does it's simply it simply make it simply take the features which are which are less important and make that Parameter weights closer and closer to zero okay so what do i mean with this so for an example let's let's assume let's assume that uh that you that the number of your model this you using the eq this using this equation this is the l1 norm the it is the l1 norm okay so let's assume sorry l2 norm this is l2 norm okay in

ridge regression we have l2 norm and here it's simply penalizing your theta who is the very less important so Let's assume that the number of ac or modern model find out that the number of ac is less important so what it will do so uh theta 4 our indicating the number of ac though theta 4 is the weight of our number of ac then it will make theta 4 to be to be closer and closer to 0 0.0001 okay so it penalizes your theta it penalizes your theta as it found as you as it found

that your number of ac is less important so it penalizes your Theta and whenever you are multiply with penalize penalize times the number of ac then it will be also very uh means low okay so that that just helps you to less prone to overfitting okay so it is not doing anything with your with your input value it is doing because if you multiply 0.0001 with whatever the number number feature it will be approximately zero points there is a reservation okay so like like that okay so what it does Simply it simply eliminates or penalizes

your theta value by just making closer and closer to zero okay so this is the l2 norm okay and here we have something called as alpha and alpha contains alpha contains how harsh or how strict to be on to the solent feature so let's assume that we have set up some alpha obviously we don't touch alpha inside learn but here let's assume that your alpha controls the strictness okay so if your alpha is Large i think so if your alpha is large or lambda is large um some people may write this as a lambda also

okay so don't uh don't be confused so if you had a lambda is large then it will make it it is making close and closer to zero now it will make theta four equals to zero fully zero okay so it simply eliminates it if you it's it is very strict if you keep this if it is very strict okay so it kit controls it Controls the strictness but whatever i think okay but i think what i like to remember is controls the strictness okay so and here you can see that you are taking the l

one norm and you are not taking off theta zero you are explicitly doing for theta zero because theta 0 is your bias term and you do not want to penalize your theta 0 theta 0 okay so we are not penalizing we are going from theta i equals to 1 all the way on to The m we are not going to i equals to 0 we are going to each and every theta okay so we are not so we are not uh going to uh theta zero so we explicitly do for theta zero and we it's

explicitly between two and tweak our theta or we take out the derivative of our theta explicitly okay because we don't want to penalize our uh bias uh sorry it's uh our biased term okay so this is the rigid regression okay so we have something called as a lasso regression Lasso regression lasso regression means lasso regression is very very it's it's it uses l1 norm instead of l2 norm it's just the same it's just the same it's also penalizes the theta value but what it will do i will tell you okay so theta 0 theta 0

equals to whatever your cost function will be i'm just putting 1 2 3 plus now i'm right regularized num writing the regularization term i equals to 1 all the way down to the m you're not penalizing again theta zero the norm Of the norm of theta i okay so this this is what you and here you are taking the alpha norm okay so taking the norm you're taking the norm of your theta okay so here what if here you are applying the l1 norm so what it does if he if if the last regression is

very strict just assume it's very strict whatever feature he finds whatever i'm Saying i'm taking the gender as a he okay or whatever this this last regression finds whatever feature that less important he finds he will directly make that theta four to zero so here we assume here we taken theta four equals closer and closer to now then it lasso it will take a directly to zero okay so it is very harsh so the both have specific use cases okay so these are the this is called the l2 l1 norm and this is called the

alpha norm okay so here we Have started about a regularization in detail and i hope that you understood very very clearly and uh we have talked a lot and there is one more which is elastic net which you don't need to worry about now it's very kind of a just combination of both of them but it's not used in earth industry as we use this l1 and l2 norm okay so thank you for seeing this video about regularization and and you can head over To the next section to learn more about more about machine learning

and completing this course okay so thank you for seeing this video head over to the next section okay so now we will talk about support vector machine in detail so that you could be more powerful in machine learning or you or you will be having a powerful algorithm in your toolkit of machine learning so support vector machine is a supervised learning and Learning algorithm which is a both for classification and regression we have seen logistic regression and legislation is for classification and linear regression is for regression so support vector machine is for both which is

classification which is classification and one more which is a regression task okay so regression task so support vector machine can be used in Classification either your output and degree value or it can be used in continuous value okay okay so uh let's i will take diabetes support like the machine but before that uh let's uh let me tell you what we are going to study in this section so we start with the introduction of support vector machine then we'll go further into what svm do and then we'll talk about linear heart margin then we will

typically go further Into non-linear specification and then we'll talk about empirical risk minimization and then a semi supervisor is transductive um svm and then we will talk about svr which is support vector regression okay and the next section we will do one project which is stock price prediction project okay okay so uh what is svm what actually svm Does it's it's it's it's simply just what is like this so let me draw one x and y plane so here is my x and y plane i hope that's that's beautiful so here is my x and

y plane and let let me make a linear data okay so here we have the one day data point and another data point is over here okay so we have like this and then we have this data okay so let's assume that that you're working on cat And a non-cad uh brick off recognition system okay so you are working on cat and non-cat recognition system so what actually svm does is here you can see over let's assume that white color is for cat images means the labels are cat and blue color let's take as a

non-cat okay so these are data and what svm does it makes its its constructor hyper hybrid plane it's construct a hyperplane like This let's construct a hyperplane like this okay and now whatever new point come on or beyond any here then then this example will be cat otherwise if something come here then it will be non-cad okay so this is what svm does um but it's i will this is not the full procedure i will tell you but here what it does it simply construct a hybrid plane and two parallel hyperplane And uh one two

parallel hyperplane two parallel hyperplane with this margin okay these are called the margin so it always tries to maximize that margin so it may sound a little bit kind of confusing so let's draw it again to let let let me show you once more time what actually it does okay so here is the white examples And here is your blue examples is it non-cat and not cat so what it does is construct a hyper plane and a two two parallel hyper plane over here two parallel hyperplane like this okay so and this is called the

margin this is called the margin and svm always tries to maximize this margin keeping away the nearest data point far away from the hypomaine what i've said it's very uh crucial to listen What i'm saying i'm saying that what svm does it simply construct a hyperplane and a two parallel hyperplane that separates the data point and at a maximum margin okay so it always wants to as we always wants to maximize that margin in such a way in such a way in such a way so that the nearest data point so that the nearest data

point let's say x i is far away from the hyperplane this Hyperplane okay it's far away from the hyperplane so that uh so that it would be easily so that whatever comes here then it would classify as an um as a knock as a non cat whatever comes here or here it will be classified as a cat okay and then and and what i'm saying that the nearest data point and the nearest data point is far it should should be the far away from the hyper plane here okay so it would be here so these

Are called the support vectors who are the nearest data point from the hyperplane okay and that is supported by these are the support vectors which supports this parallel hyperplane to separate this okay again this may sound a little bit unconfusing so let's revisit this again in a more detailed way so let's say uh let let's take another another example that we are building a person has a cancer or a non-cancer i think malignant Uh let's say a non-cancer okay so you build uh x y plane so let's build one x and y plane and then

you put uh this blue examples which is which indicates as a cancer and this white example which indicates is a non-cancer okay so what is um does simply construct a hyper plane like this oops it simply construct a hyperplane and a two hybrid plane like this and a two hybrid plane and this is this is called the margin And it always tries to maximize this margin so that in such a way in such a way that the nearest data point is far away from the from the main hyperplane okay so you can see over here

that this this is the nearest data point is far away so these are called the support vectors these are called the support vectors okay pretty much easy now i think that's is much clearer uh with that we have taken three examples now i think this is much clearer okay so it's Me it may sounds you like it's just like a linear regression you have a linear data and you're just spitting a straight line with two hyperplanes what does this mean it obviously means like linear regression but if you come to the more in detail the

data are not not linear okay the data never a linear so uh there are two kinds of svm there the first kind of is hard margin classification And soft margin classification okay so let's revis visit us a heart marching first and then we'll revisit uh soft margin okay okay so uh you here you can see over here all that uh this is just like a linear regression but wait for a few seconds few minutes then you will understand why it is not like cleaning your depression although it seems like that you could simply construct a

straight line but equation and it's used for classification it's Quite quite different okay okay so in it's called the linear svm the what we have seen is called a linear svm where we are constructing a simple hyperplane and separating two data points okay so what are hard margins hard margin means that we are not allowing any data point to come into that margin okay that's what the violating the hot margin so sorry Highlighting the margin so we are not allowing any data point to violate that margin so in that way we end up being an

overfilling or in that way we we very much um go we our our model is being started overfitted so let's see one of what i'm trying to say of here over over here so let's construct a hyper let's let's construct one x and y plane again so let me construct one x and y plane oops here is my x and y plane and let's uh Let's just draw two data point again let me draw two data point like this okay and let's draw a white data point like this okay okay so and now what as

we have got simply construct a straight line with two hyperplanes okay with this hyperplane okay so in hard margin we are not allowing any data point that generally Of the it comes under in so we simply kind of do like this we simply kind of do means we simply uh minimize this margin like this so we are doing like this all it it is very strict hard margin is very strict hard margin is very strict the reason i'm saying is very strict because it does not allow any data point to come into that margin but

in soft margin we allow some data Points to violate that margin to avoid overfitting okay so that's what the soft margin and hard margin means hot margin which means we are not allowing any data point to come into that margin and soft production means we are allowing a little bit of data point to come in right between and the width is in the width of the margin this is the width of the margin is is is is uh is adjusted by c okay so if c is very very large then uh your Fifth will be

very very small like this okay it's let's say c equals to zero this is the margin my margin so this is my margin so let's say your c hundred c equals to hundred so your uh margin will be very very uh the width of the margin is is kind of a low means it's very slow kind of it's very small but if if if c equals to one then your the width of the margin will be very very large okay Like like this so it's just kind of any it's kind of a very awkward thing

that you can see over here but that's what the gun convention site okay so we have seen um this this kind of thing and maybe you will see sometimes c to be named as a lambda okay it's just it's just the same lambda and c are the same just as convention we give it a c as a name okay so let's just get now is you now you've got the overview of this hot margin and svm so We have seen a lot more things now it's time for getting into the little little mathematics which is

how do we construct that hyperplane in linear regression we are just making the straight line and this is theta transpose x but this was our favorite equations like this theta zero times x zero and in uh logistic relation we are just doing the sigmoid of z so these are our hypothesis so this from this equation we are we were making our Um straight line or maybe the sigma so uh what happens in case of a supported machine we have our hyper plane is diff our hyperplane should be is defined by w transpose x minus b

equals to z this is the condition that our hyperplane should satisfy okay okay so what is this uh this w is our parameter weight is our parameter rate and this guy in linear question we have seen is theta and b is our A biased term which is in case of linear and logistic which is theta 0 okay so we have just just given our new name which is w and b okay and it's okay so we have made a w transpose x minus b and this is our hyperplane that we have made as w and

b are the parameter weights okay so we have just just given a new name And transpose you all know linear algebra and this is w is a parameter vector okay so let's define some constraints in for heart margins so let's define some constraints for heart margin how hard margin make predictions okay so whatever whatever the output of your model minus w transpose x minus b is Greater than or equals to one then anything on or above this margin will be regarded as one in this case it will be count okay why i'm seeing like this

is let's see you have this straight line and you have this and you have the data point and one uh these two parallel lines and again here you have so whatever on this margin or above this margin is regarded as one means the positive attention so it is regarded it's a cap as you as you know okay and Whatever below this is regarded as a zero okay as a non-cat okay so let's let's let's write the equation for that and it's quite easy it's quite um remember rememberable like double transpose x minus v whatever on

or beyond this margin this margin will be regarded as one whatever below this margin will be regarded as a zero okay as a negative attention so that is the heart margin constraints And that's that that's how we make predictions onto uh support vector machine okay pretty much clear what i'm trying to say over here okay so now let's dig dive into some a little bit more further into math is how it truly is it it is being constructed like this so let's say you had made a straight line and this Um this margin you can

see over here this these margin is is written by this hyperplane this the main hyperplane is written by double transpose x minus b and this margin is equals this margin is equals to 2 over the norm of w so to maximize this margin we wanted to minimize this monitor the minimizer w okay if we want to maximize this margin because svm always has to maximize this margin to maximize this margin we have to minimize this norm of W and this margin is written by two over the norm of w okay and so so as you

have said that as we have always tried to maximize so uh that is written by the uh two over uh norm of w and to maximize that you want to minimize some hormone of w so we can write an objective function we can write of objective problem object like like like this the distance between two hyperplane the distance Distance between two hyperplane two hyperplane is written by two over the norm of w so to maximize its margin we want to minimize the norm of w with sub subject to subject to y i double transpose x

i minus b greater than or equals to one so what i have said so let's understand this equation a little bit more further what i have Written over here so you for maximum you this is the minimizing means we want to minimize to maximize that margin with respect to y i which is your ground growth means actual label and this is your moral predicted value moral predicted value and you can show your if it is on or beyond this it will be one otherwise it will be zero so if y i is equals to your

border predicted value then your cost function will be zero Then your loss function will be zero otherwise if it is not equals to y then then your loss function will be very very high so you want to minimize this normal w to get the good predictions okay so we have started this and i really hope that you had understood the concept that i explained to you and we can write our false function in a hinge loss or i'm talking about some soft margin so we can write our Soft margin like this so we can write

uh the the loss function the cost function the loss function you can write our loss function like this uh 1 over n plus i'm going to each and every training examples max of max of 0 comma 1 minus y double transpose x i minus b plus the regularized section plus Lambda over was times the norm of w straight okay so this is quite confusing a little bit but let me try to explain you in more detail way so you already can see we are we are going to each a restraining example but taking on the

max and this is the hinge loss if you can see this is this is called the hinge loss if i will get into more detail about a hinge loss you can have some wikipedia pages for knowing about hinge loss okay and this is your Model model predicted value y a y i and the ground this is your ground truth y i and this is your predicted value and this is the lambda times you can write it c also you can write c also because it it it help you to adjust the width of that margin

so that's why it is a very very important hyper parameter okay and times the non of w squared okay okay so we are done with a kind of a more linear uh classifier we are lame your Svm is linear classification so we have made our good loss function so we have made a prediction function and now it's okay we want to minimize the this norm of w to get the good predictions okay okay so yes yes we have seen only the linears svm so let's let me form firm formulate one example which is like this

let me make one again x and y plane number and making too much x and y plane it's best to have a This kind of things okay so let me make one non-linear classification like this okay so you have this kind of data and this kind of data and let me make the these white examples are the cat images and these white examples are non cat images so now if you can if you want to make the straight line you will be notable to do that means you're not able to um kind of classify it

so it's very bad so what you have to do you have to make a non-linear you have to do the Non-linear classification like this now it will be okay now it will be okay so svm also is also is very kind of powerful in nonlinear specification with something called as kernel trick it's something called as kernel trick so we will elaborate kernel trick in detail so let's start with kernel trick so what what we do in kernel tricks so let me write an algorithm so what you do you write an algorithm in terms of x

you write the inner product Let's say let me write you write an algorithm you write an algorithm in terms of the inner product of x and z and the these x and z are two data points are two data points so what you do let me tell you what what what we do in normal in a classification we take our data we take our data x and we transform our data to a some more let's say we have um one dimension data to a two dimensional data and non-linear Specification okay so we transform our data

to be from one-dimensional to two-dimensional like like this which okay so we write an algorithm in the form of x and z and x and z are the data points so in simply in kernel trick we transform our data to from one dimensional to higher dimensional space so you will get to know what what we are doing so let's uh let me write the steps so after uh we have written our Algorithm in terms of the inner product of x and z now what's what we do we map our input input and this x without

i write we map our input x to the phi of x okay to the phi of x so we are just i will tell you what this function is because okay so we write our function we map out x to the phi of x i will tell you what what we do in this case so we we find a way to map our this we find if we find a function [Music] so we map our x to the phi of x so we write a function or a final way we find our way so we

find the x to be some uh phi of x and it will transform your one-dimensional data to the uh maybe and then any kind of your data to be the high dimensional one okay so you write up her you write the function at k we transform a data the Phi of x transpose that times the uh phi of z okay so let's see what it does with that help an example and then what we do we replace our x and z with our transformed theta which is phi of x and z to be the beta

phi of z okay so this is what we do so let's see um some how what is just and and what faced us so we write uh the kernel functions are one of the famous Kernel is rbf kernel that will transform your data excels into the high dimensional space so the kernel function is written by the rv of kernel function written by k of x comma z exponent of minus norm of x minus z squared over 2 over the sigma squared okay and the x and z are the data points okay great and some more

kernels are Which is polynomial homogeneous polynomial inhomogeneous which you can search on internet but the most famous one is rbf kernel which is widely used in the end of stream okay okay so we have seen various kind of things and about kernel trick how we do the non-linear the transforming data you're able to do the non-linear classification so here are some we will discuss one two kind of one primal problem which will help us to state our Optimization objective and then we will see the sub gradient okay so uh why why i'm talking about primal

problem it will help you to form formalize your objective function so it will help you to formalize your objective function so it's just the same just as written we write that so for each i for each i be the member Of one two all the round to the n we introduce a new variable zeta we introduce a new variable zeta where zeta i where is equal to the max of zero comma one minus yi so we introduce a new variable zeta i zero comma one minus y i okay and then in that we write w

transpose x minus b okay so this is just a hinge loss where we introduce a new variable that hence ross is a story okay right image you see What i'm trying to say here okay so then we write our function like this you want to minimize you want to minimize 1 over n i equals to 1 all the way down to the end zeta i plus the the regular regularization term c the norm of w is great okay with respect to r is subject to subject to Y i double transpose x minus b greater than

or equal to 1 minus the zeta i time um is just a where zeta i is greater than or equal to b for all i okay so we have just formulated one problem it is just equal to the hinge loss that we have seen we want to minimize the objective function that we have seen so far is just equal to that So we have formulated in such a way that you can use this okay so we have a object objective function which is our cost function now we can now sum something called a sub gradient

something called as sub gradient descent and which what we do we we make a convex function f of w and b we take out the sub gradient of our of of our function we sub take out the sub gradient of our cos function and then we update our parameter w and b okay and that's the Sub as the gradient descent what what we are doing okay so it's little bit different than gradient descent here we are taking the sub gradient of our cos function okay so this is what we are doing this is called the

primal problem and then we uh with this we we take out the sub gradient of that primal problem okay pretty much easy what i'm trying to say over here okay so one more thing that the that is very popular among beginners are Learning theory something known as empirical risk minimization what do you mean by empirical risk minimization given you're given the input x1 x2 all the way down to the xn you want to and you and given y1 all the way down to the yn you want your output you want your output you want your

output yn plus 1 given xm plus 1. so it's just the same you give a function x and you want your Output y okay and your and the loss and the getting the getting the error should be minimized means the risk should be minimized so that's the empirical risk minimization which we have already seen so we just have to know the definition for itself what is this okay okay so um ss we have talked about nonlinear and linear now it's time for getting into this uh support vector regression uh i'm just going to give you

the equations okay it Will it will all it will it is kind of a quite easy you want to minimize your one over two the norm of w is great you want to minimize this to get your output with respect to or subject to the y i the absolute value of y i minus the inner product of w and x i hence p times b and it should be greater than or equal to the sum epsilon and epsilon be the smallest positive in teaser okay so Just to uh be comfortable into that so i hope

that is quite clear and this this is for regression task okay and we will see the implementation in our next section so i hope that you really enjoyed this tutorial this section and we will be carrying out this more more into detail now we have gone too much into math but if you haven't understood any anyone please feel free to put a comment uh put your comment in the comment box i'll be very very happy to Take your comment and and provide your answers over there i'll i will be uh taking taking a look at

the questions and we will be answering soon at that point just put the timestamp where you're commenting okay so that i could know where you have a problem okay so i think that's how we have gone in too much detail if you haven't understood grammar problems or gradient don't worry it is not required for a Beginner machine learning so if you have if you if you learn too much in machine learning and bonus deploying now you can come back to this primal problem to understand what i'm saying but it's quite easy to understand that the

kernel trick what i said etc okay so thank you for seeing this video sorry thank you for seeing this section i'll be catching up here in the next section uh till then you can do the programming assignment okay i'll be catching up your next Section will be making one project which is stock price predictor okay thank you so now we will make a stock price predictor which is our end-to-end machine learning project and here's a demo and you can see over here that i have to use you can just remove the i will show you

where i have taken all those things just so i do i'm not a web developer but i had made a good front end of this and also i made a back end using flask okay so we will code As i made a stock price predictor and i will show you how you can uh build the same website like me and you can also uh make a beautify if you're a web developer okay so here is my um here's my jupiter notebook and here first of all i'm going to download the data from yahoo finance and

you can simply pip install live finance i've already installed that you can install that so you just want to import first of all the basic libraries let me do that so first Of all you want to import the basic libraries like this uh first of all i will import numpy as np then i will import pandas as tv and then i will import matplotlib and i'll implode import matplot left dot pi plot as plt then uh with that i'm going to import the c bar because i'm going to use seaborn now in this case as

a visualization library so numpy is a Scientific library pandas is they're working with the data the seaborn is an uh visualization library and matplotlib is also a visualization library okay oh and one more thing that i want to import is uh for my data for my data which is import y finance as y f okay as y f and y f is just allies given to that okay let me run this out and yeah i can to add this is optional but i can to add over here like this mat block level line Matplotlib inline

okay now it should work fine okay so we had done with importation of our uh all the libraries and now it's time for getting in a little bit more uh detail about the data how we can load our data so first of all i'm i'm just going to use um i just want to i'm in this project i will make a stock price predictor of natural gas okay and you can do any of like gold silver just go head over to The yahoo finance just head over to the ua yahoo finance and just head over

to that and just to search whatever let's say i want to go for gold so if you go over gold then you will see over here that you have a gc equals to f which is the code of that okay so you just take take copy the code and here's just uh i will just show you how what you can do first of all i will take input i will take input which is Enter the code of this talk enter the code of this talk to download enter the code of the stock to download okay

now it will take an input now i will just uh make a variable now i'll just make a variable yf dot download and it will take off the code from the stocks okay and it will download from uh it will download from let's say 2008 it will download from 2008 in january to one till Uh it will down it download till two two two zero two one till uh zero let's say let's a two let's say one and tell 18. okay so this is the favorite thing and now what you can do we can simply

do this kind of thing and let me run this and let's ask the code of this talk okay so you just give the code of the stock so let me give the natural gas equals to f this is the code so it will download a stock like this and then it Will simply tell you the data how this looks okay now if you can see see over here that you have open high low close adjust and close and volume okay so one more and one more thing that studies that you can do you can write

auto adjust or to adjust equals to true what it does is adjust your uh all of the kind of data frame and then you can see over here like this okay so that's what It's doing till now okay so we are done with this now it's time for getting into a little bit more detail about uh now we have loaded the data now let's take a look at the shape of the data like that so data dot shape and would tell us that we have a total of thousand two hundred fifty six uh training examples

and five columns including date one two three four uh three two four five we have five columns And uh information about the data you can take a look we have a non-null values we have this we have that was the data type you can also take a look at the mean standard deviation maximum minimum of all the stuff so let me do that okay so you just uh see this and now you can see that you have a all of the count the minimum mean standard deviation minimum 25 and etc okay now if you wanted

to take a look at The now we are done with data exploration now let's literally go further into how it uh because stock price prediction is very very non-linear okay but uh one thing that i want to mention over here they do not use it for personal purposes it's only for educational purposes okay only for uh educational purposes the reason why it's very non-linear and you can't uh and then you and you can't depend on your algorithm to to simply predict the Output okay our other good it's it it may give you uh the wrong

output i don't know about this very very non-linear so be sure to do not use it just for educational purposes not working company you're just making a simple project so that you could get a concept of how you implement how the process hold looks like okay so don't immediate by yourself um kind of a use in yourself and do not kind of make a website like that do not it is Only for educational purposes okay now if you are done with this now we can analyze our data now if you followed an an analyzer with

data you can write this close dot plot you can just plot it out and now it will look like this oops you can fix size let's say i'm just want to make it 10 7 and we have this and here our target variable is close Okay our target variable is closed and uh means we it will just tell you the direction of the stock let's say if the close is very high then your direction of a stock is very high means it will close will very high so here's very nonlinear so it just starts with

2008 in very low way and gets up up in 2009 then gone drastically down in 2010 and then um kind of that so you can see the non-linearity like this so it's very Very non-linear it's not uh you can you can depend on this algorithm or this predictor to predict your output in a good way okay so now now i think that we are done with everything now let me see what i have to do okay now let's take a look at that this how we can plot the distribution plot what's the distribution plot of

our open then what's the distribution plot our close okay so it will give us a more feel how We are proceeding with our data or it will help us to choose the algorithm okay so here's the thing that i wanted to mention so first of all um just i'm going to use as c bar and just i'm going to use seaborn and then here i'm going to name as uh data i just want to name it as data and i'm going to put over here uh this open let's say close okay so we Will see

if this is normally if this is non-linear etc to take a little bit more field okay so you can see it's a little bit non uh it's just it's a non-normally distributed okay so you can apply and lock transformation that's called feature engineering but that does not work well in this case so we should leave like this and then we will take a look at this plot and then if you wanted To take a look at the open then you will again see it's again normally disputed now you can do the same for other other

things to get the feel of what we are doing let's say we are we have done for high so we are done for high and if you take a look at the highs also like there's a sweet okay so we are done uh we are done with you can you can play with the data visualization and it say take out the inferences about the data And about the data inferences okay now we have understood the data what we have understood okay let's write the conclusion what we have understood till now so we have understood the

first and first foremost is the shape of the data shape of the data and then we have understood the uh how a data is distributed how our data is distributed our data is distributed then we have understood That then we have understood how our uh it's it's very very non-linear it's very very non-linear okay so you can't cannot use for own purpose maybe you can use deep learning architectures like lstm to get the direction of your stocks and let's say if you uh maybe the 99 95 percent that works correct or remember 80 works correct

maybe 725 works correct uh so you can Just turn down do not use linear regression or any kind of things to make your own use but definitely in future let's see what the research comes with uh for stock price predictor because stock price prediction is also competitive uh project to work on because it's very very non-linear okay so we are done with this and now it's time for we have started how much algorithm you have started linear we have started linear then we have studied Logistic then we have studied some regularized linear models regularized linear

models then we have a started support vector machine then we have studied principle combat analysis which will study uh which we have not studied so we will study principle common analysis and then furthermore okay so we will use support vector machine we will use support vector machine uh but we will see that linear regression and regularized linear Models works best in this case and we will go with our saving the model of regularized linear model okay so let's start with linear regression before that let's split our data into x and y and then the training

and testing set so here i am um just i'm going to x so i like my convention so i'm just going to make it like this and our x variable will be all the except close so i'm just going to remove this And then an axis number one and then y equals to data and then y equals data dot drop i'm not saying we will not drop it we will just simply keep the close okay now if yeah we we are done with this now x and y data to drop close x now we are

done now what what we can do we can simply uh import more uh kind of a moral selection uh which is trained to split we'll import one thing which is train to split Which let's say you want to split your data let's you have 100 of data so you will uh split your data 80 for training and 20 for testing to validate your model okay so here i'm doing what i'm see you can see over here we have a x train x test and y train y test okay so this is the you can make

the variable and all the training elements the labels of x train will be in y train and label of x test will be in y test like this okay now if you now you have just give x And y and just give the size of that uh test set which is let's say give the 20 percent means you just want to use 20 20 percent of your whole data for testing then you want to do this random state and you may think hey you schwarz run random stage does it simply uh let's say if you

if you run it one more time then your data will not be changed to your shuffle because shuffles also so your data will be not changed okay so we will take a look at the Shapes of that so that we can get a more feel what we are doing so you could understand what to whom we are working with uh let me just copy and paste one over here and just gonna to make it like this and then i'm just going to make it wide swing and this one by test okay and just meant to

make it a little bit more detail okay when you run this you can you can see Over here that you have a two thousand six hundred four and four columns and the fifth column is two two thousand uh this the labels of y okay so that's the thing and now what we can do we can go further and we have taken the close as our target variable which will tell us the direction of our stock stocks okay okay so now we are done with splitting now it's time for getting into a little bit more depth

is uh modeling part means We are going to first of all i'm going to import the linear regression because everyone thinks linear regression is very bad but let me tell you it's very powerful algorithm when it comes to linearity but here we don't have a linearity but still it works best when you apply your polynomial saturated question okay but let's uh keep let's use the logistics or linear regression and you can see over here that this is a Regression problem and so supervised learning problem so we cannot use logistic it will be very very bad

for us okay so just i'm going to instantiate just i'm going to instantiate instantiate and i'm just going to call lr.fit and i'm just going to give xtrain and the labels y train which is the labels of x-ray okay the predict one means i'm going to use that lr which is train model to predict my x Test okay now if i run this now if you see the predicted output you can see over here that the uh let's see the first training example so you can see over here oops this y test now so you

can see over 2.918 and the predicted is 2.90 and the actual value is 2.918 okay see that things that that is over here this Quite uh it's quite very very good because you can see away that's quite working very very good and yeah linear regression works out best but it may happen that may overfit your data but let's see let's calculate a matrix so we are going to use some matrix which is m a we are going to use m a which is uh sorry mse which is a mean square adder which is a mean

squared error which we have talked about the cost function for Linear regression then we are going to use uh then we are going to use uh rmse and simply is it does simply the square root of a square root of mean square error and then what you do here and then we will calculate the r2 square okay and you can see if you want to get into mathematically we'll we'll talk about some matrices matrix later on but you know the best output of r2 if your model Giving output r2 equals to 1.0 then you have

a very bad sorry good model and it's a good model if if it's giving good okay so we'll see how much hour r2 etc so i will write one helper function like this we'll write one helper function calculate matrices i'll just write a matrix and then it simply takes the actual value which is a grand truth and it simply strikes the predicted value of your model to Calculate and then first of all i'm going to calculate the mse and it's just uh first of all i have to import because you can also make your own

function but it's better to use vectorized or cyclone because it's already provided to you but you know it's very kind of easy if you wanted to define msc so you just write and define msc and just take the sum of everyday data point and the residuals and then just square them up and then like that you Can do that we have already i think your area have implemented programming assignments okay so you can just to make it import uh let's say oops i don't want to i just want to import the matrix and i'm going

to import uh mse and mean square error obviously we do have mean absolute error which you can also use but for now i'm going to use this but by the way we don't have rmse we have to code rmse buyers by ourselves Okay we only have r2 squared so first of all let's uh do this kind of thing means being a square giving the for our parameters y test and y and wipe fret and then rmstn you may think here use how how you can calculate the root mean square error just write np dot square

root is just taking the square root of mse okay that's easy so r2 score which is r2 scores i'm just now on the right is doing a spelling wrong because Maybe it will cause some error if you have a reserved keyword or like that's function okay so you have r2 squared and then uh score and now you just give y test and now you just give y test sorry yeah y test and wipe red okay and these are white as in white red are just so let me write that what is this so why test

is your ground truth is your ground truth is Your ground truth and why fret why pred is your moral predicted value okay moral predicted predicted value oops i'm i think that i'm doing it wrong uh you can see the spelling okay okay so we are now whatever what i will do i'll just print it out msc then i will print rmse then i will print rmsc then i will print r2 okay r2 is coarse Okay and this i'm just gonna to msc uh just going to make it mse like this [Applause] and here also i'm

going to make it rmsc okay and here also i'm going to make r2 square r2 score sorry score it's score okay now i think that it should work now we are done now if you wanted to now we are done with helper function now if you Want to calculate the matrix so you where we will just calculate the for a linear regression so we would just give a y test which is our ground truth which i have made while splitting out data now we will make a y prediction y prediction and sorry it's a prediction

one uh that that we have made in linear regression predicted okay so you can see the msc is equal equals to zero means approximately equal to zero is quite good because in cost function your cost Function should be very very approximately equal to zero your root mean square is also and your r two is zero point nine nine nine this is approximately equals to 1.0 okay so it's quite good linear regression performs quite good okay now let's a little bit let me go go further into uh some regularized linear models like rich and lasso so

let's use that from sklearn sklearn dot model sorry linear models Linear model i'm just going to import lasso and you if you know if your name may know about this lasso and rich lasso what it's doing simply eliminates the less important features simply eliminates the less important features and rich is just penalizes your less important features okay so now let's uh make the two of them so let's say la as a lasso So i'm just going to fit it over like this is a short form giving the x train and y train is a short

form for doing that and then i will do for the same for a rich i'm just going to go for the ridge and it just fits the same thing and oops i should give it r a i think r i l a p i'll just go l a p is the lasso predicted value and then i'm l a x test okay and here also i'm just going to make r r I p equals to r i dot predict and i'm just going to give x test let's run this now we are done now if you

want to take a look let's first take a look at the lasso what lasso performs l a first of all we are going to give the ground truth which is y test and then we will gonna give the lap now it's quite uh not good it's zero point it's quite not good because simply it's very strict It simply eliminates so but it's you're you're with lasso your modal in your model is less prone to overfitting but here you can see if i take a look at the rich it's quite uh for now it's quite good

because uh it is also a regularization now it's quite a similar to linear regression but this is less prone to overfitting okay it's less prone to overfitting so we are going to use uh rich regression to save our model and build a website under this Okay so one one more thing that i want to mention over here that we can use support vector machine i mean support vector regression uh for this task so let's see how we can make this kind of thing so let's let me uh svm so you can just and i think

that the portable machine will not work well but it should work well if you have a lot of features like index voltage and then we have a different different features which contain different Different uh importance okay so i'm just going to import and here you can obviously will not use it but here you will learn how to so how to do the fine tuning of any other model using grid source cv okay so dot model selection and just going to import the grid source cv okay first of all i'm going to instantiate svr and that

is just going to um i'm just going to make it params and then and then i'm going to make it Params and then here i'm going to make it c to be uh maybe and c is just a lambda which will tell us the width of your margin so i'm just going to copy it out this i'm just going to write it from my there i've already written over there it's just i'm going to write it to minimize the length of the video okay so it's just like this and i'm taking c as uh to

use these values and check how is model performing with different different values and kernel Will be obviously the rvf kernel because it's very kind of non-linear okay now if you now what we'll do we will just uh we will just make this and then i will call my greatsword cv and i'll give my params i just gonna to give my params and svr first of all svr obviously svr and then i'm going to give it the rams red ram grid and then i just want to rough it equal to true and just wrap it will

tell you means the Warnings you can see the documentation verbose equals to three okay messages done now if you wanted to run it let's fit now now we'll fit grid onto our training to check different different values y twain okay now let's run this it will take a little bit of our own time but yeah i'm just gonna to uh code it further It will take a little bit amount of time it is checking for each and every uh this zero point one zero point twos etcetera and checking the score and whatever works best it

just will give out okay just want to make it small okay now let's wait for a few seconds and i think this will uh end up being in a few seconds uh in the meanwhile i'm just going to just copy it out these things because The parameter which we are going to get is like this the c equals to 10 and gamma equals to 0.1 and kernel equals to rbf and if you run this if here you will be left with a very good matrices but it means it is very bad i think so this

is we are performing is very very bad because of that we have a don't have a lot of features over here and spr is not able to find learn in the model actually but regular regularized linear models are More powerful in here okay that image is what i'm trying to say you here okay now i'm gonna do is to see this how much it was still running it's just checking for each and every uh this will take 1.8 seconds more to i think it's 1.9 seconds it's just taking a little bit of one time but

i'm just going to wait for a few seconds and you can see over here that is trying for each and every value gamma then c then this then that Okay so we will just wait and then i will uh let's come back i think we are we will just code it further because um we will just code it for the import job lib and here we will import job left to save our model to save our model because we are going to use a regularized linear models okay so model job lib dot dump job lib

dot dum i'll just dumps and just going to save this model dot pkl Okay and then i will simply uh if i want to load my model equals to joblib.load i'm just going to if you want to load your model you can just write it down and you can just make it model.pkl okay that's what we are going to do over here and now i think that's done now if you want to like this and now if you want to make it like this now you're done with this kind of thing now if you run

this this support victim uh regression and Now if you run this now the dumps is not there particularly very good okay so we are done with this now let's see the in folder where we have the pkl file and we can use this model to make predictions okay so we can use this model to make predictions so let's keep making predictions with this so let's me let me go to my let let me go to my uh one of the mlo one and then stock price predictor and here i'm going to open with code okay

i Do have made which i will just copy if i want to save my time i'll just copy and paste over there to save my time so that it would be more perfect if i just copy the prediction.html because it's just uh simple html blocks you're just going to cut it down because i don't think that this should require further it's just active let me cut it down okay so now i think that we are on to This now what what i can do i can just make a app.pi app.pi and here first of all

i'm going to i think that all are able to see ya so from flask i'm just going to import import flask and then i'm going to import the vendor template render template and if you don't know what plan don't worry it's quite uh easy to understand i Will walk through each and every process but i don't know why it's not working but it's okay if it's not working it's uh it's my vs code box a lot i don't know why but yeah i will just keep trying it out let me again open with code it's

again open with code and then let's see what i can do over here that we have just imported the flash till now and render template now we'll just Instantiate my model like this flask uh name oops it's just i think name it's correct yeah so now i'm just to make a route app.route and it's just uh this simple homepage to validate our server is working return render template i just want to take a render template from the index.html index.html and you may think hey you should haven't made index.html so let's make that uh This i'm

not made just i'm going to make it over here so flask looks for html files into the templates folder to just make one templates older over here okay just make it and then one more folder you have to make for your images which is static folder okay static and all your images will be here now you can also make from here just to templates like this now just make a new file which is index.html okay now after making index.html you can Just type this down like this and now if you go away stock price predictor

i just i will just copy it down i will tell you where i have taken all those stops so let me show you where i've taken from these all so but let's see before that how is it working or not like writing a hello world okay just save this down and now if now it's time for uh like this instantiating for setting up your server Name equal equals to main and then i will just uh app dot run debug equals to true anything going wrong i think yeah uh no it's true okay now if i

save this if i run this down it will take a little bit of time And it will run and so it's starting over here and this is my url address if i see over here now you can see over here that we have a a beautiful website but now you can see why it is running the reason why it's i have to stop that server to get running so let me stop that server because another server is running um into the ml01 projects folder over here which i have to stop it I will debug it

i have to reinstall my visual studio code but i will show you where i have taken this all so the first thing first you have to keep in mind from where you have to take this it's from uh see a tail block so i will just annotate the code in my mlo1 projects folder this i will annotate from where i have taken it says from first of all i have imported the cdn first of all i've imported the cdn Of bootstrap and talvin css and then i've gone to tailblocks.cc i will gone to tailblocks.cc and

grabbed my header grabbed my header then this nav navigation menu then i have to grab my header and then i have this predictor and you can see all the code is in my github okay now if i show you what the prediction.html does it's simply uh first of all what i've done i i have written a form into that you can uh the Code will be in the description description elements that get up you can head over to that okay so i'm just using the request to get the form the open the high which is

the input variables and then i'm pretty preprocessing it so how it's how i'm preprocessing it and putting into the 2d array to make prediction then i'm loading the model then i'm predicting on the test data and then returning the prediction and then i'm just running the template from Prediction.html and prediction equals to prediction which has gone through here so if i go to the prediction.html you can see that i have just extra i have made one layout i use flask flask inheritance to inherit the template from layout or html and you can see the layout

of this similar just a chunk of code so prediction is just a prediction and the prediction is derived from here Okay and it's just prediction is there's prediction status just given okay pretty much easy let's see if it's working still now or not it's still working so let me put some value so it's just and it will tell the closing price and it will tell when it sells at closing prices tells the direction of the stocks okay so the closing price is four point eight six five eight eight zero five four okay so that's the

pretty Much easy that we need uh to understand about uh uh making a flask of flask website and end-to-end machine learning project i really really hope that you enjoyed this tutorial a section in this project in the next section we will be talking about principal combat analysis then we'll do one of our project and then we have done step 10 now two projects which is an end-to-end machine learning project to get on the resume and you can also modify it let me tell You how you can modify it you can see over that that we

have an open high low close volume then a research paper or something called the stock price prediction using machine learning they have given what the features they have used like index volatility like uh mean etc selling price of the sorry three three days price three days previous price and nine days pre selling previous price so you can do that kind of features and integrate over there to Make it the more powerful model with the complex model okay but do not make it too complex we can see we had we had worked on lots of data

to understand and how we understood that this is good for linear regression the reason why we have understood because our data is not multiple linear what do i mean by multiple linear because many of the interviewer asks that how you will perform when your data is multi-co-linear with linear regression Algorithm your answer will be no we cannot perform when a linear regression when your data is multi-core linear it's only because it's only because your because the variables linear regression does not work well when your variables are highly intercourse correlated okay when you're let's say on

it will be all same means correlated very highly into intercollect correlated so if they ask you what's the way to remove this so you can say okay We can use principal component analysis which will study in the next section so you can say hey we can use principal component analysis to remove the uh multi-collinearity from our data and then it's boom you are done with your interview it means kind of just one question as an astronaut interview as i was in one of my interview okay so that's the basic thing that we need to understand

and i really hope that you enjoy this section the next section will Head over to the pca and we've had done the two projects now it's good to see that all of you are doing projects and it's i hope that you're very enjoying okay you can leave your questions in the comment box i will definitely pick up that question okay thank you for seeing this section i'll be catching up in the next section okay so now we will talk about principle component analysis which is a dimensionality reduction algorithm okay So uh you will get to

know what is dimensionality reduction etc but before that we we should have a toolkit of some some other concept of linear algebra like linear combinations or linear transformations and eigenvectors and eigenvalues okay so we will review the linear transformation and eigenvectors and eigenvalues so that we can be on the same page okay so if you want to get in more detail there Is a youtube channel named three blue one brown like this you can head over to that they have it he has a very good playlist on linear algebra they have a series of section

videos you can watch that if you want to get dick dive into that but for this you don't need that okay but uh what is linear transformation if you have linear transformation just like a is a function as we have seen f of x which is a function so it just Transforms your function transform this x to maybe the squared of x so this is just is a function the transform or the transform from one vector space to another is that respects to underlying linear structure of each vector space okay so if you if you

have seen the three blue one brown videos he has clearly mentioned that is a function lean linear transformation Is a function that transforms your one vector space to another with linear structure okay are we parallel to each other it's a linear structure of each vector space if a little bit of nonlinear gains you will be not able to do the transformation of your vector okay so that's called the linear transformation so linear transformation can be written as t column we are transforming our vector r to that okay okay so here is here's an Example of

1d linear uh transformation so here's the function t of x and a be the sum is scalar and their one dimensional linear transformation t of x which is our function that maps your that that maps your from from interval of zero to one to the interval of three to zero so we have a vector the zero to one and it's just scales okay by the factor of three but a factor of three to the three end to inter looks like this Okay so that's the linear transformation if you head over to the three blue one

brown channel for more detail so uh i know vectors are nice and values i'm not going to take dividend math of eigenvectors and eigenvalues but it's just like what it does let's say you have the new transform vector it's just a scaled version of the original vector so you have some vector let's say well v you have a vector we have a Vector let's say this and it's the new transform it's just a scale means the uh the new vector is just scaled from this vector like this if i choose new pen yeah so it's

like this then the original vector then the original vector is the eisen vector of the original matrix okay so you just if your new transform you just scale from the original then your original vector is known as to be a eigenvector and the Factor by which it is stretched like this green color is known as eyes in values okay and vectors that have this characteristics are known as eyes and vectors okay and eisen values means the factor by which it is scaled or stretched are known as the eigenvalues which is denoted by this symbol which

is lambda okay so that's the simple idea behind eigenvectors and eigenvalues again let's revise what we have studied uh till now Until now we have studied about linear transformation and a linear transformation is a function that transforms you from one vector space to another with respect to lying under a linear structure of each vector space okay we can write out here because we can write our value linear transformation as a t of x uh for example for only one d linear equals to a x and a b to some a linear sorry scalar okay so

what we are doing Over in this example here we have this vector of form from the interval of zero to one we are transforming it to zero to three by scaling it by two by a factor of three okay that's the linear transformation okay and here an eigenvectorized value we have a new transform vector which is the linear transformation which is happening okay it's a linear transformation which is a new transform vector it's a scale from the original Vector means from the original vector like this it has been shown you over here then it is

it is the original vector is known as the eisen vector of the original matrix otherwise it's just a vector and the factor by which it is scaled here is three the factor by which is scale here is three then the three is your eigenvalue okay it's this is what the simple eisen vector analyzing values are okay Pretty much easy what i'm trying to say you here okay if you want to get in more detail there is a video of give by gilbert strang gilbert i think i'm pronouncing correct gilbert strang lean new algebra videos or

a book by introduction to lean your ass bra introduction to linear algebra you can have a look by gilbert strand or you can have a look for a quick look at three blue one brown video the blue one brand video on linear algebra on youtube okay So that's the uh kind of uh resources to learn more but for now if you know this then you are good to learn about principle covenant analysis okay so you may think yeah i don't want to do so much of math etc so you have a library known as numpy

which you can have a look which i have given the notebook in the description Downloads below about numpad panda so you can have a look so we can just implement we can just implement by this la la is allies l a is allies as a linear algebra we are taking out the eyes e i z of the input and input here is our vector okay and this is your the eigenvector and either value as in vector of the or this matrix sorry yeah this is a matrix okay so and this this matrix is just a

2 comma minus 1 4 Okay so that's the matrix of 2 by 2 matrix okay so we have started about um linear um algebra that is required for us for a principal common analysis now it's time for understanding why we are studying this all and why why we need dimensionality reduction so let's say you're working on kind of a large amount of dataset so this is 3 000 dimensions and what do you Mean by dimensions is like the size of the house is one dimension then floor of a house is second dimension the number of

fans is a third dimension then etc so unlikely that we have three thousand features okay so we have three thousand features so for that it you first of all it will cause storage second it will cost time it will even take months to train your model so it may even take months so you don't have too much of computer resources okay so That's why but in real world we have a millions of dimensions data available so what we do we simply use dimensionality reduction method or technique which is a pca to reduce your model feature

with data features or variables okay and we typically see this in text data or image data so let's say you have some image let me draw one image of mine so here's my image so it will the eye has one dimension and it may be a millions of dimensional data sets only one image Millions of dimensions one image i have worked on i have worked on to that which is which has 75 dimensions 75 dimensions so we have what we have to work on this typically will be seen on your text data where you will

working on ash range processing like what embeddings or image data will be working on images okay so you need the principal common analysis okay so what is principal common Analysis principle common analysis is a method for dimensionality reduction that is used to reduce the variables or the dimensions of the data by transforming large set of features large set of features into smaller ones that contains most of the information so let's see you have x1 then you have x2 then you have x3 and all the parameters let's say x 40 let's say 50 okay so what

it will common analysis does it ask you a Component let's say you have given a two component so it will try to put it will try to put most of the information into the first component x1 and it will try to put let's say p1 and and and in the second component most of the most of the information in these two components because we have say to do reduce from 40 dimension to two dimensions okay so so it will simply um compress that like that or put all the Information into the first data variation first

and second if you say it is one so it will try to put every with most of the information in the first and it will remove this all so obviously you will lose some of the information from the data but it's good to have uh not more than dimensions okay i will give a tip when you when you have to work on principle component analysis okay so what is the basic intuition basic intuition behind principal common Analysis is that we have a principal components which is a new variable which are the new variable like let's

say i have i given you know p1 and p2 they are the new principal components which are new variables okay that are constructed as a linear combinations of the initial variables okay so what what what we do uh let's say we have x1 x2 all the way down to x40 so we just try to put in all the first and second so it's v it's a hyperbranner two is a Complex number of accomplish the high parameter so let's say we have taken two okay so it will try to put most of the information into p1

and p2 and these are the new vectors or the unit vectors or the variables like this all and it's just a linear combination of these variables okay of the initial variables okay you will get to know the through visualization what i'm saying and these combinations are done in such a way that the principal components are Uncorrelated they are uncorrelated and most of the information from these variables i mean size branch etc will be compressed into the first components and so on okay and we're projecting each data point on t1 onto only the first few components

to obtain a lower dimensions of data please preserving as much as of the data variation as possible so what is this we are projecting etc if you have seen orthogonality concept into An uh linear algebra orthogonality so what it tells so here's an uh visualization of that so here we have a data and what we do we take out the two columns let's say we have two components like this uh first dimension and then we have a second dimension like this okay so we have a second dimension like this okay is these these are the

principal components which are constructed which is which are constructed as the linear combinations Of the initial variables so what is what we do we project our data point onto this we project our each and every data point onto these diamond uh principle components like this and now our data point will be on this so here's the visualization which is showing more about this so we are projecting each data point onto the two components here we have first component and here we have second Component okay this is the basic contribution behind principal components so basically we

have let's say this and let's say this one and you have a data point like this okay so you just project the data point onto here you project data point under here to reduce the dimension of the data preserving as much as information as you can in the first component okay so we will see the algorithm so let's start seeing the Algorithm to understand a little bit more detail what what we are doing okay but for that let's review some other concept from previous uh sites okay so principal components are the unit vectors uh means

they are the unit vectors like the new variables that come out as a p one and p two whatever the p orbital p and p c is the number of components as constructed because from the linear combinations of the initial Variables and these combination combinations are done in such a way that the principal components are uncorrelated okay you will see the algorithm by this one correlated etc and most of your information within the initial variables are just in a compressed in the first and then second like this number of components you are given okay very

much easy what i'm trying to say here okay and We have seen the basic intuition now it's time for getting into the algorithm so the first step and the first and foremost step of this principal common analysis is data pre-processing what do i mean by data pre-processing means we have to standardize our data so in this step you standardize your data it simply means that your data should be falling in the same range okay so let's say you are working on a sum system like let's say the h uh like The one one one where

you have a one variable which is the age so you have eight and let's say the first person is twenty second person is 40 like like this so let's say a new percentage is 1 200 is far away or far different from these means it is it's called an outlier it's called an outlier which is which which is which is not in the range of age okay and it's which is not in the range of The common ones okay so let's say your data is like 20 40 it's a two three so let's use the

five four four thousand so you this is this is kind of a outlier and principal combat analysis is sensitive to outliers okay so what it would do for them for so uh so that's why we do the standardization of the data some sometimes we do normalization also so we do the standardization of the data so our data falls in the same range and the Reason why it's critical form because it's quite sensitive regarding the variances of the initial variables okay if you have seen the variance and mean if you and the formula for this is

x scale equals to x minus the mean of x x i by the standard deviations okay standard deviation and then you will get the scale formula of your um data and don't worry how to implement this you can also implement this just by coding in python So let's say you have made a function this is standardized it takes the value of x then you should make a new variable x scale then you subtract x minus the mean is np dot mean of x divided by the np dot standard deviation of x okay you can do

this but if more formula vectorized code is in scikit-learn you can use scikit-learn to implement it just three lines of code okay don't worry if you're if you are um not able to implement from scratch After first step after you standardize your data now it's time for getting into the computing your covariance matrix of your data so what do you mean by covariance maintenance in the cover is basically a p by p symmetric matrix where the diagonal are the variances of whatever the um data but if you will t you can have i will go

to internet what about what is covariance matrix but here after taking out the covariance matrix it will tell you that um that that tell Us how to feature how the features of the input data set is varying from the mean with respect to each other okay so here's tells the correlation like that that's not a really correlation but here we have an input date data set like x and it's va out of how much the input data set are varying from the mean with respect to each other means the x1 how it's varying from x2

and etc it is sometimes important because they are some Variables highly uh are highly correlated and they contain unnecessary information to work on okay so you should you have to you can remove that okay so that's how we compute the covariance matrix pretty much easy one i'm trying to save here okay so you just denote covariance matrix with c and for implementing covariance matrix you can use the again numpy for the scientific numpy dot np dot cough and you just give the data as Your features let's say x and you'll get your output as a

covariance matrix okay pretty much just see i'm trying to say you here so let's um see whatever i've seen so far we've seen first we have to pre-process your data like a state of standardization or normalization then we have to compute the covariance matrix of your data which tells you how the input variables are varying from the mean with respect to each other variables okay because it's Sometimes important because some variables highly correlated and they contain unnecessary information okay the next step is computing the eisen vectors and eigenvalues of the covariance matrix so let's review

what we have seen in eigenvectors and eigenvalues so in either vectors we have seen that if our new transform vector is just a scale from the original vector and the factor is called the horizon Vectors and the factored by which is stretched is known as eisen values okay and so that's the eigenvectors and either values the reason why i made this ppt is i if if i write in blackboard that would be not beneficial for you because i would be saying so i had megan know so you can have a look on the future also

to see how it's working and it's better to have a text onto the screen to uh while you're listening Okay okay so what we do you compute the eyes and vectorize a value of the covariance matrix means you come from eigenvectors are the transform vector from the original vector and the factor by which the is stretched is called the eigenvalues okay you can easily compute the eigenvectors and eigenvalues in python and then what you do you sort the columns of eisen vectors matrix v and i's in value matrix d in order of Decreasing value what

do you mean by step four let's say your computer eisen vectors are nice and values and those who has um high means high information or high numbers so you just sort the means the larger word in the first then larger one in the first and larger second second third fourth year six seven like this okay and then after that what you do you simply um you'll see oops let me do that you simply uh compute the cumulative Energy content how much the content is each eisen vector is having okay and then you select a subset

of eigenvectors as a basis vector means let's say you are in cyclone there is a sorry and any other you should choose how many number of components to use okay that's the main thing so what you do you simply sort in decrease in decreasing order and you come the content of each engine vector and Then you select who has the high energy or content from the top and if it is your if you choose the principal component two then only two highest to highest will be choose those who have highest number of information okay and

then you take this and then you take this two and to project onto the final you project data onto the new basis okay so you prob you have a large amount of again and and then what you do you remove this or you project this um Pretty simple component uh this eigenvector onto the new vectors okay and this is the formula and this is random device and this is the feature vector transpose okay so we have seen so far and here's uh uh sap what's the algorithm of steps and i have not written last step

which is uh projecting the data you can have a look but because my uh kind of was not fitting okay so first you cross pre-process your data then you compute The covariance matrix then you compute the eigenvector and either values of the covariance matrix then you rearrange the eigenvectors and either values then you compute the cumulative energy content of each engine vectors although it is not necessary because when you sort the integrating order you the top will be having the highest information and then you select a subset of the eigenvectors as basis vectors okay so

i think that we have seen so far And i really really hope that you have enjoyed this tutorial and in this section uh previously i think you may find it's quite uh intimidating now i think that you are able to grasp that con concept of linear regression logistic regression support vector machine principal common analysis and then we have current projects and then you have coded boston health risk prediction stock price prediction one Classification project you have coded logistic and linear from scratch now in the next section we'll we will cover the principal component analysis from

scratch we will do by uh with ourselves to get the more feel of the principal color analysis okay so i really hope that you have enjoyed i'll be catching up your next section uh so let's start with the next section to learn something new okay so now we'll start with learning theory again one of The most important concept to learn in machine learning and i think gaining some something or less we will see the topics which will start in this course in this section is we will see why it is very important means maybe it

might look a little bit more nonsense over here but maybe if you are going to tell advanced version of machine learning like deep learning or nlp or computer vision it Should surely make sense type that learning theory actually works and in learning theory in this section we'll learn about these three topics our main uh main communication main talk on to this like bias and variance tradeoff and then we'll move on to approximation estimation error then we will move on empirical versus minimization and this this will be our new concept this this this this will be

this is these are two Just a definition just i don't just just just a problem framed okay just a definition which which is which is needed because in many of these research papers they have listed uh empirical risk minimization of or approximation estimation error the reason why they have maybe some in history they may might be have different something out of different choices okay so we will study we will just see the definition of these and and it will surely make sense but if It is not please be sure to ignore it for now continue

with this course and you are free for your feel free to come again okay you're feel free to come again and then watch this do concept because it will surely make sense okay so let's start with bias and variance okay so in bias and variance here we will study about bias and various tradeoffs and here is your warning warning is is learning is is this is one of the easiest concept to Learn as instructed by even uh whenever i heard that this easiest concept to learn and it seems to easiest but it's very hard to

master very hard to master and i hope that you heard andrew nung saying this and i think this this actually makes sense if you're a beginner then it might not because you will understand everything but it when you are actually developing the product it's a very very important to keep track of bias and various trade-offs and etc Okay so let's start with bias and variance but before that we are going to recall our two problems which is my favorite overfitting oops not my favorite it's an overfitting and under fitting okay so here is my favorite

diagram uh from the google so here uh let's assume that we have a time and x axis and values and y-axis okay so maybe some kind of problem okay so here you have these data points here You have this data points and what you do you just flip a straight line that's a simple linear regression simple simple linear regression okay you just fitted a simple linear regression which is just theta zero times uh x0 plus theta 1 times x1 okay so um plus theta 2 times x2 okay so here you just assume that we have

a straight line using a linear regression okay but you can see over here that the Tree it is not performing well on the training set also means this is my training set so it is not performing well uh the the residual error or the cost function will be very very high will be very very high okay the cost function which is j of theta which will be very very high in this case so we call it as under fill okay so the major problem makes a major major problem that make this problem occur is that

you have a low amount of Features low amount of features or you have low amount of data or you have a low amount or you have a low amount of data okay so these two causes the problem so your low amount of features low amount of data okay and that can be that that can be sensed using just by adding more data or adding more feature or if you don't have feature you can do Feature engineering to generate more features okay so uh just to do not focus on future engineering for now because it's not

a data science course but obviously you just need a simple feature engineering means it's just generating more features based on our features so let's what an example that you are building a spam detection system so let's assume that we are building a spam detection system where you have a one column which is of text another column Which is the label whether the whether that text is a spam or ham so it is a label okay so you can generate more feature like length of the text what is the now how many number what is the

number of a text in that what is the number of words in that text what is the number of characters so you can you take you using this this this text feature you generate more three features and that's called a feature engineering okay so i think we if you if you if you see I'll be very happy to make a video on a feature engineering okay it will it will be a full place data science course okay so here after this we have our simple linear question which is just like this now this is called

underfitted and major problem that i've seen so far in my experience is a low amount of feature stack that we have okay so in general the underfooted me the under underfitting means is just that your model is not performing well on the Training set and it's obvious that you'll not perform well onto the testing set so that's why we call this under fitting okay the next picture of here is good fit slash robust robust means it will it will be very robust which is a very good fit you can see you it is a very

good fit the cost the residual error is low the residual error is low and it says we have a very good polynomial kind of thing um a Non-linear uh or or i can see here against a nonlinear or a polynomial regression over here we have applied polynomial regression and here's what we get the as a as a as a hypothesis okay so here you can see that is a very good fit and this is this is the robust model okay so we can say that this model is performing well under the training set as well

as on the testing set because whatever example will come here the Residual error will be low okay so like like that is perfecting well under training and testing set another picture which tells you about overfitting what do you mean by overfitting overfitting simply means that your model performs very very well or i can say that your model wants very very well under training set where your cost function your cost function of j of theta is equals to zero okay where you you don't have any residual error or approximately Equals to zero okay so your cost

function is very very low so you can assume that your model learned a lot which is touching each and every examples your model learned a lot so that's why your cost function is very very low and cost function is just denoted by the submission over i equals to 1 all the way down to the m h of x i h of x i minus y i squared okay and just taking out the difference between the predicted And actual value and here the the predicted and actual value are on the same line so here you you

your model want a very nonlinear hypothesis it all it happens that if if you have a lot of features a lot of features and here you have a low features and here if you have a lot of features that happens okay and maybe you have used too much of degree in polynomial regression so that's why it happens or maybe something kind of a Other no a very kind of complex architecture or you have or you have made a very complex function f of x with the highest x4 etc like that like that okay so this

is this is the this is the problem for overfitting and whatever new example come over here your model will be very very high whatever come here the residual layer will be very very high okay so your model will fail to generalize well onto the new training Examples okay so in general overfitting means over overfitting means that your model perform bad or or against your model perform very very well under training setting which your cost function is low is is it is equal to zero which is actually good but but you you may think it is

good but just just wait that your testing set error is very very high okay so that indicates the problem of overfitting so We can prevent overfilling by selecting some of the important features selecting important features or regularization so regularization is just advanced version of selection of features so let's see what it does okay so i have already told you about regularization now in our regularized linear models and i hope that you understood that okay so now we will uh see the bias and variance trade-off by taking a look at some scenario of your Model okay

so let's assume that you are building some model okay so you are building some model and your model gives one percent error one percent error on a training set so you have one training training set like this and you divided this training set into uh training and evaluation set so you have this whole data you have this whole data and you divided this in training and evaluation set okay so what you've done You've taken a one percent adder under the training set one percent adder on a training set which is uh if you see oh

you one percent error one percent error on a training set and fifteen percent error on the evaluation set so you can assume that your model is you can see that your model is overfilling because your error is very very low end training set but your error in validation set is actually 15 which is very high okay according to this okay so it is Performing a very very well on a training set but it's fails to generalize well onto the testing set so it is overfitting and in this case we say that the model is having

high variance and we use bagging we use bagging to reduce high variance which you will see in ensembl learning methods or sections so don't don't worry you can come back again to this section to watch this okay so this is this is how you identify if your model is having high Variance next next is that let's assume that your model is giving a 15 percent adder on a training set on the training set okay 15 percent add-on training set and sixteen percent adderall on evaluation set so it is not performing well under training set obviously

to not perform well on the evaluation set so it seems to be underfitting so here it has high bias and we use boosting we use boosting to Reduce bias okay so this is this this is what i'm saying and let's take let's stick for the sake of an example again again example that your model is having both high bias and high variance where you have a 15 percent error which is obviously high variance and 30 percent error on evaluation which is obviously high bias okay so it is both overfilling and unfitting and obviously it has

a high bias and high variance okay the Next our favor and last example of this bias and variance tradeoff is your model gives 0.5 percent model gifts 0.5 percent or training set and one percent underscore testing set so it seems to be a perfect model or a robust model where it's not learned too much on training setting for it is a very robust and good model okay here it seems to be a good so we can say it is it has a low bias where it has a very small error and it has a low

variance where it has on training Set is a very good okay so this is this is what we consider for low bias and low variance and all of these all of these we take assumption do you know what assumption oops you don't know but because you're watching but uh when we take an assumption that base error or human level performance or human or human level performance human level performance is approximately equal to zero percent is approximately equal to Zero percent so what do i mean with this assumption i what what do i mean with this

assumption that we take our base error to be base error base error to be approximately zero percent in all of these examples in all of this example this example this example this example okay so let's see what's that human level of bayes error is so here you can see that you're you're you're going to build some uh classification model or maybe the face detection model okay so Your build your you have built the face detection or face recognition or real time face recognition so your algorithm even you will fail to identify this person even i

will fail to identify this person the reason why because it's very blurred very very very with blood okay so even a human even a human human error okay hlp you even in human error will be very very high very very high because he will be not able to he will be not able to uh either identify Who's this person is okay so so so and you can cannot expect that your model should be very great over here you can expect that your model is very your model is also giving the same error as you are

giving because you are also not able to identify as well as your model also not not able to identify and actually this is not this is here the hlp hlp is very very high or i can say the base error is very very high so we can say that here um you can You cannot expect your algorithm to work best but let's assume that that you have a fresh image and where the hlp is equals to zero means human level performance is equal to zero and you now you can expect your algorithm performance to be

good because it is the human performance is equals to zero so that's called the base error and and i hope that you understood the next uh we we have talked about bias and variance trade-off you can again rebound This video to understand again but what is approximation estimation error this is just a definition so the approximation estimation error approximation error in some data is the difference between exact value and the approximation of it and this approximation indicates your f of x means the outputted model uh the output from the model so we here your f

x given some approximation and this is the ground truth which is y hat okay so Difference between these both is called as approximation error okay so here i've taken one example from wikipedia again it's a scale example but that's that's that's generally mean i i already told you 9 cost function you take out the difference between that's an approximation estimation error okay so it's just like this if the exact value is 50 and the approximation is 449.9 then the error will be 0.1 and that's actually what you do when taking out the Error for one

training example you just subtract this and the regression problem what you do you just subtract y hat minus y and then and you get your answer okay and then you add submission for every i for every eye etc okay so this is what you do and this is just a definition because you will see a lot in your research papers okay next one is empirical risk minimization again we have seen so an algorithm receives as an input on a training set so i'm going to Just i make you familiar with this what i'm saying that

an algorithm receives as an input a training set as means we we get and we get our training set which is a sample from the large distribution d okay so we get our sample from the distribution d means large we just take out some sample and label by some target function y okay so here we have our training set as well as we'll be having the labels for it because this is a this is framed On supervised learning okay so here we will be having label as well as the samples for each training example okay

and should make a predictor we should make a f the predictor that maps our input variable x means these features to the output variable y okay and the goal of this algorithm is to minimize the error outputted with the respect to the unknown d means now we will feed a new example that a model has not have Even seen and your model should be very minimal or you or your or your model should be audio model errors should be very minimal okay so this is what the full definition is saying and it simply means that

we want to come up with the predictor l subscript s h we're going to come up with the l of h with sub subscript s where s emphasis the fact that the output predictor depends on s okay so Whatever the output will be it depends on s because we have taken we have learned we have learned the weights we have learned the w we have learned the theta one theta two all around n from these s so that's emphasizes that minimizes the risk or the error which is called the erm which is called the empirical

risk minimization okay so i hope that you understood this concept very much clearly and i really hope that you had enjoyed seeing this section and We i have talked a lot on empirical risk minimization learning theory and et cetera so i hope that you will utilize this uh way and we have already talked about uh the job and now you can continue further if you haven't understood anything you can feel free to ask in the comment box below i'll be very happy to take your down and be sure to have a look at the course

website which is already available in the description box below So meet you in the next section okay a very warm welcome in this section and in this section we will be talking about decision tree one of my favorite topic to talk on as i will go in depth of decision tree to make you understand everything and decision free with intuitive examples with solved examples of decision tree as i have seen on youtube that they are there some instructors are doing great job but they Are not doing that into decision tree means for free so i

just want to make you familiar with decision tree whoever is watching this tutorial into depth and i really hope that you will enjoy this section but before that what we are going to cover in this section are as follows first we will start with the introduction geometric intuition a basic intuition about decision tree what the actual the decision trees are and then we will go further into how we Were building that decision tree so for building we will learn some sub tasks of concept which is like entropy information gain a guinea impurity okay then after

that we will build our own decision trees and then i will show you the implementation of decision tree okay but before that let's uh understand the basic intuition of decision tree as there will be more topics which we'll cover as i will discuss later on okay okay so Let's start so uh first of all what is decision tree decision tree is a supervised learning algorithm okay so it is a supervised learning algorithm and what do i mean by supervised learning is that we are having our uh x rx 1 with our label y1 all the

way down to the x2 then we have i2 all the way down to the xn and we have y okay so we are having labels so it is a supervised learning algorithm And it is used for both like support with the machine is used for both regression classification so it is used for both uh classification and regression okay so uh you will see how we do how we construct the distant tree like that okay so let's start with um the basic intuition of decision tree so the definition of decision trees that they are nested if

an else statements okay if you're a programmer then you will be relating this concept Which is if and else and the python is a prerequisite or any programming language is a prerequisite so what is this entry it is just a nested if and l statement so it is a nested it is a nested if and else statement so i don't know why it is so bold so it is a nested if and else statement um so is what it does is just ask a question and it splits the data okay so let me write the

formal definition to make your more uh intuitive intuition Behind so they are nested nested if and else statements okay so he's just ask questions is just ask questions is just ask questions and splits the data okay so it's just ask question and splits the data so let me take one example of iris data set let me take one example of iris Data set okay so what we do when iris data set so but what let me make you familiar with what is that this data set to make you more clear understanding of this topic okay

so iris data is simply like this you have a data set which has four features like sepal length sepal with petal and parallel petalworth okay and you have the label which is the species of the flower okay so this is a task of flower species detection under the basis of Four features okay so this is a classification data set a binary class classification data set so you were having and like this so let me change my color i don't know why it is so bold okay so you have like this uh first you have sepal

length sepal length and then i i i hope that you understand what is sepal and what is parallel so sepal width then you have petal length and then you have a petal width okay and then you have a one more column which is The label which indicates for the species so let's take an example 2.2 4.30 that's a 3.2 4.6 and the label is um acetosa satosa okay so you have three classes in this data set as a label which is cetosa which we label as one sorry zero versicolor versicolor which we label as one

and virginica virginica which is labeled as two okay so the output will be either zero means a tossa then either it will be one means we're cycler or Virginica as in quotas as two okay so that's the iris data set that i had just make you familiar with okay so let's what what we will do we will not make use of any library will not be used and then anything we will simply what what we will do we will simply make a decision tree by yourself by making just if an else condition okay so we'll

make a simple classifier obviously it is not so formal but we'll make a simple classifier that will simply classify Your flowers okay so that's what we are going to do so let me remove this and i really hope that you understand what this data set is and more uh understand uh if you want to more in detail about what the data set is you can search online this is a famous data set like iris state data set which is just for flower species detection system okay so let's start so first uh here we have an

um variable x's here we Have our x's which is sepal length sepal width parallel length and parallel width okay so we are having these features and we have a y i which indicates either one or two or two or zero or two okay so that's the basic intuition uh means of the data set part we are given this data set now what we will do we will make classifier like this first if we write if let me choose another pen okay i i hope that i should choose a better pen like this blue okay okay

so if The parallel length is smaller than some a and may a may be some number a maybe some number let's say let's say 2.3 a maybe some number if petal length is smaller than a then consider y to be um for cycler okay so let's consider y to be one okay if not if it is parallel length is greater than a then what do what what to do oops i hope that it is getting not Clear what what happened i have to buy my new computer why do i don't know what to why it

is so much lagging okay so you'd if it is smaller than a the parallel length then consider your flower is means of a versailles color and we have we have made one one and if if it is parallel and if your parallel length is not is is greater than a then what you will do you will write else If separate you take again one feature you take a game one feature and says if it is smaller than b if sepal length is smaller than b then you will you consider your y to be uh virginians

too okay if it is not if it is not if both is both condition fails then say if both condition fails then say that your output is setosa okay so here is our simple decision tree we're using two Features we have made the decision trees using two features so let me make it more intuitive i i i hope so that you are able to understand okay so that's why i'm speaking very slow so here let me tell you what i'm if petal length here we are we have taken two features which is petal length and

sepal length okay we had not taken this uh this two features okay but you can make that but this is not a formal decision tree this is just for an example this is obviously Not correct okay so you had to make a mid one if condition that if the parallel length is smaller than some a and a can be anything 2.3 4.3 that that is usually i'm taking anything but it is usually taken uh which will see um in one of our data set we will see how it is chosen and you will obviously see

how it is selected okay so if it is smaller than a parallel length then consider that y To be equals to one if it is not let me check that this recording is on here okay regarding is on okay so we if the parallel length smaller than a then we consider our y to be this uh versionica okay if the pedal length is greater than a or is it this this this condition fails this condition fails it then goes to else condition and in else condition it this is again a nested loop nested sorry not

if an assess control flow just check if your sepal length is smaller Than b if it is then you say that y equals to virginica okay and otherwise if this condition also fails then it says else your y means y should be satosa and you have this hole into the this nested loop in this and you have this whole classifier and this is your whole classifier so that this is the decision tree yeah so let me make this diagrammatically in the terms of decision tree so In terms of decision tree we can write this equation

this uh this if statement first we have our root node this uh we have first we make our root node like this let me make one root node here we have our root node with this condition if a parallel length parallel length is smaller than a okay then if it is if it is yes then you say your y to be One okay if it is no if your parallel length is no okay you consider you again make one more condition which is sepal length is smaller than b if it is yes then you what

you do you classify this y equals to two if it is not if it is not then you take it as a zero and your whole three variable uh target variables are covered and this is the decision tree and this is whole this is the decision tree i have made in yellow this is the decision Tree is just ask a question on the data set if the pedal is smaller than a if it is then consider y equals to one if it is not then again we have made another another decision and then we uh

if it is yes or if it is known like this okay so that's the this is the decision tree it's damn is like this okay so here again i'm saying this is what you do just believe me yeah we had this is the decision tree and how it is constructed we will see okay so we have made this as A final statement yesterday if an else statement like this and this is what the decision tree is so here are certain terminology that we will have to see over here so here are the the details that

we should know okay so the the the first node or head node is no is known as root node is known as a root node or parent node okay this is this is a root node obviously this is also a Parent node and these are the child node these are the child node okay child node and this is the parents node and this is also the parent node and this is the terminal node this is the terminal or leaf node because you are not splitting this node into further nodes so this is called the terminal

and this is also a terminal node this is also a terminal node okay this is also terminal or a leaf node this is some sometimes called a leaf node when you Are not spitting further okay so this is a leaf node or you consider it as a leaf node this is the whole thing is called the branch this whole thing is called branch use this whole thing is called branch okay and that's the basic terminology of this okay and this is the splitting of your data okay this is the this is what you are doing

doing which is splitting okay if you're removing some node let's take An example this then you are pruning it okay then you are pruning this node but we and we don't know we don't want right now okay so that's the that's what that's the decision free and i hope that you understood this example clearly and i really really hope that you will uh that you got a very good intuition of decision tree in much smaller span of time in much easy way okay so i hope that you remember this terminology either i will make you

Familiar with terminology by the time also if it is not required for remembering all those things okay but this is it's it's best to take a paper and a pen and write notes with me whatever i'm writing okay and just listen me carefully after listening me you can make notes okay so now let's see what's the decision boundaries will look like okay what's the decision boundaries will look like or the hyperplanes will look Like but what do i mean with hype airplanes i mean with hyperplane decision boundary is let's take an example of linear regression

in linear regression we are making a straight line this is called the hypothesis decision boundary then we have a hyperplane it's called the hyperplane in support victim machine we are also making hyperplane in logistic regression we are also making hyperplane means a decision boundary so in the decision decision tree we also Have the diffusion boundaries okay or hydroplanes so let's see how it is constructed so i don't know i will be able to make that image or not but i will fully try that okay so here uh let let me make one x and y

plane x and y oops i'm please anyone help me to make this okay let me do that oops i'm just freaking out let me do it again yeah okay great so i have made one and this is my x-axis This is this is my x-axis and this is my y-axis and we have a two features which is sample length and uh parallel uh length i think so yeah so we have parallel length in the x-axis and we have separate length in the y-axis okay so we i i have just made it you can remove this

y-axis maybe it will confuse you in in our x's we have petal length and y axis pf zeppelin because we have taken only two features in this example we have we are not taking more features Like parallel with and past apple width okay so what what we consider what we have done we have considered we have considered this full reason we have considered this full reason this full reason to be y equals to one okay this hyperplane this hyperplane list let's name it as a first type of plane so this full reason is our y

equals to one means whatever data point will come Is considered y equals to one and this is just this is just the if statement that we have seen if parallel length if parallel length is smaller than a okay then y equals to one so that's this is the full reason okay means we have not for this bit so it is just a full reason where y equals to one if that condition passes okay then another hyperplane we can construct let me choose another pen another hyperplane be construct that This reason this reason would consider y

equals to 2 okay y equals to 2. if the sepal length is smaller than b if the points come in this region then it is it will consider y equals to 2. another we have hyper plane let me take another pen that let me take another pen let's take an example of blue Okay another outcome we can consider that as a y equals to three okay if that if that falls in this region so here we are constructing hyper planes and if some with something come here then control y goes to one it's y equals

to two or y equals to three with these two features obviously it will be more dimensionally high when you plot the four features okay so you can see that we have a hyper planes Where we are able to make predictions okay and we have made a simple classifier okay but something to note over here that all your hyper planes are axis parallel okay our x is parallel means this is parallel and this is parallel so you can see over here that all the hyperplanes are x is parallel okay so that's the i i hope that

you understood decision trees in that and um this is the basic exam intuition that i want to give it to you in more uh Sophisticated way or not sophisticated it's just a good way okay okay so as i've tried it to keep it as simple as i can and i kept it okay so let's uh let's let's start building or let's start with mathematical region that's how we construct these kind of decision trees these kind of decision trees how we construct okay so but before that we have what what what we do we uh How

we choose the variable or how we choose the feature to be the root node or the bran or this this how we split so we have attribute selection measure and you if you select run randomly like i have choose a petal land of you if you choose ran randomly then you will be ended a very bad model so we have different attribute selection measure we like entropy information gain gain impurity which we'll see in detail to understand the how we select the Attribute to be as a root node or like this okay so let's take

a let's take an example of this data set if you want to split this data set what feature you will use you what are you going to use outlook as your feature root node or temperature as a root nor or humidity as a root node or when as a root node okay and play tennis is our uh label so which you will use if you choose run randomly maybe you will be end up ending up with a bad model Okay so you have to do that kind of thing no so we will we will uh

scientists or researchers have done a very great job even you all have to do all this kind of thing research and please keep contributing to the ai community that maybe and i'm also doing research in machine learning and definitely will come up with something extra okay so um different measures different measures are different attribute selection measure Are entropy we have entropy then we have the second number we have information gain information gain information gain which we usually write as ig then we have guinea impurity then we have guinea impurity and is simply uh i g

okay so we denote like this okay so uh We will talk about all of these three and i and i hope that you will uh understand each of them okay so let's start with entropy okay so uh entropy let me write it more formally yeah so what first of all what is entropy entropy is the attribute selection measure it's the measure of a randomness okay how pure or how pure that attribute is to be used as some Nodes or a root node okay so it's the measure of randomness so let me write the definition because

definition is also important and if you want this kind of all my notes you can simply write me right with me along either you can comment or join the newer community discord community and ask me there i will be able to give okay so ask me there i will be able to give all this entropy etc is the measure is the measure Of randomness it's the measure of randomness okay the higher the entropy is the higher the the higher the entropy is the harder to draw any information from them okay so it's if the higher

the entropy is it's very hard to break your uh node okay or to choose the node so um our entropy should be low to to be considered as is uh leaf node but still don't don't worry i will dig dive into The cases to make you more understand what do i mean with these terms okay but first of all uh let's take and i will show you one equation and then i will show you um how some first of all one example and then i will show you properties of entropy okay so first of all

uh let's take an example where you were given you have y to be maybe uh y1 Y2 all the way down to the y k and in this case you are naughty this is not examples this is a this this is like maybe satosa means what is the number of classes you have okay so um maybe you have binary classifier or you have a multi-class classifier okay so you what is the number of a classifier so maybe you have in the in the iris satosa we have here's uh versus tosa vergenic and ver cycler so

we here we have y equals y equals to y1 to Vctosa then we have virginica then we have very cycler means three wise okay so that's what i mean with this okay so let's first of all let me give you um let let me give you the equation okay so the equation is defined as like this h we define our entropy by h equals to let me choose white color because i like white much okay minus mini uh this is this minus is very important minus The submission i equals to 1 all the way around

to the k and k here is the number of your what is the class what is the number of a class okay p of y probability being y i log of base and b is usually taken as 2 or e okay to a 2.713 okay so b is usually taken as 2 or 3 but usually take b as a 2. so if you were taking p as a 2 then you can consider it As l g if you're taking b as a e then then you take as l and natural logarithm and you have a

log 2 base a log 2 um with a log with a base 2 okay this means lg okay so you take log of p of y i okay and that's the full equation of your entropy means it's just measures the randomness okay so let's let's take one example because it is i i i i know That is making no sense to you and i know that is making no sense to you but i will make sure that it will make sense okay so first is you we want to measure our randomness for playing golf okay

you want to measure your randomness for playing golf okay so let me write play golf oops uh let me write the data set first What's the data set we'll be using so we have this play golf data set play goal data set where if it is what is the number of yes which is 9 and what is the number of uh no which is 5 okay so this is our data set like this and let me make this also okay so this is a day-to-day data set and you want to Take out the entropy of

playing golf okay playing golf okay and here we want to take out entropy of being no being no and being yes okay so what you do first you take out the entropy entropy of 0.36 log 2 0.36 minus means this is your first this is your first Y i this is your first y means yes sorry no this is for no and minus let me make it up okay minus 0.64 log base 2 0.64 okay this is what we have and the answer is 0.94 okay so here what we what what we are Doing over

here that first we are writing this this equation p of y i times the log of for no and then we are writing for a yes okay and there is subtracting and we have um entropy at 0.94 and it can be further splitted okay okay so this this this was the basic calculation of entropy but let's see some cases of entropy to make you more sense of this uh entropy um attribute selection measure okay so we will see some properties so let's uh Consider let's consider that that we have y to be two class means

we have a binary classifier we have where we have two class one is yes one is yes and other one is no okay any kind of yes or no whether it is playing tennis or etcetera that's yes or no okay so let's take an example of let's let's take one scenario let's take one scenario number one okay so here is our scenario which tells we here use data here is your data and Number of a yes is a 99 number of a yes is 99 in your data means these are the your y labels so

we have two unique values in your y labels so here you have yes and no so in that case your yes is around 99 and your no is about one percent is about one percent okay this is your case one so let's take out the entropy for this h of y Minus you're taking minus 0.99 means 99 log means lg i'm writing log not two maybe i'm writing a log of 0.99 minus this this is for my yes means this is for my yes because we have added a summation for each for each uh y

variable so we are for we are doing for each for each y variable okay minus 0.01 the log of 0.01 and your output is 0.0801 okay so that's the your entropy okay so let's take one more scenario let's take one more scenario scenario number two and here let's take an example of your data having the yes to be around 50 and your no is around 50 okay so um if you take out entropy of this if You take out entropy of this minus 0.5 log of 0.5 minus 0.5 log of 0.5 which is equals to

1 which is equals to 1 and the maximum maximum entropy is 1 and it is and this is very very hard to add to split this to split this it's very very hard okay so it's maximum entropy is one let's take another scenario and if you have a binary classifier i'm taking an example your maximum entropy is one if you have a binary classifier if you have a multi-class then the equation changes Okay which you can see on internet but most of most of the cases if you understand binary you will be able to understand

multi-class okay let's take another scenario scenario number three okay scenario number three tells you so now your number tells you that you're a d which has yes to be around zero percent and you have no to be around hundred percent so You're in trophy over here would be zero okay your entropy will be zero and minimum entropy is zero over here okay so you can see some cases that is if you have id3 follows if you have um if you have a some algorithm means a decision tree follows that if you have entropy equals to

zero then you consider that as a leaf node and what is leaf node as a your prediction if your entropy is zero then you consider that as a leaf Node if your entropy is one then it needs further splitting or if your entropy is big means uh your entropy for binary is generally greater is like this your entropy is generally like this entropy is in between or equals to okay so your entropy will be in in this range so some Means uh algorithm follow id3 there is one one algorithm called id3 instance sub subset of

uh decision tree algorithm so id3 follows if your entropy is zero then you consider r is small then you consider as a root a leaf node if it is one or large entropy then then you then it needs further splitting okay so that's the entropy and i really hope that you understood entropy in detail so let me make you familiar with what what we have seen so far so we had talked About decision tree which you can reverse back to c more but i'm going to uh recapitulate the entropy so what do i mean by

entropy entropy is the measure of a randomness that measures if your attribute needs for the splitting or it's considered as a root node or it could consider as a leaf node or like that okay so we have taken one example of playing golf and then uh and then we have taken us some three Scenario where we have seen that it is very hard to split it is very hard to split if you have it is it is it's very hard to get you get information if your entropy is high so it needs for this splitting

okay so if you have years to be have 50 percent know to be 50 percent then entropy will be one if you have another scenario then then it will be like this i can um just see the cal calculation you can see the calculation if you want To be uh to understand it much better okay so that's what uh these certain trophy is but let's see the diagram of entropy visually it's a bit interesting okay it's a bit interesting so let me draw one diagram of entropy okay so here we have our let me draw

one let me draw a good one so here we have zero here we have one okay so i i hope so that i'm i'm not able to draw it but let me try at least okay and The highest entropy is one the highest entropy is one okay is one and this is the diagram of entropy this is the diagram of entropy okay so the highest entropy can be one if you're taking with this equation okay i have already showed you approved you so this is an example of entropy okay great so we have seen the

entropy which is a measure of a randomness so now let's talk about the another attribute Selection measure which is information gain okay we'll talk about information gain let me write it down in for formation gain and i will take an example of i i will explain you information gain with the help of one data set which is all i have already showed you i thought i will give you a surprise with that but I've already showed you let me show you again so here we have that data set let me show you yeah here is

our data set okay let me draw back like this okay so here is our data set which is played tennis okay um this is a play tennis data set and what we will do we will i will make you familiar just see this data set and look at this data set I will be looking i have already looked at okay so here you have a plate and this is target variable and you have these features so we will see which feature to use or which feature to not okay so here um you can just see

what i'm uh you have around one two three four five as a no and one two three four five six seven eight nine nine as a yes okay so we will uh we just see the cities and now let's come back to information game To understand first i will give you an overview of what information gain is and then we will dive deep oops where is my i don't know why my computer is lagging just give me a comment why it is very lagging okay so here is my information gain okay great i hope that

everyone is able to see yeah okay great so let's consider you have the data set b okay so you have the data set d like this let me consider This as a data set let me consider d oops wow it's we have a data set which is d and what you do you further divide this data set into your uh uh for the data sets for the smaller sub subset of this data set and how we will divide this subset of this data set it means that you divide this data into versions and what versions

maybe let's take an example of iris data set okay so what you what you do what You do you divide this uh satosa versicolor and virginica into three data set which concerns citosa versailles color and virginica okay so whatever examples of versitosa will go in this data set whatever examples of virginica will come will go in this data set whatever whatever examples will go where cycler will go in this data set so that's for you what you do you divide your data bases on the number of labels and i'm Talking about binary classification classical for

classifier or uh classification task over here okay so what you do you divide your data set uh first to divide for d v one then you divide for dv2 then you divide it for dv3 okay dv3 okay and what you do you take out the entropy you take out the entropy of this the v1 you take out the entropy of dv2 then you take out the entropy of dv3 okay and maybe there's this kind of cytosa Versicolor virginia you take out entropy and then you take out the entropy of your whole data distribution okay before

splitting so you take out the entropy before splitting and then you take out the entropy after splitting okay so take out entropy before splitting like this okay after that you you minus it you subtract you subtract your uh h of d at the function the entropy means previous entropy means previous entropy This entropy h of t which is the entropy before splitting minus minus let me write it down minus the weighted entropy which is this which is whatever the weighted entropy will come so let's see how we that having helped the help of an example

what the what i'm talking about uh this how to calculate one one example to make it more familiar uh weighted entropy how do you take of the intubated entropy so you Just multiple uh minus it with the weighted entropy okay after splitting entropy after splitting so let's see how it is done in more detail okay so i've just given you you what do what the formula is just to divide your data into success of your data like the divisions and then you take out the entropy before the splitting and then you take out the entropy

after splitting okay so let's see so uh first let's take out the entropy Of division number one of division number one here i'm going to take out entropy of division number one with the help of that plate in its data set okay so oops h of dv1 which is he there we have three five means first time writing for uh three means first i'm writing for yes okay first i'm writing for yes Where we have our where we had divided our data data set into a yes or no so there we are having total of

five examples where three examples are yes and two examples are no and out of five okay then the entropy will be 0.64 okay but you then again what you do you take out entropy of your division number two you it will be zero and it in the same you take out a division number for division number three zero point ninety Seven okay what then what you do you do the you take out the average entropy and that's that's equals to zero point sixty four then that's equals to zero point sixty four and the formula for

calculating your weighted entropy is like they just wait for a few seconds the formula for calculating your entropy weighted entropy is like this let me write it down first what you do you take out this i Think it says it's not 64 it's yeah so i will just multiply with the first division number one here we at d1 the norm of d1 divided by d times the entropy of d1 then you what you do you plus d2 the norm of d2 times An entropy of d2 plus norm of d3 norm of d3 and this is

the division number three data set divided by the norm of d okay and this is your full data this is d is a full data so you take out the size of that full data okay then you again d3 y okay and then this is your average entropy that you get after all of this after you've taken out the weighted entropy then that that will be Equal to some number and then what you do you subtract you subtract your entropy previous entropy you subtract your entropy before the splitting you subtract to entropy before the spitting

minus minus you get after splitting so let me write the formal equation for ig what you get okay so id ig equals to y and any var means variable maybe Variable can be outlook a variable can be outlook or temperature temperature of that from that or windy or humidity okay so in that we will check if though if the higher the entropy higher the information gain is we can select that as a root node okay so what you do you just you just take the entropy of your data full data and then you pre premise

of your data and then you Minus it minus i equals to one although all the way on to the key the norm of di means the division of your data and this is simply uh the maybe satosa or then verse cycle or yes or no like that so divided by the full full d times the entropy of di okay so this is your weighted entropy this is your weighted Weighted entropy and this is your this is your entropy okay before you're splitting and then you will get the information gain for that you for that variable

okay so we will see uh one more example later on when we will be building our own uh decision tree by yourself mathematically so again let's recapitulate what we have seen so far we have simply seen we have Simply seen so let's uh go back to that data set let's go back to that data set so here is my data set i don't know actually where yeah here is my data set and here what what we do we take this data set and we simply um divide this data set into multiple divisions into two divisions

okay so we divide this data data set into yes and no Okay and whatever number of examples will come we take out let's take an example here we have 9 over total number of data examples and here we have a five over total number of examples okay so what you do you simply nine by fourteen log of nine by fourteen minus five by fourteen log of phi by 14 okay then you will get some entropy which is your entropy of your data okay entropy of your Data before is splitting okay so that's what the entropy

is and i and i hope that you understood what i'm trying to say over here and this is your for yes and this is your for no because you have added a submission of your entropy like this i equals to 1 all the way down to the k then you have a probability of y i minus the log of means with base 2 probability of y i okay so that's what we are doing you are doing the same over here But this time you have 2 so that's what you're doing too and this is only

for binary you can see equations for multi-class okay so that's what the an info information gain is and here what what we do we take out data in the our first we take out the entropy of that data and then we divide our data into multiple divisions and here we have three cluster um categories so we divided into three Categories with two categories you could develop two categories and then take our entropy of that categories and then you take out the weighted entropy okay without uh rather thinking of the average you take out the weighted

entropy of that uh after splitting these all you take out the fader entropy then you subtract and how you take out the weighted entropy here is an equation i'm going to give it to you it's just i have already given but more formally this is The equation for weighted entropy okay so what you do you simply it's the norm of d means no divided by what is the number of a full since here we have number of no number of a number of no divided by total number of examples times the entropy of that di

okay so that's why who you how you take out the weighted entropy so after taking out the weighted entropy you subtract the previous entropy minus the weighted entropy okay so that's what we are doing An information gain i really hope that you understood information gain okay okay so please im see if on some blocks if you want uh either you can ask me in discord so i'll be very happy to help you in the newer you can find the discord server in any new era new video or new era you can find at it and

you can join and there is a lot more of that you can and also if you want to support please kindly go to newer youtube channel Newer youtube channel and please subscribe that youtube channel okay and if you can watch the whole tutorial on new era okay because you are going to get the uh whole at free over there okay so now we have seen entropy then we have seen in formation gain now it's time for uh talking about uh guinea impurity guinea impurity okay so this is also the most famous it's a most dissimilar

and most famous that is used today which is guinea impurity Gin impurity and it's just equals to it's very very similar to entropy it's a very very similar to entropy let me this is i use things very very similar sign okay don't comment it like this just i have just said it like this it's not i using it's just i have it's very very similar to entropy so let me give you one equation uh the equation for calculating The guinea impurity so i g and it is not an information gain this is a sign for

a given impurity of y equals to the 1 minus i equals to 1 all the way around to the k and again it is the your y variable not anything it's the y variable again i have already seen y equals two and yes or no mean two or is it also versus color so unique values in your i times the probability of y i squared Okay so that's the that's the thing and again it's if if we take us same scenarios if we how i'm saying it's if you take the same scenario as we have

taken entropy some scenarios let's take an example of scenario number one scenario number two scenario number one we are your yes we have let's take a l let's take you have y to be two class category where you have the unique value as a yes or no okay so uh what is the probability means what is the number of Yes is 0.5 and which is the number of nodes 50 okay 0.5 okay so what will be the guinea impurity gain impurity will be 1 minus 0.25 is 0.25 because if you square this because here we

are squaring so if you square this and again you subtract it again you subtract it minus 20 Uh 0.25 0.25 then you are going to get a 0.5 and 0.5 is maximum and gaining impurity 0.5 is maximum in entropy we have one as a maximum if it is this zero point impurity is zero point five then it sneezes for the splitting if it is zero then it's not neat if we can consider that as a leaf node okay so that's the scenario that we have already seen in and if it is in in the case

of giving impurity our gain Impurity in this case where you have 50 percent yes then it that that will be one so we have if you but you may ask here what is the ad advantage of gain impurity use rather than using the entropy so it is an alternative to entropy just to increase the computation just to increase to make it fast because in gain impurity if you have seen we are just taking a h of y equals to the minus the summation i Equals to 1 all the way down to the k p of

y i minus the log of p of y i so here you can see we are taking a log and take a look takes time okay so as an alternative a researchers comes with a very easy and fun under understandable way one minus and this these are this can be derived these are derived in information theory okay but i'm not going to go in information theory but you can see Uh this is the for gain impurity and this is most used as an alternative to guinea impure uh entropy guinea purity is most yields as an

alternative to entropy okay so that's the thing and i hope that you understood guinea impurity also so let's uh let me make you familiar with the diagram so here is the diagram let me make first for um entropy first here is for entropy so here we have zero here we have one So here this is for entropy we are here maximum is one and this is for obviously some something will differ so but uh the maximum is 0.5 you know engine impurity so yellow one is guinea impurity and the white one is entropy okay white

one is entropy so i hope that you understood the why why i'm saying but why do you use gain impurity because because just and because it is more faster because we are taking log and That takes time this more faster than um entropy okay so that's the whole definition of all decision tree and that is i'm recording till 51 minutes and i hope that you understood all of this i hope that i've written a lot okay so uh we have learned a lot and i really hope that you understood also okay so now we will

make one a decision tree classifier and i will show you uh the decision tree Numerically okay how we do uh regression task and decision trees okay so first uh let's let's do in a fast way i i i do have not a lot of time over here but i will show you this data set so here is my data set which i'm going to use okay so first of all what we do we take out the entropy okay we take out the entropy we take out the entropy of this whole data set we take out

the entropy of this data set h of d oops where is my pen H of d and simply what what we are doing here we have two y's so here we are taking the entropy of our data distribution d okay so that that will simply equal to 0.94 as i've told that you have to take out entropy before splitting okay for information gain so we take out the entropy where this of this data set as you have around five no five no and nine yes okay Okay so and that way you can take it out

uh if if i want to make i can simply make like like this five by fourteen five by fourteen log uh five by 14 minus 9 by 14 log of 9 by 14 okay so after you calculate you will get this okay so but feel free to collect correct me if i do anything wrong in calculation okay feel free to correct me in the comment box okay in mathematics we do I'm just doing it faster just we have to remember the concepts okay so after you calculate the edge of entropy of your distribution now what

you do you calculate the information gain you calculate you calculate the information gain you calculate the information you calculate the information gain okay how first you calculate the information gain for you calculate the information gain For whether for why with respect to outlook variable means we have to check how much information that outlook variable contains okay so in output in outlook here we have outlook variable you can see over here that here we have outlook variable so let me uh let me take out so in outlook if you see the y variable we have in

sunny in sunny we have two yes and three nos we have two years and three nos then we Have in outcast we have four yes and zero nose okay then in rainy we have two yes and three nos okay so this is their outlook variable and here you have a two yes and three knows four yes zero knows two years three knows and this is for sunny this is for overcast and this for rainy and you can see over here okay sunny overcast and rainy okay so this is this You can see that if you

take out the entropy of this entropy of this h of d2 d2 h of d2 then you will see that this is equals to zero and you can you can take this as a leaf node and it does not need further splitting okay so you can take this as a leaf node you can take this as a leaf node because this does not need further splitting okay now what you do you take out the entropy of this h of D1 of y okay then you take out this h of d2 d3 y okay so the

you have taken this data set large data set okay taken this day it is set and it's split you have taken this feature split it okay so we have this before spreading and then you what you will do you take out the beta entropy you take out the weighted entropy Weighted entropy weighted entropy which is equal to 0.69 okay so what you would for information gain you simply subtract 0.94 means the previous entropy minus the bit end drop is 0.69 okay and we entropy i've already shown you previously okay and you do the same for

uh temperature you do the same for temperature very very temperature you have y in temperature where you have around in You have three classes high mean i think mil let me see hi mild and hot hi hot mild and cold okay so in what we have two years and two nos and here we have in mild we have four years and two nose and then in cool we have three yes and one nose okay then you take out then you take out like this and then you do for Humidity then then you do for humidity

humidity and then you do for a windy okay then you do and it's found that the that the information gain in over outlook is very high okay so what you do you take that and you take that only that outlook as your root note as your root node okay then then you take that outlook as a root node then you divide this then you divide this so here Let me go to one of the one of my favorite decision tree pdf where i i will just explain you what this is okay so you think take

this outlook as a root note then what you do you um divide this you divide this as as we have divided into sunny overcast and rainy sunny overcast and rainy and overcast you can see that we have entropy equals to zero so it does not need further splitting so we can consider that as a yes means we have Four years so we can consider that as a yes because it's very pure and then its needs for the splitting and this also needs for this spitting and then at some point they are they also become pure

where the entropy equals to zero so or the given impurity equals to zero then you consider as a leaf node okay so this is how we make the decision freeze and this is how we calculate the information gain this is how we made our decision free okay Okay great now i really hope that you understood decision tree and now it's i i i also enjoy very much when i make these kind of tutorials this kind of tutorials and because it's just amazing just um helping students to make this uh understandable things okay okay great now

uh one as we have only talked about classification so as regression is not too too much hard also so in regression let me show you what you are what what what we do in regression so let me take One let me go to the cycle learn decision tree decision tree cycle learn i hope that is stable yeah here it is so it explains very good it's explains very in in very good way so let me uh uh make you what the what do i mean with this so let me take an example of this a

parallel and petrol width so in the same way here you what you do your petal length is smaller than or equals to zero two zero two Point four five then guinea purity is high then you would split it okay here the guinea purity equals zero then it does not need for the splitting but it's needs for the splitting so it splits that okay so here it seems to be overfilling because if you if you leave the decision tree to be go as to ask as much question then it will over fit then it will overfit

so what you do you either stopped at certain depth you Either stop your decision to add a certain depth to make it more robust to make it more robust or you prune your decision tree by removing some some branches from here okay and you can see the same way that we have seen that we have made the this kind of this kind of decision boundary decisions uh hyperplanes okay great let's see some more oops i use brave as my browser but soon we'll Get but soon we'll try something some please make sure that you can

put your comment below if i can see okay and regression what what decision we can apply in regression also so you can see over here that here what what we do let me get back to some good pictures decision uh trees regression regression but i will soon go to documentation to show you uh more Sophisticated in a good way okay so here is my decision tree regression and let's take an example of this as you can see from this also yeah so here is here we have a good example so here your predictor here you

have a target variable okay so here you what you have taken you take an outlook and then you divide into sunny overcast and rainy and sunny needs for this fitting But overcast does not okay so most of the time you average it and then you take 46.3 okay and then you split it again you what you what you do is split it if you take out the average again i'm saying you take out the average okay that's what you're doing and again if you if i go further into this it is very well explained it

is very well explained you can go to this it's very well explained but yeah it's only for Regression but it's so much of um but what what we do we simply again the whole the things are same for attribute selection measure you simply uh you can sunny overcast rainy overcast over here is we have four um this is your leaf this is your concept this is a so pure so that's what is considered as a leaf and then what you do you take out the average like this you take out the average and then you

take out the 39.8 46.3 over here after taking on average then you split your sunny into false or true then if it is false then your outward output will be 47.7 and if it is true then it is 26.5 in the same way you do the regression okay so i hope that you understood and you can go again to understand it more detail so so what you what you do you take out the hours played and our average standard deviation average and hours played and then you count it and then you simply uh Do some

calculation and that's not too much hard okay great so we have seen so far about a decision trees regression and i really hope that you understood this also okay so you can go to this node uh this tree uh this website say theresa dot com distribution we regret reg you will be able to understand them but it's more important to understand attribute selection Measure because people usually confuse us at this and we have taken lots of examples for this okay okay great so now what we will do we will i will show you i will

show you some i will show you the documentation the implementation the implementation for decision tree i will show you the implementation of digital regressor and residential classifier okay so that you can on because it's very important to learn From implementation okay so i will explain you in more intuitive way so let me open my ink to go i hope that i'm not i'm just using it uh let me see if this works for not for me or not okay okay great let me choose this pen yeah so uh here you can go to this website

this table dot com and here you will be able to find more uh you can go to cyclelearn.org and documentation of this generic classifier okay so here again it Should what is the criteria to choose the attribute here we have guinea and here we have entropy okay so you guinea is a default you can choose entropy also okay the quality of a split supported criteria is guinea and the entropy is for the input and the for the information gain after splitter splitter means do you want to choose the best splitter or random splitter okay best

means is which Was the best random is any ran randomly okay max that this is very important hyper parameter which you can tune it using great search cv using grid search cv or a randomized search randomized search okay you can tune it okay so that's what you will do and what you do you just you just um make your decision to if you do not make then it will fed or then the new decision will overfit okay so the maximum depth of the tree okay if it is Default as none okay um if you do

it will learn a lot it will make a lot of decision boundary to learn a lot until unless it's leaves are pr okay so it will learn a lot so that's why it will overfit so this is a very important hyper parameter then again you have a minimum sample split means here the default is two again you have to tune it again you have to tune it the minimum number of samples required to split an internal node okay an internal node is Just uh that know the minimum number of splits that require okay this is

two but you need to be uh very cautious under this okay then we have minimum sample split then we have minimum samples leaf and then the minimum number of samples required to be at a leaf node okay so what is the minimum number of a samples required to be a leaf node again is one but you can also tune it but it's not That but it's very uh good to tune it using great source cv then we have a minimum weight fraction leaf then we have what is the max features means uh the number of

features to consider when looking for the best plate for a larger number of features than you can consider as an uh the default is none but you can use this but it it it also has some disadvantages okay okay then we have a random state control The randomness of your estimator again it's you can read some more details on random state uh max max leaf node the grow of a tree in a best first fashion uh it is the default none it's just like a max leaf node or the minimum purity decrease the node will

will be split if the split induces induces a decrease for the impurity greater than or equals to this value but i've not used it uh if i say very much right it's just a limitation that you Can make to prevent overfitting okay i think this is deprecated then you have a class weight what is the default is none always then we have alpha non-negative value so that that's the basic intuition okay so we had talked about this and then you can see some examples and then there is something called as uh um let me show

you uh you can also use this as a understanding residential structure so i will just show you one example of this Okay so here you can use the graphis tool here you can use the graphics tool to plot your decision tree like this yeah uh you can he he has you three dot plot tree it is plotting the tree like this okay you because recently has one more is advantage it is easy to interpret okay so that's the basic intuition behind decision entry okay so let's go to decision tree regression okay so let me see

where is regression okay let me see one more Great where is decision tree regression here it is okay so in decentralized regression it can be used for this um uh this is uh for regression task also so here uh you have to choose certain criteria the function to measure here we are not using any entropy here we are using the quality of your split you can use mean square error means how how much it how much it differs okay so a mean square error error or mean absolute error or poison So you can see a

frying freedman msc but most uses mse or ma or rmse okay a splitter again it's best again the same thing max step obviously to control your adapt to prevent overfitting this is an important hyper parameter again i'm saying this is a very very important vvi this is also very very important to attune okay then you have minimum sample leaf and then you have the same as we Have discussed okay and then you can see some examples of decision tree like this how we make and then there are certain methods like plot or etc then you

have uh some some some examples which you can see from here and this is usually uh you can use we will talk about ada boost also later on okay so you can see some visualization in this library and this Entry is used to fit this curve with addition noise or observation and it's just controlled by the depth of that map by the max step and the blue line considers the two and the max depth of a5 if you take then it will perform very very well on the training set but that interesting set of you

do not control your max that okay so we have talked a lot on decision trees and i really hope that you understood decision tree and that okay so in the Next section the reason why i've taken this too long to make you understand decision tree because very very important concept to understand and and i think it's very very important okay so you can have a look if you want onto this notes okay but uh you can you can ping me on discord or linkedin i will very happy to give you these notes if you're not

able to write either i will recommend you to write these things okay i will i will just You can ask me to make this all wrong okay so that's it for this video for this section so sorry um for this section and i'm my throat is also so much paining but yeah that's it for that's it for this section and in the next section we will start with ensembl learning and again then we will go to unsupervised learning then we'll talk a little bit about neural networks and then we will end up this course and

you will be having Enough understanding of machine learning to get started making projects in machine learning and getting a job or an internship but you have to lot of practice okay so uh you can do that okay great so thank you for this seeing this section and again if you have any kind of question you can ask me to the new era be sure to subscribe if you wanted to support this content okay so let's meet at the next section till then bye bye okay so now we'll talk About example learning one of my favorite

topic to teach and to use in my professional experience so why why i think ensemble learning is one of the best for cargo competitions for cargo competitions and sample learning methods or techniques are most popular 99 of the kaggle winners uses some kind of ensemble learning techniques so that's Why uh i think that ensemble learning is a must-know technique it does not involve a lot of mathematics but it do involves only techniques concepts and a little bit more uh maths okay but in decision tree we involve some mathematics but here we do not end a

lot of mathematics we require a little bit of mathematics but there are a lot of techniques and concepts that we need and approaches that a particular kaggle governor's thieves okay so It's very um we will give david and symbol learning covering the four three techniques i think the three techniques of ensemble learning the first technique will be a bagging which we'll talk about um not in it will will be in a separate section sub sections okay so this is our main section and in a separate subsection we will talk about bagging then we will talk

about boosting Then we will talk about stacking okay so these are the three techniques that we will talk about in ensembl learning and also we will see the implementation of each one of them and it and i will also show you some of the kaggle competition winners approach or we will make one model seeing the changes for changes they uh bring into your system how do your model accuracy increases okay okay so great but before that let's a little bit uh Let me just forget about dominant sample learning let me recall use some concept which

is high variance uh concept okay so high variance and high bias concept so it it should make sense to you so let's recall in high variance we have uh overfitting model we have our overfitting model if i'm if i'm correct in high variance we have our overfitting model and in high bias we have our under Fitting model okay so our models should be low bias and low variance low variance model if you have a high variance if you have a high variance then you have then your model is overfitting if you have a high bias

then your model is under fitting and uh that's that's what the recollection that that we need for this ensemble learning just to make sure that that we are in the same path okay Okay so let's start with an a simple example of um of a small example okay so if we have played some quiz okay if you have played some quiz let's take an example that we have played some quiz of uh maybe in kahoot or anything any kind of quiz or if you are in competition you have played some quiz so it's a maximum

percent a maximum percentage that the let's take an example that we have some question let's we have some question we have some question here and we have some Option which is a b and c okay so let's let's take an example that the particular that the majority let's take an example a is a correct answer so it is it is very obvious that majority of the students will go with a and if the majority is on a then is likely to be a be the correct answer and it's actually the correct answer okay so it's

like whatever the majority will say we will go with that okay here majority is Saying a and in this case is a correct answer and if you think about the majority is more accurate than one majority is more accurate than one and in some exceptional conditions that can be different but in 99 the majority will win okay majority will win if you have seen some quiz okay so if you see that the option has got the highest majority and what also into that then you can think that is the correct Answer so in the same

way ensemble learning works in ensemble learning is a example of models is ensemble of models okay so let's take one more example to get more feel let's take an example that some election is happening some election is happening some election is happening and in that election and that election uh let's take an example that why we do not take only one person vote and select any prime minister or president why we Take the majority the majority of votes which will go on to that party that will win okay so the majority will make a right

decision okay we'll we will uh will make a right decision in most cases okay so that's what the ensemble learning also says and semi-learning has ensemble of models so let's uh let me say let me tell you um what what we do in symbol learning we have a model one model two all the way down to the more Okay okay so let me make you make this that it is visible yeah okay so you have these kind of models and you train your model you train your model you train your model onto your data okay

you train your model onto your data and you can you take predictions from each of the model so let's take an example that you're trying you are making a diabetes prediction system diabetes projection system so your output will be either zero means non-diabetes or one Okay so let's take an example that model number one says it's uh it's a zero model number two say it's a zero model number three says zero model of a force is a one and model number five says is zero so the the then what what we have a one more

model a big model which which what it will do it will check the majority of votes and here the majority is zero and only one has given one and the majority of zero so we will give our prediction as a zero okay so we will see How it is trained later on but here what the actual thing is what is happening we train our model and each model is giving predictions and in these predictions we are taking into classification we are taking in class classification we are taking majority of votes means majority means m1 is

saying is zero and two is saying zero the frequency of zero is more than the frequency of one okay so that's why our final prediction which is y hat is equals to zero Okay and it's more it will be it will be high chance that if one model predicts if one model predicts it's a one or zero uh and the ensemble of a model predicts a zero then they are more accurate rather than this okay so that's the um that's the basic overview of ensembl learning and you may think here use in classification we choose

the majority of votes but what about regression what about regression what about regression problems in Regression problems what we do we take out the mean or a median uh um of out outputted from each model so let's take an example that your model m1 given some prediction which is regression value 2.4 model 2 given 2.6 model 3 given 2.5 so what you do the final prediction will be the average of these outputs the average or mean or a median of these outputs okay so that's what That's how we do in regression we do not take

the majority we take the mean or the median of the output from the base models and these are called the base models over here okay so these are called the base models and then symbol okay so don't be confused what is base model base motors are the ensemble of models that are being trained uh and Okay so that's that's the basic intuition behind our ensemble learning and i really hope that you understood ensemble learning and that okay so um just want to make sure that what how we train uh what are the some of the

techniques used in this um used in bagging what means what are the techniques using assemble learning so we have a techniques like bagging then we have a boosting then we have a stacking Okay these are and one one more which is cascading which is which which you can learn which is not yet in the industry yet but yeah one was just gas scanning which we will see if we want otherwise it's not necessary okay so in bagging we will see one algorithm which is a random forest which is which is something called as a random

forest okay which is just an ensemble of a decision trees and then in boosting we will more Probably focus on um we will in boosting we'll more prone to focus on gradient boosting we will more prone to focus on gradient boosting and we will focus on adaptive boosting which is just a advanced uh version of gradient boosting we will focus on adaptive boosting then we will see one winner which is xg boost okay so we will see these algorithms or these techniques in this boosting then in stacking we Will see amal extend library how we

do this stacking and what's the intuitive understanding and you can see there no money any kind of math this is only the concepts okay so that's how that's what we are going to see and the study motivation for this for ensembl learning why do we study ensembl learning has a great reason behind them the reason being why do we study this is the great question to ask whenever you do any kind of thing why let me use this Good color why do we study i think that is not visible in most of the cases let

me stick to white okay so why do we study and we study this ensemble learning the reason being is most of the cargo competitions most of the gaggle competition winners most of the kaggle competition winners competition winners uses some kind of ensemble learning either they use bagging either they use Boosting either they use stacking either they use cascading they use some kind of ensemble learning techniques if it is a machine learning problem if it is a machine learning problem ml problem okay okay just think that machine learning has also a power okay so but there

is a difference in in internet companies like amazon google they also uses these algorithms xgboost adaboost gradient boosting then we have a random forest in their Own uh products of import production okay so um cora uses random forest to uh for quotient matching okay so they they also use machine learning these kind of algorithm and these are very very powerful algorithm a very powerful algorithm which is often used in industry also which is often used in many kind of kaggle competitions and around 99 of the kaggle competition winners are Uses this kind of um any

kind of ensemble learning either if it is a machine learning problem in deep learning problem you obviously go with it if it is image problem you will go with cnn if it is a text text you will go with some word embeddings and then you can use any other techniques and you can use r and n so it depends so you can use any of the technique if you want but in most of the competition in the past we have seen That they are the winners of the user okay so you can see the case

studies if you want in detail okay great so we have talked about ensembl learning and in this section and now we will start a sub section which is bagging okay we will start a sub section which is bagging which we will dip dive into the bagging okay so it's also called the bootstrap aggregation or bootstrap aggregated okay so we will see uh late just one second okay so just to Recall and sample learning is a b is an example of base model it can be logistically version it can be nine based it can be supported

with the machine okay it's a classification algorithm whatever the majority of votes will be it will take the majority of votes and then gives you a y hat okay so that's how it works okay so um let's let's and in regression we take out the mean or median of your values great Now we'll start a sub section now we will start a sub section which is something called as bagging subsection which is called as bagging okay so in bagging so in bagging uh it's again ensemble learning technique and sometimes it's called bootstrap aggregation and from

bootstrap we deserve we derive the word which is backing like this bootstrap bootstrap Aggregation agreement okay i think i'm my spinning is little bit wrong but uh just bear with me bad game okay so that's what the from the derived world from bootstrap aggregation justin shaw okay okay great so it's a statistics term if you've heard about bootstrap aggregation means it's a it's a statistics term if you can relate with your statistics and probability classes a statistics term so let's see the Geometric intuition okay let's see the basic overview or intuition behind bagging intuition behind

bagging intuition behind bagging so what we do in bagging so let's take an example that you have a data set that we have a data set d okay you have a training data which is d so you have a data so you have a data like this and what you have done you divided your Data 80 percent for training and 20 for testing okay so now what you do that's just to make sure that you are on the same page on the dividation of your data also okay so you take your training data you trade

your training data and your data is just um you take your training data like this and it's for your training it's a supervised learning algorithm so um you have x i and y i so you have your in x i With your label by i for all i one i equals to one all the way down to the m and m r is the length of your training examples okay so this is your data this is your training data this is your training data now what you do you sample some points let's take this sample

k point or i uh let's see an example that you sample k points from this data set you sample you sample K points with the resp replacement with the replacement you sample k points with displacement from this data and you feed this to this model which is uh let's uh now you got d one which is d1 okay so you have sampled the subsystem sub subset of a data from this large data set okay sample k points you sample k points and now what you do you train your model using this subset of a data

you train your model Using this subset of the data and now you've got your model as a m1 okay this is your m1 mark okay but you may think here you just can you repeat it again it's all goes uh all gone above your head no worries again i'm explaining it's usually gone okay so what you do so i'm highlighting i in white so what you do you take your training data which is the which is a supervised running problem where you have a labels now you take out the Samples with the replacement samples with

a replacement and what do i mean by sample with a replacement i will talk about just after i make another sample okay so as of now just just understand we take out some sample with a replacement around the k points we sample k points or m points and um sam we sample some points um or we take out a sample of a data from this let's take an example this this Amount of data from this uh this amount of data which any random data means and we have this is a hybrid parameter we have to

choose the number of samples from this data and then we and then we call it as a d-1 okay then we feed to the model means we train our model onto this data okay then what we do we again take out the sample again take out the sample again take out the sample k points with the replacement sample k points with Replacement and what do i mean if your data is here if your data is here means if your data is here then it can be also here then it can be also here it is

it is not necessary that your data should be different it can be same it is some data can be same here and here also okay so that's the sampling with replacement okay now your samples again that random data and it's not necessary that your data should Be different from the d1 it can some data points can be included okay so you again sample k points and the and here you have a sample of a data from the large data and then you train your model onto this data m2 okay you train your model onto this

m2 now again what you do you take out the sample you take out the sample you take out the sample k points You do it for your number of um base models means number of times if this this is also have a parameter okay how much you want to sample until you have your data points till k okay till k till k dash okay and then you train your model till am okay and then what you do you have your k models now you aggregate this into a large model you aggregate This into a large

model okay and large m okay and in case of classification the majority of votes the majority of votes will go uh as let's take an example of ensemble learning that we have seen that let's take an example of diabetes and prediction okay so if m1 gives 0 m2 gives 0 and m3 gives 0 and m4 gives less than level 1. so the majority is 0. so the majority of votes will lead to the final prediction by Okay in case of classification we think of the majority in case of regression we take out the mean or

a median okay so that's the basic intuition behind and bagging and i really hope that you understood okay so let me explain it again those who have not understood please fast this video if you can because many students are still here that are not under that may not understand this i cannot do uh recapitulate or give a summary of this Okay great so here what we do we have a large training data set which is d now what we do we sample our data sample k points from that data we sample k points from the

large training data and let's say any sample sample sample of a data and then we train our model on these sample which is m1 it can be largest regression okay now we sample again k points with replacement and also it is with a Replacement so it is not necessary that your data should be different from the d1 it can be same or it can be some data points can be same and then you again train so let's say example you train support with the machine let's take example you're saying uh nine days okay so you

train and the majority of votes is like ensemble learning okay so that's what your backing is intuitively and you can see that we are not involving any Mathematics yeah this is just a bunch of concepts okay so um just i will make sure that you all are on the same page that what it helps it is great great question it helps you in reducing the variance bagging let let me write bagging bagging helps you bagging helps you In reducing the variance it's we will discuss uh 10 minutes onto this also okay yeah okay so bagging

helps in reducing the variance okay but before that there is one term that i want to highlight over here that here we have our all base models these are our base models m1 m2 and all of them are bmk these are our base models these base models are usually high variance model These are usually high variance with a low bias model with low bias model and what do i mean by this it is usually we do not do a lot of fine tuning we do not do a lot of fine tuning so that's why it

is just overfitting and we had done overfilling so that's why just go down to any data so that's why it is not under filling okay so we have our the base models are high variance and low bias model so we do not do lots of fine tuning just we do a simple uh List in random form which just initialized with it with no no number of depth okay it can go as much as they can so it is overwhelming okay so here we have high variance model and low bias so what it helps what it

helps backing helps in reducing your variance how it helps in reducing your variance from making your model more robust okay this is just um what it does is combines them So you have a low bias low bias high variance model now if you combine them if you combine the majority of votes then obviously we'll get a good amount of good good output or a correct output okay so you combine the models you combine these models you combine base models and then you get You then then you get low bias and low variance problem low variance

okay this is what you get this is what you get after doing backing and this helps this helps in reducing your variance and this is very good and this is very good okay so there is some uh there is some i want to highlight over here that here we are doing the Row sampling so just understand that this is a point that bagging helps in reducing your variance because usually your base learners or base models have high variance and low bias the bias and variance of that trade-off that is high variance and low bias models

so what you do you combine them uh combined based models with a lot into a large model and then you get a low bias and no variance models low bias and no variance m which is a Large model okay so that's the that's the different uh now i think that you've understood why we call it as it reduces variance okay so here okay so here um what how we sample our data how we sample our data okay how we sample this is a great uh topic to talk on how we sample our data okay so

here we are doing a row sampling we are doing row sampling we are not doing column sampling we are doing a row Sampling so let me write it down while sampling while sampling our data from the large distribution of our data so let's take an example this is my large data in screening data now while sampling what we do we have this we'll do the row sampling we do the row sampling row sampling while sampling our data okay so let's take an example that we have d columns d columns And m are rows okay so

we have this we have b columns and we have ambrose okay so we sample only rows and backing we sample only rows in backing okay so it can be like it can it can be go to d1 okay only those and backing okay so that's what we do in uh like a row sampling okay so we'll see in random forest we also do the column sampling plus column sampling okay and random forest we do this we do this but in Bagging we do not do the column sampling we only do the row sampling so i

hope that you understood the bagging also okay it's also got the bootstrap aggregation first your bootstrap and then you aggregate your base models okay so that's why the name it has a bootstrap aggregation okay so just to make sure that you all understood so i'm just recapitulating the sub section as you can reverse the video to know about ensembl learning because i don't want to Just spend my time under that okay so here um just to just want to make sure that here you have your training data here let me see here here you have

a training data which is you have a training data and then what you have done you take out the sample endpoints where the replacement okay into your data subset of the data and then you train your model m1 which is your base model it can register linear okay then you again sample k points and D2 then you again put a new model onto these subset of a data you do for k you do for k models or k subsets and then you combine the majority of votes majority of what will lead to classification and and

if you use this regression problem we take out a mean or median of a most output from the model space models okay so that's the intuition behind bagging and it helps in reducing your variance it helps and reducing your variance because your base Models are usually high variance and low bias model so it combines the base models to bring up low bias and low variance model that's actually good okay so and and in bagging we do the row sampling we do the row sampling means we we have a d features and we have m column

and rows so we do sample of our m not d okay so we take whole column we take all the columns okay but we make subset of a rows for training okay so that's what we do in uh bagging and i Really hope that you understood this bagging technique now it's time for learning well one algorithm would be one good algorithm one powerful one kaggle binning algorithm one production level algorithm which is random forest okay why i think random forest is a very very powerful algorithm to work on is a very very powerful algorithm to

work on let me write a random random forests okay Random forest it's a bargaining algorithm it's a bagging technique or you can call bagging algorithm okay so why do i call this uh very powerful because in also my professional experience i think that i have also used a random course a lot and it's it seems like it's this very powerful algorithm whether you want to win a card competitions and which is a machine learning problem or you wanted to make a production level machine learning okay So random forest usually used by quora then we have

google amazon they they all use this random forest but it has some a basic intuition a basic concept that instructors are not teaching and it's very very important okay so first of all what we do in random forest so let's recall our decision trees let's recall our decision trees into this now our now our decision tree will play a role now here we have a decision tree so let's recall our decision tree so what We are doing in this entry uh we are doing an it is a simple it's makes a decision and splits your

node okay so what you're doing it's a simple if and a nested if and else statements nested if and else statements nested if an else statement so we take if the sepal length is smaller than parallel and then you take a y equals to one like this okay which is a nested if and l statements just ask a question is just ask a question is just ask a Question to the null and then splits the note and then it splits the note okay so you can the reverse stuff for the section of decision tree we

have some attribute selection measure like entropy then we have uh information gain in formation gain then we have a guinea impurity which we have seen in detail in one hour of section of decision tree and i hope that you enjoyed that also okay so that's the decision tree okay so what is random forest random Forest random forest is a combination of decision trees bt plus bagging plus bagging plus feature bagging plus feature bagging or we call it as a column sampling column sampling okay so what do i mean with this let's understand this is step

by step so it's it makes sense to you also Okay so what we do in random products we have our decision trees okay so you know about different trees so we have uh there's a 500 decision tree okay so random forest is an assembled learning algorithm so we have a lot of base models means decision trees lots of base models so here we have a large distribution of a data okay and here we sample our data d1 and we train a decision tree onto this data okay decision tree then we sample the two With the

replacement then we train our decision tree onto this subset okay so you're not using different algorithm you are using only decision trees okay with the row sampling with bagging here we are doing bagging means the sample sampling with replacement which here we are doing a row sampling okay plus in bagging you're doing a row sampling row sampling and aggregating your model means whatever the majority of votes you will aggregate your model okay so you're Doing the row sampling plus the column sampling also okay so here and this uh in a bagging we have only doing

the row sampling we are taking whole column whole features but you are not but you are only taking the rows okay as a subset but here we are taking we are doing also the column sampling okay so let's learn to understand what do i mean by column sampling or feature bagging okay so the column sampling means column Sampling means that you have this your data you have this your data like this you have this your data okay now you have d features you have your leaf features d features d columns like a b c d

and you have ambrose amaros okay we have ambrose okay so what you do you for replacement you take any rows you sports for sampling you take your rows random rows let's take an example you Took this row this row okay this row and and you took a and b as your column and then you train onto this you are not taking full columns or features you have taken this a and b and in their frame and then the next you t you took c and d okay you took c and d okay and then you

train with one one row just an example you join this entry so they're different they are different and If they are different it's much higher chance that your model will be very very good okay ensemble learning if your model is different then this is then it's very very good okay okay great so that's the that's the random forest okay so we have a large number of decision-free maintenance base learners as a base learners that are trained using a bagging technique means it's sampling rules with row sampling plus column sampling okay with a row sampling Plus

column sampling which is column sampling also called the feature bagging okay between different distributors and then you in majority of boats and and then classification you take them authority or in regression you take the mean or a median of the outputs from the base model okay so that's what you do in this so that's what you do in uh this car a random forest and i hope that you understood okay so uh let's understand It uh more intuitively uh that there is something called as oob it is something called as o o b out of

bad points outer bag and this is called in that point so it's uh something called as outer back point so let me not recall this concept just now so let's let's take an example let's take an example that what you have done you have this dn you have this uh you have this large Blouse training set set which is b okay which is d okay so what you do let's see an example that you've taken this sample of a data as a b1 and the rest of the data is called the outer back points it's

called the outer back points so what you do you subtract means the train data minus the d i and i is the sample so these are called the left points after the sampling which is outer bag points okay out of bag Out of bag points okay and this this this can be used for cross validation for evaluating your model okay so if you set omc cycle learn there is a very good library which is cycle learn if you set ob score to ob score to true if you said ob score to true then it will

uh give you the ob score also okay so that's the basic or out of bag um points and i hope that you understood um Out of ob uh points okay so now let's uh let's recapitulate our bagging and a random forest what we have seen in random forest and diamond okay so in bagging we have seen that we have our data we have our data d we have our data d okay what you do you take out a sample you sample with a replacement d1 okay then you train your model under this subset m1 then

you're into d2 little Trading water onto d2 and what you're doing you're doing is sampling with row sampling plus all i'm sampling with the replacement okay and d3 d3 then you getting your model b3 getting your model m2 m3 okay so the majority of vote majority will go and a large model means you just aggregate your model aggregate your model and then you get your final prediction which is why okay this is the whole pipeline so in In bagging you want to do the row sampling and random forest you will you do with column sampling

as well as you only take decision trees as your base models okay so in bagging into different different kind of models by the decision tree and random forest to take your uh decision tree as your base model and distill through our train or different different data okay so that's the basic uh intuition behind the random forest and i hope that you understood also About what is com column sampling or watch it what is feature bagging okay and obviously this also helps in reducing your uh variance as it is a bagging technique so obviously it will

help you to reduce your variance okay so that's the basic intuition behind branham forest so let's see the run time uh train train and run time complexity of this because if you if it is very uh uh it's very required to trade talk about Train and run complexity of this random forest okay so in decision trees in decision trees uh the train complexity that the train complexity in decision trees the train complexity let me write in dt the train complexity is order of n log m n log n times d okay times three times uh

yeah so it is a in distance we have n log n so in random forest we have d number of decision trees signed k number of samples okay number of models So that's the that's your model that's the uh train and run time control train uh training complexity of your random forest and the decision tree is just a analog n okay so that's the and d here is the number of k is the number of the models okay so uh it it makes sense also if you take an example of your d as a large

data set okay let's take an example that you have a d okay now you take out a sample with a replacement With row sampling plus column sampling with like feature packing okay you take out a subset and then you train your model like m1 then you again do the d2 then you take out m2 so here the all the way down to the mk here you have a k models and then you have a d uh d decision trees okay so and also this is a trained uh trivially paralyzed it's a trivially paralyzed what do

i mean you can train this model onto parallely okay you can Train the model you cannot you you can train this parallelly okay you can take out a subset you train d1 parallely d2 parallely d3 parallely so this is a trivially paralyzed parallelized parallelized i think i'm pronouncing this correct okay this is a trivially parallelized you can do trivially parallelized also okay so that's the decision i'm sorry i'll train around turn around and complexity of Random forest and i hope that we have talked a lot in short span of time and i hope that you're

enjoying this video war section also and and i hope that you are enjoying a lot okay so there is one more concept that i will um talk on and end this video which is sort of this section which is extremely a randomized trees okay extremely a randomized tree so what we do with extremely randomized trees is Very become popular after the cycle and releases its uh this api which is something called as extremely extremely randomized randomized trees okay trees so what is this this is in random forest we are doing a column sampling plus row

sampling with bagging okay with bootstrap aggregation with bootstrap aggregation which is bagging okay so in Uh run random forest and rf we are doing like this so in extreme trees uh in extreme trees we try out the possible values of um fi to determine the threshold means the decision trees and decision trees let's take an example that in decision trees we uh we have we have some epsilon and its checks is greater than ordinals like that for identifying in uh decision tree we are trying out every value and that is that is time that is

That is taking time that is taking time okay so uh what we can do instead of trying out every values we sample we sample some subset of columns some subset of rows with some supplies of rows and check with that and choose the pressure according to that because in random forest we have a lot of decision trees and identifying the threshold is a key over there okay so if you if you take all possible values then And that will be computationally time complex time taking so what you do you try out the sample of values

from that whole column okay then you'll try and that's called the extremely randomized trees okay so that's what i'm going to talk about extremely randomized trees just to make sure that we are on the same place that what what we were doing in random forest we have a lot we have decision trees we Have column and row sampling okay okay and that reduces the variance okay so here um instead in random form we are trying out every in indianapolis and we are trying out every possible values and we are for for that epsilon so for

a not for it's trying to time taking so what we do we take out the sample of values from that column and as a as a sample and then we check that for that epsilon 2 for getting that epsilon okay so if you're literally Getting confused don't worry it's not a popular popular algorithm which is extremely randomized classifier or regressor it is not a popular algorithm the reason being if you take a sample there is a lot more less chance that your moral is good as compared to random forest so we will rarely use i

rarely use this extremely randomized foreign popular algorithm okay so we have talked a lot i think that we have talked a lot On to this now you hope that you understood bagging then we have understood random forest then we have understood some decisions we and then we have understood that in symbol learning now we have talked about packing okay now in the next section subsection um point a point two we will talk about uh boosting okay and boosting we'll talk about gradient adaptive xg okay so we will talk about three algorithms here we Have talked

about the random forest and extremely randomized uh trees so in there we'll talk about three which is gradient adaptive and xg boost and a gradient boosting is also called g b d t g b d p is gradient boosting precision trees because they also use decision tree as their base models one little century as a base models but what are The some of the disadvantages of random forest the some of the disadvantages of random forest that it is very you it is very it is very bad choice to use to use a random forest or

decision trees on large data set okay if you don't because it's very time complexive train taking okay don't use that kind of but you can obviously try it out at least okay you can see some more assumption on the internet for this okay Great so now we have talked about everything now it's time for getting into the implementation part of random forests okay we will talk about random forest and decision trees okay either way i think that we have talked about decision trees so we'll talk about a random forest okay on cycle learn a a

library okay so let's uh let's see the implementation okay so i'm on my grave okay so here i will write i started using brave in some Uh just uh just a days ago just some some days ago so here what i will what i will write i will write a random forest random forest cycle okay and we will see one project we will see one project uh which will be doing all of this implementation with grids or cv fine tuning the hyper parameters okay so let's uh let's deep dive into this let's understand this concept

Let's see the implementation a very finest explanation of a random forest right here so let me take out my ink to go again because i it's my 212 days left that license expires okay so here i'm i will take my pen i will take medium i will take a good red color yeah so you can go to this api which is sk learn and sample you can also import by just calling uh from sklearn from sklearn dot ensemble from sklearn dot Ensemble you can go with this and then you can uh use this api so

the first um you will now now you can use the random forex classifier this is the first parameter is n estimators the first parameter is n estimators and estimators is the number of decision trees that you want okay like then what is the number of base models that you want what is the number of base models that you want so you have This data and you take this sample this data so what is the number of so you so here you should take 100 decision tree as a default okay but also this is a very

important type of parameter so you fine tune it using a grid search cv or randomized search okay we will see implementation of these two okay so you can use this is the number of decision trees After you have what is the criteria what is the criteria to use in a decision tree what is the attribute selection criteria so the default is given impurity as the best impurity either you can use the entropy but this computation little bit expensive means a time time taking okay you can the default is guinea impurity is now uh the what

is the number of depth of your tree the minimum number of a sample required to split and Sorry the maximum depth of the tree to stop okay then the nodes will expand means if it is none then it will ah it it has a high chance that it will overfill okay now it is also hypergrammar minimum sample is split with number of samples required to split an internal node means what is the minimum number of samples to split okay samples means what is the number of data points okay then you have let me delete this

again Then you have uh oops here's my pen then you have a minimum sample of leaf which is which we have already seen max features means what is that you want auto or a square root or log you can read the documentation impurity then bootstrap bootstrap obviously equals to true it involves bagging where the row sampling with the column sampling ob score which is out of back data points which is out of the bag after sampling Okay which is for n jobs if you take this n drops equals to minus one if you take this

and jobs equals to minus one then it will obviously understand your course and then it will run parallely okay it will run results in parallel it is fast if you take adjobs equals to minus one usually people take okay but it's i think it's it's for printing out something texts okay now um you can see the class page let me see i've already also forgotten about class weight let me See class weight is yeah verbosity when filling and predicting class weight is just the weight associated with the class okay so it is obviously if the

default is a none but you can see how it is calculated okay it's not max samples if bootstrap equals to what is the number of a max sample if you set it in none then the default is the x dot shape 0 the number of the samples so you can set some max Samples okay so you have some attributes base estimator the child template used to create the collection of finished sub estimators then you have a number of classes here number of features the number of uh number of outputs when if it is performed the

feature importance it is also you can use feature importance uh this attribute to uh understand what is the number of feature means what is the importance of your each feature in your model is columns ob Score if it's obviously it will give you the in the float value okay so that's the decision tree uh sorry random forest and you can see more about this into uh here and you can really read about attributes and this is how the this is how the featured importance looks like means you can also plot it like this you can

also plot it to so which feature and you can use it for feature selection okay great let's see the second one is a random Forest regressor let me see yeah that i am under the same path okay so let's see again it's very similar to classifier it's very similar to classifier here here we have mst mirror which is number let me oops i don't know why it is very here we have number of estimators criteria to be msc rms mae okay max step max step minimum sample with bootstrap equals to true obviously take there take

this okay ob score equals to false you Get big yes or no okay somewhat default to get it is quite similar to that and max sample is none and it sticks a default as x dot so i don't know i don't know why why it happened max that is x dot shape the shape of your okay of your samples okay so that's the basic intuition behind this and max the max sample is also one hypergrammar great so we have talked about a lot About bagging and we have spent one hour on to this now you

can see some more attributes you can see some more attributes over here to understand it much better so let me show you uh one uh one diagram that i found very interesting okay so you can see uh some examples using a random foreign regressor uh using a stacking so you will also see the stacking okay stacking is also a very good uh technique as a kaggle winner winning winning technique To use okay but um here you are here you can see that we only have mse we don't have rmse okay so rmse is not default

you have to make your own function for our messy just squaring the root your mse okay but you cannot use if you want here your rmse okay you have to make your own you have to make your own i don't know why it is not running well i have to complain this okay i think this okay so for rmse you have to make your own okay But here you cannot use with this library either you can make a full request or on cycle or an atm okay just adding your own rmse okay great so now

we have seen these algorithms now we have seen yes i'm going to do it so now we have stream extremely random there is one more left which is extremely randomized trees it just all got extremely randomized classifiers That's cycler it's obviously in cyclone because it's a new release of cyclone let's see this sound let me open it again integral it's very important now so here again we have mst meters criteria max depth is just the same and then we have bootstrap equals to false your bootstrap equals to false it should be true you can make

this as a true okay then we have some more uh auto square etcetera you can it's just the same as that but there is some difference that Is that that i've just told you that we choose randomly but what we do we choose a randomly sorry a sample and then we take out the sample epsilon okay so that's what we do in a random forest or extremely randomized classifier so let's see for regressor and aggressor is just the same i think that i'm not correct over here okay let's use the grasher and here the same

thing over as we have seen and here bootstrap equals to false your end jobs Is obvious that you are understood to run trivially paralyzed to create that option okay great so now we have seen about a random forest and i hope that you understood everything here okay so now in the next sub section we will talk about boosting now i hope that you will understand that also okay so let's wait at the next section till then bye bye okay so now we will talk about another ensemble technique which is Boosting okay so boosting is one

of the again one of the most popular as packing that we have already talked about bagging you just have to recalculate something about bagging that is necessary okay so we have talked about bagging and now we will talk about boosting which is one of the technique of ensemble learning and i really hope that you will enjoy this section and in boosting we will talk about In boosting we'll talk about something called as a gradient boosting and gradient boosting and then i will just give you a geometric intuition or i will just show you the implementation

of adaptive boosting which is often called as arab boost okay then i will talk about extreme boosting x g boost okay so these three algorithms means xd boost works best in some cases Xd boost works best okay so uh gradient boosting also works best but i am i it is also works best so we will see these three okay so let's start with this tutorial but before that um i i just want to tell you that you can do the problem set which is on github and you can subscribe the youtube channel just by going

to the https https on youtube.comera because this highly motivates me to make Content like this for you all for absolutely free okay so you can go to this new era to subscribe this youtube channel with 500 uh just just i have a 500 subscribers soon putting a deep learning tutorial there's announcements coming up soon putting a deep learning tutorial soon okay so let's start with this tutorial and uh okay so first of all what is bagging what is bagging that we have already Talked about you can re-wash that the bagging section if you want maybe

in a fast way so bagging and bagging we have low bias model we have low bias model and high variance uh high variance models okay so our base models usually have low bias and high variance low bias and high variance means our base models usually have this low bias and high variance and using bagging we reduce This we reduce high variance we reduce this variance okay so our output will be after applying bagging we have our low bias and low variance model low bias and low variance model okay so what what we were doing in

bagging we we are just simply doing the column sampling as well as the row sampling means we are doing the column sampling as well as the row sampling in case of random forest but it's just just take an example of random Forest and then we do the aggregation aggregation okay so that's what we are doing and let's see um let's see uh just just uh just a recap what we do we have this large data set we have this large data set d n and then what we do we simply divide this data set into

subset of the data uh with the replacement so i hope that you remember what is breadth replacement that the data which is here the data Points which is here and it needs to be needs not to be different here it can be same in the second data set also okay so you sample them points then you again sample endpoints with replacement then you sample endpoints uh so you take in three subsets of your data and then you train your model train your model onto this data like m1 onto this data maybe logistic regression maybe linear

regression or maybe support vector machine so you have three models And then if it is a classification problem you simply uh take the majority of votes majority of words that this particular example is of a particular class or if it is a regression problem then you do the average or me or you take out the mean or a median okay in case of regression so that's what we are doing and in random forest we this is these these are called our base models and in random forest we have our base models as a decision trees

we have As a decision trees okay so the this usually helps in reducing your bias and reducing your bias okay so that's that's the bagging so let's see what what we have in uh boosting boosting is just opposite in case of bias and variance tradeoff so just just now i think that you know why we have learned bias and variance a lot okay so in boosting we have low variance in the in in this case we have high variance but we have a low variance High bias model here we our model is under fitting we

have high bias model okay so it is not performing well on the training set okay so here we have a high bias model and then what we do we additively combine additively additively combined what do i mean with this this is a very great uh question this is this is what we do in here in bagging we are doing the randomization we are doing the Bagging means column sampling then row sampling then we have aggregating the model okay but in boosting we have additively combined that here it combines the week with converts the weak learners

speak models into a strong model okay so different different weak model uh tends to be a great model okay just we will see how it tends to be a great model so it is just combined so let's see let's see the core idea so the the Basic idea of using boosting is using boosting is to reduce bias if you have a hard bias model then you likely to use boosting to reduce the bias of that model ah yeah there is uh some other techniques also that that we've already seen but it's just we use boosting

to reduce the bias okay so let's see so let's see the the basic intuition about boosting so let's take an example that you have your data you have your Training data so i'm just writing my training data uh where do i write here you have your training data which is your training data i'm just writing train d okay which is usually which is this this is a supervised learning and a supervised learning technique so you have x i and y i with which is a label and it goes i equals to 1 or we can

write the i equals to 1 All the way down to the maybe m okay here m is the number of training examples into that a training data okay and then you make a model and then you make a model then you make a model at m1 so let's take an example that you have made a moral m1 that simply uh that's simply a function f of that is simply a function f of x which is our hypothesis okay so now what we will do this is the this is the core idea behind boosting so so

we Have a training data you have your training data like which is your training data and you train your m1 model onto that now what you have seen now what you have seen you simply take out the error you simply take out the error means the cost function the loss okay loss for either example loss for eight example so for each example you have y i which is a ground truth which is your ground truth minus your moral predicted value while more predicted Value is f of x which is uh even when you input x

then you get your output so model predicted value is f of x okay this will give you the loss okay the loss for if it is if it is 0 then this is very good if it is large then it is very bad okay so here we are using just a simple regression loss means just a loss so if you add a submission over here if you add a submission i equals to one all the way down to the m y i minus f f of x then this is the like mse but You do

a square in msc but now because of the gradient decent to effectively compute the derivative but here uh you can you can just see just just as a simple conversation what we uh just just we are taking the laws for each training example so here let's take an example that we have take take take in a loss and then what we do and then what we do we train our model onto this loss means we train our model to reduce this a residual error To reduce this residual a residual uh error okay we probably focus

or we we more focus on to the data which is misclassified or mis uh which is which has a very large uh error okay so we focus on that okay so what you do um so for really so you want to reduce this error so how you want to reduce you want to make a model make a model m that that that is given x i means you fit this Residual so let's take an example i'm just i'm just going to take take an example to make you this statement clear that what you do so

let's take an example that you have a regression uh prop problem statement so you have a regression problem statement oops i don't know why i'm making a bad yeah why my x-axis is not working great yeah here is my x-axis a bad x-axis and you have an x like this X like this and then what you you have fit in a straw maybe this line okay a straight line so here it for this train example the loss will be very high so it will uh more probably start focusing on that and it'll fit on that

residual covering this okay so maybe it will it will do like this okay it will also cover this if the residual is very high so it's always tries to minimize that residual maybe it can make a knowledge either the example is too much alandi But either it can make a nonlinear uh decision sorry hyperplane okay so uh for so what what we do we fit our error so we make up model m1 that minimizes the error by fitting onto the error so here is the li for i equals to 1 all the way around to

the m okay for each way example we are fitting a model okay so it is more probably focusing on the data which is either misclassified or which has a high uh the partic mse or mae okay if it is a Regression problem then we take msu or it is or which is misclassified so it is more only focusing on that okay so that's the basic intuition so let's let me uh let me make you uh the full equation how it looks like so you have you have f you have f of x f of x

with k means and some uh just just a model a big big model means f of x okay for each frame example i equals to zero All the way down to the k okay all the way down to the k you have an alpha you have an alpha times your model i th model so you will not only have one model you have in first model that fits them that second model that covers that that fits that stress residual then third model that fits that residual from this model fourth model that fits that residual from

this model means error okay so we have a different different model so we Have a large model this k and we go through each and every model with some la uh just as assume as lambda so we have this uh just uh just as of now just consider this as us we have computed this somehow so we will see how it is computed okay we will see uh when we learn about uh some gradient boosting okay so just to assume that we have some uh uh just a lambda a constant okay here this f of

i x that is that is trained to fit the Residual trained to fit the residual from the previous model okay from the previous model means this model this m2 train to fit the error we get on m1 okay so this this model may have some different thing this this model is able to correctly classify the error from this first model this m3 is able to classify the errors from the m2 so like like that okay so that's that's the basic intuition Behind boosting okay so i think that you have understood about boosting so now this

will end it up this this function this function will be ended up giving you a low bias low bias and a low variance model okay but there is a problem there is a problem so you may think hey i use can you name a problem yeah sure i am here to name okay but before that you can take a break if You want because i'm just it will just take more one hour to complete i think so it's my approximating time so it will take one more hour you can take a break or you can

do just a prop problem set if you want but start with the section and then complete the section then see the prop problem set then see some challenges given to you in my github and i really understand that you will be able to do that okay so here you have this one first you have a training data Then you have a low score loss for your showing example then each model trend each model tend to um each model tend to fit the residual error given from the previous model okay and in that way they end

up being a low bias model and it's very good on training set but there is a problem the problem is that if it is too much good under training set means only 100 accuracy on training set then what will happen then what will happen it Will start overfitting new training example will come then it will able not to classify or detect a good uh prediction okay so that's way so that's why we have to take care of how many number of base models so that we have a lambda but we will understand this all okay

great so we have understood this and so the core idea behind bagging the core idea behind bagging is not too much hard is just saying is we just use bagging Sorry oh this is boosting oops it's boosting the core idea behind boosting is to reduce the bias reduce the bias on the frame like that okay this this converts the weak learners into these strong learners and here we have a good fine-tuned models which is converted into majority of words and then in bagging like that okay so let's see um so that's the basic intuition behind

boosting so there are some of the Techniques that we'll study about like a gradient like gradient gradient boosted gradient boosted gradient booster decision trees decision trees which is often called as g b d d okay uh this is a this is because because in random forest we have our decision tree and here in gradient boosting we have Our base learner saturday decision tree and gradient boosting as a base learner has a decision tree and then we have a adapt adaptive boost which is often called as adap boost which is a little bit uh more advanced

version of gradient boosting then you have extreme boosting which is x g boost okay which is again a family of a good algorithm and it's always outperform it's very very powerful algorithmic boost as i've seen so far and we'll show you the Implementation of xgboost also with some i will show you how it is implemented and everything and this section valley okay so and that's the basic intuition behind bagging and i really hope that you understood bagging in detail just to recap that boost or helps and why i'm saying bag bagging it's a boosting okay

so boosting here in boosting we have low variance and high bias model we additively combine with In which we convert our speak learners into strong learners and the core idea is to reduce the bias it simply means that we want to improve our error under training set okay so that's the basic and we have four which we'll cover in this section which is gradient booster decision trees adapt adaptive boost then we have a x g boost okay so uh i think that you are understood about boosting okay so now we will talk about uh gradient

boosting which is again a good Up a very fantastic brilliant uh uh algorithm okay so to just just use an internet companies or very big pump companies which is used in ml in production okay so which is usually launched in 20s these uh algorithms okay so we'll study in detail to make you understand each and every concept of gradient boosting and then we will see the implementation then we will see the adaptive boost then we will see that xt boost and then we are end up with this Ensemble models okay then we will go with

uh unsupervised learning techniques okay so uh just uh let's let's start with gradient boosting now we have talked about gradient sorry the boosting uh the basic intuition behind boosting what actually the boosting are so now i will start talking about a gradient boosting okay so gradient boosting is another yet one of the most powerful algorithm that i've seen so far uh yeah one of one of the most powerful Algorithm which is uh which is yet to uh uh just learn it's very to have a good in your tool kit okay so i'm on a wikipedia

page i'm on a wikipedia page which i found a great machine this great uh algorithm which is again uh which which just tells you a very good intuition behind gradient boosting rather than just i make use of my blackboard either i will make use of my blackboard over here okay So uh so uh what is so what is gradient boosting this is a great question so creatine boosting is a boosting algorithm that just converts the weak learners into these strong learners okay so let's start with in this uh gradient boosting and you can search for

gradient boosting wiki and then you will see the gradient boosting wikipedia articles and then you can read the full articles either i will cover everything in sort Uh in sort uh pdf period of a time but i've already covered some of them and and of my just i will cover i i have covered more than this wikipedia okay so i will give you some a real world examples of gradient boosting also where it is used and everything okay but first of all let's understand the trend this the this algorithm okay so let me see my

pen is at least working or not i hope that this is working Yeah this is working great i'm just happy that my ink is working let's try if the ink to go if you're watching please help me to improve this very it is not that much good okay just to improve this but it's very good tool it's some sometimes it lacks but it's a very good tool check it out it's very good great for free okay so you have this training data which is a input training data you have this three input training data which

is Your which is the set of let me write the training data which is simply your x i and y i and simply goes from i equals to 1 all the way down to the n okay or m which is our let's let's stick to our formal notation which is m which is a number of training examples number of a training or the size of our training samples okay so that's the that's our given data and we have a differentiable cost function Then we have a differentiable cost function okay so what do i mean by

differentiable cost function differentiable cost function means that your this cost because we take out the derivative of our cost function we take out our derivative of our cost function with respect to some some value okay so let's take a we are taking out the derivative of our j of theta partial derivative over j of theta Okay so that's the that's the basic uh about means uh that's the differential equation that i know if you know if you're a calculus student then then you might interpret that we have a differential but as if if you're not

able to get what is this differentiable feel free to leave this differentiable just think that it should be derivable okay for our gradient descent or we will be because we have to take out the derivative of this Particular loss function with respect to some uh weights okay so we with that if we are given training set which is d train we have a given a training set which is a value x so x i and y i which goes from i equals to 1 all the way down to the m okay where is my eraser

i think that this is my eraser yeah i equals to 1 all the way down to the m and then we have a differentiable Equation as given so let's stick to the j of y y f of x and this is simply just the error means just a minus the minus uh your moral moderated value for each and particular examples i equals to 1 all the way down to the m and then we square it all up okay so that's the that's the our cost function maybe it can be mean square error or it can

be Long loss for classification etc okay then we are given the iterations then we are giving the number of iteration which is m and m here is the model m here is the base learners m here is the number of a base learners number of a base learners number of a base learners okay so so this is the number of base learners which is m now let's start with algorithm so what this algorithm tells you so let me Choose the different colors so it might make sense so we initialize our model with some constant value

lambda with some constant value lambda in that case we have in uh alpha but now we had changed a little bit that we have our favorite lambda okay so we have to find lambda oh no okay let's uh just think of it as that we have a lambda we have a lambda that we have to find that lambda that minimizes This training error this loss function it should be told here but i have told you here just in my excitement so you what you do you initialize your model with some lambda that minimizes this cost

function okay so this first to initialize this uh that's and we will see how this lambda is computed here but first of all you initialize it and then what you do you iterate through each and every model okay first you go first you do this for first model then You do again for second monuments you apply a for loop where you for m equals to one first u two m means you want to go one two three all the way down to the m okay so you do this and then let me first of all

let me okay so here then what you will do then i hope that then you initialize your model then you compute the residuals you compute the pseudo residuals you compute the pseudo Residuals you compute the pseudo residuals and then here it is named as r uh subscript i and m and m here denotes the num which model and i here for each 20 examples okay so here you compute the inm and then you take out the partial derivative the partial derivative oops what is not working partial derivative hey god please help me the partial derivative

of your cost function y i of your cost function y i And f of x so this is your loss function with respect to this function okay f of x your moral projected value okay and then you want to fit your model onto your previous uh residuals so for you go to i equals to 1 all the way down to the end so here you take out for f of x means f that means here here you are uh fitting your model means just you are making the model to be f of m minus one

means the previous model okay so you compute those residuals and then you fit The base learners okay that i've just showed you in my in my previous uh just in just just in the boosting that we fit the residual so let's take an example we have n m1 m2 and m3 we have taken the residuals we have taken the residuals and the here is give some residuals this m1 is trained to fit the residuals from m1 this m2 train to fit the residuals from m2 okay and this is m3 okay so it is trained to

fit the residuals of fam too Okay so then you fit a base learner or a weak learner let's take an example three closed under scaling of h of mx here is your two residuals to pseudo residuals means to fit the learner to fit the residuals from the residuals got from the previous base learners it is to train it using the training set this one okay and now here m is in our case which is l i on m model okay so then you fit your Model that i've already showed you just similar that we have

seen and then you come compute the multiplier which is a constant multiplier lambda m for each model you have lambda m okay which is a one dimensional optimization problem for solving the one dimensional optimization problem so you may ask why do we are choosing this lambda m so for solving one dimensional optimization problem going into 1d 1d 1d optimization 1d optimization is out of the boundary course but you can see on the link okay the how you compute you compute lambda for each model that minimizes the loss of y which is ground truth from the

previous model plus plus lambda times our just the model that that we have now okay now what you do you update the model now you update the model to be to with the Model to be fitted by the the model that fitted the previous residuals okay so f m x here m is our moral number the means model number then we have it then we have this f of m minus one previous model plus lambda m lambda m which is some constant for solving the one-dimensional optimization okay because we are taking the partial derivative of

our loss Function y i and f of x i with respect to f of x i and f of x i is our model predicted value okay times h of mx and then you get your output as an f big model m m big model all the way down to the x okay so here you when when you cover all of this you when the loop is done then you get the final model which is f m x that is simply f of m minus 1 x plus the previous model this previous model plus the

a model that we got after solving this or 1d optimization problem Okay so that's the gradient boosting uh algorithm that i really hope that you understood but let's uh let's uh just just as a recapitulation to help you understand a little bit much better to uh understand this problem okay so what we do let's understand what we do over here so we simply first we have our given our training data we have a training data x i and y i going from i equals to 1 all the way down to the n and n here

uh And and here i just hope that it should work now and n here is your and here is number for training your number size of your training set then you have a differentiable loss function with the number of base learners that you want okay then you initialize the model with some constant okay then you apply a for loop in that for loop you take out the residuals you take out the residuals for And and then you name it as a r subscript i which is the number of training exam that the index of between

example and m is your number in the index of your model okay minus the taking of the derivative for partial derivative for of your uh loss okay okay and then you fit a base learner or a weak learner because there because it just makes you a weak learner to the strong learner okay to Closing under h of mx which is your got from the weak model makes you just fit the base learner to to the residuals so it's simply that you know that you are going to fit your model into x i and your l

i which is simply l i is this okay now you compute the multiplier lambda m lambda sub 10 for each particular uh m means the model index for solving the one-dimensional problem so it's just to find the lambda m that minimizes the Loss that you get y y i comma the output from this model okay this is the whole model now this is a previous model that is this is a new model that is fitted to fit this residual okay and then you update your model okay then you have to output a big model which

is m ffmx that is simply this f minus one of x i plus your big model uh now you just after your full iteration you'll be ending up being a big model okay so that's the basic Intuition behind gradient boosting i really think that you understood about gradient boosting now uh now it's time for getting into the uh for getting into the something called as a regularization and uh shrinkage okay but before that why do we need even regularization and shrinkage so why do we need as we have already talked about that we have and

that we have in boosting we have high bias we have high bias let me see the recordings are on here Oops we have our high bias and low variance problem sorry high buy so we so we use boosting to reduce the bias to reduce the bias so high base is just doing barrel training set so if we just after each iteration we are fitting the previous model residual so yeah it is making a knowledge it is doing very very good on training set so very very good on training sets so it may happen that it

may start overfitting and our low by Variance goes to high where starts goes going to high variance means our low variances start increasing and now it's converged to high variance so that actually can cause overfilling okay so for avoiding we add regularization and shrinkage which is again shown in the wikipedia page let's see so here um i think that i can teach you better than wikipedia page okay so you have you this big model you have this big model f of m X which is here you have a h zero x plus you do for

i equals to one all the way down to the h all the way down to the edge it's number for your model lambda m lambda m or you can see m the number of a model okay h of m x okay so that's your model that we and then what do you do then if the number of a base models increase means It will fit more residuals if the it will fit more residuals more residuals then your over fit will then then over then it will start over fitting start over and if it is start

overfitting then your variance will start going up okay so that that will cause the problem so what you do you shrink you shrink by a factor of lamb just a there's a greek letter that i even don't know that is something called as a v so let's assume that is a V which is a parameter v okay so we have a parameter we have our parameter v which will help us uh which is just a parameter for controlling the or it's just shrinks your strings it just strings to do not go that much means it

gives a weight it is just a learnable parameter just gives weight and empirically it is found that v equals to 0.1 tends to be a dramatic improvements in your models okay so that's that's why we use uh this v this v to be 0.1 And again your v should be in between your v should be in between 0 and 1 and it's found that v equals 0.1 would be dramatic improvements in your models okay so you add a new weightage to that model means to this model okay you add british to this model that i've

just shown to you advantage to that model and that's called a learning rate for how for how Much time for how much rate it should learn okay so that's the so that's we add v direct model so let me add that so let me add that so what you do for you just shrinks your model with shrink your model big model by going f of m minus 1 x plus v times lambda m h of m x and v here is in between in between 1 okay and particle is found that b equals 0.1 works

best okay so that's the that's about uh gradient boosting and i really hope that you understood this also okay so now it's time for um we will take a look at implementation okay so now we will take a look at the implementation of gradient boosting classifier and gradient boosting a regular eraser okay but before that let's see the training time complexity Of your model so uh let's let's recall of decision trees and decision trees uh we have our train and run time complexity so here we and decision tree dp i'm writing dt we have old

and oops i am writing this n log b and here d and then random forest you have o order of n log n number of addition treatments k okay now when gradient boosting decision Tree means here we we just take our base learners as a as a decision tree rather than different different models so we have o n log d times the number of a models okay so that's the basic uh uh time complexity of your gradient boosting decision trees okay great now we will see something called as uh now we will see the implementation

of this uh gradient boosting decision Trees and cycle learn api okay so let's see so let me open gradient boosting implementation implementation okay and scale learn let's see yeah here we found this wait for a few seconds is just loading okay great so now here you are you are on a page of cycle learn api gradient boosting classifier and then you can see over here that um Now let's see the some of the parameters that we are over here so here that you have a loss that you have a loss which is the deviance deviance

refers to the log loss an exponential um refers to the which which is just adaptive version which is if you set exponential then your gradient boosting will go will will be as will be called as a adaptive boosting or ara boost okay so in adaptive boosting it is more pronely we Will see the implementation so in ada boost it is more pronely focusing on unclassified okay more more probably focusing on unclassified but it's not being used too much okay so we have a loss equals to deviance which is it was a legislation laws or log

loss okay then you have a learning rate means the rate to string the contribution of each tree by learning rate okay so this is used to reduce your uh over uh variance okay so that's your model does not start Overfitting okay so uh this is your learning rate which is default 0.1 and leave it 0.1 okay because it's found that most remember dramatically that they work best in most of the cases and then you have a number of estimators which is number of decision trees subsample criterion min sample split min sample leaf min weight fraction

max therefore e3 min impurity these are particularly for each freeze and then you have uh let's see Something more of that we have a validation internal change and then total ccp alpha that's not not too cool okay so we have this and these are some of the hyper parameters that i know that you had understood everything okay so you can see more about this in the parameters in detail here but we have already talked about decision trees already okay so let's see the implementation means the implementation of this uh gradient Boosting is just one line

of code which is here and you can use predict probability it simply gives the probability of that being a true okay for that class okay so first you import from scalar dot in symbol input gradient boosting classifier here is your data means and then you call your with an estimators learning rate 1.1 okay 1.0 max depth random state and then you fit it and then you get your score okay so that's the basic Intuition behind a gradient boosting uh classifier okay now let's see the gradient boosting regressor gradient boosting regressor regression and sk learn okay

so this is a very good to learn from the documentation okay this is very good to learn from the documentation so let's see okay here here we are on uh documentation of a cycle learn api now What we will do we will just let's use the pen and then you can see over here that we have a loss function to be optimized miss l as the refers to the least squares l a d means least absolute deviation is more robust okay so if you have a uber then we have a quantile but the default is

ls okay then you have a learning rate however how much this is a for shrinkage then you have a number of estimators then you have some sample criterion min sample Split for each particular decision trees these are for each particular decision trees and then the same thing that we have seen so far okay so now uh we are uh this is the basic for the decision trees and i hope that you understood okay now let's uh see the implementation of this i'm literally going fast because i have to teach also that xg boost and arab

boost okay so you just uh call the a gradient booster pressure then you call the your train just split Then you split your data and then you call your gradient boosting restore with the default parameters okay then you fit it okay and then you simply do the gradient boost regressor then you score it and it reuses the sum scoring parameter it's the long loss or sorry not log losses maybe it uses r2 or msc you can see okay so that's the basic intuition behind how we implement a grasshopper and Classifier which we will do in

project just to showcase you how you can assess from documentation it's very good to learn from the documentation now it's time for learning from for uh now it's time for to that you'll learn about xge boost we'll learn about uh uh but before that let's uh let's learn about ada boost okay adapt so let's see me upside i want implement Scalar okay so if you set your uh loss to be exponential okay to be exponential then it will be equivalent to ada boost classifier boost classifier okay so it will set your laws to be like

this and it's just the same as that we have already talked about we have two algorithm given by sami and sammer which is in sammer which is also for legit regression if you have Seen is the cycle and api okay so how you implement from arab boost classifier then you call this and then you fill your model and then you are done okay it's very simple i i think that you but just remembering the concepts is very very important because it's not always you have to implement maybe your own something algorithm in your company okay

you may think yeah you can't we just study the implementation no no you cannot study Okay but cyclone api is very very slow very very slow but it's great but what you can do uh you have to in in some companies for production uh because it's a kaggle winning uh algorithm that we need to understand what is happening how we can tune the hyper parameter okay so here you can see again for regressor okay and here you have a loss equals to linear and then you have a square and then you have exponential okay and

here maybe we we do have and not Lost you can see state the number of the classes how many number of a classification classifier okay so now we know about this and i really hope that you understood uh the following now it's time for uh learning about extreme boosting one of my favorite but i think adaboosh is not used in production uh many um not uses too much in the production maybe uh i'm not right over here but i think so okay so now we will learn about X g boost which is one of the

most popular algorithm which i have seen so far okay means a t okay it's a it's a good algorithm it's a good algorithm but it's one of the best algorithm that wins your kaggle competition okay so now we will see what xc boost does it's just the advanced version of gradient boosting it have it it it does have gradient boosting Decision trees plus it has randomization with row sampling plus column sampling and in gradient boosting we are not doing this but it's at randomization which is grow sampling plus column sampling so that's why maybe name

it as a extreme gradient boosting and it's works best okay this works brilliantly uh since some cases and machine learning problems okay but it needs fine tuning a lot okay so the Gradient boosting decision tree now you have a row it's just how we differ from gradient boosting decision trees you have you do the randomization by doing feature bagging and then you have a row sampling okay and it's just for a simplicity let let me explain you what we tend to learn about this what we do we simply sample the rows as well as the

columns for our data okay okay and Uh and specifically for bagging we don't know columns we don't do column sampling but in a random random forest we do both so in xgboost we do the gradient boosting crystalline trees as well as a random forest plus uh column sampling okay so that's the that's what we do randomization and that's the xd boost okay so now we will see in detail what actually the xg boost xd boost From the official documentation but but it is all it is it is of course implemented in the it is of

course implemented in cyclone api but it's uh but it's very good to use because there this is fast actually it is fast actually it's very good to learn from here rather than there okay so i think that we can found get a the Python package because we have different different package no setting parameters oops where it is xgboost uh python package okay so here we have some parameters an xg boost i think that my brave is not working i have to again switch to my favorite chrome why i have switched it to this it's bad

x uh brave is actually not good For me and since some kids actually i'm like who am i said to say bad but it's good good but it's all of course sometimes it does very bad okay so let's see some of the parameters let's see some of the parameters so we have different different packages that we had that already been implemented this algorithm so here you have booster which is gb3 means gradient boosting tree you can use the Gb linear or dart the means vu's will use gb3 then you have a validate parameters default to

false means it's just going to validate your parameters or not then default to maximum number of threads then you have an evaluation matrix then e term is learning rate okay is 0.3 okay gamma which we have seen which is the constant which is zero as default and this just we just randomly initialize means and then we Find a good gamma and then what is the max depth for each tree and you can read the documentation over here more minimum child weight minimum delta step means what is the step should one uh maximum delta step we

allow each leaf output to be if they it means there is no constraint okay if if means it's just like that it's just uh helping you to not too much pruning to overfitting it's more robust model okay so you can see Your lambda means l2 regularization like a lasso and this is for red regularization that is alpha regularization okay then we have a tree method which is exhibits construction algorithm using xgboost you can see the reference you can see obviously see the reference it also supports uh hist and etc then it's a have a scale

post weight updater which is process type group policy maximum number of leaves but is to point to zero Um to be added okay so max bin predictor then we have a gpu cpu but this leads a lot of fine tuning it takes a lot of time in cpu okay so we have this much and you can see more about this it's very long set of parameters but we usually use the main parameters that we have listed okay so you can see a python package first one you want to install it you want to install it

you can just pip install xgboost and then we and then you just call Xgb dot d matrix first you convert that into a d matrix then you uh uh do that means you can do also for pandas using the cylon api but it's okay to use it but you can see over your implementation sk learn it's also for regressor that we have seen i think that we had they have removed i think not they have they haven't removed no no worries so you can see from here how it is used and how we train this

model and how we save this Model uh it's just like a neural web as you as you know that tensorflow has also added decision trees and like that in simple learning it started adding because one of the most powerful uh algorithm in stem learning okay so we have seen a lot in in decision trees okay uh sorry in ensembl learning like bagging and boosting okay so now one last step is left is stacking of your uh stacking okay so after this Section we will start with the stacking to help you to better understand what actually

stacking is and it will obviously help you okay so we have seen a lot of applications of this and i really hope that you enjoyed it also okay so this this section is just amazing we have learned about boosting and we have learned about gradient boosting they are adaptive boosting they have learned about exege boost okay and so one of the projects Will fine-tune we'll use xg boost ada boost with fine-tuning okay one of our project okay so uh that's it for this section in the next section we'll start with the stacking there's a one

only 20 minute session on stacking will take and then we just with some other summary or revision of ensembl learning okay so let's meet at the next section okay so now we'll talk about our lawson symbol learning technique which is uh called stacking of our models okay so We will see the implementations of the stacking so and we will see how it works and how it changes your accuracy okay so let's recall uh something called as bias and variance of our bagging and boosting okay so when bagging and bagging we have high variance high variance

and low bias trade-off okay and in boosting and then boosting you Have low buy a high bias high bias and low variance tradeoff okay so remember these two uh for stacking okay so here i am i am on my mls extend which is a library for implementing the sacking classifier okay so you can we will see the implementation of this uh stacking classifier but before that we will learn well how actually stacking works you can get to this page by just Going to mlx 10 stacking classifier and click on the first link okay so let's

start so here is our data set here is our data set so let me take out my favorite uh ink to go i hope that you remember my favorite ink to go sometimes i roast it also but here it is very good just for in annotating uh these these kind of things okay so here uh let let me take the pen that as of now let's take a red which i like a lot let's take A medium okay great so here is you in bagging what what we were doing we have a large training data

we have a large training training data and then your multiplier or you are just taking out a subset of this data d and one d n2 the entry with the replacement are training the decision trees or maybe some same base learners okay so in the stacking we have different different base learners we take out the data okay and then we train our model Different different model under the each data like a c1 uh is a model which is trained on different data then again means on the data d1 then c2 is trained on d2 all

the way down to the m okay so here c1 c2 all the way down to the cm are a different are different models okay so what do i mean by our different models is quite simple is let's take an example let's just take an example that you have Your favorite legitimate ration okay you have a logistic regression you have a logistic regression as a c1 okay as a c1 then you have your support vector machine as a c2 then you have your favorite naive bayes algorithm knife base algorithm with uh maybe with extensive fine tuning

using grid search okay so here we have c3 here we have k nearest neighbors Uh which is c4 okay with extensive with maybe k equals to four okay um with extensive fine tuning okay so we have four base learners and we have a different legitimate regression support vector machine nine base k nearest neighbors and there are four different models okay and a train or different different thing okay and the major difference between uh the bagging and boosting is in bagging that that i've told you to recall Bagging bagging and stacking is in bagging we have

high variance high variance and low bias low bias models which is a base learners which is the base learners and in and in bagging we used to be used bagging for reducing the high variance okay so base learners are usually high variance and low bias and trade-off between them okay so what is the difference between stacking In a stacking in stacking our base base models or base learners are highly tuned are highly well tuned okay highly well parameter tuned okay highly well parameter tuned and they have a good bias and variance trade-off okay so in

boosting so in bagging we have high variance where our base space owners are not that much good and also when boosting our business are not that much good but in stacking our base Learners are quite good are quite good are very good with extensive fine fine tuning so maybe you have changed the legislation fine tuning with the learning rate support vector machine like c your regularization parameter and your gamma and then maybe some of the other parameters okay a nine bayes algorithm we have also done it's not required but you have done some extensive fine

tuning you have kink nearest neighbors we have chosen what is The number of k okay so you have done an extensive fine tuning of your model okay so that's what uh what i mean here i'm taking what a classification example okay so here um what what we are doing here what what we are doing is just we are the models are highly trained it's highly fine-tuned okay highly very very very much a train uh extensive fine tuning are very good models they do have a very good bias and variance tradeoff Okay and here it gives

prediction let's take an example that c1 gives you prediction as a y-hat one then c2 gives you prediction as if i had to and all the brown to the y hat m okay so each model gives your output okay the prediction okay so what you are doing so what you are doing so let me do this so as an as a motivating example so can i delete this if i can yeah let me let me delete that so what you are doing so what you are actually doing is Simply uh taking your training data which

is straight taking your training data you you are dividing your training data you're dividing a training data into subsets of into subset of training data d1 d2 and d3 spotting for an example let's take 4d3 okay and you're training your base learners which has a good bias and variance trade-off which has a good bias and variance trade-off okay the base Numbers are very good bias and very good model they are not either low bias or high bias because here in bagging again i'm saying in bagging you have high variance high variance and low bias model

and what do i mean by high variance it means it performs very very well under training data but it fails to generalize well on a testing data okay so that's why and and that's why it's called high variance and in boosting It is under fitting it means that your model uh your model is either overfill means uh under fitting because it has a low high bias which is not performing well under training data and low variance like that okay so here we have a intuition about bagging and boosting okay so here we have a highly

tuned model with extensive fine tuning and then what you do you take out the Prediction why hat one why had to why hats we y hat3 okay after you take out this now what you do you train your model you train your model you train your model obtain your meta classifier you train your meta classifier which is just pretty which is just trained to on the on the predicted class labels from the base learners or their probabilities from their ensemble okay so they are usually trained on these things on these On the predicted class of

our favorite classification or base learners either they will either take the class labels or the probability of being that class okay so maybe 0.65 or in particular class it may be zero or one or two okay so that's what the basic intuition behind the this uh stacking but again i'm i'm very uh just just i will recalculate you so that it works it makes sense again okay So what you do in stacking so uh let's uh just just for an example in the stacking we are just taking a trading data okay and then you're training

and then you're dividing our data into substitute data and then you're training different different classifier onto that data okay here we have a c1 here we have a c2 all about cm which is new data okay new in new data we are training and uh and then what it happens in trains are c1 c2 all the parameters c3 uh scm under The that data and takes out predictions and the major difference is the first difference is that the bias and variance rate of bias and variance trade-off trade-off of of the base learners in the stacking

learners in stacking is good and stacking is Good okay it's good whereas in bagging we have high variance and low bias okay and in oops what happened and in boosting we have high bias and low variance high bias and low barrier oops what happened high bias and low variance low variance okay so that's the major difference next thing is that After you get the prediction from each of the model by because you have done a lot more fine tuning a third hyper parameter you'll get your p1 p2 all the way down to the pm now

you train a big or a meta classifier which you usually call as s dash onto the prediction of h1 of x means the one classifier second s2 of vector second classifier all the way down to the edge m of x either you train onto the probabilities Probabilities given by these models probabilities or the predicted class labels from these ensembles okay and in there is in bagging we are aggregating the majority of votes we are aggregating okay the majority of votes or from that models then you're taking that as a final prediction by hat but we

have a different chance of here we have different approach okay so let's let's take a quick quick uh quick look at the at the Second at the stacking algorithm but it's i have already explained you and just just above but as an uh just as a formal definition here you have your training data which is usually x i and y i which is which goes from x i equals to 1 all the way down to the m where x i is the member of r which has an n dimensional maybe it can have a multiple

features and whereas the y i will be the member of the number Of the classes of y okay so that's the thing and then what you do and it's output will be the in symbol classifier which is va edge dash or a big edge okay first what you do learn first level classifiers means you learn uh team t classifiers maybe it can be logistic regression knight bass knight base it can be k nearest neighbors and uh [Music] one one All the way down to the t and t here is the number of base models okay

based on the distribution data d now you construct new data set now you construct new data set going from i equals to 1 all the way down to the m and then if that that constant contains x i hat prime by i where x i it's simply the prediction the prediction from the each model the from the each base learn learners model okay from the each models now what you do you train your second Level classifier first to train the first level now you train your second level classifier and here it simply trains together make

a hypothesis of training onto these models and these meta class this this meta classifier can be anything this meta classifier can be logistic reduction can be nine base can be uh carriers neighbors can be support with the machine and even model and we are Not aggregating we are not taking mean we are not taking anything we are just trading a second level classifier that is just trained on our front base learners and base learners have a good bias and variance tradeoff okay so that's the stacking and i hope that you understood about the stacking now

i will just uh take a look at the the the formal i would say the form without the i could say implementation of a stacking The implementation of the stacking okay so you can see over here you can see over here uh the paper which is in some of the research papers they usually call as a stacked generalization you can uh sound some some in some research paper we call that okay okay so let's make a simple stack classification first what you have done over here we have simply loaded the data set from the sql

data set which is the Iris data set and iris data set is not too much hard we have sepal oops we have a sepal length we have a sepal length petal length petal width and petal length okay so we have four features and bases on that we have to predict what is this species of that flower which is either can be satoshi verse cycler or virginica okay so that's the that's the that's our data set now you can see first we import the model selection which will see what it Does then we put the logistic

regression from linear model api and then cycle learn then we import caney knn which is a which is again from neighbor from neighbors api and cycle learn and then you employ import gaussian night bass from uh night bass either not gone into too much of algorithm because learning algorithm can just you can see the wikipedia you are now capable of learning any algorithm okay now uh you just import The random forest specifier from ensembl learning uh and symbol api of cycle learn now you import from mlx10 library is classifier import stacking classifier either you have

been first you have to install this using pip install if you have a python mlx stand but then you can do this kind of thing okay where is my let me delete this okay after that after that what we have to do we have to simply import the numpy as np is allies and import warnings okay now you want to Ignore the warnings given by the models now first your first model is clf one which is k neighbors which with the neighbors of one if it is bagging if it was bagging then we have to

if it is run random for us we have taken one of the decision tree but we are taking different different models and that's what makes it perfect okay random classifier gaussian i base legislative regression okay now we to now what what we have done we Instantiated our object everything now instantiate our stacking classifier which is the number of the classifier will be clf 1 which is cll k neighbors random forest which is the cl of two and these three which we are these three are zlf2 clf3 are the base learners it will give output it

will give output like this okay now what what we what you will do you want a meta classifier which will be the Logistic regression which with the logistic regression which is the which will the the distribution as i have told you can take legislation as a meta classifier okay now you perform a three-fold clause cross validation right now you perform three-fold cross-validation and now you perform three-fold cross-validation now uh in three four cross validation you are just looping through clf as well As the label it seems uh zipping the clf one cl of two clf

three and self okay now k n a random forest knight base and stacking classifier now you select the best model you now you select the base model by training each of the model whether you are taking a scoring equals to accuracy now you check the accuracy and you can see and you can see and you can see that this stacking classifier has around point point four percent of increase uh As is better i means is quite better than this a random word the stacking classifier works best in this case okay and now what what you

can do you can oops what about what happened what you can do you can actually plot the k n and how it is plotting means the decision boundaries of the models how they are plotted that decision boundaries or hyperplanes okay during the math clock level so let me and annotate what is doing first you import The math plot lab then you have to import the plot decision regions from mlx 10 then employ the greatest pick then you include the iter tools then you set the set the area then you set the area again you loop

through by dipping and then you take out the product and then you fit it and then you plot it and then you apply the decision boundary then you title the lab and lab here is just k n n a random forest night base and stacking Classifier okay after you plot it now you are done with the second classifier okay now as i've already stated you either for training the mera classifier you can use either this uh that i could say i could say that uh maybe the uh the prediction from the base learners so you

can either take prediction class or the probabilities as the meta features okay so you can say that use probe Use probe equals to true use proba equals true and average probably equals to false okay and you can see uh here are a little bit of the documentation over there okay great now we have seen this and again the same thing is not too much harder we are using the thing and now we will see uh here is an example of a stacked classification using grid search it's your task is to Do this but i'm just

going to annotate what the this kind of thing will do great search so here you are just first of all you are just you now you will tune you will have a good bias and variance trade-off so you are going to check between one to five you are going to check and then meta classifier and the feather grid search you find the grid best grid search then you plot the results means you take out the results what are the Best parameters and the accuracy so after applying that you will get a set of values which

are the best uh k nearest has these best features and like that okay now you can import the same thing for doing the k nearest neighbors and this and random forest classifier and then we have one which is the meta classifier okay now using the grid search stacking that operator on different subset of features You can also do that by selecting the column separator by calling the make pipeline that selects the two features okay so that's the that you can do and you can use pre-fitted uh classifiers which is already been fitted uh we previously

filled out classifiers in your models okay fit base estimator equals to false means you don't want to fit your base esteem errors okay you can also plot roc curve Using and you can see the roc curve like this and and this is first of all these four four examples are very very important we feel you can see the rvc curve uh in more detail on the internet okay great so now we have talked a lot about stacking classifier i showed you the implementation of stacking classifier here is it a very interesting and a brilliant diagram

on a stacking or algorithm given a stacking this is a Good example on stacking that i want to give it to you okay thanks to the author who has given to you full credit goes to them okay there's some add-on just on man annotating okay so now we are done with uh these uh with uh ensembl learning so let's uh let's recapitulate what what we have seen so far we have seen we have seen we have completed we have completed let me write a good Accomplishment completed and sample learning so what we have seen we

have seen bagging so in bagging the basic intuition is we have a data we take out different different data we train a different different model and then we aggregate the majority of boards from each of the model then you have then in bagging we have learned about one of the algorithm which is a random Forest that just uses the base learners with row sampler uh with feature sampling okay feature bagging and then we take out the boosting then we learned about the boosting we're in boosting we learned about the technique okay and then we're kind

of converting the weak learners to the strong learners then then we talk about the g gradient boosting decision trees then we have Talked about adaptive boosting which is ada boost which is just to the exponential it's a loss and then we talked about x hd boost okay now we are now we in this section in this sub section we have talked about its stacking we have talked about the stacking where we have seen a lot of examples a lot of examples using grid search and saturday using that that that okay so now i hope that

you understood everything about machine learning sorry Ensemble learning okay so if you're watching till now i am saying that you know that you do everything about supervised learning that you need to crack any kind of interview okay so just to pat your back and and be sure to subscribe my youtube channel it's it just gives you a great motivation as the new deep learning course will is coming soon so you can pre-register there for free And you can re-register that or if you want or if you want you can go to amtran.com for detailed uh

cso there is cso1 course which is uh which is on again machine learning but it's it's this it's not that different but it helps you to make a resume based projects it help you to make the uh everything means a lot paper in laricks you will get live down supports etc just as i step two so that Uh just as you can go to cs01 see the benefits for the course details etc and the launch video which has already been launched okay so i think that it's you can also apply for a scholarship if you're

a college student but it's totally fine forgive but the course price is like this okay but it's totally based upon you if you want you can definitely consider it just supports to make free content more okay so um that just just to me ask a question is is This course the same as cs01 yeah it's yeah but we in that we have talked about till about neural networks then we have talked about gans convolution neural networks you have you will get a one-to-one session you will get jupiter notebooks which are amazing jupiter notebooks you will

get early access to my books you will work with the team you will be getting an internship will work On a real world project in lantern etc okay so you can consider android.go for this okay if you want the lab but if you complete this this course you all be very very comfortable in machine learning okay so that's it for this uh section for this whole section on symbol learning from the next section you can consider subscribing obviously from the next section we'll be starting with and in between we can do some projects so from

the next section we'll starting With unsupervised learning and i really hope that you will enjoy that series also okay so let's meet at the next section okay so now we have covered supervised learning and you're gonna now consider a comfort table and supervised learning yourself supervised learning is one of the very best was topic in machine learning that you have covered very smoothly and i hope that you understood every each and everything about Supervised learning if you have any kind of doubt you can either ask a new or a discord community or you can ask

over to the uh the you can find the discord community or you can come comment and then we can answer your questions okay so uh that so let's uh let's start with unsupervised learning so what actually in supervised learning we are doing we have our data set which is x i as well as y i and we have i equals to 1 All the way down to the n and actually we have one supervisor which is by i and we know what our output should look like means we know what our output should look like

either be and continuous value which is a regression problem or a classification or a classification okay so that that we know that what our output should look like okay so here so we know what our output should look like so now in case of Unsupervised learning there is no supervisor so in case of unsupervised learning we have only x i we have only x i okay x1 x2 x3 all the way down to the x i which covers i equals to 1 all the way down to the m okay so here in unsupervised learning we

don't have any kind of supervisor that will that that will tell us what what will be the either output or will help us to train our model okay and we we we don't know what our aqua Should look like either not in continuous and not in specification so you so you cannot you cannot frame your problem and you cannot even scream because if you if you don't have your labels then then you cannot frame it in a supervised learning problem so what i have to do what what we have to do so but i'm going

to just want to do is just specify that the data we we have a structured data as well as On a structured data structure data as well as unstructured data and structured data is simply that which is in a table or format and unstructured data which is just images which cannot be fed on tables or csv files okay or in excel okay so here so uh this is our unsupervised learning because uh we don't have our output okay so what what we have to do this is the main question to ask what we have to

do if you don't have the labels so let's Consider let's consider you have this uh you have this x and y playing let me draw it very nicely so that in this section i'm just going to in this sub section i'm just going to give you an overview understanding and some of the applications of unsupervised learning okay so consider that you are working on uh just just you have your data points just you have your data points like this you Have your data points like this into an x and y plane onto the coordinate planes

and you have another data points which is like this okay which is like this and you may ask hey i use why can't we make a simple hyperplane but you don't know but you cannot frame it in classification problem just as a motivating example and i'm taking it as an example okay so here just assume that you have this type of data set okay which is true mentioned Either you can make that three-dimensional also you can make a three-dimensional also just by converting this and now let's consider that you have a three-dimensional feature so here

you have this here you have another example okay so you have this so now you only have these types of data points which is this this one is x1 x2 x3 these types of okay So in unsupervised learning in unsupervised learning what you do you make cluster of your data which is closest to one point you make cluster of your data of your data make cluster of your data so as a motivating example let's take an example of uh let's take an example that you want to segment your customers okay uh let's say example your

data scientist at amazon some some companies so you just Wanted to segment your customers so how you will segment your customers uh so what do uh you you will just segment to customers we have this data but you don't have the y labels so what you will do you will make the similar person into the one group similar person into the other guru similar person let me do that my bad okay similar person and another group and similar person similar person Similar person into the other book similar person on the other group so what you

would have done you have divided your customers into segments now data scientists are business stakeholders what they can do they can decide that okay we can give these uh these customers our deal these customers a deal basis on their activity what they are doing or god we can make a sub on another machinery model that will detect what they are really what they want okay And and we can recommend the products we can recommend the products basically these these types of customers like milk or like watching the jewelry or on an on amazon so here

uh also on here you it will show the products of books or maybe some jewelleries which are interested and here maybe they will show you some kitchen kitchen groceries maybe these customers are interested in these and then these Customers can be toys maybe uh some tvs okay so they will be recommended some products and according to them they earn a huge amount of money from recommending ads etc okay so so that's why you can see that how we how we frame our problem and now you now it's quite clear to understand that as a motivating

example of customer segmentation customer Segmentation in amazon and and if you may think here you should you see that uh let me go to amazon.com you will see that i'm being i'm being recommending and they know what i recently viewed what i recently do and they are recommending the products they are recommending the products which you can see over here that they are recommending the products basis on my views okay so i'm in some segment i'm in some segment And they are simply uh recommending the basis products and i hope that you understood and and

i hope that you are understanding what i'm saying as an unsuper training framing the problem as an unsupervised learning okay so that's what the first customers segmentation or and uh recommendation engine that is happening in amazon okay so just uh just see that just watch the too much talk videos on youtube now what youtube will do youtube will take yourself and add it In a segment of the one who watches stocks and now they will show you dog ads maybe pedigree or etc whatever the food of the dog is or that will recommend you the

videos of dogs okay so that's why we have customer segmentation as a whatever example and a recommendation is a more motivating example so now i hope that is start making sense to you about why we call this as unsupervised learning okay now we so there's a this is a in Supervised learning in supervised learning we have two two approaches we have two approaches which is the regression which is the regression and next one is specification but in unsupervised learning we have something called as clustering okay so we cluster our outputs we and if you plot

in high dimensional space then you will imagine that the most similar items are close to each other and most decimal items are very much far away okay So you can understand like this okay so let's take a let's take a motivating example again so let's see some of the applications of unsupervised learning so i will spend a little bit amount of time telling you about the applications of uh unsupervised learning and in the in this next subject we'll start with the making clusters how what is intra what is enter how we do we make the

cluster how do we evaluate our cluster etc okay so uh here i'm not talking about detail I mean clustering about clustering but i will definitely just told you that we divide our customer into segments okay so here here let's take an example we have now we are taking the now we are taking a look at the applications now we are taking a look at the applications of unsupervised learning so the first application of unsupervised learning is in Biology which i have taken from wikipedia which is sequence analysis okay sequence analysis and it's simply it's just

put your g on just genes into the particular segment and this will help you in a various cases for a biology student then you will understand then you will understand this concept of sequence and analysis because you don't have what you do you segment your genes in a particular Set segments and you can diagnose if this person has it again any kind of okay problem the next application which is in business which is in business uh maybe a grouping of similar clusters for business and needs so let's take an example that some company takes your

data and then then take your data and then simply group grouping the similar cluster grouping the similar Similar clusters basis on the business data and they can see the clusters what they are what their activity is and they can provide deals offers according to activity okay another thing is we have a recommendation engine recommendation engine recommendation engine that recommends the products recommends the product that we have seen okay so in recommendation you have content Filtering collaborative filtering content filtering quantum filtering collaborative filtering okay so which you can see uh which which have been which is

village advanced but i'm not going to talk about recommendation engine but as a modding example that you whatever you see in amazon or google they just show you what you have seen so far because they have each and every because if you search anything you have the uh they have that Data okay for making annual models next is that and the social network analysis facebook if you know about facebook uh they they told hey this this person has to uh just share the post means that we do the social network analysis social network the means

these were they the group customers into the segments the group customers into the segments The group customers into the segments and the pro and they showed up accordingly uh profiles okay another is we have in computers uh computer systems is we have our favorite in computer science we have image segmentation so what do i mean by image segment segmentation so you have a mage you have an image and you segment your image you segment if you frame this problem as unsupervised learning so you have your Image so you have your image like this and some

someone is here someone is here someone is here so you segment these images segment these images okay uh second segment these images as a pixel device you see that they are they have similar pixels segmented uh so so that's what you're saying segmenting and then this can be used for classification or object detection object detection Okay object detection like that okay so uh just grouping your images into segments by using maybe pixels etc okay because we don't have uh labels for body another motivating example in nash triangle processing and is sentiment analysis means uh just

just seeing whether it be a negative or the positive okay so what uh let's take an example that you have this uh that that we have this sentence okay This sentence and another sentence okay so it will group the positive science sentence and the negative sentence okay so now you will see now now you will see now what what you will do you will go and see one of the sentence you will go and go and see one of the sentence go and see the one of the sentence and see okay if it is positive

then you live a whole whole cluster which is maybe 10 000 10 000 Um sentences as a positive and you date is a to hold then maybe whatever the number of as a negative so me let's just take a word let's let me elaborate this nlp task of segment analysis let's take an example sentiment analysis sentiment analysis so for an example assume that you have a one billion uh text data points in text Which are text textual okay so converter is a word embeddings mean numbers now if it is very very hard to label it

will compute if it will take time it will take cost it will take because if you hire some people you have to give them money okay so you have to label one billion so how you have to do so what you will do you convert them into a high dimensional space and you segment uh the closest text which is called the word embeddings closest word imbalance the Quantum headings and sentences are here and then then you see one sentence is from this cluster one another sentence from this cluster now assume that this cluster is a

positive sentence you label all let's say 50 million as a positive and hold 50 million as the negative okay so so you just require it to just you don't require a lot of times for working with nlp tasks okay so that's the uh that's that's for nlp tasks another is anomaly Detection okay what is a normal dissection and and in a knowledge section an anomaly you have anomaly anomaly a detection anomaly detection where we determine outliers in your model so as an example that you have the exam age like this now assume that your data

points is here so this is actually outlier this is actually outlier so if you cluster it if Your cluster here then this will be ignored and this this this can help in in removing button but using maybe the db scan isolation forest they but most many of them may be some k-means just take that outlier into this cluster okay so let's assume there is a then it is also closest to then they take that into that cluster okay so we have a db scan which helps us into a normal distraction which is again an Unsupervised

learning model okay in a dance-based reasons great so we have seen a lot of applications of unsupervised learning we have seen a lot and i hope that you understood really and in the in the next sub-section we will start talking about clustering okay we will deep dive into the clustering to help you better understand the topic we'll talk about clustering we will talk about entering cluster interact and drop Cluster how we evaluate our cluster okay how we if we evaluate our data that we have a good clustering what what are the some of the types

of clustering like partial hierarchical okay then we talked about center-based continuity based density based and then we will talk about one formal center-based algorithm which is k means clustering algorithm okay and then we are done with supervised unsupervised learning okay And then i will give you a little bit overview of deep learning okay so that you could better understand your deep learning a journey uh just evaluation and then we will do some projects based on machine learning so that you could get more feel about machine learning and you are more comfortable and be sure to

do the problem sets which are uploaded and github to help you understand the topic Or master the topic okay so that's it for this section and i hope that you will that you have enjoyed this section okay so now we'll start talking about clustering we will get in the math and we will see some more algorithms like k-means clustering then we will see hierarchical clustering algorithms which is agglomerative and divisive and then we are and with unsupervised learning so we are entering the last Phases of this course and i really hope that you enjoyed this

course a lot okay as i enjoyed making this course with very curiosity with very energetic moods so i think that you're also very energetic till now and like me and you may also thinking hey are you sure what about the projects and the projects are in the last section to help you to get feel of everything about machine learning means after you learn all of these things performance matrix Algorithms now you will be able to build state of the art models okay so i'm very excited to see you all there and it means in this

section last section we'll be building back to back project and the course website is also in the description box below you can see what what problem set is that and you can download and start working on that problem set okay so you i i hope that you're watching this video did totally worth it even i was making video was Totally worth it okay so what we are going to start talking about we are going to start talking about we are going to start talking about uh clustering in the in our previous section sub section will

be we have talked about some of the applications of clustering we have talked about let let me choose a pen a good pen we have talked about applications we have talked about applications of clustering we have seen What the unsupervised learning is what unsupervised learning is and we have seen the diagram digest some a lot of applications like biology and business and etc we have seen the applications of unsupervised learning and then i've showed you the data okay and then i've given you an overview what we do in clustering okay so let's start with uh

clustering okay so how what what what with what the clustering is what the clustering is but don't just Bear with me with my handwriting because i don't know what happened to my pen it's very working bad but no worries let's start so here uh let's assume that you have this you have this you have the great oops let me white is not being removed let's assume that you have a x and y plane where you have a two two features uh where you have two features like this if i draw a straight line if i

draw a straight line Okay so here so here you have this and let's assume that you have here is a point here okay you have this as a point and another point and another point is green once which is here okay so no no no in unsupervised learning in unsupervised learning all datas are saying bye i Forgot no worries i usually forgot everything okay so what you have you have these features red features and you have this here so what you usually do in uh in in case of clustering you segment your clusters you segment

your data into different different clusters oh my god why it is not happening is you segment your cluster so you have this data so you segment this cluster you segment this cluster To be going into this this and the second cluster we're going into this okay so this is the first cluster this is your second cluster okay so this is your basic inter basic uh thing about clustering that that we have talked okay so we'll see some of the terminology in this so let's draw a very good representation of this diagram so let me delete

all these links and let me choose a black pen it's works best in the in case of white Okay so you let me draw a straight line with x plane okay so you have x and y plane x and y plane and here you have the data point here you have the data point like this and here you have the data point like this okay so what do you do you segment your uh data points into clusters you cluster it out so some of the terminology is you cluster this you cluster this let me Draw

a good cluster you cluster this and then you cluster this okay so let's leave that i don't want to make it like that let's leave that for now okay so you cluster this out and then what happens so here is a terminology alert means the in terminology we have something called as intra cluster we have something called as intra cluster Intra cluster and enter cluster just just listen what i'm saying interrupt cluster and enter into inter cluster so what do i mean with these intra and enter i mean with this intel intro and enter is

enter cluster is the distance between the cluster okay between between the clusters across all the clusters so enter cluster is the difference is the Distance between across the cluster so here we have one cluster here we have second cluster so intra inter cluster is the distance between this cluster and this cluster it's the distance between this cluster and this cluster so that's why it's known as enter cluster okay the second type of his intra cluster which is the distance between within the cluster that points within the cluster within the cluster so it may Be like

this this is called the this is called the intracluster this is called the intra cluster i hope that you are uh that is making sense you have this x and y plane and you have this this kind of thing and what to what we have to do this is this is called the intra cluster where we have the distance between two clusters or maybe two or more than clusters across all the clusters and intra cluster is the Distance between the data points inside that cluster okay great so we can think think something like that is

we want what we want in this case if we are taking as a terminology that so we want our interact cluster to be small interact cluster interact cluster to be small to be small And inter cluster inter cluster to be large what do i mean with this small and large i mean with this small and large is you have this intra cluster the internet cluster is the distance between this and this and the clusters takes whoever is similar or group is a cluster is just as in is a grouping of the similar objects that they

are similar to one another so they should be Closest the data point should be closest and the intra cluster of these should be closest okay and the inter cluster inter cluster means the similarity dissimilarity between two clusters should be maximum okay so here we have written our optimum optimization objective uh we have we we want uh our enter cluster our inter cluster which i did with i n should be large and i And a should be intracluster should be small okay so that's the that's the basic definition of that's the basic uh basic terminology that

we have seen and i hope that is making sense okay so we can frame some problem we can frame something over here what we can frame is we can frame an optimization objective we can frame a evaluation technique but before that why do we even need so let's assume let's assume that you have this you have this Cluster you have this cluster you have this cluster like this you have this cluster one two three four five six seven one two three four five six seven okay so here we have this and who can tell you

who can tell you that these are the perfect clusters these are the perfect clusters uh either you will tell hey you should i'm able to see i can tell that these are the perfect clusters but assume that We have uh three dimensional we have four dimensional we have fifty dimensional we have eighty dimensional we have hundred dimensional we have lack dimensional so what you will do in that case so we have evaluation techniques okay we have optimum optim evaluation technique that will help us to identify evaluate our clustering or unsupervised learning a clustering model okay

clustering model that will help us uh ideally to Understand to to if evaluate how good our clustering is okay so let's see how uh some of some of the evaluation uh techniques okay so the first one we have which is called done index okay so the first one we have is called d u n n index so this is this is a very funny name but it's disappearing uh you can still read about in wikipedia page so it's so let me first write the equation So d equals which is a done index equals to the

maximum maximum of i and j means the maximum should maximum distance between i and j divided by or by maximum k A maximum distance d dash between k okay so here what i'm telling here we have this we have this intra cluster we have this intra cluster here we have maximum inter cluster distance maximum inter-cluster distance enter clustered distance okay so ideally it is framing our problem of intra and Enter that our inter should be small and our interest should be large okay so here we are not assuming should be small we are assuming what

is the largest maximum distance okay so if every everyone is small then the largest maybe 0.00 like that okay so maybe something like that will be there so we are just changing this maximum okay we have maximum over here just don't be confused we are telling minim minimum intra that just ideally means That you want the distance between inter cluster to be small so that's why you are taking the largest whoever the largest however the have a largest distance okay so that's why uh so so you so you can evaluate your model and it's this

this sim this the denominator simply means that the the distance between your intra cluster okay the distance between your clusters the distance between it should be ideally large not too much small okay So that's the that's the end if you if you've seen if you see if you sense it's math mathematically you can see that d should be high d should be high to be a good good cluster these should be high okay so this is the evaluation technique for clustering your model and i think that you understood done index so let's recapitulate what we

have seen so far is we have let me remove this uh let me let me remove this so we have seen that We have some of some of the terminology so we have this small x and y plane where we have this one two three four five six one two three four five six uh again one more okay so here we have this the diff the distance between the the distance between the the distance between two clusters the distance between two clusters is called as Intracluster means across all the clusters maybe some cluster will be

here another cluster will be here it's across all the clusters this is called the intra cluster and the distance between within the cluster within the data points is known as the inter cluster is known as the inter cluster and core idea is for any evaluation matrix of your clustering model is your intra cluster should be large Intra cluster should be large large and enter clusters should be small okay so we have that we have talked about about a done index we have talked about done index we have talked about done index and done index is

simply uh evaluation matrix so the done index is written as d equals to the max Of the distance between i and j which are two data points which is obviously inter cluster which is obviously the first enumerator is inter cluster okay and the denominator is simply the intra cluster where you want the maximum distance maximum distance between the Values okay so this this is the basic definition and you may think here you should have told about entry inter cluster should be small but we are checking the maximum so that we can evaluate we are checking

the what is the maximum in that cluster that has the distance so we can evaluate so everyone should minimum so our d should be high in that case okay so i think that you understood done index also so let's see one more uh One more evaluation technique which is for uh clustering which is uh just just i'm telling you just some some constraints to be added i i just have seen the equations just some conditions to be added in ing where here in i here in i i should be i should be is greater than

i should be greater than or equals to 1 or it should be and it should be smaller than or equals To j and j should be smaller than or equals to n and n here is the number of training examples and for k constraints of k is your k should be smaller than or equals to um it's sure it should be at least a smaller should be uh smaller than or equals to n okay so that's the that's the basic definition of uh that's that's the basic clock const constraints but i don't find it's very

Uh kind of thing but here just to understand this the numerator the numerator is just a maximum interrupt in inter cluster and here maximum intra cluster okay and the reason why we we want to evaluate so that's why we are taking the maximum so if it is maximum then it is bad okay so our d index should be high if you sense it mathematically okay so another evaluation technique that that i have seen so far let me let me go back Actually okay so here i am on my another thing and let me remove this

okay so another tech technique another technique is here in front of you which is the davis baldin index davis davis baldin baldin index this is also like a done index what you want here we denote is that db equals 2 We we take it as a db equals 2 1 over and i equals to 1 all the way down to the n and then you take out the maximum they go to maximum where j is not equals to i sigma i plus sigma j divided by the distance between c i and c g okay and

this is just like the clusters and for more information about what is this you can refer to an Uh do a wikipedia page okay there is the more deviation etc has been already done over there okay so that's the devious balding index but most of most probably this this this the dawn index used properly in the country in the whole except for evaluation of you of your model so let me highlight it this one okay so we have seen the davis board in index now it's time for uh learning a little bit more in that

about What the what the approval means what actually the definitions of what was the definition of your favorite clustering okay so one line definition i could tell about clustering is in clustering you have and it's a it's a grouping uh can i write it yeah it's it's better to write it okay this bad or better with me or you can write with me also so a clustering is simply grouping grouping Of objects grouping of objects or elements objects such that in such a way i could say in such a way such a way such such

a way our our our object our object in a should be should be similar Similar to each to the to each group where it is in each group and differ from another cluster cluster differ from another cluster okay so that's the basic definition basic definition of clustering and a very good definition and from clustering in clustering we have two cases intra cluster in inter cluster where we want Our intra cluster is the measure of distance between um just a distance within the cluster and inter cluster is difference between across all the clusters okay so this

is the basic definition of a clustering and i hope that you understood everything okay so now we have seen done index we have seen clustering now it's time for learning about getting into depth of clustering about Types of clustering how many types of clustering we get in our life and then we will just after talking about types of clustering we will end this uh subsection and in the next section we will be talking about uh do one algorithm which is k-means and the next sub-section we're talking about a hierarchical clustering which is agglomerative and diversive

clustering okay so but before that let's uh let's see some of the types of clustering okay So the types of clustering which includes the types the types of clustering let me choose it the color so some some of the types of clustering are first one is partitional based clustering partitional i'm just writing in short partitional partitional based approach or clustering okay so what is partitional based clustering so assume assume that you Have this uh that you have this x and y plane and you have this data point like this and you have this data point

like this okay so what you will do you will partition it you will partition it into two clusters and you are done okay so here we have one algorithm which is called km means algorithm that we'll study in our next subsection okay for in a partitional base approach then we have hierarchical clustering high ra recall hierarchical Clustering clustering okay so in these types of hierarchical clustering we have like a dendogram if you have seen a dendogram then you know about a hierarchical cross string just just just as a just if you know so let's assume

that you have this data point like this you have this data point uh let me do it here that we have this here like this okay so you have this four points we have this four points and again you have p1 P2 p3 p4 and if you're plotted the points looks like this so these two this p1 and p2 looks close this p1 and p2 appears this uh p2 and p3 just just assume that this is a p2 and p3 this this this looks close so you just cluster it out okay so the p2 so

what you do you just make an endogram just like this uh that p2 and p3 are now a cluster which is the Same though which which we can consider p2 union p3 okay now these are the two clusters now if you see the closest structure is this a p4 so what you will do you will again make a nested cluster now i would now this this will look like this and now what you will do you will attach p4 over your pp4 as a dendrogram okay and then what you do then here is this this

is the p1 is the closest then you cover it and then this whole and this whole now After that you will be end data being a dendogram you will be able to see a dendrogram which you can see over here so here after which is p1 like this okay now you can see now we cannot burn we have a large cluster which we have a large uh cluster now we are done this this is called the traditional van der graan because you can see on internet about this so what we have seen so far we

have we have augmentative structure we are Going up means we are uh just we are going up like this means we have this p1 p2 p3 p4 we are we these are own clusters by themselves and first these are their own clusters we have four clusters over here we have four clusters now we cluster this p2 and p3 then we close cluster p3 and p4 because they are close to each other then we construct p1 to p4 okay so now we got our full dendrogram okay so that's that is usually a agglomerative agglomerative Cluster we

have something called as diversif we are in diversif and i'm talking about in hierarchical clustering application which is agglomerative and divisive and divisive we have given p1 uh p2 p2 we have something called as p3 let's let's assume that that that we have abcd okay so a it's just it will match if i do that abcd okay so e let's assume e also so you divide this okay so what What what you do you simply you diverge visit okay like this uh first you divide a and b a b as one cluster then you divide

c d e then you have c d e okay then you divide a and b into different different clusters so you have this whole dendrogram like structure like this now you are uh now you are this is just opposite of fog glimmer right here we are just here we are making up here we are making up here we Are making uh a different different cluster okay now we do divide our c then we have a d e now we divide our d and e now this is our divisive okay so we divide our whole cluster

into different different groups which are most closest to each other okay we will study in detail about this uh in in our uh next subsection where we will talking about hierarchical clustering okay after that we have something known as after that we have something known as Well separated clusters where it is well separated clusters okay just just i'm writing it's well separated means we you can easily your model can easily uh separate well separated well separated then you have the fourth one which is center-based which is also k-means clustering is this center-based algorithm okay continuity-based

we have something called as nearest neighbors k nearest Neighbors okay and then we have density based which is often known as db scan db scan okay which which you have you will get in problem set to learn about this uh db scan okay great so we have seen a lot about clustering and i hope that you really enjoyed this session there's some sub section of clustering and from the next section we'll be talking about lots algorithm or k-means clustering but i will definitely try to complete in no Time so that you can you could get

a more more a prone uh knowledge of clustering and uh able to make uh unsupervised learning models okay so let's see so let's see as just recapitulate what you have seen so far we have seen what is clustering clustering is just a grouping of similar objects in such a way that these objects are similar to each other within the cluster or our inter cluster our end to Enter clusters should be different should be maximum and our intra cluster should be small and what is the difference between enter inter and intra in enter we have the

distance between uh across all the clusters and that's why we need a maximum and we have intra in which our points our points within the cluster should be minimum okay means more similar okay other than the other clusters okay so that's that's the That's the basic intuition about this and uh the clustering then we have seen how we can evaluate so i have talked about one index which one evaluation index which is done index okay and on index it is used to uh take out the evaluation is used to evaluate your clustering model and what

is what what is what it does he it takes out the maximum what is the maximum in inter cluster what is the maximum to evaluate the performance Between between two points and it takes out the maximum of the uh this the the the denominator is an intra cluster and a numerator sorry yeah the denominator is intra clusters is the numerator is intra cluster and the denominator is inter cluster okay and then we have something called as davis bounding by bolding block and index this is just it it will take a lot of time to teach

this but it's out of the boundary course but it's Just here and is the number of clusters where we have a sigma i plus 6 sigma g and where i is not equal to j we want the maximum and then we are taking out the distance between two clusters and then we are dividing it up okay so that's why that's what the full uh clustering base models is and now i hope that you understood about clustering uh now it's time now now it's time for learning about now now now it's time to learn about Something

new which is in subsection where we will talk about uh where we have talked about four five types of clustering which is partitional based hierarchical clustering and well separated and center based and db scan okay so we will talk about partitional hierarchical and well separated is also in center based where you your prop problem set will be on db scan okay so here we are done with this sub-section now in the next section We'll start talking about k-means till then have a good day okay so now we have talked about various things like clustering we

have given a part one in subsection i've given you unsupervised learning applications and then we have talked about various applications of unsupervised learning and then i've just given you an overview of clustering in this first subsection and the next subsection i've given you the intuition behind clustering i've given a formal Definition of a clustering we have talked about inter cluster intra cluster we had talked about the evaluation matrix like done index devious balding in index and i've made you understand each and every equations and i've already also helped you to understand that what are the

types of clustering which are available like partitional based hierarchical clustering then you have a center based well separated density based continuity based okay so We have this kind of clustering which are already available now we will talk about a partitional and center center based clustering which is k means algorithm as this example was taken from andrew non course of machine learning but he has uh but it's too much uh 37 years old but i have i have made it very very updated for 2020 just that this this example is from andrew nong okay so i've

also included k means plus plus Algorithm what the k means plus plus algorithm does and some of the variations of k means and then also i have talked about in detail what are the what are the limitations of k-means clustering what how we initialize the centroids what are the time complexity of k-means clustering okay so then we have talked about the full machia full k-means clustering algorithm how we evaluate our k-means clustering algorithm okay with the euclidean Distance so that's what we are going to start talk about just be sure to just sit sit somewhere

and see and take copy and pen to understand okay but before that what we are going to study is k-means clustering algorithm sometimes so the synonym of came cayman's clustering is lloyd's algorithm it's something sometimes called lloyd's it's sometimes called lawyers algorithm lots algorithm okay so it may be some people pronounce it as a lots Algorithm or a k means clustering algorithm okay but i'd like to pronounce what k means okay so uh just to just as as an example we have this uh data set we have this data set over here now what what

we do in clustering what we do in clustering we initialize centroid which is k okay we initialize centroid which is denoted by k so here what it is a hyperparameter so what we are going to do we are going to initialize k we are Going to initialize k as uh we are going to initialize k which is two uh two centroids and centroids is just uh the the initial you will get to know up with the visualization so what you do you initialize with two points onto this so here you initialize two points like this

the the first point is over here the first point is over here and the second point is over here and these are called the centroids you will get to know why we call it as a Centroids but these are called the centroids okay these are called the centroids and here in this example k is equals to two because we have taken k equals to two because uh as we have a two centroids and and we randomly initialize these centroids okay so we'll see the initialization uh just after some ppts this has some slides okay so

you initialize your model now and now this is the first iteration now in the first and first you initialize then what You do you do the assignment step what do i mean by assignment step first you initialize then what you do you do assignment step like this you assign all the all the particular values red to red color and blue to blue color you can see over here that we have that we have done red to all the red color which are closest to this and centroid and this uh blue we have covered that is

all the centers which are closest to that blue okay now what what we will do we will Take out the average of these points we will take out the average of these whole blue points and then we will take out the average of this red points and then what and then what we will do we will average it we will average it like this we will average it now after averaging we will make that percent right to at average and then what we will do again we take out the again we do the assignment step

like this we just assign we we just update our values which are So closest to this and centroid now you can see that the blue becomes like this now then what and then what we will do we will again move the centroid again we will take take out the average and move the centroid like this we will move the sand centroid and then we will make this uh red red means so those those are closest to blue blue and these are two red okay then again you do then again you do like this as a

taking of the average and Making the color as blue and red okay now you can see that you were that what you have done you have a very good cluster means you have initialized here and then you taken out the average then you move to that centroid and again you update the updated cluster then again you move that like that okay so we have done like that and i i have just shown you a very good visualizations of this k main so let's see again in a little bit more fundamental way Okay so here is

our data and then you can see the data is looks like this so here what we had done we had in we had simply initialized two centroids like this and then what we have done we had we had just uh make the make the points which are closer to that blue to blue and red to that red okay now you can see over here then what what we have done we had taken out the average and then we moved our centroid to that average okay and how we take out You do you all know how

do we take out the average you you just take out the number of observations and then divide uh or some the num the sum over the observation divided by the frequency over the observation you just do that and then you update and then you again make the whole uh assignment step and then what you do then you again up make take out the average now you again update the points you can see like this and then you again take the average you Again make the point says like this okay and after until and unless your

centroids are not changing you keep doing this okay so here is informal algorithm here is informal algorithm and you can see over here after seventh iteration it is not changing so we converged we our algorithm is converged okay so this is how whole k-means clustering algorithm works in visualization so let's see uh the algorithm in detail okay so what you Do uh first of all here's an algorithm i'm writing just bear with me but just bear with my handwriting so i'm just writing an algorithm so here is my algorithm with the let's write it with

a red color so uh let me write here is your algo a rhythm algorithm okay so you do for k centroids k for centroids okay so you you have to choose k okay you have to choose k which Is k where is a hyper parameter okay then you repeat this two process then you repeat this two process the first process is cluster assignment cluster assignment you assign all the clusters you assign all the cluster you assign all the cluster and then update take take out the average and then uh take out the clusters time you

assign the clusters and then you take out the average and then you make that point so you when then you do the updation of a Cluster okay then you up recompute the centroid okay then you recompute the centroid okay then you how you how you recompute the centroid the centroid it simply means you make the and you take out the average and then you assign that cluster okay that cluster to to to the nearest data point okay again you have takeout average until and unless until you do until you do until uh your uh your

centroid are not changing your Centroid your centroids your centroids are not changing are not changing okay where it's it means that your algorithm is converged now you don't need okay so you repeat these two process like clusters cluster assignment means assign the cluster with blue or red and then you update your centroid and take up by taking out the average and then again doing doing this taking on the average again doing this like that okay so this Is the basic algorithm of k-means clustering algorithm and i hope that you understood about k-means so let's under

and let's understand a little bit more we is a little bit more further into just a recalculation of k-means k-means clustering is an algorithm also called the lloyd's algorithm then we what what we do so let's go back to our visualizations okay so here is our data and what what We do with into ub randomly we randomly initialize two centroids okay not every case just have taken it is a hyper parameter we have just taken two for this case okay then what you do then you do the second step which is uh which first you

do the cluster assignment you do assign all the cluster with the closest with the same points in the end of the cluster then what you do then you just recompute the centroid by taking out the average and moving the centroid Okay then you do the cluster assignment then you assign the cluster like this then you again move the centroid okay recompute the centroid then you assign the like like that and then you take out the average and again you uh again you do the same you take out the average and then move your centroid okay

now now in the next iterations is not changing now your k-means flush string is converged now you are done okay so here's a good algorithm you will be able To see on the internet the same thing like first you do you choose key how many number of key clusters then you assign the clusters recompute and then until and unless your centroids are not are are are not changing okay so this is whole about k-means clustering algorithm and i really really hope that you understood cayman's clustering algorithm so now let's see some of the evaluation techniques

of k means clustering Algorithm okay evaluation technique how do we evaluate our k-means clustering algorithm so for an example so for an example here is my example so for a for an example we are given we are given uh no not in foreign example is just a justice time giving you the optimization objective so you are given so you are given x d dimensional space dear dimensional Space as well as k and the clusters which are this uh and set i think the dictionary which is not which is in this c1 c2 all the way

down to the ck okay we have k clusters okay then what you do you want your uh to optimize c you want to minimize this cost function you want to minimize this cost func function now what you do you take out i equals to 1 all the way around to the k we are k where X be the member of c i mean the cluster now you take out the distance of x and c okay you want to minimize this cluster okay so uh just it it will make sense don't worry here we have ci

which is equals to the centroid so let's see let's see what do what do i mean with this uh distance between x and c so what we are told that what what we are told like uh we are we are told that we have that that we are Given x they dimensional space and we have clusters which is c1 c2 all the way down to the ck where we have k clusters now what we have then we have this equation i equals to 1 all the way down to the k we are going to until

every clusters where each we are each x where each x is the member of cluster means we are this data point is the member of this cluster okay now you take out the distance between you take out the distance between this data point and This cluster this this cluster okay and this should be this should be minimized your distance should be minimal your distance should be minimal as compared to this your distance should be minimal so that's what it is telling over here okay so i hope that you understood with the help of a visualization

okay so ci is the centroid okay cluster centroid okay so how do we take out the distance for taking out the distance we have something called as a euclidean distance Euclidean euclidean for taking out the distance we have something called you there are a lot more distances which are already available you can take a look at it in online but it's a very used eupledean distance so what is euclidean distance in euclidean distance we have d uh we have two points p and q okay since we are doing for each and every point i equals

to 1 all the way down to the n Means number of training training example qi minus pr squared okay and it's taking out the square root of the square root of that okay so this is the this is the euclidean distance that just measures the distance between two points like this and you can see some more about this onto the internet how we derive this equation what is this well who have found it etcetera etcetera etcetera okay and sometimes and this this cost Function of k-means clustering is called is called s s e means sum

squared error sum squared error and you can also see it is just like the dawn index following the inter and intra cluster here we want to minimize our in our intra cluster over here we want to minimize our intra cluster we want to minimize our intra cluster okay so that's the basic definition that's the whole thing about whole story about a k-means clustering Algorithm and i really hope that you understood k-means and then we have talked about evaluation techniques for k-means and that's it for uh k-means and now we will talk about why it matters

it's the basic definition of why it matters which the random initialization as we have seen that we initialize our centroids these centroids randomly you can see that we have initialized randomly so why it matters how it can cause the problem it Can cause the problem so here we had just we are just choosing you can see over here here here over here here that we have in iteration number one we have three centroids iteration number two we take another average compute then we do the we come we we simply what you're doing we cluster we

assign the cluster with the same centroid and then we recompute by taking out the average in the second iteration then in the third iteration then the fourth iteration in The fifth iteration and the sixth situation you can see over here that uh we you can easily see that we have a good cluster over here but you can see in here that you have adjusted it can cause how it can cause problem is iteration number you want you have just randomly initialized now you can see whole story changed whole story changed you can see over here

okay so that's the big problem so that's why we do which is very not recommended to to choose it Randomly so researchers discussed about it these researchers had talked about it how we can use it how we can make something good so we researchers found that uh k means plus plus something well as k means plus plus algorithm works best works best which is very time which i'm going to tell you okay so kms plus plus algorithm what it does it simply select multiple numbers it simply select Multiple numbers as in random we are we

are just taking any random but we select multiple numbers and select the smallest error okay so that's that's what the k means plus plus does it selects selects multiple multiple means is just a it's just a multiple a sample from the numbers multiple numbers and checks and checks which number which number minimizes Which number minimizes minimizes the sse the sse which is the which minimizes the into intra cluster distance okay so that's the k means plus plus and it is usually used in everywhere okay rather rather than randomly okay so we have talked about uh

why why we choose importance why we choose um over uh run why why we choose k means plus plus over i ran randomly because maybe it can happen sometimes okay a Very good a very cool god pro if the luck is not with us it can cause a big problem and further okay so how we have to deal with this kind of situations for dealing with these kind of situations we have something called as uh k-means plus plus which will help us today which we just selects the multiple multipliers and selects select the one select

the numbers which has a smallest error okay okay so Uh let's see uh you may you may think you may think hey ayush hey ayush how do we select how many number of a centroid that we need okay so that's the that's that's that's also the best uh that's that's also the big problem so we have something called as elbow method we have something called as elbow method and what it does we have elbow method like this we have written the k clusters one two three four five six okay now in if you if you

have taken k equals to one Your error is high to uh if you've taken k equals to two your error is going down like this like an elbow like an elbow okay so here you can see the elbow turns around at three so you select this elbow to select k equals to three okay so this is your loss this is the number of okay and this is your loss which is decreasing sse sse okay so that's why how we use elbow method either you can use grits or cv or random Randomized research to student this

parameter i think that will not work because we don't have labels but album method works best when you plot it you know when you plot and then see what number of k you need okay so that's it that's that's the following and let's revise so that we are on the same base so what what we do in k-means clustering algorithm we choose number of okay using The elbow method not now just uh just i choose the randomly and then we plot it and then then we choose the k and then what i've done then we

repeat a cluster assignment like we assign all the clusters with the same clusters then then we compute the centroid by taking out the average okay then you uh then you keep repeating this antenna and unless your centroids are not changing and then uh the evaluation technique for your k-means or loss function for your K-means is the you want to minimize your intra-cluster by just taking out the distance between two this distance between the the points between points inside the cluster okay and then that's called sum square error and here and you take out the distance

using the equilibrium distance that just that just take out the difference between two points by doing the summation over there okay great so uh why how how how we choose The in in how how we choose this cent centroids okay the initializing of the centroids for initializing we have seen seen over here that randomly initialize can cause a very big problem so that's why we have something called as k means plus plus which will help us to select centroids using what it does it select it does for multiple runs and checks where uh and checks

the numbers who where uh the sse is a low okay and how we select the number of a key we select The number of okay by using elbow method where it the elbow is where elbow is turning around like in as a hand as a hand if you draw it like this if you draw it like this if i draw a hand over here if you draw it like this okay so your elbow is there so your loss is decreasing with k equals to one is your losses that this this this okay so you choose

the elbow k equals to three and it works best Okay so we had talked about these things and now it's time for ending up k-means clustering the chapter of k-means clustering which is uh just uh just ending the topic with our favorite what are the some of the limitations of k-means clustering algorithm and what is the first time complexity so the time complexity of k-means clustering algorithm the time complexity the time complexity of k-means clustering algorithm if you don't know about from The if you don't know dsa leave it okay ignore this at the time

complexity is order is time complexity the order of n times k and n is the input size k k is the number of clusters i number of iterations d the dimensions okay uh this this is our and this is this this is what the time complexity of that okay so it's depend because the time complexity depends upon your size If you don't know about time complexity i have one something called as dsa mastery course uh soon we're putting putting videos on to that so um first four lectures are based on time complexity you can watch

that just go one new where i will just show you where you can go okay uh some some of some of the limitations of k-means clustering algorithm some of the limitations of k-means clustering algorithm is it's It's it's very sensitive to outliers okay so let's assume that you have this that that you have this oops like why uh powerpoint control z does not work i don't know why so uh here we draw an x and y plane and here we have x here we have y and let's assume that we have made it and here

we have the outlier here we have it is closest so it will take that outlier in that cluster so we Don't want that so it is very prone to outlier means we have we have to detect the outlier and it can be solved using density based uh even i just don't know what is with what what's there so it can be solved outliers problems can be solved using a db scan which you will have a pro problem set to solve okay means you you have to read the wikipedia page uh and write uh go google

docs onto the ddb scan what you understood okay and then It's it's it's like different sizes different distances maybe they're causing the problem okay but the main main problem is outlier okay okay says so we have i have given you the thing that db scan hierarchical clustering works on these types of issues okay so that's it for this k-means clustering algorithm and where you can uh record where you can uh find the time complexity videos and it's very very recommend uh just it's a very very very Helpful if you subscribe this youtube channel the newer

youtube channel which you can find um which way which you can easily is easily find onto the uh youtube just by writing new era where i have 500 true subscribers as of now okay so you can subscribe that and see the dsm mastery course where we have more than 23 videos already uploaded a long lecture so you can learn about time complexity from that also okay so that's it for this sub section you will now Have a toolkit of various algorithms various techniques various things okay so now let's meet at the next section where we'll

be talking about hierarchical hybridical clustering i will give it a sort so that it should make sense okay so let's meet at the next sub section okay so now we have talked about clustering we have talked about unsupervised learning we have talked about one of the partitional based or center-based approach which is K-means clustering with k means plus plus algorithm for a random initialization okay so i hope that you understood everything in unsupervised learning and if you have gone so much further into untube voice learning now we are capable of learning hierarchical clustering which is

one of the most favorite uh means one of the most favorite uh and the best clustering techniques that include including partitional based and Then we have a hierarchical clustering okay so this is the this is this this is what we are going to do we are going to start with an introduction to a hierarchical clustering to help you better understand what a hierarchical clustering is then i will go further into different um just just a quick recap onto partitional based okay then i will show you the visual representation of hierarchical clustering and what's the dendogram

okay and then we will move Further into understanding the two two types of hierarchical clustering which is called agglomerative and divisive okay so agglomerative clustering and diversive clustering we will dig dive into augmentative clustering and we will just to take a look at the diversif okay and then we will go further into understanding the augmentative clustering algorithm and then we will see uh then we'll compute that algorithm Manually and then we will see the inter cluster similarity between our two points okay but if if if all the words or technical terms are going above your

mind leave it means uh just ignore it the words as of now let's start with hierarchical clustering just introduction to hierarchical clustering okay so what is hierarchical clustering hierarchical clustering is that you have A hierarchy of clusters okay so for an example what i'm going to do what i'm going to do is i'm going to take four points i'm going to take four points p1 uh let's let's not take it here let's take p1 here p1 then i'm going to take p2 then i'm going to take p3 then i'm going to take p4 okay for

an example i've taken these four points and now we want to cluster it out okay now we want to cluster it out so How do we cluster it so for clustering so for clustering we have our favorite means we can simply see okay this distance is small these are more similar we are going to do this we are going to cluster it out like this and we have this uh p3 and p4 how do we cluster we cluster this p p one and p two as real so we cluster we we put another cluster okay

because this this cluster p1 and p2 is p1 union p2 it is now a one cluster now In nearest the data point nearest point between or the nearest these these are their own clusters initially these are their own clusters so we are going to merge it okay so here we have a p1 and p2 now the now that this is smallest distance is p4 now we attach means now we make one more cluster okay now in this big cluster in this cluster where we have three points what is the Nearest data point now the nearest

state data point is p3 okay so we'll make again one cluster again one big cluster like this until and unless we have one cluster at uh one one one cluster uh at a point okay so here we we are ended up with one cluster okay so it is not making sense i know but we can um just just as a diagram i have made this so let's see how how we do it using a dendogram that that will make more sense okay so here we have a p1 Here we have a p2 oh oops let

me make a little bit more in good oops what happens okay so here we have a p1 let's assume that we have this is the point p1 this is a point p2 this is the point p2 and this is the point p3 and this is the point p4 okay now these are the points now what we will do with here you can see we just assume that just assume that this is p1 p2 p3 and p4 and you can see this p p1 And p2 are smallest so we can make uh we can attach this

we can make a we can make like this you can attach this p2 and p3 okay now these are one plus or p2 and p3 are one cluster which usually cause p2 union p3 okay now initially p1 is one cluster c1 p2 is one cluster c2 p3 c3 and p4 c4 okay so these are initial clusters now what now what we'll do now this is the one cluster which is c2 okay so we've we we found and Attached it okay now what now what we will do now the nearest state data point is p4 over

here okay so what what we will do we'll again attach p4 into this we'll again attach p4 now we have now we have p3 and p4 as one cluster over here okay now we are now we can attach p1 now we can attach p1 now now we can attach p1 like this now we can attach p1 and now we have our dendrogram which is actually a hierarchy Of clusters okay so here we have a traditional dendrogram which uh which which i think that you have all i've seen in your journey so this is your hub

this is your dendrogram what you have done you have just who are most similar you are attached to it okay who are more similar you attached it for an example these are two similar you attached it and then these this is now what you Attach it like this okay do not merge it do not merge it okay so you have to do like this now here we have until unless we have left with one cluster okay so that's the basic uh hierarchical clustering and i hope that you understood hierarchical clustering the basic intuition but if

not let's let's try to again understand a little bit more further into a more sophisticated way okay not supposed to just just uh easy way okay so for the example i'm Gonna take a yellow color so here you have a c1 here you have a c2 here you have a c3 here you have a c4 okay so this uh this is your c4 now you can this c2 and c3 are your points c2 and c3 are your points okay so these are two more similar so you will attach this like this now c4 here he

is here so will you make you so you will make one one more now pc3 and c4 will be attached like this and c1 will be attached onto the up Of c4 now this is a traditional dendrogram this is transferred traditional then dendrogram where you convert this cluster of numbers where you convert cluster of numbers c1 these the clusters clusters to a hierarchy of clusters okay so for an intuition what what what i'm trying to see you here so i have one example so for intuition what i'm trying to say you hear that we have

a c1 c2 c3 c4 now we're just uh making a Attaching as a cluster so c2 union c3 etc okay so this is the basic intuition behind hierarchical clustering so there are two types of hierarchical clustering uh some of the types are we have some something called as agglomerative agglomerative clustering and then we have and then we have divisive clustering and then we have diversif clustering okay so we will deep to have an agglomerative Clustering i'll just give you an intuition behind diversif clustering okay so what is agglomerative clustering so the agglomerative clustering aglo agglomerative

clustering okay so before that let's understand let's uh let's see let's uh let's let's unders understand the divisive clustering to help you better understand that so it is down to it is up to down approach sorry it's a Down to up approach it's an agglomerative is a down to up approach and diversif is just opposite it's just a positive diversif diversif is just up to down approach up to down approach so what do i mean by this so for an example for an example let's assume that we have a four points we have a five

points in the form of data a B c d and e okay so here what to what you are going to do we are going to divide this cluster of numbers the cluster of numbers into a different different uh a b let's assume that you are divided with a b now this is your c d e c d e c d e as a more similar okay you divided this now what you will do you will devise this a and b a And b now you will divide this c c and d d e okay

now you will take the d and e okay so this is the basic uh this what what you are doing you are making the cluster of numbers into separate separate numbers you are making the hierarchy of numbers okay hierarchy of Numbers and this is what you are doing okay so you're just uh making a di um hierarchy of numbers and like abcdes you're going to up to down approach from up to down approach but in the case but in the case of but in the case of agglomerative you are doing something different you are here

you are going up to down up to down in agglomerative you will take you will take because in agglomerative you have cluster one c2 they are unique clusters Means only one cluster so you do merge it okay you do merge it okay so it is agglomerative is up to down approach sorry down to approach down to up okay so that's the basic intuition behind agglomerative and diversif clustering okay so we'll see in detail about agglomerative clustering and diversified clustering to help you better understand all of these algo ah agglomerative and diversif but the basic Intuition

behind hierarchical of clustering hierarchy of clustering is that you have at some points and what you do you simply you simply you simply merge to cluster because these p1 p2 p3 are before r4 cluster initially you merge through cluster and then you have some kind of similarity matrix okay so you have some some kind of similarity matrix that we will see according to that you merge it and then You make a dendogram or hierarchy of clusters okay so that's the that's that's what agglomerative and acclimative is just down to approach where you are converting the

converting the clusters into by merging different different cluster antennas we have one cluster left at the top like this c1 okay as an example and in diversif you are actually going up to down approachment dividing the cluster into a single single group okay Like a b c d e okay so that's the basic intuition behind this uh agglomerative and diversif clustering okay so let's start with agglomerative clustering deep dive into agglomerative clustering to help you better understand what an algorithm algo agglomerative agglomerative clustering does and we will see a lot more about this okay we

will not see the implementation as of now we will see in another section where We now we are now after this i think we are we have we have completed this theory part all those thoughts means we have now have a good knowledge of everything now what what we will do we will make a lots of projects okay so here's an algorithm so first here if you're so first uh let me let me write an algorithm first i will write an algorithm and then i'll then i will make you understand with the help of good

examples that i've Already listed over here okay okay so here first what what we will do first you we compute the proximity matrix we compute the proximity matrix proxy midi matrix that just tells you the similarity between two points second we will we will see we will see we will see just as now it looks like and then you repeat these two process then then you repeat this two process then you repeat this two process merge Merge two clusters merge two clusters and update the cluster update the matrix update the proximity matrix proxy midi matrix

okay so that's the basic that's the that you repeat in a for loop and then you until and unless until you have one cluster you have one large cluster left means until you have left you have covered all the cluster okay Until you have covered all the cluster okay covered all the cluster or we can say no clusters are left as a single cluster okay but this is this is a more formal definition until you have a good cluster like as as an example that i've showed to you okay so this is a basic algorithm

to understand in agglomerative clustering now let's understand this in more detail Okay so initially what i've talked about but i've note here what i'm going to note here like i'm going to write note which is each each cluster or p1 p2 p3 each are its own cluster now we'll be able to build a hierarchy of clusters okay so let's uh let's build uh let's build an approximating matrix how it looks like so the approximating matrix looks like this let's iron let's assume that you have P1 p2 p3 p4 okay i'll just assume as an example

now what you do you you do this p1 p2 p3 p4 okay now you do this means this is approximately matrix this is the approximate matrix like this okay okay and these diagonals are zero these diagonals are zero these diagonal The diagonals are zero because the distance between the distance between the p1 and p2 will be obviously zero the distance between p1 and p2 maybe some group 2.4 there is p3 p4 like that so we have we have written like that p the distance between p1 p2 and p1 is maybe something like that diagonal distance

between p2 and p2 is obviously zero okay so this this is what the basic uh this is what the basic proximity matrix looks like so let's see let's see That how we how we make a dendrogram okay so here we have p1 p2 p3 p4 now assume now we will just assume that uh let's take an example that this uh this is 4.6 to 4.2 6.9 point two eleven twelve thirteen fourteen fifteen sixteen seventeen eighteen okay it's a zero okay so the distance between p p four is p four is obviously zero okay so the

p2 we we will find the smallest one we will find the smallest Similarity here you can see that this this is the smallest similarity p2 and i think uh oops it's a it's a p2 and also i think that i've not drawn a good mirror i i will tell you so here let's let's assume that you have this proximity matrix that you have this proximally matrix just just don't take a take a look at this proximity matrix now what you will do you will find the smallest similarity so the smallest similarity so the smallest Similarity

over here is this three here we have three and six okay so here we have three and six that we have done first the smallest similarity okay so that's the that's that's what we i'm going to do so assume that this is p1 and p2 has a smallest now we have now what what now the next is the next you can find over here is four means six and four maybe let's see six and four which is here okay so we can go Over here also three and four we can see this three and four

also okay so we can take this three and four like this it's okay so just assume the next smallest is p3 and the next volts is p4 okay so now we have constructed now let's assume c2 p1 p p2 p1 then we have a again maybe some p3 and p4 now first you do this then you do this then you make this like this okay so now we have now now you have the hierarchy of clusters okay so that's the Basic definition that's the basic algorithm for proximity matrix okay and this is what you do

and you update the proximity matrix so how you update now you merge the merged the cluster now what you will do p1 union p2 okay p1 union p2 this one will be let's assume p1 and p2 are attached then p1 union p2 this will be the cluster okay now let's assume that the 2.7 and these are this are similar means this 2.2.7 and Maybe some maybe like that so we can combine the cluster like that i think this is not an actual example the bad example okay but here uh you can you will you will

be seeing one of more one more comprehensive example okay so how do we measure the similarity how do we measure the similarity of the inter cluster similarity we can measure the distance between two plus two points or clusters these p1 and p2 Using maybe euclidean distance euclidean distance manhattan distance we have different different distance similar distance uh distance measures so how do we measure the the similarity between two clusters or enter cluster and inter cluster is nothing much that that similar the this is your one cluster and this is another cluster so what's the distance

what the similarity between to merge it okay so that's the that's the inter cluster that We have already talked about in terminology sessions so let's see so we have so we have let's uh let's let's assume that you have this let's assume that you have this cluster where we have x this this this this this let's see another cluster this this so how do we measure the similarity how do we measure the similarity so for measuring the similarity we have i'm going to talk about three methods i'm going to talk About three methods min and

the next method and max the next method is group average the next method is grouped average so let's start with min so what do i'm doing in min so let's assume that you have this uh so so you have this data point like this we have this cluster you have this cluster now the second cluster is like this second cluster is like this okay so what you do in minimum what you do in minimum You take out the minimum first to first i'm just just right let me write it mathematically to make it more sense

oops what i've done okay so what you what you have to do what you have to do so let's let let me write it mathematically for minimum so minimum is is just the similarity between c1 and c2 which is nothing but equals to nothing but equals to minimum minimum of p i Is member member of c one p i to be the member of c one let's assume this is c one so p i may be this point and p j the another cluster should be the member of c two now what you will do

you will take out the similarity this the similarity of pi and pg okay these two okay these two distance okay now you take out the distance that you do you take out the distance i'm writing your distance and you you take out the Minimum point minimum points who have mental here you with this the distance between these two points is minimum as compared to this point this point this point okay so we will uh just uh what what we will do we will simply uh merge it okay so this is what the basic minimum minimum

approach does it just takes out the similarity between c1 and c2 by the taking out a minimum of all the i and js by taking on the distance Between all the minimum taking out the minimum distance of all i and js with the member of c1 and c2 okay so that's the that's the minimum so let's see the basic uh basic thing over here to understand it little bit further away and this is this this example is taken from cs6530 by cluster analysis class lecture notes this this example okay so here here is your proximity

matrix here is Your proximity matrix where you have one as a p1 2 as a p2 like that okay now you have the here you have approximately matrix here you have a proximity matrix now here the distance between one and one is zero the distance between one and two is the 0.24 the distance between one and three and small point two two two the distance between one and four is point height 0.37 the distance between 1 and 5 is like this and this in the same way diagonal are 0 Okay so you can see here

the smallest similarity between or smallest distance between two points i and j here we are let's assume this is i and this is j the smallest distance is point eleven point eleven and we have three and six okay so we merge three and six as a first cluster okay now the next now the next smallest similarity or the smallest distance is Is uh this one uh let let me choose yeah this this one point 14 okay so five and two okay so this is your next smallest so now what you will do you will merge

it at the second cluster now what you will do you will find that this two cluster this the cluster u and this this p five five union two and uh and three union six finds to be uh um Has a smallest distance so you merge with a cluster three next is you have one you have one and four now what what you do you this first you take out the just you just you can see first you take a first here we have full in which this union union and this full and this four has

the smallest distance so you merge it and then you make a big cluster again and then you are done okay so it might make sense it might not make Sense further but i will walk you through go through this example make the dendrogram by yourself or necessary clusters by yourself okay that will make more sense the next type of this is max what do i mean by with max max is simply max it's taking out the similarity between c1 and c2 which starts with the maximum which we which we want the Maximum distance maximum distance

between pi and pj and pi be the member of c one and p j p j be the member of c two so that's the that's the basic definition uh that's the basic of uh this ends here we are taking the max so what do i mean by max so here we have an uh intuitive or a comprehensive example so here you can see that we have again the same proximity matrix now you can see That the smallest distance is three six okay you merge it now how you will merge this four however why why

don't don't we merge with five okay so the distance between the distance between this five and two is at maximum it's very large so obvious this this one obviously will be minimum so your merger that that's that's makes sense that's makes sense obviously okay so this this is what why you're you you take minimum distance but for merging you take the maximum you Take out the maximum to merge and here you have taken the maximum and that's how and that's how and that's how all it works and here we have a maximum here we have

a minimum so you take out a maximum and then you uh compare and then you take on the minimum and this is obviously the minimum and then you and you can see this is the maximum here and this is the minimum from here so you merge it okay so this is the basic and you can see the Dendrogram over here how we made okay the next type of um inter cluster similarity measure the next type of inter cluster similarity measure is group average what do i mean by group average group average so i'm just going

to give you the intuitive uh just um as a simple equation which is p i with a member of c 1 and p j where the member of c 2 p i be the member of c 1 and p j with a member of c 2 you take Out the distance between p i and p j divided divided by the norm because the freak frequency divided by the norm of c one with whatever the number of points times the i think times the norm of c2 okay so this is what you have to do i

think it's it may be plus yeah it may be plus yeah okay so you take out the norm of what is the number of points freak frequency okay so that's what you are going to do So you can see over here that we have this dendrogram we have this following and then what you do you take you group you group it look for for example you can see the same thing has been done but i'm going to take another example to help you understand this much better to help you understand this much better so you

have this first and then you have this second cluster okay now you have some points you have some points in here you Have some points in here and some points in here okay you take each pair you take each pair each pair like this you take each pair okay okay then you average it out okay and then you take out the distance you average it and then here you go okay so for example here you take out the minimum distance is three and six and then you have this four uh then again you merge it

using the and that's what What we what we were doing but here we are merging one also the reason being the average between will it will be making sense if you sense it mathematically okay and five and two are closest okay so this is what you are doing in group men and let's see some of the uh some just this minimum this uh disadvantage the minimum disadvantage is it it simply it is it is sensitive to outliers it is Sensitive to outliers and bad wrath rather than max is uh will will be less prone to

outliers okay so that's the basic uh disadvantage that i've told you and you can see more onto the cs645 530 cluster analysis lecture notes to understand it much better okay so this is what we are doing and if you have not understood it is little bit advanced but it it should make sense little bit what what we do what we do And this we have make hierarchy of clusters like this okay so now we have talked a lot about hierarchical clustering and i hope that you really understood everything you need to know about hierarchical clustering

now what i will do i will just name some uh now now what i will do i will name our time complexity i will and the space complexity now i'll name one time complexity and space complexity for agglomerative clustering just to make Just to make sure everyone is on the same pace so the space complexity the space complexity is order of n square okay because we have approximate matrix too and time complexity will be order or maybe sometimes it's uh order of n q okay but some sometimes it goes to order of n square log

of n okay so this this is your this is your Space and time complexity to understand it much better and you can see the limitations pros cons of these onto the wikipedia pages they are very well explained there okay so we have we are done with hierarchical clustering and i really really hope that you understood everything now we are done with unsupervised learning now uh we will we will do a lot of projects and then we will end up this course and i highly Highly recommend you to do lots of uh projects also okay and

then after learning this machine learning you can get go to new era again i'm going to uh just to now navigate here the url https uh you can slash youtube.comera okay so you can find then uh just just take take take a look at a newly launched deep learning series okay from there you can learn uh advanced machine learning or uh or deep learning okay So that's it for this session i'll be uh so that's it for all this subsection i'll be catching up your next section okay so now we will build or start our

last section of this course which is uh project section and in this section we will build two projects maybe some more project but uh initially i plan for two projects maybe i can add more but you can surely go to uh my youtube channel for more projects buzz but these projects will tell you an Overview of how a machine learning project would look like and the motivation for starting you a new project because it's better to make your project by yourself just taking the inspiration from other people okay so in this project we will build

a heart failure prediction uh model that will predict whether the person whether the person will die based on some of the features or not okay so this is the problem statement and we Have certain features like age gender blood pressure smoke whether a person is smoked or not whether a person have diabetes or not what is injection fraction what is the that's what this is long name which i can't pronounce but the this is the this is a problem statement and you can take a look at the data uh we're going to take a look

at the data which is available at this link and i hope that you will uh Understand this project i i have made i've run step by step to help you understand everything and as i am made just i have not worked on this project so i just want to pick up this heart failure detection system just to make sure that i will be making my own project also and it will be uh and narrating over through this project okay so here uh what i'm going to do is what i'm going to do is i'm going

to make a heart failure detection based on these Features and the target variable of our which is that that event whether the particular person died or not so that's it with the target variable we will see more about the data but what is the business objective over here every machinery problem has some kind of business objective it simply means that it's some health care problem means we'll be able to build a health care we will be able to build a healthcare in healthcare something ai and healthcare Which is simply able to make a machine learning

model that will help you in early detection of the person based on particular features and help the person can be saved so that's why we are going to build that model to help you better understand the ai in healthcare we'll be building one more project which is spam detection system that is uh whether the email or messages are spam or not okay so let's start with this uh notebook and Uh first of all i've divided first of all i'm going to load the data and sorry sorry import the libraries so i'm going to import pandas

numpy seaborn matplotlib and this okay so i will run it down i will run it out and then i will load my data so i don't know why the second time so now i will load the data and my data isn't located in the folder of heart failure data set if you click on data set click on archive and here is my data Set okay so there i'm going to do i'm going to just print out head and i want one more thing that i want to clarify over here the reason why i'm not writing

or quoting by it here because i i thought like uh it will be better if i narrate the code if i if if i'm writing the code maybe i i forget something or maybe the code is not annotated too much so that's a for a few further references i just took to just write it down over here and then i'm only going to just add It over there okay most of them because because maybe while writing the code i may forget to narrate some of the code but yeah let let me know what you like

whether i should write a code or not okay so i'm going to print it out the shape of the data i'm going to do the shape of the data is 29913 which is 13 columns and here we have h anemia cretinite false fokiness diabetes injection fraction hp then Battle it serum then what is the gender smoking time and that even okay and that event is our target variable that we need to detect okay so that's our basic intuition that's our basic uh data exploration now what i'm going to do is we will see how much

the data we have means we will see what's the shape of the data we have so here we are we are given the shape of the data which is 299 rows and 13 columns okay now using a day data dot shape you can Do now i'm going to take a look at the information about the data you can calculate or you can take out the information about the data from the data the info method and you can if you see or we hope that it will tell you whether the particular column has no values what

is the data type and what is the memory usage and it will also tell you what's the shape of the data by just taking you can also take a look at this and you can see that it starts with zero indexing so We have 13 columns you can also take a look at the description it will simply tell you whoever is the numerical column uh it it will tell you the describing means what is the count what is the mean of the particular column what is the standard deviation what is the minimum what is a

25 of that column and 50 percent of that column and 75 percent of that column and max and that column it will surely help in the photo means when if you're a data scientist it will Surely help uh maybe something kind of uh you are exploring what's the maximum you can take a look at this you can take a look at what is the standard deviation what does it mean maybe you're formulating some problem based on machine learning so it's very helpful this this data frame you can you can have it spend five minutes understanding

what was this and it's very easy it will tell you the count minimum standard deviation etc Okay so what we have seen so far we have loaded we have imported the libraries we have loaded the data from my local directory i'm going to take a look at the shape of the data and i'm going to take and i'm going to take a look at information about the data okay so here i'm going to data dot info and that will give me the information about the data okay then i'm going to take a look at the

description that's what is the uh well that it will Describe our numerical data and then i'm going to take a look at what is the number of unknown values so here uh this will tell you this this will tell you in all the columns how many number of null values are and here we have zero zero zero no values in each and every column okay but if you let's say let's say let's take a second example that you want no i don't want this i just only wanted that how many how many numbers so you

can add a sum and then you can run It out to see how it works so you can see that we have a total of zero uh no values okay so this is the basic exploration uh as i can do about the data access let's little bit exploration about the data okay now the main part will come in exploratory data analysis when when we do the so much of eda and a lot more so we will see over how we do the exploratory data analysis and here what do i mean by explorative data analysis In

eda it it does not mean it is a very very hard but it does not also mean this is very very easy it's a very very crucial step in machine learning you should know your data how it is working what's the what's the distribution of the data is your data is balanced you have certain questions to ask we will which we'll see over here okay so you have certain questions to ask to your data that should your that using maybe some plots maybe some numbers will see That and you have to find answers of your

coach of of your company because if you're a data scientist or data analyst at some company you should work mostly on to understanding and finding your business solutions as i'm a data scientist at artifact i used to see i used to work a lot more with the data i used to work a lot more with the data because i think that i should be able to what actually my data is and what the business objectives are and what's this What's the answer they want from my data okay so that's what i do a lot of

ted exploration data analysis there so that's why i picked up this problem i found it very interesting and not picked so much of heart because it should be conserved like advanced house price predictor diabetes prediction system which i already made which is available in my github but i want you to try out make your unique project and showcase on your Resume okay so then what we will do then we will simply see the distribution of our classes what do i mean by classes over here for this example we have we have a binary classification problem

and this binary classification problem we are given one and zero okay so whether it if if this one then the person died if it is zero then the person doesn't uh not die i think i have to see okay so sometimes yeah so the the Person is leaving is zero and the person is dead is one so we have two classes which is a binary classification problem okay so i'm going to take a look first what i will do i will take a look at the distribution of my data and you can see i'm highlighting

the code from where i'm taking a look at the distribution of my data and here it simply means that i'm just going to take that's just just going to data then bracket i want to take a look at the event i'm going to Count the number of event where the event is equal equals to zero and i'm going to count the same where the event is equal equals to one by taking out the length of each alex or cds okay then and also the pi pi takes an array so i'm going to put in an

arrow like this i'm going to put in an arrow so let's let me choose that red color and this uh medium that works good i'm going to put in array and then i'm going to just label it just just to make sure everything is Right so i'm going to label it with living and data i'm going to print it out what is the total number of living cases and what is the total number of diet cases okay and then i'm going to plot it out by just p l t dot pi i'm going to plot

it a pie chart by giving our arr a r r will contain the length of life and length of depth like this okay and then labels will be the same light living and dyed exploit this is how much to explode Amber and shadow what do you do you want your shadow yeah sure i want the shadow so i'm going to run it now so i just hope that i've run the previously so i'm going to run it now and you can see over here that you that's your data is imbalanced uh we are actually working

out we are our data is imbalanced data so i will tell you why why this this data is imbalanced but first of all what is first of all if you can see opioid that We have total of 203 living cases and 96 which is tight cases and makes sense also because in real world living cases is more than the death cases i don't know actually i'm not but i think so okay so now let's let let me show it to you in particular to help you get a feel that uh what is imbalanced data okay

oops what is not working what is imbalanced data so imbalanced data means let's take an example that you have two classes for example here we Have two classes living and that okay so you have more examples of living which is here two or three like 203 examples but you have a far lesser two times lesser the lesser than examples of 96 okay so here you were actually working on a balance data what what can be the issue of this issue of this can be that your model and most of the model is most trained onto

this living so you're uh you can assume that you're That the output of the most of the output of your model will be zero rather than one in some cases that will print one but it will in most of the cases it will be zero okay so here uh we have some examples so maybe it can but your model is more prone to train under 203 and your model may maybe get biased towards some problem means your model and can be like this print zero your model is this this model and It's only printing zero

at every time so this happens when you have for one cases you have two on the let's take an example that you have a 400 examples so for one cases you have three nine nine examples and only one example for for death okay so that is causing that will cause the problem okay a big problem this is a big problem which comes as a something called as which was starting in deep learning which is working with Imbalanced data okay so this this is what i'm going to highlight that just i just want to take a

little take a take take take my answer is my data is imbalanced how much examples for each case do i have so i answered my question that we are actually working we are actually working on imbalanced data which is actually working here on imbalanced data where we have a living cases equals to 203 and diet cases equals to 96. so imbalance simply means That your classes that your classes are not equally distributed are not equally distributed so can i write it out um yeah so let's write the definition so i'm just going to comment it

out just to imbalance means imbalance means that your data is your data is not equally uh distributed between classes distributed between classes between classes okay so uh it may happen that If in balanced data our data is equally distributed so for an example for a particular example so for example let's say you have a training length maybe 400 so if you have a training length to be the 400 so assume so assume 200 is for death examples 200 is for live example so we have we have equal number of examples in both living and death

cases death Cases okay so this is this is an example of balanced data and our model works best here okay this is more robust it is not biased towards anything okay so this is the basic uh this is this this is this is our first sticks for which we didn't also take take another reference so what what we have done so far we have simply taken we have simply uh drawn some eda from here by taking out the length and taking out the length of each event for A zero and one and then then i

put in an array and then i'm gonna then i have labels which is living in dots and printed something and then i'm gonna plot the pie chart with that this this this this this explode comes with explode and shadow we want the shadow so you might be seeing the shadow over here okay and you can see that what the inference so i answered my question am i working on imbalanced data yes How much here okay two times just an approximate so our thigh face is two times lesser than the living cases or we can put

in a percentage by just dividing it out by just dividing it down to the total length of our data okay so you can you can take you can try it out more mathematically but this is what i'm going to show you to you about this informants okay let's move on to the next inference in the next since we enter influence i wanted to take a look At the distribution of our age uh this this this will tell you whether your data is when what range most of your uh most of them the central tendency i

think that is that is here the mean the most of the cases most of the edges like from 40 to 95 okay most of the cases 40 to 95 examples examples are starting from 40 to 95 95 and then you can see most of the cases are in around 16 okay and around 60 with some standard deviation Okay so this is the distribution of your data you can try out for different different you can try out for definitely for different different uh numbers which i've already shown to you there are a lot more numbers you

can definitely try it out okay make a pie chart for class for binary columns like high blood pressure it was or not or maybe gender make that column and see if it works so i will Show you how to filter out the columns from there okay so now we have seen the distribution of our data so now what i want i just want to check so maybe my business objective is maybe my my my lead told to me hey uh if your data your data dashboard like this should be answering lots of questions should be

answering lots of your questions about the data okay so here the answer that you're working on this data means the total number of type cases is less than The living cases okay which is two times less than the typically living cases okay so that's the that's the that's how you do and here you can see you can say to your stakeholder or whatever the lead that most of the a most of the age rise from 40 to 95 okay so you can see from this and c and you can tell most of the cases fly

around from 50 to 70 like that okay that is the distribution of a data so now say that you wanted to check you wanted to check Select you want to select the columns that are above age 50 and see they're dyed or not again a very good secret it sql query forum but maybe you just just think about what you want to do in pandas or you want to do it like that okay so it is possible and seek well but not i i just as an example yeah i've made in a python okay so

what i'm going to do my business objective is select the columns select the columns sorry number of examples i think that i Do you know select rows that are above age 50 and seeing that they are died or not so here what i'm going to do i i want the death event i want the death even because i don't want any any column with the death even whose age is above or equals to 50 and they are living okay and then again the same thing and then again i'm going to take out the length of

the same as above i'm Okay so if you've been if if i run this you can again see you can see here that here we have living cases here here we have the living cases a lot and you have a small but you can see the diet cases diet cases so if if i write it out i just want to write it out like this so oops i'm going to write a total number of total uh total number of it's a diet cases diet cases i'm going to take out the length Length of diet length

of that i'm going to do for total number of total number of non-diet cases not diet cases just i'm not able to write because i have two taps in front of me it's very difficult for me to dabble right over here okay so here length not died okay so here if you can see that You can you are able to see that we have total number of right because 85 and total number of not at 167. so just see over here that in total in total we have 203 which is two times lesser than and

here this is fairly one times less than that okay so here you can see that most of the cases again died but most of the cases over 50 died okay as comparatively to our plot maybe not not making sense but Again i'm sorry going to say that assume that just let's listen to what i'm saying i'm saying like you can see that um in our work plot we have total number of our diet cases in two times our diet cases our diet cases is two times lesser is two times lesser than our living cases and

here our diet cases is one times lesser than not diet cases so here you can we can see obviously this is low but here you can see most of the people that are of always 50 died okay from This inference comparatively from the upper plot okay so here that's how we answer the questions from the data and i hope that you understood everything and i think that you will be taking out more influence okay great so let's see one one more one more uh column which is very fairly good column which i think oops get

out i'm just going to yeah here it is oops yeah here it is okay so now now just assume that you want the columns that Are available 50 and you'll see whether they are died or not i think that i have already already okay so you want that that there are above age 50 oops i think this is uh this is for diabetes so yeah so what you have to see you have to see you have to answer the question that the person is having diabetes that a person is having diabetes how many numbers how

many numbers of patients who are diet are having diabetes and how many Numbers a patient this dog dies their dead non-diets having diabetes okay so diabetes isn't um where the person is not having diabetes and having diabetes okay so maybe i've done a little bit wrong over here the diabetes should be there the diabetes should be there and we have to see whether the person had died or not okay so you can see diet with diabetes is this and not that with that diabetes is this so there are a lot more But you can again

compare with the other plot over here then it will start making more sense okay so we have done extensive data exploration data analysis a lot more can be done to answer a particular business problem like this you can see over here you can answer some more questions from the data but as to give you a taste of this how i do how i like to do the answer the problem using my favorite visualizations okay so you can answer This uh it's it's very good to answer this all uh by just uh visualizing it out and

saying to the lead okay so now now we have seen a lot more things we have seen that more of the visualization just don't worry about this the these are again in the course website mlo one dot native app it's a very very uh easy easy to get all these notebooks okay so now you will check the correlation of our variable so what did i mean by Take checking the correlation of our variables it simply means that you'll do that how your how your features how your features are correlated so here you have a very

good reference i have taken from the style they they have explained very much extensively every some online so here was telling it shows the correlation between variables on each axis and uh it's means this just shows that i will give you the uh plot over here But what is char what it does it simply shows you the correlation ranges from minus one to plus one okay minus one to plus one it simply means that if your variable is closer to minus one then is a very very similar okay if your is is very very similar

so you can see value closer to zero means there is a no linear transmission lean linear thing is not there so you can you can't use linear regression over there between two variables means we can each this Correlation tell us whether your data is a linear or not and the close to one the correlation is more positively correlated okay more positive means correlation will empty correlation between these two and you can read about pearson correlation for your efficient topic you're going to just use something kind of that okay that is one increase so that the

other and close to close to one to the strongest this Relationship is means the the one is closer to one the stronger the relationship is the diagonals are all one that simply indicates that the squares are correlating with each variable to itself so it's a perfect correlation so if you have all the data all the all the all the diagonals of one so the perfect sign of perfect correlation and the plot is also symmetrical and i know that you know About symmetrical so you can see over here but i'm going to just uh put it

in a form of this i just hope that i'm going to put it like this okay so i'm going to run it out and here you can see that our data is perfect for relation more the dark is minus 5.3 like that is the more correlated okay so you can read that it's very io explain and also You can do the same as your like if you want to do it with the panas so here you are going to do and your diagonals are again one etc okay so we have done talked about various things

you have under understand data and etc etc everything okay so now i hope that you got an idea about how you process so now what i will do i will start with data set development as i've talked about that you should devalue your data into training or testing set Because for testing you have to check some you have to you you don't have real examples so for validating your model works best so what you will do you will divide your data and do validation segments training and validation set where you are dividing 70 30 okay

so 70 for training and 30 percent for testing okay so you can run this out we are using the escalant api uh from while importing the trainer split okay so now we have done this and i hope That you understood this also so now we will do now we will i will not do future engineering over here but uh i will just showcase to you what's the what's what we do just a one example of feature in january okay so in featured engineering we add more features we encode our variable we encode our categorical variables

we apply some transformations on our data just to insert our feature okay so here is an example of adding the feature okay so Here what would here what what we are doing we are adding the interaction term okay so what is interaction term interaction terms means interaction terms means let's for a foreign sake of example assume that you have this data set this data set which is gender and age okay for an example so it will add again one new problem by taking out the product of g z and uh not let's take A bp

okay and we think means in numerical so it will take the product and add okay so it is just doing the product of two features and making that call okay so maybe that that will not make sense but this is what the interaction term means means we are just adding the product of two features okay so here is our function for that so first of all we are taking the columns names then we are taking the length because we Have to see we have to iterate through um and then i'm going to copy it out

so that we can change anything to x and i'm going to um just iterate through all the columns and i column represent the represent of first column and this feature i name that this uh there is this this column i'm going to ask access using the x and then i'm going to range through j because i want to multiply these two so here again the Same thing and again i'm going to take out the data and we'll do is just to make it this to just show the name of the column like this that that

we multiplied and actually you're multiplying this out and then we are returning the x end okay so this is what and then we are calling on x train and x test and we are done okay so here if you run it out now you can see if if if i show you the code if i show you the x-wing mod X train mod do i run it yep let me run it again you can see that we have 78 columns we have 78 columns from your 30 or 13 columns to 78 columns you can try

the result how it is working and let me know in the comment box below on our on in our desktop community okay so this is what what we are doing and by adding the interaction term we are feel free to Explore more okay now what we'll do we will now we'll start building our model so how do we build just just we will start building our model so first i'm going to make a model for evaluating our model so first i will take a look at the accuracy precision recall and confusion matrix so again if

i run it over here now now it will give and we are giving the ground through as well as our predictive okay first of all i will start with legislative question with max Iteration to be 1000 the reason why i have given over here because if you it is not converging it is not converging with any solver so for converging if you run it down you can see that lbgs fail to converge what you can do you can increase the number of iterations or or what do what you can do you can simply you can

simply uh scale your data maybe my data is not scaled so increase to the iterations 2 000 and it worked okay but it is also Telling a process to standardize your data or to scale your data so here for standardizing as well as the building the model so we have something called this make pipeline so what what it will do for any coming example is to standalyze and then it will uh and then we and then it will apply legislation over there okay so we are we are just doing in a small number of and

you can compare the result of the standardization and that max iteration So the accuracy is actually better actually better precision is also better recall is also better and this is also better okay confusion matrix okay so i hope that you watch my session on pre-season a recall but let's uh let me show you what those precision and recall confusion matrix means so maybe i have the this stochastic gradient reset yeah so we just just just want to make sure that you are on the same base okay so Here i'm going to highlight one algorithm which

is optimization algorithm which is same as our favorite gradient decent but here in gradient decent what we were doing it just takes it is taking a lot of time white is taking a lot of time so here assume this this example assume this example i'm going to narrow it over here that let's assume that you have a 10 000 data points 10 000 data points 10 10 000 data Points and we have 10 features okay so here you have 10 features and the residuals consist of as many terms as their data points so you have

10 000 residuals because we are taking the difference between your predicted and now model predictive model output value so you have around ten thousand term in our case residuals so we need to compute the derivative of the ten thousand term you need to compute the third period of a ten ten thousand term with respect to Our features which is ten thousand times ten which is one lakh computation per iteration that is so so much high okay so that's why we have some something called as stochastic gradient adjacent so what what we do so what could

we do in stochastic gradient descent this we choose simply us the same thing happens we repeat until analytics our approximation is minimized and then we randomly shuffle and then we do at each step it means we at at one year step so We don't update we do updation as well as the derivation at each step and we are doing for each training examples for each training examples okay so it um you it is a little bit out of course it is usually thoughts and maybe an uh deep learning but i would highly recommend to learn

this is just equal equals to the batch gradient decent okay so you will Be able to see what's the difference okay so now as i've talked to you that we have something called as pre-season recall accuracy so we have not talked about that so let's talk about i'll just spend a little bit of amount of time onto that so here tp means true positive means true positive means that your outcome that your outcome is that you that your moral outcome Correctly predicts the positive class okay so you have in in any model you have two

positive and negative classes positive negative classes positive like zero is a positive m one is a negative it'll be positive so your model predicts the same output as your ground through okay in a positive end for a positive class true negative is where the where your model predicted one and your output is also one and true positive means zero and Zero these are positive plus and these are negative plus okay here you have false positive which is your model incorrectly your model predicted your model predicted your zero and actually the output is one so it

is false positive means positive and false because they are not matching false negative is just the it's your moral your your moral predicted you're more predicted wrongly okay so you're more predictive one and your output is Actually zero means negative and positive class okay so that's the that's that's the true positive negative and confusion matrix is simply like this first we have a true positive false negative false positive and true negative that we have just seen it will tell you how many number of are correctly classified as up in a positive class in a negative

class in a in a but in the modern model Field and a positive class with a moderate fail and this is a true negative where the model actually worked okay so you can see that we will we have seen okay so precedes and recall as we have talked about precision and recall a number of a true positive divided by the number of a troop or was divided by the number of a true positive plus the number of a false positive so true positives t p plot by three t p divided by the number Of a

true positive plus the number of a false positive okay so this is your output this is your output of the precision and simply answer the question what proportion of positive indications was actually correct okay so it answered what proportion so here we have 0.73 which is actually a good proportion but we had it can be improved the equal answer is the question what proportion of actual positive was identified correctly okay okay so it will tell you That what is the proportion of actual positives was identified correctly okay so this is the recall of our model

okay so now we will see now we will see uh just just we will to see the that that the every season we call confusion matrix etc okay so now we will see now we have built our legislation with some standardization now we'll build a support vector classifier with a Grid search cv with extensive fine tuning we have c which is lambda which is which which controls the width of that margin so we'll try different differences zero point one one ten hundred one thousand gamma one zero point one zero point one zero zero point one

and kernel to be rbf kernel and then i will call the grid source to instantiate with the svc classifier ram grid reference to verbals equal to three it will i'm going to fit it it will try For each and every and checks like score and this is what here and this is what the thing is they have done where they have checked and you can find out the best parameter that the model found where c equals to 10 and gamma equals 0.01 you can see it's far performs a little bit less with uh compared to

logistic progression but it worked okay so we have done with earlier we are done with the support bitter vector Classifier as well as logistic regression classifier now what we will do we will do uh this we will we will make a decision a tree classifier okay so what i mean by decision tree classifier is i'm going to import the true as that we've already talked in detail i'm going to import a randomized research we have talked about grid search randomized research is same so i'm going to define a function that takes the parameters take some

how many Runs to clf what the classifier to use okay then we will then we'll call a randomized on clf which will give the clf which is classification number of iterations etc then if fit it then then you'll find the best parameter then we find the best score and then we'll say this and this this is just a custom-made model but you can surely remove that and just use that randomized search and then put all these So i have already done in some of my projects i've just taken out from there and it's working great

okay so i'm gonna it will tell which is the best which criterion is the best whether again drop your guinea a splitter or main weight is i have done a lot of uh fine tuning okay i will run it it will take a little bit amount of time trying all the values and checking the score and you can see the training score is it 0.84 and the test score is 0.75 okay now we'll run this classifier with the same uh with these uh features okay so it it will give me the output like this now

i will oops now i will run it out maybe you can you can see okay so you can see how it how well it is performing maybe i have not let's see the best score randomized to search best score okay uh maybe uh yeah so that's that's the basic and you can also try it out maybe i'm a little bit uh you can see uh Let's see if it is it is not good or not oops i have to also show okay so it is showing the best classifier so i have to put it over

here i don't know why i put it over here but what i have put it now in when what mine but i will let me put it okay let's run it it's actually 0.75 so i've lit i have obviously done some fine tuning and here what i got that's with a good parameter okay so it's not always Oh this works best in some of the cases it worked okay so now with the service little bit of fine tuning i'm going to call a random forest specifier then i'm going to run it with this with a

with a parameter stack that we got with the random forest okay it will evaluate my model and then 0.86 0.1 actually good okay now what we'll do we will use xgboost okay xg boost is another we have talked about we're learning with 0.1 max step what is the number of parameters Then i'm going to put evaluation set into one uh array okay so just uh just just see the log loss at the same time and then i'm gonna run it so it will tell at the log loss of the zero iteration and zero point hard

to do is the log loss validation loss okay now we will evaluate and here again 0.85 is more robust model okay it is a robust model if you if you can see the importance it will show what features are more Important so time is very important injection is very important so in this way you can select you can discriminate maybe smoking anemia sex creator dying and then means these these three you can remove as a future selection okay the last in that test we are going to uh use is a gradient boosting classifier gradient boosting

is mostly used in a cases of images but let's see it how well it works with these hyper parameters you Can also fine tune it to get better results than me and it actually worked okay now what i will do i will i will save my xg boost model because eddie boost is more robust so i will save my model by just calling for job lib job lift up dump i'm going to load my job if you can see if i run it now you can see that zero zero one and you can see if

i go over here if i go over my heart failure and model.pkl file and you can load this model.pkl file to make a good Model okay so that's the that's our that's a basic thing that we need to understand about um that test we need to understand about uh this hurt failure detection system and i hope that you have understood a lot from this okay so thank you for uh seeing uh this section and i really hope that you enjoyed this and we have covered this project in 42 minutes in the next project we will

be talking about a small project spam detector System which is again a very cool project which is understanding problem statement and building a good model okay so let's meet at the next project then have a good day so now we'll talk about or we will make one project which is spam and hand detector system so if you have seen your gmail uh or google gmail or microsoft outlook there you are seeing that they are maybe where you have a tab which is spam tab they're your spam emails are There so in the same way we

are going to build a spam and ham detector system so spam means means that uh that a particular message may be corrupt and ham means not a spam just oppose it to spam okay so just uh ham means not a spam i don't know why does not work for the first time not a spam okay and here you can see that that text we have a label means uh we are given a message we are given a message and we are given the target label which is our Why okay so this this is our this

is our message maybe go until journey point crazy and then you and there is a label which is given over there okay and this data set is downloaded this data set is down downloaded from uci repository and here is our data data set so it is in table so i'm not going to use csv i'm going to use table to read that i'm going to separate by a tab header should be none and the names of the columns should be labeled and Messages okay so let's take a look at we will take a look at

the first few uh data points so let me run this out first of all this and then this and then i'm going to show you one of the messages there so it should start making sense to you so here data oops what happened data and this is the messages it will take a little bit of time Then zero position so you can see that go until journey point crazy available etc etc and here it says that is a ham that means not a spam so for a for example let's assume this second number because in

second number we have it labeled at the span so free entry it's looking like a spam like it's free entry or where now like that okay so that is the that's that's the basic uh that's the basic exploration of our Data and this is downloaded from uci repository okay so i think that you that you understood we are given a corporate for test so here we are not given any numbers here we and we are given here we are not given any number we are given oops what happened what is happening we are given a

messages which is x as our messages which is in text format so your neural network argument sorry machine learning model does not work with text so you have to convert the text into Numbers so we will see some of our favorite text text a vectorizer to convert this corpus of a test into a matrix or a vector or a number of a vector we will see that we will see that okay so here you can see that uh what here we are given a data which is text and there's one and then we have a

label okay so now we will see now we'll start exploring our data set so now we have got our problem statement that we are Given a part with our text we are given a text and we need to formative password or pipeline and give it to your output whether it is a ham or a spam or hammond zero and spam means one okay so that's that's that that's the our favorite uh given now we will move forward now now we'll move forward into exploring our data so i've already explored it so let me rerun it

again so here i hope that you all are able to see this So maybe even if you're not don't don't worry i will just we're going to take a look at a shape of the data which is 5 72 572 with two columns and then we have a no values to be zero because we don't have any no values we have two infos and here which is a count unique values uh what is the frequency and etcetera and obviously they're in the text we do you don't have numerical things so that's why It's it's empty

okay great so we explored the data very much now it's time to again start with explorer exploratory data analysis okay so in explorative data analysis we in the in our previous project we have seen that we have seen that we have to see the distribution we have to see uh the distribution we have to see the distribution of the classes okay so here we are seeing the Distribution of the classes by just taking out the length and converting that into a length now now here you can see the labels which is hammer spam and it

will tell what is the number of total labels with the plotting the pie chart and you can see over here that the number of ham examples is 4 825 and spam examples at 747 so we are actually working on imbalanced We are actually working on imbalanced data so your model will learn a more about tam rather than spam okay so be sure to keep that in mind next we will uh now it's it's very very important like uh for an example so let's uh let's understand the processing of our text how how we clean our

text and why it is why it is there a lot more need so why do i mean like what do i mean by cleaning is for an example let's assume That you have a go okay so you have go oops go okay small g i don't know why it is not working again i like okay so here you have go here you have go okay it says best place small oh okay so these two will be considered different these two will be considered different but this is the same these two will can will be considered

different but this is Uh this is the same or maybe this hashtag does not need any sense over here so why do we need hashtag over here maybe we don't need like this we don't need emerges but here but in some cases like in sentiment analysis emojis plays an important role but here we are go and go are the similar thing maybe here we have a here we have a these these need if we if he gives to a model these two will be considered Differently so that's that's not a good thing so what we

will do we will convert all our text into lower text okay now here we are doing some text pre-processing that that you need to know okay just uh just using re okay rejects okay so i'm going to replace here and i'll be given one simple message i'll be given one simple message first of all okay i'm going to first lower it down by converting that into a string then i'm going to replace this This zero if there is any zeros in the text will be replaced by m okay and this this three will be replaced

by k and k is thousand k or hundred k or what whatever this uh this comma will be replaced by this sorry uh above i i'm not recalling what's the name and except sorry uh it's uh it's uh i'm not recalling but yeah you can see okay so then you have won't will be here it will be considered Different so will not okay can not should be can not can't should be cannot okay he uh we have we are just doing n not equals not what's what is is it's it is that his he is

she is she is s own percentage then etc and then your l will okay so this is the this is the basic text preprocessing that i've done over here but one thing that i can also do over here one thing that i can also do over here like for stemming stemming Okay stemming limitization limitization limitization so i would like you to explore this i would like to you to explore both of them in implementation okay and uh why what first of all let let me tell you what is stemming okay so stemming it simply means

and it's it's in stemming it simply means that for an example for an example so if i go To some website let's take an example that i write play that i write play i write p-l-a-y-e-d then i write player okay so it will it will the stemming will give play and these three quotes can become will will be converted to this single word because they make a similar sense okay so that's the stemming and limitization is just a Bigger version of that it it makes a meaningful some sometimes stemming does not make a meaningful word

so for making a meaningful word we have lemmetization but again laminization what is just so for example place plays plates played and players so it will convert in play and the the stemming will convert in pla okay so it does not make sense but this limitation is making sense after we levitize our work Word or we convert our corporates of a word into a single word okay so that's the seminal limitation but i want you to explore stemming the word stemming limitization using nltga library you can refer to some of my github i've already done

that but i want you to do this okay so otherwise you know no tasks will be left for you in this task okay so here what then what then what we'll do then you apply to each messages by making a new column you apply each messages these Text pre-processing by doing this all by calling the lambda and calling the laminate will go through each step will it it will it will go to each step and apply the function onto each messages and put that into a processed text so let's take an example of that it

will start making more sense if you take a little bit of example okay so what i'm going to do is i'm going to print out i'm going to print Out not first which will be our favorite uh not processor text so i'm going to print out first which is not processed text so these are data and then i'm not going to i'm going to post a video now i want to zero element and now i'm going to make a simple line i'm going to make a simple line oops what happened i'm going to make a

simple line so it's just mix this and i'm not and then i'm going to just Paste it over like this okay so it will show you what is the pre-process text so this pre-processed text is converted like this okay so apply stemming okay by using the porter stemmer you can go to nlp and ltk stamic nltk okay there is something called it's limitization so you can see some of the tutorials which are available in geeks4geeks etc so read it out means you can go to any of them whatever you like it's just Converts the words

into uh yeah here is a good example so playing play play so it will come common root from play okay so this will convert it like this like stemming is the process of reducing your inflection into words okay so you can read this read this sound this is a very good documentation provided by their camp and you will get to know much better about this but i will be bringing up one full course onto national language Processing justice is a sample how do we work with the text but on new era on new era when

full course on deep learning where we'll be talking on deep learning we'll cover text working with the textual data okay now we'll go further into now i'll go further into feature engineering what is feature engineering here we are not going to add any features we're just going to encode our hand to zero and expand to one by calling our map method Then it's done now we'll divide up a data into train this split now if i run it here you have total number of this and then and then we and then what and then what

what we will do now just assume now just see now just see over here that in training set in training set we have our text but neural machine learning will not accept the string or whatever the textual data so what we have to do there is modern Means word embeddings then we convert our word and then we convert our word into maybe this word into some numbers into some numbers into some numbers there are a lot more techniques count vectorizer tf idf vectorizer bag of words which i'm not going to again talk about again it's

time for you to explore as of now you can just think of it as you can see there you go you can see that what's the mathematical equation for that is a single Mathematical equation but is what it does is simply you can go to usq learn feature extraction just extract the features from the text which converts your text into the numbers okay by just we instantiate over here now now with now what it it will do it will fit the training data means android return the matrix and then we don't want to fit the

text test testing so that's why we are just Transforming our uh text stress to the numerical into the matrix okay that's it okay so this is what we've used to convert our uh text into on vector of numbers okay so if i show you the training data how's the training data is a sparse matrix okay it's a sparse matrix which is with five thousand fifty five thousand two hundred and nine stored elements into that okay Now we will use nine base i hope that you understood and i hope that you had a look we're on

nine days uh on this as you have told to do the assignment so you hope that you had a look at night base you can read more about knife base this is just a very simple as we talked about some other learning algorithm it's very very simple okay so you may run this and then you just call color protect and here you have accuracy so now let's uh let's give up Corpus vertex so let's take and let's stick and take an example here you have a text here you have a text and i'm going to

write a y usage is a good boy i think let's see if it is a spam or not spam okay so i'm gonna just want to convert that this it in in production what we do we write a function process we write the function we'll take a text as an input then what it will do we we instead we call our Account vectorizer which which we call over here so i'm going to call my count vectorizer like this count vector dot to transform i'm going to transform my text giving uh giving giving this i'm going

to do this num okay so it will contain the basis of vectors now i'm going to just call my model which is maybe it will take a model so i'm going to call my model which is here Nine days and then it will simply dot predict now it will simply predict my txt okay and then i'm going to oops it's num na i'm just convert that into a noun so it's the prediction i'm going to use the prediction and then what i'm going to do i'm going to read her the prediction okay so here if

you call pre-process and then pre-process and predict so it will be it will take text as an input and it will return maybe it's maybe uh some problem Maybe i have to do convector that fit transform what happened pre-process iterable over raw documents a string object received okay no worries so maybe what's the problem was causing is is i'm given this messages and instead of this pre-process but we need to give the process takes instead of messages so here we are given the messages now we Are just going to providing the series and then i'm

going to do count vector transform uh giving this talk and then providing naive bayes dot predict and this this this this will predict whether that it will it is a spam or a non-spam okay so that's the basic uh spam and ham detector system obviously you can try more stacking various various things so we hope just to give you a taste how the natural Language processing uh project how we work with the data just to give you the taste of our data okay so i think we are we are completed with pro with this project

i'll be i'll be catching up you in the uh maybe uh the next i think that we are done with this uh this course maybe i will do i will maybe we will talk about simple perceptron and then we will wrap up this course okay so thank you for seeing this course i would highly Congratulate you for completing this course you

Machine Learning Course for Beginners