hello cats I welcome you all in this video we are going to discuss our next machine learning project steel plate defect prediction using machine learning I haven't uploaded any machine learning project yet that contains multiple Target columns and this data set contains seven Target columns so here such project I am uploading for the first time that contains multiple Target columns in this project we have to predict seven targets as you can see the current competition is going for this project steel plate defect prediction playground series that you can see here and your goal your goal is to predict the probability of various defects on steel plates that you can see over here so from here you can read data set description and from here rules for this competition that you can see over here now let's click on this data as usual three files are given that you can see over here train. CSV tast do CSV and Sample submission. csv I have already downloaded the train and test.
csv files that you can download from here by pressing this download button so we have to submit a file as per the format given in sample submission. csv that you can see over here and as you can see in this file we have to submit the probability of the predicted Target not the actual Target that you can see over here at the end we will submit our submission file on Kel so please join this competition first you have to join this competition then only you will see this button submit prediction we will discuss more during coding so let's get started let's jump to jupa notebook as you can see we have imported various libraries that we are needed that you can see here we have imported pandas aspd from scar. modore selection we have imported crosswell score from skarn do emble we have import imported random forest classifier and extra trees classifier from sk.
multioutput we have imported multi- output classifier from light GBM we have imported lgbm classifier from XG boost we have imported xgb classifier from cat boost we have imported cat boost classifier at the end we have imported jobl to save our model so as we discussed we are handling multi-output classification problem right that's why I have imported multi-output classifier this one a useful utility for addressing such challenges in this project we need to predict multiple Target variables simultaneously the multi-output classifier class extends traditional classifiers to handle this task it works by training a separate classifier for each Target variable effectively transforming a multi-output problem into multiple single output problems during prediction each classifier is used to predict its corresponding Target variable overall the multi-output classifier provides a convenient and efficient way to tackle multi-output classification problems within the psyit learn framework shortly we will discuss it in more detail now next load data so this is first line this line loads the data from a file called train. csv that we have downloaded from Kagel and saves it as a data frame named train underscore data now second line this line loads the data from a file called taste. CSV this file we have downloaded from Kel and save it as a data frame name testore data now next split data into features and targets as you can see these codes split the loaded data into features and targets as machine learning models expect separate features means independent variables and targets dependent variables that's why here we are splitting data into features and targets let me write xor train.
head as you can see here we are having this many features means this many independent variables and let me write Yore train. head and we are having this many dependent variables so as I said seven targets seven dependent variables that we are going to predict that you can see over here let me write xcore train as you can see here xcore minimum xcore maximum so now let's discuss these all features first as you can see here our data set contains these many features and I have divided these features into six categories that you can see here location features size features Luminosity features material and index features logarithmic features and statistical features as you can see here location features first xcore minimum the minimum x coordinate of the fault next xcore maximum the maximum x coordinate of the fault Yore minimum the minimum y-coordinate of the fault Yore maximum the maximum y-coordinate of default now next size features first pixels underscore area area of theault in pixels xcore perameter perimeter along the x-axis of the fault next Yore perimeter perimeter along the y-axis of the fault now let's discuss next set Luminosity features sum of Luminosity sum of Luminosity values in The Fault area next minimum underscore off underscore Luminosity minimum Luminosity value in The Fault area next maximum underscore off underscore Luminosity maximum luminos vity value in The Fault area now next material and index features as you can see here first type of Steel a300 which is type of Steel next again type of Steel a400 type of steel a400 steel underscore plate underscore thickness thickness of the steel plate and this all that you can see over here ages underscore index and others are various index values related to the ages and geometry that you can see over here now let's discuss another set of features logarithmic features log of areas logarithmic of the area of the fault logor xcore index logor Yore index logarithmic indices related to X and Y coordinates now another set statistical features as you can see here first orientation underscore index index describing Orient ation next Luminosity underscore index index related to Luminosity last one sigmoid of areas sigmoid function applied to areas so our data set contains this many features so these features provide detailed information about each steel plate fault including its location size material characteristics geometric properties and statistical attributes that you can see over here you can take this file from my GitHub account link is provided in the description of this video now let's back to jupy notebook now let's discuss our Target variables let me write Yore train. as you can see here we are going to predict targets variable first pastry pastry refers to small patches or irregularities on the surface of the steel plate second zore scratch Zed scratches are narrow scratches or marks on the surface of the steel plate that run parallel to the rolling Direction next K scratch K scratches are similar to Jed scratches but run perpendicular to The Rolling Direction next stains stains refer to discolored areas of the surface of the steel plate next dirtiness dirtiness indicates the presence of dirt or particular matter on the surface of the steel plate next bumps bumps are rised areas of the surface of the steel plate next other fults this category contains a broader range of fults or defects not specifically categorized in the other fall types listed this could include various types of surface imperfections irregularities or abnormalities that affect the quality or usability of the steel plate so these are our Target variables or dependent variables that we are going to predict so these fall types are typically identified and category ized during quality control inspections to ensure that steel plates meet specified standards and requirements for use in the various applications now next so as you can see we are storing IDs from the test data which helps in tracking predictions back to specific tast samples next line as you can see here this line of code creates a variable testore features containing the features from the tast data that you can see here here we are removing ID column from the test data frame so this way testore features only includes the columns representing the features that will be used for making predictions now next initialize classifiers so here we have initialized five classifiers for our prediction task random Forest classifier lgdm classifier xgb classifier cat boost classifier and extra trees classifier we have created instances named as you can see here RF uncore classifier lgbm uncore classifier xgb uncore classifier cat boost uncore classifier and extra tore classifier now next initialize multi-output classifiers this step is very important as you can see we are initializing multi-output classifier for prediction task with random forest classifier with GBM classifier with xgb classifier with cat boost classifier and with extra trees classifier that you can see over here as we discussed earlier these multi-output classifiers are particularly useful when dealing with task where there are multiple Target variables to predict simultaneously that we are doing here right we have seven Target variables that's why here we are using multi-output classifiers with other classifiers that you can see over here so this extend single output classifiers to handle multiple Target variables allowing for independent training of each classifier for predicting each Target variable specifically the multi-output classifier extends individual classifiers such as random Forest lgbm classifier xgb classifier cat boost and extra trees classifiers to handle multiple targets independently each classifier is trained separate ly for predicting a specific Target variable this approach enables efficient modeling of complex relationships between multiple output variables in various machine learning tasks so that's why here we are using multi-output classifier with other classifiers so as we discussed the multi-output classifier provides a convenient and efficient way to tackle multi-output classification problems within the pyit learn framework that's why here we are using multi-output classifier along with other classifiers now next perform cross validation to evaluate models so here we are performing cross validation to evaluate the performance of the initialized multi-output classifiers and as you can see we are using fivefold cross validation and evaluating the models based on accuracy score that you can see over here now next here we are printing the mean cross validation scores for for each classifier that you can see over here now next choose the best model so here we are choosing the best model based on the mean cross validation scores and here we are printing the bestas model as you can see as for the mean cross validation scores cat boost is the best model for this particular data set that you can see over here now next they train the best model on the entire training data so here we are training our best model model on entire data set so here depending on the highest mean cross validation score the code determines whether the best model is random Forest light gdm XC boost cat boost or extra trees classifier The Chosen based classifier is then trained on the complete training data set using this fit method that you can see here this step ensures that the selected model learns from the entire training data for accurate predictions that we are doing over here training based model on the entire training data now next here we are saving our based model so again and again training is not required and we can perform prediction using this Sav model that's why here we are saving our best model now make predictions so here we are performing prediction on test data using our base model let me copy this let me show you this predictions and as you can see here here we are using predictor proba because because we are interested in probabilities not the actual prediction that's why here we are using predict underscore proba as you can see here the output of priore proba corresponds to the probability of the negative class and the probability of the positive class that you can see over here and as you can see our output is divided into seven parts that you can see here 1 2 3 4 5 6 and 7even as we have seven Targets in our data set right each part contains predictions for one of the targets that you can see within each part each array represents the predicted probabilities 42 classes of the Target that you can see this C value corresponds to the probability of the negative class of labeled as zero or negative and this second value corresponds to the probability of the positive class of labeled as one or positive right in our data set labeled as zero and labeled as what negative class and positive class this way we have performed predictions that you can see here using predict underscore proba because we are interested in probabilities not in actual prediction now next generate submission file so here we iterate over each Target that you can see here Yore train.
colums let me copy and let me show you this our Target variables so here we are itate for each Target and here we are fetching the probability of the positive class for each Target that's why here is one and store the positive probabilties in the respective Target column this one and finally we are saving CSV by with name submission. CSV let me demonstrate it slowly so you can understand let me input the time module let me copy this let me paste it over here and here let me import time here let's print this data and Before Time dot sleep 6 second please look at here carefully by looking you can understand this entire logic let's execute this as you can see here positive class probabilities are added for our first Target here for our second target column here for our third target column here for our fourth Target column and here for our fifth Target column and here again positive class probabilities are added for our sixth Target variable and finally for our seventh Target hope all of you are clear with this so here we are taking the probability of the positive class for the first Target and adding it to the first Target column same process is repeated for the second target up to the seventh that you can see and finally we are converting our data frame to CSV which name submission.