hey and welcome to the second chapter of python 4 AI development course by assembly AI I am musra torp today I'm going to show you how to prepare your data for training a machine learning algorithm and then in the next lesson we're going to learn how to use scikit-learn library to train a machine learning algorithm today I'm going to show you how to build this code on a Jupiter notebook you already learned how to set up a Jupiter Notebook on your laptop and also Google collab if you want you can use Google collab or Jupiter notebooks for this lesson it's totally up to you if you want to follow along with the course I will leave the a link to the code somewhere up here or also in the description so you can go ahead and follow along as I also code here on top of Jupiter notebooks of course we're going to be using Python and mainly we're going to be using the pandas library of python to do our data analysis what we're going to do is to explore the data understand it a little bit better and then we're going to deal with the problems in our data if we run into anything during exploration and also prepare the data for model training at the end I'm going to take a Kegel data set the flights data set that I will introduce to you in a second and go through it as an example so that you can also follow along so it's not only going to be theoretical but you can kind of understand habits also done in real life but first I want to talk to you about the types of data so if you're new to data science if you're new to AI it might be a little bit overwhelming to see what kind of data sets that are out there because your data set could be a tabular data your data set could be images your data set could be text files or even audio files so these all of these different types of data sets will have a different way of dealing with them a different way of exploring and cleaning them but the beginner level is to kind of work with tabular data because it's more structured it's easier to understand and also it's a good way to get some experience before you jump into other types of data sets so there are a couple of steps that you should take before you do any coding and the first one is to understand where your data is coming from so sometimes you download your data from the internet maybe you download it from kaggle you download it from a University's research group or maybe if you're working at a company maybe you get it from another team so one of the main things that you have to look for is documentation because data sets are not always self-explanatory that could be there could be a column name that is abbreviated there could be a column name that just looks odd and that you just don't understand and you know you're not always the subject matter expert in the field that you're working for in as a data scientist so that's why I always look for documentation if you're getting your data from kaggle there will be a website where you can see details for your data sets if you're getting your data from some other place on the internet there's probably going to be either a txt file or PDF file explaining to you how this data was collected what each of the columns mean what their units are so let's say if it's a length column is it in inches it isn't is it in meters or if it's a Time column is it minutes or hour so these are really important things to know if you're working with internal data in a company or maybe you made the data it is important to talk to the people who prepared the data and kind of understand how the state was collected this will kind of give you an idea of what kind of problems to look for as you're doing your exploration so it will be a little bit more efficient there so let's get started the first thing that I want to do of course is to import pandas and I'm also going to import numpy because you know you never know when you're going to need numpy which is more or less always and I will also show you how to set up one of my favorite settings so this is setting the pandas option of display Max columns to none so there are no limits of how many columns to be shown and I'll show you why that matters in a second let me get my data set here flightsample. csv as I mentioned I'm using the flights data set from kaggle it is it has the information of delayed flights but also non-delayed flights and we're going to look a little bit more into detail and I'm going to tell you what kind of problem I'm going to be solving with this example data set but just so you know right now if you download it from kaggle you will have a flights. csv but I sampled uh fewer amounts of the data points from that big data set because it was a little bit too big to run as an example so that's why mine is called flight sampled I will read it into a flights variable all right and then let's see how many rows and columns this data set has and that's you'd use the shape function for that so it apparently has more than a million columns a million rows sorry and 31 columns so if I try to print this data set what's going to happen is that it's going to show me all of the columns but as you can see it doesn't show me all of the rows it's because there is a Max rows limit and if I also set the max columns limit what's going to happen is if I try to run this I'm only going to get five columns so sometimes if your data set is really big and you're trying to print it to understand all the columns that are in there it will not show you all of the columns there will be a limit and if you want to eliminate that limit all you have to do is here to set display Max columns to none and then you'll be able to see all of your columns if you want to see only a couple of data set data points you can say data flights head and then I will show you only the first five data points or you can specify how many you want C and then again it will show you the first 10 data points so as I said we're going to be doing data exploration today but what does that exactly mean so what we're going to do is to look for missing values we're going to look for outliers we're also going to make sure that our categorical values and binary values are consistent and there are no problems with them and also lastly we're going to look at column types to make sure all of the columns have the correct type so if it's a number it's a number if it's a category it's a object as we as the naming goes in Python let's quickly take a look at our data set we have for each row we have one flight we have the time and date of the flight which airline is running it flight number tail number origin and destination when it was scheduled for to depart when it actually departed how long it took for the flight to arrive and the arrival location when was the scheduled arrival when it actually arrived and the delay and we also have the information of whether it was canceled if it was canceled why it was canceled and also some reasons for delay so today the problem that we're going to try to solve is trying to predict whether there's going to be a delay and if there is going to be a delay what it is going to be caused by so as you can see we have one two three four five different delay reasons here and now we're going to change the data set in a way that when we train the model it is going to be able to tell us what is it that I uh going to be caused by so it's kind of like predicting the future um a little bit but as you can see there are a lot of columns here that maybe I will not really need as I'm doing exploration so I'm going to remove a bunch of them so I can use flights.
collums to see all of the columns and then I will only select a bunch of them so if you use double brackets in pandas that means that you're going to choose a subset of your columns so I will just copy and paste all of the column names here but then I will only keep the ones that I'm interested in so I want to keep year month day day of week Airline because you can imagine which airline is running the um flight Rule effect whether this flight is going to be delayed or not maybe some Airlines tend to be late more than the others I don't really need the flight number because flight number is basically manifested in origin airport destination airport and scheduled departure and Airline tail number I also don't feel like I need but of course these are decisions that you can undo right if you realize that you need some of the information you can come back and change this um scheduled departure I don't really need the actual departure time so when you think about it what's going to happen is when this model will be used it's going to be before the flight so we will not have certain information like departure time or arrival time arrival delay so these are things that we will not have access to so that's why I mainly want to keep information that I will already have at this point delay well yeah here one thing though one thing to be said about this is um if I have some missing information I will maybe I will be able to use departure time for example to fill in those missing information so I will actually keep departure time for now also departure delay I feel like taxi out wheels off schedule time elapsed time these will not be really relevant so I will remove those I think schedule time scheduled arrival and arrival time will definitely be helpful arrival delay is exactly what I need I've also keep canceled and cancellation reason and also all the delayed reasons for now so let's take a look at our data set again all right these are the only ones that I want so far I will immediately take a look at whether there are any missing values so take let's take a look okay so it looks like there are a bunch of missing values we are missing departure time and departure delay sometimes arrival time sometimes arrival delay also cancellation reason which makes sense because you know we um probably many of the Flies are not canceled so that's why there are no cancellation reason but interesting thing is we do not have a delay information most of the time but they there's the equal amount of time so this could be that maybe it's not a missing value it's just that because the flight was not delayed these this information is empty so to understand whether that is the case or not I want to see what happens every time arrival delay is non-existent or zero so I'll say flights let's see when arrival delay is missing we are also missing these guys which makes sense okay but of course this is only let's say how big this is this is only yeah 7806 times and delay reasons are missing more time so let me see whether these reasons are missing all at the same time this could just be a coincidence that these numbers are the same so just to save you some time I made this a prepared chord from beforehand um so basically I'm saying show me all of the rows where air system delay is null but also security delay is now Airline delay is now late aircraft delay is now and also the other delay is none another number call these not values like it is all right so that is 35 000 no 354 479 times which is actually exactly how many times we have missing information here so this means basically that every time your system delays now all the other delay reasons are also null and they're only null when the other ones are null or none or not a number so this is a good to know but one thing that is interesting is then I would have expected a rival delay to also have the same amount of none values so maybe let's um put this on a histogram and then it will be clear to see what's missing exactly all right so this is a histogram of when um air system delay secure delay Airline delay and late aircraft delay are none and we see arrival delay actually starts from minus 60 or minus 80 and goes only until like 10 or 15 which is interesting and let's take a look closer look here I want to see every time arrival delay is bigger than 10. okay arrival delay is 14 15 but these all look like very small numbers how what if I say arrival delay is um longer than 15. oh interesting so we do not have any data points where uh when delay reasons are none arrival delay is longer than 15 minutes what about 14 minutes still nothing 13 minutes okay there's something so this basically tells me that um when arrival delay is shorter than 15 minutes the plane is not counted as late and that's why there is no delay reason given this could be an interesting system if I worked in the company or maybe if I worked in the field I will probably already know this but this is something that you can learn when you're working with the data so this is good to know then I will from now on think that only flights that are that arrived later than 15 minutes compared to their scheduled arrival time will be counted as late okay so this is interesting information but I also want to see why sometimes the arrival um delay is missing so let's take a look at this again we see here arrival delay is missing um but yeah for some of these either we are missing the arrival time or we are missing even though there is scheduled arrival information and arrival time information we are still missing arrival um delay so I will set aside some of this data to kind of see to be able to look at it more clearly so I also give me all of the data points for arrival delay is none and I'll say so what could be the reason oh yes so it looks like maybe most of these are canceled that might explain why we have none as arrival time here but it does not explain these cases so let's see how many of them were actually canceled how many weren't okay a good amount was canceled maybe a better way of actually showing this could be to use value counts value accounts would tell me how many times these values these data points where arrival delay is missing uh the flight was canceled and or it wasn't canceled so okay it looks like sometimes it was canceled but most of the time it was canceled only one thousand a bit more than one thousand of the cases it wasn't canceled and we see two examples here so there are a bunch of ways how you can deal with this uh if you remember what we're trying to do here is to estimate or predict why a flight was delayed so here we see clearly this flight was supposed to land in 2026 um if you are watching from the US you might not familiar with this basically it means 8 30 PM 8 26 PM but it landed at 10 16 PM so there is one hour 50 minutes of delay there but we do not have the delay information or why it was delayed so uh what we could do here is to calculate the arrival DeLay So if we were building a model that where we were trying to predict how long of a delay there is going to be we could have definitely used these two in two columns of information to calculate arrival delay but we're trying to predict the delay reason and we cannot generate the delay reason information with what we have so here what I'm going to do is because we have more than a million examples more than a million data points and this information only covers a thousand of them I'm just going to delete them so it's easier for me to work on my data there's not going to be any dirty data in there anyways I cannot use this information at the end of the day so what I'm going to do is to save flights remove the ones where arrival delay is now so basically what you do here you create a filter you say every time the arrival delay information in the flight status set is null and by using this tilde you're saying negating it so you want the ones you want all the data points from the flights data set where the arrival delay is not missing so let's do this and then I will also want to see what the shape of my data set is so right now it is one million four hundred ten thousand two hundred twelve and before it was again one million four hundred seventeen thousand so there is not a big difference um and it's a good thing that we deleted those and you know we deleted these ones because we cannot use them and we also deleted the ones that were canceled again we cannot use them they were canceled flights all right let's see what kind of missing information I have left in my data set after this action all right we do not have any arrival delay missing information left yet cancellation reason as I said is expected and we have the delay reasons still um as missing information so what I can do actually is to just fill them in as zero because we saw that only time where these values are missing is when there is no delay there are sometimes some arrival delay less than 15 minutes but these flies are not considered late flights that's why there is no reason for delay so that's why I'm going to go and fill this information in as zero how I'm going to do that is by saying flights specifying which columns I can just copy and paste the names here a bit easier and I will use a fill na function and I will specify what I want them to be filled with so this is the information that I will get back but instead of this I will just equate my columns to what the function is going to return and then when I check my flights data set again you will see that what used to be na will be zero now all right so it looks like I'm more or less done with missing information as I said I do not need to deal with the cancellation reason one because it will not make its way to the model training phase I will remove it before then so then I want to see if there are any outliers in my data and to do that using histograms is a really good way and I will show you how to read histograms in a second now I'm just setting the bins to be 60 and the figure size to be 20 to 20 so we have a bigger camera so we can see all of the little histograms a little bit better and what bins means is basically the more you increase this number the more granulated information you're going to see if I'm going to make set the bins to be 20.
then the granularity of the information is going to be lower that's why one number that I like to use is 60 but depending on the data set that you're working on you might need to change that all right so let's take a look to see if there is anything interesting that's going on that's not expected so I only have flight information from 2015 this is expected uh months these are the amount of flights that I have each month so for the first month I have more than 120 000 flights and the other ones are kind of similar more or less expected day of the month we see a little pattern here day of the week again with the numbers are from one to seven which is expected and yeah we see that maybe on Mondays and Thursdays there are more flights uh scheduled departure departure time departure delay looks normal scheduled arrival looks normal arrival time looks normal arrival delay again sometimes there are minus delays that means um the flight arrived earlier sometimes it looks uh like it's longer one thing that you need to pay attention to when you're reading histograms though um so just let's just iterate what this looks like so it basically tells you for each value of these columns how many times this value occurred so if let's say this High number is zero that means arrival delay has been zero for more than 600 000 times in this data set and the higher the value the less occurrence that you see of these values but one thing that also you need to understand is histograms only show you the values on the x-axis that exist in your data set so if it shows you here 2000 it means that there is actually a value that is 2 000 or maybe a bit less than two thousand so that's why I want to go ahead and see if that is an outlier or not I want to see if it actually makes sense that there is a arrival delay for 2 000 um minutes or hours so let's take a look uh let's take a first first look at like more than 500. so I have a bit more than 2 000 data points where the arrival delay is higher than 500 and it looks like this actually makes sense uh so this would say arrival time was 12 28 and the flight actually arrived at 8 54 so that's why there is a big delay or the flight was supposed to land maybe the day before uh at 10 56 but it arrived at seven in the morning 7 44 in the morning that's why there is a big delay and for all of them we do actually have a delay reason so let's quickly check this one is this so this looks legit to me but also let's look at very big delays okay we only have a handful of these uh arrival delay arrival time but it looks like all of this information is actually checking out so let's see this flat was supposed to land on 1049 it arrived what looks like four hours but maybe it's actually the day after four hours um the day after at 2 pm so maybe that's why there is a big delay there and we also have this delay attributed to a reason so it looks like there is not actually a problem it's just some flies are actually super super late okay that's good to know what else can I look at so do I have all of my delay reasons I have air system delay here some of them are really long security delay Airline delay late aircraft delay but as you can see I do not have a plot for weather delay and that could actually point to something so let's look at what my flight column types are and I can use D types for this um all of these delays should be numeric but I see whether delays object and object in Python basically means string and that means there is a problem with feather July so it could either be that when I'm reading the CSV file there was a little problem and pandas was not able to read it properly or maybe that might mean that there is a value in there that cannot be cast to integer that's why pandas read it as a string so what I'm going to do is to try to turn this value or this column whether delay into a numeric column and let's see what happens pd2 numeric is what I'm going to use I'm going to pass the column all right we get an error and it says unable to parse a string dash at position 107 so let's go to position number 107 and then see this what's going on I'm going to use eye lock so the location where the index is 107.