hello everyone this is Ashwin here in this video we're going to do the project fake news detection analysis so I'm going to build a lstm model for this and this is a classification model so we are going to find whether the news is fake or Genuine so that is the objective of this uh project so we are going to see some various uh new techniques here because we are going to deal with the Deep learning project so it will be completely different from uh machine learning project projects so the pre-processing steps will be different and the model creation will be uh different so we are going to cover all the related techniques here in this video so first let's explore the data set information so our objective is to develop a machine learning program to identify when an article might be fake news so that is our objective so instead of machine learning maybe I'll just convert it into deep learning so that's good because we are going to do deep learning only now the data set attributes first we have ID so that is like a serial number title the title of the news article so that is like a subject or the overview author author of the news article uh that is the author who writes the article and text the text of the article uh it could be incomplete so that is also there so we have to keep it mind and finally the label so one means unreliable zero means reliable so you can consider one as fake news zero as uh genuine news so that's how we are going to uh do the processing and I'm going to only uh deal with the text so that is the important part so we have to analyze the whole news article to check whether the news is genuine or not so if you just consider the title means it will be vag uh the title will be small uh let's say like 10 words so that is not sufficient enough to identify whether it is a fake news or not so we are going to iterate through the text that is the entire article and based on the contextual information we are going to identify whether the article is genuine or not so we have explored about the attributes of the data set now let's import the modules so we are going to import all the basic modules in a single go so first first import venders as PD import M as NP import matplot lip. pyplot as plot and cbon 2 so these are some basic uh graphical visualization modules and we are going to deal with text that's why we are going to use a regular expression that is re import nltk so that is natural language uh toolkit and uh warnings we don't need some extra warnings and apart from that we have to specify Mt plot lib in line and warnings do filter warnings we will ignore all of them and this so it will import all the modules and after that we'll be uh loading the data set so let's wait for a while okay I just made a mistake here this is M plot lib filter warnings now the import modules is done now let's load the data set so for the D f equals PD do cre CSV I'm going to use only the train data so if you want to test it means you have to do the exact pre-processing steps and uh do the testing or you can concatenate the data both train and test do the pre-processing and in the end you can predict using the model so you can use it in both ways now let's display the head now you can see this is the title and uh title contains like few text and uh the text contain the whole article and we have the label one means unreliable zero means reliable and we also have the author so we don't need these three things so let's also see uh what is the length of the title and the text so DF of title I'm going to go for the first one so this is the length of the title so it's so small now similarly let's go for the text now you can clearly see it right the text is like so large so it's so much bigger I guess it contains more than uh 500 words and it also has some uh new lines some brackets commas so these things we have to remember so we don't have to filter the numbers so that represents some meaningful information apart from that we'll be uh filtering all the punctuations like here this one like this apostrophe and uh all these other punctuations also we are going to filter it out in a single go so this is the text we are going to use it as uh input and uh let's see what is the length of the data frame so here we have like 2,800 and uh some of the text are also missing I guess we don't need I guess we can't proceed like that we have to drop some of the null values without uh any text we cannot process that so we will drop it maybe here so we have the label and the text that is the important one now I'm going to drop the null values so drop null values I'm going to drop the entire row if it has uh null values so DF Dot drop now this is top null values axis equals zero that represents a row and we can drop the whole thing let's drop it now let's check for null values or length of the data frame now we have around 18,28 that's two less so author okay author have so many null values that's why it's dropping so many but we don't need ID title or author maybe we will drop the unnecessary columns here drop unnecessary columns DF equals DF do drop columns I'm going to drop the ID title and the author and this will be access one that is column wise now let's load the data set again to return return the info now drop it now drop now we have around 20, 761 uh samples so that is reasonable now we have to do other pre-processing let's leave some space now we have to remove special characters and uh punctuations so we will do the pre-processing here remove special characters and punctuations situations so before that I'll just uh see how the processing goes so DF off I'm going to have it in clean news clean news equals DF of text first we have to lower everything so this will lower everything and uh we can also have that clean news to be displayed now it converted all the characters into lowercase so that's our first step and the second steps will be uh removing special characters and numbers I'm going to uh transfer all these uh code in a single uh line so maybe we will just move it down so to remove uh special characters DF of clean use dot St str. replace we are going to pass the rejects here so I'm going to specify except the characters Like A to Z small A to Z okay I guess this is like a standard operation so I'll just mention all of them we already converted into lower case so you don't have to mention this so it's not necessary and apart from that we will ignore the numbers too so these are the ranges I think this is 0 to 9 0 to9 and uh I'm going to replace it with a space so that's done or shall we uh replace it like replace it completely like deleting it so I think that would be much better but it also will include the space so maybe let's give it a try we have to ignore this normal spaces also to avoid unnecessary confusion if you do like this it will remove all the special characters so it's like this so we have to definitely uh leave some spaces so this is the space now you can see uh it's uh so much clear and it also removed all the spaces it should ignore one empty spaces means uh we have to think about that maybe back slash yes let's run this okay now it's better than before now I also like ignore the single space now you can clearly see it remove the apostrophy and have the complete uh word so it's much more uh better so we have done the pre-processing in like a single step so if you know regular expression means you can uh do it very easily like this instead of having a bigger function and uh do the pre-processing separately this is a good way to do and it also removed all the additional spaces so that's better and you can also see there are some additional spaces like this like the double spaces maybe we can uh remove that also and also we have to remove the Escape characters this one so we will have that so I'm going to replace the Escape characters so back slash n I'm going to replace it with the empty string so that will be done and after that I'm going to replace back slash S Plus I'm going to replace with a single space so we have to retain these values so we have to do it like this So currently I'm just showing you how it's working in a step-by-step manner now let's display the clean use after that okay now it's uh much more better it also remove the additional space but uh we are uh having some spaces here so this is enough so you can combine all the things in a single snippet and do the prepressing like pre-processing function you can create it so I have done it everything here so you can uh see it clearly so I'll just move this upwards itself so we are not going to have a separate uh block for the preprocessing so this is good so now after that uh we have to remove the stop words so while doing pre uh de planning projects you can ignore these uh stop wordss removal and other uh uh processing techniques like uh stemming LZ so you can ignore it because it won't impact the result that much as we are getting the results in a sequence to sequence model so lstm is like a sequence to sequence model so it will try to understand the whole context of the meaning still uh we are going to do the pre-processing like removing the stop wordss in order to do the exploratory data analysis so remove stop wordss stop wordss now I'm going to say stop equals stop words do words English English before that we have to import it from nltk do Corpus Corpus import Stop wordss import Stop wordss so all the stop wordss of the English language will be stored here and based on that we have to remove this from the clean use so that will be DF of clean use equals DF of clean use do apply so this is a oneline function Lambda X so when you are using apply Lambda so this x represents a single sample from the data frame so we are going to process the text so before that dot join and in a list word for word in x.
split so we are splitting the whole uh article into single words so this is the word we are iterating and we are setting the condition if word not in stop so if the word not present in the stop words means it will be uh having like a list so all the words will be in the list and we are concatenating into a sentence so this will do all the things and after that we will have DF of Ed so it's trying to iterate all the words in the Corpus so it will take some time so now you can see we have removed all the stop wordss here there is removed here also those removed like that all the unnecessary uh stop wordss are removed and these are the clean news we are having this is the original text uh we have and this is the corresponding label now we have done the pre-processing let's jump into to the exploratory data analysis now I'm going to go for a word cloud so that is a common uh exploration we do for text analysis so visualize the frequent words so wordcloud is enough for uh data visualization in this uh project so I'm going to go for that uh first we will uh display all the words all words dot join so similarly sentence for sentence and DF of clean use so I'm just concatenating all the text so it will be a one big string before that we have to import the word cloud so for that from [Music] wordcloud import wordcloud so maybe I'll just do the importing at the top so this is the visualization so model has been imported now let's do the processing word cloud equals wordcloud of we have to specify the height and width so I'm going to specify some of the default parameters I have used for the previous project random State 42 Max font size it will be 100 dot generate all wordss so this is the wordcloud data we have to plot the graph now let's plot it plot the graph plot dot figure fix size so usually it will be 15 comma 9 now plot Dot I am show wordcloud interpolation equals bilinear plot. access off and finally plot. show let's run this so this is a bigger one we will see what is the most frequently occurring words in all the Articles so this is the whole word cloud and you can see it here so you can see said people one time United States Trump Hillary Clinton so all these things are there so all the data related to the election so Donald Trump first many so these are the frequently occurring words you can see it uh clearly so similarly we will filter out uh positive like the genuine news and the fake news and see what are the fre frequent words a frequent words for genuine news so that will be here DF off for label equals 0 so 0 is the genuine one right yes so zero is the genuine one let's run this and similarly this is for fake news and uh we have to do the same here instead of zero this is will be this will be one and that's done let's run this also so our uh for the genuine news we are having American Trump said people Mrcalled United States President so all these things are there but it's not much related but um if you go for sentiment analysis means you can clearly go for some words that is uh positive or negative but in the fake news we cannot uh go like that so these are the ones are genuine let's go for the fake so in the fake news most of the fake news are on Hillary Clinton people Trump and Donald Trump time so these are the fake news I think most of them are B based on uh Hillary uh Clinton and uh Donald Trump based on the word size itself you can clearly uh see so you can also do the analysis like you can dig deep uh doing some kind of engram processing so you can uh this is for individual words like for each word it's showing the size or the frequency so you can do this for multiple words like two words or three words like that you can find out what are the frequent uh sequence are occurring so I'll just leave it for you you can uh search for that uh how to do that so you can uh do the similar processing and um do it now the exploratory data analysis has been uh completed now let's go for the input split so in the input split we have to do some kind of pre-processing so now let's do that now so we are going to do the tokenization ation of text and padding of sequence so this is mandatory uh in order to do the Deep learning project and we are going to use some word embedding uh model I'll explain it in detail so before the info split maybe I can create something new create word embeddings so that will be much appropriate first let's tokenize text tokenize text for that we have to uh import tokenizer so from Kos do pre-processing preprocessing dot text import tokenizer see whether it is working or or not okay it has been imported now apart from that uh we have to import additional modules maybe I'll just uh take the modules in a separate cell now again from kasas do pre-processing do sequence import pad sequences so this is another thing this now let's do the tokenization initialize it tokenizer equals tokenizer now tokenizer dot fitcore on text DF of that is clean news and after that word index so this is like a dictionary for us word index equals tokenizer do word index Now app size so I already told you this is like a dictionary for us we are going to mention the length of the word index so this will uh give us all the unique words in the whole Corpus that is the whole data and based on that we are having the okap size this will be useful for uh creating the model so this is one of the attribute and uh you can also print the cap size cap size and this now we have around around 200 th000 words so that is a big Corpus we have okay after this we have to do the padding so why we need to do the padding so padding is very important because each text or each article have different number of uh length so let's say one article have uh 200 words means another article may have around 300 words so the length of the article differs so in order to equalize the length we have to do the padding padding means it will add some pads to the end of the text so it will be like uh empty uh strings so mostly it will be like zeros so pad padding data now sequences sequences equals tokenizer dot text to sequences I'm going to use the DF do clean use and after that we are going to pad the sequences so pad okay we don't have to use the same uh name padded sequence equals pad sequence of sequences sequences and uh maximum length maximum length maybe uh it's better to find out so I'll just leave it as blank or I'll just have it as 700 that is the maximum length of the article I'm mentioning so it's better to go for uh it's better to find the maximum uh article length and uh place it here so that's the best way to do and padding equals post that means after the end of the article we will do the padding so you can use uh you can pad previously also like before the article starts so that is also possible truncating truncating post so it will truncate after uh 700 uh words so that is like a maximum length of the article I'm specifying currently so you can also find the maximum length but I don't think my mission is capable of training that much of huge amount of data some articles may contain thousand words also but uh we don't have to process the whole article in order to find whether the news is fake or not so I guess we can go for 600 that's good let's run this so this will take sometime and uh after that we have to create word embedding Matrix create embedding Matrix so this Matrix uh is important for us we have to use this Matrix uh for the model so before that you have to download a file called glow so that contains vectors of uh words in 100 Dimension it will contain most of the words like pretty much all the words in the dictionary you can also go for higher Dimension if you have like lot of data currently we have like less number of data that is 20,000 samples I already downloaded the glow embedding file so you can download it uh using the link in the description or you can also search it in Google Now with open Cloud 6B 100 dimension.
text as F now for so for line in F we will explore what is inside uh the embedding Matrix so it will be easier for you so I'll also add a break statement so it won't uh do it for all the lines so it will take uh so much time so first we will get the values so line. split now I will print the values values and this see this is the values so this is the word we are going to have so for this word this is the Matrix representation or the vector representation so this is the 100 values we are going to have now we don't need this uh word for our U use case so we are going to ignore that and uh remaining uh values we need to have that so before that we have to create embedding Matrix so that's a dictionary okay this is a dictionary now after the values I'm going to say word equals values of zero so this is the word and coefficients equals NP dot as array of values of one colon so I'm going to get all the values from the first index and the data type will be float 32 so this is the coefficients this is the word and I'm going to add it to the embedding Matrix now embedding Matrix of word equal equals coefficients so that is done and so the length of the coefficients will be 100 I'll also show it maybe before that print length of coefficients see these are 100 Dimensions so let's remove this so this will create the embeding Matrix okay sorry this is embedding index after that only we have to create the embedding Matrix so we have to create the embedding uh Matrix for the words that are included in our data set so for that only we are going to do that now embedding Matrix equals NP do zer of of cap size that is the maximum length of the vocabulary + one comma 100 Dimension so this is the row and this is the column so that's done and after that for word comma I in Word index that we already uh created here that's all the words in our data set dot items items now embedding vector or you can also say Vector it's your wish now embeddings embedding index do get word so I'm going to get the vector of that particular uh word and if embedding Vector is not none like not equals none like if the word is uh if the particular word in the Corpus is not present in the embedding uh index means it will return none so at that time we don't have to add it to the embedding Matrix so if the result is not none and we have like the 100 Dimensions means we will add it to the embedding Matrix so so embedding Matrix of I that is the index equals embedding Vector so that's done we have some error in the glow okay before proceeding I I faced an error like here while opening the glow you have to place this encoding utf8 or else it's showing an error like uh the character map error something like that it was showing and uh let's run this separately because this is a big process run this so this is creating embedding index here we will create the embedding Matrix Matrix so the embedding index has been successfully created it didn't show any error now let's go for the embedding Matrix okay I think here it was showing some error is not okay now it's completed so it's not none so I just change the condition it's not because we are dealing with an array that's why it's showing some errors so it's not none means it will add the vectors uh let's display the embedding Matrix so these are the this is how the Matrix looks like let's only focus on the first so all the Matrix index are zero go for one okay now here these are the words we are having so this is the result of the embedding Matrix like this uh we will be having the vectors for all the words that is present in the data that is DF of clean use so the embedding Matrix has been uh created now we have to split the data so paded sequence that is our input and uh DF of label is the output let's split the input here before that paded sequence of zero so this is how it's going to be so these are all padded values and uh here also you can uh see some of the numbers so each number is represented a index so this particular number is represented by a word so that's how it works and uh let's see some other word all of them are having additional zeros maybe I'll just reduce the pat sequence to 500 itself run this once run this again okay now it's reasonable we are just uh saving some space so like this the paded sequence are there let's minimize this now we have to split the data let's use the train test split I hope this works from SK learn do model selection import train test spit X train X test y train y test equals train test split of the input that is padded seek and DF of label that is the output test size will be 20% 020 random State equals 402 and stratify equals DF of label so that's done let's run this okay it works let's check it once xtrain of zero the array is there and uh White train of zero and the label is also there so we have done our input split and the train and testing split is also done now we have to create the model now before that we have to import few things so from kasas do layers import okay I think in the layers we have to import other things like lstm [Music] lstm Drpout dense layer so I think that will be enough maybe yeah that will be enough and from kasas okay we also have to import embedding okay now K import sequential okay I think that's it now model equals sequential sequential of list now here we will add all the layers now embedding of okab size okab size + one that is the size this is the dimension 100 weights weights equals the embedding Matrix and trainable trainable equals false and this is our first layer before that let's import it it's not showing any suggestions for me now it will show some suggestions so this is our first layer that is the embedding and the second layer will be a Drpout layer drop out of 0. 2 so you can specify the Drpout as 0.
1 to 0. 3 so that is a reasonable range sometimes uh people are using uh 0. 5 also it depends on the analysis or the use case so after the Drpout the okay after the Drpout I'm going to use the lstm layer now lstm off I'm going to go for 120 okay I'll go for first 64 64 comma return return sequence equals true and after that another lstm with maybe I'll just specify both as 128 so that will be better so I have specified two LSM layers now so the first one will be return sequences and after that I'm going to go for one Drpout again 0.
2 and after that dense layer I can go for 512 5112 and uh a drop out 0. 2 and the dense layer I'll say 256 and again a drop out okay after this we don't need a Drpout finally we just need the output that is one dense layer of one activation equals sigmoid sigmoid so this is our whole model let's run this okay our model has been created let's compile it so model do compile here we have to specify the laws optimize and the Matrix so long equals binary cross entropy here we have to specify all the things cross entropy because this is a binary classification Optimizer equals Adam and the Matrix equals accuracy and that's done now model do summary and this now you can specif now you can see all the layers and how many uh parameters are there so these are the trainable parameters these are the non-trainable parameters that is this layer the embedding Matrix is like a bigger one there are like so many parameters are there for the embedding Matrix and apart from that we have only a few layers so that's good now we have to train the model train the model Now history equals model. fit model.
fit so X train y train this is the training data EPO I'm going to go for 10 EPO batch size batch size equals 128 so if you have a GPU with a reasonable size means you can go for 256 or 512 also we will see uh based on that we will adjust the batch size so depends on the batch size your uh training process will uh speed up we will also specify the validation data validation data so that will be X test comma y test so that is our validation data and it should display all the things so these are the basic parameters you can also fine-tune by specifying the learning rate and all the other things let's run this so it's starting the first Epoch let's hope it runs without any error so if you don't do the process uh correctly ly means definitely it will go for an error so here you can see it is training very slowly because uh we have like uh lots of data uh whenever we are dealing with text Data uh we have to use a lot of computation power because we are processing individual samples and also like each uh in a word by word basis that's why uh it's like a bigger uh data currently we have around 20,000 samples usually when you go for text meanss it will be more than lacks and lacks of data you should have some decent GPU to run all these things so the training accuracy is almost reaching around uh 61 so it will take some time uh to complete all the training process so I have specified a reasonable bat size so this is good okay I think uh the validation it's not working properly the training accuracy that is good we have to stop this it's not showing the validation uh data so I guess uh you maybe this is a duple currently I have to figure out what is the error okay I'll just change the list into a tuple I think let's see whether it is works or not okay now it is working uh instead of a tuple I just uh specified it as a list that's why uh it's a problem it's a parameter mistake I guess we are getting the training accuracy as 64 and the validation accuracy as 69 so I'll just uh do it for 10 EPO I guess it will reach a decent accuracy in the end in the meantime we have to plot the graph so here I will get a graph for the loss and the accuracy for training and validation so I'll just print the accuracy score like how it uh improves so that will be enough for us visualize the results now plot Dot Plot history dot history of accuracy accuracy now again the same line but instead of uh normal accuracy that is train accuracy we are going to get validation accuracy we can specify the title but it's not necessary but plot do X Lael that will be epox epox plot. y label that will be accuracy accuracy so you can specify the legend and show so plot.