[Music] hello everyone welcome to this lecture in the build large language models from scratch Series in today's lecture we are going to look at one specific component of the large language model architecture and that is layer normalization before that let me quickly recap what we covered in the last lecture in the last lecture we essentially had a bird ey view of how the llm architecture looks like so what we are going to do is that the llm architecture consists of several components which are arranged together uh yes uh in the previous lecture we constructed this GPT backbone which is like a dummy class which will hold all these components the first component is the layer normalization Jou activation then the feed forward neural network and then shortcut connections these four components make up the Transformer block which is one of the key components of the entire GPT architecture so let me show you how this looks like so if you zoom into the Transformer block you'll see that the Transformer block has a number of layers which are stacked together you have a layer normalization followed by The Mask multi-head attention layer followed by the Drpout layer you have the shortcut connections then another layer normalization a feed forward neural network Drpout layer and then another shortcut connections Etc if you zoom into the feed for neural network you'll see that it it consists of linear neural network layers along with the Jou activation function uh all of these are stacked together in a format which looks something like this so let me take you to the broadest level view looks something like this so you have the input text which you tokenize then you pass it into the GPT model in the GPT model we first do the token embedding then add the positional embedding vectors pass these input embeddings to the Transformer block within the Transformer block it goes through all the layers which I just showed you and then we have the output layers which essentially decode the output from the Transformer block and we predict the next word from the input this is what happens inside the GPT architecture and uh we are going to start looking at every single component of this GPT architecture today we are going to look at this component which is called as layer normalization so if you look at layer normalization it actually comes up at multiple places within the Transformer block itself if you see before the multi-head attention we have a layer normalization after the multi-head attention also we have a layer normalization so the layer normalization comes within the Transformer block at multiple places and actually it also comes outside of the Transformer block um that's why it's important to define a separate class for the layer normalization when we code the GPT architecture lecture in the last lecture we looked at this dummy GPT model class and the goal here was to put all of the different components together so that you get a bird's eye view Without Really coding too many of these components so what happens in this GPT model class is that there is this forward method which takes in the input and then we uh convert the input into token embeddings we add the positional embeddings to the Token embeddings the resulting embedding Vector we add the Drpout layer to it we pass it through the Transformer blocks then even after it comes out from the Transformer we have a normalization layer as I showed you within the Transformer block itself there is a normalization layer before the attention head and after the attention head and after we come out of the Transformer block there is another normalization layer and then there are output processing steps and the output is returned right so to execute this forward method we need to Define several things so when an instance of the dummy GPT model class is created we uh we Define these Transformer blocks and for that we'll need to create a separate class called Demi Transformer block which will ultimately become just Transformer block when we learn about what happens in the Transformer and then we also create a separate class for layer normalization why do we create a separate class because layer normalization not just happens in the Transformer otherwise we would have just written this procedure here but as you can see even after the input comes from the Transformer we pass it through another layer of normalization so that's why it makes sense to define a separate class of layer normalization in today's lecture we are going to dive deep into this class how to create this class what exactly does layer normalization do and why do we need it in the first place so let's get started diving deep into layer normalization okay so the reason layer normalization comes into to the picture is that when we look at the Transformer block as you can see number of these layers are stacked on top of each other right so there are huge number of parameters coming from each of these layers and we have to train these parameters initially we'll train these initialize these parameters to random values but through back propagation we'll train these parameters so that the next word is predicted correctly so we need to make the training process more efficient and that's where layer normaliz ation becomes very important it turns out that training deep neural networks with many layers can be challenging predominant due to two things it can either lead to a Vanishing gradient problem or it can lead to an exploding gradient problem and this leads to unstable training Dynamics so the first advantage of layer normalization is that it improves really the stability and the efficiency of neural network training so let me zoom into this further and explain this if you consider a deep neural network like this you have an input layer and then you have multiple hidden layers which are stacked together after the input layer right and you have the output layer so what happens is that when you do a forward pass so initially when the weights are initialized randomly you do a forward pass and you get the outputs and then what you do is that based on the gradients you do a backward pass and you try to update these parameters So eventually for every of these hidden layers you will have gradients of the loss with respect to the parameters so when you look at every layer think of layers as accumulating gradients every layer will have a certain set of gradient values which will get updated after every iteration right now since we are doing the backward pass let's say if we want to find the gradients of this layer um the gradients of this layer will depend a lot on the output of this layer because we are doing backward pass let me show it with a different color uh yeah so the output of this layers I'm showing with the red color and if you want to find the gradients of the first layer it depends on this red colored outputs since we are doing the backward pass now it turns out that if the layer output if this layer output which which I'm showing by red right now if this is too large or this is too small that affects the gradients so if the layer output is too large or too small then gradient magnitudes can become too large or too small now think about what will happen if gradient magnitudes become too large let's say so if we are back propagating and the gradient magnitude in this layer becomes too large what will happen when you back propagate essentially you are multiplying different gradients together right so if the gradient here becomes too large by the time we reach the first layer the gradient would have exploded to a very large value that's called as the gradient explosion on the other hand if the gradient of the last layer or one of the intermediate layers is very small when you're propagating backwards till the time you reach the first layer or the second layer the gradient will become very small what will happen when the gradient becomes very small we will not update the parameters because the parameter updates depend on the gradient magnitude if the gradient is too small learning will stagnate if the gradient is too large will lead to that will lead to an unstable learning procedure so both small gradients and large gradients lead to unstable training Dynamics we do not want unstable training Dynamics and one reason for the unstable training Dynamics is that if layer outputs themselves are very large or small as I mentioned layer outputs affect the gradient values so if we control the magnitude of the layer outputs we can ensure that the gradient magnitudes themselves do not become too large or too small batch normalization helps this batch normalization helps uh keep the outputs of the layers to certain specific values and prevents the magnitude of the output from being too large or too small and that's what keeps the gradient stable which leads to stable training Dynamics that's one of the first reasons why batch normalization or I should call it layer normalization there is a difference right so I should actually call this layer normalization batch normalization is something different and we'll come to that later today we are only going to look at layer normalization in which the outputs coming from every layer are normalized the second advantage of layer normalization is that it prevents this problem which is called as internal coate shift so what happens is that as training proceeds the inputs to every layer can change let's say we look at the second layer right in in first training iteration the inputs can have a distribution like this in the second training iteration maybe the inputs are skewed so the input distribution which every layer receives can change according to the iterations and what that leads to is that it makes training very difficult so if the input distribution to every layer is changing the weights the updating the weights becomes very hard and that delays the convergence of the parameters and that delays the overall solution reaching an optimal value we don't want that layer normalization really helps to prevent this layer normalization make sure that um since we are normalizing which means that as we'll see the variance of the standard deviation is kept to one we'll make sure that the mean and standard deviation of the output from every layer is fixed and this reduces the problem of internal coari shift which accelerates convergence so there are two main reasons why layer normalization is employed the first reason is that it keeps the training procedure stable by preventing the vanishing gradient or the exploding gradient problem and the second uh the second major reason for using layer normalization is that it prevents or reduces the problem of internal covariate shift and that accelerates convergence and we get to a result for faster so that's why layer normalization is employed not just in the GPT or the llm architecture which we are going to see right now but in fact in many deep learning architectures layer normalization is very frequently used okay uh so what exactly is the main idea of layer normalization it's very simple so we look at a specific layer and we'll look at the outputs of that specific layer and what we'll do is that we'll adjust those outputs so that they have a mean of zero and they have a variance of one let me illustrate this through a simple example let's say you are looking at a neural network and these four are the outputs from one specific layer of the neural network right uh so the four outputs are X1 X2 X3 and X4 and X1 is equal to 1. 1 X2 is8 X3 is 2. 3 and X4 is 4.
4 in layer normalization what you do is you find two quantities first you find find mean of this so the mean will just be X1 + X2 + X3 + X4 / 4 so in this case it will be 2. 15 and the second thing which you do is you find the variance so the variance will be 1X 4 because there are four um quantities here and then we'll sum up X1 minus the mean whole square + X2 - the mean square + x3 - the mean whole square + X4 minus the mean whole Square so that gives us the variance value now what we do is we perform the normalization procedure which means that for every variable we subtract the mean and we divide by the square root of the variance which is the stand standard deviation so X1 will be replaced by X1 minus mu / square root of variance X2 will be replaced by X2 - mu / square root of variance X3 will be replaced by x3 - mu / square root of variance and X4 will be replaced by X4 minus mu / square root of variance so when you do this normalization it leads to these four values do you notice something about these four normalized values if you add these together in the numerator you will have X1 + X2 + X3 + X4 - 4 * mu so that will be zero which means that the mean of this these normalized values is equal to zero and if you compute the variance of these normalized values through this formula you'll see that the variance of these normalized values is is actually equal to 1 that's the most important thing which uh which is which you should realize or you you should understand is that after performing the normalization procedure uh the values which you get these four values their mean is equal to zero and their variance is equal to one that's the whole idea behind normalization the in normalization we adjust the outputs of every layer of neural network to have mean of zero and variance of one and it turns out that this simple procedure helps us in the stability neural network training and it also helps us reduce the problem of the internal coari shift um so let us actually uh see this in code but before that I want to tell you where the layer normalization is used and uh we discussed this at the start of the lecture but um the input is converted into an input embedding then we add the token embeddings and then here right we feed it before going into the multi head attention we have a layer normalization layer so that the inputs to the multi-ad attention are normalized even after the multi attention there is a layer normalization layer before feeding into the neural network module within the Transformer block remember this Blue Block here is the Transformer block so the layer normalization layer appears two times here and then it appears once more again outside the Transformer uh so in GPT and modern Transformer architecture layer normalization is typically applied before and after the multi-head attention module like what we have seen over here and it also appears once before the final output layer and we saw this when we coded in the last lecture here if you see within the Transformer block layer normalization appears two times before and after the multi-ad attention but even before the output it we employ it once so overall it appears three times that's why we need to define a separate class of the layer normalization this one figure which I'm which I've shown over here actually illustrates the procedure of layer normalization um let's say we have a neural network layer these are the five inputs uh to the neural network right and these are the six outputs of the neural network without the normalization you'll see that their mean is not equal to zero and their variance is not equal to one but after we perform normalization on these layer outputs which means that for from every output here we are going to subtract the mean and divide by the square root of the variance so you'll get these as the resultant values after applying layer normalization and if you take a mean of these values you'll get that mean is equal to zero and the variance of these values is equal to one this is the simple illustration which describes uh what happens underneath the hood for layer normalization now what we are going to do is that I'm going to take you through code and we are going to implement layer normalization first on a neural network which looks like this and then we are going to create a separate class of layer normalization which we can integrate within our GPT architecture so let's jump into code right now all right so here's the code file for layer normalization the first thing which we are going to do is start out with a simple example to illustrate how layer normalization is implemented in practice and then we will actually fill out this layer normalization class which we had created in the previous lecture so let's get started what we are doing here is that we are essentially let me take you to the white board to demonstrate what we are doing here we'll have a simple neural network layer and the neural network is constructed such that we have two batches of inputs so here is batch number one and here is batch number two and each batch has five inputs so batch number one has X1 X2 X3 X for X5 and batch number two has X1 X2 X3 X4 H5 X5 now here we are looking at one layer of neurons and there are six neurons here so when these inputs essentially pass through this first layer of neurons we have the output which is produced and uh there will be six outputs for batch number one which is y1 Y2 Y3 y4 y5 Y6 and there will be six outputs for batch number two y1 Y2 Y3 y4 y5 Y6 so if you look at the inputs the shape of the inputs will be two rows and five columns because we have two batches and each batch will have five inputs right then we have a sequential layer which essentially takes in five inputs and it has six outputs this sequential layer is this uh the second layer which I've shown you over here this layer of six neurons and after every neuron here we essentially have The Rao activation function so which has been mentioned over here if you don't know what Rao activation function is it's fine for this lecture we don't need to understand this uh the output of the layer is that the layer is then applied on this input batch and we get the output can you try to understand why the shape of the output is like this so here you can see that we have two rows and we have six columns if you look at the first row this represents the six outputs from the first um batch and if you look at the second row this represents the six outputs from the uh second batch and that's what exactly being shown here y1 Y2 Y6 and batch two has y1 Y2 up till Y6 so this can be y1 this is Y2 this is Y3 this is Y6 for batch one and this is y1 this is Y2 and this is Y6 for batch number two okay so this is the layer which we have and now what we are we are going to do is that after this layer we are going to uh apply the batch normalization uh sorry we are going to apply the layer normalization so uh here I have simply explained that we have a neural network which consists of a linear layer followed by The Rao activation layer to quickly illustrate what The Rao activation function actually looks like take a look at this this image over here so if x is positive The Rao is just y = to X but if x is equal to negative The Rao is zero so there is a nonlinearity here that's The Rao activation function now what we are going to do as I mentioned is we are going to apply layer normalization so the way it is applied is very similar to the Hands-On example which we saw on the white board over here this example uh the same thing will be applying to the first batch and the same thing will be applying to the second batch so what we'll be doing is that when you look at the first batch we will do y1 minus so y1 will be replaced by y1 - mu divided by uh square root of we write this again divided by square root of variance Y2 will be replaced by Y2 minus mu uh divided square root of variance and like this similarly the last output which is Y6 here Y6 will also be replaced with Y6 minus mu divided by square root of variance and first we process the first batch and then we process the second batch in a very similar manner so now let's go to code to see how this is done so now what we are going to do is that we we have the output which is this tensor right and then we are doing output do mean Dimension equal to minus1 why Dimension equal to minus1 because we have to take the mean along the columns so first we look at the first batch outputs and we want to take the mean of this so we do output do mean Dimension equal to minus1 and this keep Dimension equal to true that is very important because if we don't include keep Dimension equal to true The Returned mean would be a two dimensional Vector instead of a two into one dimensional Matrix so essentially uh if you use keep dim equal to true the output which you get for the mean is this so the first value here corresponds to the mean of the first batch the second value here corresponds to the mean of the second batch since we used keyd equal to true the shape of this output is that it's a matrix uh or rather uh yeah it's a a two into one dimensional Matrix over here right now if we did not use keep dim equal to true this would not be a matrix in fact it would just be a two- dimensional vector and that's generally not good because it's good for the dimensions to be preserved as we are doing all of these calculations similarly for the variance what we are doing is that we are taking the variance across the column for both the batches and we use keep them equal to true and then you print out the mean and then you print out the variance for every batch so for the first batch of data the mean is. 1324 for the second for the first batch of data the variance is 0.
02 31 for the second batch of data the mean is 216 2170 and for the second batch of data the variance is 0. 398 so remember the two uh uh two commands which we have used here dim equal to minus1 because we have to perform that operation along the columns and keep dim equal to true because we have to retain um the dimension of the final mean and the variance Matrix which we have if we did not use keep D equal to True later when we subtract this mean from every individual element it will lead to some problems so we want to avoid that so in this text over here I have just explained why we used keep dim equal to true and why we used Dimension equal to minus1 so if you have some confusion along those lines please read this text when I share this Google or when I share this Jupiter notebook with you great and now what we are going to do is that we are going to uh subtract the mean so like over here we are going to subtract the mean and divide by the square root of variance so we have the output Matrix which is there we are going to subtract the mean which is now again you can see the mean is also a tensor which has two rows and one column and we are going to subtract the mean from the output and we are going to divide by the square root of variable this is the main normalization step so this is my output now and the normalized layer outputs are given like this uh the first row again corresponds to the normalized outputs of batch one the second row corresponds to the normalized outputs of batch number two so here I'm just printing out the mean and variance of the batch one and batch two so if you look batch one and batch two so if you look at the mean you'll see that the mean of the first batch is almost close to zero this is is 10us 8 which is really very close to zero we can approximate it to zero for the second batch the mean is again very close to 10us 8 again that's almost equal to zero and if you look at the variance for both the batches you'll see that the variance is equal to one awesome this is exactly what we wanted right which means that the layers have been normalized now so note that the value 2. 9 into 10- 8 is the scientific notation for 2.
9 * 10us 8 this value is very close to zero but not exactly zero due to small numerical errors in Python what we have is this uh we can turn on the turn off the scientific mode So currently the scientific notation is on that's why we are getting these uh um values which have been represented in the scientific notation we can turn off the scientific notation and then let's print out the mean and the variance so you'll see that the mean for both the batches is equal to zero and the variance for both the batches is equal to one great and now we'll achieve the goal which we started out this lecture with we want to create a class for layer normalization um what would be the output of this class basically this class will take in the um the output of a layer and it will apply the um normalization to that so let's look at where this layer normalization step is implemented so the layer normalization step is is implemented here the layer normalization step is implemented here so at both of these places when we get uh when the inputs are received to this block and when the input is received to this block um we have certain number of tokens uh which is let's say let's say we are looking at the uh Contex size for the number of embedding vectors which we have but the main thing which I want to point out is that the the number of columns which we have is equal to the embedding size and this is the embedding Vector Dimension which we are using so for gpt2 this embedding size is equal to 768 so when we look at the so when we look at let's say the first row over here the first row corresponds to the embedding for the first token which is an input to this layer normalization right so so we will take the mean and we'll take the variance along the column Dimension which is 768 so we'll take the mean of all of this and take the variance and then do the normalization similarly we'll do this for every single row um for the input to this layer normalization as well as the input to this layer normalization so if you see the dimension when an instance of this class is created we have to pass in the embedding Dimension why do we have to pass in the embedding Dimension because we'll see that we are going to implement something like scale and shift these are trainable parameters but the size of the scale and shift will be governed by the embedding Dimension and that's the same as the input Vector to the layer Norm module and that will be the same as the output Vector so the input to the layer Norm module will have certain number of rows but it will have embedding columns the output of the layer normalization which is Norm X will have the same dimensions as the input because normalization does not change dimensions and we are going to scale the output with the scale and shift I'll come to that in a moment so the main part of this layer normalization class is the forward method which takes in the input which I described so when you think of the input think of the input as having certain number of rows but mostly focus on the number of columns which will be the embedding dimensions in each row let's say we have 768 embedding Dimension so in the first step what we do is that along the column we take the mean ex exactly similar to what we actually did over here just keep this example in mind um then what we do is along the column we take the variance and then we subtract the mean and then we divide by the square root of variance note that we have added uh this small variable Epsilon so this is a small constant which is added to the variance to prevent division by zero during normalization so we don't want to divide by the square root of zero right so we add a small uh variable here which is called self.