[Music] hello everyone welcome to the lecture 4 in this building large language models from scratch Series in the previous lecture we took a look at the differences between the two stages of building an llm and the two stages were pre-training and fine-tuning so pre-training involves training on a large diverse data set and fine tuning is basically refinement by training on a narrower data set specific to a particular task or a particular domain if you have not seen the previous lecture I highly encourage you to go through the previous lecture so that there will be a

good flow between these different lectures if you are watching today's lecture for the first time no problem at all welcome to this series and I've have designed this lecture so that it's independently accessible and understandable so let's get started today I'm very excited because today's topic is regarding introduction or rather a basic introduction to Transformers in today's topic we are not going to go into the mathematical details or even the coding details of Transformers but we are just going to introduce the flavor of this concept what does it really mean what it did for large

language models what is the history of Transformers in the context text of GPT uh is there any similarity or differences between llms and Transformers when people say llms and Transformers they usually use these terms interchangeably when should we use these terminologies interchangeably are there any similarities or differences between them we are going to learn about all of these aspects we are also going to look at the schematic of how Transformer generally work and in doing so we'll understand the basics of few terminologies like embedding tokenization Etc so let's get started with today's lecture so the

secret Source behind large language models and the secret Source behind why llms are so popular is this world called as Transformers most of the modern large language models rely on this architecture which is called as Transformer architecture so what is a Transformer AR architecture essentially it's a deep neural network architecture which was introduced in a paper which was released in 2017 this paper is called as attention is all you need and if you go and search about this paper on Google Scholar right now so let me do that just quickly so if I go here

to Google Scholar and type attention is all you need let us check the number of citations which is which it has it has more than 100,000 citations in just six to 7 years that's incredible right it's because this paper led to so many breakthroughs which happened later the GPT architecture which is the foundational stone or foundational building block of chat GPT originated from this paper the GPT architecture is not exactly the same as the Transformer architecture proposed in this paper but it is heavily based on that so it's very important for us to understand what

this paper really did and what our Transformers so I've have opened this paper here so that you can see it's titled attention is all you need so you might be thinking what is attention and it is actually a technical term which is related to how attention is used in our daily life also we'll also be touching upon this briefly today and we'll be understanding uh intuition behind attention so if you look at this paper it's a 15 page paper and this is the Transformer architecture which I'm talking about essentially it's a neural network architecture and

there are so many things to unpack and explain here which we won't be doing today we'll be doing at subsequent lectures because every aspect of this architecture will need a separate lecture it's that detailed today we are just going to look at an overview so it's a 15 page paper and to go through this paper and to really understand this paper it will at least need 10 to 15 lectures and this lecture can serve as an introduction so it's very important for you all to to understand this lecture clearly first thing which I want to

explain is that when this paper was proposed it was actually proposed for translation tasks which means converting one language into another language text completion which is the predominant role of GPT was not even in consideration here they were mostly looking at English to French and English to German translations and they proposed a mechanism which did huge amount of advancements on these tasks the Transformer mechanism they proposed led to Big advancement in these tasks later it was discovered that using an architecture derived from this Transformer architecture we can do so many other things so that's the

first thing to note uh and that is that the original Transformer which was developed it was developed for machine translation tasks especially it was developed to translate English text into German and French okay now we are going to look at uh a schematic of the Transformer architecture so this schematic is fairly detailed like you can see and we have actually uh done a ton down version of this schematic and I have borrowed this schematic from the book building llms from scratch by Sebastian one of the best books on large language models so let us look

at this schematic first of all by zooming out so this is a simplified Transformer architecture first I want to show you that there are eight steps over here you can see this orange step number one step number two step number three step number four five 6 7 and eight so if you understand these eight steps as an intuition you would have understood the intuition of the Transformer architecture so let's start going through it from step by step and as we saw one of the main purposes of the original Transformer architecture was to convert English to

German so this is the example which we have taken here let's say the in let's look at step number one so this is the input text which is to be translated and as we can all see this input text is right now in the English language right great and uh the Transformer is designed so that it will at the end of eight steps it will convert it into German but there are number of things which happen before that let's let's go to step number two in Step number two the input text is basically taken and

pre-processed what pre-processing means is that there is a tech there is a process which is called as tokenization tokenization and what tokenization basically means is that we have used sentences right which might be let's say we have input data from billions of data sets as we saw in the previous lecture such transform perers are usually trained on huge amounts of data and let's say the input data is in the form of documents and documents have sentences right so the entire sentence cannot be fed into the model the sentence needs to be broken down into simpler

words or tokens this process is called as the process of tokenization so I have a simple schematic here so for now for Simplicity you can imagine that one word is one token this is not usually the case one word is generally not equal to one token but for understanding this class you can think of tokenizing as breaking down the sentence into individual words so let's say this is the sentence fine tuning is Fun For All tokenizing basically means breaking this down into individual words like fine tu tu and ing is Fun For All and then

assigning an ID a unique number to each of these words so basically we have taken the huge amount of data broken it down into tokens or individual words and assigned an ID or a number to this to to each token this is called as the process of tokenization and so let's say if you have English data from Reddit posts or from Wikipedia you break it down into words and you uh collect individual subwords from each sentence in the data set this is what usually happens in the pre-processing step then the next step after the pr

three processing step number three is encoder this is one of the most important building blocks of the Transformer architecture and what this encoder does is that the input text which is pre-processed let's say the tokens are passed to the encoder and what actually happens in the encoder is something called as Vector embedding so what what the encoder actually does is it implements a process which is called as Vector embedding so up till now we have seen that every sentence is broken down into individual words and uh those words uh are converted into numerical IDs right

but the main problem is that we need to encode the semantic meaning between the words also right so let's say for example if you take the word dog and puppy with this method which I've shown you right now with tokenization random IDs will be assigned to dog and puppy but we need to encode the information somewhere that dog and puppy are actually related to each other so can we somehow represent the input data can we somehow represent the tokens in a way which captures the semantic meaning between the words and that process is called as

Vector embedding what is done usually in Vector embeddings is that words are taken and they are converted into vectorized representations so this figure actually illustrates it very simply let's say these are the words King Man Woman apple banana orange football Golf and Tennis what is done in Vector embedding is that a so this is a two-dimensional Vector embedding I'm showing in a two-dimensional Vector embedding each of these words are converted into vectors and the way these vectors are formed is that so King man and woman they they are terms which are related to each other

right apple banana and orange are related all of them are fruits football gold F tennis are related all of them are sports so when you convert these words into individual vectors if you see on the right hand side look at King man and woman they are more closer together right as vectors if you look at the green circle here which is football Golf and Tennis they are more closer together if you look at the red circle here which is apple banana and orange all of them are fruits which are closer together so converting these words

into such kind of vector format is called as Vector embedding and this is a difficult task we cannot randomly put vectors right because there have so apple and banana have to be closer to each other all fruits need to be closer to each other than let's say banana and King so there is usually a detailed procedure for this and neural networks are trained even for for this step that is called as Vector embedding step so that is the main purpose of the encoder the main purpose of the encoder is actually to take in the input

text from the pre-processing maybe the tokens and to convert those tokens into Vector embeddings so if you see in Step number four we have generated Vector embeddings so in the in the left hand side of the Transformer architecture the final goal is to generate vector headings which means that let's say if we have millions of data in English language we convert them into tokens we convert them into vectors and that is done in a giant Dimension space not just in two Dimension space it is done in maybe 500,000 huge number of Dimension space which we

cannot even imagine but the way it is done is such that semantic meaning is captured between the words that is how the embedding vectors should be returned here is another uh example which visually shows you how the embedding is done let's say if you have text right now from documents that text is converted into IDs and that those tokenized IDs are converted into vector format like this this is a three-dimensional vectorized representation so we can visualize this and another nice visualization is this where we take in the where we take in the uh data put

it into the embedding model and then vectorized embeddings are the result of this so that's the first step of the Transformer architecture so you can view it as a left side and right side in the left side in these four steps we take the input sentences and the final goal is to convert them into these Vector embeddings so that semantic meaning is captured between the words okay now what do we do with these embeddings we feed these embeddings to the right hand side so look at this Arrow here this these embeddings are fed to what

is called as the decoder so let's come to the right hand side of things so step number five right this is the uh German translation which our model will be doing and remember the model completes one word at a time right so uh this is an example is the input and uh up till now let's say the model has translated this to be Das s so this is not complete translation because the translation of exact example is not yet included right so this can be called as the partial output text remember this is available to

the model because the model only generates one output word at a time so by the time we reach the fourth output word which is the translation of example we would have the translated words for this is and N so this is available to the model this is one of the key features of Transformer and even the GPT architecture one output word is produced at one time so the model has partial output text which is d s these words are available to the model and even this this kind of text which is available is converted into

the tokenization the tokenized IDS which we saw this is the pre-processing step and then this is fed to the decoder the job of the decoder is basically to do the final translation now remember along with this partial input text the decoder also receives the vector embeddings so the decoder has received the vector embeddings from the left hand side of things and now the task of the decoder is basically it has received the vector embeddings it has received the partial text and it has to predict what the next word is going to be based on this

information and then we go to the output layer slowly we go to the output layer and then uh finally you will see that that the uh final translation for example is completed over here and this is called as by spile I don't know how to pronounce it my German is uh not that good and I've not even learned German in the first place but here you can see this is the German translation for example which the decoder has produced so step number seven is for the decoder is to generate the translated text one word at

a time and then step number eight is that we get the final output and this is how the decoder actually translates the input into the output one word at a time that is very important now you might be thinking how does the decoder translate it into the output remember it's like a neural network and we are training the neural network so initially it will make mistakes of course but there will be a loss function and then we will eventually train the Transformer to be better and better and better so think of the this as a

neural network so let me show you the actual schematic of the Transformer what we have seen right now is a simplified architecture but if you see the actual schematic of the Transformer you'll see that there are feed forward layers uh which means there are weights and parameters which need to be optimized so that the decoder predicts the German World correctly it's very similar to training a neural network right so these are actually the eight steps which are very much important in the Transformer so let me actually go through these eight steps in the simplified Transformer

architecture again the first step is to have the input text which is to be translated this is an example the second step is to pre-process all the sentences by breaking them down into tokens and then assigning a token ID to each token the third step is basically to pass these token IDs into the encoder and then convert these token IDs into an embedding or a vector embedding this means that words are projected into high dimensional Vector space and the way these words are projected is such that the semantic relationship or the semantic meaning between the

words is captured very clearly now this this Vector embedding is fed as an input to the decoder but along with this the decoder also receives the partial output text remember the decoder is decoding uh the English to German one word at a time so for decoding this is an example it already has the decoded answer for this is an th is and now it wants to translate English to German for example so it receives this partial output text it receives the vector embedding and then it's trained to predict the next output word which is B

spite which is the German for example and this is how uh English is translated into German in a Transformer so this is a very very simplified explanation of how a Transformer works we have not even covered attention here you might be thinking why is this paper titled attention is all you need and there is a very specific reason for it I just want you to not get intimidated or afraid by the Transformer and that's why I'm showing you this simplified form right now at the simplest form you can think of a transformer as a neural

network and you're optimizing parameters in a neural network it's as simple as that what many students do is that they try to understand this architecture directly and then that leads to many issues because it's actually fairly complicated and then they develop a fear for this subject I wanted to avoid that so I started with this simplified Transformer architecture okay I hope you have understood until this point I encourage you all to maybe pause here and think about what you have learned now let's go to the next next part of the lecture uh the Transformer architecture

predominantly consists of two main blocks the first is the encoder block and the second is the decoder block and we saw both of these here you see the encoder was over here and the decoder was over here okay the main purpose of the encoder is to convert the input text into embedding vectors great and the main purpose of the decoder is to generate the output text from the embedding vectors and from the partial output which it has received so encoder and decoder are the two key blocks of a transformer architecture remember the GPT architecture is

actually different than the Transformer because that came later and it does not have the decoder it does sorry it does not have the encoder it only has the decoder but we'll come to that later right now remember that Transformers have both encoder and decoder now one key part of of the Transformer architecture is this thing this thing called as self attention mechanism so let's actually Google or let's actually control F attention here and see how many times it shows up 97 times and let's see how they have defined attention actually uh okay attention mechanisms have

become an integral part of sequence modeling allowing modeling of dependencies without regard to their distance in the input or output sequences remember this so the attention mechanism allows you to model the dependencies between different words without regards to how close apart or how far apart the words are that is one key thing to remember uh and then self attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence this is a bit difficult to understand so let me actually explain to you the way I understood

it on the white board what basically self attention mechanism does is that or attention is that it allows the model to weigh the importance of different words and tokens relative to each other so let's say you have two sentences right and uh let's say the first sentence is Harry Potter is on station or platform number something and then Harry Potter wants to board the train and then third sentence for fourth sentence when you are on the fourth sentence to predict the next word in the fourth sentence the context is very important right so you need

to know what the text was in the sentence number one sentence number two and sentence number three as well only then you will be able to really understand the fourth sentence and predict the next word in the fourth sentence this is the meaning of long range dependencies which means that if I'm predicting the next word in the fourth sentence I need to know the importance of previous words I need to know how much attention should I give to the previous Words which word which has come previously should receive more attention maybe Harry should receive more

attention maybe platform or train should receive more attention the self attention mechanism allows you to capture long range dependencies so that the model can look even far behind and even to sentences closer to the current one to identify the next one to identify the next word so the self attention mechanism allows the model to weigh the importance of different words or tokens relative to each other that is very important so basically uh if you are to predict the next word the self attention mechanism maintains this attention score which basically tells you which word should be

given more attention when you predict the next word and this is a key part of the intuition for all of you to think about so let us actually look at uh this architecture and look at the part where attention comes in see multi-head attention mask multi-head attention there are these blocks which are called as attention blocks so these attention blocks make sure you capture long range dependencies in the sentences that's why this paper is actually called attention is all you need because of the self attention mechanism uh and the intuition behind attention which these folks

introduced so as I mentioned they they calculate an attention score which basically it's a matrix and it tells you which words should be given more importance in relative or in relation to other words for now just understand this intuition so that later when we come to the mathematics and coding of it I just want you to be comfortable with this notion and I want you to appreciate how beautiful this is because you as a human we keep context into our mind pretty naturally when we are reading a story we remember what was written on the

previous page but for a model to do that it's very difficult and self attention mechanism actually allows the model to do that it allows the model to capture long range effect dependencies so that it it makes the next word prediction accurately in chat GPT when you write an input GPT actually gives attention to every sentence right and then it predicts what the next word could be it doesn't just look at the sentence before the current one it looks at all sent sentences because maybe previous words are more important this is possible through the self attention

mechanism so that's the second key concept which I wanted to introduce and uh in the last part of the lecture we are going to look at the later variations of the Transformer architecture so the Transformer architecture or this paper rather came out in 2017 right the GPT models came out after that and there's another set of models called as Bert which also came out as l variations of the Transformer architecture so there are two later variations which I want to discuss the first is called as B it's called as by or its full form is

B directional encoder representations from Transformers no need to understand the meaning but that's just what b means maybe you would have heard this terminology but did not know the full form the full form is B directional encoder representations from Transformers and the second is GPT models of course all of us have heard of chat GPT but the full form of GPT is generative pre-trained Transformers uh pre-trained because it's a pre-trained or a foundational model which we saw in the previous lecture so now you should start understanding these terminologies now you must be thinking what's the

difference between Bert and GPT models there's a big difference basically the way Bert operates is that it predicts hidden words in a given sentence so let's say you have a sentence it will mask some words randomly and it will try to predict those mask or hidden words that's what B does what does GPT do as we have all seen it generates a new word so there is a pretty big difference between how B works and how GPT Works let's see a schematic so this is how bir actually works let's say we have a sentence this

is an Dash or question mark of how llm Dash perform so let's say as input we have this a text which is incomplete so Bert receives inputs where words are randomly masked during training and mask means that let's say we do not know these words right and then this is the input text we do the same pre-processing steps as we saw above converting them into uh token IDs then we pass it to the encoder do the embedding same thing like what we saw before and then the main output is that we fill the missing words

so Bert model realizes that the missing words are example and can so then the final sentence is this is an example of how llm can perform this is how B Works how GPT actually works is something completely different so let's say the input which GPT receives is this is an example of how llm can Dash so we just have to predict the next World we receive incomplete text and we have to predict the next word so then the way GPT works is that it does the pre-processing like we saw converts words to token IDs then

there is a decoder model there is not an encoder model so the decoder then predicts the last word or the word which we do not know perform so then the output is this is an example of how llm can perform so GPT models learn to generate one word at a time now this leads to Big differences because if you see the GPT model is left to right right all the left information is there we are only predicting the rightmost information what is not known whereas in bir random words can be masked so the model has

to pay attention to different parts of the sentence and that's why ber actually does very well in sentiment analysis so here I have I have a text which actually Compares uh or answers why is Bert so good when we do uh sentiment analysis so the reason Bert is called B directional so you'll see that the name is by directional encoder representations right because it looks at the sentence from both directions whereas in GPT we just look at the sentence from left to right but ber looks at the sentence from both directions because even the first

word might be missing so it has to look from the left side as well as the right side and by looking at the entire sentence from both directions bir can capture the nuances and relationships between words that are important for for understanding meaning and context for instance Bert model can differentiate between Bank as a financial institution and bank as a River Bank by looking at the surrounding words so that is why ber can actually understand the nuances and relationships between different words in a sentence since it looks at the sentence from both sides that's why

Bert is very commonly used in sentiment analysis even GPT can be used in sentiment analysis but the speciality of birth is sentiment analysis okay so that's the difference between bir and GPT no one talks too much about bir these days all the hype is about GPT chat GPT because it produces one word at a time it can complete missing text also it can also do sentiment analysis uh but I just want to explain I just wanted to explain these differences to you so that you you are aware of what BT means what GPT means one

thing to not is that both of these have the word transform forers in them because they have originated from the Transformer architecture here uh which we just consider but as you saw the GPT model does not have an encoder they only have a decoder that's one key thing which I wanted to demonstrate here whereas The Bert model only has the encoder so remember these differences between Bert and GPT great and now one thing which I would like to cover before we end the lecture is that what is the difference between Transformer and llm so are

they the same thing when we say llms can we also say Transformers so the key thing to note is that not all Transformers are llms Transformers can also be used for other tasks like computer vision so one thing which I would like to show you here is this thing so Transformers are not only used for language tasks they are also used for vision tasks such as image recognition image classification Etc here is a website which I have pulled out the these are called as Vision Transformers vit and they can be used for various application so

here see the vision Transformer is being used to detect a PO hole on the road then there are a number of other important application such as it can be used to classify between tumors as maligant and venine just from the images and a number of people have discuss the similarities and differences between convolutional neural network and viit so viit AES remarkable results compared to CNN while obtaining substantially fewer computational resources for pre-training in comparison to C CNN Vision Transformers show a generally weaker bias so basically you think of only convolutional neural networks when you think

of image classification right but Vision Transformers are a new method which is also gaining a lot of popularity and they can be used for image classification tasks so remember when you think of Transformers don't think of Transformers only in the context of large language models or text generation Transformers can also be used for computer vision so remember not all Transformers are llms so what about llms are all llms Transformers so that is also not true not all llms are also Transformers llms can be based on recurrent or convolutional architectures as well this is what very

important point to remember I had made a presentation uh some time back and this image has been taken from stats Quest channel so before even Transformers came into the picture here you can see 1980 recurrent neural networks were introduced in 1997 long short-term memory networks were introduced both of them could do sequence modeling tasks and both of them could do text completion tasks so they also can be called as language models so remember that all llms are not uh Transformers right llms can also be recurrent neural networks or long shortterm memory networks to give you

a quick introduction what RNN actually do is that RNN maintain this kind of a feedback loop so that is why we can incorporate memory into account uh lstm on the other hand incorporates two separate paths One path about short-term memories and one path about long-term memories that's why they are called long short-term um memory networks so One path is for long-term memories and one path is for short-term memories so basically we have one green line let's say which is shown here that represents long-term memory one line which shows the short-term memory and then basically using

both we can make predictions of what comes next so even recurrent neural networks and long short-term memory Networks and even some convolutional architectures can also be large language models so as we end I just want you to remember that not all Transformers are llms this is very important to keep in mind and not all llms are Transformers also so don't use the terms Transformers and llms interchangeably they are actually very different things but not many students or not many people really understand the similarities or the differences between them one purpose of these set of lectures

is for you to understand everything from Basics the way it is supposed to be that way you'll also be much more confident when you transition your career or you're sitting for an llm interview and if you don't know the difference between Transformers or llms these lectures can clarify those similarities and differences for you I'm going to go into a lot of detail in lectures like what we did right now and not assume anything so I've written number of things on the Whiteboard so that you can understand let's do a quick recap of what all we

learned first we saw that most modern llms rely on the Transformer architecture which was proposed in the 2017 paper it's basically a deep neural network architecture the paper which proposed the Transformer architecture is called as attention is all you need and the original Transformer was developed for machine translation for translating English tasks or English texts into German and French we saw a simplified Transformer architecture which had eight steps we take an input example pre-process it by converting words or sentences into words and token IDs then we pass it into the encoder which converts these tokens

into Vector embeddings the vector embeddings are fed to the decoder along with the vector embeddings the decoder also receives partial output text and it generates the translated sentence one word at a time this is the simplified Transformer architecture and we saw that the Transformer architecture consists of an encoder and a decoder however later we saw that GPT models do not have an encoder they only have a decoder in the middle we had a small discussion on self attention mechanism which is really the heart of why Transformers works so well and why the paper which I

showed you earlier is called attention is all you need self attention allows the model to weigh the importance of different words relative to each other and it enables the model to capture long range dependencies so when we are predicting the next word from a given sentence we can look at all the context in the past and way the importance of which word matters more for predicting the next word you can think of also self attention as parallel attention to different parts of a paragraph or different sentences we will look into this later it's going to

be one of the key aspects uh it's going to be one of the key aspects of our course as we move forward then we saw the later variations of transform Transformer architecture in particular we looked at two variations first is B which is B directional encoder representations and then we saw GPT so there is a difference between these right Bert predicts hidden hidden words in a sentence or it predicts missing Words which are also called masked words so basically what this does is that word pays attention to a sentence from left side as well as

from the right side because any word can be masked that's why it's called B directional encoder and it does not have the decoder architecture it just has the encoder architecture since ber looks at a sentence from both the words both the directions it can capture the meanings of different words and how they relate to each other very well and that's why BT models are used for sentiment analysis A Lot GPT on the other hand just gets the data and then it predicts the next word so it's it's a left to right model basically it has

data from the left hand side and then it has to predict what comes on the right or what's the next work so GPT receives incomplete text and learns to generate one word at a time um and main thing to remember is that GPT does not have an encoder it only has decoder great and then in the last part of the lecture we saw the difference between Transformers versus llms so remember not all Transformers are llms Transformers can also be used for computer vision tasks like image classification image segmentation Etc similarly not all llms are Transformers

before Transformers came recurrent neural networks and long short-term memory networks and even convolutional architectures were used for text completion so that's why llms can be based on recurrent or convolutional architectures as well so do not use these terms Transformers and llms interchangeably though many people do it understand the similarities and differences between the two that brings us to the end of this lecture we covered a lot of we covered Five Points in today's lecture and I encourage you to be proactive in the comment section ask questions ask doubts uh also make notes about these architectures

as you are as you are learning that's really one of the best ways to learn about this material and as always I try to show everything on a whiteboard plus try to explain as clearly as possible so that nothing is left out and I show a lot of examples also in this process thanks a lot everyone I hope you are enjoying in this series I look forward to seeing you in the next lecture

Lecture 4: What are transformers?