[Music] hello everyone welcome to this lecture in the build large language models from scratch Series this is the lecture five and today we are going to look in detail at GPT or generative pre-trained Transformer we are going to see how the versions of GPT evolved how the different papers on GPT evolved what's the progression from Transformers to GPT to gpt2 to gpt3 and finally to GPT 4 where we are at right now in the previous lectures we have looked at uh Transformers we looked at a simplified architecture for Transformer and we also saw the difference between Bert and GPT model we saw what's the meaning of an encoder what's the meaning of a decoder Etc if you have not seen the previous lecture I would highly encourage you to go through that if not it's totally fine the way I have designed this lecture is such that it is self-contained and you will understand all the concepts which I explaining in this lecture so one of the key Concepts which we are also going to look at today is zero shot versus few short learning but before coming to that let me first take you through the progression of research papers from Transformers to GPT to gpt2 and and then finally to gpt3 and then GPT 4 so let's deep dive a bit into history the first paper which was released which really started the work on GPT is transformers this paper was released in 2017 and this paper was titled attention is all you need the major breakthrough of this paper is introducing the self attention mechanism where you capture the long range dependencies in a sentence allowing allowing you to do a much better job at predicting the next word in a sentence it represented a significance at significant advancement compared to recurrent neural networks and long short-term memory networks Transformers which were introduced in this paper really changed everything the original Transformer architecture looked something like this it had an encoder and it had a decoder but GPT architecture which is generative pre-rain Transformer which came after this paper did not have an encoder the GPT architecture which was developed only has a decoder so this was the paper which came out after the Transformers paper and it introduced the concept of generative pre-training we are going to look at this concept a bit more in detail today but the main idea of this concept is unsupervised learning basically what they basically says that natural language processing as such up till this point had been mostly supervised learning so they were saying that label data for learning is scarse which means it is not easily available making it challenging for train models to perform adequately we demonstrate that large or we demonstrate that large gains on these tasks can be realized by generative pre-training of a language model on a diverse Corpus of unlabeled text so two words are important here generative pre-training and unlabeled text we already saw pre-training in the previous lecture what is done here is that the text which is used is not labeled so let's say You have given a sentence right and you use that sentence itself as the training data and the next word prediction which is in that sentence itself as the testing data so everything is self-content and you don't need to provide labels we'll see this idea more in detail but this was the paper which was published in I think this paper was published in 2018 which was the first paper on generative pre-training they took this Transformer architecture they modified it a bit they removed the encoder block and they said that if we trained a language model on a huge amount of data set to just predict the next word in an unsupervised learning manner it could really imp improve language understanding and they called the training phase as generative pre-training why generative because we are generating the next World open AI also released a Blog about this on June 11 2018 improving language understanding with unsupervised learning so their main uh their main claim was that we have obtained state of the art results on language tasks with a scalable system our approach is a combination of two ideas Transformers and unsupervised pre-training just keep this in mind so they used the Transformer architecture which we saw in the 2017 paper and they also used unsupervised pre trining which means labels were not uh given to the training data the labels were taken from the sentences itself we do not need to pre-label the data so of course this was 2018 and no one in fact no one paid that much attention as it is relevant in the commercial sphere today for researchers this was a big deal and this paper also has a huge number of citations today but in the commercial space students teachers professors who did not work in large language models had not heard uh about this because it was still at the research phase it it had not entered the commercial domain then what happened is that in 2019 just the next year came one more paper which is called as language models are unsupervised multitask learner so what they basically did is they just took more amount of data than was used in the earlier paper and they also used a generative pre-train Network and you can see that they actually showed four types of generative pre-train networks they showed a smaller model a slightly larger model and the largest model which they used had 1542 which is 1,000 or 1 billion parameters almost uh so here you can see I have just shown a pictorial representation here this was the gp2 gpt2 architecture which was introduced in this paper gpt2 generative pre-train Transformer 2 and here you can see that they released four things gpt2 small gpt2 medium gpt2 large and gpt2 extra large this was the first time when a paper was published in which a large language model was so large in fact 1 billion parameters were used in gpt2 extra large and it led to very very good results at that time open AI was already working on more complex and more advanced GPT models but when this paper was released in fact even this has very good number of citations if you if you just go to Google Scholar and search this right now you'll see that this has uh around more than 10,000 plus citations so this was the gpt2 paper which had around uh the largest model in gpt2 really had around 1,000 million or 1 billion parameters then in 2020 came the real boss which was gpt3 uh gpt3 had 175 billion parameters let me show you where they actually mention about yeah so they they also released a number of versions of gpt3 small medium large extra large a version with 2. 7 billion parameter a version with 6. 7 billion parameter but there was one specific version which was released which had 175 billion parameters which was gpt3 and when people started exploring this model they could really see that it's amazing it could do so many things although it was just trained to predict the next word it can do number of other things like translation sentiment analysis answering questions uh answering multiple choice questions emotional recognition it can do so many things and this was a huge model 175 billion parameters people had not seen language models of this size then two years after this came GPT 3.
5 which became commercially viral everyone started using it and saw how good it was and right now I'm using chat GPT 4 so if you uh see here I'm using chat GPT 40 so GPT 4 is where we are right now but you just see this gradual transformation which has happened from 2017 to 2024 in a space of 7 years we have gone from this original Transformers paper we have gone then to the GPT paper in 2018 2019 came gpt2 this 2019 came gpt2 then in 2020 came gpt3 which really changed everything then came GPT 3. 5 and then finally we are at GPT 4 this is the whole uh transformation from Transformers to GPT gpt2 gpt3 GPT 3. 5 and then G G pt4 many people don't know the difference between Transformers and GPT GPT essentially borrows from the Transformer architecture but it's a bit different in that it does not have really the encoder block so I just wanted to start off this lecture by giving you this historical perspective of how the generative pre-train Transformer has evolved the next thing which I want to cover today is the difference between zero shot and few shot learning zero shot is basically the ability to generalize to completely unseen tasks without any prior specific examples and few shot is basically learning from a minimum number of examples which the user provides as input good so let me actually directly go to the gpt3 paper G this was the gpt3 paper and they have wonderful illustrations of zero shot and few short learning usually people think research papers are hard to read but they have some very nice examp examples which really clarify the concept so in zero short learning the model predicts the answer given only a description no other Assistance or no other support for example The Prompt can be that hey you have to translate English to French and take the word cheese and translate it into French if the model is able to do that that's an example of a zero shot learning because we have not provided any supporting examples to the model great then there is also one shot learning which means that the model Sees In addition to the task description the model also sees a single example of the task so for example look at this I tell the model that look C otter translates like this to French use this as a supporting guide or like a hint if you may and translate cheese into French so this is one shot learning where the model sees a single example of the task and then is few short learning where the model basically sees a few examples of this task so for example in few short learning uh we say that sea otter translates to this peppermint translates to this and giraffe translates to this use these as the supporting examples and then translate English to French and then translate GES so this is called as few short learning so I hope you understood the difference between zero shot one shot and few shot zero shot is basically you provide no supporting examples to the model you just tell it to do that particular task such as language translation and it does it for you in one shot the model sees a single example of the task and in few shot the model sees a few examples of this task these beautiful examples are provided right in the GPT paper itself so you no no need to look anywhere further so let's see what's the claim of these authors so what they were saying was that uh we train gpt3 and auto regressive language model we'll see in a moment what this means with 175 billion parameters 10 times more than any previous language model and test its performance in a few short setting so let's see what the results are gpt3 provides or achieves a strong performance perance on translation question answering as well as several tasks that require on the-fly reasoning or domain adaptation such as unscrambling words using a novel word in a sentence or performing three-digit arithmetic so this paper basically implied that GPT 3 was a few short learner which means that if it's given certain examples it can do that task very well although it is trained only for the next word prediction what this paper claimed was that gpt3 is a few short learner which means that if you wanted to do a language translation task you just need to give it a few examples let's say if you want gpt3 to do a language translation task you just give it few other examples of how that example is translated into another language and then gpt3 will do that task for you so they claimed that gpt3 is a few short learner you will encounter zero shot versus few shot at a number of different in a number of different books articles and blogs so I just wanted to make sure it's clear for you so then your question would be okay if gpt3 is a few short learner what about GPT 4 which I'm using right now is it a zero short learner or is it a few short learner because it seems that I don't need to give it examples right it it just does many things on its own so let me ask gp4 itself are you a zero short learner or are you a few short learner let's see the answer so gp4 says that I a few short learner this means I can understand and perform tasks better with a few examples while I can handle many tasks with without prior examples which is zero short learning providing examples helps me generate more accurate responses so this is a very smart answer because gp4 is saying that it does amazingly well at few shot learning which means that if you provide it with some examples it does a better job but it can even do zero shot learning so this is very important for you all of you to know when you are interacting with GPT right if you provide some examples of the output which you are looking at or how you want the output to be gp4 will do an amazing job of course it has zero shot capabilities also uh but the two short capabilities are much more than zero short capabilities let me ask it do you also have zero short capabilities so when I ask this question to gp4 it says that yes I also have zero shot capabilities this means that I can perform tasks and answer questions without needing any prior examples or specific context so this is very important I would say GPT 4 is both a zero shot learner as well as a few shot learner but to get a more accurate responses as gp4 says itself you need to probably provide it few examples to get better responses so in that sense it is a better few short learner even when the authors release this paper they say confidently that gpt3 is a few short learner but it can also do zero short learning it just that its responses may not be as accurate awesome so this is really the difference between zero shot learning and fot learning and you need to keep this in mind because when you think about large language models this distinction is generally very very important I think when we go to GPT 5 or GPT 6 even we might get really better at zero shot learning we are already there but it can be better so here I've just shown a few examples of zero short versus few short learning so that this concept becomes even more clear so let's say the input is translate English to French for zero short learning we will just give the input which you want to translate like breakfast and then it's translated here for few short learning as I already showed to you before we give it few examples like uh let's say here we want to unscramble the words or make the words into correct spelling let's say that's the task to do this task we can we we can give some example so let's say this is a wrong spelling and the correct spelling is goat this is a wrong spelling the current correct spelling isue so we give GPT or the llm these two examples and then we tell it that based on these two examples now translate this into a correct word and then it translates it correctly as F so this is an example of few short learning because as you see we do provide two supporting examples so that the llm can actually make a better translation okay so zero shot learning is basically completing task without any example and uh few short learning is completing the task with a few examples okay now let's go to the next section which is utilizing large data sets uh we also looked at this in the previous lecture but I just want to reiterate this based on this paper so let's look at the data set which gpt3 have used we already saw that they the model was 175 billion parameters right like see this but let's see the data on which the model is trained on let's look at this data so the data set is the common craw data let's go to uh internet and search common craw so you can see that this is the common crawl data set and let me click on this right now and it maintains basically a free open repository of web scrw data that can be used by anyone it has over 250 billion Pages spanning 17 years and it is free and open Corpus since 2007 great so gpt3 uses 410 billion tokens from the common craw data and this is the majority of the data it consists of 60% of the entire data set what is one token so one token can be basically you can think of it as a Subs subset of the data set so for the sake of of this lecture just assume one token is equal to one word there are more complex procedures for converting sentences into tokens that's called tokenization but just assume for now just to develop your intuition that one token is equal to one word so that will give you an idea of this quantity so there are 410 billion Words which have been used as a data set from common crawl then gpt3 also utilize data from web text to let's go and search a bit about web text 2 so this is also an enhanced version of the original web Text corpus so this covers all the Reddit submissions from 2015 to 2020 and I think the minimum number of apports here should be three or something but basically open web text Data consists of a huge amount of Reddit posts and how huge so basically gpt3 uses 19 billion words from this web text2 data set which constitutes 22% of the data the remaining I would say 18 to 19% of the data comes from books and it comes from Wikipedia so this is the whole data set on which gpt3 is trained on the total number of tokens on which it is trained on is 310 300 billion tokens although if you add up these tokens they are more than 300 billion but maybe gpt3 took a different mix of uh the data set but overall they took 300 billion words as the training data uh to generate the gpt3 model think of that for a while 300 billion tokens that's huge number of tokens and that's huge amount of data and it would need a huge amount of compute power and also cost so that's an important point to remember training the gpt3 is not easy pre-training rather I should call it pre-training is not easy you need a huge amount of computer power and you need huge amount of data so as I mentioned to you before a token is uh a unit of a text which the model reads this is a good definition to think of token is a unit of a text which the model reads for now you can just think of one token is equal to one word we'll cover this in the tokenization part in the subsequent lectures we already looked at the gpt3 main paper which is titled language models are few short Learners and this paper came out in the year 2020 one more key thing to keep in mind is that the total pre-training cost for gpt3 is $4.
6 million keep this in mind this is extremely important you have a huge amount of data set right and you need compute power to run your model on the data set you need access to gpus and that's expensive but imagine this cost the total pre-training cost for gpt3 is 4. 6 million now you must be thinking what exactly happens in this pre-training why does it cost this much what exactly are we training here so let me go that into a bit more detail for that but before that realize that the pre-trained models are also called as the base models or foundational models which can be used later for fine tuning so when you look at the generative uh pre-training paper they also mention about fine tuning so they say that we do pre-training first and then F tuning this means that let's say if you are a banking company or an airline company or an educational company which wants to use gpt3 but you also want the output to be specific to your data set then you need to fine-tune the model further on label data which belongs to your organization that process is called fine tuning that needs less amount of data than the amount of data needed in pre-training but fine tuning is very important as you go into production level settings so if you are an educational company who is building multiple choice questions let's say you can of course use gpt3 or gp4 but if you want more robust more reliable outcomes you need to fine tune it on a label data set which maybe your company has collected for the past 5 to 10 years okay now remember that many pre-rain llms are available as open-source models uh and can be used as general purpose tools to write extract and edit text which was not part of the training data so even gpt3 and G gp4 you can uh you can use it yourself as a student gpt3 and GPT 4 can be used and it's good you don't need fine tuning for this purpose let's say if you want to get some information or if you want your PDF to be analyzed you can use gpt3 and gp4 as well there is one distinction which I want to point out and that's between open source and closed Source language models so let me show you that so look at the year on the XX so 2022 uh on the this red curve is the closed Source model so gp4 is a closed Source model which means that the parameters and the weights really are not known too much you can just use the end end output like this interface which I have right now but there are many open source models which were releasing during that time it's just that their performance was not as good as the closed Source model such as GPT so on the y- axis you have MML you can think of it as a performance right right now so you can see the green line which is open source performance was much lesser and now as we are actually entering August the performance of the open source models in August 24 is actually comparable to closed Source models so Lama 3. 1 was released recently and uh the Lama 3.
1 llm is an open source model but it's one of the most powerful open source models which was released by meta it has 400 5 billion parameters and its performance is a bit better I think than gp4 so the gap between open source and close source llm is closing that being said for students you can still continue interacting with gp4 and gpt3 that serves very well for you you don't need to think about fine tuning an llm or accessing its weights or parameters great now uh I want to talk a bit about uh so the GPT for architecture so up till now we have looked at the total pre-training cost for gpt3 then we saw pre-training versus fine tuning and then we saw the open source versus closed Source ecosystem now let's come to the point number three which is GPT architecture we have already seen this in the previous lecture the GPT architecture essentially consists of only a decoder block so let me show you this to uh refresh your understanding so this is how the GPT architecture looks like it only consists of the decoder block as I've have shown here um whereas the original Transformer consists of encoder as well as decoder so here the gpt3 is a scaled up version of the original Transformer model which was implemented on a larger data set okay so gpt3 is also a scaled up version of the model which was implemented in the 2018 paper so after the Transformers paper there was this paper as I showed which introduced generative pre-training gpt3 is a scaled up version of this paper as I already mentioned it has around uh uh 175 billion parameters so I think we are aware of this so we can move to the next point now comes the very important task of uh Auto regression why is GPT models called as Auto regressive models and why do they come under the category of unsupervised learning so let's look at that a bit further the main part which I want to mention here is that GPT models are simply trained on next word prediction tasks which means that let's say you have the lion RS in the and you want to predict the next word then you predict that it is going to be jungle this is all what GPT models are trained for so let me show you how the training process actually looks like okay so uh I think this plot sums it up better so let let's say you give multiple examples right so the first input which you give to GPT is second law of Robotics colon this is the input based on this input it has to predict the next word that's the output and then it predicts a in the next round the input will be second law of Thermodynamics colon U so see the output of the previous label is now the input here and then again it has to predict the next word so then it must be robot and then the input is second law of robotics colon a robot that's the input and the output it has to predict the next World and then it predicts must then the next input will be second law of Robotics colon robot must basically you see what we are doing here at every stage of this training process the sentence is broken up into input and output the input consists of let's say three words or four words but the output is always one word now you see we don't give any labels here what happens is that the sentence itself breaks it break is broken into two parts the first half and the second half the second half is where we have to predict the next world that's why it is called as unsupervised learning because we do not give labels the sentence itself consists of the labels because the label is the next word uh okay so uh GPT is trained on next World predictions one awesome thing is that although it is only trained on predicting the next World they can still do a wide range of other tasks like translation spelling correction Etc but for the training itself GPT architectures are only trained to predict the next word in the sequence and this takes a huge amount of compute power the 4. 6 million dollars which I showed you is needed for this because imagine you have those amounts of data which I showed you before let me show you them again so let's say if you have these these many amounts of words 410 billion words so there might be around 40 billion sentences right and each of the sentence will need to be broken up let's say the sentence is uh of 10 words okay each of the sentences will then needed to be broken up and then into input pair and the output uh and then in the output you have to predict the next word it takes a huge amount of time right because you will need to do this for the billions of sentences in the data that's why it takes a huge amount of compute power for this training procedure and uh how is it trained so basically initially it will predict wrong outputs but then we have the correct output U which we know what should be the next word from the sentence itself so then the error will be computed based on the predicted output and the difference between the corrected output and then similar to The Back propagation done in neural networks the weights of the Transformer or the GPT architecture will adapt so that the next word is predicted correctly so please keep in mind that that is why it is an example of self-supervised learning uh because let's say you have a sentence right what is done is that in the sentence itself we are divided we are dividing it into training and we are dividing it into testing so this is the true we know the next word this is the next word and we know its true value what we'll do is that using this as the input we'll try to predict we'll try to predict the next word so then we'll have have something which is called as the predicted word and then we'll train the neural network or train the GPT architecture to minimize the difference between these two and update the weights so these four these 175 billion parameters which you see over here are just the weights of the neural network which we are training to predict the next word so that's why it's called as unsupervised because the label for the next word we we do not have to externally label the data set it already is labeled in a way because we know the true value of the next word so uh to put it in another words we don't collect labels for the training data but use the structure of the data itself to make the labels so next word in a sentence is used as the label and that's why it is called as the auto regressive model why is it called Auto regressive there is one more reason for this the prev previous output is used as the input for the future prediction so let's say let me go over this part again the previous output is used as the input so let's say the first sentence is second law of Robotics the output of this is U right this U becomes an input to the next sentence so now the input is second law of Robotics colon U and then the next word prediction is robot then this robot becomes an input to the next sentence that's why the model is also called Auto regressive model so two things are very important for you to remember here the first thing is that GPT models are the pre-training part rather I would I should say the pre-training part of GPT models is unsupervised why is it unsupervised because we use the structure of the data itself to create the labels the next word in the sentence is used as the label and the second thing which is very important is that these are Auto these are Auto regressive models which means that the previous outputs are used as the inputs for future predictions like I showed you over here so it is very important to note these key things when you pre-train the GPT so in pre-training you predict the next word you break you use the structure of the sentence itself to have training data and labels and then you do the training you train the neural network uh which is the GPT architecture and then you optimize the parameters the 175 billion parameters now can you think why it takes so much compute time for pre-training because 175 billion parameters have to be optimized so that the next word in all sentences is predicted correctly okay now as I have mentioned to you before uh compared to the original Transformer architecture the GPT architecture is actually simpler the GPT architecture only has the decoder block it does not have the encoder block block so again let me show you this just for reference the original Transformer architecture looks like this if you see it has the encoder block as well as the decoder block right but now if you see the GPT architecture here the input text is only passed to the decoder see it does not have the encoder so in a sense the GPT is a more simplified architecture that way uh but also the number of building blocks used are huge in in the GPT there is no encoder but to give you an idea in the original Transformer we had six encoder decoder blocks in the gpt3 architecture on the other hand we have 96 Transformer layers and 175 parameters keep this in mind we have 96 Transformer layers so if you see this if you think of this as one Transformer layer if you see this as one Transformer layer this this there are 96 such Transformer layers like this that's why there are 175 billion parameters now I want to also show you the visual uh I want to show you more visually how the next word prediction happens we already saw Here how we have input and the output but I've also written it on a whiteboard so that you have a much more clearer idea so let's say so the way GPT works is that there are different iterations there is iteration number one there is iteration number two and there is iteration number three let's zoom into iteration number one to see what's going on so in iteration number one we first have only one word as the input this it it's it goes through pre-processing which is converting it into token IDs then it goes to the decoder block then it goes to the output layer and we predict the next word which is is so then now the output is this is so it predicted the next word and now this entire this is now serves as an for the second iteration so now we go into iteration two where this is is an input so is was an output of the first iteration but now it is included in the input of the next iteration and then the same steps happen we do the token ID preprocessing then it goes through the decoder block and then we predict the next word this is an that's the output from the iteration two the next word which is predicted is n and now this output from iteration two is now uh going as the input to the iteration number three so the input to the iteration number three is this is n you see why it is called an auto regressive model the output from the previous iteration which is n is forming the input of the next iteration so the next iteration is called is this is n and then again it goes through the same preprocessing steps and then there is an output layer and then the final output is this is an example so so the next word which has been predicted is example and then similarly these iterations will actually keep on continuing this is how the GPT architecture Works in each iteration we predict the next word and then the prediction actually informs the input of the next iteration so that's why it's unsupervised and auto regressive so this schematic of the GPT architecture and as you can see that uh it only has the decoder there is no encoder if you look at iteration 1 2 and three I did not mention an encoder block here right because encoder block is not present in the GPT architecture schematic only the decoder block is present okay now one last thing which I want to cover in this lecture is something called as emergent Behavior so what is emergent Behavior I've already touched upon this earlier in this lecture and in the previous lectures but remember that GPT is trained only for the next word prediction right GP is trained only for the next word prediction but it's actually quite awesome and amazing that even though GPT is only trained to predict the next World it can perform so many other tasks such as language translation so let me go to gp4 right now and uh convert breakfast into French so gp4 is not trained to do language translation tasks it is trained to predict the next world but while the training or while the pre-training happens it also develops other advantages it develops other capabilities and this is called as emergent Behavior due to these other capabilities although it was not trained to do the translation tasks uh gp4 can also do the translation tasks which I have mentioned here see you can see one more thing which I want to show you is this uh uh McQ generator so as such when GPT was trained it was not trained to do McQ generation right but look at this if I want the GPT to provide me uh three to four multiple choice questions uh so I just clicked on generate right now and you'll see uh McQ questions have been generated on gravity now technically uh GPT was not trained really to generate these questions on gravity but it developed these properties or it developed these capabilities uh on its own while the pre-training was happening to predict the next World and that's why this is called as emergent Behavior actually so many awesome things can be done because of this emergent Behavior although GPT is just train to predict the next word it can do it can answer text questions generate worksheet sheets summarize a text create lesson plan create report cards generate a PP grade essays there are so many wonderful things which GPT can do and in fact this was also mentioned in one of the blogs of open AI where they say that uh uh we noticed so this this mentioned in their blog we noticed that we can use the underlying language model to begin to perform tasks without ever training on them this is amazing right for example ex Le performance on tasks like picking the right answer to a multiple choice question uh steadily increases as the steadily increases as the underlying language model improves this is an clear example of emergent Behavior Uh so basically the formal definition of emergent behavior is the ability of a model to perform tasks that the model wasn't explicitly trained to perform just keep this in mind and that was very surprising to researchers also because it was only trained to do the next door tasks then how can it develop these many capabilities and I think this Still Still Remains an open question that how come emergent behavior is developed by chat GPT so let me actually go to Google Scholar and search about emergent behavior I'm sure there are many papers on this so here you can see I searched emergent behavior and there are all of these papers which came up uh this is an area of active research such as exploring emergent behavior in llms and I'm sure there's a lot of scope for making more contributions here so if any of you are considering looking for research topics emergent Behavior might be a great topic to start your research on this actually brings us to the end of this lecture we covered several things in today's lecture so let me do a quick recap of what all we have covered so initially before looking at zero shot and few shot learning we started with the history we saw that the first paper which was introduced in 2017 is attention is all you need it Incorporated the Transformer architecture then came generative pre-training GPT the architecture is a bit different than Transformer it uses only decoder no encoder and then after uh generative pre-training was developed as a method it shows two main things first is that it's unsupervised second it's Auto regressive and unlabel data which which means it does not need label data for pre-training then came gpt2 one year later in 2019 and uh in fact there were four models of gpt2 which were released by open AI the first one had 117 million parameters the second had 345 the third had 762 and the fourth one had about a billion parameters but then came the big beast in 2020 that was really gpt3 and uh this paper said that language models are few short Learners which means that if gpt3 is actually provided some amount of supplementary data it can do amazing few short tasks and this model used 175 billion parameters which was the largest anyone had ever seen up till that point after looking at this history we looked at the difference between zero shot and few shot learning in particular we saw that in zero shot learning you don't need to provide any example the model can perform the task without example and in few short learning you can give a few supplementary examples so when this gpt3 paper was released the authors claimed that this this was a few short model they did not say zero short Learner in the title because although it can do zero short learning uh it's just much better at few short learning and we actually explored this ourselves we asked gp4 are you a zero short learner or are you a few short learner and gp4 sent that I'm a few short learner it's it also said that it can also do zero short learning but it it's just more accurate uh at few short learning okay that's important to keep in mind then we saw that gpt3 utilizes a huge huge amount of data uh it it uses around 300 billion tokens in total so just writing it down 300 billion tokens in total which is about 300 billion words approximately a token is a unit of text which the model reads it it's not usually just one word but for now you can think of one token as one word and then we saw that training pre-training gpt3 costs $4. 6 million why does it cost this much because we have to predict the next word in a sentence using this architecture so sentences are broken down into training data and testing data it's Auto regressive so one word of the sentence is used for testing or the next word and the remaining is used for training and this has to be done for all the sentences in the billion billions of data files which we have that's why it takes 4.
6 million to train because there are 175 billion parameters in gpt3 remember to optimize the weights for those many it would need a huge amount of computer power access to gpus Etc that's why training process is hard so this schematic also shows the GPT architecture remember it only has the decoder it works in each it works in iterations and the output of one iteration is fed as an input to the next iteration that makes it auto regressive and in each iteration the sentence itself is used to make the label which is the next word prediction that's why it's an unsupervised learning exercise pre-training we also saw that after pre-training there is usually one more step which is fine-tuning which is basically training on a much narrower and specific data to improve the performance usually needed in production level tasks we also briefly looked at the gap between the open source and the closed Source llms really closing with the introduction of Lama 3.