[Music] hello everyone welcome to this lecture in the large language models from scratch Series today we are going to learn about a very important topic which is called as bite pair encoding in many lectures or even video content which you see on large language models when tokenization is covered the concept of bite pair encoding is rarely explained however behind modern algorithms such as gpt2 gpt3 Etc the tokenizer which these modern llms use is usually bite pair encoding so you really need to understand what this means and how encoding how this encoding is done for modern

L large language models so let's get started with today's lecture first let me take you to the Google collab notebook um until now if you have followed the previous lecture what we did in the previous lecture is that we implemented a simple tokenization scheme if you remember what we covered in the previous lecture let me just uh take you quickly through that in the previous lecture what we did was we basically took sentences we converted them into tokens and then we converted this into vocabul so then every word of the token was arranged in an

ascending order and then a ID or a numerical ID or a token ID was assigned to each token so here every word was a unique token along with special characters like comma full stop exclamation mark Etc um now in today's lecture we are going to cover a much more sophisticated tokenizing tokenization scheme why sophistic ated I'll come to that in a moment but just remember that today we are going to learn about the bite pair encoding or BP tokenizer this BP tokenizer which we are going to cover today was used to train large language models

such as gpt2 gpt3 and the original model used in Chad GPT if you have not seen the previous lecture on tokenization I highly encourage you to watch that so that this lecture will become very clear to you because in the previous lecture we implemented our own tokenizer completely from scratch and today we are going to learn about a bit more advanced concept and this scheme this bpe scheme is used in all modern llms and so it's very important for us to learn so let me take you to the Whiteboard right now and explain the concept

of bite pair tokenizer and why do we even need it in the first place so if you look at the tokenization algorithms there are essentially three types of tokenization algorithms the first is the word based tokenizer U the second type is the subword based tokenizer and the third type is the character based tokenizer let me walk you through these step by step in the word based tokenizer what we usually do is every word in the sentence is usually one token the tokenizer which we saw in the previous lecture such as for example in this example

the um the fox chased the dog the tokens were the fox chased the dog so every word was one token right so that's why it's an example of word based tokenizer uh here I'm taking one more example to illustrate this concept to you if the sentence is my hobby is playing cricket and if the tokens are individual words such as my then hobby then is then playing and then Cricket that that is a word based tokenizer awesome now can you think of the problems associated with the word based tokenizer first uh the main problem is

what do you do with Words which are not present in the vocabulary so let's say if you have huge amount of training data and you will break it down into sentences right and then you'll break down the sentences into individual words and then you will assign a token ID to each of these words but when the user is interacting with the llm let's say and the user inputs a word which was not present in the vocabulary these words are also called as out of vocabulary words so you can think of as preparing for an exam

but in the exam a question is asked which is completely different from what you have prepared for uh in word based tokenization schemes it's very difficult to know what do we do with out of vocabulary words because consider this case itself my hobby is playing Cricket let's say if the data set was small and only one sentence if a new word is given such as football and that is not present in this data set then usually these tokenization schemes run into an error this is a very simplified example I have constructed but uh this problem

shows up when you do word based tokenization a lot usually it's very difficult to deal with out of vocabulary or oo words if they are not present in the vocabulary and what do I mean by vocabulary vocabulary is just a dictionary with a collection of tokens arranged in ascending order and every token corresponds to a token ID great now another problem with this word based tokenization scheme is that uh let's say boy and boys are two words in the vocabulary each of them will have different tokens maybe the token IDs for both of them will

be far apart based on uh where they appear or they might be similar also but but the problem is that uh these words are very similar right and when we do this tokenization this similarity is not captured so boys if you think of the word boys the root word is boy so ideally both of these words are very similar to each other but with this tokenization they are treated as separate words and that similarity is not captured these are the main problems with word based tokenization now let's look at the other end of the spectrum

which is character based tokenization this is also a very popular tokenization scheme where what we do is that if the sentence is again my hobby is playing Cricket instead of having individual words as tokens in this kind of character based tokenization individual characters are considered as tokens so for example here let's look at which characters are there m is the first character why is the second character uh I'm ignoring white spaces for now so H is the third character o is the fourth character Etc so then the vocabulary or the tokens would contain m y

h o b Etc these are the tokens instead of having individual words now can you think of what will be the vocabulary size in this case just think about this for a moment if you are not able to answer this just pause and think okay so this will actually lead to a very small vocabulary because every language has a fixed number of characters if you look at the English language it has 256 characters on the other hand if you look at the total words in the English language it has about 170,000 to 200,000 words uh

but what is one of the advantages of the character based tokenization is that it has a very small vocabulary size which means that uh either so if you look at the Engish Eng language there are fixed number of characters right there are 256 characters so we will never have the out of vocabulary problem because let's say any new sentence is given by the user you can always break it down into characters even if you don't know the words in that sentence it's fine you break it down into characters and it will be either of the

256 characters which are already present in your vocabulary or dictionary so the out of vocabulary problem will not come into the picture and uh it solves one more problem if you look at the word BAS tokenization right as I told you English language is about 170,000 to 200,000 words so if you really want to include everything in the vocabulary you need a vocabulary size which is huge and that is one big problem in word based tokenization this problem is completely solved in the character based tokenization because the vocabulary is based on characters and the vocabulary

length is pretty small but then you might think oh this sounds amazing right it literally solves all of the problem problem it solves the out of vocabulary problem it's also computationally and memory efficient because the vocabulary size is very small and uh then that's great then what's the issue there are some problems with character based tokenization and the first major problem is that the meaning which is associated with words is completely lost essentially the advantage of dealing with language models is that words have meanings right so different words and different sentences might be related to

each other boy and boys have a common meaning this is completely lost since you're breaking it down into individual characters that's one of the first biggest problem with uh character level tokenization the second problem is that the tokenized sequence is much longer than the initial raw text so for example if there is a word in the let's say there is a word in the text which is dinosaur now in word based tokenization this will be treated as one single token right but in character based tokenization what will happen is that every every single character here

d i n o s a UR will be treated as separate token so in character based tokenization the word dinosaur will be actually broken down or split into eight tokens and that is another major problem the tokenized sequence is much longer than the initial raw text now uh so as we can see the word based tokenization has its advantages and disadvantages the disadvantages is that we don't know what to do with out of vocabulary words and the vocabulary size is pretty large the advantage is that uh of course the words every word is a token

so the tokenized sequence length will be small like dinosaur it will be just one token so when we tokenize the paragraph it's very small uh because every word will be one token unlike character based tokenization where the tokenized sequence is much longer than the initial text that's the disadvantage of character sequence character tokenization the advantages of character tokenization is that they have very small vocabulary because every language has fixed number of characters and we solve we completely solve the out of vocabulary problem and both of these approaches have a disadvantage that the meaning between words

is not captured boy and boys have a common root right that root is cap is not captured tokenization and modernization both have the isation which is common this meaning is completely lost so now so what do we do then we turn to another tokenization algorithm which is called as the subword based tokenization and the bite pair encoding which we are going to see is an example of subword based tokenization algorithm the subword based tokenization is kind of like a Best of Both words words and let's see how it uh why it is the best best

of both words so the first thing to remember about subword based tokenization is it does capture some root words which come in many other words so boy and boys it will treat boy as a common root word uh and let's see how it does that okay so in subword based tokenization there are essentially two rules the first rule of this tokenization is that when you get the data set you do not do not split frequently used word into smaller subwords so if there are some words which are coming frequently you should retain those words as

it is right so then it retains this from the word tokenization then the rule two is that if there are some words which are very rare which are not occurring too many times then you split these words into smaller meaningful subwords this is extremely important this second part basically says that if there are some words which are rare not appearing too many times you can go on splitting it further you can even go down to the Character level so you can see why it is a mix between word tokenizer and character tokenizer the first rule

implies that if the words are occurring many times you return it as a word so this is taken this is a feature taken from the word tokenizer the second rule implies that if the word is rare you can go on splitting it into further subwords and if needed you can drop down to the Character level we don't always drop down to the character level we even stay in the middle but this is a feature from the character level because we are breaking down the word further so to give you an example if the word boy

is appearing multiple times in the data set it should not be split further that's from rule number one so then boy is retained as a token but boys if the word boys is encountered uh that should be split into boy and it should be split into s uh because boys might not be appearing too many times and again boys also derives from the word boy so we should divide this word boys into smaller meaningful subwords so boys is divided into boy and S and this is the main essence of subword tokenization words are sometimes broken

down into smaller elements and why smaller elements because those smaller elements appear very frequently so for example why is boys broken down into boy and S because boy appears more frequently and S is also another token which appears very frequently this is what is basically done in a subord tokenization scheme at a very high level so let me explain some advantages of subord tokenization the subword splitting helps the model learn that different words with the same root word such as for example token tokens and tokenizing all of these three words essentially have the same root

word right token so subord splitting helps the model understand that these different words essentially have the same root word and they are similar in meaning this meaning is lost in word based tokenization and even character based tokenization that is number one the second advantage of subord tokenization is that it also helps the model learn that let's say tokenization and modernization are made up of different root words token tokenize and modernize but have the same suffix isation and are used in same syntactic situations so basically it derives these patterns like tokenization and modernization maybe zation z

a t i o n is a is a subword token in this tokenization and then token and then modern and another tokens so then it learns that this isation is a common suffix which appears in both of these words all of these are advantages of breaking down one word into subwords this is what is majorly done in subword tokenization so why are we learning about subword tokenization and what's the relation between bite pair encoding and subo tokenization well the main relation is that uh bite pair encoding or bpe is a subord tokenization algorithm this is

very important and that's why we are learning about bpe and that's why modern llms like gpt2 and gpt3 employ bite pair encoding so let us look at the bit of a history of the BP algorithm and let and then we'll see how it is implemented in practice so the bite pair algorithm was actually introduced in 1994 and uh it does a very simple thing it is basically a data compression algorithm uh and what is done is that we we scan the data and then mostov common pair of consecutive bytes so let me actually use a

different color here this sentence is very important most common pair of consecutive bytes of data is replaced with a bite that does not occur in data so we identify pair of consecutive bites which occur the most so we find these pairs which occur the most frequently and then we replace them with a bite which does not exist in the data and we keep on doing this iteratively I'll show an example for this so don't worry if you don't understand this explanation I want to quickly show you the paper so this is the the paper which

was published in 1994 a new algorithm for data compression which introduced bite pair tokenizer or bite pair encoder rather at that time it was not known that this will be so useful for modern large language models but it is quite useful okay so now let us see a practical demonstration of this algorithm so I've taken this example from Wikipedia because I think it is an awesome illustration so let's say we have this original data right and the original data looks like this a a a b d a a a b a c okay and now

in bite pair encoding we are going to compress this data right and let us see what do we do we first identify the most common pair of consecutive bytes so let's see let us see the pair which occurs the most so if you scan this sequence from left to right you will see that the bite pair AA occurs the most right you might say a AA also occurs the most but that's not a bite pair because that three characters the bite pair AA occurs the most it occurs here it occurs here it occurs here and

it occurs the here so it occurs four times really so what we will do next is that we have identified the bite pair which occurs the most we replace this bite pair with a bite that does not exist in the data so what we'll do is that we will replace it with zed and why Z because just Zed does not occur in the data you can use any variable here so then we'll take these a a and wherever AA shows up we'll replace it by Zed so then this first AA will be Zed so then

this new sequence will be Zed a b d then again Zed to replace this a a so then it will be Zed and then a b a c so then we have a b a c so remember what has happened here is that this AA which I'm highlighting right now in circle that has been replaced by Zed and this AA has been replaced by Zed so now we have a compressed data correct very good so now now let us move next next what we do is we keep on repeating this sequentially now we again look

at this sequence and find the next common bite pair so we can see that a is ur occurring once and a is occurring twice so it is repeating two times and which is the most frequent bite pair so we will replace AB by y so this AB will be replaced by Y and this AB will be replaced by y why why because it's a bite which does not occur anywhere in the U data so then this AB will be replaced by y as you can see here and here also the ab will be replaced by

y so then my new compressed data will be z y d z y a great this is a compressed data compared to the original data and this is exactly how the bite pair algorithm works now we do this recur do this again and again right so now let us look at the bite pair so here AC is the only bite pair which is left all others are special tokens but AC only appears once so we we don't need to encode it this process of replacing common bite pairs with another variable is called encoding that's where

the word encoder comes from so we will not encode this further because it only appears once if a c were appearing twice then we would have encoded it with another variable so this compression actually stops here you can go one more layer Deeper by further com replacing the zy with another variable which is let's say w so then it will be WD w a and then you will stop so you will compress it further like this so you see the original data has been compressed to this right uh to this compressed version U using the

bite pair algorithm so the algorithm itself is pretty simple you scan the data from left to right you identify the bite pairs which occur the most and then you replace them with a bite which does not exist in the data and you do this iteratively until you reach a stage where no bite pair occurs more than once that's it that is the simple bite pair encoding algorithm now you might think okay what has that got to do with large language models right I understand this algorithm and I understand how it compresses a given data sequence

but what has that got to do with large language models well it turns out that the we slightly tweak the bite pair encoding algorithm and use this to convert our entire sentence into subwords which will be very useful for us and I'll show you exactly how I'm going to do that so the bite pair encoding for llm ensures that the most common words in the vocabulary are represented as a single token remember rule number one and rare words are broken down into two or more subord tokens this is exactly the same rules which we had

looked at the rule number one and rule number two so rule number one is that most commonly used words should not be split and second is that that rare words should be split into meaningful subwords now let's see how uh how it's related to The Bite pair encoding algorithm which we saw and we will be looking at a practical example for this uh demonstration okay I will use a color which is a bit different from the green one here so that uh there will be a good contrast so let me use the orange color okay

so for the Practical example we are going to look at a vocabulary or rather I should say we are going to look at the data set of Words which is also called as Data Corpus which is this so we have these words in our data set old older finest and lowest right let's say we have these words in the data set right now and I'm going to show you how we are going to use the bite pair encoding algorithm to break these down into tokens and you will also see why this is a subword tokenization

scheme so first of all if we are to use a word based tokenization then this will have four tokens old older finest and lowest that's it similarly if we were to use the Character level tokenization then the tokens will be individual characters like o o will be one token then L will be another token then D will be another token Etc so this is how the word based tokenization and the character based tokenization will work but right now we are going to see the subw based tokenization using the bite pair algorithm before we uh proceed

further we'll need to do a pre-processing step and this is actually done uh even when we train large language models and when we are using these tokenizers so basically when you look at these different tokens there should be some ending right so for example when this old token appears we should have another end token so that we know that this word has ended over here so I'm going to augment every token here with this additional end token which is called slw so whenever the algorithm or the model comes across slw we know that this word

ends over here so I'm going to replace the tokens in my data set with adding a w/w at the end like old becomes old slw older becomes older slw finest becomes finest slw and lowest becomes lowest /w now remember here that if we use the word based tokenization uh there is no meaning which is captured so the fact that old is the common root between old and older is not captured number one EST is the common root between finest and lowest that's not captured so word based tokenization character based tokenization have so many problems because

they don't capture these meanings or root words and towards the end of this section we'll see how subword based tokenization using the bite pair encoding algorithm actually captures these root words like old uh like EST Etc okay so let's get started with the individual steps the first step is basically to split all these words into their characters um and then make a frequency table so what we are going to do is that we are going to take all these words old older finest and lowest so old appears seven times in the data set older appears

three times in the data set these are the frequencies finest appears nine times in the data set and lowest essentially appears four times in the data set so what we are going to do right now is we are going to split these words into individual characters so then here is the table which I have made remember slw is also there uh so old so all words have/ W right and totally how many words do we have 7 + 3 10 + 9 19+ 4 23 and since all words have slw at the end of it

slw comes 23 times similarly we see that o comes 14 times L comes 14 times D comes 10 times e comes 16 times Etc and we make this frequency table list so here you can see that we have 12 tokens and if we did we use character level tokenization our tokenization would end here because all these characters would be individual tokens now what we will do similar to if you remember the bite pair encoding algorithm we looked at the most frequent pairing right uh so let me take you back to this yeah we looked at

AA because AA appeared the most so we looked at that bite pair which occurred the most this is exactly what we'll be doing here we'll be look for the mo we'll be looking for the most frequent pairing in the data set and then what we'll do is that this is the modification compared to the original algorithm when we look for the most frequent pairing we will merge them we will merge the most fre pairing and then perform the same iteration again and again and again so let me show you how this is done so in

the iteration one as I mentioned we start with the uh finding the most frequent pairing right so it so you the way to do it is look at the first character which appears the most so e is that character which appears 16 times right so if you want to look at the pairing which appears most it most probably starts with e so it turns out if you look at these words e and s is the pairing which appears the most number of times so e and s here appears nine times in finest and E and

S appears four times in lowest so e and s is that pairing which appears 13 number of times right so uh most common bite pair starting with e is e and s so what we'll now be doing is that we'll be going through the data set again uh and we'll be merging these two tokens e and s so now e s will be one token and that's why it's called subword es will be one token so now let me show you my token table again everything else is the same up token number 12 but look

at this token number 13 which has been added in token number 13 we have added one more token which is es because it's the most frequent pairing and Es appears 13 times but remember when we add es we have to subtract something from E and we have to subtract from s because now uh ES has been included so we subtract 13 from the e count so now the the number of time only e appears is three the number of time only s appears is zero this is very interesting to know so the number of time

only s appears is zero so s it seems always appears with e so es is a subword see this we would not have discovered if we just did character level tokenization or uh Word level it seems that e and s always so s only comes with e in this data set we have already obtained our first site so now this is my new uh this is my new token library and this is my additional token and now we are going to actually continuously keep on doing this process to find uh frequency or to find tokens

which appear the most number of times so in the previous iteration we saw that e and s was the bite pair right which occurred most number of times ands it appeared 13 times but now es is a separate token for us so now using that token we see that EST is again a bite pair because es is one token and T is another token so es s and t becomes a bite pair in the second iteration and EST appears 9 + 3 which is again 13 times so now in the next iteration what we'll be

doing essentially is that U let me show you iteration number two in the iteration number two we'll merge the tokens es s and t because they have appeared 13 times in the data set see we are doing the same thing what we did in the earlier bite pair encoding for that character so let me just showed you show you here remember here what we did after AA so AA was done right and then we merged this and then we looked at the second sequence which appeared the most which is AB that is actually very similar

to what we are doing here es appears the most so we created one more token for ES then we looked at another bite pair which is appearing the most and that is es s and t so now what we'll be doing is that we'll merge es andt into one token and uh so EST comes to be 13 times and we'll then subtract 13 from the previous token es so now only es appears zero times and EST appears 13 times see we have constructed a new token EST so now remember what I said earlier the previous

World level tokenizers and the Character level tokenizers could not identify that EST is a common roote between finest and lowest but with our bite pair encoding algorithm we have already created a new token for EST so this algorithm has already identified that EST is a common root World great so up till now we have done iteration number three and uh we see that okay I think yeah up till now we have done I think two iterations and now up till now we have merged es and T into one common token awesome now let us take

look at this slw token now we can see that EST and /w basically appears 13 times again it's the same thing over here so EST always comes with slw now EST is one token so EST and slw forms a bite pair and this bite pair again occurs 9 + 4 which is 13 times which is much more than any other bite pair in these words so we'll combine EST and /w into one more token so in the third iteration we are going to combine EST and /w into one more token so now this becomes our

word do you understand why we combined slw with one more token we could have just left it at EST right but if we left it at EST then essentially there would have been no difference between words like estimate and highest so estimate and highest both have EST but the words in our data set are highest and lowest so EST is the ending sequence in all of our words in our data set and we need to encode that information that it is an ending sequence so estw allows us to encode this information so now the tokenizer

knows that whenever EST comes it's always followed by slw which means the word ends after estd so now our algorithm or the tokenizer can differentiate between estimate and highest because in estimate EST does not end with a /w so now if you look at the Tok which we have earlier we had all these 12 tokens but now we created es then we merged it to EST and finally we created this estw so now these two tokens are actually not needed es and EST so now let's look at another other bite pairs which occur a lot

so it turns out that o and L is another bite pair which occurs 10 times because it's present in old and older so what we'll do is that we'll create one more we'll merge these two and create one more token for o and L it appears 10 times and we'll subtract that count 10 from the O and L so o individually or with some other character appears 14 times so we subtract 10 because now we have created one more token for o which appears 10 times similarly L appears 14 times overall and we subtract 10

from it because L comes with o 10 times right and now what we do is that o l is one token so now we see that o l and D has appeared 10 times so this bite pair has appear 10 times so we merge this bite pair so then o d now becomes another token which appears 10 times so you see the meaning which our bite pair encoder has captured we have constructed one token which is old we have another token which is estw these tokens are subwords so they are neither full words nor characters

they are subwords but they encode the root representation so old is one token and this actually tells us that uh old is the root word which comes in Old it which comes in old as well as older and our BP algorithm has actually captured that perfectly that's awesome right all these root words were not captured by just the word uh word em or the word encoder word tokenizer and the Character level tokenizer now you might be seeing here that these f i and N appear nine times uh it's fine that they appear nine times but

we we just have one word with these characters so if you look at our data set again let's see where f i and N appear so the words are finest so all of these words actually only appear in finest so it's no it does not make sense to merge them into one um one token because that token does not the frequency of that token appearing in other words is not too much why did we merge EST into one token because it appears in multiple words why why did we merge old in one token because it

appears in multiple words so see the rule of uh subord tokenization the first rule is that that word which occurs multiple times you keep it as it is old so we kept old as it is right it is a separate token uh but that word which is not used too many times like older it needs to be split into old and then e and then R are separate tokens similarly in finest EST is one token fi and N are separate tokens in lowest EST is one token and L O and W are separate tokens and

this helps us to retain root wordss in sub tokenization I hope you are I hope you have understood this concept now what we can do is that uh this EST and old are final tokens which we constructed by merging now let us actually remove all of those tokens whose frequency is zero so we can remove s we can remove so let me Mark them with a different color so s has a frequency of z t has a frequency of Z es EST has frequency of z o has a frequency of Z so let's remove this

so then our final table looks like these these are the final tokens in our subord tokenizer which is obtained using the bite pair encoding algorithm how did we obtain these tokens we just looked at bite pairs which occur the most then we merged them into one and then we repeated this process uh until we obtained until we reached a stage where enough number of tokens have been created or until we reached a stage where our iterations have stopped and this is the final tokenized uh final uh tokens which we'll be using for next steps of

the large language model training which are vector embedding Etc this is how the subword tokenizer works and this is exactly how bite pair encoder which is a subword tokenizer it works for uh training models like gpt2 or gpt3 very few people have this understanding but I hope this lecture has intuitively made it clear for you how the bite pair tokenizer actually works so now this list of 11 tokens will serve as our vocabulary why is it called subword tokenizer because these are subwords right EST is not a full word neither it's a character it's a

subword but it's the root of many words o is also the root of many words now you must be thinking when do we stop this uh merging when do we stop these iterations so you have to specify a stopping criteria usually if you look at gpt2 or gpt3 you run this bite par coding algorithm on a huge number of tokens right so the stopping criteria can be if the token count becomes a certain number then you stop or just the number of iterations can be the stop uh stopping count so I'm in this lecture I'm

not covering the stopping criteria in too much detail but I just want to give you an intuition of how the stopping criteria can actually be formulated or computed it's based on the token count or the number of iterations to give you an example uh or to mention one more advantage of bite pair encoding here you can see that we it's actually better than Word level encoding right because uh let's say if you have two words boy and boys both won't be separate tokens in this just boy will be a token so bite pair encoder also

reduces the number of tokens which we have compared to word encoding let's see uh and that usually helps so in gpt2 or gpt3 when it was trained I think it used about 50,000 more than around 57,000 tokens I think when when bite pair encoding was done uh for gpt2 or gpt3 so bite pair encoding solves the out of vocabulary problem why does it solve the out of vocabulary problem because we also have subwords now and some of it might even be characters so character level tokenization solves the out of vocabulary problem right and bite pair

encoding or subord encoding retains some properties from it so you can see some tokens here are characters which is good but it also solves the problem of the word level encoding and one of the major problem is that root meanings are not captured but now they are captured over here and it also solves the problem of the word level encoding being just a huge bunch of numbers here that is not the problem because we break it down into characters and subwords so the total length of the vocabulary is also shorter than if you would have

just considered Word level embedding so subord embedding solves it it solves the out of vocabulary problem it also gives us a vocabulary size which is manageable and it also retains the root words meanings such as EST and old awesome right and that's why because of all these advantages this is the bite pair encoding algorithm is used for tokenization in gpt2 and gpt3 now let me return back to the code I could have just taken you through the code today but then you would not have UND OD how the bite pair algorithm actually works now implementing

BP from scratch is relatively complicated so we will be using a python open source Library called tick token so if you I'll share the link of this library in the chat so this is a library which is essentially a bite pair encoder it's a bite pair encoder which is used for open a models so open a themselves use tick token for tokenizing the sentences from the data you can see that it has about 11,000 stars and about 780 Forks so it's a pretty popular repository okay so uh now we are going to implement the bite

pair encoder in Python and I'm going to show you this implementation in a Hands-On demonstration the first thing to do is install tick token because this is we'll be using the bite pair encoder from uh tick token and here you can see that it takes a it takes some time to install this uh for me when I first install insted it it took about 1 minute but now it's a bit faster so requirements are already satisfied um and you'll see that it will get installed now in some time see it's already installed so now I

have installed tick token and let me just print out the version of tick token which I'm using here so you can see that the origion of tick token is 6 which is good enough now what we have to do is that we once the tick token is installed we can instantiate the bite pair encoding tokenizer from tick token so the way to do this is just uh use tick token do get encoding gpt2 and we store it in this object called tokenizer so now tokenizer essentially has is that is similar to the simple tokenizer version

two class or version one class which we defined in python in the last lecture remember we defined this uh simple tokenizer version one class and uh we defined the simple tokenizer version two class here which also include special tokens so now in just one line of code we have essentially defined this tokenizer which is something similar but now it will it's initiated an instance is created through this tick token Library so this is a bite pair tokenizer and uh similar to the tokenizer class which we had created if you see every tokenizer class has an

encode method and a decode method what the encode method does is basically it uses that tokenizer and it converts words into token IDs what the decode method does is that it decodes those token IDs back into individual words so let's see how this uh a tokenizer actually works so this is the bite pair encoding tokenizer from gpt2 and let's give it some text and try to encode and decode right and I'm testing uh this tokenizer because I've given it a complex sentence see so the first part of the sentence is hello do you like T

then we have given one more thing which is called end of text which means that a new text is starting here this is the end of text is usually done to separate entire documents but here I'm just using it to illustrate and in fact gpt2 actively uses end of text so just to give a demonstration to you when gp2 when gpt2 or gpt3 is trained on large amounts of data sets here is how the data sets are loaded so see uh what gpt2 does is that when it has data from different text sources it usually

shows when a particular text has ended and when the other text has started so end of text is actually a part of the gpt2 vocabulary itself so I have given this end of text here and then the second sentence is in The sunl Terraces of some unknown place now look at this if we are using a word level tokenizer this this will lead to the out of vocabulary problem because this word cannot be in the vocabulary of a word level tokenizer and that's the advantage of bite pair encoding in bite pair encoding we have characters

as tokens and we even have subwords as tokens so definitely this will be encoded because might be some character can encode it so some unknown and place might be subwords which are individual tokens they can also be used to encode this so bite pair encoder will take care of this this entire word which usually does not exist as a single word so let us test this actually and uh I've run this right now and let us see the different tokens so see hello do you like T it all of these have been converted into token

IDs I want you to take a look at this 50256 token so this is the token ID of end of text and this is also the vocabulary size of the tokenization scheme used in gpt2 or gpt3 so if we were to use a World level tokenizer English language has about 170,000 to 200,000 words so the vocabulary size would have been this much but now that we have used the bite pair tokenizer or bite pair encoding the vocabulary size has reduced by around 13 from 150,000 and with with a vocabulary of around 50,000 subwords uh we

are able to get the amazing performance from gpt2 gpt3 or GPT 4 I think has much higher level of uh tokens but even gpt2 has good enough performance and it has 50,2 56 tokens awesome and here you can see that I did not not get an error because some unknown place was not encoded the tokenizer is able to encode random words like these which also look wrong and here you can see these are the encodings for this sub some unknown place so the end of text is this in the sunlight Terraces of some unknown place

so I think some unknown place is broken down into subwords three or four subwords and then it's tokenized by gpt2 so this is how subord tokenization can handle out of vocabulary awesome so the code prints these IDs which we just saw and now what we can do is we can convert the token IDs back into text using the decode method remember every tokenizer is encode method as well as decode method so we can do tokenizer do decode and put all of these integers and here you see it decodes exactly the same sentence what we gave

to it so the decoded sentence is hello do you like T then end of text in The sunl Terraces of some unknown place exactly similar to the uh text which it had G which we had given which means that the encoder and decoder is working very well uh and the bite pair encoding has its advantages of dealing with out of vocabulary so here we see two advantages of bite pair encoding first is that it reduces the vocabulary length second is that it knows how to deal with unknown text awesome so we can make two noteworthy

observations based on the token IDs and the decoded text first as I already mentioned the end of text token is assigned to a relatively large token token ID which is 50256 uh that is the last token in the BP tokenizer used for gpt2 so in fact the BP tokenizer which was used to train models like gpt2 gpt3 and the original model used in chat GPT has a total vocabulary size of 50257 keep this in mind now you know what this means and end of text is the last largest token ID and it's also the last

token in the vocabulary the second uh thing to note which I already mentioned is that the BP tokenizer encodes and decodes unknown words such as some unknown Place correctly the BPA tokenizer can handle any unknown word and how does it achieve this without mention so we did not give special tokens for unknown words right uh how did it achieve this and this this is because of how the tokenization was done and because of how the uh BP tokenizer actually goes down to the level of uh let me show this the BP tokenizer goes down to

the level of characters and subwords that's why it is able to deal with unknown text so the algorithm underlying BP breaks down words that aren't in its predefined vocabulary to smaller subword units or even individual characters this is exactly what we saw over here uh and this enables it to handle out of vocabulary words so thanks to the BP algorithm if the tokenizer encounters an unfamiliar word during tokenization it can represent it as a sequence of subword tokens or characters that is the advantage of BP tokenizer and that's why we are having this lecture in

the first place great now let us actually take one last example to illustrate how the BP tokenizer deals with unknown tokens right so let's say we have this this sentence a k w i r w i e r completely random words which do not make any make any sense but if you pass these random words to uh the tick token gpt2 encoder bite pair encoder you'll see that it does not show an error in fact it even encodes these so a k w i r w i e r is encoded as these tokens because it

broken down into subwords maybe e r is a commonly occurring subword in all the vocabulary which we have so then the token for this is 25959 maybe AK is a commonly occurring subo maybe W is just a simple character maybe IR is a commonly occurring subo maybe I is just a character so you can see this what underneath the hood what this tokenizer must have done is that it must have broken this down into words and subwords based on what is present in the vocabulary and then to each of these word to each of the

subword or character it would have assigned a token ID so the tokenizer would have scanned this from left to right and broken it down into characters or subwords and then assigned each character or subword a unique token ID that's how the encoding works and if you decode the integers you get back exactly the same sentence which you started with and the reason I covered this lecture in so much detail is because in the previous lecture we saw how to do tokenization from scratch and today we saw how what is bite pair encoding and how you

can install tick token and Implement your own bite pair encoder or tokenizer with gpt2 is using and we saw how it handles unknown words uh how it captures the root meaning and how the vocabulary size is about 50,000 uh this brings us to the end of today's lecture in the next lecture we'll be looking at data sampling batch sizes context length Etc before we feed the embed or before we feed the tokenizers into the word embedding so in our progression of learning right now let me show you um one plot so that I can illustrate

up till where we have covered until now so if you look at uh if you look at yeah if you look at this schematic right here yeah if you look at this schematic right here until now we have looked at how to tokenize an input text and how to convert it into token IDs and we saw implementing our own tokenizer in the previous lecture and using a bite pair encoder in this lecture we still have to see token embeddings so we'll get to this point and before that we'll also see a bit about data sampling

context length and batch sizes this brings us to the end of this lecture thank you so much everyone I really wanted to have a good mix of uh whiteboard writing and whiteboard understanding which we did a lot in today's lecture so uh if you see our whiteboard notes we did a number of itations of the BP encoder I explained it to you intuitively I hope you understood this please mention in the chat if something is unclear and I'll be happy to explain it in the comment section and then we also saw how to code the

bite pair encoder uh using uh the tick token library in Python so now what I encourage you all to do is take some unknown sentences and try using the uh this tick token library and the bite pair encoder and try to just see what are the results uh if you if you encounter any error for any sentence if the tokenizer is not able to encode it will be awesome and I'll highlight that comment in the next video thank you so much everyone and I look forward to seeing you in the next lecture

Lecture 8: The GPT Tokenizer: Byte Pair Encoding