hello everyone and welcome back in this session we are going to be properly introduced to the lexical analyzer so let's get to learning coming to the outcome of this session today we will learn the working principle of lexical analyzer in details now in the session different phases of compiler we observed when we gave this arithmetic expression to the lexical analysis phase the lexical analyzer produced a stream of tokens and we also came to know that the lexical analyzer uses regular expressions for recognizing the tokens basically it uses the regular grammar or type 3 grammar for recognizing the tokens so if we are to note down the features of lexical analyzer successively well it scans the pure high level language code line by line and while doing so it takes the leg zooms as inputs and as outputs it produces the tokens now coming to tokens there are several types of it it may be an identifier an operator constants basically these are the fixed values which we assign to the variables in the source code itself then there is keyword there is various keywords of the programming language if we consider c programming language if ent return etc are the keywords of that tokens can also be literals that is string literals the punctuators like comma semicolons parenthesis braces brackets etc are also tokens finally the spatial characters like ampersand underscore etc are tokens as well so tokens can be broadly classified into these seven categories now in a lexical analyzer two functions take place first the scanning and then analyzing lexums are given to the scanning phase which in turn eliminates the non-token elements such as comments consecutive white spaces etc thereafter in the analyzing phase the actual magic happens and we get the tokens at the end of that let me illustrate how this analyzing phase works let's consider a few tokens of the c language now if is a keyword of c and that's why it is a token now for recognizing if the lexical analyzer will require a finite state machine or fsm so from the initial state a saying i we will move to the next state b then from b seeing f we will end up in the final state c considering identifiers well we already have briefly observed the finite state machine for that previously so from the initial state say d seeing a symbol may that be any small letter in the range of a to z or any capital one in the same range or an underscore we will end up in the final state e because we know that single later can also be an identifier like for most programmers the variable that runs any loop by incrementing itself is i additionally an identifier can have any number of letters underscores or digits nonetheless they must follow either a letter or an underscore now if we talk about integers for an integer starting from the initial state say f if we see any sign be that a positive or negative sign we will move to the next state g and from there seeing any digit from the range 0 to 9 we will end up in the final state h now this much is sufficient for single legit integers however for two or more digits we will have an epsilon transition from the spinal state h to g that is from h without seeing anything we can move to g now using this portion we can have any number of digits by the way integers can also not have any sign at the beginning because for positive integers the positive sign is implied so we need not mention that therefore to facilitate that property we will have an epsilon transition from the initial state f to this state g now these are individual finite state machines with their own initial states in the analyzing phase of the lexical analyzer we cannot have separate finite state machines it will need a single finite state machine combining all the separate finite state machines assuming our c language has only these three finite state machines and can only recognize the keyword if identifiers and integers let me show you how these will be combined as a single finite state machine so what we will do we will introduce a new initial state say s and we will have epsilon transitions from that to all these different states now this one is an nfa or non-deterministic finite automata to be precise it actually is an epsilon effect epsilon nfa is the nfa which contains epsilon moves moreover nfa is conceptual that is we cannot implement it for implementation we will have to derive the equivalent deterministic finite automata or the dfa now this one here this is the equivalent dfa it will recognize the keyword if the identifiers also the integers let me show you how starting from the initial state s seeing the letter small i we will move to the state one and from there saying f we will move to the state two and since it is a final state if we stop here it will mean that this dfa has recognized the input lexum as the keyword token if now if the lexum has got something more than if suppose it is iffy in that case we won't stop at the state 2. with the first f after if we will move to the state 3 and then for the last y we will use this cell flow and remain in this state only now the question is why one is also a final state if you remember earlier in this session i told you that the identifier using which most programmers run loops is i so if we see i only which doesn't have any following f then it is not a keyword rather it is an identifier so for the dfa if it stops reaching the state 1 it recognizes that as an identifier now observe the transition from 1 to 3. here we have intentionally chosen the range of small letters a to e or g to z so after i accept small f if we have any other small later or any capital letter or an underscore or any digit we will move to the state 3.
now observe this transition here if you observe the ranges of small letters a to h or j to z that is any small letter except i so from the initial state s seeing any small letter except i or any capital letter or an underscore we can move to this final state 3. so if the machine either stops in one or in three through any of these transitions it will recognize the lexus as identifier now coming to the lower portion of the dfa starting from the initial state seeing a digit from the range 0 to 9 we can end up at the final state 4 which will recognize any single digit positive integer thereafter this self loop on this state 4 it will help the dfa to recognize any other positive integer with two or more digits now if in the legazim the sign has explicitly been specified for those starting from the initial state saying any sign be that either positive or negative the dfa will move to the next state that is this state 5 and from this seeing a single digit it will move to the final state 4.