[Music] Hello everyone and welcome to the next lecture of our course reasoning LLMs from scratch. All of you might be aware that we are in the reinforcement learning phase of our course and we have been making steady progress in this field. In the last lecture, we looked at the concept of value functions and we also looked at the Bellman equation and the Bellman optimality equation.

These equations are used to write down equations for estimating the value function for a reinforcement learning problem. And today we are going to start looking at the different methods which can be used to actually solve for the value function in a reinforcement learning problem and to actually find the optimal policy for an agent. So far we have looked at structures agent environment interface uh the mark of decision process.

Then uh we looked at what are value functions, how how you can write down equations for value functions etc. But we haven't yet started to understand how we can solve these equations to actually find the optimal policy for the agent. And this is exactly what we are going to start today.

The methods that we are going to cover starting from today's lecture will equip you with an understanding of u how you can develop or use tools to solve for reinforcement learning problems depending on the problem type. Okay. So uh let's get started and uh the the topic for today is called as dynamic programming.

Some of you might have heard about dynamic programming wiggly in some different context but we are going to define it from scratch over here and use it in the context of reinforcement learning. So dynamic programming is a is a collection of algorithms which helps us to find or calculate the optimum policies for an agent in a reinforcement learning problem. So the the the final aim is going to be to find the optimal policy for the agent and uh the lecture will be organized in such a way that we'll slowly cover the different steps which are involved in dynamic programming algorithms.

So let's get started. Uh the first step or the first phase of dynamic programming methods is called as policy evaluation. And the meaning of this term is that for a given policy pi, can we estimate the value function vi of s?

Remember in the previous lecture we had looked at this notation in detail and what we said was the meaning of this notation is that if you give me any state then I am going to calculate the cumulative rewards obtained by the agent starting from that state and then moving towards the end of the episode. So I'm going to add all the rewards which the agent is receiving. So that's called as the value function for that particular state.

And in policy evaluation what we are doing is that we are already given the policy which the agent is following. And for a given policy our objective is to estimate the value function for that particular policy. Okay.

So this is the definition of policy evaluation. The name is slightly misleading because it might lead you to think that the meaning of policy evaluation is maybe we are evaluating the policy but that's not the case. It has a very specific definition which we have looked at a while back.

So this is also called as the prediction problem. So dynamic programming methods or in general any method in uh to to solve for the optimal policies are divided into uh first is the prediction problem and second is called as the control problem. So we are going to look at the control problem also shortly but first we'll develop a very nice understanding of the prediction problem.

Okay. Now let's start to develop an intuition for how we can evaluate the value function for a given policy. The first starting point u might have already come to your mind is to use the bellman equation which we covered in the last lecture.

What the bellman equation essentially says is that let's say I have a state s and from this state there are multiple actions I can take and after taking those multiple actions I can transition into different states right so these these states are called as sdash So the bellman equation says that you can estimate the value function of uh state s by looking forward to the next state. So essentially I can estimate the value function of state s in terms of the value function of the next state which is sdash. That's the primary intuition behind uh behind the bellman equation.

And then uh what you essentially do is that you you have a uh have a factor with which you u u you know associate the relative importance to all of these different next states. And that that factor is what you do is you multiply it by the discount factor V of S ddash and you add the reward which the agent gets along the way and then you multiply it by the probability of reaching this state from the previous state Sasha R given S comma A. This means that what is the probability that the agent reaches the state sdash and receives a reward R starting from this state and taking action A.

And then what you do is you sum up this for all the possible next states which the agent can encounter starting from the previous state. And then you can write this down in format of a mathematical expression which looks like this. So this is the current state.

uh this is the next state sdash. So the value function of the current state can be expressed in terms of the value function of the next state which is sdash. It is discounted by a factor of gamma and there is a reward which is added along the way.

And then this is just the transition probability. I want all of you to get familiar with this expression because we are going to use it multiple times later as well. So this is transition probability and this is uh this is the policy which states that what's the probability that the agent takes an action a while being in the status.

So this is the bellman equation uh which we have and now we are asked the question that uh given this policy this this policy is given can I predict the value function for all the states and the moment you see this equation you realize there is one issue the issue is that we are saying the value function of the state is dependent on the value function of the next state and then similarly we can write down the equation for the value function of the next state which is going to be dependent on the value function of the state which follows that state. So in every single equation we are going to have value functions. So it's it's it's a recursive problem you see.

So there is no straightforward solution to this. Just by looking at this, the first thought which runs in my mind is that okay for every equation that I write down, I am going to encounter value functions for the next state which are not known to me either. So it's going to be u a system of equations with unknowns in every equation.

And our main focus for solving the prediction problem in dynamic programming is going to find a way to solve this system of equations. So now let's try to break this equation down and see how this system of equations actually look like. So let us say there are five different states which the agent encounters uh during the episode and these states are given by s_ub_1, s_ub_2, s_sub_3, s4 and s5.

So these are the five different states which the agent is going through while it is u navigating the environment. And what we are going to do is that we are going to write down the uh the bellman equation for each of these states. So there are going to be five equations in total because we have to write an equation for each of these states separately.

So let's see what the five equations are and then these are the backup diagrams. So the backup diagrams which we have looked at before essentially give us a representation of what happens later when the agent reaches that particular state. For example, when the agent reaches S1, there are two possible actions A1 and A2.

A1 leads the agent to state S_sub_2 and A2 leads the agent to state S3. And the backup diagram is different for every state. So the back end diagram for state S3 is that there are two options A3 and A4.

But now if I take action A3, I can go into two states. I can either go into S_sub_1 or S_sub_2. But if I take action A4, I'll only go into one state that is S5.

In order to write down the Bellman equations properly, you need to know this backup diagrams really well. And this has to be known for all the states. Otherwise you would not know how to write these equations properly and the backup diagrams essentially tell you that which are the states which comes later from this state and that helps you write down the equations in an organized manner.

So now using the bellman equation for evaluating the value function. The first thing that we can see is that the value function for the state s_sub_1 is going to be a function of the value function of state s_sub_2 and the value function of state s_sub_3 because s_sub_2 and s3 are the subsequent states from that state. Similarly, the value function of state S_sub_2 is going to depend on the value function of state S_sub_1 and the value function of state S4 because these two are the subsequent states from the state S_UB_2.

And in a similar fashion, we can write down the Bellman equations for these five different states. And then these f_sub_1, f_sub_2, f_sub_3, f_sub_4, f_sub_5 are functions which depend on the transition probabilities. The first thing that I see is that there are five equations and five unknowns.

Five equations and five unknowns which I have to solve if I want to find the value function for every single state. And if if you have a problem which is quite general, if you have n different states, then you would have n equations and n unknowns to solve for. So the bellman equations if elaborated for all the states they lead to a series of equations where the number of equations are equal to the number of states and the number of unknowns are also equal to the number of states.

So essentially to solve the prediction problem we need to find a solution for these n equations and n unknowns. Uh so this is uh straightforward but the computation is quite tedious and that's why uh we don't always use this particular method to solve the prediction problem and that's where the method of dynamic programming comes in. It's a method of iteratively solving these equations so that as you increase the number of iterations finally the solution will converged to the true solution.

So we are going to develop an intuition as to how this iterative method would have been constructed in the first place. So imagine yourself in the shoes of people 50 years back in 1960s uh who are trying to solve these kind of equations and you figure out that it's very hard to solve these n equations and n unknowns because it it'll involve huge matrix computations and I do not have the computing power right now. So what do I do?

Can I use some kind of an iterative method? And uh we are going to understand this from the point of view of the scientists 50 years back. What we do is that we first say uh okay let me first uh assign a value of zero to all the states.

So with the value functions for s_ub_1, sub_2, s3, s_ub4 and s5, I am going to assign a value of zero. This is my first iteration which is v 0. And obviously this is wrong.

But this is my starting point. So this is called as V 0 where [Music] assign zero to all states. Now what we are going to do is that we are going to iteratively change these values and for each iteration we are going to get better and better and better.

Now let us use the same backup diagrams which we used uh just a while back. Okay. Uh now using this value function v 0 we are now going to estimate the next value function which is going to be called as v_sub_1.

and try to think how we would estimate the next value function. Now remember that you have you already know the value functions for let's say all of the states and using these value functions we are going to now obtain a better estimate for the value functions for all the states. So let's say now we look at state S1.

Let us focus our attention on state S1. Remember what we saw earlier was the value function of state S1 is dependent on value function of state S_sub_2 and value function of state S3. What if we assign these to the value function calculated previously that is zero and then we will get a new estimate for v_sub_1.

Similarly we will assign these the values which we calculated before which is zero. And then we'll get a new estimate for the state s_ub_2. And we'll do this for all the states.

And after doing this, we will get a new update which we call as v_sub_1. And then imagine that v_sub_1 is something like 2. 5 and 3.

This is v_sub_1. And now what I'll do is I'll I'll calculate v_sub_2 but to calculate v_sub_2 instead of 0 and 0 here what I'll do is I'll substitute 0 5 and 1 here. 5 and 1 so I'll again calculate the next value for this state similarly I'll again reestimate the value for s_ub_2 by substituting 2 in s1 and what is the value for s_ub_42 and then I'll again get a new value of s_sub_2 similarly I'll I'll I'll get v_sub_2 then I'll get v3 I'll I'll proceed so on now here what we are doing is that the amount of computations we are not solving a system of n equations and n unknowns at the same time all we are doing is that we are doing the simple mathematical calculation for each iteration and then we are proceeding to the next iteration.

So we are getting better and better and better. So to write this mathematically we calculate the value function of uh the next iteration for all the states by using the value function of the previous iteration which is denoted by V K. So the K starts from zero.

So first we have v 0 then using v 0 we get v_sub_1 using v_sub_1 we get v_sub_2 using v_sub_2 we get v_ub3 dot dot dot and this sequence has been shown to converge to the true value function that is v pi as you do more and more iterations. So this is the intuition and this is exactly what the scientists actually thought 50 years back. They thought okay what if we do something like this.

What if we assign some initial value to the value function of all the states then we change the value function for all the states by calculating an update and using the previous value function then we get a new value function for all the state. Then what we do is we use this value function to calculate the next value function and so on. So this method is called as the dynamic programming method to find out the value function for all the states given a specific policy pi and this equation is is used to calculate the value function.

So this algorithm is also called as iterative policy evaluation. This is because we have to iterate through the time steps. So directly we don't get the final answer but slowly we get to the final answer.

Now uh one of the major drawbacks of dynamic programming is the requirement of these transition probabilities. So if you have a reinforcement learning problem where you exactly know the model of the environment like uh the recycling robot we knew the transition probabilities exactly. So there we can use dynamic programming methods to solve the problem.

But there are many cases where you do not know the model of the environment before. Imagine an agent being dropped into Mars, right? And there on on I do not know the surface of Mars.

So there I have to learn purely by experience. So dynamic programming methods are not learning methods because there is no learning from experience. Here we are assuming a model of the environment and then we are saying okay now I know the model of the environment.

So I can use bellman equations to solve for the value functions in its exact form. So this is uh okay and and then for going from one iteration to the next iteration we have to update all the values of the states and then only we can move to the next iteration. So imagine there are thousand states right and initial value function you assign to be zero and then to get the next value function in the next iteration you have to calculate the update for all the thousand states then you move to the next iteration then you again calculate the update for all the thousand states then you again move to the next iteration so this is called as a state sweep so for dynamic programming methods we require a full state sweep that is we have to go through all the states and then we have to calculate the update and then only we'll get the uh updated value function for the next iteration.

So it's it's computationally expensive as well. These are the two drawbacks of dynamic programming algorithms and we are going to overcome these drawbacks in the next lecture when we discuss about Monte Carlo methods and temporal different uh difference methods and dynamic programming offers a perfect stepping stone to understand both these other methods with clarity. But for now uh I want you to understand how if we have a perfect model of the environment we can use bellman equations to solve the policy evaluation problem.

But we are nowhere close to finding the optimal policy yet. Remember we assumed that the policy was already given to us. Now we come to the next step which is policy improvement.

So far we have learned how to calculate the value function for a given policy. But what if this policy is not optimal? How do we reach to a policy which is optimal?

We have not looked at that. And policy improvement is a method where we slowly nudge the policy towards the right policy. We don't directly get the right policy at one go but then again we do number of iterations and then we slowly reach to the best policy or the optimal policy.

Okay. So let's let's begin to uh let's let's begin to understand this. Uh suppose we have a policy which is denoted by pi like we assumed in the previous case and let's say our policy tells us to take action A1 for the state S.

So from this state there are three possible actions A1, A2, A3 and depending on what action you take you can transition to different states. Now remember the meaning of a policy is that for every state it tells you which action you should take or which action has the highest probability. So let's say my policy tells me that uh you have to go through this path which is the path this path that you have to take action A1.

Okay. And now what we can do is we can calculate the expected return for this path using the value function which we have calculated in the previous section. So the expected return for this path.

So let's see the expected return for this path is going to be the reward which is which the agent collects along the way plus the discounted value of the next state. So there is there is a mathematical way also to express this. uh let's say I want to calculate the total reward that is or the expected return from state s and taking this path right so what's the formula for expected return we write the expected return as g of t which is equal to r of t + 1 plus gamma * r of t + 2 plus gamma square * r of t + 3 dot dot dot.

Now I have given I have been given this action a1. So I am in state stats and I'm taking this action a1 which my policy is telling me. So I am already collecting one reward.

So this can be written as RT + 1 plus gamma * RT + 2 uh this should be gamma gamma * R2 T + 2 + gamma RT + 3 + dot dot dot and this expression inside is nothing but the value of the next state sdash So the value of this state. So I can now calculate the expected return which I get along this path. Uh forget the mathematics for now.

We'll first try to build an intuition and then we'll go to the mathematics if required. So let's say my policy pi is suggesting me to take action A1 and I calculate the expected return along this path. But it turns out that if I had taken action A3, I would have gotten a larger expected return.

Which means that my policy is not optimal because my policy is telling me to take action A1 which is not yielding me the best returns. Whereas ideally I should have taken action A3. So then what we do is that we say that okay if this is the case I am going to improve my policy.

I'm going to say that for every state I'm going to calculate all these expected returns and I'm going to take the maximum of these expected returns and I'm going to find which action is giving me this maximum and if this action is same as the action suggested by my policy then it's fine but if this action is different than the action which my policy suggests then I'm going to change my policy. For example, in this case my policy is going to change and for state S my policy will recommend A3 instead of A1 and we do this for all the states. So this is the key intuition that I want all of you to understand.

What if action A1 is not the best? What if the expected return was higher if we had taken action A2 or action A3? If this is the case, then our policy is definitely not the optimal policy and we have to change our policy.

But how do we change our policy? For every state, we calculate the expected return after taking all possible actions. and then we find the action which gives the maximum value.

Now we'll come to the mathematics of uh uh after we understand this intuition. So let's think about this how we can calculate the expected return for all these different paths. So as we looked at before the expected return can be written down like this.

This is the expected return which the agent receives uh for all these different parts. And what I have to do is I have to look at this formula and I have to calculate the maximum of this maximum for all possible actions. So if if the action prescribed by our present policy matches this action then we keep the action otherwise we change the previous action to this action.

This is how we update the policy and bring it closer to the optimal policy. Now you might ask me the question uh Raj this is fine but why did we do policy evaluation before? How is that helpful for us over here?

The reason it is helpful is that we cannot estimate the expected return from a state unless we know the value function of the next state. In other words, if we have to calculate this value for all the all the different actions, we need to know the estimate for these for this value function. And this is where policy evaluation comes into the picture because we have already solved the policy evaluation problem.

We know what is the value of vi for state sdash and we can use this estimate as the value function estimate for the next state. So I'm going to again explain this using an intuition. Uh let's say I am at state S.

My policy evaluation has given me the value function for this state S. And I have these three different paths which I can take. These three different paths are going to lead me to different states and my policy evaluation has given me a way to estimate the value function for all these different states as well.

So now what I'm going to do is that I'm going to say that let's say my current policy is saying that you should take action A1 and uh I calculate the expected return for A1. I calculate the expected return for A2 and I calculate the expected return for A3. So let's say this expected return is five, this is three and this is eight.

This means that my policy is not optimal because it is giving me it is suggesting me a path which gives me a return of five. But this action A3 gives an expected return of 8. So I will change my policy.

I will change my policy such that when I am at state S, my agent will now take action A3 instead of action A1. This is the overall intuition. Just rewinding back a little bit, we cannot calculate this expected return unless we know the value function for the next state because the expected return is heavily dependent on that.

The expected return is in fact given by something like this. And this we know because we have solved the policy evaluation problem and we know the value function estimate for the state sdash. So this is how we nudge the policy towards a better policy.

Now this is still not an optimal policy but we have nudged the policy towards a better policy. So now the the remaining question or the final question is that this is fine but how do we get the optimal policy? The the current policy gives us actions for each state which are better but the the policy is still not optimal.

How how how am I guaranteed of the optimal policy? And I imagine that when when people were solving this problem uh five six decades back uh they were wondering about this same question and probably they thought something in these lines. They said that initially I used a policy I calculated the value function for that policy and then I said that okay this policy is not optimal.

So I used that value function to change my policy. But now that value function is not the correct value function for this changed policy. So I have to again calculate the value function for this new changed policy.

So I get a new value function. Then I again have to check whether my policy is optimal or not. If my policy is not optimal then I again nudge my policy make it better.

I again recalculate the value function. I again nudge my policy. I keep on doing this until I reach a stage where for all the states which I'm encountering, whichever action my policy is recommending is giving the best expected return.

And this is exactly what happens in the in policy iteration. In policy iteration, we we do something like this. We assume an initial policy.

we get the value function for that policy. This is policy evaluation. Okay.

Uh then after that what we do is we nudge the policy towards a better policy. uh we take this uh value function and we use this value function to nudge the policy towards a better policy. This is exactly what we saw in this section which is policy improvement.

So this is policy improvement. Then what we do is now this this policy is better than the previous policy but we again have to calculate the value function for this policy. So again we do policy evaluation then again we do policy iteration and we do this step by step again and again so that we finally reach the optimal policy.

So uh essentially it start it it's it's it's it's something like a cone where let's say the top of the cone is the value functions and the bottom part of the cone is the policy. So you start with this then you get a better policy you again update the value function you get a better policy update the value function you get a better policy till you get an optimal policy and you get a value function for that optimal policy you get pi star and you get v star this is called as policy iteration so there are multiple rounds of policy evaluation and policy improvement ment until we get to the stage where we have the optimal policy and this is exactly how uh we solved the recycling robot last example which we took in the last lecture at that time I did not show you how this how the problem was solved but this method was used to solve that problem and to find uh the optimal uh policy and the corresponding optimal value function. This is the whole essence of dynamic programming.

But the two steps we saw the first is the prediction problem and the second problem is actually called as the control problem where we are trying to nudge the policy towards a better policy and find an optimal policy. So the prediction problem and the control problem is form the base for other methods of reinforcement learning type of problems as well. Now uh we are going to look at an example which will help us understand can policy iteration be really used to solve practical problems or not or it's just some theoretical construct to make mathematicians feel better.

So let's let's try to understand that by taking an example. Let's say I own a car rental and people come to my shop to purchase cars on rent. For every car that I rent, I get 10 rupees from the national company.

And I have two warehouses, warehouse one and warehouse 2. So I can rent my cars from either of the warehouses. So you might now think that okay since I'm getting 10 rupees for each car rented I can just rent as many cars as as I can.

So uh that is the objective to maximize the earnings that I get from renting cars. Now let's say the cars are not available in one warehouse. Let's say the cars get over uh so so you can transfer cars from one warehouse to the other warehouse but that incurs you a cost of two uh two rupees.

This is done overnight. uh this transfer can be done overnight and what we are going to assume is that we are going to assume that at the location one the requests and the returns are distributed according to the poisonson distribution requests graph is shown in the blue. It means how many cars are being requested by the people and returns is meaning how many cars are being returned back to me.

Uh so if you are not aware of the poison distribution don't worry it's a it's a probability distribution which tells us the distribution of the car uh and and distribution of an event. So here the there are number of cars and I can see that maximum I am people are requesting for three cars a day. Uh but here yeah maximum three cars a day and here probably four cars a day.

So at at location two there are more number of requests and return wise it's kind of two cars are returned maximum and here we have two and three cars. So it's more or less similar. So we assume that there are there will be no more than 20 cars in each location and maximum of five cars can be transferred from one location to another.

Uh now the the question that we want to ask is that given this information about the warehouses uh the requests and returns etc. Can we develop a policy for this problem? Now you might ask this question.

Okay, how do I even start to think about this problem? For such type of problems, it's always beneficial to think about an agent environment interface. To start off with, you need to think about what are your states, what are your actions and what are your rewards.

So here the states are the number of cars in each location. The actions are how many cars I'm moving between the locations overnight. Uh and the reason this is kept as an action is because the cars I moving overnight is influencing how many cars are are present in each location.

Do I have enough cars to manage the requests given the number of returns? and the reward is the total money made. Now uh we will use policy iteration to find the best policy for this problem.

But before that what is even a policy here. So for every state you need to tell me which is the best action for that state. For example if I have six and seven cars in two locations how many cars should I move from each location between locations overnight?

It can be zero also. So a positive action means cars moved from location one to location two and a negative action means cars moved from location two back to location one. And then uh I have attached a code uh at at the bottom of uh in in the description of this video so that you can run the policy iteration by yourself.

But it it it essentially uses the same logic that we have discussed. So here uh this is the policy uh evaluation where you can see that the value functions are updated with the previous value functions plus this term r + gamma v pi of sdash. And remember this is the the same term which we encounter in uh the bellman equation.

So this part of the code is policy evaluation where we are finding the value function for a given policy and this part of the code is for policy improvement. So here you can see what I'm doing is I'm calculating uh arg max means the maximum value for for all these actions and the action values are given by this. So essentially for every state what I'm doing is I'm calculating which action is giving me the maximum expected return from that state and this is uh embedded here r + gamma * v pi multiplied by the probability and np.

t argmax mean is I'm finding the best value and if my policy is uh already giving me the best value then there is no problem but if my policy is not giving me the best value then I have to iterate my policy and policy table is false okay so if the policy table is true then the policy will converge because for all states it's giving my policy is giving me the best possible action but until that happens I'll keep on improving my policy and finally you get a graph which looks like this this is the optimal policy for Rajat's car rental this is quite interesting because let's say I am in this phase so this phase means that let's say here I have a lot of cars in location one and less cars in location two. Then my policy tells me that you should move cars from location one to location two. But as I move towards the right, you can see that the number of cars moved also also go down.

So for example, if if I am somewhere uh if I am somewhere here or if I move down also it goes down because if I move down it means my cars in location one are less and if I move right it means that the cars in the location two are increasing and a similar pattern is followed over here. If I have a lot of cars in location two then I want to move as many cars to location one as possible. But essentially this graph is giving me exactly for each state how many cars should I move for that state.

So it's it's it's very interesting to see how this policy iteration in uh using the dynamic programming method allows you to solve these practical problems. Now some of you uh might have one one question. In the last lecture, we saw the Bellman optimality equation, right?

Which directly gave us the formula for the optimal value function, which was this formula. You can go back to the last lecture and see. So you might ask the question why don't we directly do this?

Why go through all this pain of policy evaluation and iteration? It's isn't it the fact that we are overdoing our work? Why can't we directly take the maximum for the action values for all the different states and actions?

So this is in fact a method in which what you do is you take the maximum of this expression and you find out which action gives you the maximum which is essentially what we are doing in uh policy improvement. But here the policy iteration, policy evaluation and policy improvement are happening in one single step. Where is the evaluation happening?

Evaluation is happening here. We are updating the values by getting the value values of the previous value function. So VK and VK + one.

And then where is the policy improvement? It is right here. We are choosing the action which gives us the maximum expected return.

And this method is called as value iteration. It reduces the two steps of policy evaluation and policy iteration into one single step. I'm not focusing on this uh in detail because uh for the purposes of this course, we only need to understand the basics of policy evaluation and uh and policy iteration.

uh so we we do not need to go into the depth of value iteration that much. So this brings me to the end of the lecture. Uh I hope you enjoy uh dynamic programming and if you hear the word dynamic programming later in life you don't get uh you don't feel like it's it's something very complex.

Um and just one some things to note are dynamic programming requires a model of the environment. um it it's not a learning method as such. It does not learn from experience and it gives you a method to find out optimal policies by following the two steps.

First step is policy evaluation and second is policy improvement and the process of iterating both of this is called as policy iteration. In the next two lectures, we will cover Monte Carlo methods and temporal different methods which will solve these drawbacks and those methods are also very interesting. Thank you everyone and I'll see you in the next lecture.

Lecture 7 - Dynamic Programming | Reinforcement Learning Phase | Reasoning LLMs from Scratch