a three-part lecture today still continuing on the theme of reinforcement learning part one I'm going to be speaking and I will be following up on last week's discussion about causal inference and Tuesday's discussion on reinforcement learning and I'll be going into sort of one one more subtlety that that arises there and where we can develop some nice mathematical methods to help with and then I'm going to turn over the show to to Barbara who will formally introduce when the time comes and she's going to both talk about some of her work on developing on evaluating

dynamics treatment regimes and eventually lead a discussion on the sepsis paper which was required reading from today's class so those are the three parts today's today's lecture so I want you to return back put your back in your self back in the mindset of Tuesday's lecture where we talk about reinforcement learning I remember that the goal of reinforcement learning was to optimize some reward specifically our goal was to find some policy which I'm going to go to spy star which is the Arg max over all possible policies PI of V of Pi where just remind

you V if pi is the value of the policy PI formally it's defined as the expectation of the sum of the rewards across time okay so the reason why I'm calling this an expectation is like the PI is because there's stochasticity both in the environment and possibly pi e is going to be a stochastic policy and this is summing over the time steps because this is a this is not just a single time step problem but we're going to be considering interventions across time of the reward at each point in time and the reward function

could either be at each point in time or you might imagine that this is zero for all time steps except the last time step so the first question I want to think about is well what are the implications of this as a learning paradigm if we look what's going on over here hidden in my story is also an expectation over acts the in the patient for example or the initial state and so this intuitively is saying let's try to find a policy that has high expected reward average nessing's over all patients and I just want

to you think about whether that isn't indeed the right goal okay can anyone think about a setting where that might not be desirable yep what if the reward is the patient living or dying you don't want it doing have high ratings like saving to patients and see what they expected since panics all right so what happens this word is something mission critical like patient patient dying you really want to try to avoid that from happening as much as possible of course there are other criteria that we might be interested in as well and both in

Fredericks lecture on Tuesday and in the readings we talked about how there might be other aspects about making sure that a patient is just not just alive but also healthy which might play into reward function so there might be rewards to rate those and if you were to just for example put a positive or negative infinity for patient dying that's a non-starter right because if you did that unfortunately in this situation in this world we're not always going to be able to keep patients alive and so you're going to get into an infeasible optimization problem

so minus infinity is not an option we're going to have to put some number to it in this type of approach but then you're going to start trading off between patients you know in some cases you might you might have a very high reward for you might you there are two different equations that you can might imagine one solution where the reward is somewhat balanced across patients and another situation where you have really small values of for some patience and a few patients with very large values and rewards and both would give you the same

average obviously but not both but both are not necessarily equally useful right we might want to say that we prefer to avoid that worst case situation so one could imagine other ways of formulating this optimization problem like maybe you want to maybe you want to control the worst case reward instead of the average case reward or maybe you want to say something about different quartiles I just wanted to point that out because really that's the starting place for a lot of the work that we're doing here so now that wants to think they're okay returning

back to this goal we've done our policy iteration and/or we've done our cue learning that is and we get a policy out and when might now want to know well what is the value of that policy right so what is our estimate of that quantity well to get that one could just try to read it off from the results of Q learning by just computing that V PI well V PI hat the estimate is just equal to now a maximum over actions a of your Q function evaluated at whatever your initial state is and your

and the optimal choice of action a all right so this is all I'm saying here is that the last step of that will be that might be to ask well what is the expected reward of this policy and if remember the Q learning algorithm is essence that dynamic programming algorithm working its way from the you know sort of large values of time up to the up to the present and it is indeed actually computing this expected value that you're interested in so you could just read it off from the Q values at the very end

but I want to point out that here there's an implicit policy built-in right so I'm gonna compare this in just a second to what happens under the causal inference scenarios or just a single time step and potential outcomes framework that were used to notice that the value of this policy it's the reason why it's a function of of Pi is because the values of is a function of every subsequent action that you're taking as well and so now let's just compare that for a second to what happens within the potential outcomes framework so there are

starting place so now I'm going to I'm going to turn our attention for one just one moment from reinforcement learning now back to just causal inference and reinforcement learning we talked about policies how do we find policies to do well in terms of some expected reward of this policy but yet when we were talking about causal inference we only used words like average treatment effect or conditional average treatment effect where for example to estimate the conditional average treatment effect what we said is we're going to first learn if we use if you we use a

covariate judgment approach we learn some function f of X comma T which is intended to be an approximation of the expected value of your outcome Y given X comma comma o say I'll say Y of T there so that notation so the goal of Kaveri adjustment was to estimate this quantity and we could use that then to try to construct a policy for example you could continue could think about the policy PI of X which simply looks to see is it will say it's 1 if Kate or your estimate of Kate for X is positive

and 0 otherwise just remind you the way that we got the estimate of Kate for an individual acts was just by looking at f of X comma 1 minus f of X comma 0 ok so if we have a policy so now we're gonna start thinking about policies in the context of causal inference just like we were doing in reinforcement learning and I want us to think through what would the analogous value of the policy be right how good is is that policy it could be another policy but right now I'm assuming I'm just going

to focus on this policy that show up here well one approach to try to evaluate how good that policy would is is exactly analogous to what we did in reinforcement learning in essence what we're gonna say is we evaluate the quote of the policy by summing over your empirical data of PI of X I so this is going to be 1 if the policy says to give treatment 1 to individual X I that in that case we say that the value is f of X comma 1 or if you gave the second if the policy

would would give treatment 0 the value of the policy on that individual is 1 minus PI of X times f of X comma 0 so I'm going to call this an X sort of an empirical estimate of the what you should think about as the reward for a policy PI and an a yes to the estimate of a V of Pi that you would get from a reinforcement learning context but now we're talking about policy that explicitly so let's try to dig down a little bit deeper and think about what this is actually saying imagine

the story where you just have a single covariant X we'll think about access being let's say the patient's age and fortunately there's just one color here but I'll do my best with that and imagine that the potential outcome Y is zero as a function of the patient's age axe looks like this now imagine that the other potential outcome y1 look like that all right so I'll call this the Y one potential outcome suppose now that the policy that we're defining is this so we're going to give treatment 1 if the conditional average treatment effect is

positive and 0 otherwise I want everyone to draw what the value of that policy is on a piece of paper it's going to be I'm sorry I want everyone to write on a piece of paper what the value of the policy would be for each individual so it's going to be a function of X and now I want it to be I'm looking for y of PI of X ok so I'm looking for you to draw that plot and feel free to talk to your neighbor in fact I encourage you to talk to your neighbor

just to try to connect this a little bit better to what I have up here I'm gonna assume that F this is f of X 1 and this is f of X 0 all right any any guesses what does this pot look like someone who hasn't spoken in the last one week and a half if possible so something like this until the intersection point and then like that outdoors yeah this exact what I'm going for and let's try to think through why is that the value of the policy well here the Kate which is looking

at the difference between these two lines is negative right so for every X up to this crossing point the policy that we've defined over there is going to perform action action do to do wait am i drawing this correctly maybe it's actually opposite right you should be doing action one okay so here okay so here the Kate is negative and so by my definition the action perform is action zero and so the the value of the policy is actually this one right wait oh wait oh good you Lou knows what wanna get you cuz this

is the graph I have in my notes oh good okay I was getting I was getting worried okay so to this action right all the way up until you get over here and then over here now the Kate suddenly becomes positive and so the action chosen is is one and so and so the value of that policy is y one okay and so one could write this a little bit differently for and in the case of just in the case of just two policies and now I'm going to write this in a way that some

that it's really clear in the case of just just two actions one could write this equivalently as an average over the data points of the maximum of F X comma zero and f of X comma one okay and this this simplification turning this formula into into this formula is making the assumption that the pie that were being evaluated on is precisely this pie right so this implication is only for that pie for another policy which is not looking at Kate or for example she might threshold Kate at a gamma it wouldn't quite be this will

be something else okay so but I've gone a step further here right so what are what I've shown you right here is not the average value but sort of individual values where I've shown you the max function but what this is actually looking at is the expected reward which is now averaging across all X so to truly draw a connection between this plot we're drawing and the and the average reward of that policy what we should be looking at is the average of these two functions which is we'll say you know something something like that

okay and that value is the expected reward now this all goes to show that the expected reward of this policy is not a quantity that we've considered in the previous lectures these not the previous lectures and causal inference right this is not the same as the average treatment effect for example so I'm just giving you one way to think through number one what is the policy that you might want to derive when you're doing causal inference and number two what is one way to estimate the value of that policy which goes through the process of

estimating potential outcomes via covary adjustment but we might wonder just like when we talked about and causal inference where I said there are two approaches or these or more than two but we focused on two using Coverity adjustment and doing propensity inverse propensity score waiting you might wonder is there another approach to this problem altogether is there approach which wouldn't have had to go through estimating the potential outcomes and that's what I'll spend the rest of the this third of the lecture focused talking about and so to help you page this back in remember that

we derived in last Thursday's lecture an estimator for the average treatment effect which was 1 over N times the sum over data points that got treatment one of why I the observed outcome for that data point divided by the propensity score which I'm just going to write as e I so AI is equal to the probability of of observing treat T equals T equals 1 given the data point X I plus sorry minus a sum over data points I such that T I equals 0 of Y I / 1 - e I and by the

way there was a lot of confusion in class why do I have why do they have a 1 / n here one of our n here but right now I just took it out all together and not one over the number of positive data points and one over the number of negative or zero data points and I expanded the derivation that I gave in class I posted new slides online after class so if you're curious about that go to those slides and look at the derivation so in a very analogous way now I'm going to

give you a new estimator for this same quantity that I had over here the expected reward of a policy notice that this estimator here it made sense for any policy didn't have to be the policy which looked at is Kate's just greater than 0 or not this held for any policy that simplification I gave was only in this particular setting I'm gonna give you now another estimator for the average value of a policy which doesn't go through estimating potential outcomes at all and analogous to this is just going to make use of the propensity scores

and I'll call that the I'll call it R hat now I'm going to put a superscript ipw for inverse propensity weighted and it's a function of Pi and it's given to you by the following formula 1 over n sum over the data points of an indicator function for if the treatment which was actually given to the eigth patient is equal to what the policy would have done for the ice patient and by the way here i'm assuming that pi is a deterministic function so it so the policy says for this patient you should do this

treatment all right so we're gonna look at just the data points for which the observed treatment is consistent with the observed unfor that patient and this indicator function is 0 otherwise and we're going to divide it by by the probability of of TI given x i so the way i'm writing this by the way is already is very general so this formula will hold for for non-binary treatments as well and that's one of the really nice things about thinking about policies which is whereas when thinking about when talking about average treatment effect average treatment effect

is sort of makes sense in the comparative sentence right comparing one to another but when we talk about how good is a policy it's it's not a comparative statement at all the policy does something for everyone you could ask well what is the average value of the outcomes that you get for those actions that we're taking for those individuals um so that's why i'm writing in a slightly more general fashion already here x why i obviously okay so this is now a new estimator i'm not going to derive it for you in class but the

derivation is very similar to what we did last week when we tried to drive the average treatment effect and the critical point is were dividing by that propensity score just like we did over there so this if all of the sumption x' made sense to an infinite data should give you exactly the same estimates as this but here you're not estimating potential outcomes at all right so you never have to try to impute the the counterfactuals here all it relies on knowing is that you have the propensity scores for each of the data points in

your training set or in a data set so for example this opens the door to tons of new exciting directions imagine that you had a very large observers observational data set and you learned a policy from it for example you might have done covariant adjustment and then said okay based on coverity adjustment this is my new policy so you might have gotten it via that approach and now you want to know how good is that well suppose that you then run a randomized control trial and then your demise control trial you have 100 people maybe

200 people so not that many so not nearly enough people to have actually estimated your your policy alone right you might have needed thousands or millions of individuals to estimate your policy now you're only gonna have a couple hundred individuals that you could actually afford to do a randomized control trial on for those people because you're flipping a coin for for which cheat me they're going to get suppose that we're in a binary setting with only two treatments then this value is always one-half one-half and what I'm giving you here is going to be an

unbiased estimate of how good that policy is which one can now estimate using that randomized control trial now this also might lead you to think through the question of well rather than estimating the policy through rather than obtaining a policy through the lens of of optimizing Kate of figuring how to estimate Kate maybe we could have skipped that altogether for example I suppose that we had that randomized control trial data now imagine that rather than 20 other than 100 individuals you had a really large randomized control trial with 10,000 individuals in it this now opens

the door to thinking about directly meant maximizing or minimizing depending what of what they want this to be large or small pie with respect to this quantity right which completely bypasses the goal of estimating the condition the the conditional average treatment effect and you'll notice how this looks exactly like a classification problem right this quantity here looks exactly like a 0 1 loss and the only difference is that you're waiting each of the data points by this inverse propensity so one can reduce the problem of actually finding an optimal policy here to that of a

weighted classification problem in the case of a discrete set of treatments like big cat there are two big caveats to that line of thinking the first major caveat is that you you have to know these propensity scores and so if you have data coming from randomized controlled trial you will know those propensity scores so or or if you have for example some control over the data generation process right for example if you are a company and you're showing you get it shoes which ads to show to your customers then you look to see who clicks

on what you might know what that policy was that was showing things it's right in that case you might exactly know the propensity scores in healthcare we other than in randomized control trials we typically don't know this value so we either have to have a large enough randomized control trial that we won't over fit by trying to directly minimize this or we have to work within an observational data setting but we have to estimate the propensity scores directly alright so you would then have a two-step procedure where first you estimate the propensity scores for example

by doing logistic regression and then you attempt to maximize or minimize this quantity in order to find the optimal policy and that has a lot of challenges because this quantity shown in the very bottom here could be really smaller really large in observational data set and due to these issues of of having very small overlap between your treatments and this being very small implies then that the variance of this estimator is very very large and and so when one wants to use an approach like this similar to when one wants to use an average treatment

effect estimator and when you're estimating these two propensity is often you might need to do things like clipping of the propensity scores in order to prevent the variance from being too large that then however leads to a biased estimator typically and I wanted to give you a couple of references here so one is Swami nothin and jokey-joke UM's J Oh a CH IMS ICML 2015 in that paper they tackle this question they focus on the setting where the propensity scores are known such as you would have from a randomized control trial and they recognize that

you might decide that you want you prefer it something like a biased estimator because of the fact that these propensity scores could be really small and so they use some generalization results from the machine learning theory community in order to try to control the variance of the estimator as a function of these propensity scores and then they then learn directly minimise the policy which is what they call counterfactual regrettin minimization in order to in order to allow one to generalize as best as possible from sort of the small amount of data you might have available

the second reference that I want to give just to point you into this literature if you're interested is by Nathan callous and his student I believe Angela Zhu from Europe's 2018 and that was a paper which was one of the optional readings for last Thursday's class now that paper they also start from something like this from this perspective and they say that oh now that you have we're working in this framework one could think about what happens if you have actually unobserved confounding so there you might not actually know the true propensity scores because there

are unobserved confounders that you don't observe and that you can think about trying to bound how how wrong your estimator can be as a function of how much you don't know this quantity and they show that when you try to if you think about having some backup strategy like if your goal is to find a new policy which performs as best as possible with respect to an old policy then it gives you a really elegant framework for trying to think about a robust optimization of this even taking the fact that there might be an observed

confounding and that worked oh so in this in this framework so I'm nearly done now I just want to now finish with the thought can we do the same thing for policies learned by reinforcement learning so now let's now that we've sort of built up this language let's return to the RL setting and there one can show that you can get a similar estimate for the value of a policy by summing over your observed sequences summing over the time steps of that sequence of the reward observed at that time step times a ratio of probabilities

which is going from the first time step up to time little T of the probability that you would actually take the observed action T prime given that you were in the observe state T prime divided by the probability this is where this is the knowledge of the propensity score the probability under the data generating process of of seeing action a given that you're in state T prime so if as we discussed there you had a deterministic policy then this PI would it would just be a delta function and so this would just be looking at

what this would this estimator would only be looking at sequences where the precise sequence of actions taken are identical to the precise sequence of actions that the policy would have taken and the difference here is that now instead of having a single propensity score one has a product of these propensity scores corresponding to the propensity of observing that action given the corresponding state at each point along the sequence and so this is nice because this gives you one way to do what's called off policy evaluation and this is an estimator which is I which is

a completely analogous to the estimator that we got from q-learning so if all assumptions were correct and you had a lot of data then those two should give you precisely the same answer but here like in the causal inference setting we are not making the assumption that we can do Co very two adjustment well or said differently we're not assuming that we can fit the Q function well and this is now just just like they're based on the assumption that we have the ability to really accurately know what the propensity scores are so it now

gives you an alternative approach to do evaluation and you could think about looking for the robustness of your estimates from these two different different estimators and here if you're and there this is the most naive of des tomaters there are many ways to try to try to make this better such as by by doing wo bust estimators and if you want to learn more I recommend reading this paper by Thomas and I'm a Brun skill and I CML 2016 and with that I want Barbara to come up and get set up and we're going to

transition to the next part of the lecture yes one easy way to think about this is suppose that you only had a reward at the last time step if you only had a reward of the last time step then you wouldn't have this sum over T because the rewards in the earlier steps would be zero you would just have that product going from zero up to capital T the last time step the reason why you have it up to at each time step is because one wants to be able to appropriately way the likely in

essence the likelihood of seeing that reward at that point in time one could rewrite this in other ways I want to hold other questions because this part of lecture is gonna be much more interesting than my part of lecture and with that I want to introduce Barbara Barbara I first met her when she invited me to give a talk in her class last year she is an instructor at at harvard medical school or School of Public Health and she recently finished her PhD in 2018 and her PhD looked at many questions related to the themes

of the last couple of weeks since that time in addition to continue her research she's been really leading the way in creating data science curriculum over at Harvard so please take away thank you so much for the introduction David I'm very happy to be here to share some of my work on evaluating dynamic treatment strategies which you've been talking about over the past few lectures so my goals for today I'm just gonna breeze over defining dynamic treatment strategies as you're already familiar with it but I would like to touch on when we need a special

class of methods called G methods and then we'll talk about two different applications different analyses that have focused on evaluating dynamic treatment strategies so the first will be an application of the parametric G formula which is a powerful G method to cancer research and so the goal here is to give you my causal inference perspective on how we think about this task of sequential decision making and then with whatever time remains will be discussing a recent publication on the AI clinician to talk through the reinforcement learning perspective so I think it'll be a really interesting

discussion where we can share these perspectives talk about the relative strengths and limitations as well and please stop me if you have any questions so you already know this when it comes to treatment strategies there's three main types there's point interventions happening at a single point in time there's sustained interventions happening over time when it comes to clinical care this is often what we're most interested in within that there's static strategies which are constant over time and then there's dynamic strategies which we're going to focus on and these differ in that the intervention over time

depends on evolving characteristics so for example initiate treatment at baseline and continue it over follow-up until a contraindication occurs at which point you may stop treatment and decide with your doctor whether you're going to switch to an alternate treatment you would still be adhering to that strategy even though you quit the comparison here being do not initiate treatment over follow-up likewise unless an indication occurs at which point you may start treatment and still be adhering to the strategy so we're focusing on these because they're the most clinically relevant and so clinicians encounter these every day

in practice so when they're making a recommendation to their patient about a prevention intervention they're going to be taking into consideration the patient's evolving comorbidities or when they're deciding the next screening interval they'll consider the previous result from the last screening test when deciding that likewise for treatment deciding whether to keep the patient on treatment or not is the patient having any changes in symptoms or lab values that may reflect toxicity so one thing to note is that while many of the strategies that you may see in clinical guidelines and in clinical practice our dynamic

strategies these may not be the optimal strategies so maybe what we're recommending and doing is not optimal for patients however the optimal strategies will be dynamic in some way in that they will be adapting to individuals unique and evolving characteristics so that's why we care about them so what's the problem so one problem deals with something called treatment confounder feedback which you may have spoken about in this class so conventional statistical methods cannot appropriately compare dynamic treatment strategies in the presence of treatment confounder feedback so this is when time varying confounders are affected by previous

treatment so if we kind of ground this in a concrete example with this causal diagram let's say we're interested in s2 the effect of some intervention a vasopressors or it could be IV fluids on some outcome y which we'll call survival here we know that vasopressors affect blood pressure and blood pressure will affect subsequent decisions to treat with vasopressors we also know that hypotension so again blood pressure l1 affects survival based on our clinical knowledge and then in this dag we also have the node U which represents disease severity so these could be potentially unmeasured

markers of disease severity that are affecting your blood pressure and also affecting your probability of survival so if we're interested in estimating the effect of a sustained treatment strategy then we want to know something about the total effective treatment at all time points we can see that l1 here is a confounder for the effect of a1 on Y so we have to do something to adjust for that and if we were to apply a conventional statistical method we would essentially be conditioning on a Collider and inducing a selection bias so an open path from a

0 to L 1 to u 2 y what's the consequence of this if we look in our data set we may see an association between a and Y but that Association is not because there's necessarily an effective a on Y it might not be causal it may be due to this selection bias that we created so this is the problem and so in these cases we need a special type of method that can handle these settings and so a class of methods that was designed specifically to handle this is G methods and so these are

sometimes referred to as causal methods they've been developed by Jamie Robbins and colleagues and collaborators since 1986 and they include the parametric G formula G estimation of structural nested models and inverse probability weighting of marginal structural models so in my research what I do is I combine G methods with large longitudinal databases to try to evaluate dynamic treatment strategies so I'm particularly interested in bringing these methods to cancer research because they haven't been applied much there so a lot of my research questions are focused on answering questions like how and when can we intervene to

best prevent detect and treat cancer and so I'd like to share one example with you which focused on evaluating the effect of adhering to guideline based physical activity interventions on survival among men with prostate cancer so the motivation for this study there's a large clinical organization ASCO the American Society of Clinical Oncology but it actually called for randomized trials to generate these estimates for several cancers thinking with prostate cancer is it's a very slowly progressing disease so the feasibility of doing a trial to evaluate this is very limited the trial would have to be ten

years long probably so given that given the absence of this randomized evidence we did the next best thing that we could do to generate this estimate which was combine high-quality observational data with you know advanced epi methods in this case the parametric G formula and so we leveraged data from the health professionals follow-up study which is a well characterized prospective cohort study so in these cases there's a three step process that we take to extract the most meaningful and actionable insights from observational data so the first thing that we do is we specify the protocol

of the target trial that we would have liked to conduct had it been feasible the second thing we do is we make sure that we measure enough covariance to approximately adjust for confounding and achieve conditional exchangeability and then the third thing we do is we apply an appropriate method to compare the specified treatment strategies under this assumption conditional exchangeability and so in this case eligible men for this study had been diagnosed with non metastatic prostate cancer and a baseline they were free of cardiovascular and neurologic conditions that may limit physical ability for the treatment strategies

men were to initiate one of six physical activity strategies at diagnosis and continue it over follow-up until the development of a condition limiting physical activity so this is what made the strategies dynamic the intervention over time depended on these evolving conditions and so just to note we pre specified these strategies that we were evaluating as well as the conditions men were followed until diagnosis until death and to follow-up ten years after diagnosis or administrative end to follow-up which ever happened first our outcome of interest was all cause mortality within ten years and we were interested

in estimating the per protocol effect of not just initiating these strategies but adhering to them over follow-up and again we applied the parametric G formula so I think you've already heard about the G formula in a previous lecture possibly in a slightly different way so I'll I won't spend too much time on this so the G formula essentially the way I think about it is a generalization of standardization to time varying exposures and confounders so it's basically a weighted average of risks where you can think of the weights being the probability density functions of the

time varying confounders which we estimate using parametric regression models and we approximate the weighted average using Monte Carlo simulation so practically how do we do this so the first thing we do is we fit parametric regression models for all of the variables that we're going to be studying so for treatment confounders and death at each follow-up time the next thing we do is Monte Carlo simulation where essentially what we want to do is simulate the outcome distribution under each treatment strategy that we're interested in and then we bootstrap the confidence intervals so I'd like to

show you kind of in a schematic what this looks like because it might be a little bit easier to see so again the idea is we're gonna make copies of our data set where in each copy everyone is adhering to the strategy that we're focusing on in that copy so how do we construct each of these copies of the data set we have to build them each from the ground up starting with time zero so the values of all of the time varying covariates at time zero are sampled from their empirical distribution so these are

actually observed values of the covariance how do we get the values at the next time point we use the parametric regression models that I mentioned that we fit in step one then what we do is we force the level of the intervention variable to be whatever was specified by that intervention strategy and then we estimate the risk of the outcome at each time period given these variables again using the parametric regression model for the outcome now and so we repeat this over all time periods to estimate a cumulative risk under that strategy which is taken

as the average of the subject specific risks so this is what I'm doing this is kind of under the hood what's going on with this with this method so you first estimate the Markov decision process mm-hmm which allows you in essence to simulate from the underlying data distribution so you know that probability of the sort of next sequence of observations given the previous sequence and action and previous actions and then with that then you could sort of you could you could then intervene and simulate for so that was if you remember graduate gave you three

different buckets of approaches and then he focused on the middle one this is the leftmost problem I have that right so we didn't talk about yes but it's very sensible what side oh it seems very hard yeah so that is a challenge that is the hardest part about this mm-hm and it's relying on a lot of assumptions yeah so the primary results that kind of come out after we do all of this so this is the estimated risk of all-cause mortality under several physical activity interventions so I'm not gonna focus too much on the results

I want to focus on two main kind of takeaways from this slide one thing to emphasize is we pre specified the weekly duration of physical activity or you can think of this like the dose of the intervention we pre specified that and this was based on current guidelines so the third row of each band we did look at some dose or level beyond the guidelines to see if there might be additional survival benefits but these were all pre specified we also pre specified all of the time varying covariates that made these strategies dynamic so I

mentioned that men were excused from following the recommended physical activity levels if they developed one of these listed conditions metastasis and my stroke etc we pre specified all of those it's possible that maybe you know a different dependence on a different time varying ovarian may have led to a more optimal strategy there was a lot that remained unexplored so we did a lot of sensitivity analyses as part of this project I'd like to focus though on the sensitivity analyses that we did for potential unmeasured confounding by chronic disease that may be severe enough to affect

both physical activity and survival and so the G formula is actually providing a natural way to at least partly address this by estimating the risk of these physical activity interventions that are at each time point t only applied to men who are healthy enough to maintain the physical activity level at that time and so again in the main analysis we excused men from following the recommended levels if they developed one of these serious conditions so in sensitivity analyses we then expanded this list of serious conditions to also include the conditions that are shown in blue

text and so this attenuated our estimates but didn't change our conclusions one thing to point out is that the validity of this approach rests on the assumption that at each time T we had available data needed to identify which men were healthy at that time enough to do the physical activity yeah great question so because the strategy was pre specified to say that if you develop one of these conditions you may essentially do whatever level of physical activity you're able to do so importantly I'm glad you brought this up we did not censor men at

that time they were still followed because they were still adhering to the strategy as defined mm-hm thanks for asking and so given that we don't know whether the data contain at each time T the information necessary to know are these men healthy enough at that time we therefore conducted a few alternate analyses in which we lagged physical activity and covariate data by two years and we also used a negative outcome control to explore potential unmeasured confounding by clinical disease or disease severity so what's the rationale behind this so in the DAGs below for the original

analysis we have physical activity a we have survival Y and this may be confounded by disease severity you so when we see an association between a and Y in our data we want to make sure that it's causal that is because of the blue arrow and not because of this confounding bias the red arrow so how can we potentially provide evidence for whether that red pathway is there we selected questionnaire non-response as an alternate outcome instead of survival that we assumed was not directly affected by physical activity but that we thought would be similarly confounded

by disease severity and so when we repeated the analysis with a negative outcome control we found that physical activity had a nearly null effect on questionnaire non-response as we would expect which provide some support that in our original analysis physical activity the effect of physical activity on death was not confounded through the pathways explored through the negative control so one thing to highlight here is the sensitivity analyses were driven by our subject matter knowledge and there's nothing in the data that kind of drove this and so just to recap this portion shoji methods are useful

tool because they let us validly estimate the effect of pre-specified dynamic strategies an estimate adjusted absolute risks which are clinically meaningful to us and appropriately adjusted survival curves even in the presence of treatment confounder feedback which occurs often in clinical questions and of course this is under our typical identifiability assumptions so this makes it a powerful approach to estimate the effects of currently recommended or proposed strategies that therefore we can specify and write out precisely as we did here however these pre specified strategies may not be the optimal strategies so again when I was doing

this analysis I was thinking there's so many different weekly durations of physical activity that we're not looking at there's so many different time varying covariates where we could have different dependencies on those for these strategies over the time and maybe those would have led to better survival outcomes among these men but all of that was unexplored

17. Reinforcement Learning, Part 2