[Music] hello everyone and welcome back to this series on intermediate statistics today we're going to be talking about a very very very important topic known as time series analysis now I'm only going to be doing an introduction to time series today um but I want to sort of illustrate some of the differences and nuances that time Series has compared to that of regression so first off what is time series so a Time series is just a series or sequence of data points our data points are going to be commonly represented by y of T some
people will use S to represent stocks in the financial realm indexed with respect to time T right so this is the important thing here time is going to be our in some sense predictor variable now with regression usually we do not have predictive variables that are associated to time right but if we were to pretend that time T is a predictor then one could say well couldn't I just fit a regression model to this for example I could fit a line of best fit to this via least squares or I could possibly even fit a
polinomial regression of best fit because according to this graph maybe a polinomial uh probably would fit a lot more better and I'm going to be like okay I can use this uh cubic polinomial uh to predict the stock value at any time T right so a natural question you may be asking can time series data or time series problems be solved via regression uh and the answer take a guess technically no but practically yes keep in mind we can use wrong things and get accurate models in the end right so if this trend for our
time series does continue in this way then it's possible that our polinomial model or polinomial regression model will predict the time values correctly or the stock values correctly at Future times right but technically speaking y doesn't depend on time Y is just observed Through Time right so that's why it's technically no but practically yes in some degree so let's just State out a couple things that we're going to be looking at here so in regression what are we trying to do well we're trying to predict y um at some particular value of x via some
constants beta 0 plus beta 1 x and usually we have this uh difference term or aror term usually c a residual in regression called Epsilon right so what does this exactly do well it assumes a you could do quadratic but generally we start at linear uh dependency or correlation between X and Y that's the Assumption Associated to regression models that assumes that there's a linear dependence between them but in the realm of Time series analysis so time series where we're going to do where we're going to be trying to predict y with time and obviously
there could be some constants beta 0 and beta 1 let's call it t uh plus Epsilon and the main difference at least on the surface between time series and regression it only assumes or it only will assume uh that y was observed uh over time so time series will not assume that y depends on T it will assume that y has been observed over time but it's only going to be analyzing the difference in y values and how the Y's are changing over time not with respect to time all right so let's just State uh
formally these couple of important things so variables for example y do not depend on time they are only observed over time over time right so let's just do a brief you know descriptor example let's assume that y represents uh stock prices stock prices for some company for a company so what does why depend on well y depends on for example G GDP of the company for which it resides uh it could depend on uncertainty it could depend on price of bonds it could depend on competition it could depend on inflation it could depend on interest
rates interest rates it could depend on politics it could depend on so many things and so many things are hard to model over time time or maybe we couldn't get data that's uniformly distributed across all categories at all particular times T right so instead of measuring a bunch of different variables usually it's easier to just look at one variable and keep track of how that variable is changing over time so time series analysis does assume uh a more simplistic nature uh in regards to uh data collection but it obviously is going to come with a
price because there are so many variables that could be contributing uh to those residuals and a change of those residuals um that you know these predictions might not be super accurate and that's technically one of the downsides of Time series but don't worry there's plenty of advanced time series methods um that can use that can be used to predict the vi Y at a particular given time T within some margin of error now let's start building a bit of the theory that we're going to use to build our very first time series model so the
first thing that we definitely need to understand is the residuals in the realm of Time series so in the realm of Time series we usually do not call Epson a residual we usually call it noise because that residual or noise is commonly caused by uncertainty emotions volatility and several other things so obviously every time we make assumptions um we're probably building a parametric model um that's not really applicable to every single time series data set but nonetheless it's a good starting point to start the conversation so we're going to be assuming a few things about
the noise random variables um that is sort of reminiscent for other things such as linear regression and Anova so the assumptions are going to be as follows so assumptions for our noise random variable so firstly let's assume that e orep on T is a sequence of random variables uh that satisfy the following elements first the expected value of our noise random variable will be equal to zero and let's assume that this is true for all time T let's assume this true for all time T now obviously this is not necessarily the case for all time
series models but we're going to assume this for now just to keep things basic to start the conversation the second conversation there exists a number which I'm going to represent by Sigma squ that's going to be a non- negative real number um usually it's positive but technically speaking it could be zero uh such that the variance of this noise term will be equal to Sigma squ and again this needs to be true for all time T right so it's very important to notice that that the expected value of this random variable is zero for any
time and the variance of this random variable is equal to Sigma squ and this is obviously independent of time as well for any time T it's equal to this constant like five now before I list out the other two assumptions I just need to make a theoretical point about this particular expression so note that the variance of a random variable can be represent represented in terms of its second moment minus the mean of that random variable squared right so variance is equal to Second moment minus mean squar this is one of those fundamental properties of
expected value and variance in their connection so what does this mean so we don't know what the second moment of our noise term is but we do know what the first moment AK the expected value of our noise term is we are assuming that that's going to be equal to zero so notice that we have that this is just precisely equal to uh the second moment so the variance of this random variable must be equal to the second moment of this random variable and we're assuming that the variance is equal to Sigma squar so that
means under these assumptions this is also equal to Sigma squ as well so under these assumptions if we assume that the variance is constant and equal to Sigma squ the second moment of that random variable is also equal to Sigma squ as well uh as well so that's actually pretty nice uh the next property uh that I want to assume is that this random variable is a normal random variable for all time t uh and therefore U by properties 1 2 and 3 we can say that Epsilon T is a normal random variable whose mean
is zero and its variance is Sigma squ and that's again for all time T and the fourth assumption that we're going to assume take a guess of what this assumption is going to be is that Epsilon T1 and Epsilon T2 are independent random variables for obviously T1 not equal to T2 so on the calculation end what does this mean in terms of the co-variance between them so let's just go through that so the co-variance of Epsilon T1 and Epsilon T2 will be what so by definition that's going to be equal to the product moment expected
value of Epsilon T1 * Epsilon T2 minus the product of first moments the expected value of Epsilon T1 times the expected value of Epsilon T2 and what do we have well if they are independent then the product moment is equal to the product of first moments so that means this first term is just going to be equal to the expected value of Epsilon T1 times the expected value of Epsilon T2 we know that's equal to 0 we know that's equal to 0 and we know that this is going to be equal to 0 * 0
- 0 which is also equal to zero so we have that the co-variance between these is equal to zero and moreover you can say that the correlation between Epsilon T1 and Epsilon T2 which keep in mind is equal to the co-variance of Epsilon T1 and Epson T2 all over the standard deviation of uh Epsilon T1 and the Epsilon uh standard deviation of Epsilon T2 F standard deviation of Epsilon T2 and that's going to be equal to 0 over something which is obviously going to be equal to zero right so Independence implies correlation equal to zero
but keep in mind it does not go the other way around right so these are going to be the assumptions that we're going to be making for these residual terms Epsilon the next topic that I want to introduce before we build our first time series model is what a moving average is so if we look at time series data it's typically not very smooth so I want to formally state that here so time series time series data is not or are not usually smooth and this is likely due to our Epsilon terms because they are
random variables um which causes things to randomly go up and go down so one thing that helps smooth out this this uh Rocky process is moving averages so moving averages moving averages help to smooth out the data which also helps us figure out any underline Trends so if we look at two points at a time for example we can consider those two points and then those two points and those two points those two points and those two points uh we can look at the average of those two points for example we can look at that
average here that average here that average that average that average and so on right so what I can do is I can Define this average as the moving average from time one and time two I can Define this point as the moving average from time 2 to time three this is the moving average from time 3 to Time 4 I can Define this to be the moving average from 4 to 5 and this from 4 to 6 so notice that I'm only looking at intervals 1 to 2 2 to 3 3 to 4 4 to
5 and five to that last point in order to get these points 1 2 3 4 and five right so this point is associated to this interval that interval that interval that interval and that interval and so on so if I look at this time series from only looking at those moving average points notice that I get a relatively more smoother graph compared to the original a lot more Rocky looking uh time series data so what I can do is I can Define very easily uh the moving average approximation to my time series at time
T is equal to T by only looking at one point in the past and hence two points so it's obviously going to be equal to YT the point at time T is equal to T and then the point before T minus one and then divide that by two and keep in mind this is going to be 4 T is equal to 2 3 4 5 all the way up to n and you might uh notice that this is similar to a derivative I guess you could say um so reminiscent of backwards differences backwards differences since
we're since we're always marching forward in time using using backwards differences is typical the way to go and we can always increase the order by looking at a lot more points in the past so for example if I want to look at two points in the past which means I'm going to be using three points I'm going to be looking at T T minus 1 and also t - 2 and then dividing that by three and this would be for T is equal to 3 4 5 up to M right so that's looking at uh
so this is looking at interval T to T minus1 and TUS two so it's looking at those points there and I can increase this to an arbitrary value for example I can look at it as order QT so this is going to be equal to YT + YT minus 1 + y tus 2 all the way down to so note here that 2 ends at T minus 2 so that means Q is going to end at T minus q and that's going to be divided by Q so we have this expression and this is going
to be for T is equal to keep in mind uh ma2 starts at T is equal to 3 for its domain of definition start it's plus one so this is going to start at Q + 1 Q + 2 all the way down to M where n is the number of points that we have observed in the past and this is what we refer to as a cuth order uh moving average moving average approximation to YT um where these have already been observed all right so now that we know moving averages and a couple of
the definitions of what a moving averag is and what it aims to do and we also know a couple or we have already taken a few assumptions about the noise term Epsilon now let's build our first time series model so the first model that we want to build for time series is known as the moving average process and it's built on these two fundamental questions first should each of the yks be weighted equally like we did with moving average the answer for this is probably no and we typically don't know YK in the future so
how can we use those formulas in the first place and if we don't know YK could we build YK from our noise terms Epsilon k for which we've already made assumptions on and the answer is obviously yes so the first order moving average process so the first order first order moving average process for which I'm going to just abbreviate by M1 for moving average order one uh is going to be of the following form so we're going to assume that YT can be represented as a plus b Epsilon t plus C Epsilon t one so
obviously uh we're going to be looking at the noise term at time T which is going to be the term that we're associated with and we're also going to be looking one step in the past one step in the past and it's possible that this has some coefficient let's call it C in front and we would like this model to be true uh for all t for all time T right so we can use the same numbers a b and c for all time t for which we want to predict in the future so let's
see if we can build a couple things here right so we already know that Epsilon T is a random variable Epsilon T minus one is a random variable uh so y of T is obviously going to be a random variable as well so it has a distribution it has an expected value it has a variance so let's go ahead and find the expected value of this object so the expected value of YT is going to be equal to what so that's going to be the expected value of a plus b * the expected value of
Epsilon t plus C * the expected value of Epsilon T minus 1 so we already know by assumptions that the expected value of Epsilon T and Epsilon T minus one are both going to be equal to zero so we're just going to be left with the Epsilon uh the expected value of a which is just going to be equal to a by properties of expected values so we just have this particular expression so that's actually pretty nice so for notation um since a is just a random letter I'm actually going to be notating this letter
by mu so mu is going to be the new name for that constant in front of our ma1 model now let's take a moment and think about what this B Epsilon T term is and what this C Epsilon T minus1 term is so the B Epsilon T corresponds to the effect the effect of noise at time t on YT right so B Epsilon T is the effect of noise a Time t on YT and similarly C Epsilon t minus1 uh corresponds to the effect of noise at time tus1 on YT right but notice that for
B Epsilon T it's the effect of noise of time t on y of T it happens at the same exact time so this should be should be a direct effect so what we're going to do is we're going to let B be equal to one under this assumption right so under this assumption this allows us to rewrite our model in a more meaningful way and this number c now since A and B have already been removed or redefined we're going to be re naming C to be equal to Theta 1 corresponding to the parameter Associated
to one time step in the past which some people also refer to as One lag from now so therefore our M1 model is going to be of the form y of T is equal to Mu + Epsilon t + Theta 1 Epsilon T minus1 so this is going to be our move moving average order one process all right so Epson T Epson T minus1 are variables and mu and Theta 1 are going to be some numbers for which we need to find now can I go uh a higher order than order one obviously yes so
we can generalize this to what we call a cuth order moving process or moving average process proc moving average process right and this is going to be equal to what take a guess of what form this will be so this will be y t = mu + Epsilon t + Theta 1 Epsilon tus1 plus Theta 2 Epsilon tus 2 all the way down to Theta Q Epsilon T minus Q so this is the qth order moving average process uh for which we will just notate by M A Q right so now we have the structure
for which we want and now we're going to make our final and most important assumption for maq so assumptions assumptions for ma AQ and we've already made this assumption but I just want to re-emphasize it is that YT is a station stary stationary process it's a stationary process and what do you mean by stationary now there is a formal definition of stationary and we'll get to this a little bit later um but pretty much it has the same or fixed overall pattern overall pattern pretty much it behaves the same way for all of time T
right so two an iCal things that is immediately from this uh Sigma 2 T is just equal to Sigma s and Mt is equal to mu and these both have to be true for all time T right so the variance of Epsilon and the mean of Epsilon are all equal to a constant regardless of when you're dealing with so now we have uh our model so we're going to be again just focusing on m1's calc m1's calculation today which again is just y t is equal to Mu + Epsilon t + Theta 1 * Epsilon
T minus1 so we have these two numbers mu and Theta 1 that we need to approximate and then we're going to have our model so let's target uh the first and easiest one how to approximate how to approximate mu and we're going to approximating it with something called me hat we're not going to be using xar notation although xar is commonly used to notate uh the approximation for me but we're not going to be talking about unbiasedness here and Mew technically is a mean of technically some other things and I don't want those to inflate
so I'm just going to be using hat notion to use approximates because there's actually several different types of approximates that I want to mention in the near future right so we're going to be approximating mu with mu hat and the question is how are we going to do it obviously there's several different ways maximum likelihood estimation is one uh but we're today we're going to be just using the method of moments it's the easiest one uh to sort of get your head around if you're new at time series right so what are we going to
do so keep in mind the expected value of a y of T is equal to me right so technically speaking YT the random variable is an unbiased estimator for me so that means we can estimate we can estimate uh mu with the YK values for all K that are less than or equal to T right that is the observed observed YK values and we can use all K less than T um since we're assuming that this model is stationary right so we don't have to worry about going too far in in the past because even
if it was 50 million years ago since it's stationary the model of YT still behaves the same so the more data the better for stationary processes so therefore if we can use all of the yks and we want an unbiased estimator for Mu one can show that a good estimator via the method of moments will just be equal to 1 / M times the sum from K is 1 to M of YK is pretty much an arithmetic M and this is going to be true for K less than or equal to t uh and M
equals the number of observed points right uh and obviously here we're assuming that we have a finite number of points so here we have an approximation for Mu right okay so that's the first thing that we have now keep in mind since we know the expected value of y to you should also be able to find the variance of YT as well we're going to need this in just a moment so what is that going to be so that's the variance of mu plus Epsilon t + Theta 1 Epsilon T minus one uh constant variable
variable independent we're assuming so we can distribute uh this over that addition without having to deal with Co variance terms so we're going to have 0 plus Sigma 2 and then + Theta 1^ 2 * Sigma 2 Epson T Epson tus 1 keep in mind have the same exact variance um so we have that this is going to be equal to 1 + Theta 1^ 2 * Sigma 2 right so this is the variance of Y of T and keep in mind this is true for all T and I've seen it so many times please
do not make the mistake and say that this is equal to Sigma squ Sigma squ is for the noise term not the Y term right so this is not equal to Sigma squ it's only equal to Sigma squ if Theta 1 is equal to zero which rarely ever is the case right so we have the variance of YT term so now the question is now to approximate approximate Theta 1 right so now let's go through some fancy math let's look at the expected value of YT * YT minus 1 so it's the product moment of
the thing we're trying to predict today and the thing we had observed yesterday so what is this so this is going to be equal to the expected value of and here's some fancy math so this going to be equal to Mu + Epsilon t + Theta 1 Epsilon T minus 1 time mu + Epsilon T -1 + Theta 1 * Epsilon t - 2 and now we need to Triple foil this and see what that comes out to so that's the expected value of some fun stuff so we're going to have mu^ 2 + mu
Epsilon tus1 plus mu Theta 1 Epsilon T minus 2 so that's going to be equal to the first phase you know mu * mu tus one and that last term and then we can focus on the next phase that's going to be equal to what so that's going to be equal to Mu Epsilon t plus Epsilon T Epsilon T minus 1 plus Epsilon T Theta 1 Epsilon T minus 2 that's the second phase and then the last phase that's going to be mu Theta 1 Epsilon tus 1 plus Theta 1 Epsilon tus1 Epsilon t-1 again
and then Theta 1 Epsilon tus1 and we're going to have have Theta 1^ 2 Epsilon t-1 Epsilon tus 2 close brackets uh begin the fun so what do we have here so the expected value on individual terms is going to be equal to zero because that's the Assumption for our noise terms so that's going to be equal to zero this one is going to be equal to zero for the same exact reason uh here's another lonely little Epsilon term that's going to be equal to zero as well do we have any other lonely little residual
terms here oh there's a lonely little residual term there so that goes to zero as well because keep in mind me and Theta 1 are just constant so we don't really care about them and what else do we have here so let's look at this uh these two noise terms so T and T minus one they're independent so the product moment is going to be equal to zero for those um there's an Epsilon T Epsilon T minus 2 they're different so they're not correlated so that's going to be equal to zero um this last term
has a tus1 tus2 they're different so they're not correlated so that's going to be equal to zero and then we're just going to be left with uh this mu^ 2 and this Theta 1 Epsilon T minus one squared term okay so that's going to be equal to the expected value of mu^ 2 + Theta 1 Epsilon tus1 2 right now keep in mind the second moment or the variance of a random variable X is equal to the second moment minus the mean squared so we can actually solve this equation for the second moment of this
random variable by adding the mean squar to the variance of both of these Expressions so the expected value mu^ s we know that's just going to be equal to mu^ 2 and then we're going to have Theta 1 * the expected value of e- 1^ 2 which is just the second moment of e minus1 so the second moment of e minus1 is going to be equal to the variance the variance of e t minus1 plus the expected value of e minus1 squar so what is this so we already know the expected value of that is
going to be equal to zero and the variance of e minus 1 is just going to be equal to Sigma squ so as a beautiful and simplified result the expected value or the product moment of YT YT and YT - 1 is going to be equal to mu^ 2 + Theta 1 Sigma squ this is the first important theoretical result and keep in mind this is true for all time T right so that's actually very very beautiful so what are we going to do with this identity so we already know the product moment of YT
and YT minus one and we know everything about YT minus1 we also know the expected value of YT and YT minus1 both of those are equal to Mu so we should be able to find the co-variance between YT and YT minus one which some people will refer to as the autoc covariance between these processes we'll talk about Auto covariance uh in a little bit more detail later and also autocorrelation um but technically speaking uh the covariance between Y and T and YT minus 1 from the formulas that we've already established in statistics is going to
be equal to the product moment YT YT minus1 minus the expected value of YT times the expected value of y t minus one that formula holds we've already proven what the first moment of YT or the product moment of YT YT minus 1 is equal to that's just going to be equal to mu^ 2 + Theta 1 Sigma 2 and what is e expected value of YT YT minus one both of them are going to be equal to Mu so that's going to be mu ^ s since it's a stationary process and notice that the
mu^ squ will cancel in this expression and then we're just left with Theta 1 * Sigma 2 for the covariance between them once we have the covariance you can keep in mind covariance is not bounded above nor below and can range from minus infinity to infinity and has very strange units so for example if YT is measured in um uh if YT is measured in dollars then covariance is measured in dollar squar Square dollars whatever that means uh so correlation between them is commonly used as an alternative uh for units because is bounded between minus
one and 1 so we're going to be denoting uh this particular expression uh as Row one sometimes called the autocorrelation uh or the first autocorrelation of this process uh and this is again going to be defined to be equal to the covariant of YT YT - 1 all over the standard deviation of YT time the standard deviation of Y T minus one so we have expressions for the co-variance of Y of T already the covariance of YT uh and YT minus 1 would just be equal to Theta 1 * Sigma 2 we've already proven that
and the standard deviation of YT will be the square root of its variance of YT keep in mind YT not Epsilon so that's going to be to theare < TK of 1 + thet 1^ 2 * Sigma 2 * standard deviation of the other one is the same uh 1 + Theta Theta not Sigma Theta 1^ 2 * Sigma 2 uh and that gives us the correlation uh notice that we have a sigma squ Sigma time Sigma on the bottom so they cancel and the squs will eliminate here just leaving us with 1 + Theta
1^ 2 on the bottom so therefore the first Auto correlation for this moving average process of order 1 will be equal to Theta 1 all over 1 + Theta 1 squared another very important process and keep in mind this is for moving average order one okay so keep in mind this is correlation sometimes it's easy to approximate correlation of data points in the past so again what are we aiming for we're aiming for Theta 1 so if we can approximate Row one we have an approximate for Theta 1 so that's our goal so what we're
going to do is we're going to approximate approximate Row one with Row one hat this is similar to R you you know the r and r squ value from regression and this is going to be defined to be equal to the sum and keep in mind this is for ma1 their equivalent ways for ma2 ma3 and maq but this is going to be the sum from K is equal to 2 to M so this is previously observed values of YT minus y bar and then YT - oneus Y Bar so it's looking at the correlation
between moving averages uh or between lags I guess you could say and that's going to be divided by the sum from K is equal to 1 to M of Y Kus y right and that's y k and Y Kus one right so this is approximation to row one so now we have that row Hat 1 will be approximately equal to Theta 1 all over 1 + Theta 1^ 2 and now we can solve this equation for Theta 1 so cross multiplying by the denominator will give us row1 hat plus row Hat 1 Theta 1^ 2
uh is equal to Theta 1 and then we can form a quadratic equation from this by isolating and S setting equal to zero so that's going be equal to a row Hat 1 Theta 1^ 2 - Theta 1 + row Hat 1 is equal to 0 so that's quadratic and Theta 1 so we can solve that using the quadratic formula and we're going to have two values um to look at and keep in mind we're looking at approximations for Theta 1 this is not the exact value because we're approxim row right so therefore we're going
to use Theta 1 Hat as the approximation for Theta 1 and this going to be equal to B plus or minus the square < TK of b ^ 2 - 4 a c all over 2 a okay so we have two potential values uh for Theta 1's approximate and keep in mind this Theta hat one exist if and only if -2 is less than or equal to row Hat 1 is less than or equal to 12 and this should actually cause you some theoretical concerns because keep in mind can't correlation be between minus1 and 1
definitely so if you get a sample correlation coefficient of your legs to be like 78 which is bigger than a half um then this Theta half one is not going to have an estimate so this sample estimate for Row one is a requirement for M1 if you use uh uh estimates estimates from moments right because technically speaking that's the only thing that we have used up to this point if we use let's write that a little bit neater method of moments for mu and Theta 1 right so it does have its limitations but as long
as your approximate is between min-2 and 1/2 then we have our our M1 model right because once you have your mu and once you have your Theta 1 approximate now you have your model right so those are just some of the basics Theory and also applied side uh for time series analysis and now you have your first order first method for time series predictions we'll get into simulations and more advanced Methods at another time but I hope you enjoyed this introduction to time series and I hope to see you in the next video take [Music]
care [Music] [Music] [Music]