Hi everyone Welcome to our full and free tutorial about statistics before we dive in we want to say a huge thank you last year we launched our first ever full course on statistics and the response was absolutely amazing we are on track to reach 1 million views within just one year wow this inspired us to create a revised version for 2025 it was a big project but there it is thank you for being such an amazing Community so let's start this video is designed to guide you through the fundamental concepts and most powerful statistical tests
used in research today from the basics of descriptive statistics to the complexities of regression and Beyond we'll explore how each method fits into the bigger picture of data analysis and don't worry if you have no no clue about statistics will go through everything step by step if you like you can find all topics in our book As well the link is in the video description so what is the outline of the video our video has three major parts in the first part we discuss what statistics is and what the differences between descriptive and inferential statistics
are in the second part we go through the most common hypothesis test like T Test and an NOA and discuss the difference between parametric and non-parametric tests in the third part we take a look at correlation analysis And regression analysis and finally we talk about clust analysis we have prepared detailed videos for each section so let's start with the video that explains what statistics is after this video you will know what statistics is what descriptive statistics is and what inferential statistics is so let's start with the first question what is statistics statistics deals with the
collection analysis and presentation of data an example we would like to Investigate whether gender has an influence on the preferred newspaper then gender and newspaper are our so-called variables that we want to analyze in order to analyze whether gender has an influence on the preferred newspaper we first need to collect data to do this we create a questionnaire that asks about gender and preferred newspaper we will then send out the survey and wait 2 weeks afterwards we can display the received answers in a Table in this table we have one column for each variable one
for gender and one for newspaper on the other hand each row is the response of one served person the first respondent is mail and stated New York Post the second is female and stated USA Today and so on and so forth of course the data does not have to be from a survey the data can also come from an experiment in which you for example want to study the effect of Two drugs on blood pressure now the first Step is done we have collected data and we can start analyzing the data but what do we
actually want to analyze we did not survey the entire population but we took a sample now the big question is do we just want to describe the sample data or do we want to make a statement about the whole population if our aim is limited to the sample itself I.E we only want to describe the collected data we will use descriptive statistics descriptive statistics will provide a Detailed summary of the sample however if we want to draw conclusions about the population as a whole inferential statistics are used this approach allows us to make educated guesses
about the population based on the sample data let us take a closer look at both methods starting with descriptive statistics why is descriptive statistics so important let's say a company wants to know how its employees travel to work so the Company creates a survey to answer this question once enough data has been collected this data can be analyzed using descriptive statistics but what is descriptive statistics descriptive statistics aims to describe and summarize a data set in a meaningful way but it is important to note that descriptive statistics only describe the collected data without drawing conclusions
about a larger population put simply just because we know how some People from one company get to work we cannot say how all working people of the company get to work this is the task of inferential statistics which we will discuss later to describe data descriptively we now look at the four key components measures of central tendency measures of dispersion frequency tables and charts let's start with the first one measures of central tendency measures of central tendency are for example the mean the median and The mode Let's first have a look at the mean the
arithmetic mean is the sum of all observations divided by the number of observations an example imagine we have the test scores of five students to find the mean score we sum up all the scores and divide by the number of scores the mean test score of these five students is therefore 86.6 what about the median when the values in a data set are are arranged in ascending order the median is the middle Value if there is an odd number of data points the median is simply the middle value if there is an even number of
data points the median is the average of the two middle values it is important to note that the median is resistant to extreme values or outliers let's look at this example no matter how tall the last person is the person in the middle Remains the person in the middle so the median does not change but if we look at the mean it does have an effect on how Tall the last person is the mean is therefore not robust to outliers let's continue with the mode the mode refers to the value or values that appear most
frequently in a set of data for example if 14 people travel to work by car six by bike five walk and five take public transport then car occurs most often and is therefore the mode great let's continue with the measures of dispersion measures of dispersion describe how spread out the values in a data set are Measures of dispersion are for example the variance and standard deviation the range and the interquartile range let's start with the standard deviation the standard deviation indicates the average distance between each data point and the mean but what does that mean
each person has some deviation from the mean now we want to know how much the person's deviate from the mean value on average in this example the average deviation from the mean value is 11.5 cm to Calculate the standard deviation we can use this equation Sigma is the standard deviation n is the number of persons it x i is the size of each person and xbar is the mean value of all persons but attention there are two slightly different equations for the standard deviation the difference is that we have once 1 / n and once's
1 / nus1 to keep it simple if our servy doesn't cover the whole population we always use this equation to estimate the Standard deviation likewise if we have conducted a clinical study then we also use this equation to estimate the standard deviation but what is the difference between the standard deviation and the variance as we now know the standard deviation is the quadratic mean of the distance from the mean the variance now is the squared standard deviation if you want to know more details about the standard deviation and the variance please watch Our video let's
move on to range and interquartile range it is easy to understand the range is simply the difference between the maximum and minimum value inter quartile range represents the middle 50% of the data it is the difference between the first quartile q1 and the third quartile Q3 therefore 25% of the values are smaller than the inter quartile range and 25% of the values are larger the inter quartal range contains exactly the middle 50% of The values before we get to the last two points let's briefly compare measures of central tendency and measures of dispersion let's say
we measured the blood pressure of patients measures of central tendency provide a single value that represents the entire data set helping to identify a central value around which data points tend to Cluster measures of dispersion like the standard Devi the range and the inter quarle range Indicate how spread out the data points are whether they are closely packed around the center or spread far from it in summary while measures of central tendency provide a central point of the data set measures of dispersion describe how the data is spread around the center let's move on to
tables here we will have a look at the most important ones frequency tables and contingency tables a frequency table displays how often each distinct value appears in a data Set let's have a closer look at the example from the beginning a company surveyed its employees to find out how they get to work the options given were car bicycle walk and public transport here are the results from 30 employees the first answer car the next walk and so on and so forth now we can create a fre table to summarize this data to do this we
simply enter the four possible options car bicycle walk and public transport in the First Column and then Count how often they occurred from the table it is evident that the most common mode of Transport among the employees is by car with 14 employees preferring it the frequency table thus provides a clear and concise summary of the data but what if we have not only one but two categorical variables this is where the contingency table also called cross tab comes in Imagine the company doesn't have one Factory but two one in Detroit and one in Cleveland
so we also ask the Employees at which location they work if we want to display both variables we can use a contingency table a contingency table provides a way to analyze and compare the relationship between two categorical variables the rows of a contingency table represent the categories of one variable while the columns represent the categories of another variable each cell in the table shows the number of observations that fall into the corresponding category Combination for example the first cell shows that car and Detroit were answered six times and what about the charts let's take a
look at the most important one to do this let's simply use data.net if you like you can load this sample data set with the link in the video description or you just copy your own data into this table here below you can see the variables distance to work mode of transport and site data daab gives you a hint about the level of Measurement but you can also change it here now if we only click on mode of Transport we get a frequency table and we can also display the percentage values if we scroll down we
get a bar chart and a pie chart here on the left we can adjust further settings for example we can specify whether we want to display the frequences or the percentage values or whether the bars should be vertical or horizontal if you also select site we Get a cross table here and a grouped bar chart for for the diagrams here we can specify whether we want the chart to be grouped or stacked if we click on distance to work and mode of Transport we get a bar chart where the height of the bar shows the
mean value of the individual groups here we can also display the dispersion we also get a histogram a box plot a violin plot and a rainbow plot if you would like to know more about what a Bo Bo plot a violin plot and a rainbow plot are take a look at my videos let's continue with inferential statistics at the beginning we briefly go through what inferential statistics is and then I'll explain the six key components to you so what is inferential statistics inferential statistics allows us to make a conclusion or inference about a population based
on data from a sample what is the population and what is the sample the population is the whole group We're interested in if you want to study the average height of all adults in the United States then the population would be all adults in the United States the sample is a smaller group we actually study chosen from the population for example 150 adults were selected from the United States and now we want to use the sample to make a statement about the population and here are the six steps how to do that number one hypothesis
first we need a statement a hypothesis That we want to test for example you want to know whether a drug will have a positive effect on blood pressure in people with high blood pressure but what's next in our hypothesis we stated that we would like to study people with high blood pressure so our population is all people with high blood pressure in for example the US obious obviously we cannot collect data from the whole population so we take a sample from the population now we use this sample to Make a statement about the population but
how do we do that for this we need a hypothesis test hypothesis testing is a method for testing a claim about a parameter in a population using data measured in a sample great that's exactly what we need there are many different hypothesis tests and at the end of this video I will give you a guide on how to find the right test and of course you can find videos about many more hypothesis tests on our Channel but How does a hypothesis test work when we conduct a hypothesis test we start with a research hypothesis also
called alternative hypothesis this is the hypothesis we are trying to find evidence for in our case the research hypothesis is the drug has an effect on blood pressure but we cannot test this hypothesis directly with a classical hypothesis test so we test the opposite hypothesis that the drug has no effect on blood pressure but what does that Mean first we assume that the drug has no effect in the population we therefore assume that in general people who take the drug and people who don't take the drug have the same blood pressure on average if we
now take a random sample and it turns out that the drug has a large effect in a sample then we can ask How likely it is to draw such a sample or one that deviates even more if the drug actually has no effect so in reality on average there is no Difference in the population if this probability is very low we can ask ourselves maybe the drug has an effect in the population and we may have enough evidence to reject the Nile hypothesis that the drug has no effect and it is this probability that is
called the P value let's summarize this in three simple steps number one the null hypothesis states that there is no difference in the population number two the hypothesis test calculates how much The sample deviates from the null hypothesis number three the P value indicates the probability of getting a sample that deviates as much as our sample or one that even deviates more than our sample assuming the null hypothesis is true but at what point is the P value small enough for us to reject the Nile hypothesis this brings us to the next Point statistical significance
if the P value is less than a predetermined threshold the result is Considered statistically significant this means that the result is unlikely to have occurred by chance alone and that we have enough evidence to re dect anal hypothesis this threshold is often 0.05 therefore a small P value suggests that the observed data or sample is inconsistent with the null hypothesis this leads us to reject the null hypothesis in favor of the alternative hypothesis a large P value suggests that the observed data is consistent with the Null hypothesis and we will not reject it but note
there is always a risk of making an error a small P value does not prove that the alternative hypothesis is true it is only saying that it is unlikely to get such result or a more extreme when the Nal hypothesis is true and again if the Nal hypothesis is true there is no difference in the population and the other way around a large P value does not prove that the Nal hypothesis is true it is only saying that it is Likely to get such a result or a more extreme when the null hypothesis is true
so there are two types of Errors which are called type one and type two error let's start with the type one error in hypothesis testing a type one error occurs when a true null hypothesis is rejected so in reality the null hypothesis is true but we make the decision to reject the null hypothesis in our example it means that the drug Actually had no effect so in reality there there is no difference in blood pressure whether the drug is taken or not the blood pressure Remains the Same in both cases but our sample happened to
be so far off the True Value that we mistakenly thought the drag was working and a type two error occurs when a false null hypothesis is not rejected so in reality the null hypothesis is false but we make the decision not to reject an Al hypothesis in our example this means the Drug actually did work there is a difference between those who have taken the drug and those who have not but it was just a coincidence that the sample taken did not show much difference and we mistakenly thought the drug was not working and now
I'll show you how data helps you to find a suitable hypothesis test and of course calculates it and interprets the results for you let's go to data.net and copy your own data in here we will just use this example data Set after copying your data into the table the variables appear down here data tab automatically tries to determine the correct level of measurement but you can also change it up here now we just click on hypothesis testing and select the varibles we want to use for the calculation of a hypothesis test data tab will then
suggest a suitable test for example in this case a Ki Square test or in that case an analysis of Variance then you will see the hypothesis and the results if you're not sure how to interpret the results click on summary inverts further you can check the assumptions and decide whether you want to calculate a parametric or a nonparametric test now we know the differences between descriptive and inferential statistics our next step is to take a closer look at inferential statistics and a choosing the appropriate hypothesis test which Hypothesis test you can use depends on the
level of measurement of your data there are four types of levels of measurement nominal ordinal interval and ratio and here is an easy explanation for you in this video we we are going to explore the four levels of measurement nominal ordinal interval and ratio each level gives us important information about the variable and supports different types of statistical analysis by the end of this video you will know What the levels of measurement are and especially you will understand why you need these levels so whether you are analyzing survey data optimizing business operations or studying for
a statistics Exam stay tuned what are levels of measurement levels of measurement refer to different ways that variables can be Quantified or categorized if you have a data set then every variable in the data set corresponds to one of the four Primary levels of measurement these levels are nominal ordinal interval and Rao in practice interval and ratio data are often used to perform the same analysis therefore the term matric level is used to combine these two levels why do you need levels of measurement the level of measurement is crucial in statistics for several key reasons
it tells us how our data can be collected analyzed and interpreted here's why understanding these levels is so Important different levels of measurement support different statistical analysis for instance mean and standard deviation are suitable for matric data in some cases it may be suitable for ordinal data but only if you know how to interpret the results correctly and it definitely makes no sense to calculated for nominal data the level of measurement also tells us which hypothesis tests are possible and determines the most effective type of Data visualization for example bar charts are great for nominal
data while histograms are better suited for metric data so each level provides different information and supports various types of statistical analysis but attention the level of measurement is mainly relevant at the end of the research process however the types of data to be collected and are formed are determined at the beginning therefore it is crucial to consider the level of measurement of The data from the start part to ensure that the desired tests can be conducted at the end so let's take a closer look at each level of measurement what characterizes nominal variables this is
the most basic level of measurement nominal data can be categorized but it is not possible to rank the categories in a meaningful way examples of nominal variables are gender with the categories male and female types of animals with for example the categories dog cat bird Or preferred newspaper in all these cases you can tell whether one value corresponds to the other so you can distinguish the values but it is not possible to put the categories in a meaningful order an example we would like to investigate whether chender has an influence on the preferred newspaper both
variables are nominal so when we create a questionnaire we simply list the possible answers for both variables since there is no meaningful order pH Nominal variables it usually does not matter in which order the categories are listed in the questionnaire then we can display the collected data in a table where each row is a person with the respective answer we can now use our data to create frequency tables or bar charts but what about the ordinal level of measurement ordinal data can be categorized and in comparison with ninal data it is possible to have a
meaningful ranking of the categories but Differences between ranks do not have a mathematical meaning this means the intervals between the data points are not necessarily equal examples of ordinant variables are all kinds of rankings like first second third satisfaction ratings very unsatisfied unsatisfied neutral satisfied very satisfied and levels of Education High School Bachelors Masters in a questionnaire you could ask how satisfied are you with your current job In this case we have these five possible options the answers can be categorized and there is a logic order that's why the variable satisfaction with the job is
an ordinal variable what about matric variables matric variables are the highest level of measurement matric data is like ordinal but the intervals between values are equally spaced this means the differences and sums can be formed meaningfully examples of mic variables Are Income weight age and electricity consumption if you ask for a matric variable in a questionnaire there's usually just an input field in which the person directly enters the value for example age or body weight let's look at what we've learned so far using an example imagine you're conducting a survey in a school to understand
how pupils get to school here are questions you might ask each corresponding to a different level of measurement the first Question could be what mode of transportation do you use to get to school bus car bicycle walk this is of course a nominal variable the answers can be categorized but there is no meaningful order this means that bus is not higher than bicycle walk is not higher than car and so on and so forth if you want to analyze the results of this question you can count how many students use each mode of transportation and
present it in a bar chart further You can ask ask how satisfied are you with your current mode of transportation choices might include very unsatisfied unsatisfied neutral satisfied very satisfied this is of course an ordinal variable you can rank the responses to see which mode of transportation ranks higher in satisfaction but the exact difference between satisfied and very satisfied for example or other options isn't quantifiable and the last question how many minutes does it take you to get To school minutes to get to school is a metric variable here you can calculate the average time
to get the school and use all standard statistical measures we can visualize this data with a histogram showing the distribution of times you get to school and compare the different Transportation modes so using nominal data we can categorize and count responses but cannot inere any order ordinal data allows us to rank responses but not to measure precise differences Between ranks matric data enables us to measure exact differences between data points as already mentioned matric level of measurement can be further subdivided into interval scale and ratio scale but what is the difference between interval and ratio
level let's look at an example in a marathon of course the time of the marathon runners is measured let's say the first one took 2 hours and the last one finished the marathon in 6 hours here we can say that the fastest runner Was three times as fast as the slowest or to put it the other way around the slowest one took three times as long as the fastest one this is possible because there is a true zero point at the beginning of the marathon where all Runners start from zero in this case we have
ratio level of measurement if however the stopwatch is forgotten to start at the beginning of the race and only the differences are measured starting from the fastest runner we Don't have this true zero now the runners cannot be put in proportion in this case we can say how big the interval between the runners is for example the fastest runner is 4 hours was faster than the slowest runner but we cannot say that the fastest runner was three times as fast as the slowest this is because we don't know the absolute values for both Runners we
still have equal intervals we can say things like Runner B finished one hour After the fastest runner and Runner C finished 1 hour and 45 minutes after the fastest runner the time differences are measurable and meaningful but since there is no true zero point we can say the fastest runner was x times as fast as the slowest runner we only know how much lat that the other Runners finished relative to the fastest runner but not that total running times and in this case we have an interval level of measurement in summary while both Interval and
ratio scales have equal intervals and support similar operations like addition and subtraction ratio scales have a true zero point zero represents the absence of the quantity being measured this allows meaningful multiplication and division and now a little exercise to check whether everything is clear to you first we have state of the US which is a nominal level of measurement this means the data is used for labeling or naming categories Without without any quantitative value in this case the states are names with no inherent order or ranking next we have product ratings on a scale from
1 to five this is an example of ordinal data here the numbers do have an order or rank five is better than one but the intervals between the ratings are not necessarily equal moving on to religious confession like the state stes this is also nominal the categories here such as different religions are for Categorization and do not imply any order next we have CO2 emissions in the year which is measured on a metric ratio scale this level allows for the full range of mathematical operations including meaningful ratios zero emissions mean no emissions at all then
we have telephone numbers although telephone numbers are numeric they are categorized as nominal they are just identifiers with no numerical value for analysis care level of patience is Another ordinal example this might include levels such as low medium and high care which indicate an order but not the exact difference between these levels living space in square meters is measured on a ratio scale like CO2 emissions Co sare meters mean there is no living space and comparisons like double or half are meaningful lastly we have chop satisfaction on a scale from 1 to four this is
ordinal data it ranks satisfaction levels but the difference Between each level isn't Quantified now we know what the level of measurement is and we can go through the hypothesis tests that are most popular and discuss when to use which tests let's start with the video on the most common hypothesis test the tea test this video is about everything you need to know about the Tea test after this video you know what a te test is and when you use it what types of te tests there are what the hypotheses and the assumptions Are how a
te test is calculate it and how you interpret the results let's start with the first question what is a T Test the T Test is a statistical test procedure hm and what does the T Test do the T Test analyzes whether there is a significant difference between the means of two groups for example the two groups may be patients who received once drag a and once drag B we would now like to know if there is a difference in blood pressure between these two groups now There are three different types of T tests the one
sample T Test the independent samples T Test and the pair samples T Test when do we use a one sample T Test we use the one sample T Test when we want to compare the mean of a sample with a known reference mean example a chocolate bar manufacturer claims that its chocolate bars weigh an average of 50 g to check this we take a sample of 30 bars and weigh them the mean value of this sample is 48 G now we Can use a one sample T test to check if the mean of 48 G
is significantly different from the claimed 50 g when do we use the independent samples T Test we use the T test for independent samples when we want to compare the means of two independent groups or samples we want to know if there is a significant difference between these means example we would like to compare the effectiveness of two painkillers we randomly divide 60 people Into two groups the first group receives track a and the second second group receives Str B using an independent T Test we can now test whether there is a significant difference in
pain relief between the two drugs when we use the pair samples T Test we use the paired samples T Test to compare the means of two dependent groups example we want to know how effective a diet is to do this we weigh 30 people before the diet and then weigh exactly the same people after The diet now we can look at the difference in weight between before and after for each subject we can now use a paired samples T test to test whether there is a significant difference in a paired sample the measurements are available
in pairs the pairs result for example from repeated measurements with the same people independ dep samples are made up of people and measurements that are independent of each other here's an interesting note the paired samples T Test is very similar to the one sample T Test we can also think of the paired samples T Test as having one sample that was measured at two different times we then calculate the difference between the paired values giving us a value for one sample the differ difference is 1's - 5 1's + 2 1's -1 and so on
and so forth now we want to test whether the mean value of the difference just calculated deviates from a reference value in this case zero this is exactly What the one sample T Test does what are the assumptions for a t test of course we first need a suitable sample in the one sample T Test we need a sample and the reference value in the independent T Test we need two independent samples and in the case of a pair T Test a paired sample the variable for which we want to test whether there is a
difference between the means must be metric examples of matric variables are age body weight and income for example a Person's level of education is not a metric variable in addition the matric variable must be normally distributed in all three test variants to learn how to test if your data is normally distributed watch my video test for normal distribution in case of an independent T Test the variances in the two groups must be approximately equal you can check whether the variances are equal using lavine's test for more information watch my video on lavine's Test so what
are the hypotheses of the T Test let's start with the one sample T Test in the one sample T Test the null hypothesis is the sample mean is equal to the given reference value so there's no difference and the alternative hypothesis is the sample mean is not equal to the given reference value what about the independent samples T Test in the independent T Test the null hypothesis is the mean values in both groups are the same so there is no Difference between the two groups and the alternative hypothesis is the mean values in both groups
are not equal so there is a difference between the two groups and finally the pair samples T Test in a pair T Test the Nile hypothesis is the mean of the difference between the pairs is zero and the alternative hypothesis is the mean of the difference between the pairs is not zero so now we know what the hypotheses are before we look at how the te test is Calculated let us look at an example of why we actually need a te test let's say there is a difference in the length of study for a bachelor's
degree between men and women in Germany our population is therefore made up of all graduates of a bachelor who have studied in Germany however as we cannot survey all Bachelor graduates we draw a sample that is as representative as possible we now use the test to test the null hypothesis that there is no difference in the Population if there is no difference in the population we will certainly still see a difference in study duration in the sample it would be very unlikely that we drew a sample where the difference would be exactly zero in simple
terms we now want to know at what difference measured in a sample we can say that the duration of study of men and women is significantly different and this is exactly what the T Test answers but how do we calculate a T Test to do This we first calculate the T value to calculate the T value we need two values first we need the difference between the means and then we need the standard deviation from the mean this is also Al known as the standard error in the one sample T Test we calculate the difference
between the sample mean and the known reference mean s is the standard deviation of the collected data and N is the number of cases s / the square root of n is then the standard Deviation from the mean which is the standard error in the dependent samples T Test we simply calculate the difference between the the two sample means to calculate the standard error we need the standard deviation and the number of cases from the first and second sample depending on whether we can assume equal or unequal variance for our data there are different formulas
for the standard error read more about this in our tutorial on data.net in a Paired sample T Test we only need to calculate the difference between the paired values and calculate to mean from that the standard error is then the same as for a one sample T Test so what have we learned so far about the T value no matter which T Test we calculate the T value will be greater if we have a greater difference between the means and the T value will be smaller if the difference between the means is smaller further the
T value becomes smaller when We have a larger disperson of the mean so the more scattered the data the less meaningful a given mean difference is now we want to use the T test to see if we can reject the null hypothesis or not to do this we can now use the T value in two ways either we read the critical T value from a table or we simply calculate the P value from the T value we'll go through both in a moment but what is the P value a T Test over always test the
null hypothesis that there is No difference so first we assume that there is no difference in the population when we draw a sample this sample deviates from the null hypothesis by a certain amount the P value tells us How likely it is that we would draw a sample that deviates from the population by the same amount or more than a sample we drew thus the more the sample deviates from the null hypothesis the smaller the P value becomes if this probability is very very small we can of course ask Whether the null hypothesis holds for
the population perhaps there is a difference but at what point can we reject an Al hypothesis this border is called the significance level which is usually set at 5% so if there is only a 5% chance that we draw such a sample or one that is more different then we have enough evidence to assume that we reject the null hypothesis and to put it simple we assume that there is a difference that the alternative hypothesis is true Now that we know what the P value is we can finally look at how the T value is
used to determine whether or not the null hypothesis is rejected let's start with the path through the critical T value which you can read from from a table to do this we first need a table of critical T values which we can find on data.net under tutorials and T distribution let's start with the two-tailed case we'll briefly look at the one tail case at the end of this Video here below we see the table first we need to decide what level of significance we want to use let's choose a significance level of 0.05 or 5%
then we look in this column at 1 - 0.05 which is 0.95 now we need the degrees of freedom in the one sample T Test and the paired samples T Test the degrees of freedom are simply the number of cases minus one so if we have a sample of 10 people There are 9 degrees of freedom in the independent sample T Test we add the number of people from both samples and calculate that minus two because we have two samples note that the degrees of freedom can be determined in a different way depending on whether
we assume equal or unequal variance so if we have a 5% significance level and 9 degrees of freedom we get a critical T value of 2.262 now on the one hand we've calculated a t value with the T Test and We have the critical T value if our calculated T value is greater than the critical T value we reject the null hypothesis for example suppose we calculate a t value of 2.5 this value is greater than 2.262 and therefore the two means are so different that we can reject the N hypothesis on the other hand
we can also calculate the P value for the T value we've calculated if we enter 2.5 for the T value and 9 for the degrees of freedom We get a P value of 0.034 the P value is less than 0.05 and we therefore reject the null hypothesis as a control we copy the T value of 2.262 here we get exactly a P value of 0.05 which is exactly the limit if you want to calculate a t test with data tab you just need to copy your own data in this table click on hypothesis test and
then select the variables of interest for example if you want to test whether Gender has an effect on income you simply click on the two variables and automatically get a T Test calculated for independent samples here below you can read the P value if you're still unsure about the interpretation of the results you can simply click on interpretation in words a two-tail t test for independent samples equal variances assumed showed that the difference between female and male with respect to the dependent variable salary Was not statistically significant thus the null hypothesis is retained the final
question now is what is the difference between directed hypothesis and undirected hypothesis in the undirected case the alternative hypothesis is that there is a difference for example there is a difference between the salary of men and women in Germany we don't care who earns more we just want to know if there is a difference or not in a directed Hypothesis we are also interested in the direction of the difference for example example the alternative hypothesis might be that men earn more than women or women earn more than men if we look at the T distribution
graphically we can see that in the two-sided case we have a range on the left and the range on the right we want to reject the null hypothesis if we are either here or there with a 5% significance level both ranges have a probability of two .5% Together just 5% if we do a one tail T Test the null hypothesis is rejected only if we are in this range or depending on the direction which we want to test in that range with a 5% significance level all 5% fall within this range we've seen how the
test is a powerful tool for comparing the means of two groups to determine why they differ sign significantly but what if we want to extend our analysis to more than two Groups this is where analysis of variance or Anova comes into play so let's get started with Anova this video is about analysis of variance or Anova for short we discuss what an anova is and why you need it we look at the hypotheses and the assumptions for an anova and I will show you how to calculate an anova and what the equations behind an anova
are finally we will discuss how to interpret the results and take a look at the posst hog Test let's get started first of all there are different types of analysis of variant the simplest and most common one is the oneway Anova and that's exactly what this video is about what is a oneway Anova the oneway Anova is a hypothesis test okay and what does the One Way Anova test and an NOA tests whether there are statistically significant differences between three or more groups more precisely it is tested whether there is a significant Difference between the
mean values of the groups an example you want to investigate whether there is a difference between fertilizers a b and c in terms of plant growth the analysis of variance now helps you to find out whether there is a significant difference in plant growth between the three groups but doesn't it T Test do something similar that's true the T test tests whether there is a difference between two groups the analysis of Variance is now an extension of the T test and it tests whether there is a difference between more than two groups so if you
only had two fertilizers A and B that you wanted to compare you would use a t test if you have three or more different fertilizers you would use an anova okay that makes sense what are the hypothesis test it in an anova in an anova the null hypothesis is that the mean values of the groups are equal mathematically this can be expressed as Mu1 is equal to mu2 is equal to Mu K where mu1 to Mu K represents the mean values of the different groups and the alternative hypothesis is that the mean values of the
groups are not equal this can be expressed mathematically as follows at least one mu is different where e stands for one of the groups under consideration what are the assumptions of an anova and how is it calculated let's look at this with the help of an example let's say you are a Researcher and you want to find out whether three different drugs have a different effect on blood pressure so your research question is do the three drugs have a different effect on blood pressure the null hypothesis is there is no different between the three drugs
in terms of blood pressure and the alternative hypothesis is there is a difference between the three drugs in terms of blood pressure in order to calculate an anova we of course need Data how do we get the data we obtain data by taking a sample from the population we need to analyze for example all people suffering from high blood pressure let's say we took a sample of 24 tests subjects we can then divide these 24 test subjects into three groups with each group receiving a different rack okay but now we still don't have any data
we only have groups for now that's right in order to be able to calculate an anova we have to measure Something for each subject in our case the blood pressure we measure the blood pressure for each test person once at the beginning and once at the end of the treatment the difference between the two values tells us how the blood pressure has changed while taking the medication we have now collected data but what about the assumptions we still have to check them don't we exactly so that you can calculate an anova these assumptions must be
met number one level of Measurement first the levels of measurement for your variable must be appropriate for an anova the independent variable should be nominal such as the type of medic ation or the type of fertilizer applied the dependent variable on the other hand must be metric like blood pressure or plant growth to summarize we need a variable that defines the different groups such as the different medications and the variable that reflects what is measured In a different groups for example blood pressure but now to the second assumption Independence the measurements should be independent I.E
the measured value of one group should not be influenced by the measured value of another group the third assumption is that of normal distribution the data within the groups should be normally distributed this assumption becomes less important as the sample size increases but should still be evaluated especially For small samples if this assumption is violated the cross called Wallace test can be used as an alternative if you would like to know how to test data for normal distribution please watch my video test for normal distribution and thus to the fourth assumption the variances in each
group should be roughly the same in other words it should not be the case that we have a very large variance or a very small variance in one group compared to the Other group this assumption can be tested with the Levine test you can find more information in my video on the Levine test if this assumption is not met the Welch Anova can be used as alternative okay but how does an anova work we want to know whether there is a difference between the groups to test whether there is a difference the Anova uses the
ratio of the variance between the groups and the variance within the groups what is the between group Variance and what is the within group variance the variance between the groups measures how much the mean values of the groups differ from each other in our case the deviation of the mean values of groups a b and c in this case we would have a big variance in that case a small one what is the variance within the groups the within group variance measures how much the individual data points in each group fluctuate in this case we
would have a large variance Within the groups in that case a small variance the ratio between the variance between the groups and the variance with within the groups is now the so-called F value or F statistic if the variance between the groups is small and the variance within the groups is large then we get a small F value in this case it is likely that these deviations or even larger ones occur purely by chance even if in reality there is no difference between the groups however if we have a Small spread within the groups and
a large spread between the groups then the F value is large it is then very unlikely that such a difference or an even more extreme difference would arise purely by chance or to be more precise it is very unlikely that we will obtain such an F value or an even larger one therefore to assess whether the differences between the groups are statistically significant we use the F value however this F value does not give Us a DE definite answer to assess the significance of the results more precisely we need the P value so let's calculate
the P value let's say this is our data we have the values from the first second and third group so now we start and calculate the variance between the groups and the variance within the groups first of all we need the total mean and the mean values of the individual groups we obtain the total mean by adding up all the values and Dividing by the number of values we have a total of 24 values this gives us a total mean of 5.3 we get the mean values of the groups by simply adding up all the
values of the respective group and dividing by the number of values per group now we've calculated all the required mean values and we can calculate the so-called sum of squares between the groups to do this we calculate the difference between each group mean and the total mean Square This and add up the results n is the number of values in the E Group so we get 2.5 - 5.3 sared + 8 - 5.3 sared + 5.5 - 5.3 sared the sum of the squares between the groups is therefore calculate the sum of squares within the
groups to calculate this we subtract the mean of each group from each individual value within that group so we subtract the mean of the first group from each of its values the mean of the second group from each of its values and the mean of The third group from each of its values we Square the differences obtained and add everything up for example in group one we substract the the group average of 2.5 from the first value from the second value and so on in total we get a sum of squares within the groups of
36 so now we've almost made it we've calculated the total mean and the group mean values we have calculated the sum of squares between the groups and within the groups and now we can calculate the Variances within and between the groups in most cases however the term variance is not used but mean squares which we will now also adopt we obtain the mean squares between the groups by dividing the sum of squares between the groups by the number of degrees of freedom between the groups we've just calculated the sum of squares between the groups and
the degrees of freedom result from the number of groups minus one so in our case 3 - 1 which is equal to 2 we get a Mean square of 60.6 7 so now we need the mean squares within the groups these are obtained by dividing the calculated sum of squares with in the groups by the degrees of freedom within the groups the degrees of freedom result from the total number of values in our case 24 minus the number of groups in our case three so we get 21 now we can calculate the F value to
do this we divide the mean Square between the groups by the mean Square Within the groups or if we take the expression from the beginning the variance between the groups by the variance within the groups we get an F value of 3539 but what about the P value we can calculate the P value with an F distribution to do this we simply go to data.net we need the degrees of freedom between the groups in our case two the degrees of freedom within the groups in our case 21 and we need the F value Which is
calculated I.E 3539 we get a P value that is smaller than 0.01 so if we use the usual significance level of 5% we get a significant difference and we reject the null hypothesis that there is no difference between the three groups of course you can simply calculate an anova online with data tab just copy your data into this table and click on Hy hypothesis test then simply select your desired Variables for example blood pressure and medication and an NOA will now be calculated automatically if you don't know exactly how to interpret the results simply click
on summary inverts or you can simply click on AI interpretation for most tables in this table we now get exactly the values that we just calculated by hand and we can also see here that we get a significant result as I said you can simply click on AI interpretation to help you understand The results now there's one more topic to cover the post talk test a post talk test becomes necessary when an anova indicates a significant difference between groups while a NOA can confirm that differences exist it doesn't specify which groups are different from each
other a post talk test helps to pinpoint exactly which groups differ from each other it performs a pairwise comparison between the groups indicating for example whether group a differs from Group b or group C in our case the post talk test reveals that all groups differ significantly from one another however this isn't always the outcome it's possible that only group one differs from group two or that no individual groups shows a significant difference even if the unov result is significant overall but what if you have not just one factor but two in this case you
will need to perform a two-way Anova let's explore the two-way Anova What is a two-way Anova a two-way Anova is a statistical method used to test the effect of two categorical variables on a continuous variable the categorical variables are the independent variables for example the variable drag type with drug A and B and gender with female and male and the continuous variable is the dependent variable for example the reduction in blood pressure so the two-way Anova is the extension of the one-way Anova while a oneway Anova tests The effect of a single independent variable on
a dependent variable a twv Anova tests the effects of two independent variables the independent variables are called factors but what is a factor a factor is for example gender of a person with the levels male and female type of therapy with therapy a b and c or the field of study with medicine business administration psychology and Mathematics in an analysis of variance a factor is Therefore a nominal variable we use an anova whenever we want to test whether these levels have an influence on the so-call dependent variable you might want to test whether gender has
an effect on salary whether therapy has an effect on blood pressure or whether field of study has an effect on length of study celer blood pressure and length of study will then be the dep dependent variables in each of these cases you test whether the factor has an effect on The dependent variable since you only have one factor in these cases you would use a oneway Anova okay you're right in the first case we have a variable with only two categories so of course we would use the independent samples T Test but when do we
use a two-way Anova we use a two factor analysis of variance when we have a second factor and we want to know whether this Factor also has an effect on the dependent variable we would also like to know whether in Addition to gender the highest level of education has an impact on salary or we would like to include gender in addition to type of therapy or in the third case we would also like to know whether the university attended in addition to the field of study has an in fluence on the length of study now
we don't have one factor in all three cases but two factors in each case and since we now have two factors we use a two-way analysis of variance so in a one-way Anova we have one factor from which we create the groups if the factor we're looking at has three levels for example three different types of drug we will have three groups to compare in the the case of a two-way analysis of variance the group results from the combination of the levels of the two factors if we have one factor with three levels and one
with two levels we have a total of six groups to compare but what kind of statements can we make with a two-way Anova with the help of a two-way Anova we can answer three things whether the first Factor has an effect on the dependent variable whether the second Factor has an effect on the dependent variable and whether there is an interaction effect between the two factors but what about the hypotheses in a two-way Anova there are three null hypotheses and therefore also three alternative hypothesis the first null hypothesis is there is no significant Difference between
the groups of the first factor and the alternative hypothesis there is a significant difference between the groups of the first Factor the second Nile hypothesis is there is no significant difference between the groups of the second factor and the alternative hypothesis there is a significant difference between the groups of the second factor and the third null hypothesis reflects the interaction Effect one factor has no Effect on the fact of the other factor and the alternative hypothesis at least one fact Factor has an influence on the effect of the other factor and what about the assumptions
for the test results to be valid several assumptions must be met number one normality the data within the groups should be normally distributed or alternatively the residual should be normally distributed this can be checked with a quantile quantile plot Number Two Homogeneity of variances the variance of data in groups should be equal this can be checked with the Lin test number three Independence the measurements should be independent I.E the measured value of one group should not be influenced by the measured value of another group number four measurement level the dependent variable should have a metric
scale level but how to calculate a two-way Anova let's look at the example from the beginning we would Like to know if Dr type and gender have an influence on the reduction in blood pressure Dr type has the two levels drug A and B and gender has the two levels male and female to answer the question we collect data we randomly assigned patients to the treatment combinations and measured the reduction in blood pressure after a month for example the first patient receives track a is male and after 1 month a reduction in blood pressure of
six was measured now let us Answer the questions is there a main effect of drag type on the reduction in blood pressure is there a main effect of gender on the reduction in blood pressure and is there an interaction effect between drag type and gender on the reduction in blood pressure for the calculation we can use either a statistical software like data tab or do it by hand first I will show you how to calculate it with data Tab and how to interpret the results at the end I will Show you how to calculate the
Anova by hand and go through all the equations to calculate a two-way Anova online simply visit data.net and copy your data into this table then click on hypothesis test under this tab you will find a lot of hypothesis tests and depending on which variable able you select you will get an appropriate hypothesis test suggested we want to know if drag type and gender have an influence on the reduction in blood pressure so let's just click on All three variables data now automatically gives us a two-way Anova we can read the three null and the three
alternative hypotheses here afterwards we get the descriptive statistics and the LaVine test for equal variance with the LaVine test we can check if the variances within the groups are equal the P value is greater than 0.05 so we assume equality of variance in the groups for this data and here we see the results of the analysis of Variance we'll look at these in more detail in a moment but if you don't know exactly how to interpret the results you can also just click on summary in words in addition you can check here if the requirements
for the analysis of variance are met at all but now back to the results let's take a closer look at this table the first row tests the N hypothesis where drag type has an effect on the reduction in blood pressure the second row tests whether gender has an Effect on the reduction in blood pressure and the third row tests if the interaction has an effect you you can read the P value in each case right at the back here let's say we set the significance level at 5% if our calculated P value is less than
0.05 the null hypothesis is rejected and if the calculated P value is greater than 0.05 the N hypothesis is not rejected thus we see that all three p values are Greater than 0.05 and therefore we cannot not reject any of the three null hypotheses therefore neither the drug type nor gender have a significant effect on the reduction in blood pressure and there's also no significant interaction effect but what does an analysis of variance actually do and why is the word variance in analysis of variance in a two-way analysis of variance the total variance of the
dependent variable is divided Into the variance that can be explained by factor a the variance that can be explained by Factor B the variance of the interaction and the arrow variance actually SS is not the variance but the sum of squares we will discuss how to calculate the variance in this case in a moment but how can I imagine that the dependent variable has some variance in our example not every everyone will have the same reduction in blood pressure we now want to know if we can explain some Of this variance by the variabl Str
type gender and their interaction the part that we cannot explain by these three terms accumulates in the eror if the result looked like this we would be able to explain almost all the variance by factors A and B and their interaction and we would only have a very small proportion that could not be explained this means that we can make a very good statement about the reduction in blood pressure by the variable drag type sex And interaction in this case it would be the other way around drug type gender and the interaction almost have no
effect on the reduction in blood pressure and it all adds up in the arrow but how do we calculate the sum of squares the F value and the P value here we have our data one's drug type with drug A and B and one's gender with male and female so these individuals are for example all male and have been given drug a first we calculate the mean Values we need we calculate the mean value of each group so male and drag a that is 5.8 then male and Drug B that is 5.4 and we do
the same for female then we calculate the mean value of all males and females and the mean value of drug A and B finally we need the total mean we can now start to calculate the sum of squares let's start with the total sum of squares we do this by subtracting the total mean from each individual value squaring the result and adding up all The values the total mean is 5. 4 so we calculate 6 - 5.4 2ar + 4 - 5.4 squared to finally 3 - 5.4 squared so we get a sum of squares
of 84.8 the degrees of freedom are given by n * P * Q - 1 n is the number of people in a group in our case five and P and Q are the number of categories in each of the factors in both cases we have two groups the total variance is calculated by dividing the sum of squares by the degrees of freedom so we get 4.46 now we can calculate the sum of squares between the groups for this we calculate the group mean minus the total mean so 5.8 - 5.4 sared + 5. 4
- 5.4 sared and the same for these two values we get 7.6 in this case the degrees of freedom are three which gives us a variance of 2.53 now we can calculate the sum of squares of factor a a dash is the mean value of the categories of factor a so we calculate 5.9 minus the total mean value and and 4.9 minus the total mean value this results in five together with the degrees of freedom we can now calculate the variance for factor a which is five we do the same for Factor B in this
case we use the mean values of male and female and we get the variance of 0.8 now we can calculate the sum of squares for the interaction we obtain this by calculating the sum of squares minus the sum of squares of A and B the Degrees of freedom result to one for the interaction we get a variance of 1.8 finally we can calculate the sum of squares of the error we substract the mean value of each group from the respective group values so in this group we subtract 5.8 from each individual value in this group
we subtract 5.4 here we substract six and then we substract 4.4 this gives us a sum of squares of 77.2 the degrees of freedom are 16 and We get a variance of 4.83 and now we calculate the F values these are obtained by dividing the variance of factor a b or the interaction by the arrow variance so we get the F value for factor a by dividing in the variance of factor a by the eror variance which is equal to 1.04 we can now do exactly the same for FB and faab to verify we get
exactly the same values with data tab 1.04 0.17 and 0.37 for the calculation of the P value you need the degrees of freedom and the F distribution so with these three values you can either read the critical P value in a table or as usual you just use a software to calculate the P values you can find a table of critical F values on data tab a g for a significance level of 5% you can use this table if the red value is greater than the calculated F value the null Hypothesis is rejected otherwise not
we've seen how an NOA allows us to compare PA means across different groups to determine if there are significant differences but what if our research design involves measurements taken from the same subjects at different time points this is where the pendency among the observations comes into play Let's dive into how repeated measures and NOA adjusts our approach to interconnected data points this video is about the Repeated measures and NOA we will go through the following questions what is a repeated measures analysis of variance what are the hypotheses and the assumptions how is a repeated measures
and over calculated how are the results interpreted and what is a post talk test and how do you interpret it we'll go through all points using a simple example let's start with the first question what is a repeated measures Anova a repeated measures an is of Variance tests whether there is a statistically significant difference between three or more dependent samples what are dependent samples in a dependent sample the same participants are measured multiple times under different conditions or at different time points we therefore have several measurements from each person involved let's take a look at
an example let's say we want to investigate the effect effectiveness of a training program for This we've started looking for volunteers to participate in order to investigate the effectiveness of the program we measure the physical fitness of the participants at several points in time before the training program immediately after completion and two months later so for each participant we have a value for physical fitness before the program a value immediately after after completion and a value 2 months later and since we are measuring the Same participants at different points in time we are dealing with
dependent samples now of course it doesn't have to be about people or points in time in a generalized way we can say in a dependent sample the same test units are measured several times under different conditions the test units can be people animals or cells for example and the conditions can be time points or treatments for example but what is the purpose of repeated measures and over we Want to know whether the fitness program has an influence on physical fitness and it is precisely this question that we can answer with the help of an anova
with repeated measures physical fitness is therefore our dependent variable and time is our independent variable with time points as levels so the analysis of variance with repeated measures checks whether there is a significant difference between the different time points but isn't that what the paired Samples T Test does doesn't it also test whether there is a difference between dependent samples that's correct the paired samples T Test evaluates whether there is a difference between two dependent groups the repeated measures a Nova extends this concept allow allowing you to examine differences among three or more dependent groups
what are the hypotheses for repeated measures and NOA the null hypothesis for a repeated measures a Nova is that there are no Differences between the means of the different conditions or time points in other words the null hypothesis assumes that each person has the same value at all times the values of the individual persons themselves May differ but one the same person always has the same value the alternative hypothesis on the other hand is that there is a difference between the dependent groups in our example the null hypothesis states that the training program has no
influence on Physical fitness I.E that physical fitness does not change over time and the alternative hypothesis assumes that the training program does have an influence I.E that physical fitness changes over time to correctly apply repeated measures and over certain assumptions about the data must be fulfilled number one normality the dependent variable should be approximately normally distributed this can be tested using the QQ plot or the Karov smof test for more information please watch my video test for normal distribution you can find the link in the video description number two sphericity the variances of the differences
between all combinations of factor levels or time points should be the same this can be tested with the help of Marley's test for sphericity if the resulting P value is greater than 0.05 we can assume that the variances are equal and the assumption is not Violated in this case the P value is greater than 0.05 then therefore this assumption is fulfilled if the assumption is violated adjustments such as Greenhouse Geer or hind F can be made now I'll show you how to calculate and interpret an analysis of variance online with data Tab and then we'll
go through the formulas to explain how to calculate the analysis of variance with repeated measures by hand to calculate an anova online you simply Go to data.net and copy your own data into this table I use this example data set you can find a link to load this example data set in the video description make sure that your data is structured correctly I.E one row per participant and one column per condition or time now we click on the hypothesis test tab at the bottom we see the three variables before in the middle and end from
the data set if we now click on all of them a repeated measures an NOA is Automatically calculated firstly we can check the assumptions here we see that the Mist test for sphericity results in a P value of 0.357 this value is greater than 0.05 and thus the assumption is fulfilled if this is not the case you can take sphericity correction I will explain how to test the normal distribution in a separate video in our example now we will assume normal distribution with a lot of coning Of teeth if the assumption is not fulfilled you
can simply calculate the non-parametric counterpart to the repeated measures and NOA the fredman test this does not require your data to be normally distributed first of all if you do not know exactly how to interpret the individual tables in your analysis you you can simply click on summary in words or on AI interpretation for the tables but now back to the results first we see The null and the alternative hypothesis the null hypothesis is that there is no difference between the dependent variables before in the middle and end and the alternative hypothesis is that there
is a difference at the end of the test we can say whether we reject this null hypothesis or not now we see the descriptive statistics and a box plot we then get the results of the Anova with repeated measures in this table the P value is the most Important value it is 0.01 and indicates the probability that a sample deviates as much or even more from the null hypothesis as our sample with a P value of 0.01 the results are statistically significant at the conventional significance level of 0.05 which means that there are significant differences
between the mean values of the three levels before in the middle and end this rejects the null Hypothesis and we assume that there is a difference between the groups and that the training program or therapy has a significant effect if you want an interpretation of the other values in this table simply click on AI interpretation finally here is the table for the bonferoni posst Haw test since the p value of the analysis of variance is smaller than 0.05 we know that there is a difference between one or more groups with the POs Hog test we
can now determine between which groups this different exists we see that there is a significant difference between before and end and in the midle middle and end both have a P value of less than 0.05 how do you calculate an analysis of variance with repeated measures by hand let's say this is our data we have five people each of whom we measured at three different points in time now we can calculate the necessary mean values First we calculate the mean value of all the data which is 5.4 then we calculate the mean value Val of
the three groups for the first groups we get a mean value of five for the second a value of 6.1 and for the third a value of 5.1 and finally we can calculate the mean value of the three measurements for each person so for the first person for example we have an average value of eight over the three measurements and For the last person we have an average value of five now that we have all all the mean values we need to calculate the required sums of squares but note our goal is the so-called F
value and subsequently calculate a P value from it there are different ways for getting this F value I will demonstrate one common way how to do this depending on which statistics textbook you use you may come across a different formula but back to the calculation let's start with The sum of squares within the subject we obtain this by calculating each individual value xmi minus the mean value of the respective subject squaring this and adding it up so we start with 7 - 8 2ar + 9 - 8 2ar until finally 3 - 5 and 7
- 5 we can then calculate the sum of squares of the treatment I.E the sum of squares of the three points in time we obtain this by subtracting the total mean value from each group mean value Squaring it and adding it Up N is the number of people in a group so we get 5 - 5.4 sared + 6.1 - 5.4 sared + 5.1 - 5.4 squar now we can calculate the sum of squares of the residual we get this by simply calculating the sum of squares within the subjects minus the sum of squares of
the treatment alternatively we can also use this formula here xmi is again the value of each individual person AI is the mean value of the respective group PM is the mean value of The respective person of the three points in time and G is the total mean value we can then calculate the mean squares to do this we divide the respective sum of squares by the degrees of freedom the mean square of the treatment is therefore calculated by dividing the sum of squares of the treatment by the degrees of freedom of the treatment the degrees
of freedom of the treatment are the number of factor levels minus one so we have three time Points minus one which is two the mean Square of the residual is obtained in the same way here the degrees of freedom are the number of factor levels minus 1 times the number of subjects minus 1 we get 2 * 7 which is equal to 14 now we calculate the F value which is done by dividing the mean square of the treatment by the mean square of the residual or error finally we calculate the P value using the
F value and the degrees of freedom from the treatment And residual to calculate the P value you can simply go to this page on data tab the link can be found in the video description here you can enter your values our F value is 1.69 the numerator degree of Freedom I.E that of the treatments is two and the denominator degree of Freedom I.E that of the error is four 14 we get a P value of 0.22 the P value is greater than 0.05 and therefore we do not have enough Evidence to reject the null hypothesis
of course we can then compare the results with data tab to do this we copy the data back into this table and click on the variables we can see that we also get a P value of 0.22 here after after exploring how repeated measures Anova can be used to analyze data we might wonder how to handle even more complex designs this is where mixed model Anova comes in let's find out how this Powerful tool can help us what is a mixed model Anova what are the hypotheses and assumptions and how to interpret the results of
a mixed model Anova this is what we discussed in this video Let's let's start with the first question what is a mixed model Anova a mixed model Anova is a statistical method used to analyze data that involves both between subject factors and within subject factors but what are between subjects factors and within Subjects factors let's look at an example let's say we want to test whether different diets have an effect on cholesterol levels we would like to compare the three diets a b and c so the factor diet has the three levels a b and
c to test whether there is a difference between the diets we are conducting a study with 18 participants the individual participants are called subjects now we randomly assign six participants to each of the three groups Each participant or subject is assigned to only one group in this case we have a between subjects Factor difference subjects are exposed to different levels of a factor in this analysis our objective is to determine whether significant differences exist in the mean cholesterol levels among the various groups on the study and this is exactly what a One Way Anova does
now of course we could also examine the impact of one diet across multiple time points We could measure the cholesterol levels at each participant at the start of the diet after 2 weeks and after 4 weeks so the factor time has the three level start two weeks and four weeks and in this case the same subjects are being exposed to all levels of the factor and this is called a within subjects Factor the same subjects are exposed to all levels or conditions in this case we want to know if there is a difference in the
mean value of the cholesterol levels Between the different points in time and this is exactly what a repeated measures Anova does therefore in a repeated measures Anova we have within subject factors so in a between subjects design each subject or participant is only assigned to one factor level so that the different subjects only have the influence of the respective group in contrast in the within subject design the same subjects op participants are exposed to all Factor levels which Enables a direct comparison of the reactions to each factor level but what if we want to test
if there is a difference between Diet a b and c over the different points in time so we want to test if there is a difference between the diets and if there is a difference between the different time points then we need a mixed model and NOA because we have both one between subject factor and one within subject Factor so in a mixed model and NOA we have at least one Between subjects factor and at least one within subjects factor in the same analysis note a mixed modan NOA is also called a two Way Anova
with repeated measures because there are two factors and one of them results from repeated measures therefore the mixed model and Nova tests whether there is a difference between more than two samples which are divided at least between two factors one of the factors is a result of measurement repetition with the help of A mixed model and NOA you can now answer three things first whether the within subject Factor has an effect on the dependent variable second whether the between subject Factor has an effect on a dependent variable and third whether there is a so-called interaction
effect between the two factors this gives us a good transition to the hypothesis the first null hypothesis is the mean values of the different measurement times do not differ there are no significant Differences between the groups of the within subject Factor then of course there is the second the means of the different groups of the between subject Factor do not differ and the third Nile hypothesis reflects the interaction effect one factor has no effect on the effect of the other Factor what are the assumptions of a mixed model Nova normality the dependent variable should be
approximately normally distributed within each group of the dependent Variables this assumption is especially important when a sample size is small when a sample size is large and NOA is somewhat robust to violations of normality homogeneity of variances the variances in each group should be equal in mixed modela NOA this needs to be true for both the within subjects and between subjects factors the leine stask can be used to check this assumption homogeneity of co-variances sphericity this applies to the within Subjects factors and assumes that the variance of the differences between all combinations of the different
groups are equal what does that mean let's start with the differences between all combinations to do this we simply need to calculate the difference of the first group minus the second the difference of the first group and the third group and the difference of the second group and the third group these calculated differences should now have the same Variance this assumption can be tested using marless test of stcity when this assumption is violated adjustments to the degrees of freedom such as Greenhouse Geer or hind Feld can be used independence of observations this assumes that the
observations are independent of each other this is a fundamental assumption in an NOA and is usually assured by the study design no significant outliers outliers can have a disproportionate effect on Anova Potentially causing misleading results it's important to identify and address outliers let let's calculate an example and I'll show you how to interpret the results let's say this is our data that we want to analyze we want to know whether therapy a b and c and three different time points have an effect on cholesterol levels each row is one person the therapy is the between
subject factor and the time with the levels before the therapy in the middle And at the end of the therapy is the within subject Factor so the first patient on therapy a had a cholesterol level of 165 before therapy a cholesterol level of 145 in the middle and a cholesterol level of 140 at the end let's first calculate the example online with data Tab and then discuss how to interpret the results to calculate an anova online simply go to data.net and copy your data into this table table you can also load this Example data set
using the link in the video description then click on hypothesis testing under this tab you will find a lot of hypothesis tests and depending on which variable you click on you will get an appropriate hypothesis test suggested if you copy your data up here the variables will appear down here if the correct scale level is not automatically detected you can simply simply change it on variables view for example if we click on before middle and End a repeated measur Anova is automatically calculated but we also want to include the therapy so we just click on
therapy now we get a mixed model Anova we can read the three null and the three alternative hypotheses here then we get the descriptive statistics output and here we see the results of the analysis of variance and also the post talk test we'll look at these again in detail in a moment but if you don't know exactly how To interpret the results you can also click on summary in wordss but now back to the results most important in this table are these three rows with these three rows you can check if the three n hypotheses
we stated before are rejected or not the first row test hypothesis whether cholesterol level changes over time so whether the therapies have an effect on cholesterol level the second row tests whether there Is a difference between the respective therapy forms with respect to cholesterol level and the last row checks if there is an interaction between the two factors you can read the P value in each case right at the back here let's say we set the significant level at 5% if our calculated P value is less than 0.05 then the respective null hypothesis is rejected
and if the calculated P value is greater than 0.05 then the null hypothesis is not rejected thus we see here that the P value of before middle and end is less than 0.05 and therefore the values at before middle and end are significantly different in terms of cholesterol levels the P value in the second row is greater than 0.05 therefore the types of therapy have no significant influence on the cholesterol level it is important to Note that the mean value over the three time points is considered here it could also be that in one therapy
the blood pressure increases and in the other therapy the blood pressure pressure decreases but on average over the time points the blood pressure is the same if this was the case however we would have an interaction between the therapies and the time we test this with the last hypothesis in this case there is no significant interaction between therapy And time so there is an influence over time but it does not matter which therapy is used the therapy has has no significant influence if one of the two factors has a significant influence the following two tables
show which of the combinations differ significantly so far we've explored various types of Anova and test now these are all so-called parametric tests and require certain assumptions about the data like normality but what happens if our data Doesn't meet these assumptions this is where nonparametric tests come into play let's compare these two families of tests to understand their differences hi in this video I explain the difference between parametric and non-parametric hypothesis testing why are you interested in this topic you want to calculate a hypothesis test but you don't know exactly what the difference is between
a parametric and a non-parametric metric test and you're Wondering when to use which test if you want to calculate a hypothesis test you must first check the assumptions for the test one of the most common assumptions is that your data is normally distributed in simple terms if your data is normally distributed parametric tests are used such as the T Test analysis of variance or peeron correlation if your data is not not normally distributed nonparametric tests are used such as men with u test or spearman's correlation What about the other assumptions of course you still need
to check whether there are other assumptions for the test in general however nonparametric tests make fewer assumptions than parametric tests so why use parametric tests at all parametric tests are generally more powerful than non-parametric tests what does that mean here's an example you have formulated your null hypothesis man and women are paid equally whether this null hypothesis is rejected depends on The difference in salary the dispersion of the data and the sample size in a parametric test a smaller difference in salary or a smaller sample is usually sufficient to reject Al hypothesis if possible always
use parametric tests what is the structural difference between parametric and non-parametric tests let's take a look at pearon correlation and spearman's rank correlation as well as a t test for independent samples and the Man withney U test let's start with the Pearson and Spearman correlation the span rank correlation is the nonparametric counterpart to the the Pearson correlation what is the difference between the two correlation coefficients spean correlation does not use raw data but the ranks of the data let's look at an example we measure the reaction time of eight computer players and ask their age
when we calculate a Pearson correlation we simply take the two Varibles reaction time and age and calculate the PE and correlation coefficient however we now want to calculate span's rank correlation so first we assign a rank to each person for reaction time and age the reaction time is already sorted by size 12 is the smallest value so gets rank one 15 the second smallest so gets rank two and so on and so forth we are now doing the same with ag here we have the smallest value there the Second small here the third smallest fourth
smallest and so on and so forth let's take a look at this in a scatter plot here we see the raw data of age and reaction time but now we would like to use the rankings so we form ranks from the variables age and reaction time through this transformation we have now distributed our data more evenly to get span's correlation we we simply calculate peon correlation from the Ranks so Spearman correlation is equal to peon correlation only that the ranks are used instead of raw values what about a test for independent samples and the man
with u test the T test for independent samples and the man with u test check whether there is a difference between two groups an example is the there a difference between the reaction time of man and women the man with u test is the nonparametric counterpart to the T test for independent samples but There is an important difference between the two tests the T test for independent samples tests whether there is a mean difference for both samples the mean value is calculated and it is tested whether these mean values differ significantly the man with u
test on the other hand checks whether there is a rank sum difference how do we calculate the rank sums for this purpose we sort all persons from the smallest to the largest value this person has the Smallest value so gets rank one that person has the second smallest value so gets rank two and this person has the third smallest value and so on and so forth now we have assigned a rank rank to each person then we can simply add up the ranks of the first group and the second group in the first group we
get a rank sum of 42 and in the second group a rank sum of 36 now we can investigate whether there is a significant difference between these rank sums if You want to know more about the man with u test check out my related video so we can summarize the raw data I used to parametric test tests and the ranks of the raw data are used for nonparametric tests the hypothesis test you use usually depends on how many variables you have and whether it is an independent or dependent sample in most cases there is always
a nonparametric counterpart to parametric tests so if you do not meet the assumptions for the Parametric test you can use the non-parametric counterpart but don't worry data tab will do its best to help you choose the right hypothesis test of course you can calculate the most common parametric and nonparametric test with data tab online simply copy your own data into the table and your variables will appear here below now click on the variables you want to calculate a hypothesis test for for example if you choose salary and gender a t test will Be calculated here
you can check the assumptions if the assumptions are not met you can simply click on nonparametric and a man with the U test will be calculated if you click on salary and Company an analysis of variance is calculated or in the nonparametric case the cross Vol test as we've seen parametric tests rely heavily on the assumption that data are Normally distributed this leads us to an essential step in data analysis checking our data for normality before applying parametric tests it's crucial to check this assumption otherwise we would get inaccurate results let's now look into different
methods and statistical tests to check out data for normal distribution in this video I show you how to test your data for normal Distribution first of all why do you need normal distribution let's say you've collected data and you want to analyze this data with an appropriate hypothesis test for example a t test or an analysis of variant one of the most common requirements for hypothesis testing is that the data used must be normally distributed data are normally distributed if the frequency distribution of the data has this bell Curve now of course the big question
is how do you know if your data is normally distributed or not or how can you test that there are two ways either you can check the normal distribution analytically or graphically we now look at both in detail let's start with the analytical test for normal distribution in order to test your data analytically for normal distribution there are several test procedures the best known are the colog Of smov test the Shapiro wil test and the Anderson darling test with all these tests you test the null hypothesis that the data are normally distributed so the null
hypothesis is that the frequency distribution of your data fits the normal distribution in in order to reject or not reject the null hypothesis you get a P value out of all these tests now the big question is whether this P value is greater or less than 0.05 if the P value is less than 0.05 this is interpreted as a significant deviation from the normal distribution and you can assume that your data are not normally distributed if the P value is greater than 0.05 and you want to be statistically clean you cannot necessarily say that the
frequency distribution corresponds to the normal distribution you just cannot disprove the null hypothesis in practice however values greater than 0.05 are assumed to be normally Distributed to be on a safe side you should always take a look at the graphic solution which we will talk about in a moment so in summary all these tests give you a P value if this P value is less than 0.05 you assume no normal distribution if it is greater than 0.05 you assume normal distribution for your information with the Koger of smof test and with the Anderson darling test
you can also test distributions other than the normal Distribution now unfortunately there is a big disadvantage of the analytical methods which is why more and more people are switching to using the graphical methods the problem is that the calculated P value is influenced by the size of the sample therefore if you have a very small sample your P value may be much larger than 0.05 but if you if you have a very large sample your P value may be smaller than 0.05 let's assume the distribution in Your population deviates very slightly from the normal distribution
then if you take a very small sample you will get a very large P value and thus you will assume that it is normally distributed data however if you take a larger sample then a P value Bec becomes smaller and smaller even though the samples come from the same population with the same distribution therefore if you have a minimal deviation from the normal distribution which isn't actually Relevant the larger your sample the smaller the P value becomes with a very large sample you may even get a P value smaller than 0.05 and thus reject the
hypothesis that it is a normal distribution to get around this problem graphical methods are being used more and more we'll come to that now if the normal distribution is checked graphically you either look at the histogram or even better at the QQ plot if you use the histogram you Plot the normal distribution in the histogram of your data and then you can see whether the curve of the normal distribution ution roughly corresponds to that of the normal distribution curve however it is better if you use the so-called quantile quantile plot or QQ plot for short
here the theoretical quantiles that the data should have if they are perfectly normally distributed and the quantiles of the measured values are compared if the data is perfectly Normally distributed all points would lie on the line the more the data deviates from the line the less it is normally distributed in addition data de plots the 95% confidence interval if all or almost all of your data lies within this interval it is a very strong indication that your data is normally distributed your data would not be normally distributed if for example they form an arc and
lie far away from the line in some areas if you use data tap In order to test for normal distribution you get the following evaluation first you get the analytical test procedures clearly arranged in a table then come the graphical test procedures how you can test your data with data tab for normal distribution I will show you now just copy your data into to this table click on descriptive statistics and then select the variable you want to test for normal distribution for example age after that you can simply click on test For normal distribution here
and you will get the results down here I know the test procedures are not actually descriptive methods but if you want to get an overview of your data it's usually also relevant to look at the distribution of your data furthermore if you calculate a hypothesis test for example whether gender has an influence on the salary of a person then you can check the precond conditions for each hypothesis test and you will also get The test for normal distribution if the pr condition is not met you would click on this and a non-parametric test the man
Whitney U test would be calculated the man with u test does not need normally distributed data another important assumption is the equality of variance you can check whether two or more groups have the same variance using the LaVine test let's take a look at it what is a lavine's test lavine's test tests the hypothesis that the variances Are equal in different groups the aim is to determine whether the variances in different groups are significantly different from each other the hypotheses for Lavin's test are as follows the null hypothesis is the variances of the groups are
equal and the alternative hypothesis is at least one of the groups has a different variance when is lavine's test most commonly used lavine's test is most often used to test a assumptions for another hypothesis Test what does that mean let's say your hypothesis is there is a difference between two medications in terms of perceived pain relief to test this hypothesis you've collected data now to test the hypothesis based on your data you use a hypothesis test such as a T Test many hypothesis tests have the assumption that the variances in each group are equal and
this is where thein test comes in it tells us whether this assumption is fulfilled or not how is a Lavine's test calculated here's an example we want to know if there is a significant difference in variance between these groups first we simply calculate the mean of each group then we subtract the respective group mean from each person the amount of each value is now formed so that the negative values become positive from these new values the group mean can now be calculated again the larger the group mean the greater the Variance within each group thus
there is a smaller variance in this group than in that group in addition we can calculate the total mean value now we can calculate the Square deviations of the group means from the overall mean and sum them up and then we can calculate the square deviation of the individual values from the respective group mean and add them up we can now compare the two calculated sums and that is exactly what lavine's test does the test Statistic of lavine's test is obtained with this equation n is the number of cases and I the number of cases
in the E Group set I is the mean value of the E Group set is the overall average set i j is the respective value in the groups and K is the number of groups the calculated test statistic L is equal to the F statistic therefore with the F value and the degrees of freedom the p value can be calculated the degrees of freedom result with number of groups Minus one and number of cases minus number of groups if the P value is greater than 0.05 the null hypothesis that the variances are equal is not
rejected thus equality of variance can be assumed if you use data Tab and calculate an analysis of variance you can find lavine's test under test assumptions in an independent T Test you will find lavine's test at the bottom of the results if equality of variance is Not given you can use the T test for unequal variance now that we understand the importance of testing for normal distribution we might find ourselves in situations where these assumption are not met in this case we turn to nonparametric methods that are less sensitive to the distribution of data we
will discuss various kinds of tests for this purpose like man with u test cross called Wallace test will coxen signed Rank test and fredman test let's start with the non-parametric counterpart to the T test for independent samples The Man withney U test what is a man Whitney U test and how is it calculated that's what we will discuss in this video let's start with the first question what is a man Whitney UT test a man Whitney UT test tests whether there is a difference between two independent samples an example is there a difference between the
reaction time of women and men but The T test for independent samples does the same it also tests whether there is the difference between two independent samples that's right the manw U test is the nonparametric counter part to the T test for independent samples but there is an important difference between the two tests the T test for independent samples tests whether there is a mean difference for both samples the mean value is calculated and it is tested whether these mean values differ Significantly the man would need you test on the other hand checks whether there
is a rank sum difference how do we calculate the rank sum for this purpose we sort all persons from the smallest to the largest value this person has the smallest value so gets rank one this person has the second smallest value so gets rank two and this person has the third smallest value and so on and so forth now we have assigned a rank to to each person then we can simply add up The ranks of the first group and the second group in the first group we get a rank of 42 and in the
second group a rank of 36 now we can investigate whether there is a significant difference between these rank sums but more on that later the advantage of taking the rank sums rather than the difference in means is that the data need not to be normally distributed so in contrast to the T Test the data in the man with u test do not have to be Normally distributed what are the hypotheses of the Man witne U test the null hypothesis is in the two samples the rank sums do not differ significantly the alternative hypothesis is in
the two samples the rank sums do differ significantly now let's go through everything with an example first we calculate the example with data Tab and then we see if we can get the same results by hand if you like you can load the sample data set to follow the Example you can find the link in the video description we simply click on data.net and open the statistics calculator I've already loaded the data from the link here you can also copy your own data into this table then all you have to do is click on the
hypothesis test Tab and then simply select the desired variables we measure the reaction time of a group of men and women and want to know if there is a difference in reaction time so we click On response time and gender we don't want to calculate a t test for independent samples but a man Whitney U test so let's just click on non-parametric test here we see the results of the Man whitne U test if you're not sure how to interpret the results just click on summary inverts for the given data a Man withney U test
showed that the difference between female and male with respect to the dependent variable response time was not Statistically significant thus the null hypothesis is not rejected so we now calculate the man with u test by hand for this we have plotted the values in a table on one side we have gender with female and male and on the other side the values for reaction time unfortunately the data is not normally distributed so we cannot use a t test and we calculate the Man withney U test instead first we assign a rank to each value we
pick the smallest Value which is 33 which gets the rank one the second smallest value is 34 which gets the rank two the third smallest value is 35 five which gets the rank three now we do the same for all other values so now we have all ranks assigned and we can just add up all the ranks from women and all the ranks from Men the rank sum is abbreviated with t and we get T1 for female with 2 + 4 + 7 + 9 + 10 + 5 which is 37 now we do the
same for male here we get get 11 + 1 + 3 + 6 + 8 which is 29 again our Nal habesis is that both rank sums are equal now we want to calculate the P value for this we have once calculated the rank sum for the female participants and we have the number of cases of six therefore we have six female subjects we can now calculate the U1 that that is the U for the female participants using this formula here we have N1 and N2 that is the number of cases of female and male
minus the rank sum of the female Participants if we insert our values we get a U1 of 14 we now do exactly the same for the male participants and we get a U2 of 16 so now we have calculated U1 and U2 the U for the man with u test is now given by the smaller value of the two so in our case we take the minimum of 14 and 16 this is of course 14 next we need to calculate the expected value of U which we get by N1 * N2 / 2 in our
case it is 6 * 5 / 2 and that is equal to 15 last but not least we need the Standard error of U the standard error can be calculated with this formula and in our case it is equal to 54772 with all these values we can now calculate Z the Z value results with u minus mu U divided by the standard error in our case we get 14 - 15 / 5 54772 which is equal to - 01825 so now we have the set value and with the set value we can calculate the P
value however it should be noted depending on how large the sample is the P value for the man with u test is calculated in different ways for up to 25 cases the exact values are used which can be read from a table table for large samples the normal distribution of the U value can be used as an approximation in our example we would actually use the exact values nevertheless we assume a normal distribution for this we can simply go to data and calculate the P value for a given set value the P value of 0.855
is significantly greater than the significance level of 0.05 and thus the null hypothesis cannot be rejected based on this sample how to calculate the man with you test on tide ranks you can learn in our tutorial on data.net you find the link in the video description but what if we want to compare to dependent samples and need a non-parametric test let's take a look at the Willcox signed rank test in this video I will explain the will coxen test To you with will go through what will coxen test is what the assumptions are and how
it is calculated and at the end I will show you how you can easily calculate the will coxen test online with data Tab and we get started right now the will coxen test analyzes whether there's a difference between two dependent samples or not therefore if you have two dependent groups you want to test whether there is a difference between these two groups then and you Can use the will coxen test now you rightly say hey the T test for dependent samples does the same thing it also tests whether there's a difference between two dependent groups
that's correct of course the will coxen test is the nonparametric counterpart of the T test for dependent samples the special thing about will coxen test is that your data do not have to be normally distributed to put it simple if your data are normally distributed you use a Parametric test in the case of two dependent samples this is the T test for dependent samples if your data is not normally distributed you use a nonparametric test in the case of two dependent samples this would be the wil coxen test now of course you could say hm
well then I'll just always use Willl coxen test and I don't even have to check the normal distribution at the end of this video I will show you why you should Always use the T Test if it's possible to do so first I have a little reminder for you what dependent samples are in dependent samples the measured values are always available in pairs the pairs result from for example repeated measures of the same same person but what is the difference now between the T test for dependent samples and the will coxen test the T test
for dependent samples tests whether there's a Difference in means if we have a dependent sample say we took a value from each person once in the morning and once in the evening then we can calculate the difference from each pair so for example we would have 45 5 - 34 which equals 11 the T test for dependent sample now tests whether these differences differ from zero or not in The Wil coxon test we don't use the differences of means but we form ranks and then we compare These ranks with each other three is the smallest
value in terms of amount it gets rank one four four is the second smallest value and it gets rank two six gets rank three and 11 gets rank four we assign a plus to all positive values and a minus to all negative values but don't worry we will go through this slowly and we'll also look at an example now we go to the assumptions and the hypotheses in The Wil coxon test only two dependent r random samples with at least ordinary Scaled characteristics need to be present the variables do not have to satisfy a distribution
curve what should be mentioned now however is that the distribution shape of the differences of the two dependent samples should be approximately symmetric the null hypothesis in the will coxen test is that there is no difference in the so-called central tendency of the two samples in the population that is that there is no difference between the Dependent groups the alternative hypothesis is that there is a difference in the central tendency in the population so we expect that the two dependent groups are different so now we finally look at a quite simple example let's say you
have measured the reaction time of a small group group of people once in the morning and once in the evening and you want to know if there is a difference between morning and evening in order to do this you measure the Reaction time of seven people in the morning and in the evening the measured values are therefore available now in pairs and now you want to calculate the difference between morning and evening if the difference would be normally distributed you would use a t test for dependent samples if it is not we would use a
wil coxen test let's just assume that there is no normal distribution and we need to calculate a will coxen test in order to do this the first thing we Do is we form ranks we look for the smallest value in terms of amount that is minus 2 which gets rank one what is the second smallest value that is three which gets ranked two and so on and so forth until we have ranked all the values the next thing is that we look at the differences and we try to figure out which difference is positive and
which is a negative one for the negative differences we simply add a minus then we can add up the positive ranks and the Negative ranks for the positive r ranks we get 7 + 2 + 3 + 4 + 6 which is equal to 22 for the negative ranks we get 5 + 1 which is equal to 6 if there is no difference between morning and evening the positive and negative ranks should be approximately equal therefore the null hypothesis is that both rank sums are equal but how can we test test this now we have
the sum of ranks and we use it to calculate the test statistic W This is simply the minimum value of t+ and T minus in our case it is the minimum value of 22 and six and therefore the test statistic W in our case is six now we can further calculate the value for t or W that we would expect if there was no difference between morning and evening in this case we would get a value of 14 therefore if there is no difference between morning and evening we would actually expect a value for t
plus and T Minus of 14 and thus W would also be 14 further we can calculate the standard deviation this is given by this to be fair a little bit complicated formula once we finished with that we can now calculate the set value the set value is obtained by calculating W minus mu and dividing that by the standard deviation so we compare the value that would be expected if there was no difference with the value that actually occurred note that if there are more than 25 cases Normal distribution is assumed in which case we can
calculate the set value using this formula if there are less than 25 values the critical T value is read from a table of critical T values therefore in our case we would actually use the table now I will show you how you can easily calculate the will coxen test online and then I will go into why I should always prefer the dependent T Test to the will coxen test if it's possible in order to calculate the will Coxen test simply go to data.net you will also find the link in the video description and you copy
your own data into this table then you click on this tab and you will see the names of all your variables that you copied into the table above underneath this tab many hypothesis tests are summarized and data tab automatically suggests the appropriate hypothesis test for your data if you now select morning and evening data daab Automatically recognizes that it is a dependent sample and calculates the dependent T Test but we don't want to calculate a T Test we want to calculate the Willl coxen test so we just click here now data automatically calculates The Wil
coxen test here we can read the negative and positive ranks and here we see the Z value and the P value if you don't know exactly how this is interpreted just look at the summary in words it says that the morning group had Lower values than the evening group and a wil coxen test showed that this difference was not statistically significant P equals 312 and now we come as promised to the point why you should always prefer a parametric test for example the T Test to a non-parametric test we already discussed that The Wil coxen
test has fewer requirements than a t test now of course the question may be why do I use parametric tests like the T Test at all parametric tests usually Have a greater test strength than nonparametric tests what does that mean say you have formulated your null hypothesis for example reactivity is the same in the morning and in the evening whether the null hypothesis is rejected depends among other things on the difference in responsiveness and also on the sample size in a parametric test usually a smaller difference or a smaller sample is sufficient to reject the
null Hypothesis so if possible always use a parametric test and finally we can take a look at the non-parametric counterparts of the Anova let's start with the cral Wallace test this tutorial is about the cral Wallace test if you want to know what the cral Wallace test is and how it can be calculated and interpreted you're at the right place and at the end of this video I will show you how you can easily calculate the Cross C Wallace test online and we get Started right now the cral Wallis test is a hypothesis test that
is used when you want to test whether there is a difference between several independent groups now you may wand a little bit and say hey if there are several independent groups I use an analysis of variant that's right but if your data are not normally dist distributed and the assumptions for the analysis of variance are not met the cral Wallace test is used the cral Wallis test is the Non-parametric counterpart of the single factor analysis of variance I will now show you what that means there is an important difference between the two tests the analysis
of variance tests if there is a difference in means so when we have our groups we calculate the mean of the group and check if all the means are equal when we look at a cross Cod wace test on the other hand we don't check if the means are equal we check if the rank sums of all the groups are Equal what does that mean now what is a rank and what is a rank sum in the cral volis test we do not use the actual measured values but we sort all people by size and
then the person with the small smallest value gets the new value or rank one the person with the second smallest value gets rank two the person with the third smallest value gets rank three and so on and so forth until each person has been assigned a rank now we have assigned a rank to each person and Then we can simply add up the ranks from the first group add up the ranks from the second group and add up the ranks from the Third group in this case we get a rank sum of 42 for the
first group 70 for the second group and 47 for the third group the big Advantage is that if we do not look at the main difference but at the rank sum the data does not have to be normally distributed when using the cral wace test our data does not have to satisfy any distributional Form and therefore we also don't need needed to be normally distributed before we discuss how the crustal wace test is calculated and don't worry it's really not complicated we first take a look at the assumptions when do we use the cral Wallace
test we use the cral volis test if we have a nominal or ordinal variable with more than two values and a matric variable a nominal or ordinal variable with more than two values is for example the variable preferred newspaper with The values Washington Post New York Times USA Today it could also be frequency of Television viewing with daily several times a week rarely never a matric variable is for example salary well-being or weight of people what are the assumptions now only several independent random samples with at least ordinary scaled characteristics must be available the variables
do not have to satisfy a distribution curve so the null hypothesis is the independent samples All have the same central tendency and therefore come from the same population or in other words there's no difference in the rank sums and the alternative hypothesis could be at least one of the independent samples does not have the same central tendency as as the other samples and therefore comes from a different population or to say it in other words again at least one group differs in rank sums so the next question is how do we calculate a ccal Wallace
test it's not difficult let's say you have measured the reaction time of three groups group a group b and Group C and now you want to know if there's a difference between the groups in terms of reaction time time let's say you have written down the measured reaction time in a table let's just assume that the data is not normally distributed and therefore you have to use the cral volis test so then our null hypothesis is that there is no Difference between the groups and we're going to test that right now first we assign a
rank to each person this is the smallest value so this person gets rank one this is the second smallest value so this person gets rank two and we do this now for all people if the groups have no influence on reaction time the ranks should actually be distributed purely randomly in the second step we now calculate the Rank sum and the mean rank sum for the first group the rank sum is 2 + 4 + 7 + 9 which is equal to 22 and we have four people in the group so the mean rank sum
is 22 / 4 which equals 5.5 now we do the same for the second group here we get a rank sum of 27 and the mean rank sum of 6.75 and for the third group we get a rank sum of 29 and a mean rank sum of 7.25 now we can calculate the expected value of the rank sums the expected Value if there is no difference in the groups would be that each group would have a rank sum of 6.5 we've now almost got everything we need we interviewed 12 people so the number of cases
is 12 the expected value of the ranks is 6.5 we've also calculated the mean rank sums of the individual groups the degrees of freedom in our case are two and these are simply given by the number of groups minus one which makes 3 - 1 Last we need the variance the variance of ranks is given by n to the^ of 2 -1 divided by 12 n is again the number of people so 12 so we get a variance of 11.9 to now we've got everything we need with these values we can now calculate our test
value H the test statistic H corresponds to the G Square value and is given by this formula n is the number of cases R Bar is the mean rank sum of the individual groups and E is the expected value of the ranks Sigma squared is the Variance of the ranks in our case the number of cases is 12 we always have four people per group so we can pull out the ne 5.5 is the mean rank of group a 6.75 is the mean rank of Group B and 7.25 is the mean rank of Group C
this gives us a rounded AG value of 0.5 as we just said this value corresponds to the G Square value so now we can easily read the critical gsquare value in the table of critical G Square values you find this table on data.net We have two degrees of freedom and if we assume that we have a significance level of 0.05 we get a critical G squ value of 5991 so of course our value is smaller than the critical G squ value and So based on our example data the Nile hypothesis is retained and now I
will show you how I can easily calculate the cral volis test online with data tab in order to do this you simply visit data.net net you will find a link in the Video description and then you click on the statistics calculator and insert your own data into this table further you click on this tab and under this tab you will find many hypothesis tests and when you select the variables you want to test data tab will suggest the appropriate test after you've copied your data into the table you will see the reaction time and group
right here at the bottom now we simply click on reaction time and group and Data tab automatically calculates an analysis of variance for us but we don't want an analysis of variance we want the non-parametric test so we just click here now data automatically calculates the cral volis test we also get a CH Square value of 0.5 the degrees of freedom are two and the calculated P value is 0.779 and here below you can read the interpretation in words across calist has showed that there is no significant Difference between the categories P makes 0.779 therefore
with the data used the null hypothesis is not rejected if we have three or more dependent samples we can use the fredman test as a nonparametric alternative to the repeated measures Anover this video is about the Freedman test and we start right away with the first question what is the fredman test the fredman test analyzes whether There are statistically significant differences between three or more dependent samples what is a dependent sample again in a dependent sample the measured values are linked for example if a sample is drawn of people who have knee surgery and these
people are interviewed before the surgery and one and two weeks after the surgery it is a dependent sample this is the case because the same person was interviewed at multiple time Points now you might rightly say that the analysis of variance with repeated measures tests exactly the same thing since it also tests whether there is a difference between three or more dependent samples that is correct the Freedman test is the nonparametric counter part of the analysis of variance with repeated measures but what is the difference between the two tests the analysis of variance tests the
extent to which the measured values of the Dependent sample differ in the Freedman test on the other hand it is not the actual measured values that are used but the ranks the time where a person has the highest value gets rank one the time with the second highest value gets rank two and the time with the smallest value gets rank three this is now done for all people or for all rows afterwards the ranks of the single points of time are added up at the first time point we get a sum of seven at the
second time point We get a sum of eight and at the third time point we get a sum of nine now we can check how much these rank sums differ from each other but why are rank sums used the big Advantage is that if you don't look at the mean differences but at the rank sum the data doesn't have to be normal distributed to put it simple if your data are normally distributed parametric tests are used for more than two dependent samples this is the analysis Of variance with repeated measures if your data are not
normally distributed nonparametric tests are used for more than two dependent samples this is the fredman test this leads us to the research question that you can answer with with the freedan test is there a significant difference between more than two dependent groups let's have a look at that with an example you might be interested to know whether therapy after a slipped disc has an influence on the Patient's perception of pain for this purpose you measure the pain perception before the therapy in the middle of the therapy and at the end of the therapy now you
want to know if there is a difference between the different times so your independent variable is time or therapy progressing over time your dependent variable is the pain perception you now have a history of the pain perception of each person over time and now you want to know whether the Therapy has an influence on the pain perception simplified in this case the therapy has an influence and in that that case the therapy has no influence on the pain perception in the course of time the pain perception does not change here in this case it does
now we also have a good transition to the hypotheses in the Freedman test the null hypothesis is there are no significant differences between the dependent groups And the alternative hypothesis is there is a significant difference between the dependent groups of course as already mentioned the Freedman test does not use the True Values but the ranks we will go through the formula behind the Freedman test in a moment this brings us to the point of how to calculate the fredman test for the calculation of the Freedman test you can of course simply use data or calculate
it by hand to be honest hardly anyone will calculate the fredman Test by hand but it will help you to understand how the fredman test works and don't worry it's not that complicated first I will show you how to calculate the fredman test with data Tab and then I will show you how to do it by hand in order to do this simply go to data.net and copy your own data into this table let's say you want to investigate whether there is a difference in the response time of people in the morning at noon and
in the Evening we simply click on this tab under this tab you will find many hypothesis tests and data tab will automatically suggest an appropriate test if we click on all three variables morning noon and evening data tab will automatically calculate an analysis of variance with repeated measures but in our case we want to calculate the nonparametric test so we click here now now we get the results for the Freedman test up here you can read the Descriptive statistics and down here you can find the P value if you don't know exactly how to interpret
the P value you can just read the interpretation in words down here a fre has showed that there is no significant difference between the variables Kai Square = 2.57 p = 0 .276 if your P value is greater than your set significance level then your null hypothesis is retained the null Hypothesis is that there is no difference between the groups usually a significance level of 0.05 is used so this P value is greater than 0.05 additionally data gives you the post Hawk test if your p value is smaller than 0.05 the post talk test helps
you to examine which of the groups really differ so now let's look at what the equation behind the Freedman test are and recalculate this example by hand Here we have the measured values of the seven people in the first step we have to assign ranks to the values in order to do this we look at each rows separately in the first row which is the first person 45 is the largest value this gets rank one then comes 36 with rank two and 34 with rank three we do the same for the second row here 36
is the largest value and gets rank one then comes 33 with rank two and 31 with rank three Now we do this for each row so for all people afterwards we can calculate the rank sum for each point in time for example we just sum up all ranks at one point in time in the morning we get 17 at noon 11 and in the evening 14 if there were no differences between the different time points in terms of reaction time we would expect the expected value at all time points the expected value is obtained with
this formula and in this case it is 14 so if There is no difference between morning noon and evening we would actually expect a ranksum of 14 at all three time points next we can calculate the kai Square value we get it with this formula n is the number of people which is which is seven K is the number of time points so three and the sum of R SAR is 17 squared + 11 SAR + 14 squar so we get a Kai Square value of 2.57 now we need the number of degrees of freedom
this is given by the number Of time points minus one so in our case two finally we can read the critical Kai Square value in the table of critical Kai Square values for this we take the predefined significance level let's say it is 0.05 and the number of degrees of freedom here we can read that the critical Kai Square value is 5.99 this is greater than our calculated value therefore the null hypothesis is not rejected and based on these data There is no difference between the responsiveness at a different time points therefore the null hypothesis
is not rejected and based on these data there's no difference between the responsiveness at a different time points if the calculated Kai Square value was greater than the critical one we would reject the null hypothesis having explored various nonparametric tests for ordinal or nonnormal distributed matric data now Let's shift our Focus to nominal data when our variables are nominal such as gender or color preferences we require a different kind of statistical test the kai Square test the kai Square test is a powerful tool for analyzing nominal data let's get started what is a Kai Square
test and how is the kai Square test calculat that's what we will discuss in this video let's start with the first question what is a Kai Square test the Kai Square test is a hypothesis test that is used when you want to determine if there is a relationship between two categorical variables what are categorical variables again categorical variables are for example gender with the categories male and female the preferred newspaper with the CATE cies USA Today The Wall Street Journal the New York Times and so on or the highest educational level with the categories without
graduation College bachelor's Degree master's degree so gender preferred newspaper and highest educational level are all categorical variables for example no categorical variables are the weight of a person the salary of a person or the power consum assumption if we now have two categorical variables and we want to test whether there is a relationship we use a Kai Square test for example is there a relationship between gender and the preferred newspaper we have two Categorical variables so we use a Kai Square test another example is there a relationship between perferred newspaper and highest educational level here
again we have two categorical variables so we use a Kai Square test however there are two things to note first the Assumption for the kai Square test is that the expected frequencies per cell are greater than five we'll go over what that means in a moment second the Ki Square test uses only the categories and Not the rankings however in the case of the highest educational level there's a ranking of categories if you want to account for rankings check out our tutorials on Spearman correlation manw U test or cross Wallace test but how do we
calculate the kai Square test let's go through that with an example we would like to investigate whether gender has an influence on the preferred newspaper so our question is is there a relationship between gender and the Preferred newspaper our null hypothesis is there is no relationship between gender and the preferred newspaper and our alternative hypothesis is there is a relationship between gender and the preferred newspaper so first we create a questionnaire that asks about gender and the preferred newspaper we will then send out the questionnaire the results of the survey are displayed in a table
in this table we see one respondent in each row the first respondent is male And stated New York Post the second respondent is female and stated USA Today We can now copy this table into a statistic software like data deab data daab then gives us the so-called contingency table in this table you can see the variable newspaper and the variable gender the number of times each combination occurs is plotted in the cells for example in this survey There are 16 people who stated New York Post and male Or 13 people who stated female and New
York Post now we want to know if gender has an influence on the preferred newspaper or put another way is there a relationship between gender and the preferred newspaper now to answer this question question we use the kai Square test there are two ways we can calculate the kai Square test either we use a statistical software like data tab or we calculate the kai Square test by hand we start with the uncomplicated variant and Use data tab if you like you can load the sample data set for calculation you can find the link in the
video description to calculate a Kai Square test online simply copy your own data into this t table or use the link to load this data set then the variables gender and newspaper appear here below now we click on hypothesis test here you will find a variety of tests and data tab will help you to choose the right one for example if we click on chender And newspaper the kai Square test will be automatically calculated now we get the results for the kai Square test above we see the contingency table for the variables gender and newspaper
the contingency table shows us how often the respective combinations occur in our survey female and USA Today for example occurs six times in the second table we can see what the contingency table should actually look like if the two variables Were perfectly independent that is if gender had no influence on the preferred newspaper here it is important to note that all of the frequencies should be larger than five so that the assumptions of the kai Square test are fulfilled but this is the case here the kai Square test now Compares this table with that table
and here we see the results the P value is 0.91 which is much higher than our significance level of 0.05 and therefore we keep the null hypothesis if you don't know exactly how to interpret the results just click on summary in words a Kai Square test was performed between gender and newspaper all expected cell frequences were greater than five thus the assumptions for the kai Square test were met there was no statistically significant relationship between gender and newspaper this results in a P value of 0.918 which is above the defined Significance level of 5% the
kai Square test is therefore not significant and the null hypothesis is not rejected if you're unsure what exactly the P value means just watch our video about the P value and now we come to the question how to calculate the kai Square test by hand and we go through the formulas needed don't worry it's not difficult we need the contingency table with the observed frequencies the contingency table with the expected frequencies that Is those frequencies that would occur with perfectly independent variables you can find how to calculate the expected frequencies on data tab in the
tutorial on the kai Square test we can now calculate the kai square with this formula the index k stands for the respective cell o is the observed frequency e k is the expected frequency so we get 6 - 6.08 squared divided by 6.08 plus the next cell 7 - 6.92 SAR divided by 6.92 if we do this for all cell and sum them up we get a Kai square of 0.504 so this results in a Kai Square value of 0.504 now we would like to calculate the critical Kai Square value what do we need it
for if we use a statistical software we simply get a P value displayed if the value is smaller than the significance level for example 0.05 the the null hypothesis is rejected Otherwise not in our example case the null hypothesis is not rejected by hand however you can't really calculate the P value therefore you read off in a table which Kai Square value you would get with a P value of 0.05 this Kai Square value is called the critical Kai Square value in order to calculate the critical Kai Square value we need the degrees of freedom
these are obtained by taking the number of rows minus one times the number of columns Minus one we have four rows and two columns therefore we get 3 * 1 and thus three degrees of freedom now let's take a look at the table of critical Kai Square values you can find this table on data tab the link is in the video description we select a significance level of 0.05 and have three degrees of freedom therefore our critical Kai Square value is 7.81 15 the critical Kai Square value of 7.81 five is larger than our calculated
Kai Square value of 0.504 thus the null hypothesis is retained up until now we focused on statistical tests designed to compare groups or categories another fundamental aspect of data analysis is understanding the relationship between variables this is where correlation analysis comes into play Let's transition from tests of differences to measures of Association in the following Video this video is about correlation analysis we start by asking what a correlation analysis is we will then look at the most important correlation analysis Pearson correlation Spearman correlation candles Tow and Point by zerial correlation and finally we will discuss
the difference between correlation and cation let's start with the first question what is a correlation analysis correlation analysis is a statistical Method used to measure meure the relationship between two variables for example is there a relationship between a person's salary and age in this CER plot every single point is a person in correlation analysis we usually want to know two things number one how strong the correlation is and number two in which direction the correlation goes we can read both in the correlation coefficient which is is between min-1 and 1 the strength of the correlation
Can be read in a table if R is between 0 and 0.1 we speak of no correlation if R is between 0.7 and one we speak of a very strong correlation a positive correlation exists when high values of one variable go along with high values of the other variable or when small values of one variable go along with small values of the other variable a positive correlation is found for example for body size and shoe size the result is a positive correlation Coefficient a negative correlation exists when high values of one variable go along with
low values of the other variable and vice versa a negative correlation usually exists between product price price and sales volume the result is a negative correlation coefficient now we have different correlation coefficients the most popular are peon correlation coefficient R Spearman correlation coefficient RS canel Tow and Point by zal correlation Coefficient rpb let's start with the first the peon correlation coefficient what is a peon correlation as all all correlation coefficients the pon correlation R is a statistical measure that quantifies the relationship between two variables in the case of pearon correlation the linear relationship of matric
variables is measured more about metric variables later so with the help of peon correlation we can measure the linear Relationship between two variables and of course the peon correlation coefficient R tells tells us how strong the correlation is and in which direction the correlation goes how is p and correlation calculated the p and correlation coefficient is obtained via this equation where R is the p and correlation coefficient x i are the individual values of one variable for example h y i are the individual values of the other variable for example cell X Das and Y
dash are respectively the mean values of the two variables in the equation we can see that the respective mean value is first substracted from both values so in our example we calculate the mean values of age and salary we then subtract the mean values from each person's age and salary then we multiply both Val values and we sum up the individual results of the multiplication the expression in the denominator ensures that the correlation Coefficient is scaled between minus1 and 1 if we now multiply two positive values we get a positive value so all values that
lie in this area have a positive influence on the correlation coefficient if we multiply two negative values we also get a positive value minus * minus is plus so all values that lie in this area also have a positive influence on the correlation coefficient if we multiply a positive value and a negative value we get a negative value minus * Plus is minus so all values that lie in these ranges have a negative influence on the correlation coefficient therefore if our values are predominantly in these two areas we get a positive correlation coefficient and thus
a positive relationship if our values are predominantly in these two areas we get a negative correlation coefficient and thus a negative relationship if the points are distributed over all four areas the positive terms and the Negative terms cancel each other out and we get a very small or no correlation but now there's one more thing to consider the correlation coefficient is usually calculated with data taken from a sample however we often want to test a hypothesis about a population in the case of correlation analysis we then want to know if there is a correlation in
the population for this we check whether the correlation coefficient in the sample is the statistically Significantly different from zero the null hypothesis in the Pearson correlation is the correlation coefficient does not differ significantly from zero there is no linear relationship and the alternative hypothesis is the correlation coefficient differs significantly from zero there is a linear relationship attention it is always tested whether the null hypothesis is rejected or not in our example the research question is Is there a correlation between age and salary in the British population to find out we draw a sample and test
whether in this sample the correlation coefficient is significantly different from zero the null hypothesis then is there is no correlation between salary and age in the British population and the alternative hypothesis there is a correlation between salary and age in the British population whether the correlation coefficient is significantly Different from zero based on the sample collected can be checked using a T Test where R is the correlation coefficient and N is the sample size A P value can then be calculated from the test statistic T if the P value is smaller than the specified significance
level which is usually 5% % then the null hypothesis is rejected otherwise it is not but what about the assumptions for a Pon correlation here we must distinguish whether we just want to calculate the Peon correlation or whether we want to test a hypothesis to calculate the peon correlation coefficient only two matric variables need to be present matric variables are for example a person's weight a person's salary or electric it consumption the peon correlation coefficient then tells us how large the linear relationship is if there is a nonlinear relationship we cannot tell from the peon
correlation coefficient however if we want to test whether the Peon correlation coefficient is significantly different from zero the two variables must always be normally distributed if this is not given the calculated test statistic t or the P value cannot be interpreted reliably let's continue with the Spearman correlation the Spearman rank correlation is the nonparametric counterpart of the pearon correlation but there is an important difference between both correlation coefficients Spearman correlation does not use the raw data but the ranks of the data let's look at this with an example we measure the reaction time of eight
computer players and ask their age when we calculate a peon correlation we simply take the two variables reaction time and age and calculate the peering correlation coefficient however we now want to calculate the spe and rank correlation so first we assign a rank to each person for reaction time and age The reaction time is already sorted by size 12 is the smallest value so gets rank one 15 the second smallest value so gets rank two and so on and so forth we are now doing the same with h here we have the smallest value there
the second smallest there the third smallest fourth smallest and so on and so forth let's take a look at this in the skatter plot here we see the raw data of age and reaction time but but now we would like to use the rankings so we form ranks From the variables age and reaction time through this transformation we have now distributed the data more evenly to calculate the pearon correlation we simply calculate the peon correlation from the ranks so the Spearman correlation is equal to the pon correlation only that the ranks are used instead of
the raw values let's have a quick look at that in data tab here we have the reaction time and age and there we have the just created ranks of Reaction time and age now we can either calculate spean correlation of reaction time and AG where we get a correlation of 0.9 or we can calculate pieron correlation from the ranks where we also get at 0.9 so exactly the same as before if you like you can download the data set you can find the link in the video description if there are no rank ties we can
also use this equation to calculate the pon correlation RS is the spean Correlation n is the number of cases and D is the difference in ranks between the two variables referring to our example we get a different D's with this 1 - 1 is = to 0 2 - 3 is -1 3 - 2 is 1 and so on now we Square the individual D's and add them all up so the sum of d i squ is 8 n which is the number of people is eight if you put everything in we get a correlation
coefficient of 0.9 just like the pieron correlation coefficient R spe correlation Coefficient RS also varies between minus one and one let's continue with candle's tow candle's tow is a correlation coefficient and is thus a measure of the relationship between two variables but what is the difference between peeron correlation and candles rank correlation in contrast to pearon correlation kless rank correlation is a nonparametric test procedure thus for the calculation of candle St the data need not be normally distributed and the variables need only Have ordinal scale levels exactly the same is true for the spe and
rank correlation right that's right candle tow is very similar to spearman's rank correlation coefficient however candle's tow should be preferred over spean correlation if very few data with many rank ties are available but how is candle toar calculated we can calculate candle Tower with this formula where C is the number of concordant pairs and D is the number of discordant Pairs what are concordant and discordant pairs we will now go through this with an example suppose two doctors are asked to rank six patients according to the physical health one of the two doctors is now
defined as a reference and the patients are sorted from 1 to six now the sorted ranks are matched with the ranks of the other doctor EG the patient who is in third place with the reference Doctor Is In fourth place with the other doctor now using candle's tow we want to Know if there is a correlation between the two rankings for the calculation of candle Tower we only need these ranks we now look at each individual Rank and note whether the values below are smaller or greater than itself so we start at the first rank
three one is smaller than three so gets a minus four is greater so it gets a plus two is smaller so it gets a minus six is greater so it gets a plus and five is also greater so it also gets a plus now We do the same for one here of course each subsequent rank is greater than one so we have a plus everywhere at rank four two is smaller and six and five are greater now we do this for rank two and rank six then we can easily calculate the number of concordant and
discordant pairs we get the number of concordant pairs by counting all the Plus in our example we have 11 plus in total we get a number of this coordinate pairs by counting through all the minus in our Example we have a total of 4 minus C is thus 11 and D is four candle tow now is 11 - 4 ided by 11 + 4 and we get a candle tow of 0.47 we get an alternative formula for kandle tow here with s is C minus D therefore 7 n is the number of cases I.E 6
if we insert everything we also get 7 ided by 15 just like the peeron correlation coefficient R candle tow also varies between minus1 and + one we have again calculated the correlation Coefficient using data from a sample now we can test if the correlation coefficient is significantly different from zero thus the null hypothesis is the correlation coefficient too is equal to zero there is no relationship and the alternative hypothesis is the correlation coefficient to is unequal to zero there is a relationship therefore we want to know if the correlation coefficient is significantly different from zero
you can analyze this either by Hand or with a software like data tab for the calculation by hand we can use use the set distribution as an approximation however for this we should at least have 40 cases so the six cases from our example are actually too few we get the set value with this formula here we have Tow and N is the number of cases this brings us to the last correlation analysis the point by zerial correlation Point by zal correlation is a special case of Pon correlation and examines the Relationship between a dichotomous
variable and the metric variable what is a dichotomus variable and what is a matric variable a damous variable is a variable with two values for example gender with male and female or smoking status with smoker and non-smoker a matric variable is for example the weight of a person the salary of a person or the electricity consumption so if we have a dichotomus variable and a matric variable and we want to know if There is a relationship we can use a Point by Zer correlation of course we need to check the assumptions beforehand but more about
that later how is the point by zerial correlation calculated as stated at the beginning the point by zerial correlation is a special case of the peon correlation but how can we calculate the p correlation when a variable is nominal let's look at this with an example let's say we are interested in investigating the Relationship between the number of ours studied for a test and the test result pass failed we've calculated data from a sample of 20 students where 12 students passed the test and eight students failed we have recorded the number of hours each student
studed for the test to calculate the point by Zer correlation we first need to convert the test result into numbers we can assign a score of one to students who passed the test and a score of zero to students who Failed the test now we can either calculate the pon correlation of time and test result or we use the equation for the point by zero correlation X1 Dash is the mean value of the people who have passed pass and X2 Dash is the mean value of the people who failed N1 is the number of people
who passed and N2 the number of people who failed and N is the total number but whether we calculate the peon correlation or we use the equation for the point by zerial Correlation we get the same result both times let's take a quick look at this in data tab here we have the learning hours the test result was passed and fail fail and there the test result with 0er and one we Define the test result with Zer and one as metric if we now go to correlation and calculate the peeron correlation for these two variables
we get a correlation coefficient of 0.31 if we calculate the point by zal Correlation for learning hours and the exam result was passed and failed we also get a correlation of 0.31 just like the pon correlation coefficient R the point by zal correlation coefficient rpb also varies between minus1 and 1 if we have a coefficient between minus1 and less than one there is a negative correlation thus a negative relationship between the variables if we have a coefficient between greater than zero and one there Is a positive correlation that is is a positive relationship between the
two variables if the result is zero we have no correlation as always with the point by Zer correlation we can also check whether the correlation coefficient is significantly different from zero thus the null hypothesis is the correlation coefficient R is equal to zero there is no relationship and the alternative hypothesis is the correlation coefficient R is unequal to Z there is a Relationship before we get to the assumptions here's an interesting note when we compute a point by zerial correlation we get the same P value as when we compute a t test for independent samples
for the same data so whether we test a correlation hypothesis with the point by zero correlation or a difference hypothesis with the T Test we get the same P value now if we compute a test in data tab with these data and we have the null hypothesis there is no Difference between the groups failed and passed in terms of the variable learning hours we get a P value of 0.179 and also if we calculate a point by zero correlation and have the null hypothesis there is no correlation between learning hours and test results we get
a P value of 0.1 179 in our example the P value is greater than 0.05 which is most often used as a significance level and thus the null Hypothesis is not rejected but what about the assumptions for a point by zal correlation here we must distinguish whether we just want to calculate the correlation coefficient or whether we want to test a hypothesis to calculate a correlation coefficient only one matric variable and one damus variable must be present however if we want to test whether the correlation coefficient is significantly different from zero the one matric variable
must also be Normally distributed if this is not given the calculated test statistic t or the P value cannot be interpreted reliably this brings us to the last question what is causality and what is the difference between causality and correlation causality is the relationship between a cause and an effect in a causal relationship we have a cause and a resultant effect an example coffee contains caffeine a stimulating substance when you drink Coffee the caffeine enters the body affects the central nervous system and leads to increased alerted drinking coffee is the cause of the feeling of
alerted that comes afterwards without drinking coffee the effect I.E the feeling of alerted would not occur but causality is not always so easy to determine clear requirements must be met in order to speak of a causal relationship but more about that later so what is the difference between Correlation and causality a correlation tells us that there is a relationship between two variables example there is a positive correlation between ice cream sales and the number of sunburns however an existing correlation cannot tell us which variable influences which or whether a third variable is responsible for the
correlation in our example both variables are influenced by a common cause namely sunny weather on sunny days People buy more ice cream and and spend more time Outdoors this can lead to an increased risk of sunburns causality means that there is a clear cause effect relationship between two variables causality exists when you can say with certainity which variable influences which however a common mistake in the interpretation of Statistics is that a correlation is immediately assumed to be a causal relationship here is an example example the American statistician Daryl Huff found a negative correlation between the
number of headli and the body temperature of the inhabitants of an island a negative correlation means that people with many head lies generally have a lower body temperature and people with few head lies generally have a higher body temperature the Islanders concluded that head lies were good for health because they reduced fever so their assumption was that headlights have an effect on the Temperature of the body in reality the correct conclusion is the other way around in an experiment it was possible to prove that high fever drives away the lies so the high body temperature
is the cause not the effect what are the conditions for talking about causality there are two conditions for causality number one there is a significant correlation between the variables this is easy to check we simply check whether the correlation coefficient is Significantly different from zero number two the second condition can be met in three ways first chronological sequence there is a chronological sequence and the results of one variable occurred before the results of the other variable second experiment a controlled experiment was conducted in which the two variables can be specifically influenced and number three Theory
there is a well-founded and plausible theory in which direction the causal Relationship goes if there is only a significant correlation but none of the other three conditions are met we can only speak of correlation never of causality after examining how correlation analysis helps us to determine the extent to which variable are related we now move on into the field of regression first we'll start with an overview of regression analysis where we will break down the fundamentals and Explore Its Real World application next we'll dive into simple linear regression where you will learn how to model
the relationship between two variables then we will move on to the multiple linear regression where we extend the model to include multiple predictors making our predictions more powerful and finally we'll cover logistic regression which is essential when working with categorical variables like predicting whether something will happen or not so let's Get started with the first question what is a regression analysis A regression analysis is a method for modeling relationships between variables it makes it possible to infer or predict a variable based on one or more other variables let's say you want to find out what
influences a person's salary for example you could take the highest level of Education the weekly working hours and the age of a person you could now investigate whether these three Variables have an influence on the salary of a person if they do you can predict a person's salary by taking the highest level of Education weekly working hours and a person's age the variable we want to inere or predict is called the dependent variable the variables used for prediction are called independent variables depending on your field independent variables may also be called predictor variables or input
variables while the dependent variable Might be referred to as the response output or Target variable okay but when do we use a regression analysis regression analysis can be used to achieve two goals you can measure the influence of one variable or several variables on another variable or you can predict a variable based on other variables let's go through some examples let's start by measuring the influence of one or more variables on another in the context of your research you may be Interested in understanding the factors that influence children 's ability to concentrate specifically you aim
to determine whether certain parameters have a positive or negative impact on their concentration but in this case you're not interested in predicting children's ability to concentrate or you could investigate whether the educational level of the parents and the place of residence have an influence on the future educational level of children This area is therefore very research based and has many applications in Social and economic Sciences the second area using regression for predictions is more application oriented to get the most out of Hospital occupancy you might be interested in how long a patient will stay in
the hospital so based on the characteristics of the prospective patient such as age reason for stay and pre-existing conditions you want to know how long that person is like likely to Stay in the hospital based on this prediction bath planning can then be optimized or as an operator of an online store you are very interested in which product a person is most likely to buy you want to suggest this product to the visitor in order to increase the sales of the online store second point is highly application oriented focusing on making predictions to enhance efficiency
okay great now there are different types of regression analysis there is the Simple linear multiple linear and logistic regression in simple linear regression we use just one independent variable to predict the dependent variable for example if we want to predict a person's salary we use only one variable either if a person has studied or not the weekly working hours or the age of a person multiple linear regret on the other hand uses several independent variables to predict or inere the dependent variable I.E the Highest level of edication the number of hours worked per week and
the age of the person therefore the difference between a simple and the multiple regression is that in one case only one independent variable is used and in the other case several both have in common that the dependent variable is matric matric variable are for example the salary of a person the body size or the electricity consumption in contrast logistic regression is used when you have a Categorical dependent variable categorical variables are for example if a person is at risk of burnout or not if a person is diseased or not or type of animal however the
most common form of logistic regression is the so-called binary logistic regression in this case the outcome variable is binary meaning it has two possible values like yes or no success of failure or diseased and not diseased therefore in linear regression the dependent Variable is a metric variable while in logistic regression it is a categorical variable also known as nominal variable but what about the independent variables in all cases the level of measure M of the independent variables can be nominal ordinal or metric okay actually in regression you can only use categorical variables with two categories
or levels such as gender with male and female in this case we can code one category with zero and the other with one however if a Variable has more than two categories like vehicle type there's an easy solution we create dummy variables don't worry we'll explain more about dummy variables later in this playlist okay a quick recap there is the simple linear regression a question could be does the weekly working time have an impact on the hourly wage of employees here we only have one independent variable there is the multiple linear regression do the weekly
working hours And the age of employees have an influence on the hourly wage here we have at least two independent variables in this case weekly working hours and age and the last case logistic regression du the weekly working hours and the age of employees have an influence on the probability that they are at risk of burnout where burnout at risk has the categories yes or no now that we've covered the basics of regression analysis and its applications Let's take a closer look at the simple linear regression this technique helps us to understand the relationship between
two variables allowing us to predict one based on the other in the next section we'll break down how it works and go through a real example and show you how to interpret the results let's get started what is a simple linear regression simple linear regression is a method to understand the relationship between two variables you Can infer or predict a variable based on another variable for example You can predict the annual salary of a person based on the years of work experience so a simple linear regression can help us understand how salary changes with an
increase in years of experience the variable we want to inere or predict is called the dependent variable the variable we use for prediction is called independent variable let's look at an example imagine we want to predict house Prices the dependent variable the one we want to predict is of course the price of the house the independent variable the one we used to make the prediction could be the size of the house in square feet of course you can also use more than one independent variable for example the construction year or features like whether the house
has a swimming pool the number of bathrooms and the house size but in this case it would be a multiple linear regression Because you have more than one independent variable more on that in my video about multiple linear regression okay but how do we calculate a simple linear regression first of all we need data so we collect information from 10 houses including their size in square feet and the price they were sold for now we can use use this data to calculate our regression model here Y is the dependent variable house price and X is
the independent variable house size And we want to use our data to determine the coefficient b and a but how do we do that let's visualize our data using a scatter plot on the xais we plot our independent variable the house size and on the Y AIS we plot the dependent VAR variable the house price each point is therefore one house with the respective house size and the house price okay now we want to summarize all this data using a simple linear regression to do this we draw a straight line through the points On the
scatter plot but the line we draw isn't just any random line it's the line that tries to minimize the error or the distance between the actual data points and the line line itself if we add up the length of all the red lines we get the total error our goal is to find the regression line that minimizes this error but how do we actually calculate this line This is where the equation of the linear regression comes into play in the equation B is the slope of the line The slope shows how much the house price
changes if the house size increases by one square squ food a is the Y intercept telling us where the line crosses the Y AIS so if we have a house with a size of zero the model will predict a house price of a of course predicting the price of a house with zero size doesn't make sense however every model is a simplification of the real world and in the case of simple linear regression our model is defined by a regression line With a specific slope B B and an intercept a let's look at this example
in that case our intercept is 100 so we enter 100 for a but how do we read the slope for this we take a one unit step in the independent variable for example if we move from 1 to two then we observe how much the dependent variable changes with this one unit increase in this case if the independent variable increases by one unit the dependent variable increases by 50 units so our B is 50 okay but how do we calculate b and a there are two ways to do this we can calculate them by hand
or using statistical software like data tab let's look at this example how can we calculate b and a by hand to calculate the slope B we use this formula R is the correlation coefficient between X and Y so in our case the correlation between house size and house price we get a correlation coefficient of 0.92 s y is the standard deviation of The dependent variable house price and SX is the standard deviation of the independent variable so house size so in this case our B is 108 35 here's a quick side note for anyone who
wants to follow along and recalculate the example there are two slightly different formulas for standard deviation one divides by n and the other by n minus one without diving into the details now almost all statistical software uses the formula with n minus1 To calculate the standard deviation however for calculating the regression coefficient B we use the formula that divides by n if you like a more detailed explanation of the standard deviation feel free to check out my video on that topic all right once we've calculated B we can find The Intercept a using this formula
here Y Bar represents the mean of the house prices B is the slope which is calculated and xar is the mean of the house sizes substituting these values The inter ccept a comes out to be 6,919 44 So based on this data we have now calculated the coefficient b and a if we insert the numbers for b and a we get this equation if we enter Z for X the house size we get 6,919 44 which is our intercept if we increase the house size by One S square foot each time we get a house
price that is $18.35 higher each time okay before we Start with the last important topic the assumptions for a simple linear regression let's check the results with data tab if you like you can load this sample data set the link is in the video description we want to calculate a regression so we click on regression here we can now select our dependent variable house price and the independent variable house size now let's look at the results we will focus on this table which shows The key information we need if you're curious about the other tables
you can click on AI interpretation for the corresponding table or check out our video on multiple linear regression where we explain these details in depth okay in this table we can see the calculated regression coefficient for the constant we called it intercept and house size the values match exactly with the ones we calculated by hand the intercept and the slope in the results Table for a linear regression you'll also see the P value what does the P value tell us the P value helps determine whether the relationship between the independent variable and a dependent variable
is statistically significant to to test whether the relationship we observe is Meaningful or just due to random Chens we start by stating the null hypothesis there is no relationship between the independent variable and the dependent variable if The P value is small typically smaller than 0.05 we reject the null hypothesis this is suggesting a significant relationship between the variables if the P value is large typically great later than 0.05 we fail to reject the null hypothesis indicating The observed data may have occurred by chance with no strong evidence for a relationship so in our case
the P value is highly significant indicating strong evidence Of a relationship between house price and size all right in this example it's pretty obvious a bigger house size typically costs more however there are cases where the relationship isn't that clear and what about the assumptions here are the key assumptions number one linear relationship in linear regression a straight line is drawn through the data this straight line should represent All Points as good as possible if the Relation is nonlinear the straight line cannot fulfill this requirement number two independence of Errors the error so the differences
between actual and predicted values should be independent of each other this means that the error of one point doesn't affect another number three homoscedasticity or equal variance of Errors if we plot the errors on the Y AIS and the dependent variable on the xaxis their spread should be roughly the Same across all values of X in other words the variance of the error should remain constant in this case the assumption is fulfilled but what about that case here we observe unequal variant at low values of X the errors are small while at high values the
variance of the errors becomes much larger number four normally distributed errors the arrows should be normally distributed the normality of the arrow can be tested both analytically and Graphically however be cautious with analytical tests for small samples they often indicate normality and with large samples they quickly become significant because of these limitations graphical methods such as the QQ plot are more commonly used today if you use data tab you just need to click here to check the assumptions so let's just go through how the assumptions are checked in practice to check for a linear relationship
you can use a scatter plot plot the Independent variable against the dependent variable if the points form a clear straight line pattern a linear relationship exists if not the relationship is likely nonlinear in our case we observe a clear linear relationship to test if the errors are normally distributed you can use a QQ plot or one of the several analytical tests with a QQ plot the residuals should fall roughly along a straight line indicating normality if you use an Analytical test check whether the calculated P value is greater than 0.05 if it is not there
is evidence that the data are not normally distributed the choice of test often depends on your field of research however as mentioned the QQ plot is increasingly preferred as a visual and intuitive way to assess normality independence of erors can be tested using the Durban Watson test which checks for autocorrelation in the residuals if the calculated P value is Greater than 0.05 it indicates that there is no significant autocorrelation in the residuals and the independence assumption is likely satisfied homoscedasticity can be checked using a residual plot where the predicted values are plotted on the xais
and the residuals or arrow on the Y AIS the residuals should show a consistent spread across the plot a funnel shape indicates hat Tois causticity meaning The variance is not constant in our case the plot looks acceptable though not perfect if these assumptions are violated the regression results might not be reliable or meaningful and the predictions could be inaccurate so always check these assumptions before drawing conclusions from a regression model so far we've seen how simple linear regression helps us model the relationship between two variables one dependent and one independent but what If we have
more than one factor influencing our outcome that's where multiple linear regression comes in in the next section we'll extend our regression model to including multiple predictors making our predictions more accurate and realistic let's dive in what is a multiple linear regression a multiple linear regression is a method for modeling relationships between variables it makes it possible to inere or predict a variable based on other Variables an example let's say you want to find out what influence uences a person's salary you take the highest level of Education the weekly working hours and the age of a
person you now investigate whether these three variables have an influence on the salary of a person if they do you can predict a person's salary by taking the highest educational level the weekly working hours and a person's age the variables we use for prediction are Called independent variables the Vari we want to inere or predict is called the dependent variable but what is the difference between a simple linear and the multiple linear regression as we know from the previous video in simple linear regression we use just one independent variable to predict the dependent variable for
example if we want to predict a person's salary we use either if a person has studied or not the weekly working hours or the age of a Person multiple linear regression on the other hand uses several independent variables to predict or inere the dependent variable therefore the difference between a simple and a multiple linear regression is that in one case only one independent variable is used and in the other case several both have in common that the dependent variable is matric matric variables are for example the salary of a person the body size or the
electricity consumption So unlike simple linear regression multiple linear regression can include two or more independent variables but what impact does that have on the regression equation in the case of linear regression this was our equation we had one dependent variable Y and one independent variable X now in multiple linear regression we have more than one dependent variable but don't worry the coefficients b and a are interpreted similarly to those in a simple linear Regression if all independent variables are zero the value a is obtained so we get a value of a for the dependent variable
y furthermore if an independent variable increases by one unit the associated coefficient B indicates the corresponding change in the dependent variable okay let's make one small adjustment going forward instead of Y we'll use y head but why in the previous video we learned that regression aims to model The dependent variable as accurately as possible however when working with real world data there's always some error in other words the True Values often differ from the predictions now y hat represents the predicted value values from the regression model While y denotes The observed actual values great you're
gradually becoming an expert now there are four topics to cover the assumptions of regression how to calculate a regression with data tab how To interpret the results and finally how to handle categorical variables in regression by creating dami variables let's start with the first topic so what are the assumptions the first four assumptions of multiple linear regression are similar to those of simple linear regression but there is an additional fifth assumption let's briefly recap the first four assumptions and then we'll go into more details on the fifth assumption let's start with The first one linear
relationship in the case of the simple linear regression we were able to test this assumption easily a straight line is drawn through the data this straight line should represent All Points as good as possible if the relationship is nonlinear the straight line cannot fulfill this requirement in simple linear regression we have one independent variable and one dependent variable making it straightforward to plot the data points and the regression Line in contrast multiple linear regression involves multiple independent variables which complicate Ates the visualization however you can still plot each independent variable separately against the dependent variable
to gain an initial sense of whether a linear relationship might exist number two independence of Errors the errors so the differences between actual and predicted values should be independent of each other this means that the arrow of one Point doesn't affect another one we can test this with the Durban Watson test number three homoscedasticity or equal variance of Errors if we plot the errors on the Y AIS and the predicted values from the regression model on the xaxis their spread should be roughly the same across all values of X in other words the variance of
the error should remain constant in this this case the assumption is fulfilled but what about That case here we observe unequal variance at low values of X the errors are small while at high values the variance of the errors becomes much larger number four normally distributed errors the arrrow should be normally distributed we can test this with a QQ plot or with analytical tests if you like you can check out my video test for normal distribution for a deeper dive and what about the fifth assumption no multicolinearity first of all what is Multicolinearity in regression
multicolinearity means that two or more independent variables are highly correlated with each other as a result the effect of individual variables cannot be clearly separated why is that a problem let's look at the regression equation again we have here the dependent variable and there the independent variables with the respective coefficients for example if there is a high correlation between X1 And X2 or if these two variables are almost equal then it is difficult to determine B1 and B2 if both are completely equal the regression model cannot determine how large B1 and how large B2 should
be this means that one independent variable can be predicted from the others with a high degree of accuracy an example imagine you're trying to predict the price of a house to do this you use the size of the house the number of rooms and some other Variables usually the size of the house is related to the number of rooms large houses tend to have more rooms so these two variables are correlated if we now include both in our regression model the model will struggle to decide how much of the price is influenced by size and
how much is influenced by number of rooms this is because they overlap in the information they provide and this is multicolinearity in this case it becomes impossible to reliably determine the Regression coefficients if you just want to use the regression model for a prediction the presence of multicolinearity is less critical in this context the focus is on how accurate the prediction is rather than on understanding the influence of the individual variables however if the regression model is used to assess the influence of independent variables on the dependent variable there should be no Multicolinearity okay but
how do we detect m multicolinearity if we look at the regression equation again we have the variable X1 X2 and up to variable XK we now want to determine whether X1 is nearly identical to any other variable or a combination of the other variables for this we simply set up a new regression model in this regression model we take X1 as the new dependent variable and keep the others as Independent variables if we can predict X1 accurately using the other independent variables X1 becomes unnecessary its information is already captured by the other variables we can
now do this for all other variables so we estimate X1 by using the other independent variables we estimate X2 by using the other variables and we estimate XK by using the other independent variables okay but what is a method to detect multical linearity for All K regression models we calculate R sared which is the so-called coefficient of determination what is the coefficient of determination R sared if X1 is the dependent variable and the other independent variables are used as predictors R squar explains how well the independent variables explain the variability of the dependent variable therefore
a high r squared in this context suggests that X1 is highly correlated with the other independent Variables this is a sign of multicolinearity using R squar we can calculate the tolerance and the variance inflation Factor short VI basically the vi is one divided by the tolerance if the tolerance is less than 0.1 it indicates potential multicolinearity and caution is required on the other hand a vif value greater than 10 is a warning sign of Multicolinearity requiring further investigation typically statistical programs like data tab provide the tolerance and variance inflation Factor via F values for each
independent variable okay but how to address multicolinearity there are two common ways to address multicolinearity number one remove one of the correlated variables so choose the variable that is less significant and remove it or number two combine variables create a single Variable by combining the correlated variables e taking an average if you're using data Tab and calculate the regression you just need to click on test assumptions here you can see the table with the tolerance and the vif all right let's work through an example on how to calculate a multip linear regression and then look
at how to interpret the results our goal is to analyze the influence of age weight and Cholesterol level on blood pressure so blood pressure is our dependent variable while age weight and cholesterol level are our independent variables to calculate the regression we just go to data.net and copy our data into the table if you like you can load the sample data set using the link in the video description we want to calculate a regression so we click on regression now we simply click on BL pressure under dependent variable and age and weight And cholesterol level
under independent variable afterwards we automatically get the results of the regression we will now discuss how to interpret the individual tables if you need an interpretation of your individual data you can just click on AI interpretation at each table and you will get a detailed explanation of your results and if you want to test assumptions just click here but back to the results let's start with the most important table the Table with the regression coefficients and then take a closer look at the model summary table we will focus on these three columns here we can
see the three independent variables age weight and cholesterol the first row represents the constant so in our regression equation let's replace X1 X2 and X3 with their corresponding names so we want to predict blood pressure based on a person's age weight and cholesterol levels okay in the First Column we see The unstandardized regression coefficients these are our coefficients from the regression equation now we can calculate the blood pressure for a given person let's say a person is 55 years old has a weight of 95 kg and a cholesterol level of 180 then our model would
predict a blood pressure of 91 so for example if we look at the variable H 0.26 means that for each additional year of age the blood pressure increases by 0.26 units assuming other variables remain constant and what about the standardized coefficients the the standardized coefficient tells us the relative importance of each independent variable after standardizing the variables to the same scale why is this useful our model includes variables measured in different units such as age in years weight in kilograms comparing their unstandardized coefficients can be misleading because these coefficients Are influenced by the units of
measurement for in instance if weight is measured in tons the coefficient would be larger if we measured in grams it would be smaller Additionally the values for age and years are generally smaller than the values for cholesterol so you cannot directly compare their unstandardized coefficients with each other in contrast their standardized coefficient remains consistent regardless of the units this allows for Direct comparison of the relative effects of different variables for example we can see that cholesterol level has the largest standardized coefficient indicating that it has the strongest influence on blood pressure in our video about
simple linear regression we explained the P value in detail the interpretation is similar in multiple linear regression to summarize the P value shows whether the corresponding coefficient is significantly different From zero in other words it tells us if a variable has a real influence or if the result could just be due to chance if the P value is smaller than 0.05 it means the difference is significant in our case all P values are smaller than 0.05 so all variables have a significant influence perfect let's move on to the next table the model summary table first
we get the multiple correlation coefficient r R measures the correlation Between the dependent variable and the combination of the independent variables what does that mean here we have the equation for linear regression once the coefficients are determined we can sum all up and calculate the predicted values y heads of the dependent variable so if we use our example data we have the real blood pressure data and we can predict the blood pressure data with the regression model the multiple correlation coefficient R is now the Correlation between the predicted values y hat and the actual values
Y in other words the multiple correlation coefficient R indicates the strength of the correlation between the actual dependent variable and its estimated values therefore the greater the correlation the better the regression model in our case an R value of 0.27 indicates a strong positive relationship okay and what about r squared r squared is called the Coefficient of determination r squared indicates the proportion of variance in the dependent variable that is explained by the independent variables the greater the explained variance the better the model's performance for example an r squared value of one would mean that
the entire variation in blood pressure can be perfectly explained by the variables age weight and cholesterol level however in reality this is really the case an r s of 0.52 means that 52% of the variation in blood pressure is explained by the model what is the adjusted r squared the adjusted R squar Accounts for the number of independent variables in the model this provides a more accurate measure of explanatory power when a model includes many independent variables the regular R squar can overestimate how well the model explains the data in such cases it is recommended
to consider the adjusted R squar to avoid Overestimation okay and what about the standard error of the estimate the standard error of the estimate measur me the average distance between the observed data points and the regression line a standard aror of the estimate of 6.6 indicates that on average the model's predictions deviate from the actual values by 6.6 units so if we predict a person's blood pressure using their age weight and cholesterol level our prediction will on average deviate By 6.6 units from from the person's actual plot pressure okay if you want an interpretation of
the other tables simply click on AI interpretation earlier in this video I mentioned that independent variables in regression analysis can be nominal okay but what are nominal variables nominal variables are variables with different categories like gender with male and female or vehicle type but how do we use nominal variables in a regression model As independent variables let's keep things simple let's start with variables with two categories imagine we have the variable gender with the categories male and female now we can code female as zero and male as one the category coded with zero is our
so-called reference category all right let's take a look at the regression equation suppose the variable X1 represents gender then B1 is the regression coefficient for gender but how do we interpret B1 we said zero Is female and one is male so let's just insert this for X1 for a female individual we have Z multiplied by B1 and for a male individual we have one multiplied by B1 accordingly B1 represents the difference between males and females now that we've discussed how to handle variables with two values let's explore how to approach variables with more than two
values let's say we want to predict the fuel consumption of a car based on its horsepower and Vehicle type to keep it simple let's say there are only three vehicle types cedan sports car and family van thus we have a variable vehicle type with more than two categories however as we know in regression we can only include variables with two categories so what's the solution this is where dumy variables come into play dumy variables are artificial variables that make it possible to handle variables with more than two categories for the variable Vehicle type we create
a total of three dummy variables is sedan is sports car and is family van each of these dummy variables has only two possible values zero or one a value of one indicates the presence of the specific category while a value of zero indicates its absence instead of having one variable with three categories we now have three variables with two categories each these newly created dummy variables can be included in the regression model okay But what does this mean for our data preparation initially we have one column labeled vehicle type where the individual vehicle types from
our sample are listed the first entry is see then the second is also San the third is a sports car and so on from this column we create three new variables for the first vehicle which is the sedan we assign one under the sedan and zero under the others as it's neither a sports car nor a family van similarly the second Vehicle is also a seeden the Third vehicle however is a sports car so you assign a one under sports car and zero under the others by doing this we've successfully created our dummy variables one
important thing to note the number of dummy variables you create will always be the number of categories minus one so in our case we have three categories so we basically only need two dummy variables why is that the case if we know a vehicle is a sedan we Automatically know it is neither a sports car nor a family van similarily if we know it's a sports car we can inere that it's not a seeden or a family van finally if it's neither a Sean nor a sports car we know it must be a family van
this means we can express the same information with just two variables instead of three including all three variables would make the regression model overdetermined okay but don't worry if You're using data tab it will automatically create the dummy variables for you for example if we select fuel consumption as the dependent variable and horsepower and vehicle type as the independent variables we can then see the three categories here and and can select which one to use as the reference category for example if we choose sedan as the reference dummy variables will be created for sports car
and family van when we examine the results we will see The two variables vehicle type sports car and vehicle type family van along with horsepower so far we focused on linear regression where our goal is to predict a continuous variable like sales or house prices but what if we need to predict categories instead like whether a costomer will buy a product or not or whether an email is a spam that's where logistic regression comes into play so in the next section we'll explore how Logistic regression helps us model binary outcomes and make probability based predictions
let's jump in what is a logistic Reg regression how is it calculated and most importantly how are the results interpreted let's start with the first question what is a regression in a regression analysis you want to infer or predict an outcome variable based on one or more other variables okay so what about logistic regression a binary logistic regression is now a type Of regression analysis used when the outcome variable is binary meaning it has two possible values like yes or no success or failure let's look at an example let's say we are researchers and we
want to know whether a particular medication and a person's age have an influence on whether a person gets a certain disease or not so the outcome we're interested in is whether the patients developed the disease or did not develop it and our independent Variables are medication and age of a person now with the help of a logistic regression we want to inere or predict the outcome variable based on the independent variables okay but what is the difference between a linear and a logistic regression in a linear regression the dependent variable is a metric variable it
salary or electricity consumption in a logistic regression the dependent variable is a binary variable so with the help of logistic regression We can determine what has an influence on whether a certain disease is present or not for example we could study the influence of age gender and smoking status on that particular disease in this case one stands for diseased and zero for not diseased we now want to estimate the probability that a person is diseased so our data set might look like this here we have the independent variables and there the dependent variable with zero
and one we could now Investigate what influence the independent variables have on the disease if there is an influence we can predict How likely someone is to have a disease okay but why do we need logistic regression in this case why can't we just use linear regression a quick recap in linear regression this is our regression equation we have the dependent variable the independent variables and the regression coefficients however our dependent Variable is now binary taking on the value of either zero or one regardless of the values of the independent variables the outcome will always
be zero or one a linear regression would now simply put a straight line through the points we can now see that in the case of linear regression values between minus and plus infinity can occur however the goal of logistic regression is to estimate the probability of occurrence the value range for the Prediction should therefore be between zero and one so we need a function that only takes values between zero and one and that is exactly what the logistic function does no matter where we are on the X AIS between minus and plus infinity only values
between 0er and one result and that is exactly what we want the equation for the logistic function looks like this logistic regression now uses the logistic function for that the equation of the linear regression is now Simply used this gives us that equation this equation gives us the probability of the dependent variable equal one given specific values of the independent variables hm what does this look like for our example now in our example the probability of having a certain disease is a function of age gender and smoking status next we need to determine the coefficients
that help our model best fit the given data this is done using the maximum likelihood method for this There are numerical methods that can solve the problem effectively a statistics program such as data tab therefore calculates the coefficients all right let's work through this example on how to calculate a logistic regression and then look at how to interpret the results to calculate the regression we just go to data.net and copy our data into this table if you like you can load the sample data using the link in the video description we Want to calculate a
logistic regression so we just click on regression we choose disease as the dependent variable and age gender and smoking status as the independent variables datab now calculates a logistic regression for us depending on how our dependent variable is scaled data will calculate either a logistic or a linear regression under the tab regression since we have two categorical variables we can set the reference Category we will just use female and nonsmoker as reference now we can choose for which category we want to build the regression model so we can decide if we want to predict if
a person is diseased or not diseased instead of diseased and not deceased we could of course also have one and zero okay he before we go into detail about the different results a little tip if you don't know how to interpret the results you can also just click on summary inverts a logistic Regression analysis was performed to examine the influence of age gender fale and smoking status smoker on the variable disease to predict the value diseased logistic regression analysis showed that the model as a whole was significant ific and then comes the interpretation of the
different independent variables further you can click on AI interpretation at the different tables we will now carefully go through each table step by step to Ensure everything is clear to you let's begin at the top first we get the result table here we can see that a total of 36 people were examined with the help of the regression model of these 36 person 26 could be correctly assigned that is 72.22% next is the classification table this table shows how often the categories not deceased and deceased were observed and how frequently they were predicted in total
not deceased was observed 16 times among these 16 Individuals the regression model correctly classified 11 as not deceased while misclassifying five as deceased of the 20 deceased individuals the regression model misclassified five as not deceased and correctly classified 15 as deceased but how do we determine whether a person is classified as diseased or not as mentioned earlier logistic regression provides the probability of a person being diseased so we obtain values ranging from 0 to 100% now we simply set a threshold of 50% if a value exceeds 50% the person is classified as diseased otherwise they
are classified as not deceased of course you can choose a threshold other than 50% to learn more about this check out our video on the r curve so let's have a look at the next table the kai Square test evaluates whether the model as a whole is statistically significant for this two models are compared in one model all independent variables are used And in the other model the independent variables are not used now we can compare how good the prediction is when the independent variables are used and how good it is when the independent variables
are not used the kai Square test now tells us if there is a significant difference between these two results the null hypothesis is that both models are the same if the P value is less than 0.05 this null hypothesis is rejected in Our example the P value is less than 0.05 and we assume that there is a significant difference between the models thus the model as a whole is significant next comes the model model summary in this table we can see on the one hand the minus 2 log likelihood value and on the other hand
we are given different coefficients of the determination r squar r squ is used to find out how well the regression model explains the dependent variable in a Linear regression the r squ indicates the proportion of the variance that can be explained by the independent variables the more variance can be explained the better the regression model in a logistic regression however its interpretation differ and multiple methods exist to calculate R squar unfortunately there's no consensus yet on which method is considered the best data daab gives you the r squared according to Cox and Snell according to
Nagle ker and according to MC feton and now comes the most important table the table with the model coefficients the most important parameters are the coefficient B the P value and the Ys ratio we'll now discuss all three columns in the First Column we can read the calculated coefficients from our model we can insert these into the regression equation so we get the coefficients for age gender smoker and the constant for example for a person Which is 55 years old is male and is nonsmoker we get a probability of 36% thus it is 36% likely
that a 55-year-old male nonsmoker is diseased in reality that would certainly be many other and different independent variables okay but what about the P value so the P value shows whether the corresponding coefficient is significantly different from zero in other words it tells us if a variable has a real influence or if the result Could just be due to chance if the P value is smaller than 0.05 it means the difference is significant in our case all P values are greater than 0.05 indicating that none of the variables have a significant influence and finally the
odds ratio but what are odds and and what is the odds ratio let's start with the odds let's say we have two possible outcomes of something success and failure for example if a Therapy is successful or not let's say that the probability that the therapy is successful is 0.7 so 70% and thus the probability of failure is 1us 0.7 so 0.3 okay but what about the odds odds are defined as the ratio of the probability of success and the probability of failure or in other words odds represent the ratio of the probability of an event
happening to the probability of it not happening if we Look at our example the odds are 0.7 divided by 0.3 which equals 2.33 this means the event success is 2.33 times more likely to happen than not so odds give us a measure of the likelihood of an event happening versus it not happening in this case we've calculated the odds of success of course we can also calculate the odds of failure all right now that we understand Odds let's talk about odds ratios so what are odds ratios let's look at the example from the beginning we're
studying a new medication to reduce the risk of a certain disease so we have a group a patients with medication and the group b patients without medication let's say in group a we calculated a probability of 60% or 0.6 of getting diseased so the odds of getting diseased is 0.6 ided 0.4 which is 1.5 five again Odds just represent the ratio of the probability of an event happening to the probability of it not happening in our case in group a the likelihood of being diseased is 1.5 time higher than the likelihood of not being diseased
let's say in group b where the patients didn't get the medication the probability of getting deceased is 80% or 0.8 so the odds in group b of getting diseased are 0.8 divided by 0.2 so four therefore in group b the likelihood of Being diseased is four times higher than the likelihood of not being diseased what about the odds ratio with the odds ratio we can now compare the two groups to do this we can compare the odds of getting the disease in group a relative to the odds of getting the disease in group b so
the odds ratio is simply calculated by dividing the odds in group a by the odds in group b this results in an odds ratio of 0.38 the odds ratio of 0.38 means that The odds of being deceased in group a are 0.38 times the odds of being deceased in group b of course we can also switch the order then the odds ratio would be the odds in Group B divided by the odds in group a in this case the odds ratio of approximately 2.67 means that the odds of being deceased in group b are 2.67
times higher than the odds of being deceased in group a so an odds ratio is simply a comparison of the odds of an Event occurring in two different groups the odds ratio indicates how much more likely the event is to occur in one group group compared to the other group if the odds ratio is greater than one the event is more likely to occur in the first group if it is less than one the event is less likely in the first group okay now let's put it all together and look at how to interpret the
odds ratio in logistic regression let's get started first of all to calculate a logistic Regression we need data let's say we have data from 50 patients our our outcome variable is disease which is coded as zero for not diseased and one for diseased and we have two independent variables medication and age now we can use this data to calculate a logistic regression you can find a link to the data set in the video description in the First Column we can see the coefficients that Define our model these coefficients can be entered into the logistic Regression
formula here we can see the coefficients from the table the constant the coefficients for medication and for age now we just need to enter a value for medication such as one indicating the patient received medication and a value for age for example 50 then we can calculate the probability in this case the probability of being diseased is 0.55 or 55% so for a patient who took the medication and is 50 years old the Probability of being deceased is 55% of course we can simply use data to calculate this probability to do this just enter one
here and 50 there we will then also get a probability of 0.55 and data dep further gives us the odds as we know the odds are calculated by the probability that a certain event will happen divided by the probability ility that the event will not happen we therefore get 0.55 / 1 - 0.55 which equals 1.22 okay but we are not interested in the odds alone we're interested in the odds ratio again the odds ratio is simply a comparison of the odds of an event occurring in two different groups the two groups could be persons
who took the medication and persons who did not take the medication therefore if we're going back to data tab we just need to compare the odds of a person who took the Medication with the odds of a person who did not take the medication so to get the odds ratio we just need to divide the odds of a person who took the medication by the odds of a person who did not take the medication this results in an odds ratio of 0.64 and surprise the calculated value matches the odds ratio listed for the variable medication
the odds ratio of 0.64 for medication indicates that for individuals who took the medication the Odds of the outcome diseased are 0.64 times the odds of those who did not take the medication all right with medication we have two groups to compare but what about a continuous variable like AG in this case we simply look at what happens when we increase age by one unit for example we might compare the odds of the outcome for someone aged 50 versus someone aged 51 this allows us to calculate the odds ratio by comparing the two odds in
this case we get an odds Ratio of 1.04 so for each one year increase in age the odds of the outcome deceased increase by a factor of 1.04 but there's one thing I haven't told you yet the odds ratio can actually be calculated simply by exponentiating each coefficient so e to the power of 0.45 is 0.64 which is the odds ratio of medication and e to the power of 0.04 is 1.04 Which is the odds ratio for H to sum it up odds are simply the ratio of the probability of an event happening to the
probability of it not happening the odds ratio is now the ratio of the odds that an event occurs in two different groups we just talked about regression where we predict values based on known relationships let's now shift our Focus to discovering hidden patterns in data with no apparent relationship this brings us to class analysis specifically The K means clustering technique K means clustering is a powerful method used to identify hidden groups or clusters within our data let's explore how that works and how it can enhance our understanding of complex data sets in this video I
will explain to you everything you need to know about clust analysis I will start with the question what is the K mean claster and then I will show you how it can easily calculated online with data Tab and now Let's start with the question what is the C's clust analysis the K means class analysis is one of the simplest and most common methods for class analysis by using the K means method you can cluster your data by a given number of clusters so you already need to Define beforehand the number of clusters for example you
have a data set and you want to Cluster your cases into three clusters this can be done with the C's clust analysis for example you could have a data set with 15 European countries and you want to Cluster them into three country groups so now the question is how does the kin's clust analysis work there are five simple steps required let's start with the first step first you have to define the number of clusters to find the groups or clusters the number of clusters is the K in K means in our case we simply select
three classs so in this example K was selected equal to three the second step now is to set the Cluster centers random each of the centes now represent one cluster let's come to step three now we have selected the number of clusters and we set the claster centers randomly now we assign each element to one claster so for example we assign each country to one cluster let's start with one element and now the distance from the first element to each of the claster centroids is calculated so for example we calc calate the distance from this
Element to each cluster centroid afterwards each element is assigned to the claster to which it has the smallest distance in our example the distance between this element and the claster centroid is the smallest so we will assign this element to the yellow cluster now this step is repeated for all further elements so at the end we have one yellow cluster one red cluster and one green cluster and then all points are initially assigned to a Cluster so let's summarize it again we first have defined the number of clusters we have then assigned the cluster centroid
randomly and we have assigned each element in step four we now calculate the center of each cluster so for the green elements for the yellow elements and for the red elements the center of each clust is calculated these centers are the new claster centroids now this means that we simply shift the centroids into the cluster Center so the Cluster centroids are moved to the cluster centers now in Step number five we assign the elements to the new clusters since the centroids now can be located at a different element each element is assigned to the claster
that is closest to it now we have finished all steps and from now on step four and five are repeated until the claster solution does not change anymore then the classing procedure is over one disadvantage of The K means method is that the final results depends very much on which initial clusters we used to take this into account the whole procedure is carried out several times and different randomly chosen starting points are used for each of the calculations each time we use different starting points it could be that the outcome is different so we do
the whole class analysis several times in order to get the best possible result if you use Data to calculate the class analysis the analysis is for example done 10 times with 10 different randomly chosen sty starting points and then at the end the best cluster solution is chosen so the next question is what is the optimal claster number with each new cluster the sum distance in the Clusters gets smaller and smaller so if we have a look at this picture where we have two classs and that picture where we have three classers for sure these
three clusters Fit the data better than these two clusters the distance between the elements and the Clusters is in this case higher than in that case so the question now is how many clusters should be used in order to answer this question we use the elbow method with each additional cluster the summed distance between the elements and the claster center becomes smaller and smaller however there is a claster number from which each additional cluster reduces The sum distance only slightly this point is used as the number of clusters so if we have a look at
this plot we can see that there is a big gap between cluster number one and two and there's also a big gap between the number of clusters two and three but there's only a small gap between number of clusters three and four so in this case we select a number of clusters of three clusters now I'd like to show you how you can easily calculate a k means class Analysis online with data tab to do this please visit data t.net and click on the statistics calculator in order to calculate a class analysis you simply choose
the tab claster if you want to use your own data you can clear the table and then copy your own data into this table I will simply use the example data now so I want to calculate a k means clust analysis let's say we want to clust a group of people by salary and age first We can Define the number of clusters and we enter the number of clusters data deab will now calculate everything for you and you get your results right away so here you can see the three clusters with the centroids and there
you can see the elbow method in this case the results indicate that we should use a solution with two two clusters we have selected three clusters before so we can change it in order to get the best suitable number of clusters but let's go Through the results now step by step here we can see how many elements are assigned to the different clusters and here we get the plot with the different clusters here we can see one cluster we see one cluster there and we see another cluster here further we get a table where each
element is allocated to a cluster if we now choose two as the number of clusters we get new results and we can see them here in this plot we have one claster here and one claster There more over we can see that the two clusters fit the data quite well there's one more important concept we need to discuss confidence intervals in the next video we'll break down what confidence intervals are why they matter in statistical analysis and how to interpret them properly let's dive in in this video we'll uncover the true definition of a confidence
interval we will clear up some common misconceptions and explain the difference between the Incorrect and the correct interpretation so let's get started first of all why do we need confidence intervals in statistics par parameters of the population are often estimated based on a sample therefore on the one hand you have the population but since in most cases you cannot serve the entire population you draw a sample now we want to use this sample to estimate a parameter of the population parameters that can be estimated are for example The mean or the variance let's look at
an example you want to know the height of all profession basketball players in the US in order to figure this out you draw a sample the mean of the sample is most likely different from the mean of the population let's assume that we draw not just one but several samples which of course you don't actually do in practice each sample is likely to show a different mean so in the first sample we have one mean in the second sample we Most likely have another mean and again in another sample we have another mean of course
it is also possible that purely by chance two or more samples have means that are exactly the same but this is very unlikely now it would be extremely valuable to have a range that we expect to capture the true parameter with a certain level of confidence and this is precisely where the misconception about confidence intervals comes in in fact published Studies have Shown that scientists frequently misinterpret conf confidence intervals let's dive in and break down exactly what a confidence interval means and just as importantly what it does not mean there are two common ways to
explain the confidence interval on the one hand there is a simpler explanation of the confidence interval but it's not correct when feuded from a frequenti statistics perspective on the other hand there's a slightly more complex Explanation that is actually true to make the difference clear we'll start with the simple but wrong interpretation then explain why it falls short and finally lead us to a clearer understanding of the correct interpretation to keep things simple let's focus on the 95% confidence interval but the same goes for the others of course so let's address the simple but incorrect
interpretation this Interpretation goes like this there is a 95% chance that the true parameter lies within a calculated confidence interval so what does this actually mean imagine we have a population with a true mean value this true mean value is the one we want to estimate although we don't know this true mean we can make an educated guess by taking a sample from the population from this sample we calculate both the sample mean and the 95% confidence interval the simplified Interpretation is to say the confidence interval provides a range within which the true mean lies
with a certain probability or in case of the 95% confidence interval we would say there is a 95% chance that the True Value Falls within this interval however this interpretation isn't accurate but why in frequent statistics the true parameter in our case the true mean is treated as a fixed but unknown quantity so the true parameter does not move around it is Fixed if we now draw a sample and calculate the confidence interval the True Value either lies inside the interval or it doesn't in this case the confidence interval contains the True Value therefore there's
no probability associated with the parameter being within this specific interval but why because probabilities in frequentist terms only apply to events that are subject to variability and again the true parameter is fixed and cannot Change the only thing that varies is the sample data we collect every time we draw a new sample we have new data and consequently a new mean and confidence interval so for example in this sample the True Value Falls within a confidence interval if we take a second sample maybe the confidence interval will not include the True Value therefore there's no
probability associated with the parameter being within this specific interval but why the reason is that Probabilities in frequencies terms only apply to events that are subject to variability and again the true parameter is fixed and cannot change therefore you cannot assign a probability to the true parameter being in a given interval parameter is either inside the interval or it's not the only thing that varies is the sample data we collect every time we draw a new sample we have new data and consequently a new mean and confidence interval so for example in All these samples
the True Value Falls within the confidence interval while in those samples it doesn't in summary you can say that there is a 95% chance that this interval contains the true parameter because once the interval is calculated it either contains the parameter or it doesn't and there is no probability left to assign in the frequen sense but what is the correct interpretation let's say we took a lot of random samples and we calculated the Mean value and the confidence interval of each sample the confidence interval can now be be interpreted in the following way if we
were to take an extremely large number of random samples and construct a confidence interval for each sample 95% of those intervals would contain the True Value while 5% would not in other words if we were to take 100 random samples we would expect that on average 95 of the confidence intervals would contain the True Value While five would not you can also see it the other way around the confidence interval can be defined in terms of probability with respect to a single theoretical sample that has yet to be realized therefore if you haven't drawn the
sample yet you can be 95% sure that the interval from the next sample you draw will contain the True Value but if you have taken the sample the true value is either in the interval or not therefore cont confidence is about the Method not the specific interval the 95% confidence refers to the long run reliability of the method you use to construct the interval it means that if you use this method repeatedly on different samples you expect to capture the true parameter 95% of the time but once you've applied it and obtained a specific interval
you then cannot make a probability statement about whether this interval contains the fixed true parameter or not a side note In statistics there are two distinct approaches or Frameworks the frequentist and the basan the confidence interval is a method used in the frequentist approach in AAS approach we would treat the parameter as a random variable with its own probability distribution reflecting our uncertainity about it in that frame work it would make sense to say that given our data there is a certain probability that the parameter will fall within a certain range but Compared to the
frequentist interpretation this is a fundamentally different way of thinking in the basan approach there is a concept known as the credible interval which serves as the counterpart to the confidence interval in frequented statistics but unfortunately there's also critics of the basan way in short patient statistics require the use of a so-called prior distribution the main criticism is that credible intervals may Not be entirely objective as they are influenced by the choice of the prior distribution this makes the results potentially sensitive to subjective inputs however this same feature can also be seen as a strength as
it allows for incorporating prior knowledge into the analysis in a principled way okay but now to the easiest part how is the confidence interval for the mean calculated if you data are normally distributed the confidence interval for The mean can be calculated with this formula the confidence interval CI is xar plus or minus Z * s / by the root of n here xar is the mean set is the set value for the respective confidence level n is the sample size and S is the standard deviation plus minus results from the fact that we have
once the upper limit with plus and once the lower limit with minus where do we obtain the set value the set value for a given confidence interval can be found in a Standard normal distribution table which lists set values corresponding to the different confidence levels for example at a 95% confidence level the set value is 1.96 using this the confidence interval can be expressed as the sample mean plus - 1.96 times the standard deviation divided by the square root of the sample size the confidence interval can of course be calculated for many Statistical parameters not
only for the mean value thanks for watching and I hope you enjoyed the video bye