Hello, and welcome to the module on Analysis of Variance
This module will introduce the analysis of variance or ANOVA.
By the end of this module, you should be able to:
● Differentiate between a t-test and an ANOVA, and apply the appropriate test for a given problem.
● Formulate the null hypothesis in a one-way ANOVA
● Identify the conditions necessary to conduct an ANOVA
● Use Excel to conduct an ANOVA, and
● Calculate and interpret eta-squared, a measure of effect size
As we have previously learned, a t-test allows us to compare two means to determine whether these groups are different, but sometimes we want to compare more than two means.
In other words, we want to compare the mean values of a continuous dependent variable at different levels of a nominal independent variable. We call that nominal variable a factor. In this situation, we use what is called an Analysis of Variance, or ANOVA.
This name may sound counterintuitive because if we want to know about differences in means, why are we analyzing variances? To get a better idea, let’s have a look at the logic underlying the statistics we calculate for t-tests and ANOVAs.
The logic of an F-test, which is our ANOVA statistic, is much like the logic for a t-test. In the numerator you have some measure of the difference between your means, which is the effect you’re interested in, and in the denominator you have some measure of the variability we expect in these data.
With the t-test, we look at the size of the difference between means in terms of how many standard deviations apart they are. If the distributions of the groups are really variable (that is, they have large standard deviations), the difference between means need to be much larger to be significant. The logic is similar with ANOVA.
This time, instead of a t-statistic, we calculate an F statistic. At its heart, the F statistic is a ratio of two different variances.
We look at a measure of the variance BETWEEN the groups we are looking at, and see how big that is compared to the variance WITHIN these groups.
If the groups themselves are really variable, a difference between groups would have to be much larger to make it significant.
Essentially we partition the total variability into smaller chunks that are attributed to different sources.
Between groups variability is our best estimate of the possible differences between population means, while within groups variability, also referred to as error variance, estimates the expected spread in population scores ignoring group membership
Therefore, the F ratio compares variability explained by group differences to our estimate of unexplained or error variation in the population
We estimate all of these values using sums of squares.
In the equation here i represents the groups and j represents the individuals Other books and courses may use different letters for individuals or groups. It doesn’t really matter, as long as you keep track. So for example, in the equation shown here, X 1, 2 would represent the value for the first group, and the second person within that group.
Each equation calculates a different sum of squares value, which we denote using SS. We call them sum of squares because these values involve adding up the squared differences from a mean value.
We call the mean of all scores the grand mean, represented in the formula as Xbar with the subscript “gm”. The sum of squares group value, which represents our variability between groups, is calculated by subtracting the mean of each group from the grand mean , squaring that, and summing those values. The sum of squares error represents our variability within groups, or the variability not accounted for by our group differences. To calculate this value, we need to subtract the SS group value from the SS total value. We get SS total by subtracting each individual score from the grand mean, squaring that and adding up all of those squared difference..
Don’t focus too much on the math, we’re going to use a computer to help us calculate the F statistic.
Just notice that these are very similar in form to the variance we learned about in the module on Central Tendencies and Variability. It is also important to try to see how each value relates to its source of variation.
Before we get into the actual calculations, let’s look at this graphically. Here we have 3 normal density curves - representing data from 3 groups we want to test. [It might be helpful to label each of the curves with Group A, B, or C]
Look at the table above. In a moment you’ll see how changes to the curves change our values for sums of squares and F. We'll get into exactly how to calculate this later, but let's think about it more intuitively right now.
As we mentioned, the between groups variability measures the combined squared distance of each mean [highlight each mean] from the grand mean [highlight GM] - this value is the between groups variability and the numerator of our F-test. The within group variability is the spread around each individual group’s mean [inside 3 distributions, draw arrows from each group mean to edge of distribution] - the error variability and denominator of our f-test.
Changing either of these value affects our F-test. If the means are more spread apart - like this - our between group sums of squares and F increase. The opposite happens when the means move back together.Now let’s look at what happens when within group variability gets smaller - like this. Our within group sums of squares gets smaller and F gets larger. Again, the opposite occurs when we increase within group variability.
When sample means do not much differ from each other, we can assume that the observations for each group likely were sampled from a single population. If the samples are all derived from the same population, then you should expect the variance between the sample means not to differ much from the overall population variance, either. Therefore, the numerator and denominator should be relatively similar to one another, resulting in an F value of approximately 1. That is, the ratio of our between and within group variability should be nearly equal, suggesting that our groups don’t really differ.
However when sample means are very different, we infer that they probably were sampled from populations with different means. It would make sense, then, that the variance of the group means will be different –or larger -- than the population variance. Therefore, the numerator should have a larger value than the denominator, resulting in a greater F value than the scenario outlined previously. That is, the ratio of our between and within group variability should be larger, suggesting at least one of the groups differs in some way from the others.
This really demonstrates the crux of the ANOVA, and it explains why we call it an analysis of variance, even though we make hypotheses and conclusions about means.
citation: Scott R. Colwell. "Single Factor Analysis of Variance”. http://demonstrations.wolfram.com/SingleFactorAnalysisOfVariance/Wolfram Demonstrations Project
Analysis of Variance, like many statistical procedures, requires some assumptions about the data you are dealing with for it to be valid.
It’s important to remember that the validity of these assumptions is required if you want to make any meaningful inferences with the ANOVA. Violations of these assumptions will not prevent you from actually computing your calculations -- it is up to you, the statistician, to verify that these assumptions are valid with your data before you conduct an ANOVA. We will not discuss assumption-checking methods in this module, but know that they do exist, and it is always best practice to check your assumptions before making any statistical inferences. With that said, there also exist methods to account for violations of the first two assumptions that we will discuss now
The first assumption is the Assumption of Normality. This assumption requires that our samples are normally distributed around the population mean. The green graph meets this assumption, however, the red and blue graphs depict data that violate this underlying assumption of an ANOVA.
The second assumption we make is that all the populations we sample from have the same variance, known as the Homogeneity of Variance Assumption. Notice that our assumption is based on the population and not the sample, but given that we only have the sample, we use the sample variance as a best estimate for the population variance.
A common rule of thumb suggests that when we compare pairs of variances, if the larger variance is 4 or more times greater than the smaller variance, we may be violating this assumption. Notice that this would be true if we paired any of the distributions below; they all have quite different variances. This problem is worse when sample sizes are not equal across groups.
The last assumption we make when conducting a one-way ANOVA is the assumption of independent observations. Simply put, we assume that each individual data point or observation does not provide us with any information about any of the other data points.
In other words each data point is not related to any another, as they might be if measurements were repeated on the same person.
It’s important to remember that the assumptions we just discussed are all assumptions about the population.
When conducting an ANOVA, or any inferential statistics for that matter, we are dealing not with populations, but with samples from these populations. We use samples because it is not feasible to use an entire population.
As a very simple example, imagine investigating the relationship between height and age during adolescence.
It is rather ridiculous to expect researchers to measure the height of every single adolescent in the world as they get older. Instead, measuring the height of a sample of adolescents of different ages should provide an adequate idea of what the relationship is between age and height .
As before, we will denote population mean and variance using mu and sigma squared, respectively while we will refer to the sample mean and variance using xbar and s squared.
Now that we understand when to use ANOVA, the logic behind ANOVA, and the assumptions that must be met to use it, let’s set up an example to help us understand the rest of the procedure.
Referring back to the cars database we’ve been working with, let’s imagine that you’re considering purchasing a car to commute to and from university.
You arrive at a dealership who specializes in Mazdas, Saturns, and Suzukis .
Fortunately, you have no sense of style, so you are content with the look of the hundreds of cars on the lot, but unfortunately, you have to choose one. You want to save some money on gas, so you figure you will narrow your options by first seeing if the different manufacturers differ in their fuel efficiencies.
The dealer quickly pulls together the fuel efficiencies for the Mazdas, Saturns, and Suzukis on the lot. (In this example, fuel efficiency is on a ratio scale, and manufacturer is a nominal variable)
You want to quickly do some stats to see if Mazdas, Saturns, and Suzukis differ in their fuel efficiencies. You suspect that they do differ from one another, and will go with the manufacturer with the best fuel efficiency
Let’s dissect what you are doing here:
There are three different car manufacturers, thus there are three levels of the factor Manufacturer. The three levels are Mazda, Saturn, and Suzuki.
Here we are showing you the distribution of each in a stacked histogram and boxplot.
Independent variable → dependent variable
You want to see if the fuel efficiency is different across car manufacturers; you want to test if your independent variable, the factor Manufacturer, has an effect on your dependent variable, Fuel Efficiency.
To help you see this difference, we have graphed the means as a bar plot. On the x axis is car manufacturer and on the y axis, fuel efficiency. The height of each bar represents the average value for that group while the perpendicular error bars represent 1 standard error of the mean above and below the mean, that is, 1 standard deviation divided by the square root of the sample size. This is a common measure of spread.
A point graph summarizes the same information - the points represent the means and the bars the standard error. Both of these are common graphical displays. Note however that the scales are slightly different.
When dealing with ANOVAs, you do not refer to the individual differences between means. Rather, you use the term main effect when inspecting the effect of an independent variable, on a dependent variable. Here the independent variable is “Manufacturer”, and the dependent variable is “Miles Per Gallon in the City”. The ANOVA informs you if the factor (manufacturer) has a main effect on the dependent variable.
Just like the t-test, a one-way ANOVA has a single null hypothesis. Let’s remind ourselves that the null hypothesis of a two-sample t-test is that the means of two groups are equal. Again, the difference between a t-test and a one-way ANOVA is that the ANOVA is used to evaluate more than two groups.
Therefore, the null hypothesis of a one-way ANOVA is a bit broader, and states the means of all the groups are equal.
If the null hypothesis is true and the means of all the groups are not different from each other, then there can be only one possibility: the groups are equivalent.
However, if the null hypothesis is false, then the reason is not as clear: were all the means different from one another? is the mean of one group different from another, but not from a third? Clearly, there are many possibilities, and these possibilities grow as the levels of your factor increase.
The truth is, the alternative hypothesis simply states that at least one mean is different from the others. If you do an ANOVA and reject the null hypothesis in favour of the alternative, then you can conclude that at least one of the samples was obtained from a different population than the other samples.
Referring to our cars example, if the one-way ANOVA on fuel efficiency yielded a statistically significant result causing us to reject the null hypothesis in favour of the alternative hypothesis, then you would only be able to conclude that the fuel efficiency of [insert logos used previously] at least one of the car manufacturers is different than the others. At the end of this module, we’ll discuss the steps you would then take to identify where the difference exists.
As we mentioned earlier in the module, an ANOVA is computed by taking some measure of the difference between groups and comparing that to some measure of variability. Our sums of squares are a measure of the variability in the sample. However, as we’ve already suggested many times, the aim of these statistical tests is to use measures obtained from samples to estimate the population characteristics.
The SS represents the total variation in the current sample. We need to convert the SS to Mean Squares (or MS for short), which is an estimate of the variation in the population. There’s a rather simple way to correct the SS to better reflect the variance in the population, and that’s by dividing the SS by the degrees of freedom in order to obtain the mean squares.
Degrees of freedom (or df) can be measured easily, and there are unique degrees of freedom associated with each SS. You can think of degrees of freedom as how many squares go into computing the respective SS, minus the number of means used in the SS calculation. So for SStotal, there are N squares, where N is the total number of data points, and a single Grand mean, so the df is N-1. Similarly,for the SSgroup, there are k number of group means, subtracted from a single grand mean. Lastly, df for error is more simply thought of as the leftover degrees of freedom, so you can subtract degrees of freedom for group from degrees of freedom total.
The table we just described to organize the variables used in calculating an F value is known as the ANOVA summary table, and it does exactly that – it summarizes the results from your ANOVA. At a glance, you can identify your sources of variation, how much variation is associated with each of these sources (sums of squares), how many parameters are involved with each of these sources (degrees of freedom), and the variation corrected to represent the population variation (mean squares). There is a final column on the right that represents the output of the significance test you’ve done. This column essentially tells you if the amount of variation accounted for by the Group is a significant amount. In other words, does the group or factor have a significant effect on the dependent measure? This is an important component of the ANOVA as the F-value will determine the inferences you are able to make regarding your data. Up next we will discuss how to compute the F value and the implications of different F values.
We are going to show you how to do an ANOVA using Excel.
The first thing we need to do is make sure that you have the data analysis pack installed in Excel, so let's go over here to the start menu and down to excel options. Look through this menu and find “add ins”.
From the add ins menu, we will select “Analysis ToolPak” and hit OK. Now in the data ribbon you should see a section called data analysis. We can start by clicking data analysis and selecting single factor ANOVA. We’re going to define our input range, which is to say tell it where to find the data, by clicking and dragging over this entire thing including the names.
Then we let them know it's grouped by columns so each manufacturer is a column - and that we have labels for those manufacturers in the first row: Saturn, Mazda, and Suzuki. Now we indicate the alpha level we want to test at and to make the table in a new sheet. As you can see, Excel has done it for us. We’ll just expand this. We can see some summary statistics at the top and the actual ANOVA results here on the bottom.
Note that You can find instructions for how to install the Excel Package and how to complete the ANOVA at the following links.
You’ll notice that we get some summary statistics above our ANOVA table and that this table is laid out like the one we previously showed you. But we do see two additional pieces of information: a p-value and an F-critical value.
Just like a t-test, the F-test compares a value to a theoretical distribution (in this case the F-distribution) to determine if our observed value is more extreme than we would expect by chance---the cutoff point is our ‘critical F value’ or F-critical.
It corresponds to the area shaded red on this F density curve.
Our p-value is the chance of getting the observed value that we did (or higher) if the null hypothesis were true - that is, if there were no difference between the groups.
Throughout these modules, we’ve been talking about how confident we are about differences between scores, but just as important is the size of that difference. Remember, however, that we can only talk about a main effect, where one of the means, or some combination of means, is not identical to the others.
We can characterize how big our effect is by asking how much variability it explains. A substantial effect should account for a lot of variability, leaving comparatively less unexplained than a tangential, less important factor.
For example we’d expect type of engine, hybrid or not, to have a huge impact on a car’s fuel efficiency. Manufacturer would have a smaller effect by comparison.
We can measure this difference using a statistic called eta squared. We calculate it as the ratio of the sums of squares group over sums of squares total. Because sums of squares group is partitioned off from sums of squares total, this creates a ratio that ranges from zero (when there is absolutely no effect of group differences) to 1 (when all the variability is due to group). In practice, the ratio is usually somewhere in the middle.
Although what constitutes a large effect will vary depending on what you are studying, in the social sciences, we often use the following rules of thumb if we have no better way to gauge the size of an effect:
Values around 0.02 are considered small effects
Values around 0.13 are considered to be medium effects, and
Values around 0.26 are considered to be large effects
If we refer back to our example about fuel efficiency, eta squared equals .41. This means 41% percent of the variance in gas mileage is accounted for by manufacturer.
This seems like a fair amount. But does a statistically significant difference of about 4 MPG really matter to our decision? Maybe but it’s probably not as important to our decision as other factors.. We should always keep in mind the distinction between statistical significance and practical significance.
So now we’ve concluded that there is a significant main effect of car manufacturer on fuel efficiency and we’ve thought about how big that effect might be.
But now what? You still don’t know which car has the best fuel efficiency. The purpose of an ANOVA is to tell you whether a difference exists. It’s standard practice to follow up a significant ANOVA with t-tests. In most cases, we wouldn’t do this if the ANOVA didn’t detect a difference.
Now that’s we’ve observed a difference we can go back and do two-sample t-tests between Mazda & Suzuki, Mazda & Saturn, and Suzuki & Saturn.
Additionally, a plot of the group means will really assist in interpreting your results.
As it turns out, if you do three t-tests between the different car manufacturers, you see that the difference between Mazda and Suzuki is not significant (t(14)=-1.32, p=0.21), nor is the difference between Suzuki and Saturn (t(14)=-1.69, p=0.11), but the difference between Mazdas and Saturns is significant (t(14)=-2.92, p=0.011).
So our ANOVA revealed to us that Car Manufacturer has a significant main effect on Fuel Efficiency. We further investigated the significant main effect with t-tests and can conclude that the main effect of car manufacturer represents a significant difference in fuel efficiency between Mazdas and Saturns (p=0.011), but not between any other groups.
This module has focused on a very basic but fundamental type of ANOVA, known as the one-way ANOVA. In the one-way ANOVA, you have one factor with three or more levels, and you want to see whether the dependent variable is affected by varying the levels of the factor. We won't cover them here, but there are other types of ANOVA for more complicated data sets.
One is a factorial ANOVA, in which there are multiple factors at once. For example, you might compare fuel efficiency by manufacturer AND by whether the vehicle is an SUV or not.
Another is repeated measures. Just like in a repeated measures t-test, it allows you to look at 2 measurements about the same thing. For example we might look at manufacturers of cars and compare their fuel efficiency for both highway and city driving.
To summarize the main learning points in this module, we first learned that a one-way analysis of variance is needed when you are comparing the means of 3 or more different groups. You start with the null hypothesis that all group means are not different from each other, and use an ANOVA to find any evidence against this null.
You do so by measuring the variance in the data using the Sum of Squares. You then deconstruct the sum of squares into its components, to have a sum of squares group (which is the variability between the groups) and sum of squares error (which is the variability within the groups). Using the respective degrees of freedom for each of the components, you can calculate Mean Squares group and Mean Squares error. You then simply divide MS group by MS error to obtain an F value, which you compare to the critical F in the F table, to determine whether the effect of Group was significant on your dependent measure. If the observed F is greater than the critical F, then you know that Group has a significant effect on the measure, that is, that at least one of the means is significantly different from another. Only then can you identify the source of this effect by computing individual tests between each sample to see which group means significantly differ from one another.
Furthermore, we explained the three main underlying assumptions we make when conducting an ANOVA, which include the Assumption of Normality, the Assumption of homogeneity of variance, and the Assumption of individual observations. Although we did not discuss how to test the validity of these assumptions in your sample set, we hope you understand why these assumptions are required for you to make any meaningful inferences from the results of your ANOVA.
Finally, we discussed how to determine not just whether there is a difference between groups, but how large that effect is using Eta Squared.