One-way fixed effects ANOVA

Introduction

Analysis of Variance (ANOVA) is a technique used to compare the mean of a measurement from several different populations. For example, we might wish to know whether mean income is the same in three different cities, New York, Chicago, and Los Angeles. We take a sample of people from each city, and calculate the average income in each sample. We then ask whether the sample means differ enough to conclude that the population mean income is different across the three cities.

Formulas

We have measurements on I populations. Denote each population by i. We have ni measurements for population i, xij, j = 1,2,...,ni. In the ANOVA one-way model, we assume that
.
The term ε is a random variable that has a normal distribution with a mean of zero and a standard deviation of σ2.
That is, a measurement, xi, from population i has a normal distribution with mean ยตi. Only the means of the populations differ; the standard deviations of the populations are the same.

Our null and alternative hypotheses are:


In words, the null hypothesis says that all of the measurements came from the same normal population, with the same mean and standard deviation. The alternative hypothesis says that at least one group of the measurements came from a normal population that has a population mean that is different from the population means for the other groups of measurements.

Let N = n1 + n2 + ... + nI and . The test statistic is based on several sum-of-squares calculations:

  • SSG: Sum of Squares Group. This is a measure of difference between the group mean and the overall mean:
  • SSE: Sum of Squares Error. This is a measure of the within-group variance (how much does each observation in the group vary from the mean of the group?)
  • SST: Sum of Squares Total. This is a measure of the overall variance in the data (how much does each observation vary from the overall mean?)
    .

With some algebra, we could show that SST = SSG + SSE. Intuitively, the variance in the data when it is all grouped together can be divided into the two pieces: a measure of the variance among the group means (SSG) and the variance within the groups (SSE).

The following table summarizes the calculation of the F-statistic for the test of the null hypothesis that the means are the same across the I groups:

Source Degrees of Freedom (DF) Sum of Squares (SS) Mean Square (MS) F
Groups I - 1 SSG SSG/DFG MSG/MSE
Error N - I SSE SSE/DFE  
Total N - 1 SST    

The statistic MSG/MSE has an F-distribution with I - 1 and N - I degrees of freedom. This is written as F(I-1,N-I).

If F exceeds a pre-chosen critical value, we reject the null hypothesis. Otherwise, we fail to reject the null hypothesis.

Statistical theory

Note that SST = SSE + SSG. This is a quadratic form:

  • SST/σ2 has a χ2 distribution with N-1 degrees of freedom
  • SSE/σ2 has a χ2 distribution with N-I degrees of freedom
  • SSG/σ2 has a χ2 distribution with I-1 degrees of freedom

Note that SSE/(N-I) is an unbiased estimate of σ2. Similarly, under the null hypothesis, SSG/(I-1) is an estimate of σ2. Hence, if the null hypothesis is true, the F statistic is close to one.

Example

ANOVA calculations are almost always done on using statitical software since the calculations are tedious. Here is an example worked by hand to demonstrate the calculations.

Suppose we have three instructors teaching a statistics class, Beth, Kathy and Laura. We randomly sample four students from each instructor. On the first exam, the scores for the 12 students were as follows:

Beth Kathy Laura
52 41 51
48 50 32
46 44 40
35 37 31

We wish to test the null hypothesis that the mean scores on the exam were the same for all three instructors. Since we have three populations (the students of Beth, the students of Laura, the students of Kathy), and we want to compare means, we use ANOVA.

In this example, I = 3, ni = 4 for i=1,2,3 and N = 12. Summary statistics for the three samples from the three populations are given in the following table:

Instructor si
Beth 45.25 7.27
Kathy 43.00 5.48
Laura 38.5 9.26

The overall mean for all twelve measurements is = 42.45 and s = 7.39.

The ANOVA table for this data is:

Source DF SS MS F p-value
Group 2 94.5 47.3 0.84 0.463
Error 9 505.8 56.2    
Total 11 600.3      

To calculate degrees of freedom:

  • there are I = 3 groups, so DF(group) = I - 1 = 2
  • there are N = 12 measurements and I = 3 groups, so DFE = 12 - 3 = 9
  • there are N = 12 measuresments, so DFT = N - 1 = 11.

To calculate SS:

  • SSG = 4x(45.25 - 42.45)2 + 4x(43.00 - 42.45)2 + 4x(38.50-42.45)2
  • SSE = (4 - 1)x(7.27)2 + (4 - 1)x(5.48)2 + (4 - 1)x(9.26)2
  • SST = SSG + SSE = 94.5 + 505.8 = 600.3

To calculate MS:

  • MSG = SSG/DFG = 94.5/2 = 47.3
  • MSE = SSE/DFE = 505.8/9 = 56.2

To calculate the F-statistic: F = MSG/MSE = 0.84. This statistic has an f-distribution with (2,9) degrees of freedom. The probability of observing a value of 0.84 or greater when the null hypothesis is true is 0.463. We should fail to reject the null hypothesis at the 5% significance level.

Minitab commands

Stat > ANOVA > One-Way (Unstacked)
Performs a one-way analysis of variance, with each group (factor level) in a separate column.

Stat > ANOVA > One-way
Performs a one-way analysis of variance, with the response variable in one column, factor levels in another.

Dialog box items

  1. Responses [in separate columns]: Enter the columns containing the values for the response variable. Each column represents a different factor level.
  2. Store residuals: Check to store residuals in the next available columns. The number of residual columns will match the number of response columns.
  3. Store fits: Check to store the fitted values (level means) in the next available column.
  4. Confidence level: Enter the confidence level. For example, enter 90 for 90%. The default is 95%.

This applet demonstrates how the value of the F statistic and the p-value of the test vary with changes in the pooled standard deviation and the differences among the means of three groups.

Labels

 
(None)