Introduction
In general, a chi-square test is a test based on the chi-square probability distribution. Here, we discuss the Pearson's chi-square test for two-way contingency tables. The data is qualitative: we have data which is grouped by a characteristic and by a response or outcome. For example, we might ask a group of people what type of juice they prefer - orange, apple, cranberry or other. We might be curious about whether juice preference differs by gender. Is the percentage of men who buy orange juice different from the percentage of women who buy organize juice? The characteristic is gender (male or female) and the response is the preferred juice (orange, apple, cranberry, other). An observation is the gender and juice preference of a person in our sample.
In some instances, the data is summarized in a table. A table entry contains the number of observations that have a particular characteristic and a particular response. The formulas that follow assume that we've summarized the data in a table.
Formulas
Let:
Ocr be the observed count for category c and response r
Ecr be the expected count for category c and response r
The observed count comes directly from the sample data. The expected count can be calculated as follows:
Expected Count = (Category Total) x (Response Total)/(Total Observations)
The following statistic has a chi-square distribution with degrees of freedom equal to (C-1)(R-1), where C is the number of possible characteristics and R is the number of possible responses.

The hypotheses are:
H0 : Ocr = Ecr for all c and r
Ha : Ocr ? Ecr for at least one combination of c and r
The null hypothesis is that there is no association between the characteristic and the response. The alternative hypothesis is that a relationship exists between the characteristic and the response. Alternatively, the alternative hypothesis says that the response differs depending on value of the characteristic.
Example 1.
An on-line music service company wants to know (for the purposes of designing a marketing campaign) if customer age is important in deciding to subscribe, and, if so, which age groups are most likely to subscribe to its services. The company has gathered a random sample of 1000 people from the population, and asked each person whether he or she would subscribe to the service. The company knows which age group the person falls into: under 18 years of age, 18 - 34 years of age, and 35 years or older. The company could use a chi-square test to answer its questions. How do we know this? We know that a chi-square test is useful because we are asking a question about the differences in outcomes (subscribe or don't subscribe) across people who differ in by some characteristic (age group).
The data for this example might be the following:
| Yes | No | Total | |
|---|---|---|---|
| Under 18 | 120 | 41 | 161 |
| 18 - 34 | 262 | 103 | 365 |
| 35 and over | 237 | 237 | 574 |
| Total | 619 | 381 | 1000 |
Here, we have two responses (Yes, No), so R = 2. We have three characteristics (Under 18, 18-34, 35 and Over), so C = 3.
Overall, 619 people said they would subscribe, and 381 said that they would not subscribe. In percentage terms, 69.1% of the people said "yes" and 38.1% said "no". Did that percentage differ by age group?
- In the Under-18 age group, 120 out of 161 (or 74.5%) of the people said "yes"
- For the 18-34 year olds, 262 out of 365 (71.8%) said "yes"
- For the 35 and older group, 237 out of 474 (50%) said "yes"
The null hypothesis of the chi-square test is that the characteristic (age group) ''does not'' affect the outcome (subscribing to the service). More specifically, the null hypothesis says that the percentage of people overall who said "yes" is the same as the percentage who said "yes" in each age group. If the null hypothesis is true, then we should see 69.1% of the people in each group saying "yes". In other words, we'd expect to see the following data:
| Yes | No | Total | |
|---|---|---|---|
| Under 18 | 99.66 | 61.34 | 161 |
| 18 - 34 | 225.94 | 139.06 | 365 |
| 35 and over | 293.41 | 180.59 | 574 |
| Total | 619 | 381 | 1000 |
For example, 111.25 is 69.1% of the 161 Under 18 year olds who were in the sample. These are called the ''expected counts''. Since we are applying percentages to whole numbers, we won't get whole numbers out of the calculation. This is fine.
To calculate the chi-square statistic, take the difference between the Observed Count and the Expected Count, square the difference, and divide by the Expected Count, producing the following table:
| Yes | No | Total | |
|---|---|---|---|
| Under 18 | 4.152 | 7.745 | 161 |
| 18 - 34 | 5.757 | 9.353 | 365 |
| 35 and over | 10.844 | 17.618 | 574 |
| Total | 619 | 381 | 1000 |
To calculate the chi-square statistic, add the non-bold elements in this table to obtain the result ?2 = 54.468. There are (3-1)x(2-1) = 2 degrees of freedom. The p-value for this test can be calculated using statistical software, by looking at a chi-square table, or using a chi-square p-value calculator. The p-value for this statistic is zero to three decimal places, so we reject the null hypothesis and conclude that the age groups differ in their decision to purchase the service.
Specifically, rejection of the null hypothesis leads us to conclude that there is a relationship between the row and column variables. In this case, age is related to whether or not a person would subscribe to the music service.
