Chi-square Tests

Introduction

In general, a chi-square test is a test based on the chi-square probability distribution. Here, we discuss the Pearson's chi-square test for two-way contingency tables. The data here is qualitative: we have data which is grouped by a characteristic and by a response or outcome. For example, we might ask a group of people what type of juice they prefer - orange, apple, cranberry or other. We might be curious about whether juice preference differs by gender. Is the percentage of men who buy orange juice different from the percentage of women who buy orange juice? The characteristic is gender (male or female) and the response is the preferred juice (orange, apple, cranberry, other). In the random sample, an observation is the gender and juice preference of a person.

In some instances, the data is summarized in a table. A table entry contains the number of observations from the sample that have a particular characteristic and a particular response. The formulas that follow assume that we've summarized the data in a table.

Intuition

The idea behind the chi-square test is to compare what we observe in the random sample to what we expect to observe when we assume that there is no relationship between the characteristic and the response. For example, suppose that 40% of the people in the random sample are men. Then we'd expect (if there is no relationship) 40% of the apple juice drinkers to be male, 40% of the orange juice drinkers to be male, and so on. Equivalently, suppose that 10% of the people in the sample drink cranberry juice. Then we'd expect 10% of the men to drink cranberry juice and 10% of the women to drink cranberry juice.

We'll compare the observed percentages to the expected percentages. If the observed and expected percentages differ by more than is implied by random chance, we'll conclude that there is a relationship between the characteristic and the response. In the example, suppose that 15% of the men drink cranberry juice and 5% of the women drink cranberry juice. This difference could lead us to conclude that there is a significant difference between the percentage of women and men who prefer cranberry juice.

Formulas

Let:
be the observed count for category c and response r
be the expected count for category c and response r

The observed count comes directly from the sample data. The expected count can be calculated as follows:

Expected Count = (Category Total) x (Response Total)/(Total Observations)

The following statistic has a chi-square distribution with degrees of freedom equal to (C-1)(R-1), where C is the number of possible characteristics and R is the number of possible responses:

The null hypothesis states that there is no association between the characteristic and the response (or, equivalently, between the row variable and the column variable). The alternative hypothesis is that a relationship exists between the characteristic and the response. Alternatively, the alternative hypothesis says that the response differs depending on value of the characteristic.

The p-value of the test statistic is where X2 is a chi-square random variable with df = (R-1)(C-1)

Example 1.

An on-line music service company wants to know (for the purposes of designing a marketing campaign) if customer age is important in deciding to subscribe, and, if so, which age groups are most likely to subscribe to its services. The company has gathered a random sample of 1000 people from the population, and asked each person whether he or she would subscribe to the service. The company knows which age group the person falls into: under 18 years of age, 18 - 34 years of age, and 35 years or older. The company could use a chi-square test to answer its questions. How do we know this? We know that a chi-square test is useful because we are asking a question about the differences in outcomes (subscribe or don't subscribe) across people who differ in by some characteristic (age group).

The data for this example is:

  Under 18 18-34 35 and over Total
Yes 120 262 237 619
No 41 103 237 381
Total 161 365 574 1000

Here, we have two responses (rows) (Yes, No), so R = 2. We have three characteristics (columns) (Under 18, 18-34, 35 and Over), so C = 3.

Overall, 619 people said they would subscribe, and 381 said that they would not subscribe. In percentage terms, 69.1% of the people said "yes" and 38.1% said "no". Did that percentage differ by age group?

  • In the Under-18 age group, 120 out of 161 (or 74.5%) of the people said "yes"
  • For the 18-34 year olds, 262 out of 365 (71.8%) said "yes"
  • For the 35 and older group, 237 out of 474 (50%) said "yes"

The null hypothesis of the chi-square test is that the characteristic (age group) ''does not'' affect the outcome (subscribing to the service). More specifically, the null hypothesis says that the percentage of people overall who said "yes" is the same as the percentage who said "yes" in each age group. If the null hypothesis is true, then we should see 69.1% of the people in each group saying "yes". In other words, we'd expect to see the following data:

  Under 18 18-34 35 and over Total
Yes 99.66 225.94 293.41 619
No 61.34 139.06 180.59 381
Total 161 365 574 1000

For example, 99.66 is 69.1% of the 161 Under 18 year olds who were in the sample. These are called the ''expected counts''. Since we are applying percentages to whole numbers, we won't get whole numbers out of the calculation. This is fine.

To calculate the chi-square statistic, take the difference between the Observed Count and the Expected Count, square the difference, and divide by the Expected Count, producing the following table:

  Under 18 18-34 35 and over Total
Yes 4.152 5.757 10.844 619
No 7.745 9.353 17.618 381
Total 161 365 574 1000

To calculate the chi-square statistic, add the non-bold elements in this table to obtain the result &chi2 = 54.468. There are (3-1)x(2-1) = 2 degrees of freedom. The p-value for this test can be calculated using statistical software, by looking at a chi-square table, or using a chi-square p-value calculator. The p-value for this statistic is zero to three decimal places, so we reject the null hypothesis and conclude that the age groups differ in their decision to purchase the service.

Specifically, rejection of the null hypothesis leads us to conclude that there is a relationship between the row and column variables. In this case, age is related to whether or not a person would subscribe to the music service.

Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.