The goal of multiple regression is to develop a linear model that can be used to explain variation in or to predict values of the Y variable. Multiple regression refers to a model in which a dependent variable, Y, is a linear function of a set of independent variables, X1, X2, ..., XP. The population model is written as:
The Y variable is sometimes called the response variable, while the X's are sometimes referred to as the predictor variables. The term ε is a random variable that is uncorrelated with the independent variables.
Suppose we have a random sample of N observations from the population: , i = 1,...,N. The sample regression line is written:
We will use the sample to estimate the parameters of the population regression model. In order to do so, we assume that:
- The data
is a random sample from the population
- The random variable ε has a mean of zero
In order to conduct inference in the multiple regression model, we often make an additional assumption:
- The random variable ε has a normal distribution with a variance, σ2, that is constant
Least squares estimation
As in simple regression, the parameters of the regression model are estimated by finding values of that minimize the sum of squared errors:
.
We can write the set of X variables in matrix form, where 1 is an Nx1 vector of ones. The matrix X is N x (p+1). we can write the vector of coefficients as
. Then the least squares estimates of the coefficients are given by:
.
The estimate is unbiased, which means that, on average, the estimate will equal the population value. More formally,
.
If we assume that the random variable ε has a normal distribution, then the estimate has a normal distribution with a mean of β and variance equal to ....
ANOVA
The analysis of variance (ANOVA) table summarizes several quantities associated with the regression equation:
| Source | DF | SS | MS | F |
|---|---|---|---|---|
| Regression | p | SSR | MSR | MSR/MSE |
| Error | N-p-1 | SSE | MSE | |
| Total | N-1 | SST |
SST = Sum of Squares Total =
SSE = Sum of Squares Error =
SSR = Sum of Squares Regression =
With some algebra, we could show that SST = SSE + SSR.
The degrees of freedom for SST are labeled DFT, and equals N-1. The degrees of freedom for SSE are labeled DFE and equals N - p - 1. The degrees of freedom for SSR are labeled DFR and equal p. So DFT = DFE + DFR.
MSR is the mean squared regrssion, and equals SSR/DFR. MSE is mean squared error, and equals SSE/DFE.
The F-statistic, F = MSR/MSE, is relevant for testing the null hypothesis that all of the population slope coefficients are equal to zero against the alternative that at least one of the population slope coefficients is not equal to zero:
If we reject the null hypothesis, we conclude that there is at least one significant variable; in other words, we have enough evidence to conclude that at least one X variable explains variation in the Y variation. If we fail to reject the null hypothesis, we conclude that we don't have enough evidence to show that any of the X variables are significant.
The R2 of the regression, R2 = SSR/SST, measures the percentage of the variation in Y that is explained by the X variables. Note that R2 = 1 - SSE/SST.
Interpretation
The regression coefficient for independent variable Xpi measure the effect of a one unit change in Xpi on Yi on average, holding the values of the other independent variables constant. For example, suppose we estimate a regression of the price of a new home (Price) on the square footage of the house (SqFt) and the number of rooms (Number):
Price = $20,000 + $150 SqFt + $1500 (Number)
Holding the number of rooms in a house constant, an increase in the square footage of the house by one foot is associated with a $150 increase in the price of a house, on average. Holding the square footage of a house constant, an increase in the number of rooms increases the price of a house by $1500 on average.
Significance of individual coefficients
Testing the significance of the coefficients proceeds along the same lines as in Simple Regression. The hypothesis we wish to test is:
. The test statistic is:
This statistic has a t-distribution with N-p-1 degrees of freedom.
If we reject the null hypothesis, we conclude that the variable X_j is significant, given all of the other variables in the model. If we fail to reject the null hypothesis, then we conclude that there is not enough evidence to show that X_j is significant, given all of the other variables in the model. The outcome of a test always depends on the other variables in the model. When we say a variable is significant, we mean that variation in that variable helps to explain variation in the Y variable.
Model building
The best models are built on theoretical knowledge about the situation being studied. If we wish to explain the price of houses, for example, we would want to know what variables affect the price of houses (e.g., number of bathrooms, location, square footage, etc.).
