In simple linear regression, the goal is to estimate the linear relationship between a dependent variable, Y, and an independent variable, X. We call this simple regression because there is only one independent variable. We assume that there is a linear relationship between the two variables, and that differences (variation) in X can explain differences (variation) in Y (but not the reverse - variation in Y cannot explain variation in X). More specifically, we assume that a change in X can explain a change in Y. The population regression model is written as:
The essential assumption of the linear regression model is:
- the error term
has mean zero
For purposes of testing, we often assume that:
- the error term has a normal distribution with a constant standard deviation σ
The parameters of the model are the intercept, β0, the slope β1, and the standard deviation, σ.
Estimation
Our random sample consists of pairs . We use the data to estimate the parameters of the regression line, β0 and β1. The estimates are denoted by b0 and b1. Our goal is to find the sample regression line,
that is "closest" to the population regression line (so that the vertical distance between the population and sample regression lines is minimized).
The estimates of the parameters of the population regression line can be calculated from the following formulas:
where r is the correlation coefficient, sx is the sample standard deviation of x and sy is the sample standard deviation of y. These formulas are the ordinary least squares (OLS) estimates; these estimates minimize the sum of squared errors of the regression:
We can estimate the standard deviation of the error term with the following formula:
Under assumption 2, the regression estimates, b0 and b1, have normal distributions with means of β0 and β1 respectively. We can form confidence intervals for the slope coefficient using the t distribution with N-2 degrees of freedom:
where t* is a critical value from the t-distribution and SE represents the standard error of the slope coefficients:
The predicted value of yi for a particular value of xi is denoted by . Since the prediction uses estimates of the parameters, it has a sampling distribution and, hence, a standard error. The standard error of prediction for an individual response when x = x* is:
The standard error for predicting the mean (or average) response, , for a particular value of x, x*, is given by:
Inference
We can test the whether a linear relationship exists between X and Y by testing the following hypothesis:
The relevant test statistic is
which has a t-distribution with N-2 degrees of freedom.
Prediction
As above, to predict a value of y given a particular value of x, denoted by x*, we substitute the value of x into the estimated regression equation:
.
This is also the mean or average response of y for a particular value of x, x*.
The average response is different from the predicted response. Suppose we have a sample of students for whom we know college GPA and high school GPA. We could use simple linear regression to predict college GPA from high school GPA. If we then wish to predict the college GPA of a particular student who had a high school GPA of 3.25, we would substitute 3.25 into the estimated regression equation. The appropriate standard error for that prediction is the standard error for an individual response.
If we wish to predict the average college GPA for all students whose high school GPAs were 3.25, we would again substitute 3.25 into the estimated regression equation. However, the appropriate standard error in this case is the standard error for the average.
