What is Linear Regression?
In a cause and effect relationship, the independent variable is the cause, and the dependent variable is the effect. Least squares linear regression is a method for predicting the value of a dependent variable Y, based on the value of an independent variable X.
In this AP Statistics tutorial, we focus on the case where there is only one independent variable. This is called simple regression. In another tutorial (see Regression Tutorial), we cover multiple regression, which handles two or more independent variables.
Tip: The next lesson presents a simple linear regression example that shows how to apply the material covered in this lesson. Since this lesson is a little dense, you may benefit by also reading the next lesson.
Requirements for Regression
Simple linear regression is appropriate when the following conditions are satisfied.
- Linearity. The relationshp between the independent variable X and the dependent variable Y should be linear. To check this, make sure that the XY scatterplot is linear and that the residual plot shows a random pattern. (In a future lesson, we'll explain how to check linearity with a residual plot.)
- Homoscedasticity. The variance of residuals should be constant across all levels of the independent variable. To check for homoscedasticity, plot residuals against the independent variable. If the spread is roughly constant, homoscedasticity holds. (Bartlett's test and Hartley's Fmax test can also be used to test for homogeneity of variance; but these tests are not part of the AP Statistics curriculum, and they will not appear on the AP Statistics test.)
- Independence. Residuals should be independent of each other. The value of one residual should not provide any information about the value of another. Plot residuals against time or observation order. If the residuals fluctuate randomly around the zero line with no clear pattern, they are likely independent. If they show a trend (e.g., increasing or decreasing) or cyclical behavior, this indicates dependence.
- Normality. The residuals should be normally distributed, especially for small sample sizes. Plot a histogram of the residuals and check for a bell-shaped distribution. Or produce a normal probability plot. If points on the plot will fall approximately along a straight line, the residuals are normally distributed. (This assumption is less critical when the sample size is large.)
By checking these assumptions, you can ensure the validity of a simple linear regression model. If any assumptions are violated, appropriate transformations or other analytical methods should be considered. (We'll cover transformations in a future lesson.)
The Least Squares Regression Line
Linear regression finds the straight line, called the least squares regression line or LSRL, that best represents observations in a bivariate dataset. Suppose Y is a dependent variable, and X is an independent variable. The population regression line is:
Y = Β0 + Β1X
where Β0 is a constant, Β1 is the regression coefficient, X is the value of the independent variable, and Y is the value of the dependent variable.
Given a random sample of observations, the population regression line is estimated by a sample regression line. The sample regression line is:
ŷ = b0 + b1x
where b0 is a constant, b1 is the regression coefficient, x is the value of the independent variable, and ŷ is the predicted value of the dependent variable.
How to Define a Regression Line
Normally, you will use a computational tool - a software package (e.g., Excel) or a graphing calculator - to find b0 and b1. You enter the x and y values into your program or calculator, and the tool solves for the regression constant (b0) and for the regression coefficient (b1).
In the unlikely event that you find yourself on a desert island without a computer or a graphing calculator, you can solve for b0 and b1 "by hand". Here are the equations.
b1 = Σ [ (xi - x)(yi - y) ] / Σ [ (xi - x)2]
b1 = r * (sy / sx)
b0 = y - b1 * x
where b0 is the constant in the regression equation, b1 is the regression coefficient, r is the correlation between x and y, xi is the x value for observation i, yi is the y value for observation i, x is the sample mean of x, y is the sample mean of y, sx is the standard deviation of x, and sy is the standard deviation of y.
Properties of the Regression Line
When the regression parameters (b0 and b1) are defined as described above, the regression line has the following properties.
- The line minimizes the sum of squared differences between observed values (the y values) and predicted values (the ŷ values computed from the regression equation).
- The regression line passes through the mean of the x values (x) and through the mean of the y values (y).
- The regression constant (b0) is equal to the y intercept of the regression line.
- The regression coefficient (b1) is the average change in the dependent variable (y) for a 1-unit change in the independent variable (x). It is the slope of the regression line.
The least squares regression line is the only straight line that has all of these properties.
The Coefficient of Determination
The coefficient of determination (denoted by R2) is a key output of regression analysis. It is interpreted as the proportion of the variance in the dependent variable that is predictable from the independent variable.
- The coefficient of determination ranges from 0 to 1.
- An R2 of 0 means that the dependent variable cannot be predicted from the independent variable.
- An R2 of 1 means the dependent variable can be predicted without error from the independent variable.
- An R2 between 0 and 1 indicates the extent to which the dependent variable is predictable. An R2 of 0.10 means that 10 percent of the variance in y is predictable from x; an R2 of 0.20 means that 20 percent is predictable; and so on.
The formula for computing the coefficient of determination for a linear regression model with one independent variable is given below.
Coefficient of determination. The coefficient of determination (R2) for a linear regression model with one independent variable is:
R2 = Σ (ŷi - ȳ)2 / Σ (yi - ȳ)2
where yi is the value of the dependent variable for observation i, ŷi is the predicted value of the dependent variable for observation i, and ȳ is the mean of observed values of the dependent variable.
If you know the linear correlation (r) between the independent variable and the dependent variable, then the coefficient of determination (R2) is easily computed using the following formula: R2 = r2.
Standard Error of the Estimate
The standard error of the estimate (aka, residual standard error) is a measure of the average amount that the regression equation over- or under-predicts. The higher the coefficient of determination, the lower the standard error; and the more accurate predictions are likely to be.
For simple linear regression (regression with only one independent variable), the standard error of the estimate (SE) can be calculated from this formula:
SE = sqrt [ Σ(yi - ŷi)2 / (n - 2) ]
where yi is the actual value of the dependent variable for observation i, ŷi is the predicted value of dependent variable for observation i, and n is sample size.
Here is how to interpret the standard error of the estimate.
- The standard error tells you on average how much the actual data points deviate from the regression line.
- A smaller standard error indicates the regression model fits the data more closely.
- The standard error has the same units as the dependent variable y.
You can think of the standard error like the standard deviation of the residuals: If the standard error is, say, 2.3, then on average the actual values are about 2.3 units away from the predicted values.
Note: The standard error of the estimate is different from the standard error of the slope. In future lessons, we'll describe the standard error of the slope; and we'll explain how the standard error of the slope is used to test hypotheses about the slope and to define a confidence interval around the slope.
Test Your Understanding
Problem 1
A researcher uses a regression equation to predict home heating bills (dollar cost), based on home size (square feet). The correlation between predicted bills and home size is 0.70. What is the correct interpretation of this finding?
 
        (A) 70% of the variability in home heating bills can be 
            explained by home size.
        
        (B) 49% of the variability in home heating bills can be 
            explained by home size.
        
        (C) For each added square foot of home size, heating bills 
            increased by 70 cents.
        
        (D) For each added square foot of home size, heating bills 
            increased by 49 cents.
        
        (E) None of the above.
    
Solution
The correct answer is (B). The coefficient of determination measures the proportion of variation in the dependent variable that is predictable from the independent variable. The coefficient of determination is equal to R2; in this case, (0.70)2 or 0.49. Therefore, 49% of the variability in heating bills can be explained by home size.