Chapter 10: Linear Regression and Correlation

Chapter 10 Review

Chapter Review from 10.1

The most basic type of association is a linear association. This type of relationship can be defined algebraically by the equations used, numerically with actual or predicted data values, or graphically from a plotted curve. (Lines are classified as straight curves.) Algebraically, a linear equation typically takes the form y = mx + b, where m and b are constants, x is the independent variable, y is the dependent variable. In a statistical context, a linear equation is written in the form y = a + bx, where a and b are the constants. This form is used to help readers distinguish the statistical context from the algebraic context. In the equation y = a + bx, the constant b that multiplies the x variable (b is called a coefficient) is called the slope. The slope describes the rate of change between the independent and dependent variables; in other words, the rate of change describes the change that occurs in the dependent variable as the independent variable is changed. In the equation y = a + bx, the constant a is called the y-intercept. Graphically, the y-intercept is the y coordinate of the point where the graph of the line crosses the y axis. At this point x = 0.

The slope of a line is a value that describes the rate of change between the independent and dependent variables. The slope tells us how the dependent variable (y) changes for every one unit increase in the independent (x) variable, on average. The y-intercept is used to describe the dependent variable when the independent variable equals zero. Graphically, the slope is represented by three line types in elementary statistics.

Formula Review

y = a + bx where a is the y-intercept and b is the slope. The variable x is the independent variable and y is the dependent variable.

Use the following information to answer the next three exercises. A vacation resort rents SCUBA equipment to certified divers. The resort charges an up-front fee of 💲25 and another fee of 💲12.50 an hour.

What are the dependent and independent variables?

Solution

dependent variable: fee amount; independent variable: time

Find the equation that expresses the total fee in terms of the number of hours the equipment is rented.

Graph the equation from [link].

Solution
This is a graph of the equation y = 25 + 12.50x. The x-axis is labeled in intervals of 1 from 0 - 7; the y-axis is labeled in intervals of 25 from 0 - 100. The equation's graph is a line that crosses the y-axis at 25 and is sloped up and to the right, rising 12.50 units for every one unit of run.

Use the following information to answer the next two exercises. A credit card company charges 💲10 when a payment is late, and 💲5 a day each day the payment remains unpaid.

Find the equation that expresses the total fee in terms of the number of days the payment is late.

Graph the equation from [link].

Solution
This is a graph of the equation y = 10 + 5x. The x-axis is labeled in intervals of 1 from 0 - 7; the y-axis is labeled in intervals of 10 from 0 - 50. The equation's graph is a line that crosses the y-axis at 10 and is sloped up and to the right, rising 5 units for every one unit of run.

Is the equation y = 10 + 5x – 3x2 linear? Why or why not?

Which of the following equations are linear?

a. y = 6x + 8

b. y + 7 = 3x

c. yx = 8x2

d. 4y = 8

Solution

y = 6x + 8, 4y = 8, and y + 7 = 3x are all linear equations.

Does the graph show a linear equation? Why or why not?

This is a graph of an equation. The x-axis is labeled in intervals of 1 from -5 to 5; the y-axis is labeled in intervals of 1 from 0 - 8. The equation's graph is a parabola, a u-shaped curve that has a minimum value at (0, 0).

[link] contains real data for the first two decades of AIDS reporting.

Adults and Adolescents only, United States
Year # AIDS cases diagnosed # AIDS deaths
Pre-1981 91 29
1981 319 121
1982 1,170 453
1983 3,076 1,482
1984 6,240 3,466
1985 11,776 6,878
1986 19,032 11,987
1987 28,564 16,162
1988 35,447 20,868
1989 42,674 27,591
1990 48,634 31,335
1991 59,660 36,560
1992 78,530 41,055
1993 78,834 44,730
1994 71,874 49,095
1995 68,505 49,456
1996 59,347 38,510
1997 47,149 20,736
1998 38,393 19,005
1999 25,174 18,454
2000 25,522 17,347
2001 25,643 17,402
2002 26,464 16,371
Total 802,118 489,093

Use the columns “year” and “# AIDS cases diagnosed. Why is “year” the independent variable and “# AIDS cases diagnosed.” the dependent variable (instead of the reverse)?

Solution

The number of AIDS cases depends on the year. Therefore, year becomes the independent variable and the number of AIDS cases is the dependent variable.

Use the following information to answer the next two exercises. A specialty cleaning company charges an equipment fee and an hourly labor fee. A linear equation that expresses the total amount of the fee the company charges for each session is y = 50 + 100x.

What are the independent and dependent variables?

What is the y-intercept and what is the slope? Interpret them using complete sentences.

Solution

The y-intercept is 50 (a = 50). At the start of the cleaning, the company charges a one-time fee of 💲50 (this is when x = 0). The slope is 100 (b = 100). For each session, the company charges 💲100 for each hour they clean.

Use the following information to answer the next three questions. Due to erosion, a river shoreline is losing several thousand pounds of soil each year. A linear equation that expresses the total amount of soil lost per year is y = 12,000x.

What are the independent and dependent variables?

How many pounds of soil does the shoreline lose in a year?

Solution

12,000 pounds of soil

What is the y-intercept? Interpret its meaning.

Use the following information to answer the next two exercises. The price of a single issue of stock can fluctuate throughout the day. A linear equation that represents the price of stock for Shipment Express is y = 15 – 1.5x where x is the number of hours passed in an eight-hour day of trading.

What are the slope and y-intercept? Interpret their meaning.

Solution

The slope is –1.5 (b = –1.5). This means the stock is losing value at a rate of 💲1.50 per hour. The y-intercept is 💲15 (a = 15). This means the price of stock before the trading day was 💲15.

If you owned this stock, would you want a positive or negative slope? Why?

Chapter Review from 10.2

Scatter plots are particularly helpful graphs when we want to see if there is a linear relationship among data points. They indicate both the direction of the relationship between the x variables and the y variables, and the strength of the relationship. We calculate the strength of the relationship between an independent variable and a dependent variable using linear regression.

Does the scatter plot appear linear? Strong or weak? Positive or negative?

This is a scatterplot with several points plotted in the first quadrant. The points form a clear pattern, moving upward to the right. The points do not line up , but the overall pattern can be modeled with a line.
Solution

The data appear to be linear with a strong, positive correlation.

Does the scatter plot appear linear? Strong or weak? Positive or negative?

This is a scatterplot with several points plotted in the first quadrant. The points move downward to the right. The overall pattern can be modeled with a line, but the points are widely scattered.

Does the scatter plot appear linear? Strong or weak? Positive or negative?

This is a scatter plot with several points plotted all over the first quadrant. There is no pattern.
Solution

The data appear to have no correlation.

Chapter Review from 10.3

A regression line, or a line of best fit, can be drawn on a scatter plot and used to predict outcomes for the x and y variables in a given data set or sample data. There are several ways to find a regression line, but usually the least-squares regression line is used because it creates a uniform line. Residuals, also called “errors,” measure the distance from the actual value of y and the estimated value of y. The Sum of Squared Errors, when set to its minimum, calculates the points on the line of best fit. Regression lines can be used to predict values within the given set of data, but should not be used to make predictions for values outside the set of data.

The correlation coefficient r measures the strength of the linear association between x and y. The variable r has to be between –1 and +1. When r is positive, the x and y will tend to increase and decrease together. When r is negative, x will increase and y will decrease, or the opposite, x will decrease and y will increase. The coefficient of determination r2, is equal to the square of the correlation coefficient. When expressed as a percent, r2 represents the percent of variation in the dependent variable y that can be explained by variation in the independent variable x using the regression line.

Use the following information to answer the next five exercises. A random sample of ten professional athletes produced the following data where x is the number of endorsements the player has and y is the amount of money made (in millions of dollars).

 
x y x y
0 2 5 12
3 8 4 9
2 7 3 9
1 3 0 3
5 13 4 10

Draw a scatter plot of the data.

Use regression to find the equation for the line of best fit.

Solution

ŷ = 2.23 + 1.99x

Draw the line of best fit on the scatter plot.

What is the slope of the line of best fit? What does it represent?

Solution

The slope is 1.99 (b = 1.99). It means that for every endorsement deal a professional player gets, he gets an average of another 💲1.99 million in pay each year.

What is the y-intercept of the line of best fit? What does it represent?

What does an r value of zero mean?

Solution

It means that there is no correlation between the data sets.

When n = 2 and r = 1, are the data significant? Explain.

When n = 100 and r = -0.89, is there a significant correlation? Explain.

Solution

Yes, there are enough data points and the value of r is strong enough to show that there is a strong negative correlation between the data sets.

Chapter Review from 10.5

After determining the presence of a strong correlation coefficient and calculating the line of best fit, you can use the least squares regression line to make predictions about your data.

Use the following information to answer the next two exercises. An electronics retailer used regression to find a simple model to predict sales growth in the first quarter of the new year (January through March). The model is good for 90 days, where x is the day. The model can be written as follows:

ŷ = 101.32 + 2.48x where ŷ is in thousands of dollars.

What would you predict the sales to be on day 60?

Solution

💲250,120

What would you predict the sales to be on day 90?

Use the following information to answer the next three exercises. A landscaping company is hired to mow the grass for several large properties. The total area of the properties combined is 1,345 acres. The rate at which one person can mow is as follows:

ŷ = 1350 – 1.2x where x is the number of hours and ŷ represents the number of acres left to mow.

How many acres will be left to mow after 20 hours of work?

Solution

1,326 acres

How many acres will be left to mow after 100 hours of work?

How many hours will it take to mow all of the lawns? (When is ŷ = 0?)

Solution

1,125 hours, or when x = 1,125

[link] contains real data for the first two decades of AIDS reporting.

Adults and Adolescents only, United States
Year # AIDS cases diagnosed # AIDS deaths
Pre-1981 91 29
1981 319 121
1982 1,170 453
1983 3,076 1,482
1984 6,240 3,466
1985 11,776 6,878
1986 19,032 11,987
1987 28,564 16,162
1988 35,447 20,868
1989 42,674 27,591
1990 48,634 31,335
1991 59,660 36,560
1992 78,530 41,055
1993 78,834 44,730
1994 71,874 49,095
1995 68,505 49,456
1996 59,347 38,510
1997 47,149 20,736
1998 38,393 19,005
1999 25,174 18,454
2000 25,522 17,347
2001 25,643 17,402
2002 26,464 16,371
Total 802,118 489,093

Graph “year” versus “# AIDS cases diagnosed” (plot the scatter plot). Do not include pre-1981 data.

Perform linear regression. What is the linear equation? Round to the nearest whole number.

Solution

Check student’s solution.

Write the equations:

Linear equation: __________
a = ________
b = ________
r = ________
n = ________

Solve.

When x = 1985, ŷ = _____
When x = 1990, ŷ =_____
When x = 1970, ŷ =______ Why doesn’t this answer make sense?
Solution
When x = 1985, ŷ = 25,52
When x = 1990, ŷ = 34,275
When x = 1970, ŷ = –725 Why doesn’t this answer make sense? The range of x values was 1981 to 2002; the year 1970 is not in this range. The regression equation does not apply, because predicting for the year 1970 is extrapolation, which requires a different process. Also, a negative number does not make sense in this context, where we are predicting AIDS cases diagnosed.

Does the line seem to fit the data? Why or why not?

What does the correlation imply about the relationship between time (years) and the number of diagnosed AIDS cases reported in the U.S.?

Solution

Also, the correlation r = 0.4526. If r is compared to the value in the 95% Critical Values of the Sample Correlation Coefficient Table, because r > 0.423, r is significant, and you would think that the line could be used for prediction. But the scatter plot indicates otherwise.

<!– From 12.4 Move to 12.8 –>

Plot the two given points on the following graph. Then, connect the two points to form the regression line.

Blank graph with horizontal and vertical axes.

Obtain the graph on your calculator or computer.

<!– From 12.5 MOVE to 12.8 –>

Write the equation: ŷ= ____________

Solution

[latex]\stackrel{^}{y}[/latex] = 3,448,225 + 1750x

Hand draw a smooth curve on the graph that shows the flow of the data.

<!– From 12.6 Move to 12.8 –>

Does the line seem to fit the data? Why or why not?

Solution

There was an increase in AIDS cases diagnosed until 1993. From 1993 through 2002, the number of AIDS cases diagnosed declined each year. It is not appropriate to use a linear regression line to fit to the data.

Do you think a linear fit is best? Why or why not?

What does the correlation imply about the relationship between time (years) and the number of diagnosed AIDS cases reported in the U.S.?

Solution

Since there is no linear association between year and # of AIDS cases diagnosed, it is not appropriate to calculate a linear correlation coefficient. When there is a linear association and it is appropriate to calculate a correlation, we cannot say that one variable “causes” the other variable.

<!– From 12.7 MOVE to 12.8 –>

Graph “year” vs. “# AIDS cases diagnosed.” Do not include pre-1981. Label both axes with words. Scale both axes.

Enter your data into your calculator or computer. The pre-1981 data should not be included. Why is that so?

Write the linear equation, rounding to four decimal places:

Solution

We don’t know if the pre-1981 data was collected from a single year. So we don’t have an accurate x value for this figure.

Regression equation: ŷ (#AIDS Cases) = –3,448,225 + 1749.777 (year)

Coefficients
Intercept –3,448,225
X Variable 1 1,749.777

Calculate the following:

  1. a = _____
  2. b = _____
  3. correlation = _____
  4. n = _____

Chapter Review from 10.6

To determine if a point is an outlier, do one of the following:

Input the following equations into the TI 83, 83+,84, 84+:

[latex]\begin{array}{l}{y}_{1}=a+bx\hfill \\ {y}_{2}=\left(2s\right)a+bx\hfill \\ {y}_{3}=-\left(2s\right)a+bx\hfill \end{array}[/latex] where s is the standard deviation of the residuals

If any point is above y2or below y3 then the point is considered to be an outlier.

Use the residuals and compare their absolute values to 2s where s is the standard deviation of the residuals. If the absolute value of any residual is greater than or equal to 2s, then the corresponding point is an outlier.
Note: The calculator function LinRegTTest (STATS TESTS LinRegTTest) calculates s.

Use the following information to answer the next four exercises. The scatter plot shows the relationship between hours spent studying and exam scores. The line shown is the calculated line of best fit. The correlation coefficient is 0.69.

Do there appear to be any outliers?

Solution

Yes, there appears to be an outlier at (6, 58).

A point is removed, and the line of best fit is recalculated. The new correlation coefficient is 0.98. Does the point appear to have been an outlier? Why?

What effect did the potential outlier have on the line of best fit?

Solution

The potential outlier flattened the slope of the line of best fit because it was below the data set. It made the line of best fit less accurate as a predictor for the data.

Are you more or less confident in the predictive ability of the new line of best fit?

The Sum of Squared Errors for a data set of 18 numbers is 49. What is the standard deviation?

Solution

s = 1.75

The Standard Deviation for the Sum of Squared Errors for a data set is 9.8. What is the cutoff for the vertical distance that a point can be from the line of best fit to be considered an outlier?

License

Icon for the Creative Commons Attribution-ShareAlike 4.0 International License

Introductory Statistics Copyright © 2024 by LOUIS: The Louisiana Library Network is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License, except where otherwise noted.

Share This Book