Chapter13 - Simple Linear Regression

(1)

Statistics for Managers

Using Microsoft® Excel

5th Edition

Chapter 13

(2)

Learning Objectives

In this chapter, you learn:



_{To use regression analysis to predict the value of a}

dependent variable based on an independent variable



_{The meaning of the regression coefficients}

_b

₀

_and

_b

₁



_{To evaluate the assumptions of regression analysis and}

know what to do if the assumptions are violated



_{To make inferences about the slope and correlation}

coefficient

(3)

Correlation vs. Regression



_{A scatter plot (or scatter diagram) can be}

used to show the relationship between two

numerical variables



_{Correlation analysis is used to measure}

strength of the association (linear

relationship) between two variables



_{Correlation is only concerned with strength of}

the relationship

(4)

Regression Analysis

Regression analysis is used to:



_{Predict the value of a dependent variable based}

on the value of at least one independent variable



_{Explain the impact of changes in an independent}

variable on the dependent variable

Dependent variable: the variable you wish to explain

Independent variable: the variable used to explain

(5)

Simple Linear Regression

Model



_Only

_one

_{independent variable, X}



_{Relationship between X and Y is described}

by a linear function

(6)

Types of Relationships

Y

X

Y

X

(7)

Types of Relationships

Y

X

Y

X

Y

X

(8)

Types of Relationships

Y

X

Y

(9)

The Linear Regression Model

Linear component

The population regression model:

Population

Y intercept

Population

Slope

Coefficient

Random

Error

term

Dependent

Variable

Independent

Variable

(10)

The Linear Regression Model

Random Error

for this X

_i

value

Y

X

Observed Value

of Y for X

_i

Predicted Value

of Y for X

_i

i

1

0 i

β

X

ε

Y

_

X

Slope = β

₁

Intercept = β

₀

(11)

Linear Regression Equation

i

1

0 i

b

X

Yˆ

_

The simple linear regression equation provides an

estimate of the population regression line

Estimate of the

regression

intercept

Estimate of the

regression slope

Estimated (or

predicted) Y

value for

observation i

(12)

The Least Squares Method



_b

₀

_{and b}

₁

_{are obtained by finding the values of b}

₀

and b

1 that minimize the sum of the squared

differences between Y and :

2 i

1

0 i

2 i

i

Yˆ

)

min

(Y

(b

b

X

))

(Y

min

_

(13)

Finding the Least Squares

Equation



_{The coefficients b}

₀

_{and b}

₁

_{, and other regression}

results in this chapter, will be found using Excel

(14)

Interpretation of the

Intercept and the Slope



_b

₀

_{is the estimated mean value of Y when the}

value of X is zero



_b

₁

_{is the estimated change in the mean value of}

(15)

Linear Regression Example



_{A real estate agent wishes to examine the}

relationship between the selling price of a

home and its size (measured in square feet)



_{A random sample of 10 houses is selected}



_{Dependent variable (Y) = house price in}

$1000s

(16)

Linear Regression Example

Data

House Price in $1000s

(Y)

Square Feet

(X)

245 1400

312 1600

279 1700

308 1875

199 1100

219 1550

405 2350

324 2450

319 1425

(17)

Linear Regression Example

0

500 1000

1500

2000

2500

3000

Square Feet

H

(18)

Linear Regression Example

Using Excel

Tools

Data Analysis

(19)

Linear Regression Example

Excel Output

Regression Statistics

Multiple R

0.76211

R Square

0.58082

Adjusted R Square

0.52842

Standard Error

41.33032

Observations

10 ANOVA

df

SS

MS

F

Significance F

Regression

1 18934.9348

18934.9348

11.0848

0.01039

Residual

8 13665.5652

1708.1957

Total

9 32600.5000

Coefficients

Standard Error

t Stat

P-value

Lower 95%

Upper 95%

Intercept

98.24833

58.03348

1.69296

0.12892

-35.57720

232.07386

Square Feet

0.10977

0.03297

3.32938

0.01039

0.03374

0.18580

The regression equation is:

(20)

Linear Regression Example

Graphical Representation

0

500 1000

1500

2000

2500

3000

Square Feet

H



_{House price model: scatter plot and regression line}

feet)

= 0.10977

(21)

Linear Regression Example

Interpretation of b

₀



_b

₀

_{is the estimated mean value of Y when the value}

of X is zero (if X = 0 is in the range of observed X

values)



_{Because the square footage of the house cannot be 0,}

the Y intercept has no practical application.

feet)

(square

0.10977

98.24833

price

(22)

Linear Regression Example

Interpretation of b

₁



_b

₁

_{measures the mean change in the average value of}

Y as a result of a one-unit change in X



_{Here, b}

₁

_{= .10977 tells us that the mean value of a}

house increases by .10977($1000) = $109.77, on

average, for each additional one square foot of size

feet)

(square

0.10977

98.24833

price

(23)

Linear Regression Example

Making Predictions

317.85 0)

0.1098(200

98.25 (sq.ft.)

0.1098

98.25 price

house



Predict the price for a house with 2000 square feet:

(24)

Linear Regression Example

Making Predictions

0 

_{When using a regression model for prediction, only}

predict within the relevant range of data

Relevant range for

interpolation

Do not try to

extrapolate beyond

(25)

Measures of Variation

Total variation is made up of two parts:

SSE

Regression Sum of

Squares

Y

_i

= Observed values of the dependent variable

_i

= Predicted value of Y for the given X

_i

value

Yˆ

(26)

Measures of Variation



_{SST = total sum of squares}



_{Measures the variation of the Y}

_i

_{values around}

their mean Y



_{SSR = regression sum of squares}



_{Explained variation attributable to the}

relationship between X and Y



_{SSE = error sum of squares}



_{Variation attributable to factors other than the}

(27)

Measures of Variation

X

_i

Y

X

Y

_i

SST

=

_

(Y

_i

-

Y

)

2 SSE

=

_

(Y

_i

-

Y



_i

)

2 SSR =

_

(Y



_i

-

Y

)

2 _

_

Y



Y

_

Y

(28)

Coefficient of Determination, r

2 

_{The coefficient of determination is the portion of}

the total variation in the dependent variable that

is explained by variation in the independent

variable



_{The coefficient of determination is also called}

r-squared and is denoted as r

2

(29)

Coefficient of Determination, r

2 r

2 = 1

Y

X

Y

X

r

2 = -1

r

2 = 1

Perfect linear relationship

between X and Y:

(30)

Coefficient of Determination, r

2 Y

X

Y

X

0 < r

2 < 1

Weaker linear relationships

between X and Y:

(31)

Coefficient of Determination, r

2 r

2 = 0

No linear relationship between X

and Y:

The value of Y is not related to X.

(None of the variation in Y is

explained by variation in X)

Y

(32)

Linear Regression Example

Coefficient of Determination, r

2 Regression Statistics

Multiple R

0.76211

R Square

0.58082

Adjusted R Square

0.52842

Standard Error

41.33032

Observations

10 ANOVA

df

SS

MS

F

Significance F

Regression

1 18934.9348

18934.9348

11.0848

0.01039

Residual

8 13665.5652

1708.1957

Total

9 32600.5000

Coefficients

Standard Error

t Stat

P-value

Lower 95%

Upper 95%

Intercept

98.24833

58.03348

1.69296

0.12892

-35.57720

232.07386

Square Feet

0.10977

0.03297

3.32938

0.01039

0.03374

0.18580

58.08% of the variation in house

prices is explained by variation in

square feet

(33)

Standard Error of Estimate



_{The standard deviation of the variation of observations}

around the regression line is estimated by

(34)

Linear Regression Example

Standard Error of Estimate

Regression Statistics

Multiple R

0.76211

R Square

0.58082

Adjusted R Square

0.52842

Standard Error

41.33032

Observations

10 ANOVA

df

SS

MS

F

Significance F

Regression

1 18934.9348

18934.9348

11.0848

0.01039

Residual

8 13665.5652

1708.1957

Total

9 32600.5000

Coefficients

Standard Error

t Stat

P-value

Lower 95%

Upper 95%

Intercept

98.24833

58.03348

1.69296

0.12892

-35.57720

232.07386

Square Feet

0.10977

0.03297

3.32938

0.01039

0.03374

0.18580

(35)

Comparing Standard Errors

Y

X

YX

s

small

large

s

_YX

S

_YX

is a measure of the variation of observed

Y values from the regression line

(36)

Assumptions of Regression

L.I.N.E



_Linearity



_{The relationship between X and Y is linear}



_{Independence of Errors}



_{Error values are statistically independent}



_{Normality of Error}



_{Error values are normally distributed for any given value}

of X



_{Equal Variance (also called homoscedasticity)}



_{The probability distribution of the errors has constant}

(37)

Residual Analysis



_The

_residual

_{for observation i,}

_e

_i

_{, is the difference between its}

observed and predicted value



_{Check the assumptions of regression by examining the residuals}



_{Examine for Linearity assumption}



_{Evaluate Independence assumption}



_{Evaluate Normal distribution assumption}



_{Examine Equal variance for all levels of X}



_{Graphical Analysis of Residuals}



_{Can plot residuals vs. X}

i

Y

(38)

Residual Analysis for Linearity

Not Linear

Linear

x

re

si

d

ua

ls

x

Y

x

Y

x

re

si

d

ua

(39)

Residual Analysis for

Independence

Not Independent

Independent

(40)

Checking for Normality



_{Examine the Stem-and-Leaf Display of the}

Residuals



_{Examine the Box-and-Whisker Plot of the}

Residuals



_{Examine the Histogram of the Residuals}



_{Construct a Normal Probability Plot of the}

(41)

Residual Analysis for

Equal Variance

Unequal variance

Equal variance

x

Y

x

Y

re

si

d

ua

ls

re

si

d

ua

(42)

Linear Regression Example

Excel Residual Output

House Price Model Residual Plot

-60

0 1000

2000

3000

Square Feet

R

RESIDUAL OUTPUT

Predicted

House Price

Residuals

1 251.92316

-6.923162

2 273.87671

38.12329

3 284.85348

-5.853484

4 304.06284

3.937162

5 218.99284

-19.99284

6 268.38832

-49.38832

7 356.20251

48.79749

8 367.17929

-43.17929

9 254.6674

64.33264

(43)

Measuring Autocorrelation:

The Durbin-Watson Statistic



_{Used when data are}

_{collected over time}

_to

detect if autocorrelation is present



_{Autocorrelation exists if residuals in one}

(44)

Autocorrelation



_{Autocorrelation is correlation of the errors}

(residuals) over time



_{Violates the regression assumption that residuals are}

statistically independent

Time (t) Residual Plot

-15



_{Here, residuals suggest a}

(45)

The Durbin-Watson Statistic

autocorrelation, D greater than 2 may

signal negative autocorrelation



_{The Durbin-Watson statistic is used to test for autocorrelation}

(46)

The Durbin-Watson Statistic



_{Calculate the Durbin-Watson test statistic = D}

(The Durbin-Watson Statistic can be found using Excel)

Decision rule: reject H

₀

if D < d

_L

H

₀

: positive autocorrelation does not exist

H

₁

: positive autocorrelation is present

0 d

d

2 Reject H

₀

Do not reject H

₀



Find the values d

_L

and d

_U

from the Durbin-Watson table

(for sample size

n

and number of independent variables

k

)

(47)

The Durbin-Watson Statistic



_{Example with n = 25:}

Durbin-Watson Calculations

Sum of Squared

Difference of Residuals

3296.18

Sum of Squared Residuals

3279.98

Durbin-Watson Statistic

1.00494

y = 30.65 + 4.7038x

Excel output:

(48)

The Durbin-Watson Statistic



_{Here, n = 25 and there is k = 1 independent variable}



_{Using the Durbin-Watson table, d}

_L

_{= 1.29 and d}

_U

_{= 1.45}



_{D = 1.00494 < d}

_L

_{= 1.29, so reject H}

₀

_{and conclude that}

significant positive autocorrelation exists



_{Therefore the linear model is not the appropriate model to}

predict sales

Decision:

reject H

₀

since

D = 1.00494 < d

_L

0 d

=1.29

d

=1.45

2

(49)

Inferences About the Slope:

t Test



_{t test for a population slope}



_{Is there a linear relationship between X and Y?}



_{Null and alternative hypotheses}

(50)

Inferences About the Slope:

t Test Example

House Price in

$1000s

(y)

Square Feet

(x)

245 1400

312 1600

279 1700

308 1875

199 1100

219 1550

405 2350

324 2450

319 1425

255 1700

(sq.ft.)

0.1098

98.25 price

house

_

Estimated Regression Equation:

The slope of this model is 0.1098

(51)

Inferences About the Slope:

t Test Example



_H

₀

_{: β}

₁

_{= 0}



_H

₁

_{: β}

₁

_{≠ 0}

From Excel output:

Coefficients

Standard Error

t Stat

P-value

Intercept

98.24833

58.03348

1.69296

0.12892

Square Feet

0.10977

0.03297

3.32938

0.01039

(52)

Inferences About the Slope:

t Test Example

Test Statistic:

t = 3.329

There is sufficient evidence

that square footage affects

house price

Decision: Reject H

₀

Reject H

₀

Reject H

₀



/2=.025

-t

_α/2

Do not reject H

0

0 t

α/2



/2=.025

-2.3060

2.3060 3.329

d.f. = 10- 2 = 8

(53)

Inferences About the Slope:

t Test Example



_H

₀

_{: β}

₁

_{= 0}



_H

₁

_{: β}

₁

_{≠ 0}

From Excel output:

Coefficients

Standard Error

t Stat

P-value

Intercept

98.24833

58.03348

1.69296

0.12892

Square Feet

0.10977

0.03297

3.32938

0.01039

P-Value

There is sufficient evidence that

square footage affects house price.

(54)

F-Test for Significance



_{F Test statistic:}

where

where F follows an F distribution with k numerator degrees of freedom

and (n - k - 1) denominator degrees of freedom

(55)

F-Test for Significance

Excel Output

Regression Statistics

Multiple R

0.76211

R Square

0.58082

Adjusted R Square

0.52842

Standard Error

41.33032

Observations

10 ANOVA

df

SS

MS

F

Significance F

Regression

1 18934.9348

18934.9348

11.0848

0.01039

Residual

8 13665.5652

1708.1957

Total

9 32600.5000

11.0848

(56)

F-Test for Significance



_H

₀

_{: β}

₁

_{= 0}



_H

₁

_{: β}

₁

_{≠ 0}



_

_{= .05}



_df

₁

_{= 1 df}

₂

_{= 8}

Test Statistic:

Decision:

Conclusion:

Reject H

₀

at

_

= 0.05

There is sufficient evidence that

house size affects selling price

0 Critical Value:

F

_

= 5.32

(57)

Confidence Interval Estimate

for the Slope

Confidence Interval Estimate of the Slope:

Excel Printout for House Prices:

At the 95% level of confidence, the confidence

interval for the slope is (0.0337, 0.1858)

1 b

2 n

1 t

S

b

_

Coefficients

Standard Error

t Stat

P-value

Lower 95%

Upper 95%

Intercept

98.24833

58.03348

1.69296

0.12892

-35.57720

232.07386

Square Feet

0.10977

0.03297

3.32938

0.01039

0.03374

0.18580

(58)

Confidence Interval Estimate

for the Slope

Since the units of the house price variable is $1000s, you

are 95% confident that the mean change in sales price is

between $33.74 and $185.80 per square foot of house size

Coefficients

Standard Error

t Stat

P-value

Lower 95%

Upper 95%

Intercept

98.24833

58.03348

1.69296

0.12892

-35.57720

232.07386

Square Feet

0.10977

0.03297

3.32938

0.01039

0.03374

0.18580

This 95% confidence interval does not include 0.

(59)

t Test for a Correlation Coefficient



_Hypotheses



_H

₀

_{: ρ = 0}

_{(no correlation between X and Y)}



_H

₁

_{: ρ}

_{≠ 0}

_{(correlation exists)}



_{Test statistic}

(with n – 2 degrees of freedom)

(60)

t Test for a Correlation Coefficient

Is there evidence of a linear relationship

between square feet and house price at the .05

level of significance?

(61)

t Test for a Correlation Coefficient

Reject H

₀

Reject H

₀



/2=.025

-t

_α/2

Do not reject H

0

0 t

α/2



/2=.025

-2.3060

2.3060

_3.329

d.f. = 10- 2 = 8

Conclusion:

There is evidence

of a linear

association at the

5% level of

(62)

Estimating Mean Values and

Predicting Individual Values

X

Y = b

₀

+b

₁

X

_i



Confidence

Interval for

the mean of

Y, given X

_i

Prediction Interval for an

individual Y

,

given X

_i

Goal: Form intervals around Y to express

uncertainty about the value of Y for a given X

_i

Y

X

(63)

Confidence Interval for

the Average Y, Given X

Confidence interval estimate for the mean value

of Y given a particular X

_i

(64)

Prediction Interval for

an Individual Y, Given X

Prediction interval estimate for an individual value

of Y given a particular X

_i

This extra term adds to the interval width to reflect the

added uncertainty for an individual case

(65)

Estimation of Mean Values:

Example

Find the 95% confidence interval for the mean price of 2,000

square-foot houses

Predicted Price Y



_i

= 317.85 ($1,000s)

Confidence Interval Estimate for μ

_Y|X=X

37.12 The confidence interval endpoints are 280.66 and 354.90,

or from $280,660 to $354,900

(66)

Estimation of Individual

Values: Example

Find the 95% prediction interval for an individual house with

2,000 square feet

Predicted Price Y



_i

= 317.85 ($1,000s)

Prediction Interval Estimate for Y

_X=X

102.28 The prediction interval endpoints are 215.50 and 420.07, or

from $215,500 to $420,070

(67)

Pitfalls of Regression Analysis



_{Lacking an awareness of the assumptions}

underlying least-squares regression



_{Not knowing how to evaluate the assumptions}



_{Not knowing the alternatives to least-squares}

regression if a particular assumption is violated



_{Using a regression model without knowledge of}

the subject matter

(68)

Strategies for Avoiding

the Pitfalls of Regression



_{Start with a scatter plot of X on Y to observe}

possible relationship



_{Perform residual analysis to check the}

assumptions



_{Plot the residuals vs. X to check for violations}

of assumptions such as equal variance



_{Use a histogram, stem-and-leaf display,}

(69)

Strategies for Avoiding

the Pitfalls of Regression



_{If there is violation of any assumption, use}

alternative methods or models



_{If there is no evidence of assumption}

violation, then test for the significance of the

regression coefficients and construct

confidence intervals and prediction intervals



_{Avoid making predictions or forecasts}

(70)

Chapter Summary



_{Introduced types of regression models}



_{Reviewed assumptions of regression and}

correlation



_{Discussed determining the simple linear}

regression equation



_{Described measures of variation}



_{Discussed residual analysis}



_{Addressed measuring autocorrelation}

(71)

Chapter Summary



_{Described inference about the slope}



_{Discussed correlation -- measuring the}

strength of the association



_{Addressed estimation of mean values and}

prediction of individual values



_{Discussed possible pitfalls in regression and}

recommended strategies to avoid them