Chapter14 - Introduction to Multiple Regression

(1)

Statistics for Managers

Using Microsoft® Excel

5th Edition

Chapter 14

(2)

Learning Objectives

In this chapter, you learn:



_{How to develop a multiple regression model}



_{How to interpret the regression coefficients}



_{How to determine which independent variables to}

include in the regression model



_{How to determine which independent variables are}

most important in predicting a dependent variable



_{How to use categorical variables in a regression}

(3)

The Multiple Regression

Model

Idea: Examine the linear relationship between 1 dependent (Y)

& 2 or more independent variables (X

_i

).

i

ki

k

2i

2 1i

1

0 i

β

X

β

X

β

X

ε

Y

_



_

Multiple Regression Model with k Independent

Variables:

(4)

Multiple Regression Equation

The coefficients of the multiple regression model

are estimated using sample data

ki

k

2i

2 1i

1

0 i

b

X

b

X

b

X

Yˆ

_



_

Estimated

(or predicted)

value of Y

Estimated slope coefficients

Multiple regression equation with k independent

variables:

Estimated

intercept

(5)

Multiple Regression Equation

Example with

two independent

variables

Y

X

₁

X

₂

2

1

0 b

X

b

X

b

Yˆ

_

Sl

op

e f

or

va

ria

ble

X

1

(6)

Multiple Regression Equation

2 Variable Example



_{A distributor of frozen dessert pies wants to}

evaluate factors thought to influence demand



_{Dependent variable: Pie sales (units per week)}



_{Independent variables: Price (in $)}

Advertising ($100’s)

(7)

Multiple Regression Equation

2 Variable Example



Sales = b

₀

+ b

₁

(Price) + b

₂

(Advertising)



Sales = b

₀

+b

₁

X

₁

+ b

₂

X

₂

Where X

1 = Price

X

₂

= Advertising

Week

Pie Sales

Price

($)

Advertising

($100s)

1

350

5.50

3.3

(8)

Multiple Regression Equation

2 Variable Example, Excel

Regression Statistics

Multiple R

0.72213

R Square

0.52148

Adjusted R Square

0.44172

Standard Error

47.46341

Observations

15 ANOVA

_df

_SS

_MS

_F

_{Significance F}

Regression

2 29460.027

14730.013

6.53861

0.01201

Residual

12 27033.306

2252.776

Total

14 56493.333

Coefficients

Standard Error

t Stat

P-value

Lower 95%

Upper 95%

Intercept

306.52619

114.25389

2.68285

0.01993

57.58835

555.46404

Price

-24.97509

10.83213

-2.30565

0.03979

-48.57626

-1.37392

Advertising

74.13096

25.96732

2.85478

0.01449

17.55303

130.70888

(9)

Multiple Regression Equation

2 Variable Example

)

decrease, on average, by

24.975 pies per week for

each $1 increase in selling

price, net of the effects of

changes due to advertising

b

₂

= 74.131

: sales will

increase, on average, by

74.131 pies per week

for each $100 increase

in advertising, net of the

effects of changes due

to price

where

Sales is in number of pies per week

Price is in $

(10)

Multiple Regression Equation

2 Variable Example

Predict sales for a week in which the selling price is

$5.50 and advertising is $350:

Predicted sales is

428.62 pies

428.62

(11)

Coefficient of

Multiple Determination



_{Reports the proportion of total variation in Y}

explained by all X variables taken together

squares

of

sum

total

squares

of

sum

regression

SST

SSR

2 



(12)

Coefficient of

Multiple Determination (Excel)

Regression Statistics

Multiple R

0.72213

R Square

0.52148

Adjusted R Square

0.44172

Standard Error

47.46341

Observations

15 ANOVA

_df

_SS

_MS

_F

_{Significance F}

Regression

2 29460.027

14730.013

6.53861

0.01201

Residual

12 27033.306

2252.776

Total

14 56493.333

Coefficients

Standard Error

t Stat

P-value

Lower 95%

Upper 95%

Intercept

306.52619

114.25389

2.68285

0.01993

57.58835

555.46404

Price

-24.97509

10.83213

-2.30565

0.03979

-48.57626

-1.37392

Advertising

74.13096

25.96732

2.85478

0.01449

17.55303

130.70888

.52148

(13)

Adjusted r

2 

_r

2 never decreases when a new X variable is added

to the model



_{This can be a disadvantage when comparing}

models



_{What is the net effect of adding a new variable?}



_{We lose a degree of freedom when a new X}

variable is added



_{Did the new X variable add enough independent}

(14)

Adjusted r

2 

_{Shows the proportion of variation in Y explained by all X}

variables adjusted for the number of X

variables used

(where n = sample size, k = number of independent variables)



_{Penalizes excessive use of unimportant independent}

variables



_{Smaller than r}

2 

_{Useful in comparing models}

(15)

Adjusted r

2 Regression Statistics

Multiple R

0.72213

R Square

0.52148

Adjusted R Square

0.44172

Standard Error

47.46341

Observations

15 ANOVA

_df

_SS

_MS

_F

_{Significance F}

Regression

2 29460.027

14730.013

6.53861

0.01201

Residual

12 27033.306

2252.776

Total

14 56493.333

Coefficients

Standard Error

t Stat

P-value

Lower 95%

Upper 95%

Intercept

306.52619

114.25389

2.68285

0.01993

57.58835

555.46404

Price

-24.97509

10.83213

-2.30565

0.03979

-48.57626

-1.37392

Advertising

74.13096

25.96732

2.85478

0.01449

17.55303

130.70888

.44172

r

_adj

2 _

44.2% of the variation in pie sales is explained

by the variation in price and advertising, taking

into account the sample size and number of

(16)

F-Test for Overall Significance



_{F-Test for Overall Significance of the Model}



_{Shows if there is a linear relationship between all of}

the X variables considered together and Y



_{Use F test statistic}



_Hypotheses:

H

0 : β

1 = β

2 = … = β

k

= 0 (no linear relationship)

H

1 : at least one β

i

≠ 0 (at least one independent variable

(17)

F-Test for Overall Significance



_{Test statistic:}



_{where F has (numerator) = k and}

(denominator) = (n - k - 1)

degrees of freedom

1 



k

n

SSE

k

SSR

(18)

F-Test for Overall Significance

Regression Statistics

Multiple R

0.72213

R Square

0.52148

Adjusted R Square

0.44172

Standard Error

47.46341

Observations

15 ANOVA

_df

_SS

_MS

_F

_{Significance F}

Regression

2 29460.027

14730.013

6.53861

0.01201

Residual

12 27033.306

2252.776

Total

14 56493.333

Coefficients

Standard Error

t Stat

P-value

Lower 95%

Upper 95%

Intercept

306.52619

114.25389

2.68285

0.01993

57.58835

555.46404

Price

-24.97509

10.83213

-2.30565

0.03979

-48.57626

-1.37392

Advertising

74.13096

25.96732

2.85478

0.01449

17.55303

130.70888

(19)

F-Test for Overall Significance



_H

₀

_{: β}

₁

_{= β}

₂

_{= 0}



_H

₁

_{: β}

₁

_{and β}

₂

_{not both zero}



_

_{= .05}



_df

₁

_{= 2 df}

₂

_{= 12}

Test Statistic:

Decision:

Conclusion:

Since F test statistic is in the

rejection region (p-value < .05),

reject H

₀

There is evidence that at least one

independent variable affects Y

0 Critical

Value:

F

_

= 3.885

(20)

Residuals in Multiple Regression

Two variable model

Y

x

_1i

The best fitting linear regression

_{equation, Y, is found by}

minimizing the sum of squared

errors,

_

e

2 <

Sample

observation

Residual =

e

_i

= (Y

_i

– Y

_i

)

(21)

Multiple Regression Assumptions

Assumptions

:



_{The errors are independent}



_{The errors are normally distributed}



_{Errors have an equal variance}

e

_i

= (Y

_i

– Y

_i

)

Errors (

residuals

) from the regression model:

(22)

Multiple Regression Assumptions



_{These residual plots are used in multiple regression:}



_{Residuals vs. Y}

_i



_{Residuals vs. X}

_1i



_{Residuals vs. X}

_2i



_{Residuals vs. time (if time series data)}

Use the residual plots to check for

violations of regression assumptions

(23)

Individual Variables

Tests of Hypothesis



_{Use t-tests of individual variable slopes}



_{Shows if there is a linear relationship}

between the variable X

i

and Y



_Hypotheses:

H

0 : β

i

= 0 (no linear relationship)

H

1 : β

i

≠ 0 (linear relationship does exist

(24)

Individual Variables

Tests of Hypothesis

H

0 : β

j

= 0 (no linear relationship)

H

1 : β

j

≠ 0 (linear relationship does exist

between X

i

and Y)



_{Test Statistic:}

(df = n – k – 1)

j

b

j

S

b

(25)

Individual Variables

Tests of Hypothesis

Regression Statistics

Multiple R

0.72213

R Square

0.52148

Adjusted R Square

0.44172

Standard Error

47.46341

Observations

15 ANOVA

_df

_SS

_MS

_F

_{Significance F}

Regression

2 29460.027

14730.013

6.53861

0.01201

Residual

12 27033.306

2252.776

Total

14 56493.333

Coefficients

Standard Error

t Stat

P-value

Lower 95%

Upper 95%

Intercept

306.52619

114.25389

2.68285

0.01993

57.58835

555.46404

Price

-24.97509

10.83213

-2.30565

0.03979

-48.57626

-1.37392

Advertising

74.13096

25.96732

2.85478

0.01449

17.55303

130.70888

t-value for Price is t = -2.306, with

p-value .0398

(26)

Individual Variables

Tests of Hypothesis

d.f. = 15 - 2 - 1 = 12



_{= .05}

t

__/2

= 2.1788

H

0 : β

j

= 0

H

1 : β

j

≠ 0

The test statistic for each variable falls in

the rejection region (p-values < .05)

There is evidence that both Price and

Advertising affect pie sales at



= .05

Reject H

₀

for each variable

Coefficients

Standard Error

t Stat

P-value

Price

-24.97509

10.83213

-2.30565

0.03979

Advertising

74.13096

25.96732

2.85478

0.01449

(27)

Confidence Interval Estimate

for the Slope

Confidence interval for the population slope β

_i

Example:

Form a 95% confidence interval for the effect of changes in

price (X

₁

) on pie sales, holding constant the effects of advertising:

-24.975 ± (2.1788)(10.832): So the interval is (-48.576, -1.374)

i

b

1 k

n

i

t

S

b

_

Coefficients

Standard Error

Intercept

306.52619

114.25389

Price

-24.97509

10.83213

Advertising

74.13096

25.96732

where t has

(n – k – 1) d.f.

Here, t has

(28)

Confidence Interval Estimate

for the Slope

Confidence interval for the population slope β

_i

Example:

Excel output also reports these interval endpoints:

Weekly sales are estimated to be reduced by between 1.37 to 48.58 pies

for each increase of $1 in the selling price, holding constant the effects of

advertising.

Coefficients

Standard Error

…

Lower 95%

Upper 95%

(29)

Testing Portions of the Multiple

Regression Model



_{Contribution of a Single Independent Variable X}

_j

SSR(X

_j

| all variables except X

_j

)

= SSR (all variables) – SSR(all variables except X

_j

)



_{Measures the contribution of X}

_j

_{in explaining the total}

(30)

Testing Portions of the Multiple

Regression Model

Measures the contribution of X

₁

in explaining SST

From ANOVA section of

regression for

From ANOVA section of

regression for

Contribution of a Single Independent Variable X

_j

, assuming

all other variables are already included

(consider here a 3-variable model):

SSR(X

₁

| X

₂

and X

₃

)

= SSR (all variables) – SSR(X

₂

and X

₃

)

3

2

1

0 b

X

b

X

b

X

b

(31)

The Partial F-Test Statistic

Consider the hypothesis test:

H

0 : variable X

j

does not significantly improve the

model after all other variables are included

H

1 : variable X

j

significantly improves the model after

all other variables are included

Test using the F-test statistic:

(with 1 and n-k-1 d.f.)

MSE

j)

except

variables

all

|

(X

SSR

_j



(32)

Testing Portions of Model:

Example

Test at the α = .05 level to determine whether

the price variable significantly improves the

model given that advertising is included

(33)

Testing Portions of Model:

Example

H

₀

: X

₁

(price) does not improve the model

with X

₂

(advertising) included

H

₁

: X

₁

does improve model

α

= .05, df = 1 and 12

F critical Value = 4.75

(For X

₁

and X

₂

)

(For X

₂

only)

ANOVA

df

SS

MS

Regression

2 29460.02687

14730.01343

Residual

12 27033.30647

2252.775539

Total

14 56493.33333

ANOVA

df

SS

Regression

1 17484.22249

(34)

Testing Portions of Model:

Example

Conclusion: Reject H

₀

; adding X

₁

does improve model

316 Regression

2 29460.02687

14730.01343

Residual

12 27033.30647

2252.775539

Total

14 56493.33333

ANOVA

df

SS

Regression

1 17484.22249

(35)

Relationship Between Test

Statistics



_{The partial}

_F

_{test statistic developed in this section}

and the

t

test statistic are both used to determine the

contribution of an independent variable to a multiple

regression model.



_{The hypothesis tests associated with these two}

statistics always result in the same decision (that is,

the

p

-values are identical).

a

F

t

2 _

₁

_,

(36)

Coefficient of Partial Determination

for k Variable Model

Measures the proportion of variation in the dependent variable

that is explained by X

j

while controlling for (holding constant)

the other independent variables

j)

except

variables

all

|

SSR(X

)

variables

SSR(all

SST

j)

except

variables

all

except

variables

Yj.(all



(37)

Using Dummy Variables



_{A dummy variable is a categorical}

independent variable with two levels:



_{yes or no, on or off, male or female}



_{coded as 0 or 1}



_{Assumes equal slopes for other variables}



_{If more than two levels, the number of}

(38)

Dummy Variable Example

Let:

Y = pie sales

X

₁

= price

X

₂

= holiday

(X

₂

= 1 if a holiday occurred during the week)

(X

₂

= 0 if there was no holiday that week)

2

1

0 b

X

b

X

b

(39)

Dummy Variable Example

Same

slope

X

₁

(Price)

Y (sales)

b

₀

+ b

₂



Holiday

No Holiday

(40)

Dummy Variable Example

Sales: number of pies sold per week

Price: pie price in $

Holiday:

1 If a holiday occurred during the week

_{0 If no holiday occurred}

b

₂

= 15: on average, sales were 15 pies greater in weeks

with a holiday than in weeks without a holiday, given

the same price

)

15(Holiday

30(Price)

-300

(41)

Dummy Variable Model

More Than Two Levels



_{The number of dummy variables is}

_{one less than}

the number of levels



_Example:

Y = house price ; X

₁

= square feet



_{If style of the house is also thought to matter}

Style = ranch, split level, colonial

(42)

Dummy Variable Model

More Than Two Levels

Example: Let “colonial” be the default category, and let X

2 and

X

3 be used for the other two categories:

Y = house price

X

1 = square feet

X

2 = 1 if ranch, 0 otherwise

X

3 = 1 if split level, 0 otherwise

The multiple regression equation is:

3

2

1

0 b

X

b

X

b

X

b

(43)

Dummy Variable Model

More Than Two Levels

18.84 With the same square feet, a

ranch will have an estimated

average price of 23.53

thousand dollars more than a

colonial.

With the same square feet, a

split-level will have an

estimated average price of

18.84 thousand dollars more

than a colonial.

Consider the regression equation:

3

2

1 23.53X

18.84X

(44)

Interaction Between

Independent Variables



_{Hypothesizes interaction between pairs of X}

variables



_{Response to one X variable may vary at different levels}

of another X variable



_{Contains a two-way cross product term}

(45)

Effect of Interaction



_Given:



_{Without interaction term, effect of X}

₁

_{on Y is}

measured by β

₁



_{With interaction term, effect of X}

₁

_{on Y is measured}

by β

₁

+ β

₃

X

₂



_{Effect changes as X}

₂

_changes

ε

X

β

X

β

X

β

(46)

Interaction Example

X

₂

= 1:

Y = 1 + 2X

₁

+ 3(1) + 4X

₁

(1) = 4 + 6X

₁

X

₂

= 0:

Y = 1 + 2X

₁

+ 3(0) + 4X

₁

(0) = 1 + 2X

₁

Slopes are different if the effect of X

₁

on Y depends on X

₂

value

X

₁

4

8

12

0

0.5

1

1.5

1.5 Y

= 1 + 2X

₁

+ 3X

₂

+ 4X

₁

X

₂

(47)

Significance of Interaction

Term



_{Can perform a partial F-test for the}

contribution of a variable to see if the

addition of an interaction term improves the

model



_{Multiple interaction terms can be included}



_{Use a partial F-test for the simultaneous}

(48)

Simultaneous Contribution of

Independent Variables



_{Use partial F-test for the simultaneous contribution}

of multiple variables to the model



_{Let m variables be an additional set of variables added}

simultaneously



_{To test the hypothesis that the set of m variables}

improves the model:

MSE(all)

m

/

)]

variables

m

of

set

new

except

(all

SSR

[SSR(all)

_



F

(49)

Chapter Summary



_{Developed the multiple regression model}



_{Tested the significance of the multiple}

regression model



_{Discussed adjusted r}

2 

_{Discussed using residual plots to check}

model assumptions

(50)

Chapter Summary



_{Tested individual regression coefficients}



_{Tested portions of the regression model}



_{Used dummy variables}



_{Evaluated interaction effects}