Multiple Regression Model Building

(1)

Multiple Regression Model Building

Chapter 15

(2)

Objectives

In this chapter, you learn:

 How to use quadratic terms in a regression model.

 How to use transformed variables in a regression model.

 How to measure the correlation among the independent variables.

 How to build a regression model using either the stepwise or best-subsets approach.

 How to avoid the pitfalls involved in developing a

multiple regression model.

(3)

 The relationship between the dependent variable and an independent variable may not be linear.

 Can review the scatter plot to check for nonlinear relationships.

 Example: Quadratic model:



The second independent variable is the square of the first independent variable.

Nonlinear Relationships

i 2

1i 2

1i 1

0

i β β X β X ε

Y    

DCOVA

(4)

Quadratic Regression Model

 where:

β

₀

= Y intercept

β

₁

= regression coefficient for linear effect of X on Y β

₂

= regression coefficient for quadratic effect on Y ε

_i

= random error in Y for observation i

i 2

1i 2

1i 1

0

i β β X β X ε

Y    

Model form: DCOVA

(5)

Linear fit does not give

Linear vs. Nonlinear Fit

Nonlinear fit gives

X

re si d ua ls

X Y

X

re si d ua ls

Y

X

DCOVA

(6)

Quadratic Regression Model

Quadratic models may be considered when the scatter plot takes on one of the following shapes:

X

₁

Y

X

₁

X

₁

Y Y

Y

β

₁

< 0 β

₁

> 0 β

₁

< 0 β

₁

> 0

β

₁

= the coefficient of the linear term β = the coefficient of the squared term

X

₁

i 2

1i 2

1i 1

0

i β β X β X ε

Y    

β

₂

> 0 β

₂

> 0 β

₂

< 0 β

₂

< 0

DCOVA

(7)

Testing the Overall Quadratic Model

 Test for Overall Relationship

H

₀

: β

₁

= β

₂

= 0 (no overall relationship between X and Y)

H

₁

: β

₁

and/or β

₂

≠ 0 (there is a relationship between X and Y)



F

STAT

=

2 1i 2 1i

1 0

i b b X b X

Yˆ   

MSE MSR

 Estimate the quadratic model to obtain the regression equation:

DCOVA

(8)

Testing for Significance:

Quadratic Effect

 Testing the Quadratic Effect:



Compare quadratic regression equation

with the linear regression equation

2 1i 2 1i

1 0

i b b X b X

Y   

1i 1 0

i b b X

Y  

DCOVA

(9)

Quadratic Effect



Consider the quadratic regression equation:

Hypotheses

(The quadratic term does not improve the model) (The quadratic term improves the model)

2 1i 2 1i

1 0

i b b X b X

Y   

H

₀

: β

₂

= 0 H

₁

: β

₂

 0

(continued)

DCOVA

(10)

Quadratic Effect

Hypotheses

(The quadratic term does not improve the model) (The quadratic term improves the model)

 The test statistic is:

H

₀

: β

₂

= 0 H

₁

: β

₂

 0

(continued)

b 2

2 STAT 2 S

β

t b 



3 n

d.f.  

where:

b

₂

= squared term slope coefficient

β

₂

= hypothesized slope (zero) S

_b

= standard error of the slope

₂

DCOVA

(11)

Compare r 2 from simple regression to adjusted r 2 from the quadratic model

 If adj. r 2 from the quadratic model is larger than the r 2 from the simple model, then the quadratic model is likely a better model

Quadratic Effect

(continued)

DCOVA

(12)

Example: Quadratic Model

Purity increases as filter time increases.

Purity Filter Time

3 1

7 2

8 3

15 5

22 7

33 8

40 10

54 12

67 13

70 14

78 15

85 15

87 16

99 17

DCOVA

(13)

Example: Quadratic Model (In Excel)

(continued)

Regression Statistics

R Square 0.96888

Adjusted R Square 0.96628 Standard Error 6.15997

 Simple regression results:

Y = -11.283 + 5.985 Time

Coefficients Standard

Error t Stat P-value Intercept -11.28267 3.46805 -3.25332 0.00691

Time 5.98520 0.30966 19.32819 2.078E-10

F Significance F 373.57904 2.0778E-10

^

Time Residual Plot

-10 -5 0 5 10

0 5 10 15 20

Time

Residuals

t statistic, F statistic, and r

²

are all high, but the

residuals are not random.

DCOVA

(14)

Coefficients Standard

Error t Stat P-value Intercept 1.53870 2.24465 0.68550 0.50722

Time 1.56496 0.60179 2.60052 0.02467

Time-squared 0.24516 0.03258 7.52406 1.165E-05

 Quadratic regression results:

Y = 1.539 + 1.565 Time + 0.245 (Time) ^

²

Example: Quadratic Model in Excel

(continued)

The quadratic term is statistically significant (p-value very small)

DCOVA

(15)

Regression Statistics

R Square 0.99494

Adjusted R Square 0.99402 Standard Error 2.59513

 Quadratic regression results:

Y = 1.539 + 1.565 Time + 0.245 (Time) ^

²

Example: Quadratic Model in Excel

(continued)

The adjusted r

²

of the quadratic model is higher than the adjusted r

²

of the simple regression model. The quadratic model explains 99.4% of the

DCOVA

(16)

Example: Quadratic Model Residual Plots

Quadratic regression results:

Y = 1.539 + 1.565 Time + 0.245 (Time)

²

The residuals plotted versus both Time and Time-squared show a random pattern.

(continued)

DCOVA

^

(17)

Using Transformations in Regression Analysis

Idea:

 Non-linear models can often be transformed to a linear form.



Can be estimated by least squares if transformed.

 Transform X or Y or both to get a better fit or to deal with violations of regression

assumptions.

 Can be based on theory, logic or scatter plots.

DCOVA

(18)

The Square Root Transformation

 The square-root transformation:

 Used to

 Overcome violations of the normality assumption.

 Overcome violations of the constant variance assumption.

 Fit a non-linear relationship.

i i

1 1 0

i β β X ε

Y   

DCOVA

(19)



Shape of original relationship

X

The Square Root Transformation

(continued)

b

₁

> 0

b

₁

< 0 Y

Y

X



Relationship when transformed

i i 1 1 0

i

β β X ε

Y   

i 1i 1 0

i

β β X ε

Y    DCOVA

Y

(20)

Example Of Square Root

Transformation DCOVA

Consider this data

The scatter plot and residual plot below show that variability

increases as x increases. This is a violation of the assumption

of constant variance.

(21)

Example Of Square Root

Transformation (continued)

DCOVA

After transforming the dependent variable by taking the square

root, the plots below show that after transformation the constant

variance assumption is no longer violated.

(22)



Original multiplicative model:

The Log Transformation



Transformed multiplicative model:

i β 1i 0

i β X ε

Y 

¹

log Y i  log β 0  β 1 log X 1i  log ε i The Multiplicative Model:



Original multiplicative model:

^

Transformed exponential model:

i 2i

2 1i

1 0

i β β X β X ln ε

Y

ln    

The Exponential Model:

i X

β X β β

i e ε

Y 

⁰



¹ ¹ⁱ



² ²ⁱ

DCOVA

(23)

Interpretation of coefficients

For the multiplicative model:

 When both dependent and independent variables are logged:



The coefficient of the independent variable X

_k

can be interpreted as: a 1 percent change in X

_k

leads to an

estimated b

_k

percentage change in the mean value of Y.



Therefore b

_k

is the elasticity of Y with respect to a change in X

_k

.

i 1i

1 0

i log β β logX log ε

logY   

DCOVA

(24)

Example Of The Log Transformation

DCOVA Consider this data

The plots below show a non-linear pattern in both the scatter

plot and the residual plot indicating a need to transform Y.

(25)

Example Of The Log Transformation

(continued)

DCOVA

After transforming the dependent variable, the new model fits

the data well and the residual plot shows no pattern.

(26)

Collinearity

 Collinearity: High correlation exists among two or more independent variables.

 This means the correlated variables contribute redundant information to the multiple regression model.

DCOVA

(27)

Collinearity

 Including two highly correlated independent variables can adversely affect the regression results.

 No new information provided.

 Can lead to unstable coefficients (large standard error and low t-values).

 Coefficient signs may not match prior expectations.

(continued)

DCOVA

(28)

Some Indications of Strong Collinearity

 Incorrect signs on the coefficients.

 Large change in the value of a previous

coefficient when a new variable is added to the model.

 A previously significant variable becomes non- significant when a new independent variable is added.

 The estimate of the standard deviation of the model increases when a variable is added to the model.

DCOVA

(29)

Detecting Collinearity

(Variance Inflationary Factor)

VIF j is used to measure collinearity:

If VIF

_j

> 5, X

_j

is highly correlated with the other independent variables.

where R

²_j

is the coefficient of determination of variable X

_j

with all other X variables.

2 j

j 1 R

VIF 1

 

DCOVA

(30)

Model Building

 Goal is to develop a model with the best set of independent variables.

 Model is easier to interpret if unimportant variables are removed.

 Lower probability of collinearity.

DCOVA

(31)

DCOVA

Model Building Example



Develop a model to predict Standby Hours using four independent

variables:



Total Staff Present



Remote Hours



Dubner Hours



Total labor Hours.

(32)

First Check For Collinearity Using

All The Dependent Variables DCOVA

VIF’s are small indicating

little evidence of collinearity

(33)

Analysis Continues Looking For An

Adequate Subset Of Dependent Variables

 Two model building methods are commonly used to study multiple regression models:

 Stepwise regression procedure:



Provide evaluation of alternative models as variables are added and deleted.

 Best-subset approach:



Try all combinations and select the best using the highest adjusted r

²

and C

_p

criteria.

DCOVA

(34)

 Develop the least squares regression equation in steps.

 Add one independent variable at a time and evaluate whether existing variables should remain in the model or be removed.

 The coefficient of partial determination is the measure of the marginal contribution of each independent

variable, given that other independent variables are in the model.

Stepwise Regression

DCOVA

(35)

Stepwise Regression In Excel

DCOVA

• Using a significance level of 0.05 to enter a new variable into the model or to delete a variable from the model.

• First variable entered was Total Staff because it was the most highly correlated variable with Standby Hours and its p- value < 0.05.

• Next, the variable Remote is added

because it makes the largest contribution to the current model and its p-value <

0.05.

• Both variables are kept in the model at this point because each of their p-values are < 0.05.

• At this point, no other variable is added to

the model because they do not meet the

significance criteria of a p-value < 0.05.

(36)

Best Subsets Regression

 Estimate all possible regression equations

using all possible combinations of independent variables.

 Choose the best fit by looking for the highest adjusted r 2 and C p criteria.

DCOVA

(37)

Best Subsets Criterion

 Calculate the value C p for each potential regression model.

 Consider models with C p values close to or below k + 1.



k is the number of independent variables in the model under consideration.

DCOVA

(38)

Best Subsets Criterion

 The C p Statistic:

)) 1 k

( 2 n

R ( 1

) T n

)(

R 1

C ( 2

T 2

p k   



 

Where k = number of independent variables included in a particular regression model

T = total number of parameters to be estimated in the full regression model

= coefficient of multiple determination for model with k independent variables

= coefficient of multiple determination for full model with

2

R

k

2

R

T

(continued)

DCOVA

(39)

Best Subsets In Excel

DCOVA

• The C

_p

statistic often provides several alternative models for you to evaluate in depth (just not the case here.)

• The best model or models using C

_p

might not include the model selected using stepwise regression.

• Note that here, the stepwise model has a C

_p

= 8.4193 > k + 1 = 3.

• There may not be a uniquely best model, but there may be several equally

appropriate models.

• Final model selection often involves using

subjective criteria, such as parsimony,

interpretability, and departure from model

assumptions.

(40)

Residual Analysis Should Be Done On The Chosen Model DCOVA

Residual plots versus each independent variable show only

random scatter (no pattern).

(41)

Residual Analysis

DCOVA

(continued)

Residual versus Predicted Y show constant variance.

(42)

The Final Model Building Step Is Model Validation

 Models can be validated via multiple methods:



Collect new data and compare the results.



Compare the results of the regression model to previous results.



If the data set is large enough, split the data into two parts and cross-validate the results.



To do this you split the data prior to building the model and use one half of the data to build the model and the other half to validate the model.

DCOVA

(43)

Steps in Model Building

1. Choose the independent variables to be

considered for the multiple regression model.

2. Develop a regression model that includes all of the independent variables.

3. Compute the VIF for each independent variable.



If no VIF > 5, go to step 4.



If one VIF > 5, eliminate this variable, go to step 4.



If more than one VIF > 5, eliminate the variable with the highest VIF and go back to step 2.

4.Perform best subsets regression with remaining variables.

DCOVA

(44)

Steps in Model Building

5. List all models with C p close to or less than (k + 1).

6. Choose the best model from all the models in step 4.

7.Perform a complete analysis with chosen model, including residual analysis.

8.Based on the residual analysis add quadratic and/or interaction terms, transform variables as appropriate. Repeat steps 2 through 6.

9.Validate the regression model. If validated, use the regression model for prediction and inference.

(continued)

DCOVA

(45)

Pitfalls and Ethical Considerations

 Interpret the regression coefficient for a particular independent variable from the

perspective of holding all other independent variables constant.

 Evaluate residual plots for each independent variable.

 Evaluate interaction and quadratic terms.

To avoid pitfalls and address ethical considerations:

(46)

Additional Pitfalls

and Ethical Considerations

 Obtain VIFs for each independent variable

before determining which variables should be included in the model.

 Examine several alternative models using best- subsets regression.

 Use logistic regression when the dependent variable is categorical.

 Validate the model before implementing it.

To avoid pitfalls and address ethical considerations:

(continued)

(47)

Chapter Summary

In this chapter we discussed:

 How to use quadratic terms in a regression model.

 How to use transformed variables in a regression model.

 How to measure the correlation among the independent variables.

 How to build a regression model using either the stepwise or best-subsets approach.

 How to avoid the pitfalls involved in developing a

multiple regression model.