Multiple Regression Model Building
Chapter 15
Objectives
In this chapter, you learn:
How to use quadratic terms in a regression model.
How to use transformed variables in a regression model.
How to measure the correlation among the independent variables.
How to build a regression model using either the stepwise or best-subsets approach.
How to avoid the pitfalls involved in developing a
multiple regression model.
The relationship between the dependent variable and an independent variable may not be linear.
Can review the scatter plot to check for non- linear relationships.
Example: Quadratic model:
The second independent variable is the square of the first independent variable.
Nonlinear Relationships
i 2
1i 2
1i 1
0
i β β X β X ε
Y
DCOVA
Quadratic Regression Model
where:
β
0= Y intercept
β
1= regression coefficient for linear effect of X on Y β
2= regression coefficient for quadratic effect on Y ε
i= random error in Y for observation i
i 2
1i 2
1i 1
0
i β β X β X ε
Y
Model form: DCOVA
Linear fit does not give
Linear vs. Nonlinear Fit
Nonlinear fit gives
X
re si d ua ls
X Y
X
re si d ua ls
Y
X
DCOVA
Quadratic Regression Model
Quadratic models may be considered when the scatter plot takes on one of the following shapes:
X
1Y
X
1X
1Y Y
Y
β
1< 0 β
1> 0 β
1< 0 β
1> 0
β
1= the coefficient of the linear term β = the coefficient of the squared term
X
1i 2
1i 2
1i 1
0
i β β X β X ε
Y
β
2> 0 β
2> 0 β
2< 0 β
2< 0
DCOVA
Testing the Overall Quadratic Model
Test for Overall Relationship
H
0: β
1= β
2= 0 (no overall relationship between X and Y)
H
1: β
1and/or β
2≠ 0 (there is a relationship between X and Y)
F
STAT=
2 1i 2 1i
1 0
i b b X b X
Yˆ
MSE MSR
Estimate the quadratic model to obtain the regression equation:
DCOVA
Testing for Significance:
Quadratic Effect
Testing the Quadratic Effect:
Compare quadratic regression equation
with the linear regression equation
2 1i 2 1i
1 0
i b b X b X
Y
1i 1 0
i b b X
Y
DCOVA
Testing for Significance:
Quadratic Effect
Testing the Quadratic Effect:
Consider the quadratic regression equation:
Hypotheses
(The quadratic term does not improve the model) (The quadratic term improves the model)
2 1i 2 1i
1 0
i b b X b X
Y
H
0: β
2= 0 H
1: β
2 0
(continued)
DCOVA
Testing for Significance:
Quadratic Effect
Testing the Quadratic Effect:
Hypotheses
(The quadratic term does not improve the model) (The quadratic term improves the model)
The test statistic is:
H
0: β
2= 0 H
1: β
2 0
(continued)
b 2
2 STAT 2 S
β
t b
3 n
d.f.
where:
b
2= squared term slope coefficient
β
2= hypothesized slope (zero) S
b= standard error of the slope
2DCOVA
Testing the Quadratic Effect:
Compare r 2 from simple regression to adjusted r 2 from the quadratic model
If adj. r 2 from the quadratic model is larger than the r 2 from the simple model, then the quadratic model is likely a better model
Testing for Significance:
Quadratic Effect
(continued)
DCOVA
Example: Quadratic Model
Purity increases as filter time increases.
Purity Filter Time
3 1
7 2
8 3
15 5
22 7
33 8
40 10
54 12
67 13
70 14
78 15
85 15
87 16
99 17
DCOVA
Example: Quadratic Model (In Excel)
(continued)
Regression Statistics
R Square 0.96888
Adjusted R Square 0.96628 Standard Error 6.15997
Simple regression results:
Y = -11.283 + 5.985 Time
Coefficients Standard
Error t Stat P-value Intercept -11.28267 3.46805 -3.25332 0.00691
Time 5.98520 0.30966 19.32819 2.078E-10
F Significance F 373.57904 2.0778E-10
^
Time Residual Plot
-10 -5 0 5 10
0 5 10 15 20
Time
Residuals
t statistic, F statistic, and r
2are all high, but the
residuals are not random.
DCOVA
Coefficients Standard
Error t Stat P-value Intercept 1.53870 2.24465 0.68550 0.50722
Time 1.56496 0.60179 2.60052 0.02467
Time-squared 0.24516 0.03258 7.52406 1.165E-05
Quadratic regression results:
Y = 1.539 + 1.565 Time + 0.245 (Time) ^
2Example: Quadratic Model in Excel
(continued)
The quadratic term is statistically significant (p-value very small)
DCOVA
Regression Statistics
R Square 0.99494
Adjusted R Square 0.99402 Standard Error 2.59513
Quadratic regression results:
Y = 1.539 + 1.565 Time + 0.245 (Time) ^
2Example: Quadratic Model in Excel
(continued)
The adjusted r
2of the quadratic model is higher than the adjusted r
2of the simple regression model. The quadratic model explains 99.4% of the
DCOVA
Example: Quadratic Model Residual Plots
Quadratic regression results:
Y = 1.539 + 1.565 Time + 0.245 (Time)
2The residuals plotted versus both Time and Time-squared show a random pattern.
(continued)
DCOVA
^
Using Transformations in Regression Analysis
Idea:
Non-linear models can often be transformed to a linear form.
Can be estimated by least squares if transformed.
Transform X or Y or both to get a better fit or to deal with violations of regression
assumptions.
Can be based on theory, logic or scatter plots.
DCOVA
The Square Root Transformation
The square-root transformation:
Used to
Overcome violations of the normality assumption.
Overcome violations of the constant variance assumption.
Fit a non-linear relationship.
i i
1 1 0
i β β X ε
Y
DCOVA
Shape of original relationship
X
The Square Root Transformation
(continued)
b
1> 0
b
1< 0 Y
Y
X
Relationship when transformed
i i 1 1 0
i
β β X ε
Y
i 1i 1 0
i
β β X ε
Y DCOVA
Y
Y
Example Of Square Root
Transformation DCOVA
Consider this data
The scatter plot and residual plot below show that variability
increases as x increases. This is a violation of the assumption
of constant variance.
Example Of Square Root
Transformation (continued)
DCOVA
After transforming the dependent variable by taking the square
root, the plots below show that after transformation the constant
variance assumption is no longer violated.
Original multiplicative model:
The Log Transformation
Transformed multiplicative model:
i β 1i 0
i β X ε
Y
1log Y i log β 0 β 1 log X 1i log ε i The Multiplicative Model:
Original multiplicative model:
Transformed exponential model:
i 2i
2 1i
1 0
i β β X β X ln ε
Y
ln
The Exponential Model:
i X
β X β β
i e ε
Y
0
1 1i
2 2iDCOVA
Interpretation of coefficients
For the multiplicative model:
When both dependent and independent variables are logged:
The coefficient of the independent variable X
kcan be interpreted as: a 1 percent change in X
kleads to an
estimated b
kpercentage change in the mean value of Y.
Therefore b
kis the elasticity of Y with respect to a change in X
k.
i 1i
1 0
i log β β logX log ε
logY
DCOVA
Example Of The Log Transformation
DCOVA Consider this data
The plots below show a non-linear pattern in both the scatter
plot and the residual plot indicating a need to transform Y.
Example Of The Log Transformation
(continued)
DCOVA
After transforming the dependent variable, the new model fits
the data well and the residual plot shows no pattern.
Collinearity
Collinearity: High correlation exists among two or more independent variables.
This means the correlated variables contribute redundant information to the multiple regression model.
DCOVA
Collinearity
Including two highly correlated independent variables can adversely affect the regression results.
No new information provided.
Can lead to unstable coefficients (large standard error and low t-values).
Coefficient signs may not match prior expectations.
(continued)
DCOVA
Some Indications of Strong Collinearity
Incorrect signs on the coefficients.
Large change in the value of a previous
coefficient when a new variable is added to the model.
A previously significant variable becomes non- significant when a new independent variable is added.
The estimate of the standard deviation of the model increases when a variable is added to the model.
DCOVA
Detecting Collinearity
(Variance Inflationary Factor)
VIF j is used to measure collinearity:
If VIF
j> 5, X
jis highly correlated with the other independent variables.
where R
2jis the coefficient of determination of variable X
jwith all other X variables.
2 j
j 1 R
VIF 1
DCOVA
Model Building
Goal is to develop a model with the best set of independent variables.
Model is easier to interpret if unimportant variables are removed.
Lower probability of collinearity.
DCOVA
DCOVA
Model Building Example
Develop a model to predict Standby Hours using four independent
variables:
Total Staff Present
Remote Hours
Dubner Hours
Total labor Hours.
First Check For Collinearity Using
All The Dependent Variables DCOVA
VIF’s are small indicating
little evidence of collinearity
Analysis Continues Looking For An
Adequate Subset Of Dependent Variables
Two model building methods are commonly used to study multiple regression models:
Stepwise regression procedure:
Provide evaluation of alternative models as variables are added and deleted.
Best-subset approach:
Try all combinations and select the best using the highest adjusted r
2and C
pcriteria.
DCOVA
Develop the least squares regression equation in steps.
Add one independent variable at a time and evaluate whether existing variables should remain in the model or be removed.
The coefficient of partial determination is the measure of the marginal contribution of each independent
variable, given that other independent variables are in the model.
Stepwise Regression
DCOVA
Stepwise Regression In Excel
DCOVA
• Using a significance level of 0.05 to enter a new variable into the model or to delete a variable from the model.
• First variable entered was Total Staff because it was the most highly correlated variable with Standby Hours and its p- value < 0.05.
• Next, the variable Remote is added
because it makes the largest contribution to the current model and its p-value <
0.05.
• Both variables are kept in the model at this point because each of their p-values are < 0.05.
• At this point, no other variable is added to
the model because they do not meet the
significance criteria of a p-value < 0.05.
Best Subsets Regression
Estimate all possible regression equations
using all possible combinations of independent variables.
Choose the best fit by looking for the highest adjusted r 2 and C p criteria.
DCOVA
Best Subsets Criterion
Calculate the value C p for each potential regression model.
Consider models with C p values close to or below k + 1.
k is the number of independent variables in the model under consideration.
DCOVA
Best Subsets Criterion
The C p Statistic:
)) 1 k
( 2 n
R ( 1
) T n
)(
R 1
C ( 2
T 2
p k
Where k = number of independent variables included in a particular regression model
T = total number of parameters to be estimated in the full regression model
= coefficient of multiple determination for model with k independent variables
= coefficient of multiple determination for full model with
2
R
k2
R
T(continued)
DCOVA
Best Subsets In Excel
DCOVA
• The C
pstatistic often provides several alternative models for you to evaluate in depth (just not the case here.)
• The best model or models using C
pmight not include the model selected using stepwise regression.
• Note that here, the stepwise model has a C
p= 8.4193 > k + 1 = 3.
• There may not be a uniquely best model, but there may be several equally
appropriate models.
• Final model selection often involves using
subjective criteria, such as parsimony,
interpretability, and departure from model
assumptions.
Residual Analysis Should Be Done On The Chosen Model DCOVA
Residual plots versus each independent variable show only
random scatter (no pattern).
Residual Analysis
DCOVA
(continued)
Residual versus Predicted Y show constant variance.
The Final Model Building Step Is Model Validation
Models can be validated via multiple methods:
Collect new data and compare the results.
Compare the results of the regression model to previous results.
If the data set is large enough, split the data into two parts and cross-validate the results.
To do this you split the data prior to building the model and use one half of the data to build the model and the other half to validate the model.
DCOVA
Steps in Model Building
1. Choose the independent variables to be
considered for the multiple regression model.
2. Develop a regression model that includes all of the independent variables.
3. Compute the VIF for each independent variable.
If no VIF > 5, go to step 4.
If one VIF > 5, eliminate this variable, go to step 4.
If more than one VIF > 5, eliminate the variable with the highest VIF and go back to step 2.
4.Perform best subsets regression with remaining variables.
DCOVA
Steps in Model Building
5. List all models with C p close to or less than (k + 1).
6. Choose the best model from all the models in step 4.
7.Perform a complete analysis with chosen model, including residual analysis.
8.Based on the residual analysis add quadratic and/or interaction terms, transform variables as appropriate. Repeat steps 2 through 6.
9.Validate the regression model. If validated, use the regression model for prediction and inference.
(continued)
DCOVA
Pitfalls and Ethical Considerations
Interpret the regression coefficient for a particular independent variable from the
perspective of holding all other independent variables constant.
Evaluate residual plots for each independent variable.
Evaluate interaction and quadratic terms.
To avoid pitfalls and address ethical considerations:
Additional Pitfalls
and Ethical Considerations
Obtain VIFs for each independent variable
before determining which variables should be included in the model.
Examine several alternative models using best- subsets regression.
Use logistic regression when the dependent variable is categorical.
Validate the model before implementing it.
To avoid pitfalls and address ethical considerations:
(continued)
Chapter Summary
In this chapter we discussed:
How to use quadratic terms in a regression model.
How to use transformed variables in a regression model.
How to measure the correlation among the independent variables.
How to build a regression model using either the stepwise or best-subsets approach.
How to avoid the pitfalls involved in developing a
multiple regression model.