INDEPENDENT VARIABLES
5.3 Simplifying the Model
168 5. CASE STUDY: FIVE INDEPENDENT VARIABLES
able at a time from the regression model. Removing one variable from the model will cause the regression coefficients and the partial sums of squares for the remaining variables to change (unless they are orthogonal to the variable dropped). These results do indicate that not all five independent variables are needed in the model. It would appear that any one of the four, SALINITY,pH,Na, orK, could be dropped without causing a significant decline in predictability ofBIOMASS. It is not clear at this stage of the analysis, however, that more than one can be dropped.
There are several approaches for deciding which variables to include in the final model. These are studied in Chapter 7. For this example, one variable at a time is eliminated— the one whose elimination will cause the smallest increase in the residual sum of squares. The process will stop when the partial sums of squares for all variables remaining in the model are significant (α=.05). As discussed in Chapter 7, data-driven variable selection and multiple testing to arrive at the final model alter the true significance levels; probability levels and confidence intervals should be used with caution.
The variable Na has the smallest partial sum of squares in the five- A 4-Variable Model variable model. This means thatNais the least important of the five vari-
ables in accounting for the variability inBIOMASS after the contributions of the other four variables have been taken into account. As a result,Nais the logical variable to eliminate first. And, since the partial sum of squares forNa,R(β4|β1β2β3β5β0) = 47,011 is not significant, there is no rea- sonX4=Na should not be eliminated.
DroppingNa means thatX must be redefined to be the 45×5 matrix consisting of1, X1 = SALINITY, X2 = pH, X3 = K, andX5 = Zn;
the column vector ofNaobservationsX4is removed fromX. Similarly,β must be redefined by removingβ4. The regression analysis using these four variables (Table 5.4) shows the decrease in the regression sum of squares, now with four degrees of freedom, and the increase in the residual sum of squares to be exactly equal to the partial sum of squares forNa in the previous stage. This demonstrates the meaning of “partial sum of squares.”
In the absence of independent information onσ2, the residual mean square from this reduced model is now used (tentatively) as the estimate ofσ2, s2= 155,832. (Notice that the increase in the residual sum of squares does not necessarily imply an increase in the residual mean square.)
The partial sums of squares at the four-variable stage (Table 5.4) show A 3-Variable Model SALINITY andZn to be equally unimportant to the model; the partial
sum of squares forZnis slightly smaller and both are nonsignificant. The next step in the search for the final model is to eliminate one of these two variables. Again, it is not safe to assume that both variables can be dropped since they are not orthogonal.
SinceZnhas the slightly smaller partial sum of squares,Znwill be elim- inated andpH,SALINITY, and K retained as the three-variable model.
One could have used the much higher simple correlation betweenZn and
5.3 Simplifying the Model 169 TABLE 5.4.Results of the regression of BIOMASS on the four independent vari- ables SALINITY, pH, K, and Zn (Linthurst data).
Variable βj s(βj) t Partial SS Sal −35.94 21.48 −1.67 436,496 pH 293.9 84.5 3.48 1,885,805 K −0.439 0.202 −2.17 732,606 Zn −23.45 14.04 −1.67 434,796 Analysis of variance
Source d.f. Sum of Squares Mean Square
Total 44 19,170,963
Regression 4 12,937,689 3,234,422 Residual 40 6,233,274 155,832
TABLE 5.5. Results of the regression of BIOMASS on the three independent variables SALINITY, pH, and K (Linthurst data).
Variable βj s(βj) t Partial SS SAL −12.06 16.37 −.74 88,239
pH 410.21 48.83 8.40 11,478,835 K −.490 .204 −2.40 935,178 Analysis of variance
Source d.f. Sum of Squares Mean Square
Total 44 19,170,963
Regression 3 12,502,893 4,167,631 Residual 41 6,668,070 162,636
BIOMASS, r = −.62 versus r = −.10, to argue that SALINITY is the variable to eliminate at this stage. This is a somewhat arbitrary choice with the information at hand, and illustrates one of the problems of this sequential method of searching for the appropriate model. There is no as- surance that choosing to eliminateZn first will lead to the best model by whatever criterion is used to measure “goodness” of the model.
Again,Xandβare redefined, so thatZn is eliminated, and the compu- tations repeated. This analysis gives the results in Table 5.5. The partial sum of squares forpH increases dramatically whenZn is dropped from the model, from 1.9 million to 11.5 million. This is due to the strong corre- lation between pH andZn (r =−.72). When two independent variables are highly correlated, either positively or negatively, much of the predic- tive information contained in either can be usurped by the other. Thus, a
170 5. CASE STUDY: FIVE INDEPENDENT VARIABLES
TABLE 5.6.Results of the regression of BIOMASS on the two independent vari- ables pH, and K (Linthurst data).
Variable βj s(βj) t Partial SS pH 412.04 48.50 8.50 11,611,782
K −0.487 0.203 −2.40 924,266 Analysis of variance
Source d.f. Sum of Squares Mean Square
Total 44 19,170,963
Regression 2 12,414,654 6,207,327 Residual 42 6,756,309 160,865
very important variable may appear as insignificant if the model contains a correlated variable and, conversely, an otherwise unimportant variable may take on false significance.
The contribution ofSALINITY in the three-variable model is even smaller A 2-Variable Model than it was beforeZn was dropped and is far from being significant. The
next step is to dropSALINITY from the model. In this particular example, one would not have been misled by eliminating bothSALINITY andZn at the previous step. This is not true in general.
The two-variable model containing pH andK gives the results in Ta- ble 5.6. Since the partial sums of squares for bothpH andKare significant, the simplification of the model will stop with this two-variable model. The degree to which the linear model consisting of the two variablespH andK accounts for the variability inBIOMASS isR2=.65, only slightly smaller than theR2=.68 obtained with the original five-variable model.