Simplifying the Model - INDEPENDENT VARIABLES

INDEPENDENT VARIABLES

5.3 Simplifying the Model

168 5. CASE STUDY: FIVE INDEPENDENT VARIABLES

able at a time from the regression model. Removing one variable from the model will cause the regression coefficients and the partial sums of squares for the remaining variables to change (unless they are orthogonal to the variable dropped). These results do indicate that not all five independent variables are needed in the model. It would appear that any one of the four, SALINITY,pH,Na, orK, could be dropped without causing a significant decline in predictability ofBIOMASS. It is not clear at this stage of the analysis, however, that more than one can be dropped.

There are several approaches for deciding which variables to include in the final model. These are studied in Chapter 7. For this example, one variable at a time is eliminated— the one whose elimination will cause the smallest increase in the residual sum of squares. The process will stop when the partial sums of squares for all variables remaining in the model are significant (α=.05). As discussed in Chapter 7, data-driven variable selection and multiple testing to arrive at the final model alter the true significance levels; probability levels and confidence intervals should be used with caution.

The variable Na has the smallest partial sum of squares in the ﬁve- A 4-Variable Model variable model. This means thatNais the least important of the ﬁve vari-

ables in accounting for the variability inBIOMASS after the contributions of the other four variables have been taken into account. As a result,Nais the logical variable to eliminate ﬁrst. And, since the partial sum of squares forNa,R(β₄|β₁β₂β₃β₅β₀) = 47,011 is not signiﬁcant, there is no rea- sonX4=Na should not be eliminated.

DroppingNa means thatX must be redeﬁned to be the 45×5 matrix consisting of1, X1 = SALINITY, X2 = pH, X3 = K, andX5 = Zn;

the column vector ofNaobservationsX4is removed fromX. Similarly,β must be redeﬁned by removingβ4. The regression analysis using these four variables (Table 5.4) shows the decrease in the regression sum of squares, now with four degrees of freedom, and the increase in the residual sum of squares to be exactly equal to the partial sum of squares forNa in the previous stage. This demonstrates the meaning of “partial sum of squares.”

In the absence of independent information onσ², the residual mean square from this reduced model is now used (tentatively) as the estimate ofσ², s²= 155,832. (Notice that the increase in the residual sum of squares does not necessarily imply an increase in the residual mean square.)

The partial sums of squares at the four-variable stage (Table 5.4) show A 3-Variable Model SALINITY andZn to be equally unimportant to the model; the partial

sum of squares forZnis slightly smaller and both are nonsigniﬁcant. The next step in the search for the ﬁnal model is to eliminate one of these two variables. Again, it is not safe to assume that both variables can be dropped since they are not orthogonal.

SinceZnhas the slightly smaller partial sum of squares,Znwill be eliminated andpH,SALINITY, and K retained as the three-variable model.

One could have used the much higher simple correlation betweenZn and

5.3 Simplifying the Model 169 TABLE 5.4.Results of the regression of BIOMASS on the four independent vari- ables SALINITY, pH, K, and Zn (Linthurst data).

Variable βj s(βj) t Partial SS Sal −35.94 21.48 −1.67 436,496 pH 293.9 84.5 3.48 1,885,805 K −0.439 0.202 −2.17 732,606 Zn −23.45 14.04 −1.67 434,796 Analysis of variance

Source d.f. Sum of Squares Mean Square

Total 44 19,170,963

Regression 4 12,937,689 3,234,422 Residual 40 6,233,274 155,832

TABLE 5.5. Results of the regression of BIOMASS on the three independent variables SALINITY, pH, and K (Linthurst data).

Variable βj s(βj) t Partial SS SAL −12.06 16.37 −.74 88,239

pH 410.21 48.83 8.40 11,478,835 K −.490 .204 −2.40 935,178 Analysis of variance

Source d.f. Sum of Squares Mean Square

Total 44 19,170,963

Regression 3 12,502,893 4,167,631 Residual 41 6,668,070 162,636

BIOMASS, r = −.62 versus r = −.10, to argue that SALINITY is the variable to eliminate at this stage. This is a somewhat arbitrary choice with the information at hand, and illustrates one of the problems of this sequential method of searching for the appropriate model. There is no as- surance that choosing to eliminateZn ﬁrst will lead to the best model by whatever criterion is used to measure “goodness” of the model.

Again,Xandβare redeﬁned, so thatZn is eliminated, and the compu- tations repeated. This analysis gives the results in Table 5.5. The partial sum of squares forpH increases dramatically whenZn is dropped from the model, from 1.9 million to 11.5 million. This is due to the strong correlation between pH andZn (r =−.72). When two independent variables are highly correlated, either positively or negatively, much of the predic- tive information contained in either can be usurped by the other. Thus, a

170 5. CASE STUDY: FIVE INDEPENDENT VARIABLES

TABLE 5.6.Results of the regression of BIOMASS on the two independent vari- ables pH, and K (Linthurst data).

Variable βj s(βj) t Partial SS pH 412.04 48.50 8.50 11,611,782

K −0.487 0.203 −2.40 924,266 Analysis of variance

Source d.f. Sum of Squares Mean Square

Total 44 19,170,963

Regression 2 12,414,654 6,207,327 Residual 42 6,756,309 160,865

very important variable may appear as insigniﬁcant if the model contains a correlated variable and, conversely, an otherwise unimportant variable may take on false signiﬁcance.

The contribution ofSALINITY in the three-variable model is even smaller A 2-Variable Model than it was beforeZn was dropped and is far from being signiﬁcant. The

next step is to dropSALINITY from the model. In this particular example, one would not have been misled by eliminating bothSALINITY andZn at the previous step. This is not true in general.

The two-variable model containing pH andK gives the results in Ta- ble 5.6. Since the partial sums of squares for bothpH andKare significant, the simplification of the model will stop with this two-variable model. The degree to which the linear model consisting of the two variablespH andK accounts for the variability inBIOMASS isR²=.65, only slightly smaller than theR²=.68 obtained with the original five-variable model.

Dalam dokumen Applied Regression Analysis: A Research Tool (Halaman 182-185)