• Tidak ada hasil yang ditemukan

ANALYSIS OF VARIANCE AND QUADRATIC FORMS

Case 1. A simple hypothesis

4.8 Exercises

150 4. ANALYSIS OF VARIANCEAND QUADRATIC FORMS (a) How many observations were in the data set?

(b) How many linearly independent columns are inX—that is, what is the rank ofX? How many degrees of freedom are associated with themodel sum of squares? Assuming the model contained an intercept, how many degrees of freedom are associated with theregression sum of squares?

(c) SupposeY = ( 82 80 75 67 55 ). Compute the estimated meanY1 of Y corresponding to the first observation. Compute s2(Y1). Find the residuale1 for the first observation and com- pute its variance. For which data point willYihave the smallest variance? For which data point willeihave the largest variance?

4.3. The following (XX)1, β, and residual sum of squares were ob- tained from the regression of plant dry weight (grams) from n = 7 experimental fields on percent soil organic matter (X1) and kilograms of supplemental nitrogen per 1000 m2(X2). The regression model in- cluded an intercept.

(XX)1 =

 1.7995972 −.0685472 −.2531648

−.0685472 .0100774 −.0010661

−.2531648 −.0010661 .0570789

β =

 51.5697 1.4974 6.7233

, SS(Res) = 27.5808.

(a) Give the regression equation and interpret each regression coef- ficient. Give the units of measure of each regression coefficient.

(b) How many degrees of freedom does SS(Res) have? Computes2, the variance ofβ1, and the covariance ofβ1andβ2.

(c) Determine the 95% univariate confidence interval estimates of β1 andβ2. Compute the Bonferroni and the Scheff´e confidence intervals forβ1andβ2using a joint confidence coefficient of .95.

(d) Suppose previous experience has led you to believe that one percentage point increase in organic matter is equivalent to .5 kilogram/1,000 m2of supplemental nitrogen in dry matter pro- duction. Translate this statement into a null hypothesis on the regression coefficients. Use a t-test to test this null hypothesis against the alternative hypothesis that supplemental nitrogen is more effective than this statement would imply.

(e) DefineK andmfor the general linear hypothesisH0:Kβ m = 0 for testing H0 : 2β1 = β2. ComputeQ and complete the test of significance using theF-test. What is the alternative hypothesis for this test?

4.8 Exercises 151 (f) Give the reduced model you obtain if you impose the null hy- pothesis in (e) on the model. Suppose this reduced model gave a SS(Res) = 164.3325. Use this result to complete the test of the hypothesis.

4.4. The following analysis of variance summarizes the regression ofY on two independent variables plus an intercept.

Source d.f. SS MS

Total( corr) 26 1,211

Regression 2 1,055 527.5

Residual 24 156 6.5

Variable Sequential SS Partial SS

X1 263 223

X2 792 792

(a) Your estimate ofβ1isβ1= 2.996. A friend of yours regressedY onX1and foundβ1= 3.24. Explain the difference in these two estimates.

(b) Label each sequential and partial sum of squares using the R- notation. Explain whatR(β10) measures.

(c) ComputeR(β20) and explain what it measures.

(d) What is the regression sum of squares due to X1 after adjust- ment forX2?

(e) Make a test of significance (use α = .05) to determine if X1

should be retained in the model withX2.

(f) The original data contained several sets of observations having the same values ofX1andX2. The pooled variance from these replicate observations wass2 = 3.8 with eight degrees of free- dom. With this information, rewrite the analysis of variance to show the partitions of the “residual” sum of squares into “pure error” and “lack of fit.” Make a test of significance to determine whether the model usingX1andX2is adequate.

4.5. The accompanying table presents data on one dependent variable and five independent variables.

152 4. ANALYSIS OF VARIANCEAND QUADRATIC FORMS

Y X1 X2 X3 X4 X5

6.68 32.6 4.78 1,092 293.09 17.1 6.31 33.4 4.62 1,279 252.18 14.0 7.13 33.2 3.72 511 109.31 12.7 5.81 31.2 3.29 518 131.63 25.7 5.68 31.0 3.25 582 124.50 24.3 7.66 31.8 7.35 509 95.19 .3 7.30 26.4 4.92 942 173.25 21.1 6.19 26.2 4.02 952 172.21 26.1 7.31 26.6 5.47 792 142.34 19.8

(a) Give the linear model in matrix form for regressing Y on the five independent variables. Completely define each matrix and give its order and rank.

(b) The following quadratic forms were computed.

YP Y = 404.532 YY = 405.012

Y(IP)Y = 0.480 Y(IJ/n)Y= 4.078 Y(P J/n)Y= 3.598 Y(J/n)Y = 400.934.

Use a matrix algebra computer program to reproduce each of these sums of squares. Use these results to give the complete analysis of variance summary.

(c) The partial sums of squares forX1,X2,X3,X4, andX5are .895, .238, .270, .337, and .922, respectively. Give theR-notation that describes the partial sum of squares forX2. Use a matrix algebra program to verify the partial sum of squares forX2.

(d) Assume that none of the partial sums of squares for X2, X3, and X4 is significant and that the partial sums of squares for X1and X5 are significant (atα=.05). Indicate whether each of the following statements is valid based on these results. If it is not a valid statement, explain why.

(i) X1andX5are important causal variables whereasX2,X3, andX4are not.

(ii) X2, X3, and X4 can be dropped from the model with no meaningful loss in predictability ofY.

(iii) There is no need for all five independent variables to be retained in the model.

4.6. This exercise continues with the analysis of the peak water flow data used in Exercise 3.12. In that exercise, several regressions were run to relateY = ln(Q0/Qp) to three characteristics of the watersheds and a measure of storm intensity. Y measures the discrepancy between peak water flow predicted from a simulation model (Qp) and observed

4.8 Exercises 153 peak water flow (Q0). The four independent variables are described in Exercise 3.12.

(a) The first model used an intercept and all four independent vari- ables.

(i) Compute SS(Model), SS(Regr), and SS(Res) for this model and summarize the results in the analysis of variance table.

Show degrees of freedom and mean squares.

(ii) Obtain the partial sum of squares for each independent vari- able and the sequential sums of squares for the variables added to the model in the orderX1,X4,X2,X3.

(iii) Use tests of significance (α=.05) to determine which partial regression coefficients are different from zero. What do these tests suggest as to which variables might be dropped from the model?

(iv) Construct a test of the null hypothesis H0 :β0 = 0 using the general linear hypothesis. What do you conclude from this test?

(b) The second model used the four independent variables but forced the intercept to be zero.

(i) Compute SS(Model), SS(Res), and the partial and sequen- tial sums of squares for this model. Summarize the results in the analysis of variance table.

(ii) Use the difference in SS(Res) between this model with no intercept and the previous model with an intercept to test H0:β0= 0. Compare the result with that obtained under (iv) in Part (a).

(iii) Use tests of significance to determine which partial regres- sion coefficients in this model are different from zero. What do these tests tell you in terms of simplifying the model?

(c) The third model used the zero-intercept model and onlyX1and X4.

(i) Use the results from this model and the zero-intercept model in Part (b) to test the composite null hypothesis thatβ2and β3are both zero.

(ii) Use the general linear hypothesis to construct the test of the composite null hypothesis thatβ2 andβ3 in the model in Part (b) are both zero. DefineKandmfor this hypothesis, computeQ, and complete the test of significance. Compare these two tests.

4.7. Use the data on annual catch of Gulf Menhaden, number of fishing vessels, and fishing effort given in Exercise 3.11.

154 4. ANALYSIS OF VARIANCEAND QUADRATIC FORMS

(a) Complete the analysis of variance for the regression of catch (Y) on fishing effort (X1) and number of vessels (X2) with an intercept in the model. Determine the partial sums of squares for each independent variable. Estimate the standard errors for the regression coefficients and construct the Bonferroni confidence intervals for each using a joint confidence coefficient of 95%.

Use the regression equation to predict the “catch” if number of vessels is limited toX2 = 70 and fishing effort is restricted to X1= 400. Compute the variance of this prediction and the 95%

confidence interval estimate of the prediction.

(b) Test the hypothesis that the variable “number of vessels” does not add significantly to the explanation of variation in “catch”

provided by “fishing effort” alone (use α = .05). Test the hy- pothesis that “fishing effort” does not add significantly to the explanation provided by “number of vessels” alone.

(c) On the basis of the tests in Part (b) would you keep bothX1and X2 in the model, or would you eliminate one from the model?

If one should be eliminated, which would it be? Does the re- maining variable make a significant contribution to explaining the variation in “catch”?

(d) Suppose consideration is being given to controlling the annual catch by limiting either the number of fishing vessels or the total fishing effort. What is your recommendation and why?

4.8. This exercise uses the data in Exercise 3.14 relatingY = ln(days survival) for colon cancer patients receiving supplemental ascorbate to the vari- ables sex (X1), age of patient (X2), and ln(average survival of control group) (X3).

(a) Complete the analysis of variance for the model using all three variables plus an intercept. Compute the partial sum of squares for each independent variable using the formulaβj2/cjj. Demon- strate that each is the same as the sum of squares one obtains by computingQfor the general linear hypothesis that the cor- respondingβj is zero. Compute the standard error for each re- gression coefficient and the 95% confidence interval estimates.

(b) Does information on the length of survival time of the control group (X3) help explain the variation inY? Support your answer with an appropriate test of significance.

(c) Test the null hypothesis that “sex of patient” has no effect on survival beyond that accounted for by “age” and survival of the control group. Interpret the results.

(d) Test the null hypothesis that “age of patient” has no effect on survival beyond that accounted for by “sex” and survival time of the control group. Intrepret the results.

4.8 Exercises 155 (e) Test the composite hypothesis thatβ1 = β2 = β3 = 0. From these results, what do you conclude about the effect of sex and age of patient on the mean survival time of patients in this study receiving supplemental ascorbate? With the information avail- able in these data, what would you use as the best estimate of the mean ln(days survival)?

4.9. The Lesser–Unsworth data (Exercise 1.19) was used in Exercise 3.9 to estimate a bivariate regression equation relating seed weight to cumulative solar radiation and level of ozone pollution. This exercise continues with the analysis of that model using the centered indepen- dent variables.

(a) The more complex model used in Exercise 3.9 included the in- dependent variables cumulative solar radiation, ozone level, and the product of cumulative solar radiation and ozone level (plus an intercept).

(i) Construct the analysis of variance for this model showing sums of squares, degrees of freedom, and mean squares.

What is the estimate ofσ2?

(ii) Compute the standard errors for each regression coefficient.

Use ajoint confidence coefficient of 90% and construct the Bonferroni confidence intervals for the four regression co- efficients. Use the confidence intervals to draw conclusions about which regression coefficients are clearly different from zero.

(iii) Construct a test of the null hypothesis that the regression coefficient for the product term is zero (useα=.05). Does your conclusion from this test agree with your conclusion based on the Bonferroni confidence intervals? Explain why they need not agree.

(b) The simpler model in Exercise 3.9 did not use the product term.

Construct the analysis of variance for the model using only the two independent variables cumulative solar radiation and ozone level.

(i) Use the residual sums of squares from the two analyses to test the null hypothesis that the regression coefficient on the product term is zero (use α =.05). Does your conclusion agree with that obtained in Part (a)?

(ii) Compute the standard errors of the regression coefficients for this reduced model. Explain why they differ from those computed in Part (a).

(iii) Compute the estimated mean seed weight for the mean level of cumulative solar radiation and .025 ppm ozone. Compute

156 4. ANALYSIS OF VARIANCEAND QUADRATIC FORMS

the estimated mean seed weight for the mean level of radi- ation and .07 ppm ozone. Use these two results to compute the estimated mean loss in seed weight if ozone changes from .025 to .07 ppm. Define a matrix of coefficients K such that these three linear functions ofβcan be written as Kβ. Use this matrix form to compute their variances and covariances.

(iv) Compute and plot the 90% joint confidence region for β1

andβ2ignoring β0. (This joint confidence region will be an ellipse in the two dimensionsβ1andβ2.)

4.10. This is a continuation of Exercise 3.10 using the number of hospital days for smokers from Exercise 1.21. The dependent variable isY = ln(number of hospital days for smokers). The independent variables are X1 = (number of cigarettes)2 and X2 = ln(number of hospi- tal days for nonsmokers). Note that X1 is the square of number of cigarettes.

(a) Plot Y against number of cigarettes and against the square of number of cigarettes. Do the plots provide any indication of why the square of number of cigarettes was chosen as the independent variable?

(b) Complete the analysis of variance for the regression ofY onX1

and X2. Does the information on number of hospital days for nonsmokers help explain the variation in number of hospital days for smokers? Make an appropriate test of significance to support your statement. Is Y, after adjustment for number of hospital days for nonsmokers, related toX1? Make a test of significance to support your statement. Are you willing to conclude from these data that number of cigarettes smoked has a direct effect on the average number of hospital days?

(c) It is logical in this problem to expect the number of hospital days for smokers to approach that of nonsmokers as the number of cigarettes smoked goes to zero. This implies that the intercept in this model might be expected to be zero. One might also expect β2to be equal to one. (Explain why.) Set up the general linear hypothesis for testing the composite null hypothesis thatβ0= 0 andβ2= 1.0. Complete the test of significance and state your conclusions.

(d) Construct thereduced model implied by the composite null hy- pothesis under (c). Compute the regression for this reduced model, obtain the residual sum of squares, and use the differ- ence in residual sums of squares for the full and reduced models to test the composite null hypothesis. Do you obtain the same result as in (c)?

4.8 Exercises 157 (e) Based on the preceding tests of significance, decide which model you feel is appropriate. State the regression equation for your adopted model. Include standard errors on the regression coef- ficients.

4.11. You are given the following matrices computed for a regression anal- ysis.

XX =



9 136 269 260

136 2114 4176 3583 269 4176 8257 7104 260 3583 7104 12276



XY =



 64845 1,283 1,821



(XX)1 =



9.610932 .0085878 −.2791475 −.0445217 .0085878 .5099641 −.2588636 .0007765

−.2791475 −.2588636 .1395 .0007396

−.0445217 .0007765 .0007396 .0003698



(XX)1XY =



1.163461 .135270 .019950 .121954



, YY = 285.

(a) Use the preceding results to complete the analysis of variance table.

(b) Give the computed regression equation and the standard errors of the regression coefficients.

(c) Compare each estimated regression coefficient to its standard error and use thet-test to test the simple hypothesis that each regression coefficient is equal to zero. State your conclusions (use α=.05).

(d) Define theK andmfor the composite hypothesis thatβ0= 0, β1= β3, and β2 = 0. Give the rank ofK and the degrees of freedom associated with this test.

(e) Give the reduced model for the composite hypothesis in Part (d).

4.12. You are given the following sequential andpartial sums of squares from a regression analysis.

R(β30) = 56.9669 R(β30β1β2) = 40.2204 R(β10β3) = 1.0027 R(β10β2β3) = .0359 R(β20β1β3) = .0029 R(β20β1β3) = .0029.

158 4. ANALYSIS OF VARIANCEAND QUADRATIC FORMS

Each sequential and partial sum of squares can be used for the nu- merator of anF-test. Clearly state the null hypothesis being tested in each case.

4.13. A regression analysis using an intercept and one independent variable gave

Yi = 1.841246 +.10934Xi1. The variance–covariance matrix forβwas

s2(β) =

.1240363 −.002627

−.002627 .0000909

.

(a) Compute the 95% confidence interval estimate ofβ1. The esti- mate ofσ2used to computes2(β1) wass2= 1.6360, the residual mean square from the model using onlyX0 andX1. The data hadn= 34 observations.

(b) ComputeYforX1= 4. Compute the variance ofY if it is being used to estimate the mean of Y when X1 = 4. Compute the variance ofY if it is being used topredict a future observation atX1= 4.

4.14. You are given the following matrix of simple (product moment) cor- relations among a dependent variable Y (first variable) and three independent variables.



1.0 −.538 −.543 .974

−.538 1.0 .983 −.653

−.543 .983 1.0 −.656 .974 −.653 −.656 1.0



.

(a) From inspection of the correlation matrix, which independent variable would account for the greatest variability inY? What proportion of the corrected sum of squares in Y would be ac- counted for by this variable? If Y were regressed on all three independent variables (plus an intercept), would the coefficient of determination for the multiple regression be smaller or larger than this proportion?

(b) Inspection of the three pairwise correlations among theXvari- ables suggests that at least one of the independent variables will not be useful for the regression ofY on theXs. Explain exactly the basis for this statement and why it has this implication.

4.15. LetX be ann×p matrix with rankp. Suppose the first column of X is1, a column of 1s. Then, show that

4.8 Exercises 159 (a) P1=1.

(b) (J/n), PJ/n, and (IP) are idempotent and pairwise or- thogonal, whereP =X(XX)1X andJ/nis given in equa- tion 4.22.

4.16. LetXbe a full rankn×pmatrix given in equation 3.2. ForJ given in equation 4.22,

(a) show that

(IJ/n)X =



0 x11 x12 · · · x1p

0 x21 x22 · · · x2p

... ... ... ...

0 xn1 xn2 · · · xnp



,

wherexij=Xij−X.j andX.j=n1n

i=1Xij; and

(b) hence, show that X(IJ/n)X has zero in each entry of the first row and first column.

5

Dalam dokumen Applied Regression Analysis: A Research Tool (Halaman 165-176)