Exercises - A simple hypothesis - ANALYSIS OF VARIANCE AND QUADRATIC FORMS

ANALYSIS OF VARIANCE AND QUADRATIC FORMS

Case 1. A simple hypothesis

4.8 Exercises

150 4. ANALYSIS OF VARIANCEAND QUADRATIC FORMS (a) How many observations were in the data set?

(b) How many linearly independent columns are inX—that is, what is the rank ofX? How many degrees of freedom are associated with themodel sum of squares? Assuming the model contained an intercept, how many degrees of freedom are associated with theregression sum of squares?

(c) SupposeY = ( 82 80 75 67 55 ). Compute the estimated meanY1 of Y corresponding to the ﬁrst observation. Compute s²(Y1). Find the residuale1 for the ﬁrst observation and compute its variance. For which data point willYihave the smallest variance? For which data point willeihave the largest variance?

4.3. The following (XX)⁻¹, β, and residual sum of squares were obtained from the regression of plant dry weight (grams) from n = 7 experimental ﬁelds on percent soil organic matter (X1) and kilograms of supplemental nitrogen per 1000 m²(X2). The regression model included an intercept.

(XX)⁻¹ =



 1.7995972 −.0685472 −.2531648

−.0685472 .0100774 −.0010661

−.2531648 −.0010661 .0570789





β =



 51.5697 1.4974 6.7233



, SS(Res) = 27.5808.

(a) Give the regression equation and interpret each regression coef- ﬁcient. Give the units of measure of each regression coeﬃcient.

(b) How many degrees of freedom does SS(Res) have? Computes², the variance ofβ₁, and the covariance ofβ₁andβ₂.

(c) Determine the 95% univariate confidence interval estimates of β₁ andβ₂. Compute the Bonferroni and the Scheffé confidence intervals forβ1andβ2using a joint confidence coefficient of .95.

(d) Suppose previous experience has led you to believe that one percentage point increase in organic matter is equivalent to .5 kilogram/1,000 m²of supplemental nitrogen in dry matter pro- duction. Translate this statement into a null hypothesis on the regression coeﬃcients. Use a t-test to test this null hypothesis against the alternative hypothesis that supplemental nitrogen is more eﬀective than this statement would imply.

(e) DeﬁneK andmfor the general linear hypothesisH0:Kβ− m = 0 for testing H0 : 2β1 = β2. ComputeQ and complete the test of signiﬁcance using theF-test. What is the alternative hypothesis for this test?

4.8 Exercises 151 (f) Give the reduced model you obtain if you impose the null hypothesis in (e) on the model. Suppose this reduced model gave a SS(Res) = 164.3325. Use this result to complete the test of the hypothesis.

4.4. The following analysis of variance summarizes the regression ofY on two independent variables plus an intercept.

Source d.f. SS MS

Total( corr) 26 1,211

Regression 2 1,055 527.5

Residual 24 156 6.5

Variable Sequential SS Partial SS

X1 263 223

X2 792 792

(a) Your estimate ofβ1isβ1= 2.996. A friend of yours regressedY onX1and foundβ1= 3.24. Explain the diﬀerence in these two estimates.

(b) Label each sequential and partial sum of squares using the R- notation. Explain whatR(β1|β0) measures.

(d) What is the regression sum of squares due to X₁ after adjustment forX2?

(e) Make a test of signiﬁcance (use α = .05) to determine if X1

should be retained in the model withX2.

(f) The original data contained several sets of observations having the same values ofX1andX2. The pooled variance from these replicate observations wass² = 3.8 with eight degrees of freedom. With this information, rewrite the analysis of variance to show the partitions of the “residual” sum of squares into “pure error” and “lack of ﬁt.” Make a test of signiﬁcance to determine whether the model usingX1andX2is adequate.

4.5. The accompanying table presents data on one dependent variable and ﬁve independent variables.

152 4. ANALYSIS OF VARIANCEAND QUADRATIC FORMS

Y X1 X2 X3 X4 X5

6.68 32.6 4.78 1,092 293.09 17.1 6.31 33.4 4.62 1,279 252.18 14.0 7.13 33.2 3.72 511 109.31 12.7 5.81 31.2 3.29 518 131.63 25.7 5.68 31.0 3.25 582 124.50 24.3 7.66 31.8 7.35 509 95.19 .3 7.30 26.4 4.92 942 173.25 21.1 6.19 26.2 4.02 952 172.21 26.1 7.31 26.6 5.47 792 142.34 19.8

(a) Give the linear model in matrix form for regressing Y on the ﬁve independent variables. Completely deﬁne each matrix and give its order and rank.

(b) The following quadratic forms were computed.

YP Y = 404.532 YY = 405.012

Y(I−P)Y = 0.480 Y(I−J/n)Y= 4.078 Y(P −J/n)Y= 3.598 Y(J/n)Y = 400.934.

Use a matrix algebra computer program to reproduce each of these sums of squares. Use these results to give the complete analysis of variance summary.

(c) The partial sums of squares forX1,X2,X3,X4, andX5are .895, .238, .270, .337, and .922, respectively. Give theR-notation that describes the partial sum of squares forX2. Use a matrix algebra program to verify the partial sum of squares forX₂.

(d) Assume that none of the partial sums of squares for X2, X3, and X4 is signiﬁcant and that the partial sums of squares for X1and X5 are signiﬁcant (atα=.05). Indicate whether each of the following statements is valid based on these results. If it is not a valid statement, explain why.

(i) X1andX5are important causal variables whereasX2,X3, andX4are not.

(ii) X₂, X₃, and X₄ can be dropped from the model with no meaningful loss in predictability ofY.

(iii) There is no need for all ﬁve independent variables to be retained in the model.

4.6. This exercise continues with the analysis of the peak water ﬂow data used in Exercise 3.12. In that exercise, several regressions were run to relateY = ln(Q0/Qp) to three characteristics of the watersheds and a measure of storm intensity. Y measures the discrepancy between peak water ﬂow predicted from a simulation model (Qp) and observed

4.8 Exercises 153 peak water ﬂow (Q0). The four independent variables are described in Exercise 3.12.

(a) The ﬁrst model used an intercept and all four independent variables.

(i) Compute SS(Model), SS(Regr), and SS(Res) for this model and summarize the results in the analysis of variance table.

Show degrees of freedom and mean squares.

(ii) Obtain the partial sum of squares for each independent variable and the sequential sums of squares for the variables added to the model in the orderX1,X4,X2,X3.

(iii) Use tests of significance (α=.05) to determine which partial regression coefficients are different from zero. What do these tests suggest as to which variables might be dropped from the model?

(iv) Construct a test of the null hypothesis H0 :β0 = 0 using the general linear hypothesis. What do you conclude from this test?

(b) The second model used the four independent variables but forced the intercept to be zero.

(i) Compute SS(Model), SS(Res), and the partial and sequential sums of squares for this model. Summarize the results in the analysis of variance table.

(ii) Use the diﬀerence in SS(Res) between this model with no intercept and the previous model with an intercept to test H₀:β₀= 0. Compare the result with that obtained under (iv) in Part (a).

(iii) Use tests of significance to determine which partial regression coefficients in this model are different from zero. What do these tests tell you in terms of simplifying the model?

(i) Use the results from this model and the zero-intercept model in Part (b) to test the composite null hypothesis thatβ2and β3are both zero.

(ii) Use the general linear hypothesis to construct the test of the composite null hypothesis thatβ2 andβ3 in the model in Part (b) are both zero. DeﬁneKandmfor this hypothesis, computeQ, and complete the test of signiﬁcance. Compare these two tests.

4.7. Use the data on annual catch of Gulf Menhaden, number of fishing vessels, and fishing effort given in Exercise 3.11.

154 4. ANALYSIS OF VARIANCEAND QUADRATIC FORMS

(a) Complete the analysis of variance for the regression of catch (Y) on fishing effort (X₁) and number of vessels (X₂) with an intercept in the model. Determine the partial sums of squares for each independent variable. Estimate the standard errors for the regression coefficients and construct the Bonferroni confidence intervals for each using a joint confidence coefficient of 95%.

Use the regression equation to predict the “catch” if number of vessels is limited toX2 = 70 and ﬁshing eﬀort is restricted to X₁= 400. Compute the variance of this prediction and the 95%

conﬁdence interval estimate of the prediction.

(b) Test the hypothesis that the variable “number of vessels” does not add signiﬁcantly to the explanation of variation in “catch”

provided by “fishing effort” alone (use α = .05). Test the hypothesis that “fishing effort” does not add significantly to the explanation provided by “number of vessels” alone.

(c) On the basis of the tests in Part (b) would you keep bothX1and X2 in the model, or would you eliminate one from the model?

If one should be eliminated, which would it be? Does the re- maining variable make a signiﬁcant contribution to explaining the variation in “catch”?

(d) Suppose consideration is being given to controlling the annual catch by limiting either the number of fishing vessels or the total fishing effort. What is your recommendation and why?

4.8. This exercise uses the data in Exercise 3.14 relatingY = ln(days survival) for colon cancer patients receiving supplemental ascorbate to the variables sex (X1), age of patient (X2), and ln(average survival of control group) (X3).

(a) Complete the analysis of variance for the model using all three variables plus an intercept. Compute the partial sum of squares for each independent variable using the formulaβ_j²/cjj. Demon- strate that each is the same as the sum of squares one obtains by computingQfor the general linear hypothesis that the cor- respondingβj is zero. Compute the standard error for each regression coeﬃcient and the 95% conﬁdence interval estimates.

(b) Does information on the length of survival time of the control group (X3) help explain the variation inY? Support your answer with an appropriate test of signiﬁcance.

(c) Test the null hypothesis that “sex of patient” has no eﬀect on survival beyond that accounted for by “age” and survival of the control group. Interpret the results.

(d) Test the null hypothesis that “age of patient” has no eﬀect on survival beyond that accounted for by “sex” and survival time of the control group. Intrepret the results.

4.8 Exercises 155 (e) Test the composite hypothesis thatβ1 = β2 = β3 = 0. From these results, what do you conclude about the eﬀect of sex and age of patient on the mean survival time of patients in this study receiving supplemental ascorbate? With the information avail- able in these data, what would you use as the best estimate of the mean ln(days survival)?

4.9. The Lesser–Unsworth data (Exercise 1.19) was used in Exercise 3.9 to estimate a bivariate regression equation relating seed weight to cumulative solar radiation and level of ozone pollution. This exercise continues with the analysis of that model using the centered independent variables.

(a) The more complex model used in Exercise 3.9 included the independent variables cumulative solar radiation, ozone level, and the product of cumulative solar radiation and ozone level (plus an intercept).

(i) Construct the analysis of variance for this model showing sums of squares, degrees of freedom, and mean squares.

What is the estimate ofσ²?

(ii) Compute the standard errors for each regression coeﬃcient.

Use ajoint confidence coefficient of 90% and construct the Bonferroni confidence intervals for the four regression coefficients. Use the confidence intervals to draw conclusions about which regression coefficients are clearly different from zero.

(iii) Construct a test of the null hypothesis that the regression coeﬃcient for the product term is zero (useα=.05). Does your conclusion from this test agree with your conclusion based on the Bonferroni conﬁdence intervals? Explain why they need not agree.

(b) The simpler model in Exercise 3.9 did not use the product term.

Construct the analysis of variance for the model using only the two independent variables cumulative solar radiation and ozone level.

(i) Use the residual sums of squares from the two analyses to test the null hypothesis that the regression coeﬃcient on the product term is zero (use α =.05). Does your conclusion agree with that obtained in Part (a)?

(ii) Compute the standard errors of the regression coeﬃcients for this reduced model. Explain why they diﬀer from those computed in Part (a).

(iii) Compute the estimated mean seed weight for the mean level of cumulative solar radiation and .025 ppm ozone. Compute

156 4. ANALYSIS OF VARIANCEAND QUADRATIC FORMS

the estimated mean seed weight for the mean level of radiation and .07 ppm ozone. Use these two results to compute the estimated mean loss in seed weight if ozone changes from .025 to .07 ppm. Deﬁne a matrix of coeﬃcients K such that these three linear functions ofβcan be written as Kβ. Use this matrix form to compute their variances and covariances.

(iv) Compute and plot the 90% joint conﬁdence region for β1

andβ2ignoring β0. (This joint conﬁdence region will be an ellipse in the two dimensionsβ₁andβ₂.)

4.10. This is a continuation of Exercise 3.10 using the number of hospital days for smokers from Exercise 1.21. The dependent variable isY = ln(number of hospital days for smokers). The independent variables are X1 = (number of cigarettes)² and X2 = ln(number of hospital days for nonsmokers). Note that X1 is the square of number of cigarettes.

(a) Plot Y against number of cigarettes and against the square of number of cigarettes. Do the plots provide any indication of why the square of number of cigarettes was chosen as the independent variable?

(b) Complete the analysis of variance for the regression ofY onX1

and X2. Does the information on number of hospital days for nonsmokers help explain the variation in number of hospital days for smokers? Make an appropriate test of significance to support your statement. Is Y, after adjustment for number of hospital days for nonsmokers, related toX1? Make a test of significance to support your statement. Are you willing to conclude from these data that number of cigarettes smoked has a direct effect on the average number of hospital days?

(c) It is logical in this problem to expect the number of hospital days for smokers to approach that of nonsmokers as the number of cigarettes smoked goes to zero. This implies that the intercept in this model might be expected to be zero. One might also expect β2to be equal to one. (Explain why.) Set up the general linear hypothesis for testing the composite null hypothesis thatβ0= 0 andβ2= 1.0. Complete the test of signiﬁcance and state your conclusions.

(d) Construct thereduced model implied by the composite null hypothesis under (c). Compute the regression for this reduced model, obtain the residual sum of squares, and use the diﬀer- ence in residual sums of squares for the full and reduced models to test the composite null hypothesis. Do you obtain the same result as in (c)?

4.8 Exercises 157 (e) Based on the preceding tests of signiﬁcance, decide which model you feel is appropriate. State the regression equation for your adopted model. Include standard errors on the regression coef- ﬁcients.

4.11. You are given the following matrices computed for a regression analysis.

XX =







9 136 269 260

136 2114 4176 3583 269 4176 8257 7104 260 3583 7104 12276





 XY =





 64845 1,283 1,821







(XX)⁻¹ =







9.610932 .0085878 −.2791475 −.0445217 .0085878 .5099641 −.2588636 .0007765

−.2791475 −.2588636 .1395 .0007396

−.0445217 .0007765 .0007396 .0003698







(XX)⁻¹XY =







−1.163461 .135270 .019950 .121954





, YY = 285.

(a) Use the preceding results to complete the analysis of variance table.

(b) Give the computed regression equation and the standard errors of the regression coeﬃcients.

(c) Compare each estimated regression coeﬃcient to its standard error and use thet-test to test the simple hypothesis that each regression coeﬃcient is equal to zero. State your conclusions (use α=.05).

(d) Deﬁne theK andmfor the composite hypothesis thatβ0= 0, β₁= β₃, and β₂ = 0. Give the rank ofK and the degrees of freedom associated with this test.

(e) Give the reduced model for the composite hypothesis in Part (d).

4.12. You are given the following sequential andpartial sums of squares from a regression analysis.

R(β3|β0) = 56.9669 R(β3|β0β1β2) = 40.2204 R(β1|β0β3) = 1.0027 R(β1|β0β2β3) = .0359 R(β2|β0β1β3) = .0029 R(β2|β0β1β3) = .0029.

158 4. ANALYSIS OF VARIANCEAND QUADRATIC FORMS

Each sequential and partial sum of squares can be used for the nu- merator of anF-test. Clearly state the null hypothesis being tested in each case.

4.13. A regression analysis using an intercept and one independent variable gave

Yi = 1.841246 +.10934Xi1. The variance–covariance matrix forβwas

s²(β) =

.1240363 −.002627

−.002627 .0000909

(a) Compute the 95% conﬁdence interval estimate ofβ1. The estimate ofσ²used to computes²(β1) wass²= 1.6360, the residual mean square from the model using onlyX0 andX1. The data hadn= 34 observations.

(b) ComputeYforX1= 4. Compute the variance ofY if it is being used to estimate the mean of Y when X1 = 4. Compute the variance ofY if it is being used topredict a future observation atX1= 4.

4.14. You are given the following matrix of simple (product moment) correlations among a dependent variable Y (ﬁrst variable) and three independent variables.







1.0 −.538 −.543 .974

−.538 1.0 .983 −.653

−.543 .983 1.0 −.656 .974 −.653 −.656 1.0





.

(a) From inspection of the correlation matrix, which independent variable would account for the greatest variability inY? What proportion of the corrected sum of squares in Y would be accounted for by this variable? If Y were regressed on all three independent variables (plus an intercept), would the coeﬃcient of determination for the multiple regression be smaller or larger than this proportion?

(b) Inspection of the three pairwise correlations among theXvari- ables suggests that at least one of the independent variables will not be useful for the regression ofY on theXs. Explain exactly the basis for this statement and why it has this implication.

4.15. LetX be ann×p matrix with rankp. Suppose the ﬁrst column of X is1, a column of 1s. Then, show that

4.8 Exercises 159 (a) P1=1.

(b) (J/n), P−J/n, and (I−P) are idempotent and pairwise or- thogonal, whereP =X(XX)⁻¹X andJ/nis given in equation 4.22.

4.16. LetXbe a full rankn×pmatrix given in equation 3.2. ForJ given in equation 4.22,

(a) show that

(I−J/n)X =







0 x11 x12 · · · x1p

0 x21 x22 · · · x2p

... ... ... ...

0 xn1 xn2 · · · xnp





,

wherexij=Xij−X.j andX.j=n⁻¹_n

i=1Xij; and

(b) hence, show that X(I−J/n)X has zero in each entry of the ﬁrst row and ﬁrst column.

5

Dalam dokumen Applied Regression Analysis: A Research Tool (Halaman 165-176)