Exercises - MULTIPLE REGRESSION IN MATRIX NOTATION

MULTIPLE REGRESSION IN MATRIX NOTATION

3.7 Exercises

3.1. The linear model in ordinary least squares isY =Xβ+E. Assume there are 30 observations and ﬁve independent variables (containing no linear dependencies). Give the order and rank of:

(a) Y.

(b) X (without an intercept in the model).

(d) β(without an intercept in the model).

(e) β(with an intercept in the model).

(f) .

(g) (XX) (with an intercept in the model).

(h) P (with an intercept in the model).

3.2. For each of the following matrices, indicate whether there will be a unique solution to the normal equations. Show how you arrived at your answer.

X1=







1 2 4

1 3 8

1 0 6

1 −1 2





 X2=





1 1 0 1 1 0 1 0 0 1 0 1



 X3=







1 2 4

1 1 2

1 −3 −6

1 −1 −2





.

3.3. You have a data set with four independent variables and n = 42 observations. If the model is to include an intercept, what would be the order ofXX? Of (XX)⁻¹? OfXY? OfP?

3.4. A data set with one independent variable and an intercept gave the following (XX)⁻¹,

(XX)⁻¹=





17731 −3 177 177−3 6

177



.

How many observations were there in the data set? Find

X_i². Find the corrected sum of squares for the independent variable.

3.5. The data in the accompanying table relate grams plant dry weight Y to percent soil organic matterX1, and kilograms of supplemental soil nitrogen added per 1,000 square metersX₂:

94 3. MULTIPLE REGRESSION IN MATRIX NOTATION

Y X1 X2

78.5 7 2.6

74.3 1 2.9

104.3 11 5.6

87.6 11 3.1

95.9 7 5.2

109.2 11 5.5

102.7 3 7.1

Sums: 652.5 51 32.0 Means: 93.21 7.29 4.57

(a) DeﬁneY,X,β, andfor a model involving both independent variables and an intercept.

(b) ComputeXX andXY. (c) (XX)⁻¹for this problem is

(XX)⁻¹=



 1.7995972 −.0685472 −.2531648

−.0685472 .0100774 −.0010661

−.2531648 −.0010661 .0570789



.

Verify that this is the inverse ofXX. Computeβand write the regression equation. Interpret each estimated regression coeﬃ- cient. What are the units of measure attached to each regression coeﬃcient?

(d) ComputeY ande.

(e) TheP matrix in this case is a 7×7 matrix. Illustrate the com- putation ofP by computingv₁₁, the first diagonal element, and v12, the second element in the first row. Use the preceding results and these two elements ofP to give the appropriate coefficient onσ²for each of the following variances.

(i) Var(β1) (ii) Var(Y1) (iii) Var(Ypred1) (iv) Var(e1).

3.6. Use the data in Exercise 3.5. Center each independent variable by subtracting the column mean from each observation in the column.

ComputeXX,XY,andβusing the centered data. Were the com- putations simpliﬁed by using centered data? Show that the regression equation obtained using centered data is equivalent to that obtained with the original uncentered data. Compute P using the centered data and compare it to that obtained using the uncentered data.

3.7 Exercises 95 3.7. The matrixP for the Heagle ozone data is given in Example 3.3. Ver- ify thatP is symmetric and idempotent. What is the linear function ofYithat givesY3?

3.8. Compute (I−P) for the Heagle ozone data. Verify that (I−P) is idempotent and that P and (I−P) are orthogonal to each other.

What does the orthogonality imply about the vectorsY ande?

3.9. This exercise uses the Lesser–Unsworth data in Exercise 1.19, in which seed weight is related to cumulative solar radiation for two levels of exposure to ozone. Assume that “low ozone” is an exposure of .025 ppm and that “high ozone” is an exposure of .07 ppm.

(a) Set upX andβfor the regression of seed weight on cumulative solar radiationandozone level. Center the independent variables and include an intercept in the model. Estimate the regression equation and interpret the result.

(b) Extend the model to include an independent variable that is the product term between centered cumulative solar radiation and centered ozone level. Estimate the regression equation for this model and interpret the result. What does the presence of the product term contribute to the regression equation?

3.10. This exercise uses the data from Exercise 1.21 (number of hospital days for smokers, number of cigarettes smoked, and number of hospital days for control groups of nonsmokers). Exercise 1.21 used the in- formation from the nonsmoker control groups by deﬁning the dependent variable asY = ln(number of hospital days for smokers/number of hospital days for nonsmokers). Another method of taking into ac- count the experience of the nonsmokers is to useX2= ln(number of hospital days for nonsmokers) as an independent variable.

(a) Set up X andβ for the regression ofY = ln(number of hospital days for smokers) on X₁ = (number cigarettes)² andX₂= ln(number of hospital days for nonsmokers).

(b) Estimate the regression equation and interpret the results. What value ofβ2would correspond to using the nonsmoker experience as was done in Exercise 1.21?

3.11. The data in the table relate the annual catch of Gulf Menhaden, Brevoortia patronus, to ﬁshing pressure for 1964 to 1979 (Nelson and

96 3. MULTIPLE REGRESSION IN MATRIX NOTATION Ahrenholz, 1986).

Catch Number Pressure

Year Met. Ton ×10⁻³ Vessels Vessel-Ton-Weeks×10⁻³

1964 409.4 76 282.9

1965 463.1 82 335.6

1966 359.1 80 381.3

1967 317.3 76 404.7

1968 373.5 69 382.3

1969 523.7 72 411.0

1970 548.1 73 400.0

1971 728.2 82 472.9

1972 501.7 75 447.5

1973 486.1 65 426.2

1974 578.6 71 485.5

1975 542.6 78 536.9

1976 561.2 81 575.9

1977 447.1 80 532.7

1978 820.0 80 574.3

1979 777.9 77 533.9

Run a linear regression of catch (Y) on ﬁshing pressure (X1) and number of vessels (X2). Include an intercept in the model. Interpret the regression equation.

3.12. A simulation model for peak water flow from watersheds was tested by comparing measured peak flow (cfs) from 10 storms with predictions of peak flow obtained from the simulation model.QoandQpare the observed and predicted peak flows, respectively. Four independent variables were recorded:

X1= area of watershed (mi²),

X2= average slope of watershed (in percent),

X₃ = surface absorbency index (0 = complete absorbency, 100

= no absorbency), and

X4 = peak intensity of rainfall (in/hr) computed on half-hour time intervals.

3.7 Exercises 97

Q0 Qp X1 X2 X3 X4

28 32 .03 3.0 70 .6

112 142 .03 3.0 80 1.8 398 502 .13 6.5 65 2.0 772 790 1.00 15.0 60 .4 2,294 3,075 1.00 15.0 65 2.3 2,484 3,230 3.00 7.0 67 1.0 2,586 3,535 5.00 6.0 62 .9 3,024 4,265 7.00 6.5 56 1.1 4,179 6,529 7.00 6.5 56 1.4 710 935 7.00 6.5 56 .7

(a) Use Y = ln(Q_o/Q_p) as the dependent variable. The dependent variable will have the value zero if the observed and predicted peak ﬂows agree. Set up the regression problem to determine whether the discrepancy Y is related to any of the four independent variables. Use an intercept in the model. Estimate the regression equation.

(b) Further consideration of the problem suggested that the discrepancy between observed and predicted peak ﬂow Y might go to zero as the values of the four independent variables approach zero. Redeﬁne the regression problem to eliminate the intercept (forceβ0= 0), and estimate the regression equation.

(c) Rerun the regression (without the intercept) using onlyX1and X4; that is, omitX2andX3from the model. Do the regression coeﬃcients for X₁andX₄change? Explain why they do or do not change.

(d) Describe the change in the standard errors of the estimated regression coeﬃcients as the intercept was dropped [Part (a) versus Part (b)] and asX2andX3were dropped from the model [Part (b) versus Part (c)].

3.13. You have ﬁt a linear model usingY =Xβ+whereX involves r independent variables. Now assume that the true model involves an additional sindependent variables contained inZ. That is, the true model is

Y =Xβ+Zγ+,

where γ is the vector of regression coeﬃcients for the independent variables contained inZ.

(a) Find E(β) and show that, in general, β = (XX)⁻¹XY is a biased estimate ofβ.

98 3. MULTIPLE REGRESSION IN MATRIX NOTATION (b) Under what conditions wouldβbe unbiased?

3.14. The accompanying table shows the part of the data reported by Cameron and Pauling (1978) related to the eﬀects of supplemental ascorbate, vitamin C, in the treatment of colon cancer. The data are taken from Andrews and Herzberg (1985) and are used with permis- sion.

Sex Age Days^a Control^b

F 76 135 18

F 58 50 30

M 49 189 65

M 69 1,267 17

F 70 155 57

F 68 534 16

M 50 502 25

F 74 126 21

M 66 90 17

F 76 365 42

F 56 911 40

M 65 743 14

F 74 366 28

M 58 156 31

F 60 99 28

M 77 20 33

M 38 274 80

aDays = number of days survival after date of untreatability.

bControl = average number of days survival of 10 control patients.

UseY = ln(days) as the dependent variable andX1= sex (coded−1 for males and +1 for females),X2= age, andX3= ln(control) in a multiple regression to determine if there is any relationship between days survival and sex and age. Deﬁne X andβ, and estimate the regression equation. Explain why X3 is in the model if the purpose is to relate survival toX1andX2.

3.15. SupposeU ∼N(µ_u,Vu). Let U =

, µ_u=

µ₁ µ₂

, and Vu=

V11 V12

V21 V22

. Use equation 3.32 to show thatu1andu2are independent ifV12= 0.

That is, if u is multivariate normal, then u₁ and u₂ uncorrelated impliesu1andu2 are independent. (The joint density ofu1andu2

3.7 Exercises 99 is the product of the marginal densities of u1andu2, if and only if u₁andu₂are independent.)

3.16. Consider the modelY =Xβ+, where∼N(0, σ²I). Let U= β

(XX)⁻¹X (I−P)

Find the distribution ofU. Show thatβandeare independent. (Hint:

Use the result in equation 3.31 and Exercise 3.15.)

4 ANALYSIS OF VARIANCE AND

Dalam dokumen Applied Regression Analysis: A Research Tool (Halaman 110-117)