Properties of Regression Estimates - MULTIPLE REGRESSION IN MATRIX NOTATION

MULTIPLE REGRESSION IN MATRIX NOTATION

3.5 Properties of Regression Estimates

88 3. MULTIPLE REGRESSION IN MATRIX NOTATION

This result is based on the assumption that the linear model used is the correct model. If important independent variables have been omitted or if the functional form of the model is not correct, Xβ will not be the expectation ofY. Assuming that the model is correct, the joint probability density function ofY is given by

(2π)^−n/2|Iσ²|^−1/2e^−(1/2){(Y−Xβ)(Iσ²)⁻¹(Y−Xβ)}

= (2π)^−n/2σ⁻ⁿe^−(1/2σ²⁾(Y−Xβ)(Y−Xβ). (3.38) Expressing β asβ = [(XX)⁻¹X]Y shows that the estimates of the

regression coeﬃcients are linear funtions of the dependent variableY, with βVector the coeﬃcients being given byA= [(XX)⁻¹X]. Since theXs are con-

stants, the matrixAis also constant. If the modelY =Xβ+is correct, the expectation ofY isXβand the expectation ofβis

E(β) = [(X X)⁻¹X]E(Y)

= [(XX)⁻¹X]Xβ

= [(XX)⁻¹XX]β

= β. (3.39)

This shows thatβ is an unbiased estimator of β if the chosen model is correct. If the chosen model isnot correct, sayE(Y) =Xβ+Zγ instead ofXβ, then [(XX)⁻¹X]E(Y) does not necessarily simplify toβ.

Assuming that the model is correct,

Var(β) = [(X X)⁻¹X][Var(Y)][(XX)⁻¹X]

= [(XX)⁻¹X]Iσ²[(XX)⁻¹X].

Recalling that the transpose of a product is the product of transposes in reverse order [i.e., (AB) =BA], thatXX is symmetric, and that the inverse of a transpose is the transpose of the inverse, we obtain

Var(β) = (X X)⁻¹XX(XX)⁻¹σ²

= (XX)⁻¹σ². (3.40)

Thus, the variances and covariances of the estimated regression coefficients are given by the elements of (XX)⁻¹ multiplied by σ². The diagonal elements give the variances in the order in which the regression coefficients are listed inβand the off-diagonal elements give their covariances. When is normally distributed,βis also multivariate normally distributed. Thus, β∼N(β,(XX)⁻¹σ²). (3.41)

In the ozone example, Example 3.3, Example 3.7

3.5 Properties of Regression Estimates 89 (XX)⁻¹=

1.0755 −9.4340

−9.4340 107.8167

Thus, Var(β0) = 1.0755σ² and Var(β1) = 107.8167σ². The covariance be- tweenβ0andβ1is Cov(β0,β1) =−9.4340σ².

Recall that the vector of estimated means Y is given by Y = [X(XX)⁻¹X]Y =P Y. Therefore, usingP X=X, the expectation ofY is

E(Y) =PE(Y) =P Xβ=Xβ. (3.42) Thus,Y is an unbiased estimator of the mean ofY for the particular values ofXin the data set, againif the model is correct. The fact thatP X=X can be veriﬁed using the deﬁnition ofP:

P X = [X(XX)⁻¹X]X

= X[(XX)⁻¹(XX)]

= X. (3.43)

The variance–covariance matrix ofY can be derived using either the rela- tionshipY =XβorY =P Y. Recall thatP =X(XX)⁻¹X. Applying the rules for variances of linear functions to the ﬁrst relationship gives

Var(Y) = X[Var(β)]X

= X(XX)⁻¹Xσ²

= Pσ². (3.44)

The derivation using the second relationship gives Var(Y) = P[Var(Y)]P

= P Pσ²

= Pσ², (3.45)

sinceP is symmetric and idempotent. Therefore, the matrixP multiplied by σ² gives the variances and covariances for all Yi. P is a large n×n matrix and at times only a few elements are of interest. The variances of any subset of theYi can be determined by using only the rows ofX, say Xr, that correspond to the data points of interest and applying the ﬁrst derivation. This gives

Var(Yr) =Xr[Var(β)]X _r=Xr(XX)⁻¹X_rσ². (3.46)

90 3. MULTIPLE REGRESSION IN MATRIX NOTATION Whenis normally distributed,

Y ∼N(Xβ,Pσ²). (3.47)

Recall that the vector of residuals eis given by (I−P)Y. Therefore, eVector the expectation ofeis

E(e) = (I−P)E(Y) = (I−P)Xβ

= (X−P X)β= (X−X)β=0, (3.48) where0is ann×1 vector of zeros. Thus, the residuals are random variables with mean zero.

The variance–covariance matrix of the residual vectoreis

Var(e) = (I−P)σ² (3.49) again using the result that (I−P) is a symmetric idempotent matrix. If the vector of regression errorsis normally distributed, then the vector of regression residuals satisﬁes

e∼N(0,(I−P)σ²). (3.50)

Prediction of a future random observation, Y0 = x₀β+0 at a given PredictionY0

vector of independent variablesx₀, is given byY₀=x₀β. It is easy to see that

Y₀∼N(x₀β, x₀(XX)⁻¹x₀σ²). (3.51) This result is used to construct conﬁdence intervals for the meanx₀β.

If the future0is assumed to be a normal random variable with mean zero and varianceσ², and is independent of the historic errors of, then the prediction errorY0−Y0=x₀(β−β) + 0satisﬁes

Y0−Y0∼N$

0,[1 +x₀(XX)⁻¹x0]σ²%

. (3.52)

This result is used to construct a conﬁdence interval for an individualY₀ that we call a prediction interval forY0. Recall that the variance of (Y0−Y0) is denoted by Var(Ypred0).

The matrixP =X(XX)⁻¹Xwas computed for the ozone example in Example 3.8 Example 3.3. Thus, with some rounding of the elements inP,

Var(Y) = Pσ²







.741 .377 .086 −.205 .377 .283 .208 .132 .086 .208 .305 .402

−.205 .132 .402 .671





σ².

3.5 Properties of Regression Estimates 91 The variance of the estimated mean ofY when the ozone level is .02 ppm is Var(Y₁) = .741σ². For the ozone level of .11 ppm, the variance of the estimated mean is Var(Y3) =.305σ². The covariance between the two estimated means is Cov(Y1,Y3) =.086σ².

The variance–covariance matrix of the residuals is obtained byVar(e) = (I−P)σ². Thus,

Var(e1) = (1−.741)σ²=.259σ² Var(e3) = (1−.305)σ²=.695σ² Cov(e1, e3) = −Cov(Y1,Y3) =−.086σ².

It is important to note that the variances of theleast squares residuals are not equal toσ²and the covariances are not zero. The assumption of equal variances and zero covariances applies to thei, not theei.

The variance of any particularYi and the variance of the corresponding Var(Yi)

≤Var(Yi) ei will always add toσ²because

Var(Y) = Var(Y+)

= Var(Y) +Var() +Cov(Y,) +Cov(,Y)

= Pσ²+ (I−P)σ²+P(I−P)σ²+ (I−P)Pσ²

= Pσ²+ (I−P)σ²

= Iσ². (3.53)

Since variances cannot be negative, each diagonal element of P must be between zero and one: 0< vii<1.0, wherevii is theith diagonal element of P. Thus, the variance of any Yi is always less than σ², the variance of the individual observations. This shows the advantage of ﬁtting a con- tinuous response model, assuming the model is correct, over simply using the individual observed data points as estimates of the mean ofY for the given values of theXs. The greater precision from ﬁtting a response model comes from the fact that each Yi uses information from the surrounding data points. The gain in precision can be quite striking. In Example 3.8, the precision obtained on the estimates of the means for the two intermediate levels of ozone using the linear response equation were.283σ²and.305σ². To attain the same degree of precision without using the response model would have required more than three observations at each level of ozone.

Equation 3.53 implies that data points having low variance on Yi will Role by Xs have high variance onei and vice versa. Belsley, Kuh, and Welsch (1980)

show that the diagonal elements ofP,viican be interpreted as measures of distance of the corresponding data points from the center of theX-space (fromXin the case of one independent variable). Points that are far from the center of theX-space have relatively largeviiand, therefore, relatively

92 3. MULTIPLE REGRESSION IN MATRIX NOTATION

high variance on Yi and low variance on ei. The smaller variance of the residuals for the points far from the “center of the data” indicates that the ﬁtted regression line or response surface tends to come closer to the observed values for these points. This aspect of P is used later to detect the more inﬂuential data points.

The variances (and covariances) have been expressed as multiples ofσ². Controlling Precision The coeﬃcients are determined entirely by theX matrix, a matrix of con-

stants that depends on the model being ﬁt and the levels of the independent variables in the study. In designed experiments, the levels of the independent variables are subject to the control of the researcher. Thus, except for the magnitude ofσ², the precision of the experiment is under the control of the researcher and can be known before the experiment is run. The eﬃ- ciencies of alternative experimental designs can be compared by computing (XX)⁻¹andP for each design. The design giving the smallest variances for the quantities of interest would be preferred.

Dalam dokumen Applied Regression Analysis: A Research Tool (Halaman 104-109)