Chapter 1 Introduction
5.2 The Generalized Linear Mixed Model
5.2.7 Estimation approaches by Schall and by Breslow and Claytonand Clayton
models. Such models are quite easy to fit in WinBUGS.
• It may be computationally demanding
• It is not widely implemented in existing commercial software packages (it is only freely available in MIXORE and also available in Stata ver- sions 6 and 9)
One might wonder why we opt to use the likelihood conditional on the ran- dom effects u in the case of the non-normal response. The answer is that in the case of a normal response and the identity link, the random effects do not appear explicitly in the likelihood, but only appear through the vari- ance covariance parameters σ21, σ22, . . . , σ2k in the case of k random effects or equivalently through the “ratios or gammas”γi = σσ2i2 whereσ2 is the resid- ual variance. In the above example E(zj) = 0, thus E[xijβ+γzj] = xijβ.
However E[g(µ)]6=g[E(µ)] in the case of the non-identity link functiong, so taking the expectations will not cause the random terms to vanish. Hence we will consider a number of alternative approaches based on modifications to the mixed model equations.
5.2.7 Estimation approaches by Schall and by Breslow
variate y∗ is created by linearizing the link function applied to the data about the mean values, as follows y∗ = Xβ +Zu+D(µ)(y−µ) where D(µ)= dµdg = diag{g0(µ1),g0(µ2), . . .}. Thus the working variate has three components:
(i) fixed effects represented by Xβ
(ii) random effects represented by Zu and
(iii) an error term represented by D(µ)(y−µ)which depends on the dis- tribution of y and the link function g through D(µ)
The first and third terms numbered above are the same as those of a standard working variate for the standard generalized linear model discussed under the PQL and MQL methods. The second and third terms give:
var(y∗) = ZGZ0 +DV D where Vis the variance of y conditional on the random effects u, and DV D is a diagonal matrix.
Thus the working variate y∗ is described by a linear mixed model with fixed effects β and random effects u and weights W = (DV D)−1. Given an estimate of µ, the standard mixed model equations can be used to estimate β and u
"
X0W X X0W Z Z0W X Z0W Z+G−1
# "
βˆ ˆ µ
#
=
"
X0W y∗ Z0W y∗
#
(5.12)
These equations can be solved using the REML algorithm discussed in Chap- ter 3, which will also give estimates of the variance components G. It has been found that restricting the REML algorithm to two iterations to provide an approximate solution rather than allowing it to converge, does not affect the speed of convergence of the GLMM algorithm, and saves computing time (Welham, 1993). Given estimates of β and µ, further estimates of µ are
formed and used to update the working variate y∗ and weights W(which is also true in the case of generalized linear models) and the estimation is repeated until convergence.
The method of Schall (1991) uses an estimate of the conditional mean of y given u i.e. µˆ =g−1(Xβˆ+Zu)=ˆ h(Xβˆ+Zu).ˆ
This method takes slightly different forms suggested by a number of authors.
Breslow and Clayton (1993) refer to this method as penalized quasi-likelihood or PQL and derive the methods using a quasi-likelihood argument, which fol- lowing Littell et al. (1996, Chapter 11) can be summarized as follows. In the the case of generalized linear models the estimating equations are determined by
Q(µi, yi) = yiθi−b(θi) a(φi)
where µi is involved in the right hand side of the equation through the mean function θi.
In the normal “errors” mixed model of the form y =Xβ+Zu+e
the conditional mean of the observations given the random model effects is E[y|u] = Xβ+Zu
and the conditional variance
var[y|u] =R =var(e) The observations can be then described as
y =µ+e
where µ denotes the conditional mean E[y|u]. In the generalized linear mixed model the conditional distribution of y given u plays the same role
as the distribution of y in the fixed effects generalized linear model i.e. the conditional quasilikelihood of an observation yi given µi is
Q(µi, yi|ui) = yiθi−b(θi) a(φi) .
The joint quasilikelihood of the observations y is the sum of the quasilike- lihood of y given u and the quasilikelihood of u. In matrix terms the joint quasilikelihood is
Q(µ, u;y) = y0A−1θ−(b0.5θ A−1b0.5θ ) + 1
2u0G−1u (5.13) where A is the matrix of a(φi)’s, θ is the vector of θ(µi)’s, bθ is the vector of b(θi)’s and Gis as defined as var(u).
Breslow and Clayton (1993) and Wolfinger and O’ Connell (1993) show that solutions forβ andu can be obtained from Q(µ, u;y)by iteratively solving the modified mixed model equations as described above. Engel and Keen (1994, 1996) discuss the relationship between Schall’s method which is an extension of the iteratively reweighted least squares algorithm used in the estimation of generalized linear models, and PQL which is an approximation of ML estimation using Laplace integration and showed that they are equiv- alent. Schall’s method (1991) or the PQL procedure assume that the scale parameter a(φ) = 1. The Wolfinger-O´ Connell procedures which are imple- mented in the SAS macro GLIMMIX and which they call pseudolikelihood (PL) or restricted pseudolikelihood (REPL), assume that a(φ) is unknown.
PL obtains a maximum likelihood type estimate ofa(φ), while REPL obtains a REML like estimate. PQL is a special case of PL when a(φ) = 1 (Wolfin- ger and O´ Connell (1993)). Note that an additional dispersion factor is also incorporated as a residual variance in the IRREML approach of Engel and Keen(1994). In the terminology of Zeger et al.(1988), Schall’s method is a
‘subject specific’ (SS) model.
Goldstein (1995) derived the PQL method as an extension of IGLS to non- linear models, using a Taylor’s series expansion for the non-linear function.
We should note that in his notation, a(φ) is not assumed to be unity i.e. his model corresponds to the PL and REPL models of Wolfinger and O’Connell.
The PQL model of Goldstein can be fitted using the software MLn.
An alternative to Schall’s model is what the developers of the statisti- cal software, Genstat refers to as the marginal model of Breslow and Clay- ton, variantly referred to by other authors (for example Goldstein, 1995) as
”marginal quasilikelihood” or MQL. The algorithim for MQL is identical to that for Schall’s model, except for the estimation of the mean µ=g−1(Xβ) (where the random term Zu is not used). This model is implemented in Genstat’s GLMM procedure and in MLn. It can be considered as an approx- imation to that of Schall when the σ2i’s are small, and in the terminology of Zeger et al. (1998) as a population averaged or PA model.
Breslow and Clayton (1993) point out that the detailed derivation of the PQL (Schall) model using quasilikelihood involves several ad hoc adjust- ments and approximations of which no formal justification is given; thus the derivation should be considered as providing heuristic motivation for the es- timating algorithm that is used, and that this algorithm should be studied on its own right. They further note that some of the approximations made in the derivation are likely to improve as the normal approximation becomes more possible, for example as the denominators of binomial proportions in- crease or as the means of Poisson observations increase. This will not be the case for a binary outcome, which is what we observe in many surveys.
Breslow and Clayton (1993) feel that even with binary data the PQL esti- mates of the fixed effects and their standard errors are sufficiently accurate for many practical purposes; however, inference on variance parameters are
less satisfactory and in simulation studies the procedure often converged to a non-positive definite matrix when the binomial denominator is 1 or 2. They point out that this is due to the fact that when the response probabilities are small and the data are highly discrete, only limited information is present for estimating random effects and their associated variances and covariances.
While this applies to ML estimates as well, these may be consistent where PQL are not, for example for paired binary data with random pair effects.
Breslow and Clayton (1993) derive the model leading to MQL estimation (the marginal model of Breslow and Clayton as implemented in Genstat) by specifying the generalized linear model in terms of the marginal mean as E[yi] =µi =h(x0iβ) whereh is the inverse link function. If the link function is not the identity, the marginal mean so defined does not in general coincide with the marginal mean calculated from the conditional formulation
E[y|u] = h(Xβ +Zu).
Zeger et al. (1988) investigated marginal models of the above form for longi- tudinal designs. They showed that the true marginal mean of the hierarchi- cal model with normally distributed random effects could (at least approx- imately) be expressed in the form of the equation above, but with altered values for the regression variables or regression coefficients. With the log link for example they showed that E[yi] = µi = exp(x0iβ+ 0.5zi0Dzi) so the random effects add an offset to the equations for the marginal mean.
With the logit link the β coefficients are attenuated by a factor ci such that E(yi) = exp(cix0iβ)
1 + exp(cix0iβ) (5.14) where ci = det(c2Dzizi0 +I)−0.5 = (1 + c2zi0Dzi)−0.5 where c = √16
3/[15π]
(Breslow and Clayton (1993), equation (18)).
This result could be used to alleviate the bias in the estimation of both mean and variance parameters by treating the ci as a multiplicative offset whose
values depend on the current model parameters θ. A number of suggested improvements to the estimation algorithms for both PQL and MQL will be considered below.
Breslow and Clayton (1993) compare the PQL and MQL methods; they point out that the regression coefficients of PQL depend strongly on the estimated variance components when we do not have an identity link, even in large samples, whereas this is not the case for MQL. They regard marginal models as being more appropriate when interest is focussed on the marginal relationship between covariates and the response, in which case the random effects model serves mainly to suggest a plausible covariance structure that enables us to get reasonably efficient estimating equations for the mean value parameters, while PQL is the method of choice for estimating parameters in the hierarchical model.
To consider this in more detail, we will now return to the concepts of population average (PA) and cluster specific (CS) models in survey analysis.
The term subject specific (SS) and population average (PA) were introduced by Zeger et al. (1988) in an article discussing extensions of GLMs for the analysis of longitudinal data. As an example they considered data for 537 children from Steubenville Ohio, each of whom was examined from age 7 to 10 annually. Whether the child had respiratory infection in the year prior to each examination was reported by the mother. The mother’s smoking status [regular smoker (1) or not (0)] was determined at the first interview, and was regarded as a time independent variable. They considered the following models:
(1) In the population average (PA) models the marginal probability of res- piratory infection µit was assumed to satisfy the model
logit(µit) =β0∗+β1∗x1+β2∗x2+β3∗x3
where
x1 = 1 if the mother smokes and 0 otherwise x2 = age in years since the 9th birthday and
x3 = x1x2 = age if the mother smokes and 0 otherwise
Zeger et al. (1988) argue that the β∗ describe how the population averaged response depends on covariates, so in particular β1∗ compares the rate of respiratory disease for children whose mothers smoke to the rate for children whose mothers do not smoke.
(2) In the subject specific (SS) model, the probability of respiratory infection for an individual is described as
logit(wit) =β0∗+β1∗x1+β2∗x2 +β3∗x3+ui
where in the logit(wit), wit = E[yit|ui] and assumed ui ∼ N(0, σ2u). Zeger et al. (1988) argue that in this case β1 indicates how one child’s risk would change if his/her mother changed his/her smoking status. They present a number of findings:
1. As has been noted above, with the logit link, the PA parameters are attenuated i.e. the random effects variability shrinks the fixed effects parameters towards 0, with the degree of shrinkage depending on the variance of the random effects
2. Estimates of the SS parameters are correlated with the variance param- eters of the random effects, even asymptotically. Thus the precision of estimating β depends on that of the variance parameters, and these parameters are more difficult to estimate from the longitudinal data with a nonlinear link and a few observations per subject.
3. In PA models, only the link function needs to be correctly specified to make consistent inferences about PA coefficients. On the other hand the SS model uses not only the information contained in the popu- lation averaged response, but also a distributional assumption about the heterogeneity among subjects, thus for the SS models both the link function and the random effects distribution must be correctly specified for consistent inferences.
4. SS models are desirable when the response for an individual rather than the population is the focus, they give growth curves as an example. PA models are most efficiently used in population studies, such as in epi- demiology. The difference in the population averaged response between two groups with different risk factors is more the focus than the change in an individual’s response
Pendergast et al.(1996) comment in detail on the example given in Zeger et al.(1988) which has been discussed above. They point out that the differ- ence in the interpretation of the maternal smoking effect in the two models is difficult to conceptualize, and insight may be gained from considering a slightly different scenario. Suppose that the covariate measured was whether or not the mother currently smoked, and that this was measured at the same time points as the child’s respiratory disease status i.e. “mothers’s smoking status” could now be considered as a within cluster covariate, since it could change from timepoint to timepoint (i.e. from subunit to subunit). The SS model would then allow direct observation and estimation of the aver- age effect (in terms of the log odds ratio) of the change in smoking status upon respiratory disease status, assuming that there were subjects in which a change in smoking status was observed. The coefficient for “maternal smok- ing” represents the common log odds ratio for respiratory disease of mother’s
smoking status across children. The PA model on the other hand ignores the fact that the effect of change in smoking status of the mother given a child had actually been measured, and succeeds in estimating only the odds ra- tio between smoking and nonsmoking mothers. Mothers who had changed smoking status would appear in both groups.
However in fact if no mother changes smoking status during the study, the covariate would again be a between subject covariate and no effect of chang- ing smoking status can be directly observed. The PA model measures the log odds ratio between groups of mothers, whereas the SS model purports to measure the effect of change in the mother’s smoking status and the interpre- tation of that coefficient is entirely model based; in fact it can be considered as a form of extrapolation with no data available to check it’s validity. Thus Pendergast et al.(1996) conclude that SS interpretation of covariates which do not change within a subject is difficult. On the other hand with covari- ates that do change within a subject, marginal models ignore the observable information obtained when subjects serve as their own controls and a change is observed.
Neuhaus, Kalbfleish and Hauck (1991) compare SS and PA model approaches by analyzing clustered binary data, looking in particular at parameter inter- pretation. The data that is used is from a study of breast disease conducted in San Francisco. One component in the study consisted of obtaining a sam- ple of fluid from both breasts of all study women. The binary outcome was whether a sample of breast fluid could (Y = 1) or could not (Y = 0) be obtained from each breast and the covariates considered were X2 = age in years,X3 = age in years at menarche, a binary indicatorX4 = 1 if the woman was parous and X4 = 0 if not and a binary indicator of whether (X1 = 1) or not (X1 = 0) physical examination of each breast found evidence of dys-
plasia. The sample comprised 490 white premenopausal women who had no breast disease. Their comparison is between a model fitted using numerical quadrature and a model fitted using GEE, rather than PQL and MQL. They examine one representative of each approach; for SS models they look at the mixed effects logistic model, while PA models they use the GEE approach of Liang et al.(1986). They compare the two approaches for the case of a single covariate x, pointing out that the results generalize easily to the case of several covariates.