Applications of Measurement Models
8.2 Multiple Populations in IRT .1 Differences between populations
Suppose respondents are sampled from two populations, say males and females, and the interest is in evaluating the difference in the mean ability level of the two populations. A background variable gender is introduced as
(1) and gender differences in ability are modeled as
(2)
where it is assumed that ei has a normal distribution with mean zero and variance σ2. Note that µ is the mean ability level of the females, while µ+β is the mean of the males. So β is the effect of being male. In IRT, the assumption that the variance σ2 is equal over groups is easily generalized to the assumption that groups have unique variances, say . Maximum marginal likelihood (MML) estimation is probably the most used technique for parameter estimation in IRT models (Bock & Aitkin, 1981; Thissen, 1982; Mislevy 1984, 1986; Glas & Verhelst, 1989). In this approach, a distinction is made between structural parameters, which need to be consistently estimated and nuisance parameters, which are not of primary interest. MML estimation derives its name from maximizing the log-likelihood that is marginalized with respect to the nuisance parameters. In the present case, the likelihood is marginalized with respect to the ability parameters θ. This leads to the marginal likelihood
(3)
where g(θi | µ,β, σ, xi) stands for the normal density. The reason for maximizing the marginal rather than the joint likelihood of all parameters is that maximizing the latter does not lead to consistent estimates. This is related to the fact that the number of person parameters grows proportional with the number of observations, and, in general, this leads to inconsistency (Neyman & Scott, 1948). Simulation studies by Wright and Panchapakesan (1969) show that these inconsistencies can indeed occur in IRT models.
Kiefer and Wolfowitz (1956) have shown that MML estimates of structural parameters, say the item and population parameters of an IRT model, are consistent under fairly reasonable regularity conditions, which motivates the general use of MML in IRT models.
Table 8.7 gives a small simulated example of the procedure. The data were generated with the 1PLM. The design consisted of 9 items administered to two groups. Group 1 consisted of 100 simulees, who responded to the items 1 to 6. The second group consisted of 400 simulees, responding to the items 4 to 9. So the items in the so-called “anchor”
were responded to by 500 simulees. The true item parameters are shown in the second
column of Table 8.7, the MML parameter estimates and their standard errors are shown in the third and fourth column, respectively.
Table 8.7 Parameter Values and Estimates.
Item bk bk se(bk)
1 −1.00 −0.71 0.33
2 0.00 −0.04 0.30
3 1.00 1.18 0.29
4 −1.00 −1.10 0.14
5 0.00 −0.17 0.13
6 1.00 1.09 0.14
7 −1.00 −1.09 0.15
8 0.00 0.00 0.14
9 1.00 0.94 0.15
Pop β β se(β)
1 1.00 1.07 0.22
Pop σg σg se(σg)
1 1.00 1.13 0.18
2 1.50 1.45 0.10
Note that the standard errors are inversely proportional to the number of simulees responding to the item. The bottom lines of the table give the generating values for β, σg
(g=1,2), their estimates and their standard errors. In this example, µ has been set equal to zero to identify the scale of θ. The test whether the two groups have the same mean ability level, that is, the test of the null hypothesis β=0 against the alternative β≠0, can be based on the ratio of the estimate with its standard error: . In the present case, the outcome is 1.07/0.22=4.864. Under the null-hypothesis the statistic has a standard normal distribution so the nullhypothesis is clearly rejected.
This approach can be generalized in various ways. One could introduce a second variable, say
(4) and consider the model
(5)
If x1i stand for gender, then β12 stands for the interaction of being male and living in an urban area, and, as above, a test of the hypothesis β12=0 against the alternative β12≠0 can be based on the parameter estimate relative to its standard error. The next section gives further generalizations of this approach.
8.2.2 Multilevel regression models on ability
In much social research, elementary units are clustered in higher-level units. A wellknown example is educational research, where pupils or students are nested within classrooms, classrooms within schools, schools within districts and so on. Multilevel models (ML models) have been developed to take the resulting hierarchical structure into account, mostly by using regression-type models with random coefficients (Aitkin, Anderson & Hinde, 1981; Goldstein, 1986; Raudenbush & Bryk, 1986; Longford, 1987).
However, if, variables in these multilevel models contain large measurement errors, the resulting statistical inferences can be very misleading (Fuller, 1987). Measurement error can be modeled in the framework of classical test theory (see, for instance, Longford, 1993) and IRT (Mislevy & Bock, 1989; Adams, Wilson & Wu, 1997; Fox & Glas, 2001, 2002). In the classical framework, the variance component due to unreliability, can either be imputed in the model, or it can be estimated within the model, for instance by splitting test scores tests into subtest scores. The IRT framework is a generalization of the linear model described in the previous section. The approach entails the definition of a multilevel linear model, where latent variables from IRT measurement models are entered either as dependent or as independent variables. The resulting model is the so- called multilevel IRT model (MLIRT model, Fox & Glas, 2001, 2002). The general model is defined as follows. The dependent variables are observed item scores yijk, where the index i (i=1,…, nj) signifies the respondents, the index j (j=1,…, J) signifies the level two clusters, say the schools, and the index k (k=1,…, K) signifies the items. The first level of the structural multilevel model is formulated as
(6) where the covariates xqij (q=1,…, q’) are manifest predictors and the covariates ξqij
(q=q’+1,…, Q) are latent predictors. Finally, eij are independent and normally distributed error variables with mean zero and variance σ2. In general, it is assumed that the regression coefficients βqj are random over groups, but they can also be fixed parameters.
In that case, βqj=βq for all j. The Level 2 model for the random coefficients is given by (7)
where zsqj (s=1,…, s’) and ζsqj (s=s’+1,…, S) are manifest and latent predictors, respectively. Further, uqj are error variables which are assumed independent over j and have a Q-variate normal distribution with a mean equal to zero and a covariance matrix T.
Figure 8.9 Path diagram of a multilevel IRT model.
An example of a MLIRT model is given in the path diagram in Figure 8.9. The structural multilevel part is presented in the big square box in the middle. The structural model has two levels: the upper part of the box gives the first level (a within-schools model), and the lower part of the box gives the second level (a between-schools model). The dependent variable θij, say math ability, is measured by three items. The responses to these items are modeled by the 2PLM with item parameters ak and bk, k=1,…, 3. Note that the measurement error models are presented by the ellipses. Both levels have three independent variables: two are observed directly, and one is a latent variable with three binary observed variables. For instance, on the first level, X1ij could be gender, X2ij could be age, and ξ3ii could be intelligence as measured by a three item test. On the second level, Z10j could be school size, Z20j could be the school budget and ζ30j could be a school’s pedagogical climate, again measured by a three item test. In order not to complicate the model, it is assumed that only the intercept β0j is random, so the Level 2 predictors are only related to this random intercept. So the slopes are fixed.
The parameters in the MLIRT model can be estimated in a Bayesian framework with a version of the Markov chain Monte Carlo (MCMC) estimation procedure: the Gibbs sampler (Fox & Glas, 2001, 2002). There are many considerations when choosing between a frequentist framework (such as MML) and Bayesian framework (such as MCMC), but the reason for adopting the Bayesian approach given by Fox and Glas (2001, 2002) is a practical one: MML involves integration over the nuisance parameters, and in the present case these integrals become quite complex. In the Bayesian approach, the interest is in the posterior distribution of the parameters, say p(θ, δ, β, µ, σ | y). In the MCMC approach samples are drawn from the posterior distribution of the parameters of interest, and in this process, nuisance parameters can play a role as auxiliary variables. So the problem of complex multiple integrals does not arise here.
To give some idea of the output of the procedure, consider an application reported Shalabi (2002). The data were a cluster sample of 3,384 grade 7 students in 119 schools.
At student level the variables were gender (0=male, 1=female), SES (with two indicators:
the father’s and mother’s education, scores ranged from 0 to 8), and IQ (range from 0 to 80). At school level: leadership (measured by a scale consisting of 25 five-point Likert items, administered to the school teachers), school climate (measured by a scale consisting of 23 five-point Likert items) and mean-IQ (the IQ scores aggregated at school level). The items’ scores for the leadership and climate variables were recoded to be dichotomous (0,1, and 2=0; 3, and 4=1). The dependent variable was a mathematics achievement test consisting of 50 multiplechoice items. The 2PLM was used to model the responses on the leadership and school climate questionnaire and the mathematics test.
The parameters were estimated with the Gibbs sampler. For a complete description of all analyses, one is referred to Shalabi (2002); here only the estimates of the final model are given as an example. The model is given by
and
The results are given in Table 8.8. The estimates of the MLIRT model are compared with a traditional multilevel analysis where all variables were manifest.
The observed mathematics, leadership and school climate scores were transformed in such a way that their scale was comparable to the scale used in the MLIRT model.
Further, the parameters of the ML model were also estimated with a Bayesian approach using the Gibbs sampler. The columns labeled C.I. give the 90% credibility intervals of the point estimates; they were derived from the posterior standard deviation. Note that the credibility regions of the regression coefficients do not contain zero, so all coefficient can be considered significant at the 90% level. It can be seen that the magnitudes of the fixed effects in the MLIRT model were larger than the analogous estimates in the ML model.
This finding is in line with the other findings (Fox & Glas, 2001, 2002; Shalabi, 2002), which indicates that the MLIRT model has more power to detect effects in hierarchical data where some variables are measured with error.
Table 8.8 Estimates of the Effects of Leadership, Climate and Mean IQ.
MLIRT estimates ML estimates
Estimates C.I. Estimates C.I.
γ00 −1.096 −2.080 − −.211 −0.873 −1.20 −
−0.544
β1 0.037 0.029–0.044 0.031 0.024–0.037 β2 0.148 0.078 −0.217 0.124 0.061 −0.186 β3 0.023 0.021–0.025 0.021 0.019–0.022 γ01 0.017 0.009–0.043 0.014 0.004–0.023 γ02 0.189 0.059− 0.432 0.115 0.019 −0.210