• Tidak ada hasil yang ditemukan

Multilevel IRT Model .1 Models for item parameters

Measurement Models in Assessment and Evaluation

7.5 Multilevel IRT Model .1 Models for item parameters

Takane and de Leeuw (1987) point out that also this model is both equivalent to a MIRT model for graded scores (Samejima, 1969) and a factor analysis model for ordered categorical data (Muthén, 1984). This model can be estimated using standard software for factor analysis (see above) or using a fully Bayesian approach in combination with a MCMC algorithm (Shi & Lee, 1998).

In the framework of adjacent categories models, the logistic versions of the probability of a response in category m can be written as

where h(θi, ak, ek) is some normalizing factor that assures the sum over all possible responses on an item is equal to one. The probability pkmi) is determined by the compound so every item addresses the abilities of a respondent in a unique way.

Given this ability compound, the probabilities of responding a certain category are analogous to the unidimensional partial credit model by Masters (1982). Firstly, the factor m indicates that the response categories are ordered and that the expected item score increases as the ability compound increases. And secondly, the item parameters bkh are the points where the ability compound has such a value that the odds of scoring either in category m-1 or m are equal.

7.5 Multilevel IRT Model

introduces (slight) random variation between items derived from the same parent, it becomes efficient to model the item parameters as random and shift the interest to the hyperparameters that describe the distributions of the item parameters within parents (Glas & van der Linden, 2001, 2003). Another example of a model with random item parameters is given in Janssen, Tuerlinckx, Meulders and de Boeck (2000). To support standard setting on a criterion-referenced test with sections of items in the test grouped under different criteria, an IRT model is developed with random item parameters drawn from different distributions for different sections. A Bayesian argument for this approach is that if the only thing known a priori about the items is that they are grouped under common criteria, they are exchangeable given the criterion and can be treated as if they are a random sample.

Glas and van der Linden (2001, 2003) define the model as follows. Consider a set of item populations p=1,…, P of size K1,…, KP, respectively. The items in population p will be labeled kp=1,…, Kp. The first-level model is the 3PLM, which describes the probability of a correct response as as in Formula (7), but with the subscript changed from k to kp. In the Level 2 model, the values of the item parameters are considered as realizations of a random vector. It is assumed that the item parameters, say , have a 3-variate normal distribution with mean µp and a covariance matrix Σp. To support the assumption of normality, the item parameters are transformed as

or as The logit transformation is

standard way to map a probability, such as to the real continuum, and taking the logarithm of assures that is positive.

In general, the model can be estimated by Bayesian methods based on the MCMC procedure (for the 1PLM, see, Janssen, Tuerlinckx, Meulders & de Boeck, 2000; for the 3PLM, see, Glas & van der Linden, 2001) or by MML (Glas & van der Linden, 2003).

However, Glas and van der Linden (2001) point out that the application can interfere with the feasibility of certain estimation procedures. This arises in computer-based item generation were the computer generates a new item for each examinee (“item generation on the fly”). An example is some arithmetic task where new values of variables are drawn in every presentation of the item. In that case, the item parameters and are unique for each examinee and there is only one item response available to estimate these three parameters. Because of this under-determination, these parameters cannot play a role as auxiliary variables in the MCMC procedure. In the MML estimation, however, they can be treated as nuisance parameters and integrated out of the likelihood function.

7.5.2 Testlet models

A testlet is a subset of items related to some common context. Haladyna (1994) refers to context-dependent item sets. Usually, these sets take the form of a number of multiple choice items organized under or within some text. Haladyna (1994) gives examples of comprehension type items sets and problem solving type item sets. When a test consists of a number of testlets, both the within and between dependence between the items play a role. One approach is to ignore this hierarchical dependence structure and analyze the test as a set of atomistic items. This generally leads to an overestimate of measurement

precision and bias in the item parameter estimates (Sireci, Wainer & Thissen, 1991; Yen, 1993; Wainer & Thissen, 1996). Another approach is to aggregate the item scores within the testlet to a testlet score and analyze the testlet scores using an IRT model for polytomously scored items. This approach discards part of the information in the item responses, which will lead to some loss of measurement precision. However, this effect seems to be small (Wainer, 1995). The rigorous way to solve the problem is to model the within and between dependence explicitly. Bradlow, Wainer & Wang (1999, also see, Wainer, Bradlow, & Du, 2000) introduce a generalization of the 3PLM given by

where t(k) is the testlet to which item k belongs and γit(k) a student-specific testlet effect. It is assumed that γit(k) has a normal distribution with a mean equal to zero and a variance that gauges the importance of the testlet effect. Further, it is assumed that θ has a standard normal distribution.

The parameters in the model can be estimated in a Bayesian framework using MCMC (Bradlow, Wainer & Wang, 1999; Wainer, Bradlow & Du, 2000) or in a frequentist framework using MML (Glas, Wainer & Bradlow, 2000).

7.5.3 Models for ratings

Closely related to the testlet model is the IRT model for analyzing ratings by Patz and Junker (1999a, 1999b). Suppose that a student i performs tasks labeled k=1,…, K, and the tasks are rated by raters labeled t=1,…, T. For simplicity, it is assumed that the response variables yikt are dichotomous. Then the problem is that the ratings pertaining to the same item cannot be viewed as independent, because they relate to the same response by student /. First a model will be presented and then it will be shown that this model provides an acceptable specification of the dependence of the responses of different raters pertaining to the same task. Consider a model where the students’ ability parameters have a standard normal distribution with density g(θi). For every task k, the student gives a response ξik. The raters base their ratings on this response, but in the model it is an unobserved latent response that depends on the ability level of the respondent θi. This dependence is modeled by introducing a distribution for ξik that depends on θi. It is assumed that the distribution is normal with a density denoted by h(ξik| θi, σ). Further, the model contains parameters for constant effects: bk models the item difficulty and δt

models the leniency of the rater. With these assumptions, the likelihood is given by

where Pktik −bkt) is the probability of a correct response P(Yikr=1|ξik, bk, δt). This probability can, for instance, be modeled by a logistic function, say Pktik −bkt)=Ψ(ξik

−bkt). This model could be further enhanced with item discrimination and guessing parameters. The model can be estimated by MML after integrating out the unobserved variables ξik and θi, or in a Bayesian framework using the MCMC algorithm (see, Patz &

Junker, 1999a, 1999b).

8