• Tidak ada hasil yang ditemukan

Local and global reliability

Measurement Models in Assessment and Evaluation

7.2 Unidimensional Models for Dichotomous Items .1 Parameter separation

7.2.5 Local and global reliability

the first method, the prior distribution is fixed, in the second approach, which is often labelled an empirical Bayes approach, the parameters of the prior distribution are estimated along with the other parameters. As in MML, the student parameters are integrated out of the likelihood. Further, point estimates are computed as the maximum of the posterior distribution (hence the name Bayes modal estimation, for more details, see, Mislevy, 1986).

The Bayes modal procedure essentially serves to keep the parameters from wandering off. However, the procedure still entails integrating out the ability parameters and for complex models this integration often becomes infeasible. To solve this problem, Albert (1992) proposed a Markov chain Monte Carlo (MCMC) procedure. In this procedure, a graphical representation of the posterior distribution of every parameter in the model is constructed by drawing from this distribution. This is done using the so-called Gibbs sampler (Gelfand & Smiths, 1990). To implement the Gibbs sampler, the parameter vector is divided in a number of components, and each successive component is sampled from its conditional distribution given sampled values for all other components. This sampling scheme is repeated until the sampled values form stable posterior distributions.

If we apply this to the 1PLM we could divide the parameters into two components, say the item and ability parameters, and the Gibbs sampler would than imply that we first sample from p(θ|b,Y) and then from p(b|θ,Y) and repeat these iterations until the chain has converged, that is, when the drawn values are relatively stable and the number of draws is sufficient to produce an acceptable graph of the posterior distribution. Starting points for the MCMC procedure can be provided by the Bayes modal estimates produced described above, and the procedure first needs a number of burn-in iterations to stabilize.

For more complicated models, MCMC procedures were developed by Johnson and Albert (1999) and Béguin and Glas (2001).

(16)

Up to now, we have only swapped one definition for the next one, but these definitions have a background. For every student i, the solution of the estimation equation given in (11) is the maximum of the likelihood function L(θ, b) with respect to the variable θi. This maximum is found upon setting the first-order derivatives of the logarithm of this function with respect to θi equal to zero. So the equation is given by dlogL(θ, b)/dθi=0.

The variance of the estimate is found by taking the second-order derivative, taking the opposite, and inserting the estimate That is, the variance is equal to evaluated at . If we would work out this second-order derivative, we would see that the local independence between the item responses resulted in a summation of item information components, as is Formula (16), and, for the 1PLM, the item components are given by the item information function

(17)

Note that the right-hand side has the same form as the variance of a Bernoullivariable. So due to the independence of student and item responses given the parameters, the IRT model is nothing else than a model for N times K Bernoullitrials, with the peculiarity that the probability of success differs from trail to trail. The fact that the test information is additive in item information is very convenient: every item has a unique and independent contribution to the total reliability of the test. Further, item information is always positive, so we can infer that the test information increases with the number of items in the test, and as a consequence, the standard error decreases if the number of items goes up. Further, we can now meaningfully define the optimal item for the measurement of a student’s ability parameter θ. Suppose that the 1PLM holds, that is, the probability of a correct response Pk(θ) is given by Formula (4). If we enter this definition into the definition of item information given by Formula (17), it can be verified that item information is maximal if θ=bk, that is, if the item difficulty matches the ability parameter (This can be verified by taking the first-order derivative of the logarithm of item information with respect to θ and equating to zero). At the point θ=bk, the probability of a correct response is equal to a half, that is, Pk(θ)=0.5. Intuitively, this result is quite plausible. If students are administered items which are far too difficult, hardly any correct response will be given and we hardly learn anything about their ability level. The same holds for the opposite: if the items are far too easy, all responses are will be correct. We have the maximum a-priori uncertainty if the odds of a correct response are even, and in that case we learn most when we observe the actual outcome.

For the 2PLM, item information is given by

Note that item information in the 2PLM has the same form as in the 1PLM, with the exception of the presence of the square of the item discrimination parameter ak. As a

result, item information is positively related to the magnitude of the discrimination parameter. For the 3PLM, item information is given by

Note that if ck=0.0 and ak=1.0, item information is the same as for the 1PLM, as it should be. Further, item information is maximal if the guessing parameter is equal to zero, that is, if ck=0.0, and item information decreases with the guessing parameter.

In a Bayesian framework, inferences about a student’s ability parameter are based on the posterior distribution of θ. For the 1PLM, the number-correct score ri is a sufficient statistic for the ability parameter, so it makes no difference whether we use the posterior distribution given ri, given by p(θ| ri,b,µ,σ), or the posterior distribution given the complete response pattern yi, say p(θ|yi, b, µ, σ). In the 2PLM and 3PLM, the number- correct score is not a sufficient statistic, so here we generally use the second posterior distribution. A point estimate of the student’s ability can for instance be computed as the posterior expectation, the so-called EAP estimate, given by

Of course, also the posterior mean or mode can be used as a point estimate for ability.

The posterior variance given by

serves as indication of the local reliability. Computation of these indices needs estimates of the item parameters b and population parameters µ and σ. In a likelihood-based framework as MML, point estimates obtained in the MML estimation procedure can be imputed as constants. In a fully Bayesian framework posterior distributions are the very outcome of the estimation procedure, and the posterior indices of central tendency and the posterior variance can be easily computed from the draws generated in the MCMC procedure.

ML and EAP estimates of ability are generally not comparable. Consider the example of Table 7.6. The example pertains to the 1PLM, the same data and item parameter estimates as in Table 7.5 were used.

Table 7.6 Estimates of Latent Abilities.

Score Freq WML SE EAP SD

0 9 −3.757 1.965 −1.809 .651

1 28 −2.416 1.056 −1.402 .626

2 70 −1.647 .860 −1.022 .608

3 130 −1.039 .779 −.661 .595

4 154 −.502 .741 −.311 .588

5 183 .007 .730 .033 .586

6 157 .515 .740 .378 .589

7 121 1.049 .775 .728 .596

8 97 1.648 .855 1.091 .609

9 39 2.403 1.049 1.472 .628

10 12 3.728 1.953 1.882 .654

For all possible scores on a test of 10 items, the table gives the frequency, an ML estimate, its standard error, the EAP estimate and the posterior standard deviation. ML estimates for the zero and perfect score do not exist, and further, the ML estimates are slightly biased. Therefore, the estimates in Table 7.6 are computed using a weighted maximum likelihood (WML) procedure where a weighting function is added to the left- hand side of the estimation equations given by Formula (11), such that the estimators are unbiased to the order K−1 (Warm, 1989). WML also gives estimates for the zero and perfect score, though it can be seen in Table 7.2 that the standard errors of these estimates are quite large. Comparing the WML and EAP estimates, it can be seen that the EAP estimates are shrunken towards zero. From the frequency distribution, it can be inferred that the mean of the ability distribution might be zero, and this is in fact true, because the latent scale was identified by imposing the restriction µ=0. In Bayesian statistics, this shrinking phenomenon is known as shrinking towards the mean. This phenomenon is caused by the fact the Bayesian estimates are often a compromise between the estimate imposed by the likelihood alone and the influence of the prior. If parameters have some common distribution, and the likelihood gives little information about the values of individual parameters, one might say that the estimates of these parameters borrow strength from each other via their common distribution. The result is that they are shrunken towards their mutual mean. Therefore, one might suggest that the WML estimates are preferable, unless information is sparse and a prior distribution must support the estimates.

Both the WML and EAP estimates have an asymptotic normal distribution. This can be used to determine confidence or credibility regions around the estimates. The assumption of normality is acceptable for most of the score range of a test longer than 5 items, except for the extreme scores, say the three highest and lowest scores. Consider the ability estimate associated with a number-correct score of 5. The ability estimate is 0.07 and its standard error is 0.730. For this short test of 10 items, this means that a 95%

confidence region would range from −1.4301 to 1.4315. This means that the ability estimates associated with all scores from 3 to 7 are well within this confidence region. A 95% confidence region build using the posterior standard deviation ant the assumption of normality also includes the scores 2 and 8.

If this kind of reasoning is applied to assess to what extent scores on a test can be distinguished, it is reasonable to take the uncertainty of both scores into account.

Consider the scores 5 and 6. Based on the WML estimates, they have normal distributions shown in the first panel of Figure 7.4. The means are 0.07 and 0.515, respectively, and the standard deviations are 0.730 and 0.740, respectively. The conclusion is that these two distributions greatly overlap. If we would assume that the two distributions represent distributions of ability values, the probability that a draw from

the first one would exceed a draw from the second on would be 0.31. This probability is on a scale ranging from a probability of 0.50 for complete unreliability (the distributions are identical) to a probability of 0.00 for complete reliability (the distributions are infinitely far apart). If three tests identical to the original test of 10 items could be added, that is, the test length would be 40, the variance would be divided by 4. This situation is depicted in the second panel of Figure 7.4. As a consequence, the probability that a random draw from the distribution associated with a score 5 would exceed a random draw from the distribution associated with a score 6.

The below calculus of local reliability was done for the 1PLM, where the numbercorrect score is a sufficient statistic for the ability parameter. In the 2PLM and 3PLM, the ability estimate depends on the weighted sum score or the complete response pattern, respectively. However, tests are usually scored using numbercorrect scores. As note above, number correct scores and weighted scores are usually highly correlated and the loss of precision is usually limited. To index the precision of a test following the 2PLM or 3PLM scores with number-correct scores, the same logic as above can be used in a combination with the Bayesian framework. That is, one can compute the posterior expectation E(θ|r,a,b,c) and variance E(θ|r,a,b,c) conditional on the number correct scores r, but with the 2PLM or 3PLM as a response model.

Figure 7.4 Confidence intervals for ability estimates.

Local reliability plays an important role in educational measurement in activities as standard setting. For instance, if a certain number-correct score serves as a cut-off score on some examination, the dispersions of ability estimates round the scores can serve as an indication of the number of misclassifications made. However, in many other instances, it is also useful the have an indication of global reliability. This index is based on the following identity.

The identity entails that the total variance of the ability parameters is a sum of two components. The first component, E[Var(θ|r)], relates to the uncertainty about the ability parameter. Above, it was shown that the posterior variance of ability, Var(θ|r), gives an indication of the uncertainty with respect to the ability parameter, once we have observed the score r. By considering its expectation over the score distribution, we obtain an estimate of the average uncertainty over the respondents’ ability parameters. The second term, Var [E(θ|r)], is related to the systematic measurement component. The expectation E(θ|r) serves as an estimate of ability, and by considering the variance of these expectations over the score distribution, we get an indication of the extent to which the respondents can be distinguished on the basis of the test scores. Therefore, a reliability index taking values between zero and one can be computed as the ratio of the systematic variance and the total variance, that is

For the example of Table 7.6 the global reliability was equal to 0.63.

Optimal Test Assembly

Theunissen (1985) has pointed out that test assembly problems can be solved by binary programming, a special branch of linear programming. Binary programming problems have two ingredients: an objective function and one or more restrictions. This will first be illustrated by a very simple test assembly problem. Consider a test where the interest is primarily in a point θ0 of the ability continuum and one wants to construct a test of L items that has maximal information at θ0. Selection of items is represented by a selection variable dk,dk=1 if item k is selected for the test and dk=0 if this is not the case. The objective function is given by

which should be maximized as a function of the item selection variables dk. This objective function must be maximized under the restriction that the test length is L, which translates to the restriction

If the item is selected, dk=1 so in that case the item contributes to the target number of items L and the total test information.

This basic optimization problem can be generalized in various directions, such as choosing more points on the ability scale, constructing more tests simultaneously, introducing time and cost constraints, and taking the constraints of the table of test specifications into account. More on optimal test assembly problems and the algorithms to solve these problems can be found in Boekkooi-Timminga (1987, 1989, 1990), van der Linden and Boekkooi-Timminga (1988, 1989) and Adema and van der Linden (1989). In

this section, some of the possibilities of optimal test assembly will be illustrated using an application by Glas (1997).

The problem is assembling tests for pass/fail decisions, where the tests have both the same observed cut-off score and the same cut-off point on the ability scale and are approximately equally reliable at the cut-off point. It is assumed that the Rasch model for binary items holds for all items in the item bank. For this problem the item selection variables are dti for t=1,…, T and i=1,…, K, where again dti is equal to one if item i is selected for test t and zero otherwise. The binary programming that must be solved in the variables dtk for t=1,…, T and is maximizing

subject to

(18) (19) (20) Inequality (18) imposes the restriction that every item should be present in one test only.

Inequality (19) must assure that the difference in information of the two tests at the cut- off point is small. This is accomplished by choosing c1 as a sufficiently small constant.

The inequalities (20) must assure that the observed and latent cutoff scores of the two tests are sufficiently close. Also here this is accomplished by choosing c2 as a sufficiently small constant. If the restrictions imposed by (18)–(20) are too tight, a solution to the optimization problem may, of course, not exist. For instance, it is not possible to construct two unique tests of 50 items from an item bank of 70 items. Besides the size of the item bank, also the distribution of item parameters, the distribution of items on the classification variables and the magnitude of c1 and c2 determine the feasibility of a solution.

As already noted, the test assembly problem discussed here is just one of the large variety of problems addressed in the literature. Test assembly procedures have been developed for minimizing test length under the restriction that test information is above a certain target for a number of specified points on the ability scale, minimization of administration time with a fixed number of items in a test, maximization of classical reliability, and optimal matching of a target for the observed score distribution of a test.

Tables with objective functions and constraints covering most of the options available can be found in Van der Linden and Boekkooi-Timminga (1989).