• Tidak ada hasil yang ditemukan

Measurement Models in Assessment and Evaluation

7.2 Unidimensional Models for Dichotomous Items .1 Parameter separation

7.2.6 Model fit

values of the respondents’ ability values θ are not available. Secondly, if they would be available only a limited number of θ-values would be available. And thirdly, for every available θ-value, only one observed response is available. So we cannot accumulate responses to obtain sufficiently large values for the number of observed and expected responses to assess the (asymptotic) distribution of possible test statistics. The first point could be solved using estimates of θ, but in the CML and MML framework does not directly provide these estimates and item fit statistics in this framework using estimates outside this framework have theoretical problems. The fact that the 1PLM has sufficient statistics for the ability parameter, however, suggest an alternative approach.

Respondents with the same number-correct score will probably be quite homogeneous with respect to their ability level (more precisely, the number-correct score has a monotone likelihood ratio in θ). In an MML framework, the 2PLM and 3PLM do not have sufficient statistics for θ, but usually, the number-correct score and the estimate of θ highly correlate. Therefore, a test of the item response curves can be based on the difference between the observed proportion of a correct response given the sum score and the analogous probability. This leads to a test statistic

(21)

where Ogk and Egk are the observed and expected proportion of respondents in subgroup g with a correct score on item k, respectively. Usually, the groups are groups with the same number-correct score, but there are alternatives, which will be returned to later. The way in which the expected value Egk is computed differs depending on the estimation procedure. First, a likelihood-based framework will be treated, and then we will discuss the Bayesian approach.

Orlando and Thissen (2000) give the formulas needed to compute Egk in an MML framework. The test statistic can also be computed in a CML framework (van den Wollenberg, 1982). In older versions of the test (Yen, 1981, 1984; Mislevy & Bock, 1990) subgroups were formed on the basis of their ability estimates rather than on the basis of their total score, and the expectation Egk was computed as the mean predicted probability of a correct response in subgroup g. However, this approach (which is still prominent in some software packages) does not lead to a statistic with a tractable distribution under the null-hypothesis and to acceptable power characteristics (Glas &

Suárez-Falćon, 2003). The reason is that in this case the grouping of respondents is not based on some directly observable statistic, such as the number-correct score, and, therefore, the observed frequencies were not solely a function of the data, but also of model-based trait estimates, which violates the assumptions of the traditional χ2 - goodness-of-fit-test.

Also the approach used in the Ql-test as defined by Orlando and Thissen (2000) has a disadvantage, because here a tractable distribution under the null-hypothesis and acceptable power characteristics can only be achieved when the test is computed with the sample of respondents partitioned on the basis of their number-correct scores (Glas &

Suárez-Falćon, 2003). Especially for long tests, it would be practical when a number of adjacent scores could be combined, so that the total number of groups G would remain limited, say limited to 4 to 6 groups. The problem that score-groups cannot be combined

is caused by the fact that both the dependency between the observed proportions Ogk and their expectations Egk caused by the parameter estimation are not taken into account. To obtain a test statistic where score-groups can be combined and where the asymptotical distribution can be derived analytically the complete covariance matrix of the differences Ogk -Egk has to be taken into account. These problems are solved in the framework of the so-called Lagrange multiplier (LM) statistic (Rao, 1947; Aitchison & Silvey, 1958). The application to IRT is labeled the LM-Q1-test (Glas, 1988; Glas & Verhelst, 1995; Glas, 1998, 1999).

An example

In the following artificial example shows the some major steps in a fit analysis. In this example, responses of 1000 respondents to 10 items were generated according to the 2PLM. As in the previous example, the item difficulty parameters bk were equal to

−2.0(0.5)2.0, with the middle value 0.0 appearing twice. The item discrimination parameters ak were all equal to one, except for the items 5 and 6, which had a discrimination parameter equal to 0.5 and 2.0, respectively. Ability parameters were drawn from a standard normal distribution. First, the data were analyzed with the 1PLM.

The item parameter estimates and their standard errors are given in Table 3, under the heading “Model 1”. It can be seen that the estimates of the difficulty parameters are sufficiently close to the true generating values, even though the items 5 and 6 violated the model. All results in Table 7.3 were obtained using the computer program OPLM (Verhelst, Glas & Verstralen, 1995). For every item, the LM-Q1-statistic was computed.

The results are again shown under the heading “Model 1”. The values of the computed statistics are given under the label “LM-Q1”, the degrees of freedom and the significance probabilities are given under the labels “df” and “p”, respectively. It can be seen that the test was significant for the items 5 and 6. That is, the difference between the observed and expected proportions of correct responses in the score groups were such that the hypothesis that the observed proportions were properly described by the 1PLM had to be rejected.

Table 7.7 CML Estimates and Model fit.

Model 1

Item b SE(b) LM-Q1 df p

1 −1.97 .090 3.33 5 .649

2 −1.62 .082 4.18 5 .523

3 −1.02 .074 5.03 5 .412

4 −.51 .070 4.81 6 .567

5 −.05 .068 29.08 6 .000

6 −.09 .068 41.52 6 .000

7 .35 .069 4.06 6 .669

8 1.13 .074 5.38 5 .371

9 1.80 .084 5.12 5 .401

10 1.99 .088 7.88 5 .163

LM-Q1=71.73 df=27 p=.000

Model 2

Item b SE(b) LM-Q1 df p

1 −1.99 .089 2.15 3 .541

2 −1.64 .082 .63 3 .889

3 −1.04 .074 3.58 4 .465

4 −.53 .070 3.28 4 .511

5 – – – – – – – – – – –

6 – – – – – – – – – – –

7 33 .069 .47 4 .976

8 1.11 .074 2.47 4 .649

9 1.78 .085 1.91 3 .589

10 1.98 .089 4.06 3 .255

LM-Q1=15.53 df=21 p= 796

Model 3

Item a b SE(b) LM-Q1 df p

1 4 −.50 .030 2.70 5 .746

2 4 −.41 .029 6.27 6 .394

3 4 −.25 .027 3.05 6 .802

4 3 −.21 .088 5.50 3 .138

5 9 .02 .014 .87 3 .831

6 1 .05 .187 2.87 5 .719

7 4 .08 .026 5.02 6 .541

8 5 .27 .026 .51 4 .972

9 4 .44 .029 1.55 5 .907

10 4 .49 .030 6.40 5 .269

LM-Q1=33.24 df=25 p=.125

The presence of the two misfitting items did not interfere with the fit of the eight other items: all tests for these items were not significant. The bottom line gives the outcome of a version of the test were the item response curve of all items are evaluated simultaneously. It can be seen that this test of global model fit rejected the model.

Three routes are open to obtaining a fitting model for this data set: removing the misfitting items, or using the OPLM or 2PLM to model the data. The results of removing the two items are shown in Table 7.3 under the heading “Model 2”. The tests for the eight remaining items are not significant and also the global test statistic, shown at the bottom of the table, no longer rejects the model. An alternative to removing items is trying to model response behavior. Here two approaches were used: using the OPLM model or using the 2PLM. The first approach entails defining integer-valued discrimination indices that are imputed as constants in a CML estimation procedure. Initial estimates of these discrimination indices are found by estimating these indices using the 2PLM and then rounding to integer values. One might ask why the 2PLM is then not used straightaway.

The reason is that both approaches have their advantages and disadvantages. The 2PLM is more flexible than the OPLM (though Verhelst & Glas, 1995, show that coverage of the parameter space using integer values for the discrimination parameters is quite compact) but it needs an assumption on the distribution of the ability for obtaining MML estimates. For the OPLM, CML estimation is feasible, and this estimation procedure needs no assumption on the distribution of ability. So both approaches have their merits and drawbacks. The estimates and the outcome of the fit tests are shown in Table 7.7 under the heading “Model 3”. The discrimination parameters are given under the label

“a”. The generating values of the discrimination parameters of the item 1–4 and 7–10 were all equal to one. In the present analyses they are scaled to approximately 4. The original generating value for item 5 was half that of the discrimination parameters of the item 1-4 and 7-10, the value for items 6 was twice that value. Note that these ratios are still reflected in the values as they are given in Table 7.7. Further, the parameterization of the model is as in Formula (9), so the item difficulty parameters should be multiplied with the discrimination indices to make them comparable to the item parameters in the 1PLM.

Both the outcomes of the item oriented tests and the global test now support the OPLM model. In the present example, the data fit the OPLM right away. In practical situations, however, finding a set of discrimination parameters ak to obtain model fit is usually an iterative process. Discrimination parameters ak of misfitting items can be adjusted using the pattern of differences between observed en expected proportions of correct responses, increasing ak when these differences suggest that the item response curve under the OPLM must be steeper, and decreasing ak when these differences suggest the opposite. This iterative process is repeated until the proportion of misfitting items falls below the nominal significance level and the global test is no longer significant.

Table 7.8 MML Estimates and Model fit.

item a Se(a) b Se(b) LMQ1 p df

1 1.01 .135 −1.99 .127 0.67 .88 5

2 1.05 .130 −1.66 .113 5.97 .11 4

3 1.16 .123 −1.10 .097 2.88 .41 6

4 0.91 .104 −0.53 .079 4.57 .21 6

5 1.98 .229 −0.12 .106 2.59 .46 4

6 0.54 .079 −0.12 .069 1.18 .76 6

7 1.07 .116 0.30 .081 1.07 .78 6

8 1.27 .141 1.15 .104 2.33 .51 6

9 1.04 .128 1.73 .114 2.87 .41 5

10 1.09 .143 1.95 .129 10.15 .02 5

LM-Q1=33.24 df=25 p=.125

The other approach is applying the 2PLM in combination with MML estimation. Table 7.8 gives the MML estimates of the item parameters computed using the program Bilog- MG (Zimowski, Muraki, Mislevy, & Bock, 1996). Note that the estimates reflect the true generating values very well. The fit statistics computed by Bilog-MG are not used here;

they have the problems with their Type I error rate and power already mentioned above.

Instead, the LM-Q1 statistics were computed using dedicated software. It can be seen that all test supported the 2PLM. The LM-Q1 item oriented tests were computed using four score groups. At the bottom of the table, the outcome of a global version of the LM-Q1- test is displayed. It can be seen that the model is supported, which is, of course, as expected.

Testing local independence and multidimensionality

The statistics of the previous section can be used for testing whether the data support the form of the item response functions. Another assumption underlying the IRT models presented above is unidimensionality. Suppose unidimensionality is violated. If the student’s position on one latent trait is fixed, the assumption of local stochastic independence requires that the association between the items vanishes. In the case of more than one dimension, however, the student’s position in the latent space is not sufficiently described by one unidimensional ability parameter and, as a consequence, the association between the responses to the items given this one ability parameter will not vanish. Therefore tests for unidimensionality are based on the association between the items.

Yen (1984, 1993) proposed a test statistic, which is based on the argument that the random error scores on the items k and l, defined by dk=ykPk (θ) and dl=ylPl(θ) are approximately bivariate normally distributed with a zero correlation. The test statistic was equal to the correlation (taken over respondents) of dk and dl, that is,

where Pk (θ) and Pl (θ) are evaluated using the EAP estimate of θ. If the model holds, the Fisher r-to-z transform of this statistic may have a normal distribution with a zero mean and a variance equal to 1/(N-3). Simulation studies reported by Yen (1984, 1993) showed that this approximation produces quite acceptable results. In the framework of the 1PLM and CML estimation, van den Wollenberg (1982) showed that violation of local independence can be tested using a test statistic based on evaluation of the association between items in a 2-by-2 table. Applying this idea to the 3PLM in an MML framework,

a statistic can be based on the difference between observed and expected frequencies given by

where nkl is the observed number of respondents making item k and item l correct, in the group of respondents obtaining a score between 2 and K-2, and E(Nkl) is its expectation.

Only scores between 2 and K-2 are considered, because respondents with a score less than 2 cannot make both items correct, and respondents with a score greater than K-2 cannot make both items incorrect. So these respondents contribute no information to the 2-by-2 table. Using Pearson’s X2 statistic for association in a 2-by-2 table results in

where is the expectation of making item k correct and l wrong, and and are defined analogously. Glas and Suárez-Falćon (2003) describe a version of the statistic for the MML framework, and Glas (1999) proposes a Lagrange-multiplier version of the test labeled LM-Q2.

An example

An artificial example was generated using the 2PLM with the same item parameters as in the previous example. However, local independence between the item 5 and 6 was violated: if a correct response was given to item 5, the difficulty parameter of item 6 was decreased by 0.50. Parameters were estimated by MML, and the LM-Q2-statistic was computed to verify whether the misfit could be detected. The results are shown in Table 7.9. Not that the model violation did not result in a substantial bias estimates of the item parameters. Further, only the LM-Q2-test for the pair item 5 and item 6 was significant, the other pairs were not affected. The observed and expected values on which the test was based are shown in the last two columns.

Table 7.9 Testing for Local Dependence.

Item a Se(a) b Se(b) Item LM-Q2 p Obser Expect 1 0.89 .14 −1.93 .12 − − − − − − − − 2 1.00 .13 −1.43 .10 1 0.12 0.73 670 665.6 3 0.70 .11 −0.81 .07 2 0.89 0.35 555 541.7 4 0.78 .10 −0.43 .07 3 0.01 0.91 424 425.5 5 1.28 .14 −0.07 .08 4 0.01 0.92 346 347.2

6 1.36 .16 −0.49 .09 5 5.00 0.03 392 365.3 7 0.93 .11 0.44 .07 6 0.38 0.54 280 287.1 8 0.86 .12 0.86 .08 7 0.08 0.77 166 163.0 9 0.81 .12 1.36 .09 8 0.00 0.96 97 96.6 10 0.60 .13 1.76 .10 9 0.13 0.72 46 48.2

Testing differential item functioning

Differential item functioning (DIF) is a difference in item responses between equally proficient members of two or more groups. For instance, a dichotomous item is subject to DIF if, conditionally on ability level, the probability of a correct response differs between groups. One might think of a test of foreign language comprehension, where items referring to football might impede girls. The poor performance of the girls on the football-related items must not be attributed to their low ability level but to their lack of knowledge of football. Since DIF is highly undesirable in fair testing, methods for the detection of DIF are extensively studied. Overviews are provided by the books by Holland and Wainer (1993) and by Camilli and Shepard (1994). Several techniques for detection of DIF have been proposed. Most of them are based on evaluation of differences in response probabilities between groups conditional on some measure of ability. The most generally used technique is based on the Mantel-Haenszel statistic (Holland & Thayer, 1988), others are based on log-linear models (Kok, Mellenbergh &

Van der Flier, 1985; Swaminathan & Rogers, 1990), and on IRT models (Hambleton &

Rogers, 1989; Kelderman, 1989; Thissen, Steinberg & Wainer, 1993; Glas & Verhelst, 1995; Glas, 1998).

In the Mantel-Haenszel (MH) approach, the respondent’s number-correct score is used as a proxy for ability, and DIF is evaluated by testing whether the response probabilities differ between the score groups. Though the MH test works quite well in practice, Fischer (1995) points out that its application is based on the assumption that the 1PLM holds. In application of the MH test in other cases, such as with data following the 2PLM or the 3PLM, the number-correct score is no longer the optimal ability measure. In an IRT model, ability is represented by a latent variable θ, and an obvious solution to the problem is to evaluate whether the same item parameters apply in subgroups that are homogeneous with respect to θ. As for the previous model violations, also DIF can be assessed with a Lagrange multiplier statistic (Glas, 1998).

Table 7.10 Parameter Generating Values, Estimates and the LM Statistic.

Item bk LM Pr

1 −1.00 −0.91 0.12 1.33 0.25

2 0.00 0.13 0.11 0.27 0.61

3 1.00 1.13 0.12 1.14 0.29

4 −1.00 −0.93 0.11 1.14 0.29

5 0.0/0.5 0.41 0.11 18.03 0.00

6 1.00 1.04 0.12 0.02 0.90

7 −1.00 −0.77 0.12 0.05 0.83

8 0.00 0.11 0.11 0.01 0.92

9 1.00 1.03 0.11 0.11 0.74

Pop µ

1 1.00 1.00 0.11

2 0.00 0.00 − − −

Pop σg

1 1.00 1.01 0.07

2 1.50 1.41 0.08

An example

Tables 7.10 and 7.11 give a small simulated example of the procedure. The data were generated as follows. Both groups consisted of 400 simulees, and responded to 9 items.

The 1PLM was used to generate the responses. The item parameter values are shown in the second column of Table 7.10. To simulate DIF, for the first group the parameter of item 5 was changed from 0.00 to 0.50. Further, in DIF research it is usually implausible to assume that the populations of interest have the same ability distribution. Therefore, the mean and the standard deviation of the first group were chosen equal to 1.0 and 1.0, respectively, while the mean and the standard deviation of the second group were chosen equal to 0.0 and 1.5, respectively. Item and population parameters were estimated concurrently by MML. Table 7.10 gives the generating values of the parameters, the estimates and the standard errors.

The last two columns of Table 7.10 give the values of the LM statistic and the associated significance probabilities. In this case, the LM statistic has an asymptotic chi- square distribution with one degree of freedom. The test is highly significant for item 5.

For the analysis of Table 7.11, item 5 has been splitted into two virtual items: item 5 was assumed to be administered to group 1, item 10 was assumed to be administered to group 2. So the data are now analyzed assuming an incomplete item administration design,

where group 1 responded to the items 1 to 9 and group 2 responded to the item 1 to 4, 10, and 6 to 9 (in that order). As a consequence, the virtual items 5 and 10 were only responded to by one group, and the LM test cannot be performed for these items. It can be seen in Table 7.11 that the values of the LM statistics for the other items are not significant, which gives an indication that the model fit now fits.

Table 7.11 Parameter Generating Values, Estimates and the LM Statistic after Splitting the DIF Item into two Virtual Items

Item bk LM Pr

1 −1.00 −0.88 0.12 0.32 0.57

2 0.00 0.18 0.11 1.27 0.26

3 1.00 1.18 0.12 0.23 0.63

4 −1.00 −0.90 0.12 0.23 0.63

5 0.50 0.81 0.15

6 1.00 1.10 0.12 0.22 0.63

7 -1.00 −0.73 0.12 0.11 0.74

8 0.00 0.16 0.11 0.23 0.63

9 1.00 1.09 0.11 0.90 0.34

10 0.00 0.08 0.15

Pop µ µ

1 1.00 1.00 0.11

2 0.00 0.00 − − −

Pop σg

1 1.00 1.01 0.07

2 1.50 1.41 0.08

Person fit

Applications of IRT models to the analysis of test items, tests, and item score patterns are only valid if the model holds. Fit of items can be investigated across students and fit of students can be investigated across items. Item fit is important because in psychological and educational measurement, instruments are developed that are used in a population of students; item-fit then can help the test constructor to develop an instrument that fits an IRT model in that particular population. Examples of the most important item-fit statistics were given in the previous sections. As a next step, the fit of an individual’s item score pattern can be investigated. Although a test may fit an IRT model, students may produce patterns that are unlikely given the model, for example, because they are

unmotivated or unable to give proper responses that relate to the relevant ability variable, or because they have preknowledge of the correct answers, or because they are cheating.

As with item-fit statistics, also person-fit statistics are defined for the 1-, 2- and 3- parameter model, in a logistic or normal ogive formulation, and in a frequentist or Bayesian framework. Analogous to item-fit statistics, person-fit statistics are based on differences between observed and expected frequencies. A straightforward example is the W-statistic introduced by Wright and Stone (1979), which is defined as

Since the statistic is computed on for individual students, we drop the index i. Usually, the statistic is computed using maximum likelihood estimates of the item parameters (obtained using CML or MML), and maximum likelihood estimates of the ability parameter. A related statistic was proposed by Smith (1985, 1986). The set of test items is divided into G non-overlapping subtests denoted Ag (g−1,…, G) and the test is based on the discrepancies between the observed scores and the expected scores under the model summed within subsets of items. That is, the statistic is defined as

One of the problems of these statistics is that the effects of the estimation of the parameters are not taken into account. However, as above, the test based on the UBstatistic can be defined as an LM-test (Glas & Dagohoy, 2003).

Person-fit statistics can also be computed in a fully Bayesian framework. Above it was outlined that in the Bayesian approach, the posterior distribution of the parameters of the 3PNO model can be simulated using a Markov chain Monte Carlo (MCMC) method.

Person fit can then be evaluated using a posterior predictive check based on an index T(y,ξ), where y stands for the ensemble of the response patterns and ξ stands for the ensemble of the model parameters. When the Markov chain has converged, draws from the posterior distribution can be used to generate model-conform data yrep and to compute a so-called Bayes p-value defined by

So person-fit is evaluated by computing the relative proportion of replications, that is, draws of ξ from the posterior distribution p(ξ |y), where the person-fit index computed using the data, T(y, ξ), has a smaller value than the analogous index computed using data generated to conform to the IRT model, that is Posterior predictive checks are constructed by inserting the UB and UD statistics for T(y, ξ). After the burn-in period, when the Markov Chain has converged, in every nth iteration, using the current draw of