Applications of Measurement Models
8.1 Test Equating and Linking of Assessments .1 Data collection designs
8.1.2 Multi-stage testing
In the previous section, it was mentioned that missing responses could be the result of many possible causes: presenting items in an incomplete design, skipping items, or not reaching items. Whenever data are collected, however carefully, the possibility, origin and treatment of “missing responses” should be considered. Even if care is taken to ensure that all appropriate respondents are contacted and provide some data, responses on individual variables may be missing, uncodeable or in a category such as ‘don’t know’ or
‘not applicable’. If missing observations are present, then the mechanism causing the incompleteness in the data can be characterized according to its degree of randomness.
Rubin (1976) described and named a number of types of mechanism. Let D be the missing data indicators, in the present case, D can be viewed as a matrix with as entries the missing data indicators dik defined by Formula (3). Further, a distinction is made between the observed data yobs, say the observed response patterns of the students, and the unobserved or missing data ymiss, say the parts of the person-by-item-matrix where the related design variable dik equals zero. Following Rubin, data are missing at random (MAR) if the distribution of the design does not depend on the missing data, that is,
where φ is a vector of the parameters of the missing data process, and x are covariates that might also determine the missing data process. So the data are MAR if the variables determining the missingness are all observed. In a likelihood-based framework, there is an additional requirement that the space of the parameters of interest (say the item, person and population parameters) and the parameters of the missing data process should be distinct. If MAR and distinctness hold, then maximizing the likelihood of the actually observed data is equivalent with a maximization taking the missing data process into account. That is, we can use the actually observed data alone to obtain estimates of the parameters of interest. In a Bayesian framework, besides MAR, it should also hold that the prior of the parameters of interest and the parameters of the missing data process φ should be independent, and in that case, inferences based on the posterior given the actually observed data suffice.
This has various implications. To mention a few situations where MAR does not hold:
1) If difficult items are differentially skipped by high and low ability students;
2) If a time limit is imposed and speed is correlated with ability;
3) If the test administration design is based on a-priori estimates of ability, or on other covariates that correlate with ability, and these estimates or covariates are not part of the model.
However, there are situations where MAR does hold that are very useful. We will discuss the case of response-contingent designs, such as multi-stage testing and computerized adaptive testing.
Consider the design of Figure 8.4. In this design, all respondents are administered a so-called routing test, say a test of 10 items. If a respondent’s score is less than or equal to 5, an easy follow-up test is administered; if the score is more than 5, a difficult test is administered. The procedure is motivated by the fact that matching the ability level of the respondents with the difficulty level of the items results in optimization of the precision of both the item and ability parameter estimates, as was shown in Section 5.2.5.
In this case, MML estimates of the item and population parameters are consistent because the data are MAR, that is, the design is completely determined by the sum scores on the routing test. A small simulated example may illustrate this further. Consider the item parameter estimates in Table 8.1. The design was as in Figure 8.4, the routing test consisted of 10 items, the two follow-up tests consisted of 5 items each. The 1PLM was used to generate the data of 2000 respondents. The ability parameters had a standard normal distribution. Form the true item parameters in the second column, it can be seen that the first follow-up test was easy, while the second was difficult. The MML estimation procedure was used to obtain the item parameter estimates. Note that the response-contingent design did not bias the estimates.
Figure 8.4 Two-stage testing design.
Table 8.1 MML Item Parameter Estimates Obtained in a Multi-Stage Testing Design.
Item b b Se(b)
1 −1.0 −.901 .039
2 −.5 −.460 .037
3 .0 .026 .034
4 .5 .479 .037
5 1.0 1.038 .042
6 −1.0 −1.012 .043
7 −.5 −.542 .041
8 .0 .030 .033
9 .5 .467 .039
10 1.0 .968 .043
11 −1.0 −1.089 .076
12 −.5 −.536 .071
13 .0 −.093 .069
14 −.5 −.436 .065
15 −1.0 −1.066 .075
16 1.0 1.054 .073
17 .5 .593 .070
18 .0 .077 .069
19 .5 .490 .065
20 1.0 1.099 .075
The estimates of the ability parameters and their standard errors are given in Table 8.2.
The estimates were obtained by weighted maximum likelihood with the MML item parameter estimates imputed as constants. Note that a certain observed score on the second booklet represents a higher ability level than the same score on the first booklet.
This is as expected, because the second booklet was more difficult. In Table 8.1, it can be seen that the mean difficulty of the first booklet is −.75, while the mean difficulty of the second booklet is 0.75. In Table 8.2, it can be seen that −0.75 and 0.75 are indeed the locations on the latent scale where the respondents administered the first and second booklet attain the smallest standard errors.
Table 8.2 Ability Estimates Obtained in a Multi- Stage Testing Design.
Booklet 1 Booklet 2
Score Freq θ Se(θ) Freq θ Se(θ)
0 13 −3.99 1.83 0 −3.47 1.85
1 28 −2.80 .96 0 −2.26 .97
2 57 −2.19 .75 0 −1.63 .76
3 81 −1.75 .65 0 −1.16 .67
4 112 −1.38 .60 0 −.78 .61
5 171 −1.06 .57 0 −.44 .58
6 186 −.76 .55 20 −.13 .56
7 193 −.47 .55 96 .15 .55
8 162 −.18 .55 123 .45 .55
9 105 .10 .56 158 .74 .56
10 38 .41 .58 132 1.04 .57
11 0 .75 .61 130 1.37 .60
12 0 1.12 .66 86 1.74 .66
13 0 1.587 .764 53 2.187 .755
14 0 2.215 .972 46 2.800 .963
15 0 3.426 1.851 10 3.995 1.837
The example shown here is a two-stage testing design. Of course, the design can be branched further, for instance, with four tests in the third stage, eight tests in the fourth stage, etc. A limiting case of multistage is computerized adaptive testing (CAT). Here, every test administered consists of one item, and every item administered is selected from an item bank in such a way that the item parameters and the running estimate of ability are matched to obtain maximum precision. A good introduction to CAT can be found in the introductory volume edited by Wainer (1990), for a more advanced overview refer to van der Linden and Glas (2000). With the advent of powerful computers, application of CAT in large-scale high-stakes testing programs has taken a high flight. Well-known examples in the United States are the Nursing-licensing exam (NCLEX/CAT) by the National Council of State Boards of Nursing and the Graduate Record Examination (GRE). Ever since many other large-scale testing programs have followed. It seems safe to state that at the moment the majority of large-scale testing programs either has already been computerized or are in the process of becoming so. The main motivations for CAT are: (1) CAT makes it possible for students to schedule tests at their convenience; (2) tests are taken in a more comfortable setting and with fewer people around than in large- scale paper-and-pencil administrations; (3) electronic processing of test data and reporting of scores is faster; and (4) wider ranges of questions and test content can be put to use (Educational Testing Service, 1996). In the current CAT programs, these advantages have certainly been realized and appreciated by the examinees. When offered the choice between a paper-and-pencil and a CAT version of the same test, typically most examinees choose the CAT version.