Basic Elements of Educational Measurement
6.7 Assessment Systems
these two are no longer relevant. Therefore, they are removed from the denominator, which leads to the coefficient
(4)
The final remark of this section pertains to the relation between reliability as it is defined in here using CTT, and the alternative definition defined using IRT that will be discussed in the next chapter. Reliability in CTT pertains to the extent to which students can be distinguished based on their test scores. For this, it is essential that the true scores of the students vary. The reliability coefficients defined above by formulas (2), (3) and (4) are all equal zero if . The dependence of the reliability on the variance of the true scores can be misused. Consider a test for visual ability administered to a sample of children of 8 years old enrolled in regular schools. It turns out that the reliability is too low to make an impression on test publishers. Therefore, the test constructor adds a sample of children of 8 years old enrolled in schools for visually handicapped to the sample. These children score very low, which blows up the score variance, and leads to a much higher reliability. As such, this circumstance does not invalidate the concept of reliability. What the test constructor has done is changing the definition of the population of interest. The first coefficient related to distinguishing between non-handicapped children, while the second coefficient related to distinguishing visually impaired children from children who are not visually impaired. The point is that CTT test and item indices must always be interpreted relative to some population. In the next chapter, it will be shown that the main motivation for the development of IRT is separating the effects of tests and items on one hand, and the population of students on the other hand. It will be shown that this leads to another definition of reliability.
the Dutch Open University (the SYS system) and at the National Institute for Educational Measurement (Cito, the Cito-TSS system and the pupil monitoring system).
Another important new influence of information technology derives from its power to control multi-media environments which can be used to integrate auditive and visual media in the testing process. Multi-media environments and virtual reality are already widely used in education for simulation purposes and educational testing.
The purpose of this section is to present an overview of the elements of educational assessment systems or computerized testing service systems (TSS). Today’s practice shows many different applications and the needs greatly vary from one user to another.
On one hand, there are the network versions, run by professionals ranging for item writers to psychometricians and processing hundreds of students in test centers daily; on the other hand there are systems implemented on a personal computer used by individual teachers for supporting their educational work. This means that there is not just one TSS that suits everybody, in fact, for some users some elements of a TSS may not be interesting at all. For instance, the individual teacher who processes the data of classes may not have much use for advanced psychometry, if only, because the sizes of the samples obtained in classes are too small for computing reliable test and item statistics. In this section, the TSS will be sketched in a broad perspective, meaning that an overview of the aspects and relations of a TSS will be presented here and details will be mostly left alone. The aim is to present potential users a framework for deciding which elements play a role in the specific situation of interest. This overview is given by considering a theoretical model of a TSS, developed both at the Dutch Open University (Breukers, et al., 1992) and at Cito (Glas, 1997). Many of the features discussed here are implemented in the Cito-TSS, which is a large-scale, highly professional-oriented network system.
Further, it will be indicated which commercial and non-commercial systems developed by others may play a role in computerizing one’s testing process.
Figure 6.1 Overview of an assessment system.
Though the focus of this section is mainly on educational measurement, most of the material presented here also applies to other fields of assessment. An overview of the application of computerized testing in the field of psychological assessment can be found in Butcher (1987).
An educational assessment system, or testing service system is an integrated and computerized system for the construction and storage of items, assembly and deliverance of tests, and the analysis of the test results. The structure of the TSS follows from the various activities that can be distinguished in the functional flow of the testing process.
These activities are item banking, item construction, test assembly, test administration, test analysis, and calibration. A distinct module supports every activity in the system; the complete system is depicted in Figure 6.1. The modules are given numbers related to the order in which they come into play.
6.7.1 Item banking
Item banking supports storage, maintenance and retrieval of items. An item is viewed in the broadest sense, so it may be any task or series of tasks presented to the students for the purpose of measurement. This may include items for paper-andpencil administration or computer-based test administration and tasks for performance assessments. Apart from the actual test material, the stored information may also include scoring instructions and feedback, both for the student and the test administrator.
Before the items can be entered into the system, a so-called item bank structure must be defined. The variables in the item bank structure will generally reflect the variables used in test specifications, so items may be categorized with respect to content matter and level of cognitive behavior tested. Since items and tasks may be used in different situations, the item bank structure will probably be broader than the table of specifications of a specific test. Further, since an item bank usually covers a whole range of tests, possible relations between items must be stored. For instance, a cluster of items might belong to the same case and must always be presented together, or items should not be simultaneously present in one test because they are too much alike.
Finally, the item bank may also store empirical information, such as the frequency of use of items, or the groups of students administered certain items. Empirical information also includes psychometric information such as statistics on item difficulty and the ability level of the populations, which responded to the item. Psychometric information can be used for norming new assessments assembled from the item bank.
6.7.2 Item construction
Item writing for paper-and-pencil test administration is nowadays generally done using a word processor. This has several advantages, such as the availability of layout facilities, easy integration of text and graphics and availability of a spelling checker and a thesaurus. Linking a word processor with the item bank creates the possibility of improving the quality of the items to be written by accessing existing items with a comparable item classification and the empirical information gathered from administration of these items.
Millman and Westman (1989) give an overview of the further possibilities of computerizing the process of item writing. They distinguish five levels of automation of item writing, labeled author supplied approach, replacement-set procedures, computer- supplied prototype items, subject-matter mapping and discourse analysis.
The Author supplied approach is item writing by an author using the word processor as a tool for checking the spelling, for integrating text and graphics, for importing layout macros, etc. In its ultimate form, the author supplied approach entails an expert- and management system to support defining item structures and tables of specifications for tests, writing parallel items, item reviewing, formal acceptation of the items, etc.
In Replacement-set procedures, test items are generated from a blue-print by the computer. As already outlined in a previous section, this entails that the computer puts the answers to multiple-choice items in random order, picks distractors from a list of possible wrong answers, and, in numerical problems, substitutes random numbers in a specific spot in the item stem.
In the Computer-supplied prototype items procedure, proposed by Millman and Westman (1989), the author and the computer interact to write the text of the item. Item writers specify the mental operation they wish to measure, in response to which the computer generates a set of prototypes, which is then further refined in a dialogue to produce an empty item, which is then supplied with content by accessing a database.
In Subject-matter mapping, the author and the computer work interactively to build a concept taxonomy (Merril & Tennyson, 1977, also see Minsky, 1975; Novak & Gowin, 1984) where the key concepts of the achievement domain and their relations are modeled.
Based on this structure and a lexicon provided by the author, the computer generates crude test items for the item writer to review.
In the Discourse analysis approach, it was attempted to construct items directly from text. Wolfe (1976) developed algorithms to transform sentences into test questions, but the resulting items were either very simple or did not make sense, so this line of development seems to have been abandoned.
6.7.3 Item bank calibration
In the following two chapters, the role of measurement models, in particular IRT models, will be outlined in detail. Here a concise introduction and review of two important applications of IRT related of item banking will be given: optimal test construction and computerized adaptive testing. IRT models (Rasch, 1960; Birnbaum, 1968; Lord, 1980;
Fischer & Molenaar, 1995) are characterized by three features: (1) they relate to responses of persons to items, (2) parameter separation, meaning that the influence of items and persons on the responses are modeled by disjunctive sets of parameters, say item difficulty parameters and person ability parameters, and (3) the stochastic nature of the responses of persons to items. Item and person parameters need not be scalars, it might well be the case that ability is multidimensional and must be represented by a vector of scalars. Parameter separation gives the possibility to store item characteristics, that is, item parameters in an item bank, which are independent of the characteristics of the sample of the students who responded to the items.
An important aspect of IRT models is that they are models and, as a consequence, their legitimacy must be tested. In other words, statistical proof must be presented that the representation of person ability and item difficulty of a specific IRT model sufficiently describes the observed responses of the persons to the items. If the specific IRT model does not fit, another IRT model should be sought. Fortunately, for a large class of IRT models, statistical testing procedures have been developed that will not only evaluate
model fit, but also give information with respect to specific model violations and with respect to the direction in which the IRT model should be altered to obtain a fitting model.
As already outlined above, for sound interpretation and use of educational and psychological measures evidence of construct validity is essential (see, for instance, Messick, 1975, 1984, 1989, AERA, APA & NCME, 1985). IRT models can be used to describe the relation between the responses on test items on the level of latent variables.
Fit to an IRT model is empirical evidence that the observed responses can be explained by some underlying structure. The latent variables of the IRT model should, of course, be an appropriate representation of the hypothesis of the test constructor with regard to the construct to be measured. For instance, the hypothesis that a unidimensional construct is measured does not comply with a multidimensional IRT model. Having a fitting IRT model may corroborate construct validity, its does not imply reliability of the test.
However, in the next chapter it will be shown that, given a fitting IRT model, the reliability of a test can be computed. Further, it will be shown that in the definition of reliability, also the test objective can be taken into account.
6.7.4 Optimal test assembly
Above it was outlined that test assembly is based on the definition of a table of specifications. Also in optimal test assembly the table of specifications plays an important role. The extension here is that the items are selected in such a way that the test is optimal in some psychometric sense. Optimal test assembly can only be carried out if data from previous item administration are available. Usually, the items are calibrated via an IRT model, and this IRT model is also used for specifying the criteria of optimality.
However, also procedures have been proposed where the criteria are defined in terms of classical test theory.
One of the most important features of IRT models is the fact that the characteristics of persons and items are separately parameterized, which makes it possible to describe the characteristics of a test in terms of the item parameters only. Once the item parameters are considered known trough pre-testing, the characteristics of any test or examination constructed from the item bank can be predicted. Another important feature of IRT models is that test information consists of additive and independent contributions of the items (see, for instance, Hambleton, Swaminatan & Rogers, 1991).
The fact that test information is locally evaluated as a function of ability makes it possible to construct a test, which has optimal measurement properties at a certain ability level. The choice of the ability level of interest depends on the test objective. For a test where the objective is making pass/fail decisions, it is plausible to require that the test has maximal information at the cut-off point. If, on the other hand, the interest is in selecting high ability students or low ability students, maximal information should be in the high or low region of the latent continuum, respectively.
6.7.5 Computer based testing
In its simplest form, a computer based test need not be more than a paper-and-pencil test with multiple-choice items delivered on a computer system. The items are presented one
at a time, the student keys one of the response alternatives, the computer counts the number-correct score and produces the test result. However, in computer based tests it is possible to offer a much wider variety of item formats and test contents.
Test material may include graphics, even three-dimensional and rotational graphics, split screens, for instance, for showing reading material and questions at the same time, and simulations of real-life situations. The student may be offered support in the form of an on-screen calculator, mathematical tables and reference material. By linking the computer with CD-ROM or CD-I equipment, both the quality of simulations and the quantity of reference material may be increased to a large extent.
Also the way in which the student must respond to the material can be widely varied using computer based testing. For multiple-choice items, instead of entering the character of one of the response alternatives, the student can point and click with a mouse for choosing an option. Pointing and clicking can also be used in connection with graphic displays, for instance, for items that require the student to point out parts of the body, some machine, etc. This can be extended to having a student shading, blocking and replacing areas of a graphic display. The computer mouse can also be used for having a student draw something, which is then evaluated and scored by the computer. Notice that with this last example the realm of constructed-response formats is entered. A very advanced example of this kind of constructed-response format is the test for licensing architects developed by ETS, where a complete design environment is incorporated in the test and the student has to produce an architectural design that is evaluated by the computer.
Another salient feature of computer based testing is that it offers the possibility of response-driven branching. Here the response history of the student determines the next item to be delivered. The criteria for branching may be content-based. For instance, in a diagnostic multiple-choice test, the distractors may be constructed in such a way that the wrong answers reflect specific erroneous lines of reasoning, which can be analyzed further by presenting the proper questions. However, branching need not necessarily be content-based, as will be outlined in the following section, it can also involve psychometric objectives.
6.7.6 Adaptive testing
Tests assembled using the methodology of the section on test assembly do not depend on the responses given by the student. Although they may be optimal in the sense that they meet all the specifications imposed on the test, those tests do not necessarily provide maximum information for each individual student in the population. Tailoring a test using the student’s responses can be motivated by two reasons: minimizing the length of the test and maximizing information with respect to ability. In the following chapter, it will be shown that the information obtained from an item response has a maximum if the item difficulty parameter (in some sense) matches the ability parameter. So if the ability of a respondent would be known, the optimal item can be chosen from the item bank based on the relation between the ability parameter and the item parameters. This suggests the following procedure. First a (small) number of items is given to obtain an initial estimate of ability. One might chose some items, which sufficiently cover the difficulty spectrum of the content matter to be tested. Then the next item administered is the item with
maximal information at the current ability estimate. Following the response to this item, a new estimate of ability is computed using the response pattern thus far. The next item is selected as above, but with the new ability estimate, and this is repeated until the estimation error is smaller than a pre-specified tolerance.
Delivery of computer-based tests can be done using standard modules in generalpurpose item banking systems as the Examiner, the CAT-System and MicroCAT.
The last two packages also support adaptive testing. Besides by general-purpose packages, these facilities are also available in packages that are purposefully designed to deliver specific tests. For instance, the Accuplacer, developed by ETS for the College Board (ETS, 1990), is developed for administering computerized placement tests. These tests cover reading comprehension, sentence skills, arithmetic skills, elementary algebra skills and college level mathematics. Besides test administration, the package also includes a placement management system, where placement advice is given using the test results, and a placement research service system for updating the placement procedure.
ACT developed a comparable system called Compass (ACT, 1993).
One step further from general-purpose software are the systems developed to support complete examination organizations. A good example is the system for the Graduate Record Examinations (GRE) in America organized by ETS in combination with Sylvan Learning Systems. The GRE is a computerized and adaptive test that is administered in test centers; in 1994 ETS had 245 test centers in operation. Some of the arguments of ETS (ETS, 1994) to switch to computerized test administration are:
• It makes it possible for students to schedule tests at their convenience, rather then limiting testing to a few unmovable dates;
• Tests can be taken in a more comfortable setting and with fewer people than in large, paper-and-pencil administrations;
• Faster score reporting to the student and electronic processing of results;
• Wider range of questions and test content.
Operating this system, of course, involves more than just software for computer adaptive testing, it involves an organization and systems for scheduling, administration, accounting, identification of students, troubleshooting, reporting and handling of complaints. Summing up, computer-based test administration is implemented in various models: as general-purpose software, as medium for specific tests, and as part of an extended examination system.