Basic Elements of Educational Measurement
6.5 Test Formats
6.5.1 Selected response formats
Selected response items are items where the student has to choose an alternative from some pre-specified list of alternatives. There are three basic versions.
a) True-false items. The item consists of a statement and the student has to determine whether this statement is true or false.
b) Multiple-choice items. The item consists of a stem and a number of choicealternatives.
One of the response alternatives is the correct answer, the others are so-called distractors. The student has to pick the correct alternative or the best alternative.
c) Matching items. The item consists of two lists of alternatives, and the student has to match alternatives from each list.
These three formats will be discussed further below. In addition to the three basic formats, a number of more complex selected response formats have been developed that are aimed at testing complex thinking and problem solving.
d) Context-dependent item sets. In this format a number of selected response items with a basic format are organized in some larger framework. For instance, a test of language comprehension may consist of a number of texts, and the comprehension of each text is assessed using a number of multiple-choice items nested under the text. Item sets are described using labels as interpretative exercises, scenarios, vignettes, item bundles, problem sets, super-items and testlets.
Finally, the widespread use of computers has opened up a whole new range of possibilities in testing. Both the three basic formats and context-dependent item sets are straightforwardly adapted for presentation on computer. However, the use of computers has also facilitated a number of new possibilities that will be discussed under the heading
“Innovations”.
True-false items
Compared to multiple-choice items, true-false items are relatively easy to construct, because there is no need to construct response alternatives. It is essential that the statement in the stem can only be classified as true or false if the student really has knowledge or understanding of the content. That is, a student without the proper knowledge and understanding should not be able to infer the truth of the statement via intelligent use of unintended clues. One of the main faults made in this format is that the wording of the statement closely matches the wording used in the instructional materials.
In that case the item no longer measures knowledge or understanding but memory and rote recall.
The main advantage of true-false items is that a good content coverage can be achieved, because many items can be administered in a relatively short time. A point often made against using true-false items is that they are only suited for testing knowledge. However, Ebel and Frisbie (1991) and Frisbie and Becker (1991) give a number of suggestions on how to use true-false items to measure higher thought processes.
Another point of criticism on true-false items is related to the fact that the probability of guessing the correct response is 50%. This high guessing probability should, of course, be taken into account when determining a cut-off score. On the other hand, the guessing probability is the same for all students so it does not systematically distort the ordering of the students’ performances. In the next chapter it is explained that the guessing probability is negatively related to the test reliability, so this has the consequence that the number of items that must be administered goes up as the guessing probability increases.
The relation between the number of items that has to be administered and the guessing probability will also be returned to in the next chapter.
A last important point concerns the context in which the true-false items function.
When students have the opportunity to peruse the items in advance, they can attain remarkably high scores using a strategy where they only look at the items in the “true”- category in the test preparation phase and merely recognize these items in the test phase.
So the rationale that mastering all items in a large (public) item bank is analogous to mastering the target domain cannot be used in this context.
Multiple-choice items
Multiple-choice items are so common that many novices in education are unaware of the specific difficulties attached to this format. However, experience in largescale high-stakes testing programs shows that it takes substantial training before items are constructed that can function problem-free in a high-stakes setting. The quality criteria for multiple-choice items and item writing rules can be found in Haladyna (1994, 1997), important older contributions are Ebel (1951), Wesman (1971), and Woods (1977). In general, the items should be accurate and valid, there should be one and only one correct response, the alternatives should be mutually exclusive and the wording of the stem and the response alternatives should be unambiguous. Further, all options should be plausible and attractive to students who lack the specific knowledge or understanding addressed by the item. Intelligent and test-wise students should not be given clues about the correct alternative that are not based on the actual domain measured. Common clues are that the correct response is longer and differently worded than the incorrect alternatives. Finally, in most situations the possibility of decreasing the guessing probability by increasing the number of response alternatives is very limited. When the test constructer runs out of proper alternatives, highly illogical or even corny alternatives are added that students can eliminate on sight.
Properly constructed multiple-choice items have many advantages: the test is standardized and can be objectively scored, and the possibility of administering many items supports content coverage. The two main disadvantages are that the construction of proper items is time consuming, and the test format is unsuitable for testing proficiencies
requiring writing skills, the presentation of arguments, mathematical reasoning and for testing performance in real-life situations.
Matching items
Matching items are often recommended for testing associations, definitions, or characteristics or examples of concepts (Haladyna, 1994). Further, matching items are efficient because several questions are implicitly integrated into one item, and since the format does not entail construction of distracters, matching items are easier to construct than common multiple-choice items.
In practice, there is a tendency to make both lists of choices equally long. This has the disadvantage that making one wrong combination automatically induces additional errors. A solution is to have the students match elements of a shortlist to the elements of a much longer list. The constructing of two lists of matching options, which are homogeneous logically and homogeneous with respect to difficulty level is no minor task, so successful practical examples of this format are not numerous.
Context-dependent item set
A context-dependent item set consists of context-dependent material followed by a set of selected response items, usually multiple-choice items. The context-material may be in textual form, such as a problem, scenario or case study, or in pictorial form, such as a photo, figure, table or chart. As an example, Haladyna (1992) considers a so-called vignette, which is a problem-solving context in which examinees respond with decisions or actions via multiple-choice items. The main motivation for using context-dependent item sets is that they can be used for evaluating higher-order thinking, such as problem solving and critical thinking. The analysis of results from tests consisting of context- dependent item sets requires specific psychometric models to account for the variability of response behavior within and between the item sets. In psychometric literature, these psychometric models are referred to as testlet models or item bundle models (Wainer &
Kiely, 1987). These models will be further explored in the next chapter.
Innovations
Roughly speaking, the innovations with respect to item types that are supported by computers fall into three categories: the mode of presentation, the response mode and item generation.
Item Presentation
One of the major advantages of administering tests via the computer is the possibility of using non-text media in items. This may increase the authenticity of the test. For instance, audio presentations can be integrated in tests of listening skills in language and music.
Examples are tests of English proficiency of non-native speakers (ACT, Inc., 1999; ETS, 1998; Godwin, 1999; Nissan, 1999), or tests of listening skills for employees and professionals. An example is a listening comprehension test being investigated by the
Law School Admissions Council for possible inclusion in their exam program (ACT, Inc., 1998). Using video can also enhance task authenticity in tests. An interesting application is the video-based test of conflict resolution skills by Olson-Buchanan, Drasgow, Moberg, Mead, Keenan, and Donovan (1998).
Response mode
Traditional selected-response item types require students to mark a correct response alternative. Computer presentations can broaden this basic principle in various ways. One may ask students to click on and select the proper sentence from a reading passage, to select a part of a graphic, or to make selections in a data base (Parshall, Stewart, & Ritter, 1996). In a placement tests for adult basic education in the Netherlands, Verschoor and Straetmans (2000) use mathematics items where students have to select points in a figural histogram, on a scale, or on a dial.
For an extensive overview of innovative item types in computerized testing one is referred to Parshall, Davey, and Pashley (2000).
Item shells and item cloning
In item-cloning techniques (see, for instance, Bejar, 1993, or Roid & Haladyna, 1982) operational items are derived from “parent items” via one or more transformation rules.
These parent items have been known as “item forms”, “item templates”, or “item shells”, whereas the items generated from them are know now widely known as “item clones”.
Closely related to this approach are so-called “replacement set procedures” (Millman &
Westman, 1989) where test items are generated from a parent item by the computer. In this approach, the computer puts the answers to multiple-choice items in random order, picks distractors from a list of possible wrong answers, and, in numerical problems, substitutes random numbers in a specific spot in the item stem and adjusts the alternatives accordingly. In this approach, items are generated “on-the-fly”, that is, the computer generates a new version of the item for every student.
An important question is whether clones and items generated-on-the fly from the same parent item have comparable statistical characteristics. Empirical studies addressing this question are reported in, for example, Hively, Patterson and Page (1968), Macready (1983), Macready and Merwin (1973) and Meisner, Luecht and Reckase (1993). The general impression from these studies is that the variability between clones from the same parent is much smaller than between parents, but not small enough to justify the assumption of identical values. Of course, the size of the remaining variability depends on various factors, such as the type of knowledge or skill tested and the implementation of the item cloning technique. Psychometric models for analyzing results of tests based on item shells and clones will be discussed in the next chapter.