Basic Elements of Educational Measurement
6.4 Test Specifications
The ultimate goal of testing is to make valid inferences regarding some domain of proficiency or ability. An important step in the creation of assessments is specification of the content that should be assessed and the level of cognitive behavior that should be targeted. The content and cognitive behavior dimension can be used to create a so-called table of specifications where the cell entries give the relative importance of a specific combination of content and cognitive behavioral level. Other important test specifications are the time a test may take, the item format and the number of items.
6.4.1 Specification of test content
A specification of test content concerns the identification of specific areas of subject matter that should be included in the test. Millman and Green (1989) consider five substantive characteristics of test content that should provide a clear demarcation of the domain to be assessed in relation to the test purpose. The characteristics are as follows.
Sources of test content
The specification of the sources of test content depends on the domain to which the inferences will be made: the curricular domain, the competency domain or the domain of
a future criterion. In the curricular domain, the content can be derived from explicit curricular objectives, curricular outlines and blueprints, textbooks or other instructional material. In the competency domain, the content cannot be derived from a specific curriculum, but it is derived from theoretical conceptualizations of knowledge, skills, mental abilities and achievements. Tests developed to predict performance in a future criterion setting must be based on an analysis of the requirements of that setting. Millman and Green (1989) distinguish three steps in this analysis. First, the specific cognitive requirements of the criterion setting are identified through job analysis or task analysis, or through research or research synthesis with respect to future academic settings. Second, the content specification is developed using the criterion directly or indicators known or hypothesized to be related to the criterion. Third, the relationship between the performance on the predictive test and the performance in the criterion setting must be established.
Dimensionality of test content
Dimensionality refers to the conceptual or theoretical homogeneity or heterogeneity of the content domain. In principle, assessing a heterogeneous domain implies using separate test scores for different dimensions, which may affect the overall reliability of the test. The overall reliability is positively related with the correlation between the dimensions. As an alternative to using subscores, one may combine the subscores on the dimensions into a composite score. In that case, the weighting of the subscores should reflect the relative importance of each dimension in the conceptualization of the domain.
Domain- versus norm-referenced interpretation
A domain-referenced interpretation refers to an absolute performance level, whereas a norm-referenced interpretation refers to the performance level relative to a population.
The content specification of a domain-referenced test requires a detailed description of the entire domain and subscores are usually attributed to all aspects of the performance.
Norm-referenced tests, on the other hand, require a summary score, and the selected content should support the meaningfulness of this score.
Bandwidth versus fidelity
Closely related to the two previous points is the tradeoff between bandwidth and fidelity.
Choosing test content involves a tradeoff between the breath of content coverage and the reliability of the subscores. Tests with a very narrow scope can be very reliable, but their relevance to the content domain may be negligible.
Content distribution
The distribution of items across the content domain should reflect the relevant domain conceptualization.
6.4.2 Specification of cognitive behavior level
Besides a detailed content specification, making valid inferences with respect to some domain of proficiency or ability requires an analysis of the cognitive level at which the target behavior should be performed. The most used taxonomy of levels cognitive behavior is the well-known taxonomy by Bloom (Bloom, Engelhart, Furst, Hill &
Krathwohl, 1956). Bloom distinguishes six hierarchical categories: knowledge, comprehension (translation and interpretation), application, analysis, synthesis and evaluation. In many practical testing situations, this taxonomy is somewhat simplified and directly linked with item types. Here, a distinction is made between items assessing knowledge, understanding, application and problem solving. In this breakdown, items assess knowledge if they require reproduction without any substantial extension. Items assessing comprehension require production of new information based on the supplied information. Application requires the use of the information in some outside setting, where only one possible solution is valid. Finally, problem solving involves productive and creative thinking, in some outside setting, where there is usually more than one feasible solution.
An alternative taxonomy that has lately gained attention is based on the distinction between declarative and procedural knowledge (Snow & Lohman, 1989). Declarative knowledge involves facts, concepts, principles and procedures, that are stored in an organizational framework and retrieved when needed. Procedural knowledge, on the other hand, involves many related semantic networks. Snow and Lohman (1989) distinguish between context bound organizational frameworks that are more difficult to construct but easily to retrieve, and semantic memory which is easy to construct, for instance by rote learning, but only offers short-term benefits. Procedural knowledge is conceptualized as developed from declarative knowledge in a number of stages, at the end of which the behavior is automated. One step further in this hierarchy is so-called strategic knowledge (Greeno, 1980). This involves the development of goals and strategies for attaining certain objectives. Several authors (Crooks, 1988; Green, Halpin
& Halpin, 1990; Stiggins, Griswold & Stikelund, 1989) argue that traditional achievement testing does not properly assess higher level educational outcomes, such as procedural and strategic knowledge. On the other hand, Roid & Haladyna (1982) attribute this apparent shortcoming to lack of adequate conceptualization of higher-level outcomes. Haladyna (1992, 1994) developed several advanced item-formats, such as the context-dependent item set, that appear quite suitable for measuring higher level outcomes.
Table of Specifications
The relation between the content and cognitive behavior dimension can be defined in a so-called table of specifications. An alternative name is content-by-processmatrix. The table serves as a blueprint for the test and can be used as an item-writing scheme. The principle objective of the table of specifications is to assure that the test is a valid reflection of the domain and test purpose. An example is given in Table 6.2. The artificial example pertains to a test for a course on research in the social sciences.
For the cognitive behavior dimension, a distinction is made between four item types:
items assessing knowledge, items assessing understanding, items assessing application
and items assessing problem solving. The content dimension is hierarchically ordered in topics and subtopics. The cell entries of the table give the relative importance of a specific combination of content and cognitive behavioral level in the test. The percentages in the example of Table 6.2 define the way that the items are distributed over the content-by-level grid. When the total number of items has been fixed, the percentages translate to numbers of items.
Table 6.2 Table of Specifications.
Cognitive level addressed by items
Content Knowledge Understanding Application Problem Solving
Discrete Distributions
3 3 4 0 10
Continuous Distributions
3 3 4 0 10
Estimation 2 2 2 2 8
Statistics
Hypothesis
Testing 2 4 4 0 10
Experiments 3 4 4 4 15
Observational Studies
3 3 4 5 15
Methodology
Case Studies 3 3 3 1 10
Reliability 3 3 3 3 12
Validity 1 2 3 4 10
Measurement
23 27 31 19 100