applied, the evidentiary bases for test score interpretations can be de- rived through a variety of methods. Contributions to the validity evi- dence of test scores can be made by any systematic research that sup- ports or adds to their meaning, regardless of who conducts it or when it occurs. As long as sound scientific evidence for a proposed use of test scores exists, qualified test users are free to employ scores for their purposes, regardless of whether these were foreseen by the developers of the test. This proposition helps to explain the multifaceted nature of validation research, as well as its often redundant and sometimes conflicting findings. It also accounts for the longevity of some instru- ments, such as the MMPI and the Wechsler scales, for which a vast literature—encompassing numerous applications in a variety of contexts—has been accumulated over decades of basic and applied research.
The alert reader may have gathered at this point that validity, just as reliability, is not a quality that characterizes tests in the abstract or any specific test or test data. Rather, validity is a matter of judgmentsthat pertain to test scores as they are employed for a given purpose and in a given context. Hence, the process of vali- dation is akin to hypothesis testing: It subsumes the notions of test score mean- ing, and test score reliability, discussed in the two previous chapters, as well as the ways in which the applications of test data to psychological research and practice can be justified, the topic covered in the present chapter. Rapid Reference 5.1 lists some of the most significant contributions to the topic of validity from the 1950s to the 1990s.
HISTORICAL PERSPECTIVES ON VALIDITY
The rise of modern psychological testing took place at about the same time that psychology was becoming an established scientific discipline. Both fields date their beginnings to the late part of the 19th and early years of the 20th centuries.
152 ESSENTIALS OF PSYCHOLOGICAL TESTING
DON’T FORGET
Perhaps no other theorist has been more influential in reshaping the concept of validity than Samuel Messick. According to Messick (1989, p. 13), “validity is an in- tegrated evaluative judgment of the degree to which empirical evidence and the- oretical rationales support the adequacyandappropriateness ofinferencesandac- tionsbased on test scores or other modes of assessment.”
As a result of this historical coincidence, our understanding of the nature, func- tions, and methodology of psychological tests and measurements has evolved over the past century in tandem with the development and growing sophistica- tion of psychological science.
At the beginning, scientific psychology was primarily concerned with estab- lishing psychophysical laws, through the experimental investigation of the func- tional relationship between physical stimuli and the sensory and perceptual re- sponses they arouse in humans. Theoretical psychology consisted primarily of armchair speculation of a philosophical nature, until well into the first quarter of the 20th century. Neither of these statements implies that the contributions of the pioneers of psychology were not of value (see, e.g., Boring, 1950; James, 1890).
Nevertheless, arising against this backdrop, the first psychological tests came to be seen, somewhat naively, as scientific tools that measured an ever-expanding catalog of mental abilities and personality traits in much the same way as the psy- chophysicists were measuring auditory, visual, and other sensory and perceptual responses to stimuli such as sounds, light, and colors of various types and inten- sities. Furthermore, as we saw in Chapter 1, the success that the Stanford-Binet and the Army Alpha had in helping to make practical decisions about individuals in education and employment settings led to a rapid proliferation of tests in the
Basic References on Validity
Samuel Messick articulated his views on validity most explicitly in a chapter that appeared in Educational Measurement(3rd ed., pp. 13–103), a notable volume edited by Robert L. Linn and published jointly by the American Council on Educa- tion and Macmillan in 1989. Messick’s Validity chapter, and his other works on the topic (e.g., Messick , 1988, 1995), have directly influenced its treatment in the cur- rent version of the Testing Standards(AERA, APA, NCME, 1999). Other key con- tributions that are widely acknowledged as having shaped the evolution of theo- retical concepts of validity include the following:
• Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests.
Psychological Bulletin, 52,281–302.
• Loevinger, J. (1957). Objective tests as instruments of psychological theory [Monograph Supplement].Psychological Reports, 3,635–694.
• Embretson, S. (1983). Construct validity: Construct representation versus nomothetic span.Psychological Bulletin, 93,179–197.
• Cronbach, L. J. (1988). Five perspectives on validity argument. In H. Wainer &
H. I. Braun (Eds.),Test validity(pp. 3–17). Hillsdale, NJ: Erlbaum.
Rapid Reference 5.1
first two decades of the 20th century. The wide range of applications for which these instruments were used soon overtook the theoretical and scientific ratio- nales for them that were available at the time. In short, many early psychological tests were developed and used without the benefit of the psychometric theory, ethical principles, and practical guidelines that would begin to accumulate in later decades (von Mayrhauser, 1992).
The Classic Definition of Validity
Recognition of this state of affairs within the profession resulted in the first at- tempts to delineate the characteristics that would distinguish a good test from a bad one. Thus, the first definition of validityas “the extent to which a test measures what it purports to measure” was formulated in 1921 by the National Association of the Directors of Educational Research (T. B. Rogers, 1995, p. 25). It was ratified by many testing experts—including Anne Anastasi in all the editions of her influ- ential textbook on Psychological Testing(1954–1988) as well as Anastasi and Urbina (1997, p. 8). The view that “test validity concerns whatthe test measures and how wellit does so” (Anastasi & Urbina, p. 113) is still regarded by many as the heart of the issue of validity. In spite of its apparent simplicity, this view poses a number of problems, especially when it is seen from the perspective of the current Testing Stan- dards(AERA, APA, NCME, 1999) and of the flux that still exists with regard to defining some of the most basic constructs within the field of psychology.
Problematic Aspects of the Traditional View of Validity
The issues that the classic definition of validity raises revolve around its unstated but clear assumptions that
1. validity is a property of tests, rather than of test score interpretations;
2. in order to be valid, tests scores should measure some purported con- struct directly; and
3. score validity is, at least to some extent, a function of the test author’s or developer’s understanding of whatever construct she or he intends to measure.
While these assumptions may be justified in certain cases, they definitely are not justified in every case. The first assumption, for instance, is tenable only as long as validation data support the stated purpose of the test andas long as the test is used specifically for that purpose and with the kinds of populations for which validity data have been gathered. The second and third assumptions are justified only for tests that measure behavior which can be linked to psychologi- 154 ESSENTIALS OF PSYCHOLOGICAL TESTING
cal constructs in fairly unequivocal ways, such as certain memory functions, speed and accuracy in the performance of various cognitive processing tasks, or extent of knowledge of a well-defined content universe. They are not necessarily tenable for (a) tests designed to assess multidimensional or complex theoretical constructs about which there is still much debate, such as intelligence or self- concept; ( b) tests developed on the basis of strictly empirical—as opposed to theoretical or logical—relationships between scores and external criteria, such as the original MMPI; or (c) techniques whose purpose is to reveal covert or un- conscious aspects of personality, such as projective devices. For instruments of this nature, what is being measured is behavior that can be linked more or less di- rectly to the constructs that are of real interest, primarily through a network of correlational evidence. Rapid Reference 5.2 defines the various meanings of the wordconstructand may help to clarify the distinctions just made, as well as those to come later in this chapter.
The idea that test score validity is a function of the degree to which tests mea- sure what they purport to measure also leads to some confusion between the con- sistency or precision of measurements (i.e., their reliability) and their validity. As we saw in Chapter 4, if a test measures whatever it measures well,its scores may be deemed to be reliable (consistent, precise, or trustworthy), but they are not necessarily valid in the contemporary, fuller sense of the term. In other words, test scores may be relatively free of measurement error, and yet may not be very use- ful as bases for making the inferences we need to make.
Moreover, the implication that a test score reflects what the test author intends it to reflect has been as a source of additional misunderstandings. One of them concerns the titles of tests, which should never be—but often are—taken at face value. Test titles range from those that are quite accurate and empirically defen- sible to those that merely reflect test authors’ (unfulfilled) intentions or test pub- lishers’ marketing concerns. A second, and even more important, problem with the notion that valid test scores reflect their expressed purpose is that it can lead to superficial or glib empirical definitions of psychological constructs. Possibly the most famous example of this is E. G. Boring’s 1923 definition of intelligenceas
“whatever it is that intelligence tests measure” (cited by Sternberg, 1986, p. 2).
As a result of these misunderstandings, the field of psychological testing has been saddled with instruments—purporting to measure ill-defined or faddish constructs—whose promises vastly overstate what they can deliver, whose use in psychological research impedes or delays progress in the discipline, and whose existence, by association, diminishes the image of the field as a whole. Early mea- sures of masculinity-femininity are a prime example of this sort of problem (Con- stantinople, 1973; Lenney, 1991; Spence, 1993), although there are many others.
Perhaps the most significant consequence of the traditional definition of va- lidity is that it became attached to tests and to what they purport to measure, rather than to test scoresand the interpretations that could justifiably be based on them. By implication, then, any evidence labeled as test validitycame to be seen as proof that the test was valid and worthy of use, regardless of the nature of the link between test score data and the inferences that were to be drawn from them. Con- sequently, innumerable studies in the psychological literature have used scores 156 ESSENTIALS OF PSYCHOLOGICAL TESTING
Deconstructing Constructs
Because the term constructis used so frequently in this chapter, a clarification of its meaning is necessary at this point. Generally speaking, a constructis anything that is devised by the human mind but not directly observable. Constructs are ab- stractions that may refer to concepts, ideas, theoretical entities, hypotheses, or in- ventions of many sorts.
In psychology, the word constructis applied to concepts, such as traits, and to the theoretical relationships among concepts that are inferred from consistent em- pirical observations of behavioral data. Psychological constructs differ widely in terms of
• their breadth and complexity,
• their potential applicability, and
• the degree of abstraction required to infer them from the available data.
As a rule, narrowly defined constructs require less abstraction but have a smaller range of application. Moreover, since it is easier to obtain consensual agreement about constructs that are narrow, simple, and less abstract, these are also more easily assessed than broader and multifaceted constructs that may have acquired different meanings across diverse contexts, cultures, and historical periods.
Examples:
• Whereasmanual dexterityis a construct that can be linked to specific behav- ioral data readily,creativityis far more abstract.Thus, when it comes to evaluat- ing these traits, determining who has greater manual dexterity is much easier than determining who is more creative.
• Introversionis a simpler and more narrowly defined construct than conscien- tiousness.While the latter is potentially useful in predicting a broader range of behaviors, it is also more difficult to assess.
Synonyms:The terms constructandlatent variableare often used interchangeably.
Alatent variableis a characteristic that presumably underlies some observed phe- nomenon but is not directly measurable or observable. All psychological traits are latent variables, or constructs, as are the labels given to factors that emerge from factor analytic research, such as verbal comprehension or neuroticism.
Rapid Reference 5.2
from a single instrument to classify research participants into experimental groups, many clinicians have relied exclusively on test scores for diagnosis and treatment planning, and an untold number of decisions in educational and em- ployment settings have been based on cutoff scores from a single test. Too often, choices like these are made without considering their appropriateness in specific contexts or without reference to additional sources of data and justified simply on the basis that the test in question is supposed to be “a valid measure of. . . .”
whatever its manual states.
An important signpost in the evolution of the concept of validity was the pub- lication of the Technical Recommendations for Psychological Tests and Diagnostic Techniques (APA, 1954), the first in the series of testing standards that were retitled, revised, and updated in 1955, 1966, 1974, 1985, and most recently, in 1999 (AERA, APA, NCME). With each subsequent revision, the Testing Standards—previously dis- cussed in Chapter 1—have attempted to promote sound practices for the con- struction and use of tests and to clarify the basis for evaluating the quality of tests and testing practices.
TheTechnical Recommendationspublished in 1954 introduced a classification of validity into four categories to be discussed later in this chapter: content, predic- tive, concurrent, and construct validity. Subsequently, the 1974 Standardsreduced these categories to three, by subsuming predictive and concurrent validity un- der the rubric of criterion-related validity, and further specified that content, criterion-related, and construct validity are aspects of,as opposed to types of,valid- ity. In the same year, the Standardsalso introduced the notion that validity “refers to the appropriateness of inferences from test scores or other forms of assess- ment” (APA, AERA, NCME, 1974, p. 25).
In spite of the specifications proposed by the 1974 Standards more than a quarter century ago, the division of validity into three types (which came to be known as the tripartite viewof validity) became entrenched. It has survived up to the present in many test manuals and test reviews, as well as in much of the re- search that is conducted on psychometric instruments. Nevertheless, successive revisions of the Standards—especially the current one—have added stipulations that make it increasingly clear that whichever classification is used for validity concepts should be attached to the types of evidence that are adduced for test score interpretation rather than to the tests themselves. With this in mind, we turn now to a consideration of the prevailing view of validity as a unitary con- cept and to the various sources of evidence that may be used to evaluate possible interpretations of test scores for specific purposes. For further information on the evolution of validity and related concepts, see Anastasi (1986), Angoff (1988), and Landy (1986).
CURRENT PERSPECTIVES ON VALIDITY
Beginning in the 1970s and continuing up to the present, there has been a con- certed effort within the testing profession to refine and revise the notion of va- lidity and to provide a unifying theory that encompasses the many strands of evidence from which test scores derive their significance and meaning. One consistent theme of this effort has been the integration of almost all forms of va- lidity evidence as aspects of construct validity (Guion, 1991; Messick, 1980, 1988, 1989; Tenopyr, 1986). This, in turn, has prompted a reexamination of the mean- ing of construct—defined in general terms in Rapid Reference 5.2—as it applies specifically in the context of validity in psychological testing and assessment (see, e.g., Braun, Jackson, & Wiley, 2002; Embretson, 1983).
The Integrative Function of Constructs in Test Validation
In psychological testing, the term constructhas been used, often indistinctly, in two alternate ways:
1. To designate the traits, processes, knowledge stores,orcharacteristicswhose presence and extent we wish to ascertain through the specific behavior samples collected by the tests. In this meaning of the word, a construct is simply what the test author sets out to measure—that is, any hypo- thetical entity derived from psychological theory, research, or observa- tion of behavior, such as anxiety, assertiveness, logical reasoning ability, flexibility, and so forth.
2. To designate the inferencesthat may be made on the basis of test scores.
When used in this way, the term constructrefers to a specific interpreta- tion of test data, or any other behavioral data—such as the presence of clinical depression or a high probability of success in some en- deavor—that may be made based on a network of preestablished the- oretical and empirical relationships between test scores and other vari- ables.
Several theorists have tried to explain how these two meanings relate to the notion of test score validity. One of the earliest formulations was Cronbach’s (1949) classification of validity into two types, namely, logical and empirical. Sub- sequently, in an influential paper he coauthored with Meehl in 1955, Cronbach suggested the use of the term construct validityto designate the nomological net,or network of interrelationships between and among theoretical and observable el- ements that support a construct. In an attempt to clarify how these two meanings 158 ESSENTIALS OF PSYCHOLOGICAL TESTING
could be distinguished in the process of test development, construction, and eval- uation, Embretson (1983) proposed a separation between two aspects of con- struct validation research, namely, construct representation and nomothetic span. According to Embretson ( p. 180), construct representationresearch “is con- cerned with identifying the theoretical mechanisms that underlie task perfor- mance.” From an information-processing perspective, the goal of construct rep- resentation is task decomposition.The process of task decomposition can be applied to a variety of cognitive tasks, including interpersonal inferences and social judg- ments. It entails an examination of test responses from the point of view of the processes, strategies, and knowledge stores involved in their performance. Nomo- thetic span,on the other hand, concerns “the network of relationships of a test to other measures” ( Embretson, p. 180); it refers to the strength, frequency, and pattern of significant relations between test scores and other measures of the same—or different—traits, between test scores and criterion measures, and so forth.
Embretson (1983) described additional features of the concepts of construct representation and nomothetic span that help to clarify the differences between these two aspects of construct validation research. Two of the points she made, concerning the distinction between the functions the two kinds of research can serve, are particularly useful in considering the role of sources of validity evi- dence:
1. Construct representation research is concerned primarily with identifying differences in the test’s tasks, whereas nomothetic span research is concerned with differences among test takers.In construct representation research, a process, strat- egy, or knowledge store identified through task decomposition (e.g., phonetic coding, sequential reasoning, or ability to comprehend elementary-level texts) may be deemed essential to the performance of a test task, but yield no systematic differences across a test-taking pop- ulation made up of readers. On the other hand, in order to investigate the nomothetic span of test scores (i.e., the network of relationships between them and other measures), it is necessary to have data on indi- vidual differences and variability across test takers. This reinforces the crucial importance that score variability has for deriving information that can be used to make determinations or decisions about people, which was discussed previously in Chapters 2 and 3. If the scores of a group of people on a test designed to assess the ability to comprehend elementary level texts, for instance, are to be correlated with anything else—or used to determine anything other than whether these people,