Interpretation of the results of any single factor analytic study cannot extend be- yond the data used in the analysis, in terms either of what is being measured or of their generalizability across populations.
What is being measured?Both the manual of the Beta III (a test that purports to measure nonverbal intellectual ability) and the analysis displayed in Table 5.3 sug- gest that the five subtests of the Beta III can be configured into two clusters that have a good deal of variance in common. Examination of the tasks involved in the five subtests confirms that the two factors are nonverbal. However, these data cannot reveal whether or to what extent these factors capture the essential as- pects of nonverbal intellectual ability.
Generalizability of factor analytic results:The correlational data derived from 95 col- lege students (Urbina & Ringby, 2001) and presented in Table 5.3 yield results sim- ilar to those obtained with the Beta III standardization group, a much larger (N= 1,260), and far more representative, sample of the U.S. population. Although this convergence of results supports the factor structure obtained in both investiga- tions, it leaves open the question of whether this structure would generalize to other populations, such as people of different cultures, individuals with uncor- rected hearing or visual impairments, or any other group whose experiential his- tory differs significantly from that of the samples used in the two analyses at hand.
Rapid Reference 5.5
techniques—and other sophisticated methods of analysis of multivariate corre- lational data, such as multiple regression analyses—can be used to investigate la- tent variables (i.e., constructs) and the possible direct and indirect causal links or paths of influence among them.
One rapidly evolving set of procedures that can be used to test the plausibility of hypothesized interrelationships among constructs as well as the relationships between constructs and the measures used to assess them is known as structural equation modeling(SEM). The essential idea of all SEM is to create one or more models—based on theories, previous findings, or prior exploratory analyses—of the relations among a set of constructs or latent variables and to compare the co- variance structures or matrices implied by the models with the covariance matri- ces actually obtained with a new data set. In other words, the relationships ob- tained with empirical data on variables that assess the various constructs (factors or latent variables) are compared with those predicted by the models. The corre- spondence between the data and the models is evaluated with statistics, appro- priately named goodness-of-fit statistics. SEM provides several advantages over traditional regression analyses. Its advantages derive primarily from two charac- teristics of this methodology: (a) SEM is based on analyses of covariance structures (i.e., patterns of covariation among latent variables or constructs) that can repre- sent the direct and indirect influences of variables on one another, and ( b) SEM typically uses multiple indicators for both the dependent and independent vari- ables in models and thus provides a way to account for measurement error in all the observed variables. Readers who wish to pursue the topic of SEM, and related techniques such as path analysis, in more detail might consult one or more of the sources suggested in Rapid Reference 5.6.
As far as psychological test validation is concerned, SEM techniques are used for the systematic exploration of psychological constructs and theories through research that employs psychological testing as a method of data collection for one or more of the indicators in a model. SEM can provide supporting evidence for the reliability of test scores as well as for their utility as measures of one or more constructs in a model. However, at present, the most extensive application of SEM techniques in test score validation is through confirmatory factor analyses.
Confirmatory factor analysis(CFA), mentioned briefly in a preceding section, in- volves the a priori specification of one or more models of the relationships be- tween test scores and the factors or constructs they are designed to assess. In CFA, as in all other SEM techniques, the direction and strength of interrelation- ships estimated by various models are tested against results obtained with actual data for goodness of fit. These analyses have been facilitated by the development of computer programs—such as LISREL ( Jöreskog & Sörbom, 1993)—de-
signed to generate values for the hypothesized models that can then be tested against the actual data.
Examples of this kind of work are becoming increasingly abundant both in the psychological literature and in the validity sections of test manuals. The confir- matory factor analyses conducted with the standardization sample data of the WAIS-III ( Psychological Corporation, 1997, pp. 106–110) are typical of the studies designed to provide validity evidence for the psychological test scores. In those analyses, four possible structural models—a two-factor, a three-factor, a four-factor, and a five-factor model—were successively evaluated and compared to a general, one-factor model to determine which of them provided the best fit for the data in the total sample and in most of the age bands of the WAIS-III nor- mative group. The results of the CFA indicated that the four-factor model pro- vided the best overall solution and confirmed the patterning previously obtained with EFAs of the same data. These results, in turn, were used as the bases for de- termining the composition of the four index scores ( Verbal Comprehension, Per- ceptual Organization, Working Memory, and Processing Speed) that can serve to organize the results of 11 of the 14 WAIS-III subtests into separate domains of cognitive functioning.
In contrast to the kind of CFA conducted with the WAIS-III data, other CFA studies involve more basic work aimed at clarifying the organization of cognitive 180 ESSENTIALS OF PSYCHOLOGICAL TESTING
Sources of Information on Structural Equation Modeling (SEM)
• For a good introduction to SEM that does not presuppose knowledge of statis- tical methods beyond regression analysis, see the following:
Raykov, T., & Marcoulides, G. A. (2000).A first course in structural equation model- ing.Mahwah, NJ: Erlbaum.
• A more advanced presentation of SEM that includes contributions from many of the leading authorities on this methodology can be found in the following:
Bollen, K. A., & Long, J. S. (Eds.). (1993).Testing structural equation models.New- bury Park , CA: Sage.
• The Internet is an excellent source of information about many topics, including SEM techniques that are being used in a number of fields. One of the best starting points is a site created and maintained by Ed Rigdon, a professor in the Marketing Department of Georgia State University.The address for this site is as follows:
http://www.gsu.edu/~mkteer/sem.html.
Rapid Reference 5.6
and personality traits. An example of this type of study can be found in Gustafs- son’s (2002) description of his reanalyses of data gathered by Holzinger and Swineford in the 1930s from a group of seventh- and eighth-grade students (N= 301) who were tested with a battery of 24 tests designed to tap abilities in five broad ability areas (verbal, spatial, memory, speed, and mathematical deduction).
Using two different models of CFA, Gustafsson found support for the hypothe- sized overlap between the G(general intelligence) and Gf(fluid intelligence or reasoning ability) factors that had been suggested by the analyses done in the 1930s. Gustafsson also used contrasts in the patterns of results from the original factor analysis done in the 1930s and his own contemporary CFAs to illustrate significant implications that various measurement approaches, as well as sample composition, have for the construct validation of ability test scores.
Confirmatory factor analysis and other structural modeling techniques are still evolving and are far from providing any definitive conclusions. However, the blending of theoretical models, empirical observations, and sophisticated statis- tical analyses that characterizes these techniques offers great promise in terms of advancing our understanding of the measures we use and of the relationships be- tween test scores and the constructs they are designed to assess.
Validity Evidence Based on Relationships Between Test Scores and Criteria
If the goals of psychological testing were limited simply to describing test takers’
performance in terms of the frames of reference discussed in Chapter 3 or to in- creasing our understanding of psychological constructs and their interrelation- ships, the sources of evidence already discussed might suffice. However, the valid interpretation of test scores often entails applying whatever meaning is inherent in scores—whether it is based on norms, test content, response processes, es- tablished patterns of convergence and divergence, or any combination of these sources of evidence—to the pragmatic inferences necessary for making deci- sions about people. When this is the case, validity evidence needs to address the significance test scores may have in matters that go beyond those scores or in realms that lie outside the direct purview of the test. In other words, test scores must be shown to correlate with the various criteria used in making decisions and predictions.
Some Essential Facts About Criteria
Merriam-Webster’s Collegiate Dictionary (1995) defines criterion as “a standard on which a judgment or decision may be based” or “a characteristic mark or trait.”
Although the plural form, criteria, is often used as a singular, in the present con- text it is necessary to apply both forms of the word appropriately because they are both central to our purposes. A criterion may also be defined, more loosely, as that which we reallywant to know. This last definition, though less formal than the dictionary’s, highlights the contrast between what test scores tell us and the prac- tical reasons why we use tests.
For psychological tests that are used in making judgments or decisions about people, evidence of a relationship between test scores and criterion measures is an indispensable, but not necessarily sufficient, basis for evaluating validity. Cri- terion measuresare indexes of the criteria that tests are designed to assess or predict and that are gathered independently of the test in question. Rapid Reference 5.7 provides a list of the kinds of criteria typically used in validating test scores. Since the nature of criteria depends on the questions one wishes to answer with the help of tests, it follows that validation procedures based partly or entirely on relation- ships between test scores and criterion measures must produce evidence of a link between the predictors(test scores) and the criteria.
Criterion measures or estimates may be naturally dichotomous(e.g., graduating vs. dropping out) or artificially dichotomized(e.g., success vs. failure); polytomous 182 ESSENTIALS OF PSYCHOLOGICAL TESTING
Typical Criteria Used in Validating Test Scores
Although there is an almost infinite number of criterion measures that can be em- ployed in validating test scores, depending on what the purposes of testing are, the most frequent categories are the following:
• Indexes of academic achievement or performance in specialized training,such as school grades, graduation records, honors, awards, or demonstrations of com- petence in the area of training through successful performance (e.g., in piano playing, mechanical work , flying, computer programming, bar exams, or board certification tests).
• Indexes of job performance,such as sales records, production records, promo- tions, salary increases, longevity in jobs that demand competence, accident-free job performance, or ratings by supervisors, peers, students, employees, cus- tomers, and so forth.
• Membership in contrasted groupsbased on psychiatric diagnoses, occupational status, educational achievement, or any other relevant variable.
• Ratingsof behavior or personality traits by independent observers, relatives, peers, or any other associates who have sufficient bases to provide them.
• Scores on other relevant tests.
Rapid Reference 5.7
(e.g., diagnoses of anxiety vs. mood vs. dissociative disorders, or membership in artistic vs. scientific vs. literary occupations); or continuous(e.g., grade point aver- age, number of units sold, scores on a depression inventory, etc.). Whereas the na- ture of criteria depends on the decisions or predictions to be made with the help of test scores, the methods used to establish the relationships between test scores and criteria vary depending on the formal characteristics of both test scores and criterion measures. In general, when the criterion measure is expressed in a di- chotomous fashion (e.g., success vs. failure) or in terms of a categorical system (e.g., membership in contrasted groups), the validity of test scores is evaluated in terms of hit rates.Hit rates typically indicate the percent of correct decisions or classifications made with the use of test scores, although mean differences and suitable correlation indexes may also be used. When criterion measures are con- tinuous (e.g., scores on achievement tests, grades, ratings, etc.) the principal tools used to indicate the extent of the relationship between test scores and the crite- rion measure are correlation coefficients. However, if a certain value on a con- tinuous criterion, such as a 2.0 grade point average, is used as a cutoff to deter- mine a specific outcome, such as graduation from college, scores on the predictor test can also be evaluated in terms of whether they differentiate between those who meet or exceed the criterion cutoff and those who do not.
The history of psychological testing in recent decades reflects not only an evo- lution in understanding the nature and limitations of tests and test scores but also an increased appreciation of the significance and complexity of criterion mea- sures (see, e.g., James, 1973; Tenopyr, 1986; Wallace, 1965). As a result, with rare exceptions, the notion that there is such a thing as “a criterion” against which a test may be validated is no longer any more tenable than the proposition that the validity of a test can be determined on an all-or-none basis. Instead, the follow- ing facts about criteria are now generally understood:
1. In most validation studies, there are many possible indexes ( both quantitative and qualitative) that can qualify as criterion measures, in- cluding scores from tests other than those undergoing validation.
Therefore, careful attention must be given to the selection of criteria and criterion measures.
2. Some criterion measures are more reliable and valid than others. Thus, the reliability and validity of criterion measures need to be evaluated, just as the reliability and validity of test scores do.
3. Some criteria are more complex than others. As a result, there may or may not be a correlation among criterion measures, especially when criteria are multifaceted.
4. Some criteria can be gauged at the time of testing; others evolve over time. This implies that there may or may not be substantial correla- tions between criterion measures that are available shortly after testing and more distant criteria that may be assessed only over a longer pe- riod of time.
5. Relationships between test scores and criterion measures may or may not generalize across groups, settings, or time periods. Therefore, criterion-related validity evidence needs to be demonstrated anew for populations that differ from the original validation samples in ways that may affect the relationship between test scores and criteria, as well as across various settings and times.
6. The strength or quality of validity evidence with regard to the assess- ment or prediction of a criterion is a function of the characteristics of both the test and the criterion measures employed. If the criterion measures are unreliable or arbitrary, indexes of test score validity will be weakened, regardless of the quality of the test used to assess or pre- dict the criteria.
Criterion-Related Validation Procedures
The criterion-related decisions for which test scores may be of help can be clas- sified into two basic types: (a) those that involve determining a person’s current status and ( b) those that involve predicting future performance or behavior. In a sense, this dichotomy is artificial because regardless of whether we need to know something about a person’s current status or future performance, the only infor- mation test scores can convey derives from current behavior—that is, from the way test takers perform at the time of testing. Nevertheless, criterion-related val- idation procedures are frequently categorized as either concurrent or predictive, depending on the measures employed as well as their primary goals.
Concurrent and Predictive Validation
Concurrent validation evidence is gathered when indexes of the criteria that test scores are meant to assess are available at the time the validation studies are con- ducted. Strictly speaking, concurrent validation is appropriate for test scores that will be employed in determining a person’s current status with regard to some classificatory scheme, such as diagnostic categories or levels of performance. Pre- dictive validationevidence, on the other hand, is relevant for test scores that are meant to be used to make decisions based on estimating future levels of perfor- mance or behavioral outcomes. Ideally, predictive validation procedures require 184 ESSENTIALS OF PSYCHOLOGICAL TESTING
collecting data on the predictor variable (test scores) and waiting for criterion data to become available so that the two sets of data can be correlated. This pro- cess is often impractical because of the time element involved in waiting for cri- teria to mature and also because of the difficulty of finding suitable samples to use in such studies. As a result, concurrent validation is often used as a substitute for predictive validation, even for tests that will be used for estimating future perfor- mance, such as college admissions or preemployment tests. In these cases, the test under development is administered to a group of people, such as college sopho- mores or employees, for whom criterion data are already available.
Often the distinction between the two kinds of validation procedures hinges on the way test users pose the questions they want to answer with the help of a test. Rapid Reference 5.8 contains examples of some typical questions and decision-making situations that may call for either concurrent or predictive vali- dation evidence, depending on how the question is posed and on the time hori- zon that is chosen. To illustrate the distinction between concurrent and predic- tive validation strategies, a relatively simple example of each type of study will be presented, followed by a discussion of the major issues pertinent to criterion- related validation.
Concurrent Validation Example: The Whitaker Index of Schizophrenic Thinking
Tests that are used to screen for psychiatric disorders, such as schizophrenia or depression, usually undergo concurrent validation. Typically, these studies em- ploy two or more samples of individuals who differ with respect to their inde- pendently established diagnostic status. One of the many instruments whose scores are validated in this fashion is the Whitaker Index of Schizophrenic Think- ing ( WIST; Whitaker, 1980). The WIST was designed to identify the kind of thinking impairment that often accompanies schizophrenic syndromes. Each of its two forms (A and B) consists of 25 multiple choice items.
In standardizing the WIST, Whitaker used samples of acute and chronic schiz- ophrenic patients (S), as well as three groups of nonschizophrenics ( NS), to de- rive cutoff scores that would optimally differentiate S from NS individuals. The cutoff scores established with the standardization groups discriminated between S and NS groups with 80% efficiency for Form A and 76% efficiency for Form B. Thus, depending on the form, the cutoff index resulted in 20% to 24% incor- rect decisions. With Form A, incorrect decisions of the false negativetype—those in which S subjects were classified as NS by the index—were much higher (33%) thanfalse positivedecisions, wherein NS subjects were classified as S (10%). The same pattern (38% false negatives vs. 13% false positives) was obtained with