The concept of test validity basically corresponds to the question of whether a test or an assessment procedure supplies the kind of information needed for a particular interpreta- tion. This issue is very important because the test scores in and of themselves are meaning- less unless they refer to a defined realm of observable phenomena. In clinical psychology, the consequence of interest may be a particular personality style or behavioral disorder.
For clinical neuropsychology, the consequence of interest may range from whether brain impairment exists to the implications of the test results for adaptive behavior. Before particular issues that affect the validity of neuropsychological tests are described, general applicable concepts of validity and threats to validity are reviewed.
TYPES OF VAUDITY
Recent texts reinterpret validity as "validation"; that is, the issue of interest is not only what a test measures, but also what a test's strengths and weaknesses are in terms of particular subjects, interpretive questions, and settings. Answers to these questions may be found in various sources: the test manual, test reviews, general test standards, and published research. Throughout this book, the reviews of neuropsychological assessment instruments use the test manuals as sources as well as the publicly available research data from the literature and, in some cases, from conference presentations or directly from the researcher.
Validity is not a unitary construct, nor is it a construct that can be readily and permanently decided by a single piece of research. In addition, there is sometimes disagreement among clinical neuropsychologists on the topic of test validity. We do not attempt to resolve the issues of disagreement, but attempt to enlarge the discussion of validity by including issues of contextual and comparative quality as discussed by Cronbach (1984). We also attempt to keep the discussion focused on a very important central point: tests themselves are not valid or invalid, but inferences drawn from test results can be described as either valid or invalid.
33
VALIDITY: BASIC DEFINITIONS
Validity may be defined in an empirical sense as a statistical relationship between the results of a particular procedure and a characteristic of interest, that is, between a contrived procedure and other independently observed events (Anastasi, 1982; Nunnally, 1978).
These relationships may be defined in terms of the content of a test, related criteria, and underlying constructs. All of these can be qualified in terms of contextual and subject variables.
Content Validity
Content validity is the degree to which a test adequately samples from the domain of interest. For example, if a test is intended to be sensitive to the global aspects of memory dysfunction, there need to be sufficiently representative tasks that access the construct defining "global memory processes" as understood by the test developer. Therefore, it is important that the developer describe the definition of the construct fairly completely, so that discussions in the literature can focus on the theoretical adequacy of the definition, rather than on content validity. In our example of a test of memory, if the test designer subscribed to the Atkinson and Schiffrin (1968) model of memory, the content of the test would optimally include tasks that demonstrate the acquisition of information at the sensory level, decay from short-term storage, decay from long-term storage, and the effects of rehearsal and interference.
The process of content validation begins when the initial test items or procedures are selected during the design of an instrument. Careful selection and design in the initial stages will help to ease the task of later validation. Most arguments used to support content validation are theoretical and logical. Content validation may be most easily evaluated or demonstrated when the procedure is an operational version of a well-defined theory or a component of a theory. A "table of specifications" (Hopkins & Antes, 1978) that demon- strates the translation of theory into test items can further facilitate the demonstration of content validity.
A table of specifications can be described as a two-dimensional matrix in which one dimension represents the skills or traits of interest and the other dimension represents the behavior characteristic of the interest areas. This diagram can be used to determine the number of items that ought to be assigned to a given topic area as well as the type of item (e.g., timed vs. untimed or spoken vs. written) that need to be included. With the use of these guides, a neuropsychological battery can be expected to include diverse and appar- ently unrelated items, whereas a test of language comprehension should contain items that are more consistent with each other.
As an example, Table 5.1 demonstrates a table of specifications for the Visual Pro- cesses Scale of the Luria-Nebraska Neuropsychological Battery (LNNB). This table has been greatly simplified for descriptive purposes. The functions described here could also be further specified as visual recognition of objects, visual recognition of line drawings, visually mediated inferential reasoning, and so on, and topic areas could be weighted according to their perceived theoretical importance (2056 of the items could be selected to represent operations based on mental rotations).
Items are usually derived by a variety of methods, ranging from examples of behavior
TABLE 5.1. Specifications for the Luria-Nebraska Neuropsychological Battery: Visual Processes Scale Function
Visually mediated recognition Figure-ground perception Spatially mediated perception Mental imagery
Simple Complex Items 86, 87 Items 88, 89
Items 90, 91 Items 94, 95, 96
Items 97, 99
that are directly stated in the theoretical model (e.g., the expectation of increased skin conductance as anxiety increases) to the selection of items by expert judges. In terms of neuropsychological procedures, the sensory modality (visual, auditory, or tactile) through which a stimulus item is presented and the likely internal processes used to achieve a solution (rote vs. reasoned) are all important to content validity as well as to construct validity. The pertinent issue is the likely generalizability (Nunnally, 1978) of the informa- tion derived from an item and, consequently, the likely generalizability of the test as a whole. In dealing with content validity, the test designer must be aware of the tradeoff between lack of specificity or insufficient sampling of the domain and tedious redundancy that does not usefully articulate the characteristic of interest.
The success with which a test designer resolves this tradeoff balance is an issue for future validational research. Later investigators may propose a contrasting theory or may suggest a different set of items as better translating the initially proposed theory. When tests are shown to have limited content validity, the issue raised is insufficient to the purported domain of interest. The causes of limited content validity are usually incomplete under- standing of the underlying theory, lack of a guiding theory, or a tendency to assume greater generalized interpretations of item performance at the time of construction. One's intent may be to design a widely applicable neuropsychological screening test, but the items may all require that the subject copy simple geometric designs. There is a large subset of brain- impaired subjects (e.g., subjects with specific impairments in language skills) for whom performance on such a test will not be diagnostically useful.
Face validity is often erroneously used as a synonym for content validity. However, a more accurate description of the relationship between content validity and face validity is that face validity is a category of content validity. Face validity refers primarily to the perceptions of the examinee. The subject forms some conclusions regarding what the test is measuring on the basis of the content or language of the test procedures. In more ambiguous test situations (e.g., personality evaluation), the examinee's assumptions about the nature of the test may be widely disparate from the examiner's intent. Face validity becomes an important concern to the extent that the perceptions of the examinee affect his or her performance on the test; the result may be confounded with the intended measure- ment purpose of the test. An examinee is more likely to participate in an assessment that uses procedures that appear to provide information that will answer his or her concerns.
Face validity can also affect the decision-making process of potential users of the test.
The astute clinician will ultimately choose tests on the basis of the available empirical literature, but the original decision to investigate the utility of a test may be influenced by
the "look" of a test. Face validity may also influence later empirical validation work by researchers other than the test designer. Subsequent investigators may assume different intents for the items or test scales based on their interpretation of the intent of an item or the meaning of a scale name. Some of these problems can be avoided by including in the manual a specific discussion of the theoretical background of a test as well as a discussion of particular item and scale intents. The manual can also describe the scale titles as somewhat arbitrary labels and give titles that have readily shared meanings in the commu- nity of test researchers and consumers. The latter is not always possible because there is no standard dictionary of neuropsychological terms. Even if such a dictionary existed, there would not be universal acceptance of the definitions, given the diverse theoretical, epis- temological, and procedural backgrounds of the group of individuals who constitute the practitioners of clinical neuropsychology.
It is advisable to give titles that have a relatively large amount of acceptance and agreement among clinical neuropsychologists. It is further advisable to give names to the scales and to the skill areas purported to be measured that reffect behavioral descriptions rather than unobservable constructs. For example, a test might be described as a test of memory, but it would be more accurate to describe the scales and procedures as being measures of new learning, immediate recall, or delayed recall. From the preceding discus- sion, it can be seen that, although face validity does not playa direct role in the empirical definition of a test, it does playa mediating role in the understanding and evaluation of the test data.
Content validity plays a large role in the development of a test. Most of the evaluative processes involved in investigating content validity are based on logical considerations.
Nunnally (1978) suggested a few ways in which content validity can be evaluated. For example, if a scale is thought to measure a single construct, the items can reasonably be expected to demonstrate a fair amount of internal consistency, a relationship that can be empirically evaluated. For those constructs thought to be affected by experience, content validity can be evaluated by demonstrating improved performance following a training procedure. Content validity can also be evaluated by comparing performance on the test in question with performance on other tests thought to tap the same or similar constructs.
Anastasi (1982) suggested a qualitative method of checking content validity by having subjects think out loud during the testing. Of course, this suggestion has limited utility for assessing brain-impaired populations.
Still another method of assessing content validity is to conduct an error analysis. When persons with similar brain dysfunctions fail an item, a review of their error patterns for signs of consistency may suggest how the brain dysfunction translates into behavior. For example, failures during the block design subtest of the WAIS-R that are due to the individual's breaking the external configuration of the target design have been related to right cerebral dysfunction (Kaplan, 1983).
The demonstration of content validation is limited by the extent to which extraneous factors affect test performance. The relative contributions of motivation, academic experi- ence, prior intellectual resources, socioeconomic status, sex, and psychological status must be understood before an accurate assessment of content relevance and generalizability can be conducted. Differences in timing procedures, administration instructions, or scoring procedures may cause relationships with external criteria to vary considerably. Clinical education optimally emphasizes the role of these variables in the process of measurement,
but it is wise for both researchers and clinicians to be mindful of the need to minimize the effect of the variables on the derived score.
Criterion Validity
Anastasi (1982) described criterion-validating procedures as those that demonstrate a test's effectiveness in a given context. Criterion validity is commonly expressed as the correlation between a test score and some external variable, which may be another test that is assumed to measure the same characteristic of interest or future behavior that is assumed to demonstrate the characteristic of interest. In the former case, one usually speaks of concurrent validity, as allied behavior is measured at the same time. In the latter case, one usually speaks of predictive validity, as performance on the test being evaluated is used to temporally predict the criterion, whether it is diagnostic group membership or behavioral change.
Although the information is rarely offered in test manuals, a validity coefficient may be computed as the correlation between the score achieved on a particular test and some criterion variable. For example, one may correlate the relationship between a score on the Category Test (Reitan & Wolfson, 1993) and a dummy-coded variable indicating diagnostic group (i.e., 1 = normal, 2 = neurologically impaired). In this case, a significant positive relationship would indicate that a high score on the Category Test was associated with a subject's being part of the neurologically impaired diagnostic group. Assignment to the diagnostic group would have to be made on the basis of external, independent criteria, such as a neurological evaluation, the results of neuroradiological evaluation, or previous history.
The validity coefficient can be negatively affected by the degree of homogeneity of the sample. This can occur both numerically and conceptually. Numerically, the magnitude of the correlation coefficient decreases as the overall sample becomes more similar in test scores because of the restriction of range. Conceptually, a high validity coefficient might be associated with a particular type of neurological impairment, but not with others. Validity coefficients can also be affected by gender, level of education, age, and so on. This is important information for determining the utility of the instrument; however, it can be disheartening when it is not intended by the test author.
The relationship between a validity coefficient and decision theory is important in deciding the significance of the validity coefficient. Decision theory, as applied to tests, is a mathematical operationalization of the decision-making process. Early models of the theory were based on the net improvement in accuracy over the base rate (the natural frequency of occurrence of a given phenomenon) that one could gain at various levels of validity. Applications to psychological instruments were developed by Cronbach and GIeser (1965). We may see the application of the theory in the following example: If we were to identify all persons as neurologically impaired, even a test with a perfect validity coefficient (r
=
1.0) would not contribute additional information to the decision. On the other hand, if we knew ahead of time that approximately 70% of the people referred for neuropsychological evaluation were neurologically impaired, that same test with the per- fect validity coefficient would improve our accuracy by approximately 16%. If the prior base rate were 20%, the same test would improve our accuracy by 80%. As a rule of thumb, a base rate of 50% optimizes the utility of an instrument. As the likelihood of seeing acondition becomes more rare, the utility of an instrument in terms of improving on the base rate decreases. The cost-benefit ratio of developing and applying an instrument for rare conditions is low. It might be more useful to learn the qualitative and historical symptoms of the disorder and to assess for them rather than to develop a specific identification instru- ment. Tables for computing the relationship between base rate and validity coefficients for specific decision levels can be found in Taylor and Russell (1939).
An elaborate alternative to computing a validity coefficient is a research design some- times referred to as the method of contrasted groups (Anastasi, 1982). Here, the researcher evaluates the results of testing two groups that are assumed to be different on the criterion that is of interest. For example, persons with no known neurological impairment and persons with known neurological impairment are both administered the Luria-Nebraska Neuropsychological Battery. The scores are then compared by some statistical method. A traditional method of comparing test performance has been to test for the significance of the difference between mean scores for the two groups. Unfortunately, the two groups may contain members whose scores lie in the region of overlapping distributions of the two groups. An alternate and preferable method begins with a use of the statistical technique known as discriminant function analysis. The results of such an analysis produce a linear composite that may be used to demonstrate maximal separation between the two diagnostic groups. The weightings derived from this analysis may then be used to assign the subjects statistically to "predicted" groups. Predicted group membership is then compared with actual group membership, and an accuracy rate, or "hit rate," is computed. The latter part of this analysis is referred to as classification analysis.
Although many examples of this set of procedures may be found in even a quick review of current journals, misapplications of the procedures are common. In applying such a design, two major points should be kept in mind. First, the discriminant function analysis and the classification analysis are separate procedures. The discriminant analysis identifies those variables that significantly contribute to the function that statistically separates the groups of interest, and the classification analysis applies an equation based on those significant variables in order to predict group membership. Optimally, applying the equa- tion to the group on which it was derived should result in a relatively high hit rate. Second, demonstrating a significant hit rate in such a derivation sample is only the first step in determining predictive validity. It is not until the derived equation can significantly predict group membership in a second, or cross-validation, sample that predictive validity can be said to have been demonstrated for that set of test procedures.
The use of such a design is a sophisticated demonstration and is open to many sources of contamination. One of the major sources is sample representativeness. If the procedure being evaluated is intended to have broad applications rather than to demonstrate the presence of a particular type of neurological impairment, and if the effects of ancillary subject characteristics are to be ruled out, both the diagnostic subgroup characteristic of neurological impairment and the diagnostic subgroup characteristic of "normality" need to be as heterogeneous as possible. To the extent that they are not heterogeneous, the general- izability of the validating investigations needs to be clearly defined. In addition, the usual subject-matching procedures in the contrasting-groups methodology should be observed just as they would be in any clinical research.
The hit rate itself may be contaminated. Willis (1984) demonstrated that the meaning of a given rate of accuracy may be inflated if one does not take into account the prior probabilities of group membership (the base rate). Thus, when one is comparing two