The Wechsler Adult Intelligence Scale-Revised and Wechsler Adult

Intelligence Scale-III

6

The Wechsler Intelligence Scales are the most popular method of estimating IQ in clinical settings. There have been Spanish (Escala Inteligencia de Wechsler or EIWA) and Japanese translations (Hattori & Lynn, 1997) as well as adaptations for British and for Canadian subjects (Pugh & Boer, 1991). The Wechsler Adult Intelligence Scale-Revised (WAIS-R;

Wechsler, 1981) is, as its name implies, a revised form of the Wechsler Adult Intelligence Scale, which was first published in 1955. There were two main reasons the original version was revised. First, some of the items on the WAIS had become dated and were replaced with more contemporary items. An attempt was also made to make the test more culture- fair by including items related to black people and black history. Whether these attempts succeeded is an empirical question. Second, with the passage of three decades, the norms for the WAIS were unlikely to describe deviation intelligence. Scores could be interpreted as deviation IQs from the original sample, not from the general contemporary population.

The structure of the WAIS was maintained in the WAIS-R. There are 11 subtests, each of which is purported to measure components of intelligence. These 11 subtests are grouped under two general headings: the Information, Digit Span, Vocabulary, Arithmetic, Compre- hension, and Similarities subtests are included in Verbal IQ (VIQ), and the Picture Comple- tion, Picture Arrangement, Block Design, Object Assembly, and Digit Symbol subtests are included in Performance IQ (PIQ). All of the subtests are included in Full Scale IQ (FSIQ).

Interpretation can take place at the level of the subtest scores, the PIQ and VIQ scores, or the FSIQ scores. Although the WAIS and the WAIS-R appear similar in content, structure, and procedure, there are important differences in the scores obtained from the two tests (Ryan, Nowak, & Geisser, 1987). Wechsler (1981), Lippold and Claiborn (1983), Smith (1983), Simon and Clopton (1984), Wagner and Gianakos (1985), and Whitworth and Gibbons (1986) have all reported that use of the WAIS-R results in significantly lower estimates of IQ than does use of the WAIS in multiple populations.

The lower estimates of IQ resulting when more contemporary assessment methods are used is a very common phenomenon that is seen in almost all revisions of intelligence tests.

The lower estimates are necessary to counteract another phenomenon that has been labeled the Flynn effect. The Flynn effect is a consistent observation that regardless of the choice of

standardized method used to evaluate intelligence and regardless of the population studies, IQ scores tend to increase between 2 and 3 points per decade. This is not likely to be due to general increases in average human intelligence, as any observer of politics or even of daily human behavior can easily ascertain. Instead, this increase may be due to shifts in the typical knowledge and skills of the average individual. If the deviation IQ is to have a consistent meaning as an index of difference from the average, then the normative transformation must change as well.

Although it is not as clear whether this gain is true for other neuropsychological skill areas, it certainly seems to be the case for general intelligence. The gain has occurred for standardized tests of intelligence in American (Flynn, 1984) and in Scottish subjects (Flynn, 1990), as well as for tests of intelligence in other countries even when those tests are relatively less dependent on culture (Flynn, 1987). The reasons for this gain have been debated without complete resolution, but regardless of the reason, the effect poses problems for clinical assessment unless tests are renormed consistently. Flynn (1998) provided suggestions for compensating for obsolete norms prior to the publication of revised intelligence tests.

NORMATNE INFORMATION

The WAIS-R was normed on a sample of 1880 individuals in the United States who were between 16 and 74 years of age. By the use of information contained in the 1970 U.S. Census, the sample was stratified into groups on the basis of sex, age, race, geographic location, occupation, education, and urban versus rural residence. Raw scores on the WAIS-R are converted to scale scores by comparing the performance ofthe subject with the performance of the normative sample. The subtest scale scores have a mean of 10 and a standard deviation of 3. By examining tl1e conversion tables for the age groups, one can see that it is mainly decrements in Performance subtests that make the age norms necessary.

Although scale scores are available for the different age groups, the determination of IQ is made by obtaining scale scores based on the performance of the normative sample as a whole. It is when the subtest scale scores are summed that the age norms are used in the determination of IQ. IQ scores have a mean of 100 and a standard deviation of 15. Although the standardization sample did not demonstrate differences in performance due to race, Marcopulos, McLain, and Guiliano (1997) offer evidence that at least for elderly subjects race (Caucasian vs. African-American) was an important predictor of performance.

RELIABILITY

The scoring reliability of the WAIS-R was evaluated in a study reported by Ryan, Prifitera, and Powers (1983). In this study, 19 Ph.D. psychologists and 20 psychology graduate students scored two WAIS-R protocols that had been obtained from vocational counseling clients. The consequent IQ scores varied as much as 8 points across scorers. The variability in the scores of the two groups (Ph.D.s vs. students) was approximately equal, except in the case of the PIQ, where the Ph.D.s had a larger variance in scores that was statistically significant for both of the protocols. The results of this study underline the necessity of strict adherence to the standardized scoring system when using the WAIS-R.

The manual for the WAIS-R (Wechsler, 1981) gives split-half reliability information for each of the subtests except for the speeded subtest of Digit Symbol and the power subtest of Digit Span. In both of these cases, the test-retest reliability is reported. Re- liability information is given for each of the nine age groups and is then averaged, by means of Fisher's r-to-z transformation, to yield a reliability coefficient averaged across the age groups. The reliability coefficients for the IQ scores were obtained by a methodology for computing the reliability of a composite group of tests. The split-half reliability coefficients are corrected for the length of the tests by use of the Spearman-Brown formula, but no information is given regarding the nature of the split. The reliability coefficients range from .68 for Object Assembly to .96 for Vocabulary. The reliability coefficients for the IQ scores are .97 for VIQ, .93 for PIQ, and .97 for FSIQ. It is important to remember that the IQ reliability coefficients are estimates based on composite scores. Equivalent internal consistency reliability coefficients have been reported in two separate samples of psychiatric inpatients (Boone, 1992a; Piedmont, Sokolove, & fleming, 1989).

Test-retest reliability is presented for each of subtests as well as for IQ scores. The test-retest values were computed from the results of two administrations to a group of 71 subjects between the ages of 25 and 34 years and a sample of 48 subjects between the ages of 45 and 54 years. The intertest intervals ranged from 2 to 7 weeks. The test-retest reliability coefficients ranged from .69 for Digit Symbol in the 25- to 34-year-old group to .94 for Information in the 45- to 54-year-old group. The IQ test-retest coefficients were substantially high, the lowest being .89 for the PIQ in the 35- to 44 year-old group and the highest being .97 for verbal IQ in the 45- to 54-year-old group. There was no test for the difference in level across the two test sessions. This is a potentially important issue, as studies have indicated that, at least for the WAIS, there are substantial gains in IQ scores across time even in the presence of significant correlation coefficients (Catron & Thompson, 1979; Matarazzo, Carmody, & Jacobs, 1980). Carvajal, Schrader, and Holmes (1996) reported test-retest data similar to that in the manual but with 18- to 19-year-old subjects.

Hopp, Dixon, Grut, and Bacekman (1997) administered the WAIS-R to elderly normal subjects (aged 77 to 87 years) at 6-month intervals over a 2-year period. The WAIS-R demonstrated highly stable internal consistency estimates across periods as well as stable score values. When healthy elderly individuals were administered the WAIS-R twice across a I-year period, there was a high correlation but also an average increase of 3 points in FSIQ (Raguet, Campbell, Berry, Schmitt, and Smith, 1996). Snow, Tierney, Zorzitto, Fisher, and Reid (1989) found good reliability for IQ scores but not for subtest scores or VIQ-PIQ discrepancy scores in a sample of healthy elderly individuals. This lack of reliability for the VIQ-PIQ split was also found in the standardization sample (Matarazzo & Herman, 1984).

The temporal reliability ofWAIS-R IQs in 16-year-old subjects tested twice across either a 3-month or an 18-month interval was equivalent to the values reported in the manual (Thompson & Molly, 1993).

The temporal stability of the WAIS-R in popUlations that would be administered the test for clinical reasons is also pertinent. Ryan, Georgemiller, Geisser, and Randall (1985) report a wide range of gains or losses on retest in a heterogeneous sample, and this change was correlated with years of education. Watkins and Campbell (1992) found that the WAIS-R had good reliability in a sample of mentally retarded adults who were tested over an average interval of 2 years and 8 months. Boone (1992a) reported good temporal reliability in psychiatric inpatients tested twice across a 2- to 8-week interval. Hawkins and Sayward (1994) found that the test-retest reliability in psychiatric inpatients tested again

after discharge (mean interval was 15 months) was comparable to the values reported in the manual.

The presence of significant practice effects is as important as knowledge about reliable practice effects. Axelrod, Brines, and Rapaport (1997) found that subtracting 6 points from the FSIQ resulted in the best approximation of original FSIQ when a group of normal individuals was administered the WAIS-R twice across a 14-date interval. There may be differences in stability when the subjects are initially impaired. Therefore, it is vital that information be gathered regarding the test-retest reliability of the WAIS-R in those popUlations that are likely to be administered the instrument more than once. Rawlings and Crewe (1992) presented information regarding the average gains in IQ scores when individuals with traumatic brain injury (TBI) were tested at 2, 4,8, and 1 months post-injury as compared to individuals with TBI who were tested twice at 2 and 12 months post-injury.

Over the 12-month period, FSIQ increased nearly 14 points, VIQ increased nearly 9 points, and PIQ increased nearly 20 points. These researchers concluded that the practice effects were secondary to the improvement via recovery of skill in these subjects. In another look at this issue but with a different sample of head-injured patients as subjects, mean improvement scores were less than half those reported in the other study (Moore et al., 1990). Here, improvement correlated negatively with months post-injury, which may also account for the differences between the two studies.

McSweeny, Naugle, Chelune, and Lueders (1993) presented a method for evaluating individual change based on regression analysis of test-retest values in subjects known to be stable that may be helpful when comparison subjects with similar clinical characteristics are available.

The stability of test scores is important, as is the stability of other indices such as VIQ- PIQ splits and indices of scatter. Paolo and Ryan (1996) reported that when 61 normal elderly subjects were tested twice over an average period of 65 days, although the VIQ- PIQ split showed good reliability, there was great variability in both intersubtest and intrasubtest scatter with 52% disagreement of relative strengths and weaknesses.

The manual also presents information regarding the standard error of measurement for each of the subtest scores and the IQ scores for the nine age groups. These values are then averaged to provide an estimate of the overall standard error of measurement. The values are presented as scaled score units. These values range from .61 for Vocabulary to 1.54 for Object Assembly. The standard error of measurement values for the IQ scores are 2.74 for the VIQ, 4.14 for the PIQ, and 2.53 for the FSIQ. Naglieri (1982) presented values for confidence limits for the IQ scores. However, there is a problem with the values reported in the manual and by Naglieri. There are three formulas for the computation of the standard error of measurement. The formula used in the manual provides an estimate of the amount of error variance in the score and does not provide confidence boundaries for the scores, as the manual incorrectly states. Knight (1983) presented the values computed from the formula, which allow an interpretation of the confidence boundaries. Ryan, Paolo, and Brungardt (1990) present data regarding the reliability of the WAIS-R in elderly people and recommend its use in that population.

The manual provides a table of the minimum value required for the reliability of differences between pairs of scale scores and between the IQ scores. Atkinson (1991a) examined the reliability of intersubtest comparisons in mentally retarded subjects and found somewhat lower reliability than was reported in the manual for the normative

sample; however, this difference did not seem to be of sufficient magnitude to invalidate interpretations based on these differences. D.E. Boone (1998) examined the specificity of variances for the subtests in psychiatric inpatients and found evidence in support of ipsative comparisons and interpretation only for the Information, Arithmetic, Vocabulary, Comprehension, Similarities, Picture Arrangement, and Digit Symbol subtests. Whether this finding applies to other populations and the relationship of the specific variance estimates to the size of the observed comparisons needs further explication, but hypotheses regarding ipsative comparisons with the remaining subtests should be interpreted cau- tiously when the observed differences are close to the standard errors. Atkinson (1992a) presents values for ipsative comparisons in individuals with mental retardation. This set of data is perhaps more relevant to neuropsychological practice than similar data gained from studies of individuals with no impairment.

Many people erroneously interpret a statistically reliable difference between PIQ and VIQ as a clinically significant one. This is an especially important issue when the VIQ-PIQ split is used to generate hypotheses about the presence of organicity. For example, accord- ing to the manual, a VIQ-PIQ split of 12.04 is statistically reliable at the .05 level for the 16- to 17-year-old group. However, Knight (1983) reported that more than 25% of the normative sample in that age group exceeded that difference. Clearly, a difference of that magnitude cannot be clinically significant. Grossman (1983) also examined the prevalence of different magnitudes of VIQ-PIQ splits in the normative sample for the WAIS-R. He reported that 10% of the normative sample, averaged across age groups, had a VIQ-PIQ split of at least 18 points. The normative sample was chosen partly on the basis of a negative history of neurological impairment. The fact that a large percentage of subjects manifested reliably large VIQ-PIQ splits argues against the use of the split to diagnose organicity.

Piedmont, Sokolove, and Fleming (1989) had earlier cautioned against strict interpretation of the split when the minimum values reported in the manual are met. Atkinson (l99Ib) presented values for VIQ-PIQ differences based on confidence intervals placed around an estimate of the true score regressed against the mean rather than observed scores. [But see also Atkinson (1992b) for corrections].

An important distinction must be made among the reliability, the abnormality, and the validity of the VIQ-PIQ split (Berk, 1982b). Reliability of the split refers to a computation that denotes the probability that a split of a certain magnitude will occur as the result of error of measurement. Abnormality of the split refers to the observed percentage of subjects manifesting a split of a certain magnitude. Validity of the split refers to the clinical meaning of a split of a certain magnitude. Ryan (1984) calculated values for the abnormality ofVIQ-PIQ differences in the normative sample. Associated with each difference value is a probability level that can be interpreted as the proportion of subjects who exhibit that magnitude of difference. Ryan reported that a difference of 28 points could be found in 156 individuals of the normative sample. The fact that a split of that magnitude could occur in normal subjects emphasizes the need for an empirical evaluation of the validity of the VIQ-PIQ split.

To assess the validity of the VIQ-PIQ split in neuropsychological diagnosis, Bom- stein (1983b) examined the VIQ-PIQ split in patients with unilateral and bilateral cerebral dysfunction using a sample of 89 patients. Of these patients, 20 had left-hemisphere dysfunction, 24 had right-hemisphere dysfunction, and 45 had bilateral dysfunction as determined by neurodiagnostic techniques. It was found that the left-hemisphere patients

had significantly lower VIQ than PIQ; however, the mean difference was only 4.9 points.

Similarly, the right-hemisphere patients had significantly lower PIQ than VIQ, but the mean difference was 10.7 points. Most tellingly, the bilaterally dysfunctional patients had significantly lower PIQ than VIQ, and the mean difference was 7.8 points. It appears that cerebral dysfunction may have effects on the VIQ-PIQ split, but the absence of a control group in this study lessens our confidence in such a conclusion. However, the results of this study in conjunction with the examination of the VIQ-PIQ data in the WAIS-R normative sample emphasize the tenuous nature of performing a neuropsychological diagnosis using WAIS-R data only.

Some authors have suggested that the VIQ-PIQ split in unilateral lesions is differential by sex. McGlone (1977,1978) presented data and Inglis and Lawson (1981) provided a literature review that seem to support this idea. However, a report by Bomstein (1984) concludes that sex does not make a difference in the validity of this split. Snow, Freedman, and Ford (1986) reviewed published studies that evaluated the possibility of differential laterality results in WAIS and Wechsler-Bellevue scores for males and females. They concluded that the effect of sex was nonrobust and might therefore be a reflection of a relationship between sex and other variables, such as education level. Sundet (1986) examined VIQ-PIQ differences in 83 subjects with unilateral brain impairment. When the magnitude of the split was used, only 70% of the subjects were correctly classified as left- or right-hemisphere-impaired. Female subjects were not found to be any more or less lateralized than male subjects. The argument is far from settled, and researchers will need to take sex into account when planning the design and executing the statistical analysis of a future study to examine this idea.

The WAIS-R may be valuable in providing collateral evidence of the existence of lateralized deficits once the existence of cerebral dysfunction has been determined. The differences in VIQ-PIQ splits for the unilaterally damaged subjects in the above study conform to the expectations derived from the widely held assumption that verbal abilities are largely mediated by the dominant hemisphere, whereas spatial tasks are largely mediated by the nondominant hemisphere. To investigate the possibility that a more empirically derived indicator of lateralized deficits may result in more useful neuropsychological information, Lawson, Inglis, and Stroud (1983) performed a principal-components factor analysis of the WAIS-R normative data. They derived a solution with two factors, the first of which they interpreted as general intelligence, and the second of which they interpreted as laterality. These authors then derived a weighted-sum method of combining subtest scale scores into a single laterality index. They then used a formula for computing the reliability of a composite score from the reliabilities of the constituent scale scores and reported an average split-half reliability of .78 for their laterality index. A test-retest coefficient was similarly derived from the data contained in the WAIS-R manual and an average value of .79 was reported. A very important point needs to be made here. The values for reliability reported by Lawson et al. (1983) were derived from a sample of normal SUbjects. These authors contended that their index is valid only for unilaterally damaged males, although no data are presented to substantiate that claim. The laterality index appears to be a good idea, but much more work, including an investigation of the reliability and validity of the index in the popUlation for which it is intended, is needed before it can be recommended for clinical use.

Dalam dokumen BOOK Reliability and Validity in Neuropsychological Assessment (Halaman 63-79)