ESSENTIALS OF TEST SCORE INTERPRETATION - of Psychological Testing

meanings depending on the test from which it was derived, the areas the test covers, and how recent its norms are, as well as speciﬁc aspects of the situation in which the score was obtained and the characteristics of the test taker.

FRAMES OF REFERENCE FOR TEST-SCORE

INTERPRETATION

Underlying all other issues regarding score interpretation, in one way or another, is the matter of the frames of reference used to interpret a given score. Depending on their purpose, tests rely on one or both of the following sources of information to derive frames of reference for their meaning:

1. Norms. Norm-referenced test interpretationuses standards based on the performance of speciﬁc groups of people to provide information for interpreting scores. This type of test interpretation is useful primarily when we need to compare individuals with one another or with a reference group in order to evaluate differences between them on whatever characteristic the test measures. The term normsrefers to the test performance or typical behavior of one or more reference groups. Norms are usually presented in the form of tables with descriptive statistics—

such as means, standard deviations, and frequency distributions—that summarize the performance of the group or groups in question. When norms are collected from the test performance of groups of people, these reference groups are labeled normativeorstandardization samples.

Gathering norms is a central aspect of the process of standardizing a norm-referenced test.

2. Performance criteria.When the relationship between the items or tasks of a test and standards of performance is demonstrable and well deﬁned, test scores may be evaluated via criterion-referenced interpretation.This type of interpretation makes use of procedures, such as sampling from content domains or work-related behaviors, designed to assess 78 ESSENTIALS OF PSYCHOLOGICAL TESTING

For more extensive information on technical aspects of many of the topics discussed in this chapter, see any one of the following sources:

• Angoff, W. H. (1984).Scales, norms, and equivalent scores.Princeton, NJ:

Educational Testing Service.

• Petersen, N. S., Kolen, M. J., &

Hoover, H. D. (1989). Scaling, norm- ing, and equating. In R. L. Linn (Ed.), Educational measurement(3rd ed., pp. 221–262). New York: American Council on Education/Macmillan.

• Thissen, D., & Wainer, H. (Eds.).

(2001).Test scoring.Mahwah, NJ:

Erlbaum.

Rapid Reference 3.1

whether and to what extent the desired levels of mastery or performance criteria have been met.

NORM-REFERENCED TEST INTERPRETATION

Norms are, by far, the most widely used frame of reference for interpreting test scores. The performance of deﬁned groups of people is used as a basis for score interpretation in both ability and personality testing. When norms are the frame of reference, the question they typically answer is “How does the performance of this test taker compare to that of others?” The score itself is used to place the test taker’s performance within a preexisting distribution of scores or data obtained from the performance of a suitable comparison group.

Developmental Norms

Ordinal Scales Based on Behavioral Sequences

Human development is characterized by sequential processes in a number of behavioral realms. A classic example is the sequence that normal motor development follows during infancy. In the ﬁrst year of life, most babies progress from the fetal posture at birth, through sitting and standing, to ﬁnally walking alone.

Whenever a universal sequence of development involves an orderly progression from one behavioral stage to another—more advanced—stage, the sequence itself can be converted into an ordinalscale and used normatively. In such cases, the frame of reference for test score interpretation is derived from observing and noting certain uniformities in the order and timing of behavioral attainments across many individuals. The pioneer in the development of this type of scales was Arnold Gesell, a psychologist and pediatrician who published the Gesell De- velopmental Schedules in 1940 based on a series of longitudinal studies con- ducted by him and his associates at Yale over a span of several decades (Ames, 1989).

The Provence Birth-to-Three Developmental Profile.A current example of an instrument that uses ordinal scaling is the Provence Birth-to-Three Developmental Profile (“Provence Profile”), which is part of the Infant-Toddler Developmental Assessment ( IDA; Provence, Erikson, Vater, & Palmieri, 1995). The IDA is an in- tegrated system designed to help in the early identification of children who are developmentally at risk and possibly in need of monitoring or intervention.

Through naturalistic observation and parental reports, the Provence Proﬁle pro- vides information about the timeliness with which a child attains developmental milestones in eight domains, in relation to the child’s chronological age. The de-

velopmental domains are Gross Motor Behavior, Fine Motor Behavior, Rela- tionship to Inanimate Objects, Language/Communication, Self-Help, Relation- ship to Persons, Emotions and Feeling States (Affects), and Coping Behavior. For each of these domains, the proﬁle groups items into age brackets ranging from 0 to 42 months. The age brackets are as small as 2 months at earlier ages and as wide as 18 months in some domains at later ages. Most span between 3 and 6 months.

The number of items in each age group differs as well, as does the number of items that need to be present or competently performed to meet the criterion for each age bracket. Table 3.1 lists four sample items from each of three developmental domains of the IDA’s Provence Proﬁle. The scores on items at each age range and in each domain are added to arrive at a performance agethat can then be evaluated in comparison to the child’s chronological age.Discrepancies between performance and chronological age levels, if any, may then be used to determine the possible presence and extent of developmental delays in the child.

Theory-Based Ordinal Scales

Ordinal scales may be based on factors other than chronological age. Several theories, such as Jean Piaget’s proposed stages of cognitive development from in- 80 ESSENTIALS OF PSYCHOLOGICAL TESTING

Table 3.1 Sample Items From the Provence Proﬁle of the Infant-Toddler Developmental Assessment

Age Range

Domain (in months) Item

Gross Motor Behavior 4 to 7 Sits alone brieﬂy 7 to 10 Pulls to stand 13 to 18 Walks well alone 30 to 36 Walks up and down stairs Language/Communication 4 to 7 Laughs aloud

7 to 10 Responds to “no”

13 to 18 Shows shoe when asked 30 to 36 Knows rhymes or songs Self-Help 4 to 7 Retrieves lost paciﬁer or bottle

7 to 10 Pushes adult hand away

13 to 18 Partially feeds self with spoon or ﬁngers 30 to 36 Puts shoes on

Source:Adapted from the Infant-Toddler Developmental Assessment (IDA) Administration Manualby Sally Provence, Joanna Erikson, Susan Vater, and Saro Palmeri and reproduced with permission of the publisher. Copyright © 1995 by The Riverside Publishing Company. All rights reserved.

fancy to adolescence or Lawrence Kohlberg’s theory of moral development, posit an orderly and invariant sequence or progression derived at least partly from behavioral observations. Some of these theories have generated ordinal scales designed to evaluate the level that an individual has attained within the proposed sequence; these tools are used primarily for purposes of research rather than for individual assessment. Examples of this type of instrument include standardized scales based on Piaget’s delineation of the order in which cognitive competencies are acquired during infancy and childhood, such as the Ordinal Scales of Psycho- logical Development, also known as the Infant Psychological Development Scales ( Uzˇgiris & Hunt, 1975).

Mental Age Scores

The notion of mental age scores was discussed in Chapter 2 in connection with the ratio IQs of the early Stanford-Binet intelligence scales. The mental age scores derived from those scales were computed on the basis of the child’s performance, which earned credits in terms of years and months, depending on the number of chronologically arranged tests that were passed. In light of the difﬁculties presented by this procedure, described in Chapter 2, this particular way of arriving at mental age scores has been abandoned. However, several current tests still provide norms that are presented as age equivalent scoresand are based on the average raw score performance of children of different age groups in the standardization sample.

Age equivalent scores, also known as test ages,simply represent a way of equating the test taker’s performance on a test with the average performance of the normative age group with which it corresponds. For example, if a child’s raw score equals the mean raw score of 9-year-olds in the normative sample, her or his test age equivalent score is 9 years. In spite of this change in the procedures used to obtain age equivalent scores, inequalities in the rate of development at different ages remain a problem when this kind of age norm is used, because the differences in behavioral attainments that can be expected with each passing year di- minish greatly from infancy and early childhood to adolescence and adulthood.

If this is not understood, or if the meaning of a test ageis extended to realms other than the speciﬁc behavior sampled by the test—as it is, for example, when an adolescent who gets a test age score of 8 years is described as having “the mind of an 8-year-old”—the use of such scores can be quite misleading.

Grade Equivalent Scores

The sequential progression and relative uniformity of school curricula, especially in the elementary grades, provide additional bases for interpreting scores in terms of developmental norms. Thus, performance on achievement tests within school

settings is often described by grade levels. These grade equivalentscores are derived by locating the performance of test takers within the norms of the students at each grade level—and fractions of grade levels—in the standardization sample.

If we say, for instance, that a child has scored at the seventh grade in reading and the ﬁfth grade in arithmetic, it means that her or his performance on the reading test matches the average performance of the seventh-graders in the standardization sample and that, on the arithmetic test, her or his performance equals that of ﬁfth-graders.

In spite of their appeal, grade equivalent scores also can be misleading for a number of reasons. To begin with, the content of curricula and quality of in- struction vary across schools, school districts, states, and so forth; therefore, grade equivalent scores do not provide a uniform standard. In addition, the ad- vance expected in the early elementary school grades, in terms of academic achievement, is much greater than it is in middle school or high school; thus, just as with mental age units, a difference of one year in retardation or acceleration is far more meaningful in the early grades than it is by the last years of high school.

Moreover, if a child who is in the fourth grade scores at the seventh grade in arithmetic, it does not mean the child has mastered seventh-grade arithmetic; rather, it means that the child’s score is signiﬁcantly above the average for fourth-graders in arithmetic. Furthermore, grade equivalent scores are sometimes erroneously viewed as standards of performance that all children in a given grade must meet, whereas they simply represent average levels of performance that—due to the in- evitable variability across individuals—some students will meet, others will not, and still others will exceed.

82 ESSENTIALS OF PSYCHOLOGICAL TESTING

DON’T FORGET

All developmental norms are relative, except as they reﬂect a behavioral sequence or progression that is universalin humans.

• Theory-based ordinal scalesare more or less useful depending on whether the theories on which they are based are sound and applicable to a given segment of a population or to the population as a whole.

• Mental age norms or age equivalent score scalesreflect nothing more than the average performance of certain groups of test takers of specific age levels,at a given time and place, on a specific test.They are subject to change over time, as well as across cultures and subcultures.

• Grade-based norms or age equivalent score scalesalso reﬂect the average performance of certain groups of students in speciﬁc grades, at a given time and place.They too are subject to variation over time, as well as across curricula in different schools, school districts, and nations.

Within-Group Norms

Most standardized tests use some type of within-group norms.These norms essen- tially provide a way of evaluating a person’s performance in comparison to the performance of one or more appropriate reference groups. For proper interpretation of norm-referenced test scores it is necessary to understand the numerical procedures whereby raw scores are transformed into the large variety of derived scoresthat are used to express within-group norms. Nevertheless, it is good to keep in mind that all of the various types of scores reviewed in this section serve the simple purpose of placing a test taker’s performance within a normative distribution. There- fore, the single most important question with regard to this frame of reference concerns the exact make-up of the group or groups from which the norms are derived. The composition of the normative or standardization sample is of utmost importance in this kind of test score interpretation because the people in that sample set the standard against which all other test takers are measured.

The Normative Sample

In light of the important role played by the normative sample’s performance, the foremost requirement of such samples is that they should be representative of the kinds of individuals for whom the tests are intended. For example, if a test is to be used to assess the reading skills of elementary school students in Grades 3 to 5 from across the whole nation, the normative sample for the test should represent the national population of third-, fourth-, and fifth-graders in all pertinent respects. The demographic make-up of the nation’s population on variables like gender, ethnicity, language, socioeconomic status, urban or rural residency, geographic distribution, and public- or private-school enrollment must be reflected in the normative sample for such a test. In addition, the sample needs to be suf- ficiently large as to ensure the stability of the values obtained from their performance.

The sizes of normative samples vary tremendously depending on the type of test that is standardized and on the ease with which samples can be gathered. For example, group ability tests used in school settings may have normative samples numbering in the tens or even hundreds of thousands, whereas individual intelligence tests, administered to a single person at a time by a highly trained examiner, are normed on much smaller samples—typically consisting of 1,000 to 3,000 individuals—gathered from the general population. Tests that require specialized samples, such as members of a certain occupational group, may have even smaller normative samples. The recency of the normative information is also important if test takers are to be compared with contemporary standards, as is usually the case.

Relevant factors to consider in the make-up of the normative sample vary depending on the purpose of the test as well as the population on which it will be used. In the case of a test designed to detect cognitive impairment in older adults, for instance, variables like health status, independent versus institutional living situation, and medication intake would be pertinent, in addition to the demographic variables of gender, age, ethnicity, and such. Rapid Reference 3.2 lists some of the most common questions that test users should ask concerning the normative sample when they are in the process of evaluating the suitability of a test for their purposes.

Reference groups can be defined on a continuum of breadth or specificity depending upon the kinds of comparisons that test users need to make to evaluate test scores. At one extreme, the reference group might be the general population of an entire nation or even a multinational population. At the other end, reference groups may be drawn from populations that are narrowly defined in terms of status or settings.

Subgroup norms.When large samples are gathered to represent broadly deﬁned populations, norms can be reported in the aggregate or can be separated into sub- 84 ESSENTIALS OF PSYCHOLOGICAL TESTING

C A U T I O N

Although the three terms are often used interchangeably—here and else- where—and may actually refer to the same group, strictly speaking, the precise meanings of standardization sample, normative sample,andreference groupare somewhat different:

• Thestandardization sampleis the group of individuals on whom the test is origi- nally standardized in terms of administration and scoring procedures, as well as in developing the test’s norms. Data for this group are usually presented in the manual that accompanies a test upon publication.

• Thenormative sampleis often used as synonymous with the standardization sample, but can refer to any group from which norms are gathered. Additional norms collected on a test after it is published, for use with a distinct subgroup, may appear in the periodical literature or in technical manuals published at a later date. See, for example, the study of older Americans by Ivnik and his associates (1992) at the Mayo Clinic wherein data were collected to provide norms for people beyond the highest age group in the standardization sample of the Wechsler Adult Intelligence Scale–Revised (WAIS-R).

• Reference group,in contrast, is a term that is used more loosely to identify any group of people against which test scores are compared. It may be applied to the standardization group, to a subsequently developed normative sample, to a group tested for the purpose of developing local norms, or to any other desig- nated group, such as the students in a single class or the participants in a research study.

group norms.Provided that they are of sufﬁcient size—and fairly representative of their categories—subgroups can be formed in terms of age, sex, occupation, ethnicity, educational level, or any other variable that may have a signiﬁcant impact on test scores or yield comparisons of interest. Subgroup norms may also be collected after a test has been standardized and published to supplement and expand the applicability of the test. For instance, before the MMPI was revised to create the MMPI-2 and a separate form for adolescents (the MMPI-A), users of the original test—which had been normed exclusively on adults—developed special subgroup norms for adolescents at various age levels (see, e.g., Archer, 1987 ).

Local norms.On the other hand, there are some situations in which test users may wish to evaluate scores on the basis of reference groups drawn from a spe- ciﬁc geographic or institutional setting. In such cases, test users may choose to

Information Needed to Evaluate the Applicability of a Normative Sample

In order to evaluate the suitability of a norm-referenced test for a speciﬁc purpose, test users need to have as much information as possible regarding the normative sample, including answers to questions such as these:

• How large is the normative sample?

• When was the sample gathered?

• Where was the sample gathered?

• How were individuals identiﬁed and selected for the sample?

• Who tested the sample?

• How did the examiner or examiners qualify to do the testing?

• What was the composition of the normative sample, in terms of

—age?

—sex?

—ethnicity, race, or linguistic background?

—education?

—socioeconomic status?

—geographic distribution?

—any other pertinent variables, such as physical and mental health status or membership in an atypical group, that may inﬂuence test performance?

Test users can evaluate the suitability of a norm-referenced test for their speciﬁc purposes only when answers to these questions are provided in the test manual or related documents.

Rapid Reference 3.2

Dalam dokumen of Psychological Testing - EPDF.MX (Halaman 90-130)