meanings depending on the test from which it was derived, the areas the test covers, and how recent its norms are, as well as specific aspects of the situation in which the score was ob- tained and the characteristics of the test taker.
FRAMES OF REFERENCE FOR TEST-SCORE
INTERPRETATION
Underlying all other issues regarding score interpretation, in one way or another, is the matter of the frames of reference used to interpret a given score. Depending on their purpose, tests rely on one or both of the following sources of information to derive frames of reference for their meaning:
1. Norms. Norm-referenced test interpretationuses standards based on the per- formance of specific groups of people to provide information for in- terpreting scores. This type of test interpretation is useful primarily when we need to compare individuals with one another or with a refer- ence group in order to evaluate differences between them on whatever characteristic the test measures. The term normsrefers to the test per- formance or typical behavior of one or more reference groups. Norms are usually presented in the form of tables with descriptive statistics—
such as means, standard deviations, and frequency distributions—that summarize the performance of the group or groups in question. When norms are collected from the test performance of groups of people, these reference groups are labeled normativeorstandardization samples.
Gathering norms is a central aspect of the process of standardizing a norm-referenced test.
2. Performance criteria.When the relationship between the items or tasks of a test and standards of performance is demonstrable and well defined, test scores may be evaluated via criterion-referenced interpretation.This type of interpretation makes use of procedures, such as sampling from content domains or work-related behaviors, designed to assess 78 ESSENTIALS OF PSYCHOLOGICAL TESTING
For more extensive information on technical aspects of many of the topics discussed in this chapter, see any one of the following sources:
• Angoff, W. H. (1984).Scales, norms, and equivalent scores.Princeton, NJ:
Educational Testing Service.
• Petersen, N. S., Kolen, M. J., &
Hoover, H. D. (1989). Scaling, norm- ing, and equating. In R. L. Linn (Ed.), Educational measurement(3rd ed., pp. 221–262). New York: American Council on Education/Macmillan.
• Thissen, D., & Wainer, H. (Eds.).
(2001).Test scoring.Mahwah, NJ:
Erlbaum.
Rapid Reference 3.1
whether and to what extent the desired levels of mastery or perfor- mance criteria have been met.
NORM-REFERENCED TEST INTERPRETATION
Norms are, by far, the most widely used frame of reference for interpreting test scores. The performance of defined groups of people is used as a basis for score interpretation in both ability and personality testing. When norms are the frame of reference, the question they typically answer is “How does the performance of this test taker compare to that of others?” The score itself is used to place the test taker’s performance within a preexisting distribution of scores or data obtained from the performance of a suitable comparison group.
Developmental Norms
Ordinal Scales Based on Behavioral Sequences
Human development is characterized by sequential processes in a number of be- havioral realms. A classic example is the sequence that normal motor develop- ment follows during infancy. In the first year of life, most babies progress from the fetal posture at birth, through sitting and standing, to finally walking alone.
Whenever a universal sequence of development involves an orderly progression from one behavioral stage to another—more advanced—stage, the sequence it- self can be converted into an ordinalscale and used normatively. In such cases, the frame of reference for test score interpretation is derived from observing and noting certain uniformities in the order and timing of behavioral attainments across many individuals. The pioneer in the development of this type of scales was Arnold Gesell, a psychologist and pediatrician who published the Gesell De- velopmental Schedules in 1940 based on a series of longitudinal studies con- ducted by him and his associates at Yale over a span of several decades (Ames, 1989).
The Provence Birth-to-Three Developmental Profile.A current example of an instru- ment that uses ordinal scaling is the Provence Birth-to-Three Developmental Profile (“Provence Profile”), which is part of the Infant-Toddler Developmental Assessment ( IDA; Provence, Erikson, Vater, & Palmieri, 1995). The IDA is an in- tegrated system designed to help in the early identification of children who are developmentally at risk and possibly in need of monitoring or intervention.
Through naturalistic observation and parental reports, the Provence Profile pro- vides information about the timeliness with which a child attains developmental milestones in eight domains, in relation to the child’s chronological age. The de-
velopmental domains are Gross Motor Behavior, Fine Motor Behavior, Rela- tionship to Inanimate Objects, Language/Communication, Self-Help, Relation- ship to Persons, Emotions and Feeling States (Affects), and Coping Behavior. For each of these domains, the profile groups items into age brackets ranging from 0 to 42 months. The age brackets are as small as 2 months at earlier ages and as wide as 18 months in some domains at later ages. Most span between 3 and 6 months.
The number of items in each age group differs as well, as does the number of items that need to be present or competently performed to meet the criterion for each age bracket. Table 3.1 lists four sample items from each of three develop- mental domains of the IDA’s Provence Profile. The scores on items at each age range and in each domain are added to arrive at a performance agethat can then be evaluated in comparison to the child’s chronological age.Discrepancies between per- formance and chronological age levels, if any, may then be used to determine the possible presence and extent of developmental delays in the child.
Theory-Based Ordinal Scales
Ordinal scales may be based on factors other than chronological age. Several the- ories, such as Jean Piaget’s proposed stages of cognitive development from in- 80 ESSENTIALS OF PSYCHOLOGICAL TESTING
Table 3.1 Sample Items From the Provence Profile of the Infant-Toddler Developmental Assessment
Age Range
Domain (in months) Item
Gross Motor Behavior 4 to 7 Sits alone briefly 7 to 10 Pulls to stand 13 to 18 Walks well alone 30 to 36 Walks up and down stairs Language/Communication 4 to 7 Laughs aloud
7 to 10 Responds to “no”
13 to 18 Shows shoe when asked 30 to 36 Knows rhymes or songs Self-Help 4 to 7 Retrieves lost pacifier or bottle
7 to 10 Pushes adult hand away
13 to 18 Partially feeds self with spoon or fingers 30 to 36 Puts shoes on
Source:Adapted from the Infant-Toddler Developmental Assessment (IDA) Administration Manualby Sally Provence, Joanna Erikson, Susan Vater, and Saro Palmeri and reproduced with permission of the publisher. Copyright © 1995 by The Riverside Publishing Company. All rights reserved.
fancy to adolescence or Lawrence Kohlberg’s theory of moral development, posit an orderly and invariant sequence or progression derived at least partly from be- havioral observations. Some of these theories have generated ordinal scales de- signed to evaluate the level that an individual has attained within the proposed se- quence; these tools are used primarily for purposes of research rather than for individual assessment. Examples of this type of instrument include standardized scales based on Piaget’s delineation of the order in which cognitive competencies are acquired during infancy and childhood, such as the Ordinal Scales of Psycho- logical Development, also known as the Infant Psychological Development Scales ( Uzˇgiris & Hunt, 1975).
Mental Age Scores
The notion of mental age scores was discussed in Chapter 2 in connection with the ratio IQs of the early Stanford-Binet intelligence scales. The mental age scores derived from those scales were computed on the basis of the child’s performance, which earned credits in terms of years and months, depending on the number of chronologically arranged tests that were passed. In light of the difficulties pre- sented by this procedure, described in Chapter 2, this particular way of arriving at mental age scores has been abandoned. However, several current tests still pro- vide norms that are presented as age equivalent scoresand are based on the average raw score performance of children of different age groups in the standardization sample.
Age equivalent scores, also known as test ages,simply represent a way of equat- ing the test taker’s performance on a test with the average performance of the normative age group with which it corresponds. For example, if a child’s raw score equals the mean raw score of 9-year-olds in the normative sample, her or his test age equivalent score is 9 years. In spite of this change in the procedures used to obtain age equivalent scores, inequalities in the rate of development at dif- ferent ages remain a problem when this kind of age norm is used, because the dif- ferences in behavioral attainments that can be expected with each passing year di- minish greatly from infancy and early childhood to adolescence and adulthood.
If this is not understood, or if the meaning of a test ageis extended to realms other than the specific behavior sampled by the test—as it is, for example, when an adolescent who gets a test age score of 8 years is described as having “the mind of an 8-year-old”—the use of such scores can be quite misleading.
Grade Equivalent Scores
The sequential progression and relative uniformity of school curricula, especially in the elementary grades, provide additional bases for interpreting scores in terms of developmental norms. Thus, performance on achievement tests within school
settings is often described by grade levels. These grade equivalentscores are derived by locating the performance of test takers within the norms of the students at each grade level—and fractions of grade levels—in the standardization sample.
If we say, for instance, that a child has scored at the seventh grade in reading and the fifth grade in arithmetic, it means that her or his performance on the reading test matches the average performance of the seventh-graders in the standardiza- tion sample and that, on the arithmetic test, her or his performance equals that of fifth-graders.
In spite of their appeal, grade equivalent scores also can be misleading for a number of reasons. To begin with, the content of curricula and quality of in- struction vary across schools, school districts, states, and so forth; therefore, grade equivalent scores do not provide a uniform standard. In addition, the ad- vance expected in the early elementary school grades, in terms of academic achievement, is much greater than it is in middle school or high school; thus, just as with mental age units, a difference of one year in retardation or acceleration is far more meaningful in the early grades than it is by the last years of high school.
Moreover, if a child who is in the fourth grade scores at the seventh grade in arith- metic, it does not mean the child has mastered seventh-grade arithmetic; rather, it means that the child’s score is significantly above the average for fourth-graders in arithmetic. Furthermore, grade equivalent scores are sometimes erroneously viewed as standards of performance that all children in a given grade must meet, whereas they simply represent average levels of performance that—due to the in- evitable variability across individuals—some students will meet, others will not, and still others will exceed.
82 ESSENTIALS OF PSYCHOLOGICAL TESTING
DON’T FORGET
All developmental norms are relative, except as they reflect a behavioral se- quence or progression that is universalin humans.
• Theory-based ordinal scalesare more or less useful depending on whether the theories on which they are based are sound and applicable to a given segment of a population or to the population as a whole.
• Mental age norms or age equivalent score scalesreflect nothing more than the average performance of certain groups of test takers of specific age levels,at a given time and place, on a specific test.They are subject to change over time, as well as across cultures and subcultures.
• Grade-based norms or age equivalent score scalesalso reflect the average perfor- mance of certain groups of students in specific grades, at a given time and place.They too are subject to variation over time, as well as across curricula in different schools, school districts, and nations.
Within-Group Norms
Most standardized tests use some type of within-group norms.These norms essen- tially provide a way of evaluating a person’s performance in comparison to the per- formance of one or more appropriate reference groups. For proper interpretation of norm-referenced test scores it is necessary to understand the numerical proce- dures whereby raw scores are transformed into the large variety of derived scoresthat are used to express within-group norms. Nevertheless, it is good to keep in mind that all of the various types of scores reviewed in this section serve the simple pur- pose of placing a test taker’s performance within a normative distribution. There- fore, the single most important question with regard to this frame of reference concerns the exact make-up of the group or groups from which the norms are de- rived. The composition of the normative or standardization sample is of utmost importance in this kind of test score interpretation because the people in that sample set the standard against which all other test takers are measured.
The Normative Sample
In light of the important role played by the normative sample’s performance, the foremost requirement of such samples is that they should be representative of the kinds of individuals for whom the tests are intended. For example, if a test is to be used to assess the reading skills of elementary school students in Grades 3 to 5 from across the whole nation, the normative sample for the test should repre- sent the national population of third-, fourth-, and fifth-graders in all pertinent respects. The demographic make-up of the nation’s population on variables like gender, ethnicity, language, socioeconomic status, urban or rural residency, geo- graphic distribution, and public- or private-school enrollment must be reflected in the normative sample for such a test. In addition, the sample needs to be suf- ficiently large as to ensure the stability of the values obtained from their perfor- mance.
The sizes of normative samples vary tremendously depending on the type of test that is standardized and on the ease with which samples can be gathered. For example, group ability tests used in school settings may have normative samples numbering in the tens or even hundreds of thousands, whereas individual intelli- gence tests, administered to a single person at a time by a highly trained examiner, are normed on much smaller samples—typically consisting of 1,000 to 3,000 in- dividuals—gathered from the general population. Tests that require specialized samples, such as members of a certain occupational group, may have even smaller normative samples. The recency of the normative information is also important if test takers are to be compared with contemporary standards, as is usually the case.
Relevant factors to consider in the make-up of the normative sample vary de- pending on the purpose of the test as well as the population on which it will be used. In the case of a test designed to detect cognitive impairment in older adults, for instance, variables like health status, independent versus institutional living situation, and medication intake would be pertinent, in addition to the demo- graphic variables of gender, age, ethnicity, and such. Rapid Reference 3.2 lists some of the most common questions that test users should ask concerning the normative sample when they are in the process of evaluating the suitability of a test for their purposes.
Reference groups can be defined on a continuum of breadth or specificity de- pending upon the kinds of comparisons that test users need to make to evaluate test scores. At one extreme, the reference group might be the general population of an entire nation or even a multinational population. At the other end, reference groups may be drawn from populations that are narrowly defined in terms of sta- tus or settings.
Subgroup norms.When large samples are gathered to represent broadly defined populations, norms can be reported in the aggregate or can be separated into sub- 84 ESSENTIALS OF PSYCHOLOGICAL TESTING
C A U T I O N
Although the three terms are often used interchangeably—here and else- where—and may actually refer to the same group, strictly speaking, the precise meanings of standardization sample, normative sample,andreference groupare somewhat different:
• Thestandardization sampleis the group of individuals on whom the test is origi- nally standardized in terms of administration and scoring procedures, as well as in developing the test’s norms. Data for this group are usually presented in the manual that accompanies a test upon publication.
• Thenormative sampleis often used as synonymous with the standardization sample, but can refer to any group from which norms are gathered. Additional norms collected on a test after it is published, for use with a distinct subgroup, may appear in the periodical literature or in technical manuals published at a later date. See, for example, the study of older Americans by Ivnik and his asso- ciates (1992) at the Mayo Clinic wherein data were collected to provide norms for people beyond the highest age group in the standardization sample of the Wechsler Adult Intelligence Scale–Revised (WAIS-R).
• Reference group,in contrast, is a term that is used more loosely to identify any group of people against which test scores are compared. It may be applied to the standardization group, to a subsequently developed normative sample, to a group tested for the purpose of developing local norms, or to any other desig- nated group, such as the students in a single class or the participants in a re- search study.
group norms.Provided that they are of sufficient size—and fairly representative of their categories—subgroups can be formed in terms of age, sex, occupation, eth- nicity, educational level, or any other variable that may have a significant impact on test scores or yield comparisons of interest. Subgroup norms may also be col- lected after a test has been standardized and published to supplement and expand the applicability of the test. For instance, before the MMPI was revised to create the MMPI-2 and a separate form for adolescents (the MMPI-A), users of the original test—which had been normed exclusively on adults—developed special subgroup norms for adolescents at various age levels (see, e.g., Archer, 1987 ).
Local norms.On the other hand, there are some situations in which test users may wish to evaluate scores on the basis of reference groups drawn from a spe- cific geographic or institutional setting. In such cases, test users may choose to
Information Needed to Evaluate the Applicability of a Normative Sample
In order to evaluate the suitability of a norm-referenced test for a specific pur- pose, test users need to have as much information as possible regarding the nor- mative sample, including answers to questions such as these:
• How large is the normative sample?
• When was the sample gathered?
• Where was the sample gathered?
• How were individuals identified and selected for the sample?
• Who tested the sample?
• How did the examiner or examiners qualify to do the testing?
• What was the composition of the normative sample, in terms of
—age?
—sex?
—ethnicity, race, or linguistic background?
—education?
—socioeconomic status?
—geographic distribution?
—any other pertinent variables, such as physical and mental health status or membership in an atypical group, that may influence test performance?
Test users can evaluate the suitability of a norm-referenced test for their specific purposes only when answers to these questions are provided in the test manual or related documents.