PERFORMANCE ASSESSMENT
Dilemma 5: Can performance assessments provide teachers with instructionally useful information?
Historically teachers have criticized standardized test reporting on the grounds that it provides little instructionally useful information. As we mentioned previously, standardized test reporting typically is characterized by a single norm-referenced score, usually a percentile or a grade norm score.
Whether scores from performance assessments will improve the type of information gained by teachers for diagnostic and instructional purposes is uncertain. A single, holistic score frequently is what is reported in performance assessments. This single score differs from the type of score reported for standardized tests in that it represents an entire assessment, which may have taken several days (in the case of a complete performance task) or several months (in the case of a portfolio in which artifacts are gathered over time) to have completed. As we pointed out in another paper (García and Pearson, 1991), holistic scores avoid the decomposition fallacy— ‘the mistaken idea that by breaking an integrated performance into component processes, each can be evaluated and remediated independently’ (380). What we don’t know is whether a holistic score will provide teachers with the type of information they need to improve the performance of students who do not meet high standards.
The answers to these questions may depend upon the assumptions made about the role a rubric is supposed to play in the decision-making process.
Rubrics are the generic descriptions of performance represented by the levels within a scoring system. For example, consider the differences between score points 6 and 4 in the now defunct California Learning Assessment System (CLAS, 1994, front matter) as illustrated in Figure 2.1.
Clearly both level 6 and level 4 readers are pretty good at what they do, and level 4 readers demonstrate many of the same behaviours and characteristics of level 6 readers, albeit with less confidence, consistency, clarity and ardour. However, when we assign a score to an individual student’s performance, we explicitly make the claim that of all the rubric descriptions available, this description provides the best fit to the data in the student’s response; we also implicitly ascribe all, or most, of the attributes of that level to the individual who generated the performance. Therefore, a performance assessment system would be potentially useful in instructional decision making if, and only if, its rubrics provided teachers with guidance for instruction.
Even though the description of performance from a rubric may be richer than that provided by a standard score or percentile rank, teachers need (at least they say they want) much more detailed information in order to plan instruction for individual students. This desire for specificity may explain the popularity of dimensional scoring systems, which provide
separate scores for a number of important dimensions or features of performance.
A good example of dimensional scoring is a writing assessment system in which teachers are given information about students’ voice, audience awareness, style, organization or content coherence, and mechanics. However, when dimensional scoring is used, there is a natural, if not compelling, tendency for educators to look for the particular weakness— that one valley in a diagnostic profile of peaks and valleys—that will guide them in providing exactly the right instruction for a particular student. This type of approach could have two adverse repercussions: First, teachers might overly emphasize individual dimensions by providing isolated or decontextualized instruction on them. Second, they might ignore or fail to capitalize on the ‘peaks’ or strong features of performance.
Figure 2.1 Excerpts from the CLAS reading rubric
Flexible use of rubrics might be the answer. Holistic rubrics require teachers to apply a ‘best fit’ approach to scoring. It is unlikely that any response will possess all of the characteristics of, say, a level 4 response exactly as described in the rubric. Individual responses are much more likely to mix elements of ‘six’, ‘five’, and ‘four’. When teachers realize this inevitable blurring among levels and dimensions, one of the important lessons they learn is that there are many routes to a ‘four’, ‘five’ or ‘six’
depending upon the particular mix of features in a given response. The lesson to be learned may be that the lack of statistical independence among dimensions is probably mirrored by their lack of instructional independence—instruction designed to improve performance on one dimension is likely to improve performance on others.
Research possibilities We know very little about either the perceived or the real utility of different types of evaluative information. What is called for are studies of the ways in which all sorts of assessment information, including norm-referenced, criterion-referenced and standards-referenced (a term we shall use to refer to these newer rubric driven performance assessments), are used by teachers and schools to plan programmes, modify curriculum, or create activities for schools, classes and individuals. All of this debate could turn out to be a moot question if we learn, for example, that curriculum planning, for either individuals or groups, is based more on tradition, textbook adoption, or some other authoritative basis than on information provided by any sort of assessment.
More specifically, we need to understand the ways in which rubrics are actually used to guide instruction. We need to know whether there is any warrant to the claim that teachers apply all, or even most, of the features identified in the rubric. And, if not, what interpretation is being applied to various score points—what exactly do these scores mean in the minds of teachers, students, parents and others who use them?
With the advent of so many forms of performance and portfolio assessments, the time is certainly ripe for careful case studies—a combination of observations and interviews with key constituents—to determine the impact that these assessments have on the lives of teachers and students and to contrast that impact with that of standardized tests. We need to know how everyone involved in these assessments uses the resulting data to construct portraits of individual and collective performance. Put differently, we need to determine the instructional and consequential validity of these assessments. It will be essential, in conducting these studies, to study the effects on students in different tracks and programmes, especially those in compensatory programmes, in order to evaluate whether similar data profiles bear similar consequences across programmes. In other words, if two students have similar profiles, but live and work in different instructional contexts, one in regular education and the other in
compensatory education, how do their instructional programmes and opportunities compare?
Dilemma 6: Can we provide teachers with the knowledge and