PERFORMANCE ASSESSMENT
Dilemma 7: Will performance assessments stand up to conventional criteria for measurements, such as reliability and
generalizability? Or do we need new criteria?
Early efforts to hold performance assessments to traditional psychometric standards, such as reliability, generalizability and validity, have produced mixed results. In New Standards, for example, within-task agreement among scorers approaches acceptable standards. This suggests that teachers can learn how to score open-ended tasks with the help of elaborate rubrics, well-chosen anchor papers and commentaries that explain the logic with which rubrics are applied to the papers to obtain different scores. In the 1993 scoring conference for New Standards, about a third of the English language arts and half of the mathematics teachers had never scored performance assessments before. To qualify as scorers, teachers had to match the benchmark sets on 80 per cent of the papers (8/10 on two consecutive sets). Even with this strict criterion, 90 per cent of the teachers qualified at the national scoring conference. Moreover, these teachers then returned to their states and trained their colleagues on the same tasks and materials; they were able to achieve nearly the same rate of qualification. However, they were not as consistent in matching one another’s scores as they were in matching the scores in the benchmark sets. Using a direct match criterion (as opposed to the ± 1 score point criterion used in most states), the between judge agreement ranged from 50 per cent to 85 per cent. The agreement was a little higher in mathematics (69 per cent) as compared to writing (64 per cent), or reading (62 per cent). When agreement was calculated on the cut line between passing and not passing, interjudge agreement ranged from 84 per cent to 96 per cent across tasks.
While the data support the generalizability of scores across raters, there is little evidence to support generalizability across tasks. The data gathered from the first scoring of New Standards tasks (Linn, DeStefano, Burton and Hanson, 1995) indicate considerable covariation between task components
and holistic/dimensional scores. However, when we compared the holistic scores across the mathematics tasks of New Standards, the resulting indices of generalizability were quite low, indicating that performance on any one task is not a good predictor of scores on other mathematics tasks. Shavelson and his colleagues have encountered the same lack of generalizability with science tasks (Shavelson, Baxter and Pine, 1992), as have other scholars (e.g., Linn, 1993) on highly respected enterprises such as the advanced placement tests sponsored by the College Board. The findings in the College Board analysis are noteworthy for the incredible variability in generalizability found as a function of subject matter; for example, in order to achieve a generalizability coefficient of 0.90, estimates of testing time range from a low of 1.25 hours for Physics to over 13 hours for European History. These findings are consistent with the conceptual problems cited earlier, and they suggest that we need to measure students’ performance on a large number of tasks before we can feel confident in having a stable estimate of their accomplishment in a complex area such as reading, writing, or subject matter knowledge. They also suggest that portfolios may provide a way out of the generalizability problem by ensuring that we include multiple entries to document performance on any standard of significance.
In our early work in New Standards, we were very aware of this tension, and we tried to evaluate the efficacy and independence of various approaches to scoring. We examined carefully the statistical relationships (indexed by first order correlation coefficients) between analytic, holistic and dimensional scoring systems (Greer and Pearson, 1993). The data were generated by scoring a large sample of student papers in three ways: (1) holistically; (2) analytically (question by question by question) using the same rubric for each question, but with the requirement that scorers assign an overall score after assigning question by question scores; and (3) dimensionally using the dimensions not unlike those implied in the California rubric—thoroughness, interconnectedness, risk and challenge. The bottom line is pretty straightforward (Greer and Pearson, 1993): holistic scores, summed analytic scores and summed dimensional scores tend to correlate with one another at a magnitude in the 0.60 to 0.70 range. There are also consistently positive part-part and part-whole correlations in both the dimensional and analytic scoring systems.
As indicated earlier, our New Standards data are not very compelling on the inter-task generalizability front, although a great deal of research remains to be done before we can legitimately reject even our current crop of tasks and portfolio entries. Even so, we suspect that we will always be hard pressed to argue, as some proponents wish to, that when we include a performance task or a portfolio entry in an assessment system, we are more or less randomly drawing tasks from a large domain that share some common attributes, dimensions, or competencies, and, more importantly, somehow
represent the domain of competence about which we think we are drawing inferences.
The most provocative criticisms of our current paradigms and criteria for evaluating assessments have been provided by Moss (Moss, 1994; Moss, 1996), who has challenged the very notion of reliability, at least in the way in which we have used it for the better part of this century. She points out that many assessments, particularly those outside school settings, involve a high degree of unreliability, or at least disagreement among those charged with making judgements—scoring performances in athletic or musical contests, deciding which of several candidates deserve to be hired for a job opening, awarding a contract in the face of several competing bids, or reviewing manuscripts for potential publication. She points out that none of us label these assessments as invalid simply because they involve disagreements among judges. Yet this is exactly what we have done in the case of educational assessments; to wit, the allegedly ‘scandalous’ interjudge reliabilities reported for Vermont (Koretz, Klein, McCaffrey and Stecher, 1992) and Kentucky (Kentucky Department of Education, 1994) in their statewide portfolio assessments.
Moss (1994, 1995) argues for a more ‘hermeneutic’ approach to studying the validity of assessments. In accepting the hermeneutic ideal and its emphasis on interpreting ‘the meaning reflected in any human product’, (7), we would not only be admitting that decisions involve interpretation and judgement, we would be doing everything possible to understand and account for the roles played by interpretation and judgement in the assessment process.
Instead of scoring performances as ‘objectively’ and independently as possible, we would seek ‘holistic, integrative interpretations of collected performances’.
We would privilege deep knowledge of the students on the part of judges rather than regard it as a source of prejudice. We would seek as much and as diverse and as individualized an array of artifacts as would be needed to portray each student’s performance fully and fairly. And when differences in process or judgement arose, instead of fidgeting about sources of unreliability, we would try to account for them in our ‘interpretation’ of performance; we might even opt to document the differences in our account of either individual or group performance. From a hermeneutic perspective, differences can be both interesting and informative, not just ‘noise’ or error, as they are assumed to be in a psychometric account of interpretation.
In many of our social and professional—indeed our legal—activities, we find other mechanisms for dealing with differences in activity, judgement and interpretation; for example the human practices of consensus, moderation, adjudication and appeals are all discursive mechanisms for dealing with difference. All represent attempts to understand the bases of our disagreements.
Additionally they entail, to greater or lesser degrees, attempts to get inside of or underneath our surface disagreements to see if there is common ground (consensus), to see things from the point of view of others (moderation), to
submit our claims to independent evaluation (adjudication), and to ensure a fair hearing to all who have a stake in the issue at hand (appeals). These are all mechanisms for promoting trustworthiness in human judgment, and we use them daily in most significant, everyday human activities. Why then do we seem to want to exclude them from the assessment arena?
Research possibilities We are not sure what to recommend for research initiatives on this front. After all, the measurement community has conducted reliability, generalizability and validity studies for decades, and we see little reason to believe that this situation will change. What may change, however, is the set of criteria used to evaluate the efficacy of assessments, particularly