PERFORMANCE ASSESSMENT
Dilemma 1: How do we examine the relationship between an assessment and its underlying cognitive domain?
The case of reading assessment Years ago (see, for example, Thorndike, 1917), when we first realized that understanding what we read was a legitimate index of reading accomplishment, we started to measure reading comprehension with indirect indices, such as open-ended and multiple-choice questions. We settled on indirect measures largely because we knew we could not observe ‘the real thing’ —comprehension as it takes place online during the process of reading. In a very real sense, the history of reading assessment has been a history of developing the best possible artifacts or products from which to draw inferences about what must have been going on during the act of comprehension. We never really see either the clicks or the clunks of comprehension directly; we only infer them from distant indices.
The important question is whether, by moving to performance assessments, we are moving closer to the on-line process of comprehension or even further away from it. With performance assessments offering multiple opportunities for the inclusion of multiple texts, multi-media formats, collaborative responses and problem solving, and a heavy burden on writing, it is hard to argue that we are very close to observing and documenting the process of comprehension itself. Here is the dilemma: on the one hand, performance tasks reflect authentic literacy activities and goals—the kind of integrated, challenging activities students ought to be engaged in if they are to demonstrate their ability to apply reading and writing to everyday problems.
On the other hand, we can question whether judgements about performance on these tasks really measure reading. Are they simply an index of the uses to which reading (or perhaps more accurately, the residue or outcomes of reading) can be put when they are complemented by a host of other media and activities?
We know much too little about the impact of task characteristics and response format on the quality and validity of judgements we make about individual students to answer these questions. Intuition would tell us that there are many students, for example, for whom the requirement of writing responses creates a stumbling block that yields a gross underestimate of their comprehension. Even when task writers try to escape the boundaries of conventional writing by encouraging students to use semantic webs or visual displays, they do not fully achieve their goals. Even in these formats, the student for whom writing is difficult is still at a decided disadvantage. As we have pointed out (García and Pearson, 1994), the matter of response format is all the more problematic in the case of second language learners, where not only writing but also language dominance comes into play. For
example, when Spanish bilingual students are permitted to respond to English texts in Spanish rather than English, they receive higher scores on a range of tests (see García and Pearson, 1994, for a summary of this work).
An obvious way to address this dilemma would be to examine systematically the relationship among task characteristics, response formats and judgements of performance. Through controlled administrations of carefully constructed assessments, it would be possible to identify the relative impact of task characteristics, such as group discussion, video presentation, and various oral and written response formats, on the scores assigned to student performance. However, while such an analysis would elucidate, in terms of portioning out the score variance, how complex tasks affect student response, it does not really get at the more central construct validity issue of how well the assessment represents the domain of reading.
The more general question of domain representation This issue can be examined conceptually as well as psychometrically, and when it is, the question becomes one of judging the validity of different conceptualizations of the domain under consideration, in this case, reading accomplishment. It is a question of infrastructure and representation—determining the components of the domain and their interrelations. How validly does the assessment in question measure the cognitive domain underlying its construction and use? This would not be an issue if there were only one, or even a small number of commonly held conceptualizations of reading accomplishments. Nor would it be an issue if our measurement tools could easily distinguish among empirically valid and invalid conceptualizations.
Alas, neither assumption holds for reading as for most domains of human performance. The complexity of the act of reading, along with the seemingly inevitable covariation among its hypothesized components, renders the statistical evaluation of competing models quite complex, often baffling even the statistical dexterity of our most sophisticated multivariate and factor analytic approaches.
These conceptual shortcomings ultimately devolve into epistemological and ethical issues, as performance is subjected to judgements and action in the form of decisions about certification, mastery, or future curriculum events.
And here the interpretive and the realist perspectives on research and evaluation meet head on. Those who take a naive realist perspective tend to view the mapping of performance onto standards and eventual judgements about competence as a transparent set of operations: tasks are designed to measure certain phenomena, and if experts and practitioners agree that they measure them, then they do. Students who do well on particular tasks exhibit high levels of accomplishment in the domain measured; those who do not do well, exhibit low levels. But those who take a more interpretive perspective (see Moss, 1994, 1996; Delandshere and Petrosky, 1994), view the mapping
problem as much more complex and indeterminate. Students do not always interpret the tasks we provide in the way we might have intended.
To ground this contrast in a real example, take the case of the early versions of the National Board of Professional Teaching Standards (NBPTS) Early Adolescent English Language Arts assessment for Board Certification. One of the tasks required teachers to submit a portfolio entry (actually a video plus commentary) to document their ability to engage students in discussions of literature (NBPTS, 1993). An assumption is made, in the realist view, that levels of performance on video index expertise in leading literature discussions. But the interpretivist might argue that even though that task may have been designed to elicit evidence about accomplishment of a standard about teaching literature, particular candidates may decline the invitation to use that task to offer such evidence—not because they cannot meet it, but because they interpret the task differently. Ironically, in their completion of the post reading discussion entry, candidates may provide very useful and insightful evidence about some other standards, such as sensitivity to individual differences or appreciation of multicultural perspectives. Conversely, they may provide evidence of meeting the literature standard on another task, itself designed to elicit evidence about other standards; for example, an exercise about how to diagnose student learning might permit a teacher to reveal deep knowledge about how to engage students in literature.
To further complicate the matter, test-takers are not the only group involved in interpretation of performance assessments. Applications of standards and scoring criteria are subject to individual interpretation, thus introducing another threat to construct validity. Once tasks are completed, judges are likely to disagree about the nature, quality and extent of evidence provided by particular individuals and particular tasks. Even if one judge concludes that a particular task yields evidence about standard X, other judges may disagree, thinking that it provides evidence of other standards or that other tasks provide better evidence about standard X. We will elaborate on the roles played by interpretation and judgement when we discuss reliability and generalizability issues in Dilemma 7, but presented here, they raise the issue of ‘What construct are we measuring anyway?’ The one intended by the assessment developers, the test takers, or the judges?
Research possibilities This is an area in which some useful and important research could be conducted without great expense. We need to examine carefully the process of creating and evaluating performance assessments and portfolios, particularly the manner in which: (a) tasks are selected to represent particular domains, such as reading; (b) test-takers interpret what is being asked of them and how it will be evaluated; and (c) scorers assign value to different sorts of evidence provided by different entries or tasks.
At the heart of this research agenda is a need for traditional construct validity studies in which the constructs to be measured are operationalized and explicitly linked to the content of the assessment and its scoring criteria or standards. Coherence at this level is not enough, however. It is also necessary to demonstrate relationships between the construct as measured and other constructs or outcomes. For example, do judgements of a student’s reading accomplishment on performance tasks correlate with teacher’s judgement of accomplishment? With performance on other tasks or measures of reading? With general academic success or real world self-sufficiency?
While proponents of performance assessment hold that the authentic nature of the tasks attests to their validity as measures of reading, a collection of
‘authentic’ tasks may fall short in terms of representing the broad domain of reading. It is impossible to know whether the domain is adequately represented without studies of this kind.
Think-aloud procedures could be useful for understanding how participants perceive the tasks and the standards used to judge their responses or work (Ericsson and Simon, 1984; Garner, 1987). By asking students to tell us the decisions they make as they construct and present their responses, we can begin to determine the fit between the task as intended and as perceived by the participant and assess the magnitude of the threat to validity imposed by a lack of fit. This research could help us to create tasks that are more resistant to multiple interpretations as well as help us to improve scoring criteria to address a variety of interpretations.
Likewise we could ask scorers to think aloud while they scored performance tasks to gain insight into how individual judges interpret standards and assign scores to student work. Through think-alouds, it may be possible to determine the extent to which the underlying conceptualization of reading as represented in the task and scoring rules is guiding judges’
decisions as well as the extent to which extraneous factors are influencing scoring.
Dilemma 2: How seriously should we take the inclusion of