PERFORMANCE ASSESSMENT
Dilemma 10: Can we use performance assessments for wide- wide-scale monitoring and accountability requirements?
Many educators who are quite willing to support portfolios and performance assessments as useful within classroom tools—for teachers and students to use to make decisions about progress within a curriculum-embedded framework—balk at the suggestion that those data might travel beyond the classroom walls (Tierney, in press; Hansen, 1992), either for high stakes decisions for individuals (i.e. the certifying function of the portfolios used by New Standards or Central Park East High School or Walden III High School) or as accountability indices (for comparisons between schools and districts, similar to how standardized tests or state exams are most frequently used). Nonetheless, in the past five years, a few states have jumped headlong into wide-scale use of performance assessments (e.g., California, Maryland, Wisconsin and Indiana) or portfolios (e.g., Kentucky, Vermont and Oregon).
These efforts have met with political and technical obstacles. Some of the state efforts (California, Wisconsin and Indiana) have faltered, but others continue to develop and are beginning to be used for monitoring and accountability purposes.
In principle, there is no reason why an assessment built to provide scores or even narrative descriptions of individual students cannot be used for school and district level accountability. In fact, one can argue that precisely such a relationship should hold between individual and aggregate assessment:
why would we want to hold schools accountable to standards that differ substantially from those used for individual students? If we can use an assessment to draw valid inferences about individuals within important instructional contexts, why shouldn’t we use those same measures for school accountability indices? All we need is a valid, reliable and defensible aggregate index. Who is to say that the percentage of students who score at or above a particular standard—for example the accomplished or proficient level in a rubric used to evaluate individual performance—is not just as useful an indicator of school level programme effectiveness as a mean grade equivalent score, the percentile of the school mean, or a mean scale score on some invented distribution with a mean of 250 and a standard deviation of 50?
If we could aggregate up from the individual and the classroom level, we could meet our school and district accountability (reports to our constituencies) and monitoring (census-like) needs and responsibilities without losing valuable class time getting ready for and administering standardized tests of questionable benefit to students and teachers. As near as we can tell, most standardized tests, and even state tests, serve only this accountability (monitoring programme effectiveness) function in our schools.
They convey little instructional information. If accountability functions could be met by aggregating data from assessments that also tell us something about teaching and learning, so much the better.
To suggest that aggregating scores from locally generated assessment tools to provide classroom, school, or district scores might serve accountability purposes is to suggest that we might not need standardized tests. The implications of such a suggestion are politically and economically charged.
External assessments, like textbooks, are a key anchor in the economy of education. Of course, knowing what we know about the viability and adaptability of American business, the most likely result is that test publishers, rather than oppose such a movement, will try to become a key part of it.
Indeed, the process has already begun with commercially available performance assessments and portfolio evaluation systems.
Research possibilities We need to examine the ‘aggregatibility’ of performance assessment information from individual to classroom to school to district to state levels. Aggregatibility is dependent not only on the technical adequacy of the scores. It also depends on whether the information collected at the individual level has relevance for accountability and monitoring at the other levels. Such a programme of research could begin by assessing the information needs at various levels within the system, then asking stakeholders at each level to rate the usefulness of different types of information (i.e. average percentile ranks vs. percentage of students at various achievement levels) for the decisions they have to make.
Conclusion
At the outset of this critique, we argued that we were undertaking this critical review of performance assessment in order to improve it not discredit it. We hope that our underlying optimism about the promise of portfolio assessment shines through the barrage of difficulties and dilemmas that we have raised. Perhaps our optimism would have been more apparent had we extended our critique to standardized, multiple-choice assessments.
For then readers would have realized that any criticisms we have of performance assessment pale in comparison to the concerns we have about these more conventional tools (García and Pearson, 1994; Pearson and
DeStefano, 1993a). For example, for all of their shortcomings, performance assessments stand up much better than their more conventional counterparts to criteria such as meaningfulness, consequential vailidity and authenticity (Linn, Baker and Dunbar, 1991), while they are much less likely to be susceptible to phenemona such as test score pollution (Haladayna, Nolan and Haas, 1991) —a rise in a test score without any accompanying increase in the underlying cognitive process being measured—that often results from frantic teaching to the test.
Because so many of these questions and issues are dilemmas rather than problems, attempts at finding solutions are likely to uncover even more problems. So the best we can hope for is to decide which problems we are willing to live with in the process of solving those we believe are intolerable.
Unlike problems, which may, in principle, be solved, the test we can hope for with dilemmas is to ‘manage’ them (Cuban, 1992).
The issue of privilege brings us to this most central of dilemmas—one that we must all, both the testers and the tested, come to terms with. At every level of analysis, assessment is a political act. Assessments tell people how they should value themselves and others. Assessments open doors for some and close them for others. The very act of giving an assessment is a demonstration of power: one individual tells the other what to read, how to respond, how much time to take. One insinuates a sense of greater power because of greater knowledge (i.e. possession of the right answers).
The brightest ray of hope emanating from our recent candidates for assessment reform, the very performance assessments that have been the object of our criticism, is their public disposition. If assessment becomes a completely open process in all of its phases from conception to development to interpretation, then at least the hidden biases will become more visible and at best everyone will have a clearer sense of what counts in our schools and perhaps even a greater opportunity to become a part of the process.
References
Bleich, D. (1978) Subjective Criticism. Baltimore, MD: Johns Hopkins University Press.
California Learning Assessment System. (1994) Elementary Performance Assessments:
Integrated English-language Arts Illustrative Material. Sacramento, CA: California Department of Education.
Council of Chief State School Officers, (n.d.). ‘Model standards for beginning teacher licensing and development: A resource for state dialogue’. Draft from the Interstate New Teacher Assessment and Support Consortium. Washington, DC: Council of Chief State School Officers.
Cuban, L. (1992) ‘Managing dilemmas while building professional communities’.
Educational Researcher, 21(1): 4–11.
Delandshere, G. and Petrosky, A.R. (1994) ‘Capturing teachers’ knowledge: Performance assessment a) and post-structuralist epistemology b) from a poststructuralist
perspective c) and post-structuralism d) none of the above’. Educational Researcher, 23(5): 11–18.
Destefano, L., Pearson, P.D. and Afflerbach, P. (1996) ‘Content validation of the 1994 NAEP in Reading: Assessing the relationship between the 1994 Assessment and the reading framework’. In R.Linn, R.Glaser and G. Bohrnstedt (eds), Assessment in Transition: 1994 Trial State Assessment Report on Reading: Background Studies.
Stanford, CA: The National Academy of Education.
Ericsson, K.A. and Simon, H.A. (1984) Protocol Analysis: Verbal Reports as Data.
Cambridge, MA: MIT Press.
Feuerstein, R.R., Rand, Y. and Hoffman, M.B. (1979) The Dynamic Assessment of Retarded Performance. Baltimore, MD: University Park Press.
García, G.E. (1991) ‘Factors influencing the English reading test performance of Spanish-speaking Hispanic students’. Reading Research Quarterly, 26, 371–92.
—— (1994) ‘Equity challenges in authentically assessing students from diverse backgrounds’. The Educational Forum, 59, 64–73.
García, G.E. and Pearson, P.D. (1991) ‘The role of assessment in a diverse society’. In E.Hiebert (ed.), Literacy in a Diverse Society: Perspectives, Practices, and Policies.
New York: Teachers College Press, 253–278.
—— (1994) ‘Assessment and diversity’. In L.Darling-Hammond (ed.), Review of Research in Education (vol. 20). Washington, DC: American Educational Research Association, 337–391.
Garner, R. (1987). Metacognition and Reading Comprehension. Norwood, NJ: Ablex.
Gavelek, J. and Raphael, T.E. (1996) ‘Changing talk about text: New roles for teachers and students’. Language Arts, 73, 182–92.
Gearhart, M., Herman, J., Baker, E. and Whittaker, A.K. (1993) Whose Work is It? A question for the validity of large-scale portfolio assessment. CSE Technical report 363: National Center for Research on Evaluation, Standards, and Student Testing.
Los Angeles: University of California at Los Angeles.
Geisinger, K.F. (1992) ‘Fairness and psychometric issues’. In K.F.Geisinger (ed.), Psychological Testing of Hispanics. Washington, DC: American Psychological Association, 17–42.
Gergen, K.J. (1994) Realities and Relationships: Soundings in Social Construction.
Carmbridge, MA: Harvard University Press.
Gifford, B.R. (1989) ‘The allocation of opportunities and politics of testing: A policy analytic perspective’. In B.Gifford (ed.), Test Policy and the Politics of Opportunity Allocation: The Workplace and the Law. Boston: Kluwer Academic Publishers, 3–
32.
Greer, E.A. and Pearson, P.D. (1993) ‘Some statistical indices of the efficacy of the New Standards performance assessments: A progress report’. Presented at the annual meeting of the American Association of Educational Research, Washington, DC, April.
Haladyna, T.M., Nolan, S.B. and Haas, N.S. (1991) ‘Raising standardized achievement test scores and the origins of test score pollution’. Educational Researcher, 20, 2–
7.
Hansen, J. (1992). ‘Evaluation: “My portfolio shows who I am”.’ Quarterly of the National Writing Project and the Center for the Study of Writing and Literacy, 14(1): 5–9.
Kentucky Department of Education. (1994) Measuring Up: The Kentucky Instructional Results Information System (KIRIS). Kentucky.
Koretz, D., Klein, S., McCaffrey, D. and Stecher, B. Interim report: ‘The reliability of the Vermont portfolio scores in the 1992–93 school year’.
Linn, R.L. (1993) ‘Educational assessment: Expanded expectations and challenges’.
Educational Evaluation and Policy Analysis, 15, 1–16.
Linn, R.L., Baker, E.L. and Dunbar, S.B. (1991) ‘Complex, performance-based assessment:
Expectations and validation criteria’. Educational Researcher, 20, 15–21.
Linn, R.L., DeStefano, L., Burton, E. and Hanson, M. (1995) ‘Generalizability of New Standards Project 1993 Pilot Study Tasks in Mathematics’. Applied Measurement in Education, 9(2): 33–45.
Mabry, L. (1992) ‘Twenty years of alternative assessment at a Wisconsin high school’.
The School Administrator, December, 12–13.
Meier, D. (1995) The Power of Their Ideas. Boston: Beacon Press.
Moss, P. (1994) ‘Can there be validity without reliability?’ Educational Researcher, 23(2): 5–12.
—— (1996) ‘Enlarging the dialogue in educational measurement: Voices from interpretive research traditions’. Educational Researcher, 25(1): 20–8.
National Board for Professional Teaching Standards. (1993) Post-Reading Interpretive Discussion Exercise. Detroit, MI: NBPTS.
—— (1994) What Teachers Should Know and Be Able To Do. Detroit, MI: NBPTS.
New Standards. (1994) The Elementary Portfolio Rubric. Indian Wells, CA, July. ——
(1995) Performance Standards: Draft 5.1, 6/12/95. Rochester, NY: New Standards.
Pearson, P.D. (in preparation). ‘Teacher’s evaluation of the New Standards portfolio process’. Unpublished paper. East Lansing, MI: Michigan State University.
Pearson, P.D. and DeStefano, L. (1993a) ‘Content validation of the 1992 NAEP in Reading: Classifying items according to the reading framework’. In The Trial State Assessment: Prospects and Realities: Background Studies. Stanford CA: The National Academy of Education.
—— (1993b) ‘An evaluation of the 1992 NAEP reading achievement levels, report one: A commentary on the process’. In Setting Performance Standards for Student Achievement: Background Studies. Stanford CA: The National Academy of Education.
—— (1993c) ‘An evaluation of the 1992 NAEP reading achievement levels, report two: An analysis of the achievement level descriptors’. In Setting Performance Standards for Student Achievement: Background Studies. Stanford CA: The National Academy of Education.
—— (1993d) ‘An evaluation of the 1992 NAEP reading achievement levels, report three: Comparison of the cutpoints for the 1992 NAEP Reading Achievement Levels with those set by alternate means’. In Setting Performance Standards for Student Achievement: Background Studies. Stanford CA: The National Academy of Education.
Resnick, L.B. and Resnick, D.P. (1992) ‘Assessing the thinking curriculum: New tools for educational reform’. In B.R.Gifford and M.C.O’Connor (eds), Changing Assessments: Alternative Views of Aptitude, Achievement, and Instruction. Boston:
Kluwer Academic Publishers, 37–75.
Shavelson, R.J., Baxter, G.P. and Pine, J. (1992) ‘Performance Assessments: Political Rhetoric and Measurement Reality’, Educational Researcher, 21(4): 22–7.
Shepard, L. (1989) ‘Why we need better tests’. Educational Leadership, 46(7): 4–9.
Simmons, W. and Resnick, L. (1993) ‘Assessment as the catalyst of school reform’.
Educational Leadership, 50(5): 11–15.
Sizer, T. (1992). Horace’s School: Redesigning the American High School. Boston:
Houghton-Mifflin.
Smith, M.S., Cianci, J.E. and Levin, J. (1996) ‘Perspectives on literacy: A response’.
Journal of Literacy Research, 28, 602–9.
Smith, M.S. and O’Day, J. (1991) ‘Systemic school reform’. In S.H.Fuhrman and B.Malen (eds), The Politics of Curriculum and Testing. Briston, PA: Falmer Press, 233–67.
Thorndike, E.L. (1917) ‘Reading as reasoning’. Journal of Educational Psychology, 8(6): 323–32.
Tierney, R.J. (in press). ‘Literacy assessment reform: Shifting beliefs, principled possibilities, and emerging practices’. The Reading Teacher.
Valencia, S. and Pearson, P.D. (1987) ‘Reading assessment: Time for a change’. The Reading Teacher, 40, 726–33.
Valencia, S.W., Pearson, P.D., Peters, C.W. and Wixson, K.K. (1989) ‘Theory and practice in statewide reading assessment: Closing the gap’. Educational Leadership, 47(7): 57–63.
Vygotsky, L. (1978) Mind in Society: The Development of Higher Psychological Processes.
Cambridge: Harvard University Press.
Wells, G. and Chang-Wells, L. (1992) Constructing Meaning Together. Portsmouth, NH: Heinemann Educational Books
Wertsch, J.V. (1985) Vygotsky and the Social Formation of Mind. Cambridge, MA:
Harvard University Press.
Wiggins, G. (1993) Assessing Student Performance. Exploring the Purpose and Limits of Testing. San Francisco: Jossey-Bass Publishers.