Overall C Content C Crealivrty C Style C Mechanics C Organization C
T Score I-Score Domain xisr 3
z 3 .4003 3 .1757 3
» 3
i 3
Not«: The Score is a Modified T Score, theoretically ranging from 40 through 100 (this may be higher or lower in actual implementation) This is based on the formula ((z-acwe* 105 +70). wilti 70 being the 'Average' result If you'd like to score another paper, click here
O 2001 Tru*U«i oflndUnt UarrercJQr InUfeMtatc Engm. O 3001 VtnUg*
Some component* 01993-98 CUffbrd C. M«r»#io
CJDone irtemet
Fig. 10.1 Demonstration site for a FIPSE-funded project that will establish norms for longer essays.
The problem for establishing norms in this national study is that although samples of student writing are probably relatively stable from year to year, the number and scope of institutions mat are adopting electronic portfolios will likely change in the near future. The sample that may be representative today may in a few years no longer be reflective of those institutions using electronic portfolios in the future.
Alternate Norms
If one is concerned that examinee performance will be linked to some demographic or characteristic of concern, then a test constructor might create developmental norms or, alternatively different forms of the test to match with the different characteristic levels (Peterson, et al, 1989). The most common demographics used for achievement tests are those of "age" and "grade." In AES, the norms developed for entering college students may not extrapolate well to middle school
students. Consequently, norms may have to be developed at multiple grade levels, depending on the purpose of the test. Age norms can also be helpful. For example, if a student skips a grade level in school, she or he may write at an
"average" level for the grade, but be in the "superior" group by age.
Occasionally one might extrapolate norms to look for development over time using the same empirical model. If one measured the same individual at different points in time with different essays (and it was appropriate to measure the different essays with the same model), then the differences in normal curve percentiles might represent a shift in developmental performance (positive or negative). This would be a way to document writing growth.
To date, little research has been conducted on the use of automated essay scorers for English as a Second Language subgroup. From a teaching perspective, it is quite possible that the use of AES can provide a helpful feedback mechanism that can accelerate learning, but the evaluation of such students using norms based on an English as a First Language sample may be inappropriate. Norms for gender and ethnicity may also be appropriate or at least warrant study. Because most AES engines use an empirical base for modeling, this pattern is likely to be replicated through automated scoring. If the differences are based on rater bias, then it would be desirable to eliminate them. If not, then it would be desirable to identify the variables or combination of factors for which the differences exist.
Equating
Equating is a statistical technique that is used to ensure that different versions of the test (prompts) are equivalent. As is true with objective tests, it is quite likely that the difficulty level of prompts differ from one prompt to the next (Shermis, Rasmessen, Rajecki, Olson, & Marsiglio, 2001). Although some of die AES engines may use a separate model for each prompt, it is likely that from one test group to the next, the prompts would be treated as being equal unless either the prompts or the models to score the prompts were equated.
Shermis, Rasmussen, Rajecki, Olson, & Marsiglio (2001) investigated the equivalency of prompts using both Project Essay Grade and Multiple Content Classification Analysis (MCCA), a content analysis package that had been used to evaluate the content of television ads for children (Rajecki, Darne, Creek, Barrickman, (1993). Based on the Project Essay Grade model that involved 1,200 essays, each with four raters (800 essays for model building and 400 for validation;
Shermis, Mzumara, Page, Olson, & Harrington, 2001). One thousand essays were randomly selected and analyzed using MCCA. The essays included ratings across 20 different prompts. These researchers concluded that essays which were oriented more toward "analytical" responses were rated higher than prompts which elicited
"emotional" responses. That is, raters had a bias for the "analytical" themes. The authors concluded that prompts might be differentially weighted in much the same way that dives in a swimming competition are assigned a variety of difficulty levels.
Finally, little research has been done on trying to incorporate IRT in the calibration of AES models, although some foundational work has been performed in IRT calibration of human rating respones (de Ayala, Dodd, & Koch, 1991; Tate
& Heidorn, 1998). For example, de Ayala et al. (1991) used an IRT partial credit model of direct writing assessment to demonstrate that expository items tended to yield more information man did the average holistic rating scale. Tate and Heidorn (1998) employed an IRT polytomous model to study the reliability of performance differences among schools of various sizes.
The hope is that future theoretical research would permit the formulation of graded response or polytomous models to AES formulations. The purpose would be to create models that are more robust to changes in time, populations, or locations. One might also speculate that IRT could also help address the sticky issue of creating a separate model for each content prompt in those engines that focus on content A major challenge in applying IRT techniques to AES has to do with the underlying assumptions regarding the models (e.g., unidimensionality).
Differential Item Functioning
Differential Item Functioning (DIP) exists when examinees of comparable ability, but different groups, perform differently on an item (Penfield & Lam, 2000). Bias is attributed when something other than the construct being measured manifests itself on the test score. Because performance on writing may be contingent on mastery of several skills (i.e., it is not unidimensional) and be influenced by rating biases, it is a good practice to check for DIP in AES.
Use of performance ratings does not lend itself to dichotomous analysis of DIP. Dichotomous items are usually scored as zero for incorrect items and 1 for correct items. Polytomous items are usually scaled where increasing credit is given to better performance (e.g., a score rating from 1 to 5 on a written essay).
However, there are at least three problems limiting the use of polytomous DIP measures: (a) low reliability of polytomous scores, (b) the need to define an estimate of ability to match examinees from different demographic groups, and (c) the requirement of creating a measure of item performance for the multiple categories of polytomous scores (Penfield & Lam, 2000).
For the moment, no single method will address all types of possible DIP under all possible situations (e.g., uniform & nonuniform DIP). Penfield and Lam (2000) recommend using three approaches: Standardized Mean Difference, SIBTEST, and Logistic Regression. All of these approaches, with perhaps the exception of Logistic Regression, require a fairly sophisticated statistical and measurement expertise. Standardized Mean Difference is conceptually simple and performs reliably with well-behaved items. SIBTEST, although computationally complex, is robust to departures from the equality of the mean abilities. Finally, with Logistical Regression is generally more familiar to consumers and developers of tests than are some of the feasible alternatives (e.g., discriminant function analysis; French &
Miller, 1996).
In this chapter we have attempted to lay out some of the norming and scaling concerns that face AES researchers as they try to gain wider acceptance of the new technology. A few of the challenges will be unique because AES is a type of performance assessment that utilizes human ratings as a typical criterion measure.
Even with extensive training and experience, raters have been known not to
conform to the specifications of their rubrics or to introduce biases into their evaluations. When this is the case, it is important to check the ratings for differential item performance.
REFERENCES
Cohen, R. J., & Swenlik, M. E. (1999). Psychological testing and assessment. (4th ed).
Mountain View, CA: Mayfield Publishing Company,
de Ayala, R. J., Dodd, B. G., & Koch, W. R. (1991). Partial credit analysis of writing ability. Educational and Psychological Measurement, 51,103—114.
French, A. W., & Miller, T. R. (1996). Logistic regression and its use in detecting differentially item functioning in polytomous items. Journal of Educational Measurement, 33, 315-332.
Glass, G. V., & Hopkins, PC D. (1996). Statistical methods in education and psychology.
Needham Heights, MA: Allyn & Bacon.
Harrington, S., Shermis, M. D., & Rollins, A. (2000). The influence of word processing on English placement test results. Computers and Composition, 17, 197-210.
Jarmer, D., Kozel, M., Nelson, S., & Salsberry, T. (2000). Six-trait writing model improves scores at Jennie Wilson Elemenary. journal of School Improvement, 1. Retrieved from http://www.hcacasi.org/jsi/2000vli2/six-trait-model.adp.
Landauer, T., Laham, D., & Foltz, P. (1998). The Goldilocks principle for vocabulary acquisition and learning: Latent semantic analysis theory and applications. Paper presented at the American Educational Research Association, San Diego, CA.
Northwest Educational Research Laboratories (MWREL). (1999, December). 6+1
Traits of Uniting rubic [website], http://www.nwrel.org/eval/pdfs/6plusltraits.pdf
Page, E. B. (1966). The imminence of grading essays by computer. Phi Delta Kappan, 48, 238-243.
Page, E. B., Keith, T., & Lavoie, M. J. (1995, August). Construct validity in the computer grading of essays. Paper presented at the annual meeting of the American Psychological Association, New York.
Page, E. B., Lavoie, M. J., & Keith, T. Z. (1996, April). Computer grading of essay traits in student writing. Paper presented at the annual meeting of the National Council on Measurement in Education, New York.
Page, E. B., & Petersen, N. S. (1995). The computer moves into essay grading:
Updating the ancient test. Phi Delta Kappan, 76, 561-565.
Page, E. B., Poggio, J. P., & Keith, T. Z. (1997, March). Computer analysis of student essays: Finding trait differences in the student profile. Paper presented at the annual meeting of the American Educational Research Association, Chicago.
Penfield, R. D., & Lam, T. C. M. (2000). Assessing differential item functioning in performance assessment: Review and recommendations. Educational Measurement: Issues and Practice, /^(3), 5-15.
Petersen, N. S., & Page, E. B. (April, 1997). Nav developments in Project Essay Grade:
Second ETS blind test with GRE essays. Paper presented at the American Educational Research Association, Chicago, IL.
Peterson, N. S., Kolen, M. J., & Hoover, H. D. (1989). Scaling, norming, and equating. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 221-262). New York: MacMillan.
Rajecki, D. W., Dame, J. A., Creek, K J., Barrickman, P. J., Reid, C. A., &
Appleby, D. C. (1993). Gender casting in television toy advertisements:
Distributions, message content analysis, and evaluations, journal of Consumer Psychology, 2, 307-327.
Shermis, M. D. (2000). Automated essay grading for electronic portfolios. Washington, D.C.: Fund for the Improvement of Post-Secondary Education (funded grant proposal).
Shermis, M. D., Koch, C. M., Page, E. B., Keith, T., & Harrington, S. (2002). Trait ratings for automated essay grading. Educational and Psychological Measurement, 62(V), 5-18.
Shermis, M. D., Mzumara, H. R., Olson, J., & Harrington, S. (2001). On-4ine grading of student essays: PEG goes on the web at IUPUI. Assessment &
Evaluation in Higher Education, 26, 247-259.
Shermis, M. D., Rasmussen, J. L., Rajecki, D. W., Olson,]., & Marsiglio, C. (2001).
All prompts are created equal, but some prompts are more equal than others, journal of Applied Measurement, 2, 154-170.
Tate, R., & Heidorn, M. (1998). School-level IRT scaling of writing assessment data. Appled Measurement in Education, 11, 371-383.
11
Bayesian Analysis of Essay Grading
Steve Ponisciak Valen Johnson Duke University
The scoring of essays by multiple raters is an obvious area of application for hierarchical models. We can include effects for the writers, the raters, and the characteristics of the essay that are rated. Bayesian methodology is especially useful because it allows one to include previous knowledge about the parameters. As explained in Johnson and Albert (1999), the situation that arises when multiple raters grade an essay is like the person who has more than one watch - if the watches don't show the same time, that person can't be sure what time it is.
Similarly, the essay raters may not agree on the quality of the essay, each rater may have a different opinion of the quality and relative importance of certain characteristics of any given essay. Some raters are more stringent than others, whereas others may have less well-defined standards. In order to determine the overall quality of the essay, one may want to pool the ratings in some way.
Bayesian methods make this process easy. In our analysis of a dataset mat includes multiple ratings of essays by multiple raters, we examine the differences between the raters and the categories in which the ratings are assessed. In the end, we are most interested in the differences in the precision of the raters (as measured by their variances) and the relationships between the ratings.
Our dataset consists of ratings assigned to essays written by 1,200 individuals.
Each essay received 6 ratings, each on a scale of one to 6 (with 6 as the highest rating), from each of 6 raters. Each rater gave an overall rating and five subratings;
the categories in which the essays were rated were content, creativity, style, mechanics, and organization. Each essay was rated in all six categories by all six raters, so the data constitutes a full matrix. Histograms of the grades assigned by one rater in each category are shown in Fig. 11.1 to illustrate some of the differences among the raters and categories. One can see in Fig. 11.la and 11.Ib that the first and second raters rate very few essays higher than 4. As another illustration, the graph in Fig. 11.Id shows that in the creativity category, the fourth rater rates a higher proportion of essays at 5 or 6, and probably has a larger variance. The graph in Fig. ll.lc, for Rater Three, is somewhat skewed, and shows little variability.
181
Figure 11.1a
Figure 11.16
Organization Rating by Rater 1
Figure 11.1c
1 2 3 4 5 6
Mechanics Rating by Rater 2
Figure 11.1d
1 2 3 4 5 6
Content Rating by Rater 3
1 2 3 4 S 6 CreatMy Rating by Hatw 4
Figure 11.1e Figure 11.11
1 2 3 4 5 6
Style Rating by Rater 5
1 2 3 4 5 8
Overall Rating by Rakw 6
FIG. 11.1 Histograms of the scores assigned in one category by each rater, a) Organization rating by Rater 1, b) Mechanics rating by Rater Two. c) Content rating by Rater Three, d) Creativity rating by Rater Four, e) Style rating by Rater Five, f) Overall rating by Rater Six.
In Fig. 11. le and 11. If, one can see that Raters Five and Six tend to rate items more similarly to each other than to Rater Four. The essay ratings for all pairs of categories for a given rater are all positively correlated, as shown in Table 11.1.
The correlations are least variable for Rater Four, ranging from 0.889 to 0.982, and most variable for Rater Five, ranging from 0.577 to 0.895. One can conclude from these values that there is a relationship between the category ratings. For Raters one, three, and five the lowest correlation is observed for the ratings of mechanics and creativity, and the highest, for content and the overall rating.