Using Multifaceted Rasch Measurement to Evaluate Complex Data Sets in Science Education Research

(1)

Education

Utilizing Multifaceted Rasch

Measurement Through FACETS to

Evaluate Science Education Data Sets Composed of Judges, Respondents, and Rating Scale Items: An Exemplar Utilizing the Elementary Science

Teaching Analysis Matrix Instrument

WILLIAM J. BOONE,¹J. SCOTT TOWNSEND,²JOHN R. STAVER³

1Department of Educational Psychology, Miami University, Oxford OH 45056, USA;

2School of Clinical Education Preparation, Eastern Kentucky University, Richmond KY 40475, USA; and³Department of Curriculum and Instruction, Purdue University, West Lafayette IN 47907, USA

Received 19 March 2015; revised 9 September 2015; accepted 28 September 2015 DOI 10.1002/sce.21210

Published online 25 November 2015 in Wiley Online Library (wileyonlinelibrary.com).

ABSTRACT:When collecting data, science education researchers frequently have multiple respondents evaluate multiple artifacts using multiple criteria. Herein, the authors introduce Multifaceted Rasch Measurement (MFRM) analysis and explain why MFRM must be used when “judges’” data are collected. The authors use data from elementary science methods students enrolled in a senior-level, semester-length course. Acting as judges, students used the 23-item Elementary Science Teaching Analysis Matrix (ESTAM) instrument to evaluate five separate science lessons taught by five different teachers. Our MFRM analysis documents that multiple judges need not act as identical robots. Rather, differences in judges’ severity can be computed and corrections made in an analysis to take into consideration differing judge severity. This paper extends the current literature inScience

Correspondence to: William J. Boone; e-mail: [email protected]

C2015 Wiley Periodicals, Inc.

(2)

Educationregarding the application of Rasch measurement. ^C2015 Wiley Periodicals, Inc.

Sci Ed100:221–238, 2016

INTRODUCTION

In science education research, a common data collection practice is to have multiple judges evaluate numerous artifacts using varied criteria. For example, science teachers (acting as

“judges”) may evaluate portfolios of students ("respondents") using a rating scale ("items").

A common practice has been to train judges to rate in an identical manner and to evaluate the quality of judges’ agreement using computations such as Cohen’s kappa (Chiappetta &

Fillman, 2007; Fairbrother, 1975; Jonsson & Svingby, 2007; Marshall, Smart, & Horton, 2010). Kappa is usually presented as an indicator of the similarity of judges’ evaluations (Rijlaarsdam, Couzijn, Janssen, Braaksma, & Kieft, 2006; Van Boxtel, Van der Linden, &

Kanselaar, 2000). Following the computation of an acceptable value of kappa, typically an assertion is made that it is reasonable to treat all data as if they were evaluated by a single judge. A total raw score is computed for each respondent (e.g., John is awarded 50 of 70 possible points by the judge who evaluated his science portfolio), and then parametric tests are used to examine comparisons such as: How do the raw scores of males compare to the raw scores of females?

Typically, judges and items are not further considered following the computation of a respondent’s total raw score; however, both should be further considered. For example, what is the impact of judges not acting in unison? Regarding items, what is the range of difficulty of the instrument’s items? Are there judges who act unpredictably to the instrument’s items?

In most science education studies, little further consideration is given to the difficulty of tasks, the difficulty of individual items, and the hierarchy of item difficulties.

Herein we introduce the science education community to Multifaceted Rasch Measure- ment [MFRM] (Linacre, 1989) analysis and explain in depth and detail why this technique must be used when most judging data are collected.

Judges are frequently used to collect science education data for analyses. One example is groups of experts grading open-ended essay items (Shaw, Van Horne, Zhang, & Boughman, 2008; Sungur, Tekkaya, & Geban, 2006). Two well-known examples of judging data can be helpful to facilitate readers’ comprehension. First, consider figure skaters competing in the Winter Olympics. A group of judges evaluates skaters on a number of tasks. The judges are the “judges,” the skaters are the “respondents,” and tasks are the “items” of an instrument (think of skaters responding to an item as a skill they must attempt). A second example is a high-stakes open-ended student science exam, which is shipped to a central location for scoring by judges. In this example, the individuals scoring the exam are “judges,” the students completing the exam are “respondents,” and the exam’s items are “items.”

We build upon a set of previous Science Education articles that describe how and why Rasch measurement must be used for analyzing multiple-choice test data (Boone &

Scantlebury, 2006), rating scale data (Boone, Towsend, & Staver, 2011), and partial credit data (Eggert & B¨ogeholz, 2010). Utilizing an organizational framework similar to these previous Rasch articles, we demonstrate that MFRM techniques with a science education data set interpret the results to identify and highlight an error in some past science education data analyses and discuss the importance of MFRM for science education researchers who collect data using judges. Typically, MFRM is best taught and used following exposure to the application of Rasch to dichotomous items, rating scale items, and partial credit items.

Currently, the Rasch model is used for the analysis of many well-known data sets such as PISA (Programme for International Student Assessment), TIMSS (Trends in International Mathematics and Science Study), and PIRLS (Progress in International Reading Study)

(3)

data (Bond & Fox, 2007; Endler & Bond, 2000; Lee, 2004) as well as the analysis of many NCLB (No Child Left Behind) state tests (e.g., Ohio, Illinois, Pennsylvania, Texas).

Moreover, a number of research groups worldwide use Rasch. This study makes use of a data set format that will be familiar to science educators, regardless of their personal foci on research and teaching. The data set was collected from preservice science teachers in a semester length methods course who used an instrument to judge the teaching of five exemplar teachers via a set of video tapes/video disks.

Six Key Reasons for Utilizing Rasch Measurement Theory in Science Education Research

To understand the need for MFRM for data from judges and the promise of MFRM, we begin with a brief review of six core components of Rasch measurement and follow with a description of an organizational framework for the analyses presented herein.

Ordinal Rating Scale Data (Ordinal Data Are Not Linear). Raw ratings from a Likert- type rating scale are ordinal data (Ravid, 2010). Ordinal data are not linear and cannot be analyzed as if they are linear. The use of parametric tests to evaluate raw Likert-type scale data violates assumptions of parametric tests (Meyers, Gamst, & Guarino, 2006), and applying parametric tests to raw rating scale data can often result in incorrect conclusions.

Data Quality Control (Are Data of High Quality? Can Person Measures Be Confidently Computed?). In Rasch measurement, great effort is taken to evaluate the quality of the data (Smith, Rush, Fallowfield, Velikova, & Sharpe, 2008; Wang & Chen, 2005; Wright &

Masters, 1982).

Construct Validity (Do Items Define the Construct? Are Items Part of the Construct?).

Rasch measurement can address the construct validity of an instrument in a multitude of ways (Liu, 2010; Neumann, Neumann, & Nehm, 2011; Stewart-Brown et al., 2009;

Thomas, Anderson, & Nashon, 2008).

Item Targeting (Do Items Optimize the Quality of Respondent Measures?). The pre- sentation of an instrument’s items, be they test items or survey items, along a variable using a Rasch metric is a component of what is now commonly referred to as a Wright map (Eggert, & B¨ogeholz, 2010; Wilson, 2008; Wilson, Allen, & Li, 2006). Using Wright maps, researchers can quickly evaluate the manner in which the items of an instrument work together to define different portions of a trait.

Items and Respondents Expressed on the Same Metric. A unique aspect of the Rasch model developed by Georg Rasch (Rasch, 1960) is that, in the case of tests and surveys, respondents and items can be expressed on the same scale. In the case of judging data, judge severity, respondents, and items can be expressed on the same scale. In the case of tests, this enables respondents’ performances to be expressed using the context of all test items that define a trait.

Analysis of Rating Scale Functioning (How Well Does Your Rating Scale Work?). It is not enough to simply present a set number of survey items to a respondent and provide,

(4)

for example, a five-step rating scale. Rather, it is important to assess the functioning of the rating scale with respect to respondents as a function of each individual survey item.

WHY MFRM FOR SCIENCE EDUCATION JUDGING DATA?

In addition to the above-mentioned reasons for using Rasch analysis with many types of data sets, several broad payoffs await science education researchers who consider the use of MFRM when a task, for example, is being judged to create numbers. Our perspective is that perhaps the most critical issue is that no matter the context of their training, judges differ in the manner in which they “judge.” Put more succinctly, judges differ; some are tough, some are easy, and some are neither tough nor easy. Although some researchers will attempt to train judges to act identically, as robots, in reality judges do not act in the same manner. Therefore, to assume that all judges act as robots will penalize some respondents who are evaluated by tough judges and unfairly reward respondents who are evaluated by easy judges. Because Rasch measurement (1) enables judges’ severity and leniency to be evaluated; and (2) enables corrections in respondents’ measures to be computed based on the mix of judges that evaluate a respondent, researchers can have high confidence in the data later analyzed with parametric statistical tests.

An additional reason to choose Rasch measurement for analyses of science education data sets in which judges are used focuses on the time and cost of “judging.” In almost all scenarios, the costs are prohibitive to employ a large number of judges to evaluate all item responses of all respondents. A payoff of Rasch measurement is grounded on the concept that measuring involves one dimension; therefore, researchers can have a mix of judges evaluate subsets of respondents and items. This means that judges do not have to evaluate all respondents and all items. This ability to avoid having all judges evaluate all items and all respondents provides great savings in time and cost. We believe this opportunity will result in an increased use of judges in science education research and will, in time, provide new research opportunities for all science education researchers. The recent text, Rasch Analysis in the Human Science,Boone, Staver, and Yale (2014) provide guidance for science educators wishing to learn Rasch, be it for tests, quizzes, or data sets with judges.

AN EXEMPLAR APPLICATION OF MFRM TO ADDRESS A SCIENCE EDUCATION RESEARCH QUESTION

Preparing future elementary teachers in the many aspects of teaching science can be difficult in light of the well-documented profile of such students (Beyer, & Davis, 2012;

Gunning & Mensah, 2011; Hechter, 2011; Nilsson & Loughran, 2012). For example, low self-efficacy and minimal science content backgrounds are frequent characteristics.

Often the elementary science methods course is the first and only science experience these students have prior to full-time employment as teachers. It is difficult to characterize the components and sequence of the “typical” elementary science methods course curriculum.

However, more often than not, several fundamental concepts are presented to students such as nature of science; student misconceptions; discrepant events; conceptual change;

science process skills; cooperative learning; conducting inquiry-based science activities;

questioning strategies; teacher and student interactions; targeted science content; learning environment; science standards; and formative and summative assessment.

For both college students studying to be science teachers, as well as those currently employed, one of the most difficult skills to master is the ability to design, deliver, and evaluate one’s own science lessons (e.g., Alake-Tuenter et al., 2012; Black, 2004; Kim,

& Tan, 2011). Science teacher educators have proposed a wide range of techniques (e.g.,

(5)

Marble, 2006, 2007; Abell, Appleton, & Hanuscin, 2010) that can be used to enhance science teachers’ abilities to design, deliver, and evaluate science lessons in such a manner so as to optimize their students’ science learning in the elementary classroom. To help preservice teachers (PSTs) reach these goals, common sequences are often followed. For example, PSTs may participate “as students” in science lessons that are modeled by the science methods instructor in the college classroom. Participating in lessons may be fur- thered by allowing PSTs to (1) design and deliver their own science lessons that result in peer-teaching contexts and (2) teach science lessons to small groups of elementary students.

A culminating experience may involve PSTs teaching science lessons to entire classrooms of elementary students.

For many decades, researchers have attempted to develop some manner of collecting data to improve science teacher preparation. For example, instruments have been designed to collect data from PSTs to better target efforts to improve instruction. Past instrumentation has used a variety of data collection formats. For example, check-off sheets for indicating the presence or absence of a particular science teaching technique (e.g., Jarvis, McKeon, Coates, & Vause, 2001). Likert scales (e.g., Burry-Stock & Oxford, 1994) are frequently used. Data are collected and summarized in a variety of ways. For some instruments, a total raw score is computed for each teacher using all instrument items (Burry-Stock & Oxford, 1994). In other instruments, a subscale raw score is computed for a specific subset of items (Luft, 1999; Marshall et al., 2010).

RESEARCH QUESTIONS

Using the setting of science teacher training, we demonstrate the applicability of Mul- tifaceted Rasch Measurement with FACETS (Linacre, 2014). Acting as judges, PSTs of science were asked to evaluate five different lessons—each lesson was presented by a different teacher—using the 23-item Elementary Science Teaching Analysis Matrix (ESTAM) instrument (Gallagher & Lindsey, 1997). The research questions are

•

Do PSTs (judges) differ in their use of the ESTAM instrument, and if so, how much do they differ? After correcting for differences in judge severity and ESTAM item difficulty, what general patterns in respondents’ measures and item measures can be observed?

•

What are the implications for science education research in light of the MFRM analysis?

DATA COLLECTION AND INSTRUMENTATION

In the study reported herein, we collected data from elementary science methods students enrolled in a senior-level, semester-length course at a public university in the United States.

In an effort to improve students’ comprehension of inquiry teaching, a culminating course assignment was added to the course. This assignment required students to use the ESTAM (Gallagher & Lindsey, 1997) instrument to evaluate five, 45-minute online, videotaped science lessons taught by five different teachers. By requiring students to evaluate a lesson, we hypothesized that students would further synthesize core ideas that were presented as central themes of the science methods course.

The ESTAM consists of 23 items that were used as the judging criteria by each PST. Each item is rated using a five-step scale. The lowest step (assigned a rating score of 1) is labeled

“Didactic/Teacher-Centered.” That rating step is followed by categories that represent ever- increasing sophistication of science teaching and learning; 2: Hands-on/Student-Centered;

(6)

3: Conceptual; 4: Constructivist; 5: Constructivist/Inquiry. The developers of ESTAM argue that the 23 items rated with this five-step scale can be combined for an overall assessment of teacher effectiveness. A number of authors (e.g., Adams & Krockover, 1997; Gallagher

& Parker, 1995) have discussed the secondary version of this matrix, but little work has been reported to date on the elementary version. Table 1 presents a summary of the 23 ESTAM items as well as shorthand abbreviation that is used in the tables and figures we provide.

Data were collected from a sample of 150 students who were enrolled in an elementary science methods class during their senior year. Each of the 150 PSTs (judges) used the 23- item (items) ESTAM to judge each of the five different science teacher videos (respondents).

DATA ANALYSIS

These data were analyzed utilizing the Rasch FACETS analysis program (Linacre, 2014).

FACETS corrects not only for the nonlinearity of rating scale data but also for differences in judge severity.

The sequence of steps in collecting and the MFRM FACETS analysis of the ESTAM data is the same sequence of steps often used for collecting and analyzing high-stakes science test data. PSTs function as judges, the five videotaped teaching lessons represent five different candidates (respondents), and the 23 ESTAM items represent 23 different parts of a single trait. Table 2 presents this relationship as well as two additional examples.

Table 3 presents the mathematical equation describing the MFRM Rasch model.

RESULTS

Our multifaceted Rasch analysis was conducted using FACETS Rasch software (Linacre, 2014) and limited to an initial generation of item measures, respondent measures, and judge measures, as well as fit statistics for items, respondents, and judges. Also, we created a Wright map in which the measures of judges (PSTs), items, and respondents (video-taped teachers) are provided.

Judge (PST) Measures

Facets software facilitated a computation of three linear measures, all on the same metric.

One measure expresses the severity or leniency of the PSTs (judges) on a scale that ranges, for this study, from approximately –2.2 to 1.5 scale score units (logits). To understand the meaning of each PST measure and how these “PSTs as judges” measures can be compared, consider two students, Mindy with a judge measure of 1.46 and Bob with a judge measure of –0.81. A student (judge) who has a “more positive” measure than another student (judge) is someone who was more severe in her/his use of the ESTAM rating scale. In the case of this ESTAM and the data analysis, a more severe judge (e.g., Mindy) gave lower raw numerical ratings (using the ESTAM scale) for a respondent (one of the five teachers who supplied a video lesson) than did a less severe judge (e.g., Bob). Recall that the ESTAM rating scale ranges from “1” (Didactic/Teacher-Centered) to “5” (Constructivist/Inquiry). This means that Judge Mindy is less likely to view a teacher as exhibiting Constructivist/Inquiry teaching. Judge Bob, in comparison to Mindy, is more likely to view a teacher on tape as exhibiting more Constructivist/Inquiry teaching.

A very important second point must be made in describing the importance of computing judge measures. In the real world, judges, no matter how well they are trained to use a rating scale in a similar fashion, will always differ because they will interpret and apply a

(7)

TABLE 1

The 23 ESTAM Items

Item C1 Item SA4

ConceptFocus: Teacher fosters

conceptual understanding with students

StudentQuestions: Teacher and students’

questions focus on sense-making of ideas

Item C2 Item A1

ContentEmphasis: Teacher emphasizes student application and personal sense-making of concepts

AssessmentGrades: Assessment used for both grades and understanding

students’ ideas

Item P1 Item A2

Process/Content: Teacher balances content with inquiry processes

AssessmentProbe: Teacher probes student ideas for understanding

Item P2 Item A3

Habits/Mind: Teacher emphasizes scientific habits of mind with students’

understanding of content

Assessmentconcept: Teacher focuses assessment on conceptual

understanding of standards

Item TVA1 Item A4

BodyRelatedConcepts: Teacher emphasizes science as a body of related concepts rather than facts/processes

UseStudentIdeas: Teacher uses own knowledge of student ideas for instructional decisions

Item TVA2 Item A5

LinkUnderlyConcepts: Teacher relates activities/ideas with standards and underlying science concepts

Pre/EmbeddedAssess: Teacher uses pre and embedded assessments to inform instruction

Item TVA 3 Item ER1

MakeConnections: Teacher interacts with students to promote connections between activities and ideas

ClassSetup: Teacher’s use of classroom layout (theater-style, lab or activity table, etc)

Item TVA4 Item ER2

SciHabitsMind: Teacher promotes/facilitates students’

development scientific habits of mind

Focus/Work: Teacher uses displayed student work to promote understanding

Item TVA5 Item ER3

StudentApplications: Teacher fosters application concepts/ideas

Teacher/StudentInteract: Teacher or student

Item SA1 Item ER4

FocusSequence: Teacher fosters student connection between physical materials and concepts and standards

S-SInteract: Teacher fosters student-to-student interaction

StudentGroups: Teacher uses student groups to support one another’s understanding

MultipleResources: Teacher provides various resources for student learning

StudentsReport: Students report results and understandings in multiple formats

ResourceTypes: Teacher uses self, commercially, or student prepared types of resources

Owing to the length of each item, we present an item ID, an abbreviated item text, and below the abbreviated text we provide a brief summary of the item topic.

The item ID (e.g., SA2) is used to denote each item in the text, tables, and figures.

(8)

TABLE 2

Examples of the Use of Judges, Items, and Respondents in Medical Creden- tialing, Science Education, and Sports

Medical

Credentialing This Study

Sports (e.g., Figure Skating)

Judges Board certified experts

Preservice teachers Experts in the sport Respondents Candidates seeking

board certification

Video-taped teachers

Figure skaters Items Test items ESTAM items (traits) Required elements

TABLE 3

The Multifaceted Rasch Model

loge(Pnjimk/Pnjimk(k–1))=Bn+Lj−Di−Tm−Fk

The probabilityPnjimk, that person nof ability Bn is observed by judgej of leniencyLj in categorykof itemiof difficultyDiwhile performing taskm, of difficultyTm.Fkis the difficulty of categorykrelative to categoryk– 1.

rating scale in their own unique manner. Therefore, it is critical to compute a judge severity measure for each judge, evaluate the range of judge severity that is present in a sample, and correct for differences in judge severity to compute accurate measures of respondents (video-taped teachers) and items.

Owing to the large number of PSTs serving as judges, we provide a sample of the Facets judge measure table as Table 4. Table 4 contains a sample of severe judges who tended to use the lower part of the rating scale, meaning they viewed much of the teaching as teacher centered, a sample of less severe, somewhat easier judges who were a little more frequent in their use of the higher parts of the rating scale, a sample of judges who were substantially easier and even more frequent in their use of the higher (more constructivist/inquiry) portion of the rating scale, and a fourth sample of judges who were the most lenient in their use of the five-step ESTAM rating scale. Judges Mindy and Bob detailed above are listed in the table. Added details concerning the content of Table 4 are provided in the table footnote.

Respondent Measures

The FACETS software also allowed the computation of respondents’ measures (Table 5), in this case the measures of the five teachers who were judged by the 150 PSTs. In the case of utilizing the ESTAM, the measure provided a linear quantitative gauge of the level of “teaching effectiveness": The higher the measure, the higher the teaching effectiveness as measured via the set of 23 ESTAM items. Added details regarding figure organization are provided in the figure caption.

Item (the ESTAM Items That Define the Scale) Measures

In addition to the computation of respondents’ (teachers on video tape) measures and the computation of judges’ measures (PSTs), the measures of the 23 ESTAM items that define the scale were also computed. The item measures also take into consideration such issues as the varied severity of judges. Table 6 presents all 23 ESTAM item measures, as well as a

(9)

TABLE 4

Portions of a PST as Judge Measure Table

Total Total Observed Model Outfit

Score Count Average Measure SE MNSQ Num

27 24 1.1 1.46 0.55 0.59 Mindy

27 24 1.1 1.46 0.55 0.67 9,444

269 118 2.3 1 0.1 0.71 84,761

302 118 2.6 0.67 0.1 1.48 43,660

78 24 3.3 0.63 0.21 0.38 943

79 24 3.3 0.58 0.22 0.28 9,440

317 120 2.6 0.57 0.1 0.52 Billy

324 120 2.7 0.5 0.1 0.6 84,830

327 119 2.7 0.45 0.1 1 77,311

339 120 2.8 0.35 0.1 1.1 56,661

375 120 3.1 –0.02 0.1 0.86 18,816

383 120 3.2 –0.11 0.1 0.57 73,164

331 96 3.4 –0.41 0.12 1.12 82,290

419 117 3.6 –0.61 0.11 0.77 60,130

446 120 3.7 –0.81 0.11 0.87 Bob

488 120 4.1 –1.37 0.12 0.64 78,940

500 120 4.2 –1.55 0.13 0.8 38,434

536 120 4.5 –2.21 0.15 0.95 93,719

The numerical ID of each PST as judge is presented in the far right column. The total numerical score of the judge to the answered ESTAM items for all rated respondents (the five teachers) is provided in the first column of data. The total number of ESTAM items answered by the PST as Judge is provided in the second column. The fourth column of data provides the “measure” of the judge. This measure is corrected for nonlinearity and also the measure takes into account the ETAM items answered by the judge and the respondents evaluated by the judge. Some judges did not evaluate all respondents and/or utilize all ESTAM items.

However, through the use of Rasch, a Judge measure can still be computed. Judge Mindy answered only 24 ESTAM items—this means she only judged one respondent. Judge Billy provided 120 ESTAM item answers, which meant that he evaluated all five respondents (the five teachers on tape) using all 24 ESTAM items (5 ×24= 120). Additional columns of the table provide a number of quality control Rasch indices. A total of 150 PSTS acted as judges. The mean Outfit MNSQ for the judges was 0.97. Higher judge measures are those judges that were tougher (they gave the “best ratings” of 5 [Constructivist Teaching] less often than judges with a lower measure.

(10)

TABLE 5

The Measures of the Five Teachers Who Were Judged by the 150 PSTs

Total Total Observed Model Outfit NTeachers

Score Count Average Measure SE MNSQ on Tape

13,267 3,155 4.2 0.93 0.02 1.25 Force and Motion

13,330 3,253 4.1 0.74 0.02 0.90 All Sorts of Leaves

11,678 3,158 3.7 0.15 0.02 0.90 Water Purification

10,681 3,126 3.4 –0.20 0.02 0.98 Completing the Circuit

6,627 3,167 2.1 –1.62 0.02 1.03 Water Cycle

Each line of the table provides data for each of the five teachers who provided lessons. The last column of data provides the topic taught by the teacher. The first column provides the total raw score which a teacher received from the PST judges. This means that the teacher who taught the lesson “Forces and Motion” received a total of 13,267 raw score points from all the judges that evaluated them. The second column “Total Count” provides the total number of ESTAM items which were rated by the Judges in their assessment of the teacher on tape. This means that a total of 3,155 judge ratings were provided for the “Force and Motion” teacher. The measure column provides the “measure” of the teacher on the linear Rasch scale. A higher Rasch measure for the teachers on tape means that the teacher was rated as exhibiting the more constructivist teaching than a teacher with a lower measure.

This means that the “Water Cycle” teacher (a measure of –1.62) exhibited a lower level of constructivist teaching than did the “Forces and motion” teacher (a measure of 0.93). As is the case in Table 3, additional columns in the table provide some of the Rasch statistics which can be used to evaluate data quality

set of Rasch indices for all 23 items. Details regarding the table organization are provided in the table footnote.

DISCUSSION

MFRM was used successfully to compute person measures, item measures, and judge measures for a data set. One hundred fifty PST judges used the 23-item ESTAM instrument to evaluate five different teachers whose science lessons were video tape recorded for evaluation. Given the properties of the Rasch model, we were able to (1) convert ordinal rating data to a linear measure for items, judges, and respondents and (2) assess the data quality of items, PSTs, and respondents. Missing data (a judge skipping an item) did not impact the analysis.

Analysis of quality control indices suggests that the data set appears to align with the requirements of the Rasch model. One item, ER5, “Using multiple resources” (MNSQ outfit 1.99) of the ESTAM may not define the same trait as well as that observed for the remaining 22 items. This item, of all ESTAM items, was the hardest item to receive a high rating on (see Table 6); thus, as one considers any final decisions on this item (e.g., retain the item, remove the item, monitor the item), it is important to be alert to the fact that this item may be the item that defines the most difficult to achieve part of the trait. This is one possible explanation for the observed item misfit. If additional data are collected, analysts should closely monitor the behavior of this item. Whereas this item needs to be monitored, it should be retained, as this item helps identify very high performing teachers.

Review of all respondents suggests no misfit for the five teachers whose science lessons were recorded on tape (the five respondents). This result supports the inference that the measures computed for the five teachers are expressed along the same trait.

(11)

TABLE 6

The Measure Table for the ESTAM Items

Total Total Observed Model Outfit

Score Count Average Measure SE MNSQ Nu Traits

1,801 660 2.7 1.01 0.04 1.99 23 ER5

2,234 662 3.4 0.18 0.05 1.06 12 SA4

2,230 659 3.4 0.18 0.05 0.80 9 TVA5

2,247 665 3.4 0.18 0.05 0.93 3 P1

2,225 655 3.4 0.16 0.05 1.10 20 ER2

2,255 663 3.4 0.15 0.05 0.98 13 SA4

2,287 661 3.5 0.06 0.05 0.87 8 TVA4

2,293 662 3.5 0.06 0.05 0.98 5 TVA1

2,285 657 3.5 0.04 0.05 0.87 21 ER3

2,321 662 3.5 0.00 0.05 0.81 19 ER1

2,332 658 3.5 –0.05 0.05 0.97 18 A5

2,336 659 3.5 –0.05 0.05 0.89 16 A3

2,351 660 3.6 –0.07 0.05 0.99 24 ER6

2,361 663 3.6 –0.07 0.05 1.15 4 P2

2,362 663 3.6 –0.08 0.05 1.05 15 A2

2,366 662 3.6 –0.09 0.05 1.05 6 TVA2

2,377 663 3.6 –0.10 0.05 0.99 2 C2

2,378 661 3.6 –0.12 0.05 1.03 10 SA1

2,395 664 3.6 –0.14 0.05 0.93 1 C1

2,407 659 3.7 –0.20 0.05 1.04 17 A4

2,421 662 3.7 –0.21 0.05 1.08 7 TVA3

2,449 664 3.7 –0.26 0.05 0.85 11 SA2

2,426 656 3.7 –0.27 0.05 0.96 14 A1

2,444 659 3.7 –0.29 0.05 0.88 22 ER4

The far right column provides an item name. The first column provides the total raw score provided by all judges to the item. The second column provides the number of judge ratings provided for the specific item. The fourth column provides the ESTAM item measure. The first item in the table was rated by judges a total of 660 times, for a raw score total of 1,801.

The measure of the item is 1.01 on the Rasch logit scale that is also used to express the judge severity and the respondent measures. Recalling that a lower rating scale category means less constructivist teaching, one can observe that a higher Rasch measure means that an item was viewed overall as being less often observed in a constructivist manner.

This means that ER5 “MultipleResources: Teacher provides various resources for student learning” (measure 1.01) was observed less often in a constructivist manner (among the teachers on tape) than item ER3 (measure 0.04). Additional columns provided in the table provide indices which can be used to evaluate data quality.

In any data set with a large number of respondents and/or items and/or judges, idiosyn- cratic behavior will sometimes be observed. Although some judges (PSTs) were observed (see Table 4) with outfit MSQ greater than 1.3, generally for the large number of judges (n

=150), the data suggest that the judges’ responses fulfilled the requirements for productive measurement. A mean outfit meansquare (MNSQ), 0.97, was observed for the very large number of judges.

Regarding analysis of the functioning of the ESTAM rating scale, Rasch analysis is, in part, the mathematical application of what it means to measure. One requirement of measurement with a rating scale and a set of items is that items must define a single trait. Also, the rating scale is applied in a manner that matches what it means to measure.

(12)

TABLE 7

A Data Summary Table Which Provides a Technique by Which the Functioning of the Measurement Scale Can Be Assessed

Data Quality Control

Category Counts Average Outfit

Score Used Per Cent Cum. % Measure MNSQ

1 1,516 10 10 –1.23 0.9

2 1,521 10 19 –0.36 1.1

3 3,905 25 44 0.25 1.0

4 5,275 33 77 0.95 1.0

5 3,618 23 100 1.37 1.0

Average measures (Avg. Meas) should increase with higher rating scale (Score). Category Score reports the coding used for the ESTAM rating scale. “Counts Used” refers to how many times in the entire data set a rating scale score was observed. The “Per cent” column and “Cum. %” provide the percentage of data set that was observed for each rating scale step.

Table 7 provides a number of statistics. Perhaps most important are the first (Score) and the fifth (Average Measure) columns of data. Readers should observe that, with each increase in the Category Score (1 to 2 to 3 to 4 to 5), an increase in the logit values occurs in the Average Measure column of data (–1.23 to –0.36 to 0.25 to 0.95 to 1.37). When such an increase is observed, as it is here, the observation is evidence that the rating scale is functioning in at least one of the ways that is required to conduct sound measurement.

When MFRM is employed to compute measures, a second technique is to search for interactions of items, respondents, and judges when each rating scale category is “most likely.” When a rating scale category is “not most likely,” the rating scale should be considered for revision (e.g., improvement, removal). In Figure 1, in categories 1, 3, 4, and 5, there are instances when a category is “most likely” (the numbers 1, 3, 4, and 5) to appear on the top trace for at least some portion of the graph from left to right. However, the lack of a “2” (as a top trace) suggests that the item might not be measuring as well as possible with this category. These data suggest that if additional data are collected, then an analyst should monitor the use of category “2.” If the pattern persists, then the category might be removed, or the wording changed. Moreover, a second analysis should then be conducted to evaluate data quality. Measurement is optimized when each rating scale category is most probable at some point in the plot immediately above. When science education researchers use a rating scale involving items, judges, and respondents, it is most important to notice that each rating scale category is “most probable” (each rating scale category is observed in the top trace of curves) for at least some combinations of items and respondents.

We now focus on the topics of item targeting and construct validity. Figure 2 provides perhaps one of the most powerful aspects of Rasch analysis—Wright map. In this map, the PSTs who served as judges are plotted in the left column of *s, the measures of each of the five teachers who taught lessons to be judged are plotted in the middle column, and the measures of the 23 ESTAM items are provided in the right column. Owing to the large number of judges, *s are used to plot the measures of judges. The Wright map not only plots linear measures, all on the same scale, but it also allows quick assessment of construct validity, assessment of judges, and a summary of the overall measure of the respondents, who are the teachers teaching on tape. Details about the content of Figure 2 are in the figure caption.

(13)

Figure 1.A FACETS plot which allows for an evaluation of the manner in which each of the rating scale steps is utilized. Measurement is optimized when each rating scale category is most probable in this plot. What is important for the science education researcher using a rating scale involving items, judges, and respondents is that each rating scale category (1, 2, 3, 4, 5 in this data set) should be observed as “most probable” for some portion of the horizontal axis (this means that each rating scale category is observed in the top trace of curves).

First, to understand the Wright map, readers should recall that the ordering and spacing of judges, respondents, and items are expressed on a linear scale. This means that one can think of items, for example, as markings on a meter stick. Given the distribution of ESTAM items, one can see a distribution of items from a low measure ER4 S-SInteract (easier to demonstrate constructivist teaching) to a higher measure, ER2 Focus Work, then a very high measure, ER5 Multiple Resources (harder to demonstrate constructivist teaching).

Careful examination of the plot of item measures from a measurement perspective reveals a gap between ER2 and ER5 and suggests that authoring new items to fill the gap between ER2 and ER5 could improve the ESTAM. Continuing a measurement perspective, a second observation reveals a number of items that mark the same portion of the trait (e.g., TVA1 (Body-Related Concepts), TVA4 Science Habits of Mind). Certainly, numerous issues can be considered when adding or removing items, but from a measurement perspective, those items with very similar measures are not optimal items. This means that one might best remove some items that are redundant from a measurement perspective. Then one can either have a shorter instrument or add new items to “fill the measurement gaps.” Since rating scales (as opposed to dichotomous scales) provide some added measurement power, such gaps may not be as terrible as gaps that are observed in a right/wrong test. However, when there are gaps, it may mean that some improvements are needed in the conceptualization of the variable, and/or improvements are needed in terms of the set of items presented to measure the trait.

(14)

Figure 2. A Wright map presenting the measures of judges (PSTS), items, and respondents (teachers on tape).

The ordering, spacing, and location of each of the three (items, judges, respondents) provide important information with respect to the trait measured and the functioning of the instrument. The ordering of judges provides guidance to the science educator, as does the ordering of items, and as does the ordering of respondents. ER5 (top of the plot) is the ESTAM item for which it was hardest to demonstrate constructivist teaching. SA2, ER4, and A1 are the ESTAM items for which it was easiest to demonstrate constructivist teaching.

The location and pattern of traits are of great use to science educators who continue to try to understand science learning and help improve science teaching. The trait’s pattern from lower measures to higher measures suggests that the PSTs (judges) follow very close to a bell-shaped curve from lenient (lower logit measures) to harsh (higher logit measures).

In addition to the measures of the 23 ESTAM items, measures of the five teachers who were judged via video tape are included. These measures are the result of correcting for the nonlinearity of raw rating scales. The measures, in part, result from an understanding that not all survey items mark the trait in same manner. Also, the respondent measures take into consideration that not all PSTs who acted as judges were equally severe or easy in using the ESTAM instrument as they evaluated the five teachers. Given the distribution of the five teachers (noted by the topic of the lesson they taught), we observed that the lessons marked a range on the metric. If all five teachers were at a similar measure, then the PSTs would have been given the task of rating very similar teachers. That might not be the best way to push the PSTs to evaluate teachers with differing ranges of constructivist teaching.

If a teacher were added to the activity, it might be good to add a teacher with a measure between the teacher who taught the Water Cycle lesson and the teacher who taught the

(15)

Completing the Circuit lesson. Also, the greater the range of teachers being judged (just as is the case by presenting a greater variety of ESTAM items), the more certainty one can have regarding the severity of the PSTs as judges.

Conducting science education research, researchers should always place high priority in using robust measurements. In addition to evaluating the functioning of an instrument, research patterns in the data are also revealed. As we considered the marking of the trait of the ESTAM items, the research team evaluated the ordering and spacing of items. This was done to learn if the ordering and spacing were predictable and made sense. Additionally, this review of item ordering and spacing was conducted to search for verification of theories that have been suggested to date. The same steps can be taken with the five rated respondents.

The measures computed for each of the five teachers represent a composite measure, using the ESTAM items, a correction for the use of an ordinal scale, and corrections of judge severity. Review of the teacher ordering resulting from an analysis of all the PSTS judges revealed that the Force and Motion teacher exhibited the highest level of constructivist teaching (using the 150 PSTs judges, the 23 ESTAM items, and correcting for differences in judge severity). The teacher presenting a lesson concerning the Water Cycle was the teacher who was judged by the PSTs as presenting the least constructivist teaching lesson.

Analysis of the Interplay of PST Judges, Respondents, ESTAM Items

The far left portion of the Wright map presents the 150 measures of the PSTs who used the ESTAM instrument to judge the five teachers. The most important issue is the very wide range in judge severity. The severity of judges, for the most part, is distributed very much in the pattern of a bell-shaped curve. The one exception to this is the tail at the high end of the judge severity distribution. From a science teacher education perspective, one of the most important results of the judge severity computation is that the judges (the PSTs) do exhibit a normal curve distribution of severity. This might in fact be predicted when using a sample of judges who have not been asked to act as robots.

Our goal herein must stress the impact of the judges and the importance of correcting for judges’ severity. In light of this goal, we ask readers to note the logit range of the PST judges and the logit range of the respondents (teachers judged on tape). The range of the PST judges is approximately 3.5 logits, which is a little larger than the range of the respondents. This means that when a subset of judges is used, the “luck of the draw” will be highly influential in the computation of respondents’ measures. Therefore, one must correct for differences in judge severity, before using data in which judges are utilized.

There is a nuance in the data set that we will briefly mention. In designing this study we wished to explore MFRM for science education. We also wanted to develop a study that would allow us to improve our science teacher preparation efforts. This study allowed us to review the ordering of items by difficulty, the overall measures of the five sample teachers, as well as review the range of judge severity exhibited by this sample of PSTs.

One advantage for us in the use of these PSTs was the prediction that there would be a range of judge severity, which provided us the opportunity to explore how MFRM could allow one to confront a spread of judge severity. It is certainly the case that the PSTs were novices with respect to their science teacher preparation. As a result, these judges do differ from the more traditional concept of a judge. As we consider extensions of the work we present here, one avenue of research is the collection of judge data from experienced science teachers who are asked to judge the same five sample teachers using the ESTAM. Such a second data set would allow many comparisons to be made, one of which is a comparison of the manner in which the PST judges used the ESTAM in comparison to the way in which more experienced teachers used the ESTAM.

(16)

CONCLUSION

Our goal is to present details concerning the importance, and promise, of multifaceted Rasch measurement techniques for the science education community. Judges are used to evaluate respondents using a set of items in many instances. Well-known examples are high-stakes observations of classroom science teachers in which a set of items are used for an overall (hoped for) measure. This article extends previous articles appearing inScience Educationthat have presented the basics of Rasch analysis of multiple-choice tests, partial credit data sets, and rating scale data sets.

The steps of conceptualizing a variable, correcting for the nonlinearity of rating scales, and evaluating the fit of items and persons are similar to steps taken when conducting a Rasch analysis of a multiple-choice test or a rating scale. In this paper, the Rasch model is extended to the case in which there are items, judges, and respondents. By understanding that not all judges act as robots, one can understand the critical importance of correcting for judge severity. From a measurement perspective, it is important to have a range of item difficulty on a test. If all items function at the same difficulty level, then the test functions as a one item test. This assertion also applies to judges. It is better to have judges with differing severity rather than to have all judges to act in the same manner.

For the science educator designing data collection plans, the ability to design a data collection plan in which not all judges evaluate all respondents (missing data are not a problem for Rasch analysis) will provide not only a cost savings, but we believe, will open the door to increased use of judges for the collection of science education data. Assigning a specific mix of judges to evaluate respondents is a commonly used technique in medical board certification, and this technique should be extended to the use of MFRM in science education research. We believe the ability to use a mix of judges (who do not have to evaluate all respondents, and who in fact do not have to use all items) will open the door to more sophisticated research projects in science education that were not possible until the development of MFRM techniques. Time and money can be saved, judges not worn out, and more items can be administered to respondents.

REFERENCES

Abell, S. K., Appleton, K., & Hanuscin, D. L. (2010). Designing and teaching the elementary science methods course. New York, NY: Routledge.

Adams, P. E., & Krockover, G. H. (1997). Beginning science teacher cognition and its origins in the preservice secondary science teacher program. Journal of Research in Science Teaching, 34(6), 633 – 653.

Alake-Tuenter, E., Biemans, H. J., Tobi, H., Wals, A. E., Oosterheert, I., & Mulder, M. (2012). Inquiry-based science education competencies of primary school teachers: A literature study and critical review of the American National Science Education Standards. International Journal of Science Education, 34(17), 2609 – 2640.

Beyer, C. J., & Davis, E. A. (2012). Learning to critique and adapt science curriculum materials: Examining the development of preservice elementary teachers’ pedagogical content knowledge. Science Education, 96(1), 130 – 157.

Black, K. (2004). Science in the trenches: An exploration of four pre-service teachers’ first attempts at teaching science in the classroom. International Journal of Science and Mathematics Education, 2(1), 25 – 44.

Bond, T. G., & Fox, C. M. (2007). Applying the Rasch model: Fundamental measurement in the human sciences (2nd ed.). Mahwah, NJ: Erlbaum.

Boone, W. J., & Scantlebury, K. (2006). The role of Rasch analysis when conducting science education research utilizing multiple-choice tests. Science Education, 90, 253 – 269.

Boone, W. J., Staver, J., & Yale, M. (2014). Rasch analysis in the human sciences. Dordrecht, The Netherlands:

Springer.

Boone, W., Townsend, S., & Staver, J. (2011). Using Rasch theory to guide the practice of survey development and survey data analysis in science education and to inform science reform efforts: An exemplar utilizing STEBI self-efficacy data. Science Education, 95(2), 258 – 290.

(17)

Burry-Stock, J. A., & Oxford, R. L. (1994). Expert science teaching educational evaluation model (ESTEEM):

Measuring excellence in science teaching professional development. Journal of Personnel Evaluation in Edu- cation, 8(3), 267 – 297.

Chiappetta, E. L., & Fillman, D. A. (2007). Analysis of five high school biology textbooks used in the United States for inclusion of the nature of science. International Journal of Science Education, 29(15), 1847 – 1868.

Eggert, S., & B¨ogeholz, S. (2010). Students’ use of decision-making strategies with regard to socioscientific issues: An application of the Rasch partial credit model. Science Education, 94(2), 230 – 258.

Endler, L. C., & Bond, T. R. (2000). Cognitive development in a secondary science setting. Research in Science Education, 30(4), 403 – 416.

Fairbrother, R. W. (1975). The reliability of teachers’ judgment of the abilities being tested by multiple choice items. Educational Research, 17(3), 202 – 210.

Gallagher, J., & Lindsey, S. (1997). Elementary science teaching analysis matrix. Unpublished manuscript.

Gallagher, J., & Parker, J. (1995). Secondary science teacher analysis matrix (SSTAM). East Lansing: Michigan State University.

Gunning, A. M., & Mensah, F. M. (2011). Preservice elementary teachers’ development of self-efficacy and confidence to teach science: A case study. Journal of Science Teacher Education, 22(2), 171 – 185.

Hechter, R. P. (2011). Changes in preservice elementary teachers’ personal science teaching efficacy and science teaching outcome expectancies: The influence of context. Journal of Science Teacher Education, 22(2), 187 – 202.

Jarvis, T., McKeon, F., Coates, D., & Vause, J. (2001). Beyond generic mentoring: Helping trainee teachers to teach primary science. Research in Science and Technology Education, 19(1), 5 – 23.

Jonsson, A., & Svingby, G. (2007). The use of scoring rubrics: Reliability, validity and educational consequences.

Educational Research Review, 2(2), 130 – 144.

Kim, M., & Tan, A. L. (2011). Rethinking difficulties of teaching inquiry-based practical work: Stories from elementary pre-service teachers. International Journal of Science Education, 33(4), 465 – 486.

Lee, J. (2004). Evaluating the effectiveness of instructional resource allocation and use: IRT and HLM analysis of NAEP teacher survey and student assessment data. Studies in Educational Evaluation, 30(2), 175 – 199.

Linacre, J. M. (1989). Many faceted Rasch measurement. Chicago, IL: University of Chicago Press.

Linacre, J. M. (2014). Facets (version 3.67.0) [Rasch measurement computer software]. Chicago, IL: Win- steps.com.

Liu, X. (2010). Using and developing measurement instruments in science education: A Rasch modeling approach.

Charlotte, NC: Information Age.

Luft, J. A. (1999). Assessing science teachers as they implement inquiry lessons: The extended inquiry observa- tional rubric. Science Educator, 8(1), 9 – 18.

Marble, S. T. (2006). Learning to teach through lesson study. Action in Teacher Education, 28(3), 86 – 96.

Marble, S. T. (2007). Inquiring into teaching: Lesson study in elementary science methods. Journal of Science Teacher Education, 18(6), 935 – 953.

Marshall, J. C., Smart, J., & Horton, R. M. (2010). The design and validation of EQUIP: An instrument to assess inquiry-based instruction. International Journal of Science and Mathematics Education, 8(2), 299 – 321.

Meyers, L. S., Gamst, G., & Guarino, A. J. (2006). Applied multivariate research: Design and interpretation.

Thousand Oaks, CA: Sage.

Neumann, I., Neumann, K., & Nehm, R. (2011). Evaluating instrument quality in science education: Rasch-based analyses of a nature of science test. International Journal of Science Education, 33(10), 1373 – 1405.

Nilsson, P., & Loughran, J. (2012). Exploring the development of pre-service science elementary teachers’

pedagogical content knowledge. Journal of Science Teacher Education, 23(7), 699 – 721.

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen, Denmark: Dan- marks Paedagogiske Institut.

Ravid, R. (2010). Practical statistics for educators. Lanham, MD: Rowman & Littlefield.

Rijlaarsdam, G., Couzijn, M., Janssen, T., Braaksma, M., & Kieft, M. (2006). Writing experiment manuals in science education: The impact of writing, genre, and audience. International Journal of Science Education, 28(2 – 3), 203 – 233.

Shaw, K. R. M., Van Horne, K., Zhang, H., & Boughman, J. (2008). Essay contest reveals misconceptions of high school students in genetics content. Genetics, 178(3), 1157 – 1168.

Smith, A. B., Rush, R., Fallowfield, L. J., Velikova, G., & Sharpe, M. (2008). Rasch fit statistics and sample size considerations for polytomous data. BMC Medical Research Methodology, 8(1), 33.

Stewart-Brown, S., Tennant, A., Tennant, R., Platt, S., Parkinson, J., & Weich, S. (2009). Internal construct validity of the Warwick – Edinburgh mental well-being scale (WEMWBS): A Rasch analysis using data from the Scottish health education population survey. Health and Quality of Life Outcomes, 7(1), 15 – 22.

(18)

Sungur, S., Tekkaya, C., & Geban, ¨O. (2006). Improving achievement through problem-based learning. Journal of Biological Education, 40(4), 155 – 160.

Thomas, G., Anderson, D., & Nashon, S. (2008). Development of an instrument designed to investigate elements of science students’ metacognition, self-efficacy and learning processes: The SEMLI-S. International Journal of Science Education, 30(13), 1701 – 1724.

Van Boxtel, C., Van der Linden, J., & Kanselaar, G. (2000). Collaborative learning tasks and the elaboration of conceptual knowledge. Learning and Instruction, 10(4), 311 – 330.

Wang, W. C., & Chen, C. T. (2005). Item parameter recovery, standard error estimates, and fit statistics of the WINSTEPS program for the family of Rasch models. Educational and Psychological Measurement, 65(3), 376 – 404.

Wilson, M. (2008). Cognitive diagnosis using item response models. Zeitschrift f¨ur Psychologie/Journal of Psychology, 216(2), 74 – 88.

Wilson, M., Allen, D. D., & Li, J. C. (2006). Improving measurement in health education and health behavior research using item response modeling: Introducing item response modeling. Health Education Research, 21(suppl 1), i4 – i18.

Wright, B. D., & Masters, G. N. (1982). Rating scale analysis Rasch measurement. Chicago, IL: MESA Press.

(19)

Using Multifaceted Rasch Measurement to Evaluate Complex Data Sets in Science Education Research

Education

Utilizing Multifaceted Rasch

Measurement Through FACETS to

Evaluate Science Education Data Sets Composed of Judges, Respondents, and Rating Scale Items: An Exemplar Utilizing the Elementary Science

Teaching Analysis Matrix Instrument

•

•

may not be copied or emailed to multiple sites or posted to a listserv without the copyright

holder's express written permission. However, users may print, download, or email articles for

individual use.