Submitted as Partial Fulfillment of the Requirements For The degree

of Sarjana Pendidikan (S.Pd) in Teaching English


Siti Ghoniyya








Ghoniyya, Siti. (2016). “The Intra-Rater Reliability Consistency of English Teachers at Al-Amin Islamic Boarding School Mojokerto in Scoring Essay Test”. A thesis. English Teacher Education Department, Faculty of Education and Teacher Training, Sunan Ampel State Islamic University, Surabaya. Advisor: Sigit Pramono Jati.

Key word: intra-rater reliability, consistency, scoring essay test

This study explores the English teachers’ consistency of intra-rater reliability in scoring essay test. By analyzing the teachers’ pre- and post-scoring in two-month interval quantitatively, this study showed that the intra-rater reliability of English raters at Al-Amin Islamic Boarding School Mojokerto in scoring essay test was consistent. It was proven by the result of Cronbach alpha coefficient in SPSS 23 as descriptive statistic analysis and paired t-test result as the inferential statistic analysis. Alpha coefficient of intraclass correlation mean showed 0.942 values. It can be said that the English teachers as the rater of essay test in this study had good reliability as the coefficient have to be above 0.7 in order to be admitted that raters were internally consistent. Also, paired t-test presented that t-test result was more than t-table (5%, N-1); 3.541 > 2.776, and Sig. = 0.02 was less than 0.05 as the significant level of this study. Those calculations had qualified the rules to reject the null hypothesis. In addition, it also explained that the result of descriptive statistic was the real result, not happened incidentally.



Ghoniyya, Siti. (2016). “The Intra-Rater Reliability Consistency of English Teachers at Al-Amin Islamic Boarding School Mojokerto in Scoring Essay Test”. Skripsi. Prodi Pendidikan Bahasa Inggris, Fakultas Tarbiyah dan Keguruan, Universitas Islam Negeri Sunan Ampel, Surabaya. Pembimbing: Sigit Pramono Jati.

Kata kunci: intra-rater reliability, consistency, scoring essay test

Penelitian ini meneliti tentang konsistensi dari reliabilitas diri (intra-rater) Guru Bahasa Inggris dalam menilai esai siswa. Dengan menganalisa secara kuantitatif dua skor (pre- dan post-) yang diberikan Guru Bahasa Inggris terhadap satu esai dengan jarak waktu penilaian selama dua bulan, penelitian ini menunjukkan bahwa reliabilitas diri Guru bahasa Inggris di Pondok Pesantren Al-Amin Mojokerto dalam menilai esai siswa konsisten. Hal ini terbukti dengan hasil dari koefisien Cronbach alpha pada SPSS 23 sebagai statistik deskriptif juga paired t-test sebagai analisis dari inferensial statisik. Koefisien Alpha dari korelasi intraclass hasil rata-rata skor menunjukkan angka 0.942. Ini bisa dikatakan bahwa Guru Bahasa Inggris sebagai pemberi nilai esai siswa memiliki reliabilitas yang sangat bagus, mengingat koefisien yang dihasilkan harus lebih dari 0.7 agar guru tersebut bisa dikatakan bahwa dirinya cukup konsisten dalam menilai esai siswa. Selain itu, analisis paired t-test juga menunjukkan bahwa nilai T melebihi nilai T tabel (5%, N-1); 3.541 > 2.776, dan Sig. = 0.02 kurang dari 0.05 yang merupakan level signifikansi dari penelitian ini. Perhitungan ini telah memenuhi persyaratan untuk menolak hipotesis nol. Ditambah lagi, hasil dari paired t-test juga menjelaskan bahwa hasil dari statistik deskriptif adalah hasil yang sesungguhnya, bukan karena suatu kebetulan.







A. Background of the Study

Today, ESL teachers or tutors (English language program teachers)

more decide to use performance-based assessment in evaluating students’ language ability, such as oral production, writing production, open-ended

responses, integrated performance (across skills area), group performance and

other interactive tasks.1 It can be called as direct assessment since it provides

more direct evidence of meaningful application of knowledge and skill.




There are many kinds of writing test, such as paragraph construction

test, short-answer and sentence completion test, picture-cued test, essay test, etc.

This study will be focused on the essay test as the most appropriate test to

measure students’ critical thinking and conscious mental process. Essay test is one of writing production evaluations that requires the student to structure a rather

long written response up to several paragraphs. Their scoring requires expert

judgment rather than the application of a clerical key.2 This test is commonly

chosen to evaluate students’ writing ability since it has strengths, such as it can

test students’ complex learning writing outcome, posing a more realistic task and also requiring students to use their own writing skill.

While the fact that constructed-response essay items ask students to

produce samples of normative language which make such items to have more

valid measurement of communicative writing ability, the process of scoring essay

items is quite complex. In this part, human raters experience a new defiance to

defend the reliability and construct validity of test scores, as each rater has

different perceptions of performances and tendencies of leniency and severity. It

can cause high possibility of subjectivity as raters have their own personal criteria

of scoring.




The subjectivity in scoring a test can cause a doubt on the stable of

estimation. According to the explanation above, not only subjectivity but also

human error and bias during scoring process can deliver inconsistent grade of

each student. This inconsistency can decrease the rater reliability that can reduce

the assessment accuracy. In evaluating writing ability, rater reliability is difficult

to achieve since writing proficiency involves numerous traits that are hard to

define.3 It is the same as James Dean Brown statement that the very difficulty of

measuring mental traits explains why consistency is of particular concern to

language testers.4 It means that the consistency of scoring is very important to be

aware by all English teachers or testers since it determines the objectivity of the

“real” score that each students should be gotten.

The rater consistency in scoring essay test can be measured in two ways,

inter-rater and intra-rater reliability. Inter-rater reliability is estimated by looking

at the scores produced by two raters and calculating a correlation coefficient

between the two sets of score.5 Many foreign researchers had more chosen this

type to evaluate rater’s stability as the objectivity is more visible from comparing


scores produced by two or more raters. In fact, English teachers in Indonesia are

not accustomed to teaching and assessing their students’ test in pair.

Intra-rater reliability is paid big attention here. It is a common

occurrence for a class teacher, like teachers in Indonesia, because unclear scoring

criteria, fatigue, bias toward particular “good” and “bad” students, or simple

carelessness.6 This type allows us to measure the individual rater by comparing

two scores produced by the same rater from the same essay in different time (pre-

and post-score). Unfortunately, Indonesian English teachers do not have big

attention and awareness in evaluating their consistency in scoring their students’

test, especially essay test as one of subjective test that the raters may have

subjective nature during scoring process. It is proven by the unavailable studies

which try to examine the rater consistency using this method.

The researcher has made a survey of theses in term of language testing

which are especially done by students of English Teacher Education Department

UIN Sunan Ampel Surabaya in their thesis. They mostly have concentration in

some assessment types, such paper project assessment, self assessment, formative

assessment, peer assessment and other assessments. The others, in minority, have

analyzed the validity of tests, like content validity, face validity, index difficulty




and item discrimination. However, there is a study about essay test which was

done by Ita Faradillah entitled “An Analysis of Essay Test on English Final Test

for Grade Eleven Students of SMAN 1 Lamongan” that only analyzed the content

validity, item discrimination and index difficulty of essay test.7 The researcher did

not find any studies about reliability, especially intra-rater reliability (internal

rater consistency) in scoring essay test which has taken a crucial part like the

explanation above. Therefore, the researcher wants to examine this topic


Al-Amin Islamic Boarding School is the pioneer of bilingual Islamic

boarding school in Mojokerto. It means that this school becomes a model for

other new bilingual schools. As bilingual school, the students have to speak in

English or Arabic for their daily conversation. In addition, they get an additional

language class in the evening to deepen their language skills, one of them is

writing skill.

Absolutely language teachers take important part here. They have to

master the language very well in teaching and testing the students. Master of

language here means that English teacher must have a college degree and national

certificate in language. It can be said that they are professional enough to be good




teacher admitted by a college that formed in certificate and have good quality of

validity and reliability to be a language tester. As the limitation of the language

teachers, a teacher is in charge only for a class. They teach, test and grade their

students by them selves, especially in examining students’ writing skill, such as an essay test. Therefore, it is very important to know whether the English

teachers’ intra-rater reliability is consistent or not in scoring essay test to

understand how objective they are in assessing subjective test.

B. Problem of the Study

Based on the background of the problem, the research question of study can be formed as follows:

 Is the intra-rater reliability of English teachers at Al-Amin Islamic Boarding

School Mojokerto in scoring essay test consistent?

C. Objective of the Study

In this study, the researcher wants to know whether the intra-rater reliability of English teachers was consistent or not in scoring essay test.

D. Hypothesis of the Study



E. Significance of the Study

The finding of this study is expected to gain awareness of English teachers

about the important of intra-rater reliability consistency in writing assessment,

especially essay test. They need to know whether their-self were consistent or not

in evaluating their students’ writing skill. This awareness will make them become

more objective in assessing subjective test. Therefore, English teachers will be

more careful in scoring essay test in order that the students get their “real” score

and feedback properly.

F. Scope and Limitation of the Study

The scope of this study is to determine whether the intra-rater of English

teachers in scoring daily essay test of grade eleven Al Amin Islamic Boarding

Senior High School Mojokerto consistent or not. This research will not examine

their reliability by comparing each teachers’ score (inter-rater reliability) or any

other kinds of reliability. Therefore, it will be conducted only in measuring the

English teachers’ level of intra-rater reliability.

The second limitation of this research is that the researcher conducted the

research only once. Once means the research only analyzed their consistency



of this assessment, it has possibility that the result will be changed in different


G. Definition of Key Terms

1. Consistency

According to Oxford Dictionary, Consistency is the quality of being

consistent. Consistent means behaving the same way, having the same

opinions, standards, etc.8In short, consistency is the quality of having the

same opinion of something.

In this study, consistency here means the investigation of English

language teachers’ consistent quality in scoring essay test by obtaining their

intra-rater reliability level in statistical analysis of pre- and post-scoring.

2. Intra-rater Reliability

Intra-rater is one of kind of rater reliability which is typically

estimated by getting two sets of scores produced by the same raters for the

same group of students (say a rater scores one groups’ set of compositions on

two successive occasions about two weeks apart), and calculating a

correlation coefficient between those two sets of scores.9

8“Consistency” and “Consistent”, Oxford Dictionary

, 4th Ed. (UK: Oxford University Press, 2008), 91.




Intra-rater reliability in this study is to determine English teachers’ self-consistency in scoring essay test. Cronbach alpha coefficient is the most

appropriate formula to establish this reliability. The result of average has to be

above 0.7 in order to be considered that they are internally consistent and

reliable. In addition, there will be reliability level for detail information of

each rater result.

3. Intra-rater Reliability Consistency

In this research, the intra-rater reliability consistency means the

English teachers’ self-consistency as raters of essay test. Their own score

product in pre- and post-scoring will be examined to decide whether they are

consistent or not. As this study investigates about the consistency of intra-rater

reliability, the calculation uses intraclass reliability estimation in SPSS 23 to

analyze scores produced by raters.

4. Scoring Essay Test

The definition of scoring itself is gaining marks in a test or exams.10

Essay test is one or more essay questions administered to a group of students

under standard conditions for the primary purpose of collecting evaluation

data.11In this research, scoring essay test means grading students’ essay as

10“Score”, Oxford Dictionary

, 4thEd. ….. 393.




their writing daily evaluation. The kind of essay test used in this study is

exposition essay about the newest issue. This essay is chosen as the English




This chapter provides some theories of literature related to discussion of the study. Also, it presents review of previous study to show the differences between this research and other previous researches which is done by other researchers.

A. Review of Previous Study

Before doing this study, the researcher has read some previous studies focused on the same topic as it does. The first previous study is the thesis from Ita Faradillah entitled An Analysis of Essay Test on English Final Test for Grade Eleven Students of SMAN 1 Lamongan1. This study investigate the content validity, index difficulty and item discrimination of essay test tested in two different classes, XIA5 and XIA6. The result showed that the essay test has good content validity which is proven by 80% of concurrence. It means that most of test items represent all material taught in grade eleven although there are some items which are out of the box. Then, the index difficulty and item discrimination of essay test produced different result in those classes. It was acceptable only in XIA5 but it was rejected in XIA6 as the result is around 0,1-1,0 for index difficulty and 00-0,19 for index discrimination which means that the items were




too easy and too difficult. Therefore, the essay test should be revised for XIA6

because it could not discriminate students’ achievement properly.

Unfortunately, this study did not measure the level of raters’ consistency in scoring essay test. Moreover, a good test must complete the requirement of validity and reliability. This study only investigates the validity not reliability. So that in this study the researcher wants to examine the reliability, look at from

raters’ consistency, to complete this previous study.

Second, Classroom Writing Teacher’s Intra- and Inter-rater Reliability: Does It Matter by Viphavee Vongpumivitch2 from National Tsing Hua University investigated the reliability of each scale in an analytic scale of scoring essay test

and the teachers’ consistency in scoring essay test using the scale. The result

showed that the teachers have very low intra- and inter-rater reliability; each scale of the analytical rating scale has low correlation, especially in content and

organization; and each teacher has different understanding of the scale’s criteria

based on their experience, personalities, and personal agendas. This study had measured the inter- and intra-reliability of each rater thoroughly in using rubrics

(scale’s criteria) for assessing writing.




The next is Rater Discrepancy in the Spanish University Entrance Examination by Marian Amengual Pirazzo3 from University of Balearic Island. This study told that there are no significant differences between the holistic pre- and post-scores but there are important differences in the behavior of raters in consistency of scoring. In short, the intra-rater reliability is quite high despite some exceptions such as their condition in scoring, etc.

The last two studies have examined the intra-rater reliability of scorer in scoring essay test. It has measured the raters who use both holistic and authentic assessment. As this is the first research of intra-rater reliability in Indonesia, especially in English Education Department of UIN Sunan Ampel Surabaya, so that it will measure the intra-rater reliability in general. The researcher makes it special as the subject of this research is the English teachers of Al-Amin Islamic Boarding Senior High School Mojokerto where the English learning focuses on

the students’ language skills, especially in writing.

Another previous study is Reliability and Validity of Rubrics for Assessment through Writing by Ali Reza Rezaei from California State University and Michael Lovorn from The University of Alabama, USA.4 This study intended

to investigate the reliability and validity of rubrics in the assessment of students’

writing prompt. The results showed that rubrics may not improve the reliability


Marian Amengual Pirazzo. Rater Discrepancy in the Spanish University Entrance Examination. Journal of English Studies University of Balearic Island Vol.4 page 23-26, 2003-2004.




and validity of assessment if raters were not well trained on how to use and

implement the rubrics. This result rejected the writers’ hypotheses that the rubric will improve the raters’ reliability and validity as it was more analytic. It was

proven as the raters were more influenced by mechanical characteristic than the content even they used a rubric.

Even this study uses a rubric for scoring essay test but the focus was not

the rubric’s impact in the raters’ assessment. This research only focuses on the

raters’ consistency of intra-rater reliability in scoring essay test based on all

categories in rubric, such as Content, Organization, Grammar, Vocabulary and Mechanic.

B. Theoretical Background

1. Understanding of Consistency

Based on some dictionaries, consistency is the ability to remain the same in behavior, attitude or qualities. It means that consistency identifies how stable someone in doing something. In this case, consistency analysis of English language teachers in scoring essay test means measuring the quality

of human raters; English language teachers, in rating their students’ essay test.



interpreted as the percent of systematic, or consistent, or reliable variance in the scores on a test.5

According to James Dean Brown, reliability coefficient is different from a correlation coefficient in that it can only go as low as 0.00 because a test cannot logically have less than zero reliability.6 Therefore if there is negative for the reliability of the test, he suggests checking for errors whether the researchers make mistakes in their calculation. In addition if the calculation is right, they should round the negative result to 0.00 and admit the results on the test have zero reliability which means totally unreliable or inconsistent.

2. Understanding of Intra-rater Reliability

One of ways to measure the quality of good test is reliability. To be reliable, a test must be consistent in its measurement. In other words, a test score has to be free of measurement error.7 It is like when the teacher gives the same test to the same students on two different occasions, the test should produce the same result too. If it is different, it has many possibility factors; such as it comes from the students, the examiner or rater, the condition when


James Dean Brown, Testing in Language Program….. 175.


Ibid., 175.




the test happen (test administration) or the test itself (the length of the questions, the paper used, etc).

The student-related reliability is some factors that may affect reliability that come from the audience of the test, mean the students. It is caused by temporary illness, “bad day”, anxiety, and other internal physical and psychological factors of students.8 Whereas the test administration reliability is the reliability factors that include the condition during testing process, such as the noise street so the student who sit beside the window cannot hear the tape recorder clearly in listening test, photocopying problems, the amount of light in different part of the room, temperature problems or the condition of desks and chairs. Then another factor that can affect the reliability comes from the test itself. It can be caused from the length of the test is not balance with the time longer.

The last factors that may influence reliability come from examiner or rater of the test. There are two kinds of rater reliability. First is inter-rater reliability, when there are two or more scorers examine the same test. If they produce different score, it means the test has low-reliability or unreliability. Second is intra-rater reliability, it is not about two or more scorers but only one scorer examine the test individually. This type has bigger possibility of subjectivity because of unclear scoring criteria, fatigue, bias, simple carelessness or sympathy to students.




Rater reliability, especially the intra-, is one of repeated measurement reliability form that conceptualized in quantitative research. It has to do with the ability to measure the same thing in different time, called test-retest method. By using this method, this study wants to investigate how strong the relationship is between the scores at the two time points, in this case is consistency. The intra-rater reliability coefficient can be resulted from the average or the added up of two sets of scores in the decision making process. Cronbach Alpha is the most appropriate form to calculate this coefficient. It is the easiest formula in split-half reliability method. The result will be formed in decimal and it will show the reliability level of each teachers. In addition, it will help to give the final result whether the intra-rater reliability of English teachers at Al-Amin Islamic Boarding School Mojokerto consistent or not in scoring essay test.

One important thing that follows when measuring intra-rater reliability is how much time needed to let go by before post-scoring. This is very difficult to answer as every research about this topic has different time interval. For example, the journal entitled Writing Teacher’s Intra- and Inter-rater Reliability: Does It Matter by Viphavee Vongpumivitch9 from National Tsing Hua University had one week interval between pre- and post-training stage whereas another study entitled Rater Discrepancy in the Spanish




University Entrance Examination by Marian Amengual Pirazzo10 from University of Balearic Island gave the distance for pre- and post-scoring in three months interval. If the time interval was too short, the raters may remember how they scored last time and simply give the same score because

of this. In contrary, the raters’ opinion may be genuinely changed. It is called

carryover effect and can lead overestimating the reliability of the test.11 One to two weeks is often recommended as an optimal time, though the risk of some carryover effect remains.12 To reduce and avoid the risk of carryover, this study used two months interval as it was not too short like one week and too long like three months.

Unreliability or inconsistency is clearly a problem. Inconsistency rater will lead to unreliable test that can influence the score produced. The score may be impacted by many factors which indicate the unreal grade, means it

does not represent students’ real condition. Therefore students will not get the

appropriate feedback and mark based on their true ability.

This research only focuses on intra-rater reliability of the test rater. It is the most suitable reliability that should be researched as the condition of English teachers in Indonesia who teach the class individually. Therefore, the researcher wants to know the consistency of English teachers in scoring essay


Marian Amengual Pirazzo. Rater Discrepancy in the Spanish University Entrance Examination. Journal of English Studies University of Balearic Island Vol.4 page 23-26, 2003-2004.


Daniel Mujis. Doing Quantitative Research…. 72. 12



test as subjective test in the grade eleven of Al-Amin Islamic Boarding Senior High School Mojokerto.

3. Understanding of Essay Test

According to Coffman, essay test is one or more questions administered to a group of students under standard conditions for the primary purpose of collecting evaluation.13 Essay items are useful when teachers are interested in learning how students arrive at an answer as they do not ask students to choose one of responses like objective test but to share their ideas by their own word. In this test type, students decide how to approach the problem, how to set it up, what factual information or opinion to use, and how to specifically express their answer.

Based on Stalnaker’s definition, an essay test should meet the

following criteria:14

- Requires examinee to compose rather than select their response. - Elicits students’ responses that must consist of more than one sentence. - Allows different or original responses or pattern of responses.

- Requires subjective judgment by a competent specialist to judge the accuracy and quality of responses.


W. E. Coffman, Essay Examination.….. 271. 14



Essay test is the most appropriate part to measure students’ cognitive skill because it explores students’ critical thinking and conscious mental

process. Age is the influence for human cognition, it develops rapidly throughout the firs sixteen years of life and less rapidly thereafter.15 Therefore this research concerns on grade of eleven, as the participant of the essay test that taken the score, which the average of students in the class are sixteen to seventeen years old. In addition, the essay material is focused on the grade eleven so that the researcher is sure that all students have gotten the material well and they will give their best in doing this test.

Nowadays, most of English teachers have increasingly turned away to this essay test. They have some motives why they choose it than multiple-choice (MC) test. Moreover, the assessment of MC is easier than essay as it is one of kind of objective test. The reasons are:16

- Assess students’ higher-order or critical thinking skill, means that this test can test complex learning outcomes that cannot effectively assessed by other assessment procedures.

- Evaluate students thinking and reasoning, means that this test can examine thought processes from how the students select, organize, and evaluate facts, ideas, etc.


H. Douglas Brown, Principles of Language Learning and Teaching Fourth Edition (San Francisco State University: Longman Inc., 2000), 61.




- Provide authentic experience, means that this test can assess the students’ ability to construct solution and decision.

- Require students use own writing skill;17 the students can select their own words, sentences and paragraphs or organize correct grammar and spelling.

Besides the strength, there are many weaknesses which are contained in the essay test such as the lack of validity and reliability, the unpredictable result, difficult to assess and the longer time needed to examine. The main problem of this test is seemed from the examiner reliability. The testers get many difficulties in deciding the score of each essay test. Even there is a rubric which contains scale criteria but there is still subjectivity during scoring process. In addition, English teachers in Indonesia are never be aware of their objectivity in scoring essay test as subjective test. They score the test individually without caring whether their score will be stable or not when they try to score in different occasion. Therefore, there is a big possibility that their

assessment of each essay test will change based on the raters’ internal or

external situation when they score the essay (low intra-rater reliability).

There are two types of essay test, extended and restricted response question.18 It is distinguished from the choice of the content and the form. Extended allows students to decide the content and the format freely. While


William E. Chasin. Idea Paper No.17: Improving….1.




the restricted limits students in choosing both of them. Most writers agree that this type is the most appropriate form when the teachers wish to test content. This study uses restricted response question in form of exposition essay as this essay test is one of daily examination which is held to know students’ achievement in a specific material based on curriculum. Therefore, the teachers can examine each students writing ability clearly.

4. Understanding of Scoring Essay Test

As essay test is a subjective test that has subjective nature and complex judgment, this assessment has gotten big attention, especially in human raters. Even if there are many applications offer automated scoring of essay test but human still have a big part in this assessment as they can understand both the content and the quality of writing. Some of the strengths of scoring by human rater are that they can (a) cognitively process the information given in a text, (b) connect it with their prior knowledge, (c) be based on their understanding of the content, make a judgment on the quality of the text, and (d) be able to recognize and appreciate students’ creativity and style.19

Beside all the strengths, human scoring has limitation. Some of their weaknesses are needed good human rater quality and instructed in how to use




scoring rubric so that they must be controlled continuously.20 In addition, they can make mistakes based on the cognitive limitation which is difficult to quantify and cause systematic bias of the score.21 This bias can make the validity and reliability of the essay test automatically low. Therefore, it is very important to check them immediately. As there was study about examining the content validity, the index difficulty and item discrimination about essay test so that the researcher will examine another part of essay test. This study will measure the reliability of the essay test observed from the individual grader of the test, intra-rater reliability.

There are two tools that can be chosen by testers in scoring essay test, holistic (global) and analytic (point-score) assessment. Its evaluation and description become the main difference between those tools. The analytic allows for separate evaluation of factors to be evaluated (e.g., persuasive argument and grammar in writing) and the description of what is expected at each score level is provided. In short, the analytical assessment is

characterized by a specific scale’s criteria which decide how much of each

maximum subtotal judge the students’ answer to have earned.22

I. I. Bejar, (2011). A validity-based approach to quality control and assurance of automated scoring. Assessment in Education: Principles, Policy & Practice, 18(3), 319




criteria set for evaluation of different factors. In addition, it supports broader judgments concerning the quality of the process or the product. Simply, this tool is indicated by a whole evaluation which makes an overall judgment about how successfully the students have covered everything that was expected in the answer and assigns the paper to a grade.




This chapter shows the procedure of conducting the research. It covers research design, population and sample of the research, setting of the research that contain time and place of the study, data and source of data, technique data collection and research instrument, also data analysis technique.

A. Research Design

The research design is the overall plan or structure used to answer the research question.1 The design is used as a guide for a researcher to collect and to analyze the data. In conducting this research, it was used Quantitative design the numerical data needed to measure the English teacher’s intra-rater reliability and analyzed it mathematically.

“Quantitative research is explaining phenomena by collecting numerical data that are analyzed using mathematically based methods (in particular statistics).”2

Statistics is a special mathematically method used to analyze data in this design. Statistical procedures were used in this study is Descriptive Statistics that


Phyllis Tharenou, et.al., Ross Donohue, Brian Cooper. Management Research Methods (New York: Cambridge University Press, 2007), 16.




helps the researcher to organize, summarize and describe observations.3 It means

that after collecting the data of English teachers’ pre- and post- essay test scoring

in numerical, the researcher analyzed statistically then interpreted and concluded it descriptively. In addition, Inferential Statistic was also used in this study to generalize findings to the entire population from which the sample was drawn.4 By this kind of statistic, this research wanted to show that whether the result of descriptive statistic is the real result or only incidental.

B. Subject of the Research

1. Population

A population is defined as all members of any well-defined class of people, events or objects which the generalization is made. 5The population of this research was all English teachers in Al-Amin Islamic Boarding School Mojokerto. There are eight teachers who have the same quality of being English teachers in Al-Amin Islamic Boarding School. They have a college degree or national certificate in language.


Donald Ary, et.al., Lucy Cheser Jacobs, Christin K. Sorensen. Introduction to Research in Education Eight Edition (USA: Wadsworth Cengage Learning, 2010), 101.


Ibid., 148.




2. Sample

The sample is the smaller group or subset of the total population that the knowledge gained is representative of the population under study.6 This study was used purposive sampling as the sampling method. Purposive sampling, or judgment sampling, is one of sampling method which the sample elements judged to be typical, or representative of the population.7 The sample of the study was only six English teachers as the other two teachers have less experience in scoring essay test. So that this study only asked the six teachers to be the sample of this research as they have the same comprehension in scoring essay test for Al-Amin Islamic Boarding School. As the newest study about intra-rater reliability consistency, this study wanted to make the condition of collecting the data as natural as the English teachers do in usual scoring essay test.

C. Setting of the Research

1. Place

The place of the study means the location where the researcher will do the research activity. This study took place on the grade eleven of Islamic Boarding Senior High School of Al-Amin Mojokerto which is placed on RA. Basuni Street no. 18, Sooko, Mojokerto, East Java 61361.


Louis Cohen, et.al., Lawrence Manion, Keith Morrison. Research Methods in Education Sixth Edition (London: Routledge, 2007), 100.




2. Time

The time of the research means when the researcher will start to do the study until finish especially for collecting data. This research started from February, 20th 2016 up to April, 23rd 2016. All teachers took part in two scheduled data collection sessions:

1) One in February, 22nd 2016 (pre-scoring) 2) One in April, 18th 2016 (post-scoring)

D. Data and Source of Data

The data needed to answer the research question was the students’ essay test score in pre- and post-scoring. This research only focused on essay test which was tested in grade eleven of Al-Amin Islamic Boarding Senior High School Mojokerto as the essay material is focused on this level. The question guide of essay test was made in discussion by the English teacher of grade eleven and the researcher. So that it has adjusted with their habit in organizing essay test and made it as usual as they do their daily test. See Appendix 1.

Whereas the source of data was the English teachers’ mark of their

students’ essay test. There are for about six English language teachers; three



E. Technique of Data Collection and Research Instrument

Collecting data means identifying and selecting individual for a study, obtaining their permission to study them, and gathering information by asking people several questions or observing their behaviors which is formed as a collection of numbers (test scores, frequency of behaviors) or words (responses, opinions, quotes).8 This study uses documentation technique that usually involves quantitative data in the form of archival records.9 It was in the form of English

teachers’ pre- and post-essay test scoring. Actually the teachers had used some

specific scores needed to be achieved by students but it was not detail. Therefore the teachers were given a rubric to help them to be more specific in their assessment. The rubric was adapted from journal by Viphavee Vongpumivitch

entitled Classroom Writing Teacher’s Intra- and Inter-rater Reliability: Does It

Matter10 from National Tsing Hua University as the analytical rating grid that was

adjusted with the teachers’ scoring scale before. See Appendix 2.

These data collection sessions are described below.

a. PRE: From 40 essays in grade eleven, Raters were asked to score the same 10 essays for three male teachers and another same 10 essays for three female teachers. It means 20 of 40 essays were chosen randomly.


John W. Creswell, Educational Research….10.


Phyllis Tharenou, et.al., Ross Donohue, Brian Cooper. Management Research Methods….. 124.




b. POST: Two months later, all teachers as raters were given the other 10 essays again. The 10 essays consisted of 5 same essays (in pre-scoring) and 5 different essays. Like in pre-scoring, they were asked to score those essays again. Some teachers did not remember having read and scored the same essay before and others remembered having read it but they foget the specific score given.

Then a research instrument is a tool for measuring, observing or documenting the data of the research.11 The instrument was students’ answers of essay test. For the pre-score, the English teachers graded the students’ essays a day after the test. Whereas for the post-score, the English teachers gave the score on the copied of the same essays; like in the pre-score, in the next two-month.

F. Data Analysis Technique

According to Creswell, analyzing data involves drawing conclusions about it; representing it in tables, figures and pictures to summarize it, and explaining the conclusions in words to provide answers to the research question.12 As quantitative approach used, the researcher would analyze the data using statistic descriptive. The students of XI classes would be conducted the essay test. After doing the test in handwriting, the English teachers gave the essay to the researcher to be copied. Then, the researcher gave the first copy to the English


John W. Creswell, Educational Research …. 14. 12



teachers to be given the first score (pre-score). After scoring, the teachers gave the first essay copy to the researcher to make the recapitulation.

The second score (post-score) would be given in the next two months. Like the first score, the researcher gave the second copied to the teachers and they graded for the post score. Then the researcher made the second recapitulation to be analyzed and compared with the post-score.

After this step, the data was analyzed by the easiest rater reliability calculation pattern called Cronbach Alpha Coefficient (α) as an alternative procedure for calculating the split-half method reliability. It was one of analysis in descriptive statistics to summarize the intra-rater reliability consistency of English teachers at Al Amin Islamic Boarding School Mojokerto in scoring essay test.

The data was analyzed by using reliability analysis in SPSS 23 for each rater in each category of assessment. The steps were:

1. Opening SPSS 23 software;

2. Changing the name of both variables to be “PRE” and “POST”;

3. Changing the number in “Decimal” column to be 1 and choose “Scale” in

“Measure” column;

4. Putting the data in the “Variable View”;

5. Clicking “Analyze”, choose “Scale” then “Reliability”;



7. Choosing “Alpha” as this research used Cronbach alpha coefficient for the reliability analysis;

8. Clicking “Statistics” and check “Intraclass Correlation Coefficient”;

9. Choosing “Two-Way Mixed” in category as this study had a population of raters;

10.Choosing “Average Measures” in values as this research needed the mean for the calculation;

11.Choosing “Consistency” in reliability as this study wanted to use the subsequent values for other analyses;

12.Clicking “Continue” and “OK”.

The result of this analysis will be in the form of decimal (ex: 0.333). This measurement had to be over 0.7 before it can be concluded that the test was internally consistent.13 In addition, there were some levels of reliability that can be interpreted from the coefficient. These levels were used to identify the consistency of each rater in each category. The interpretation is described below.14

3.1 Table of Reliability Interpretation


Daniel Mujis. Doing Quantitative Research…. 73. 14


Therefore, the researcher put 0.00 into the reliability interpretation table above to take heed whether the result produced negative score.

The table presented two columns, the range result of Cronbach alpha

coefficient (α) and the interpretation of the number result. For instance, the range

score 0.800 < α ≤ 1.000 means that for coefficient more than 0.800 and less than

until 1.000 was interpreted as Very High reliability and so on.

After getting the finding of the intra-rater reliability consistency in Cronbach alpha coefficient, the inferential statistics was used for the next analysis step to calculate the significance level. The level of significance is the predetermined level at which a null hypothesis would be rejected.15 It means that the finding would be analyzed whether it was only incidentally or the true result




as this research was held only once. As the variables of this study were equal or the same subject, it was appropriate to use t test, especially for dependent sample; called paired t-test; to analyze it.

The paired t test was calculated by using SPSS 23. The data was imported

to each cell in “Variable View” and changed the number in “Decimal” to be 1 and

chose Scale in Measure like in analyzing the reliability before, PRE and POST. It

was calculated by clicking the “Analyze” menu, choosing “Compare Means” then

selecting “Paired-Samples T Test”. This application would show the new box and

both variables were selected and moved to “Paired Variables” box then click

“OK”. Afterwards it would present the output of the data.

The steps in analyzing the inferential statistic use paired t test were:

1. Deciding the level of significance. The most commonly used level of significance in the behavioral science is the 0.05 and the 0.01 levels.16 If the result was not significant in both levels, it would be tried in other levels such as 0.10; 0.20 or 0.50.

2. Calculating the paired t test in SPSS 23.

3. Checking the t-test with t-table and p-value “Sig. (2-tailed)” with the level of significance whether the null hypothesis was rejected or not.

G. Reliability and Validity

This research used students’ works formed essays as the instrument that was examined in pre- and post-scoring. The instrument of the research must




complete the rules of reliability and validity. The reliability of measuring instrument is the degree of consistency with which it measures whatever it is measuring.17 It means that the instrument has to produce the same result in any

kinds of measurement. As the instrument formed in essays, the instrument’s

reliability could not be measured mathematically. Nonetheless, this study tried to make the instrument reliable by keeping the data authenticity and avoiding the possibility of subjective value. There were some ways committed to keep the reliability of the data, such as:

- No changing any data (students’ work) which was scored by the English teachers in pre- and post-scoring (the same essays).

- Copying the essay answers by photocopy machine twice, one is for pre-scoring and another is for post-pre-scoring which will be given in the next two-week after the first scoring, so that the examiner scored the students’ original handwriting.

- Omitting the students’ name in each answer sheets of essay test and change it by their numbers of attendance list so that the rater did not understand whose writting it is.

Not only reliability but also validity of the research instrument is very important. Validity means the extent to which an instrument measured what it claimed to measure.18 As the daily essay examination, the question guide of essay


Ibid., 236.




test was made by the teacher based on the material of grade eleven’s even semester

in K-13 curriculum for basic competence 3.10 and 4.14 about exposition text for

the newest issue. Therefore, the content of students’ essays was 100% valid as




This chapter discusses about the research findings and discussions. It provides the analysis and interpretation of data that had been collected to answer the research question about the consistency analysis of English teachers of Al Amin Islamic Boarding Senior High School Mojokerto in scoring essay test.

As explained in chapter III, grade eleven teachers held a weekly writing

examination; an exposition essay test to know students’ achievement in writing that

kind of text. The students wrote the essay in handwriting to keep the originality of their writing. Even it might increases the subjectivity but the researcher had kept it by

omitting the students’ name before copying and giving to the examiners or raters;

mean the teachers. Each six teachers or raters were asked to grade 10 papers of 40 essays which were chosen randomly by the researcher in pre- and post-scoring. 5 of 10 papers in post-scoring were the same essay that they had actually rated in pre-scoring. After two months interval, some teachers admitted that they did not remember about ever seen those papers before. In addition, the others said that although they remembered having ever seen the papers but they could not remember the grades that they gave.



Alpha coefficient as descriptive statistic analysis and paired t test as inferential statistic analysis to check the significance of the finding and the null hypothesis test in SPSS. The study found varied result for each rater.

A. Findings

1. Descriptive Statistics

The raters were asked to grade five same essays in pre- and post-scoring with two months interval and some of them did not know that they score the same essay. The others might know that but they totally forgot what score they gave to each essay. Here is the table of five essays in pre- and post-scoring based on the rubric used in the assessment.

1.1.Table of Raters’ Pre- and Post-Score of Five Same Essays

No. Essay

Content Organization Grammar Vocabulary Mechanic Total



1st 14 10 14 10 40 35 13 10 4 3 85 68

2nd 11 11 10 10 32 30 10 11 3 3 66 65

3rd 14 14 14 14 45 44 14 12 3 4 90 88

4th 13 22 13 12 40 40 13 12 3 3 82 79

5th 13 22 12 11 32 35 11 12 4 3 72 73




6 2


13 11.5 12.5 11.5 41 38 12.5 10 4.3 3.8 83.3 74.8

3rd 13 11.5 12.5 11.5 41 36 12.5 10.5 4.3 4 83.3 73.5

4th 12.5 13 11.5 12.5 40 43 11 12.5 3.8 4.3 78.8 85.3

5th 13 11.5 12.5 11.5 41 36 12.5 10.5 4.3 4 83.3 73.5

The table above shows the real score of English teachers’ pre- and post-scoring. The score of each category has agreed with the rubric given. Raters had meaning that the teachers who graded the essay. It was not mentioned and

explained the teachers’ identity in detail. The important one was they had

same criteria; they had gotten a degree or language certificate. In other words, the teachers were admitted having the equal capability in English. Then, the meaning of essay number was the essay identity. Even the essays were given the number randomly in pre- and post-scoring, the five same essays had been put specific sign to help the researcher in analyzing them. Therefore, it was assured that those five essays in post-scoring were the same in pre-scoring.



By the table above, the researcher wanted to show that each rater has already changed in almost categories of the assessment. Post-score could be higher or lower than the pre-score. There were only a few post-scores which gotten as same as the pre-score. Absolutely, it was influenced the total score of each essay. The researcher could assure that the change was not unconsciously as some of them admitted that they forgot about the score given indeed the essay.

After collecting the data, the essay scores in the table above were analyzed the intra-class using SPSS 23 in Cronbach alpha coefficient analysis.

Cronbach alpha coefficient of the teachers’ consistency intra-rater reliability

of five papers is presented below.

4.2 Table of Cronbach Alpha Coefficient Result of Raters’ Intra-Rater Reliability Consistency

Content Organization Grammar Vocabulary Mechanic Total

RATER 1 0.426 0.682 0.925 0.158 - 1.333 0.832

RATER 2 0.813 0.768 0.684 0.696 - 0.468 0.722

RATER 3 0.571 -1.396 0.718 0.390 0.989 0.917

RATER 4 0.947 0.750 0.708 0.800 0.375 0.796

RATER 5 -10.435 - 0.667 - 4.850 - 2.400 - 0.667 - 3.969

RATER 6 - 3.000 - 2.444 - 3.800 - 6.087 - 10.556 -10.248



rater’s grammar scores of five essays were put and analyzed using SPSS

software and produced 0.684 as the result of Cronbach alpha coefficient. As the various score, each rater got different result in all categories. Unfortunately, some data produced negative value and absolutely it will be rounded into 0.00 score. Nevertheless, the researcher used this real result to analyze whether the raters were consistent or not.

To know the intra-rater reliability level of each teacher, the result of Cronbach alpha coefficient formed in numerical data analysis was interpreted based on the Reliability Interpretation presented in Data Analysis Technique subsection. The interpretation result is presented below.

4.3 Table of Raters’ Reliability Interpretation Result

Content Organization Grammar Vocabulary Mechanic Total

RATER 1 Enough High Very High Very Low Unreliable Very High

RATER 2 Very High High High High Unreliable High

RATER 3 Enough Unreliable High Low Very High Very High

RATER 4 Very High High High High Low High

RATER 5 Unreliable Unreliable Unreliable Unreliable Unreliable Unreliable

RATER 6 Unreliable Unreliable Unreliable Unreliable Unreliable Unreliable



the score was included in the range of 0.800 < α ≤ 1.000. Therefore, the table

above presented each result in the word.

As Table above shows, 1st rater got Very High reliability level in Grammar whereas High level in Organization. Besides, it indicated reliable Enough for Content but Very Low in Vocabulary. Mechanic was the worst as it got negative score which means that it was very unreliable. Luckily, his total score was very reliable as it got Very High level.

2nd rater was different from the 1st. Most of categories got High level, such as Organization, Grammar, Vocabulary and it might influence the Total score. The most reliable was in Content because 0.831 means it existed in

“Very High” level. In contrary, Mechanic presented unreliable as it got

negative score like the 1st rater.

Mechanic and the Total score of 3rd rater were almost perfect as it presented Very High reliability. He got High level in Grammar, reliable Enough in Content and Low level in Vocabulary. Unfortunately, it was the same as two raters before that they have minus value in one of their categories which means unreliable, this rater was in Organization.



was the lowest reliability level as it got .375 score and it was still in positive value.

The worst unreliable raters are 5th and 6th raters. All of their categories presented negative value. It can be said that they got 0.000 score or were admitted as zero reliability. Simply, it was regarded that they were included in inconsistent or unreliable level.

As the various marks gotten, it was needed to make the average of all grades so that it could conclude the result which represented and covered all raters in all categories. Here is the table of average result. The table shows the average result of pre- and post-scoring of five same essays. For example, the pre-scoring of 1st rater was the average result from all pre-scoring in all categories and so was the post-scoring.

4.4 Table of All Essays’ Average Results

No. Essay PRE POST

1st 81.8 77.4

2nd 75.8 72.9

3rd 83 80.7

4th 81.9 81.5


that English teachers or raters had good reliability consistency. Even there are two raters got inconsistent in all categories but it did not influence other results that affect the average.

4.5 Output of Intraclass Correlation in Cronbach Alpha Coefficient of SPSS 23

Intraclass Correlation Coefficient

Intraclass Correlationb

95% Confidence

Interval F Test with True Value 0



2. Inferential Statistics

After getting the result of descriptive statistics, the finding would be checked the significance by using inferential statistics in paired t-test of SPSS 23. The value checked was not the whole result but it was only the average result as it has covered all values of all raters in pre- and post-scoring. This is the result of paired t test in SPSS 23.

4.6 Output of Paired T Test Result Of Average Result in SPSS 23



The first table is Paired Sample Statistics that showed the statistic summary of pre- and post-scoring. The table provides that the average score in pre-scoring was 80.360 and in post-scoring was 78.020. It indicated reduction for about 2.340. The standard deviation presented the data variation in each variable, that in pre-scoring was 2.887 and in post-scoring was 3.394. Also N was the number of data which there were five essays graded twice by raters in two-week interval.

Paired Sample Correlation showed the correlation between two variables that produce 0.902 with 0.036 for the significance. It means that the correlation between pre- and post-scoring was so related.

The last is Paired Sample Test. It can be interpreted as:

 Hypothesis

H0 = the intra-rater reliability of English teachers at Al-Amin Islamic

Boarding School Mojokerto in scoring essay test is not consistent. H1 = the intra-rater reliability of English teachers at Al-Amin Islamic

Boarding School Mojokerto in scoring essay test is consistent.

 Significance level

Sig = 0.05

 Critical area

Based on t-test:



Accept H0 = t-test < t-table (5%, N-1)

Based on p-value (Sig.):

Reject H0 = p-value < 0.05

Accept H0 = p-value > 0.05  Decision

t-test = 3.541 > t-table (5%, N-1) = 2.776; Sig. = 0.02 < 0.05;

means H0 is rejected.

The intra-rater reliability of English teacher at Al-Amin Islamic Boarding School Mojokerto in scoring essay test was consistent.

B. Discussion



inconsistent in Organization. Unfortunately, the fifth and sixth teacher or rater seemed to be the least consistent. Their ratings were abysmal in all categories. In fact, they even contradicted in their own ratings in the pre-scoring so that the coefficient is negative.

In order to be easy in taking the conclusion, all various results were taken the average and calculated in Cronbach alpha coefficient. Based on the reliability interpretation, it produced Very High consistency as it got .924 of intraclass correlation in SPSS 23. This value was more than 0.7 as the standard of Cronbach alpha coefficient in deciding the reliability. Simply, it was proved that English teachers of Al-Amin Islamic Borading School Mojokerto had good reliability. In addition, paired t-test result as the significant calculation of inferential statistic also qualified the rules of rejecting the null hypothesis. The rules are: 1) t-test was more than t-table; 3.541 > 2.776 and 2) Sig. = 0.02 was less than 0.05 as the level of significant. It means that the result was the real score, not incidentally. Even there were two raters got inconsistent or unreliable in all categories but it did not give any impact to the calculation which proven that the intra-rater reliability of English teachers at Al-Amin Islamic Boarding School Mojokerto was internally consistent. It can be said that the inconsistent scores gotten happened by chance

with many exceptions from the raters’ self that can be investigated in the next



table were analyzed in SPSS 23 based on the categories. For instance, the 2 nd


