An analysis on the difficulty level of english summative test for second grade of junior high schoolat odd semester 2010/2011

(1)

(A Case Study at the Second Grade of SMP Negeri 13 Tangerang Selatan)

A “Skripsi”

Presented to the Faculty of Tarbiya and Teachers’ Training

in a Partial Fulfillment of the Requirements for the Degree of S.Pd. (Bachelor of Arts) in

English Language Education

Written By:

Andrian Dwi Prayoga 107014000882

ENGLISH EDUCATION DEPARTMENT

FACULTY OF TARBIYA AND TEACHERS’ TRAINING

“SYARIF HIDAYATULLAH” STATE ISLAMIC UNIVERSITY

JAKARTA

(2)

(3)

(4)

(5)

i

Semester 2010/2011 (A Case Study at the Second Grade of SMPN 13

South Tangerang), Skripsi, English Education Department, Faculty of

Tarbiya and Teachers’ Training, Syarif Hidayatullah State Islamic

University Jakarta.

Key words: Item Difficulty Level, Summative Test

This study is purposed to measure the difficulty level of the English summative test items, tested for the second grade of SMPN 13 South Tangerang at odd semester academic year 2010/2011. Through this study, it can be known which one of the test items is too easy, moderate, and difficult.

This study is included in quantitative research because the researcher uses some numerical data which are analyzed statistically. Also, this study is categorized as descriptive analysis because it is intended to describe the objective condition about the difficulty level of the English summative test for the second grade of SMPN 13 South Tangerang at odd semester academic year 2010/2011.

The findings of this study are that moderate items have highest percentage with 66,7% followed by difficult items with 20% and easy items with 13,3%. Overall, the difficulty level of the test is in moderate level with 0.50. Therefore, this test has a good difficulty level.

(6)

ii

Semester 2010/2011 (A Case Study at the Second Grade of SMPN 13

South Tangerang), Skripsi, Jurusan Pendidikan Bahasa Inggris, Fakultas

Ilmu Tarbiyah dan Keguruan, Universitas Islam Negeri Syarif Hidayatullah Jakarta.

Kata kunci: Tingkat Kesulitan Butir Soal, Tes Sumatif

Penelitian ini bertujuan untuk mengukur tingkat kesulitan butir-butir soal dari tes sumatif bahasa Inggris yang diujikan untuk kelas dua SMPN 13 Tangerang Selatan pada semester ganjil tahun ajaran 2010/2011. Dengan penelitian ini, dapat diketahui butir soal mana saja yang terlalu mudah, sedang dan sulit.

Penelitian ini termasuk dalam penelitian kuantitatif karena peneliti menggunakan beberapa data numerik yang dianalisis secara statistik. Penelitian ini juga dikategorikan sebagai analisis deskriptif karena penelitian ini menggambarkan kondisi objektif mengenai tingkat kesulitan tes sumatif bahasa Inggris untuk kelas dua SMPN 13 Tangerang Selatan pada semester ganjil tahun ajaran 2010/2011.

Hasil dari penelitian ini adalah bahwa soal yang sedang memiliki persentase yang paling tinggi dengan 66,7% diikuti oleh soal sulit sebesar 20% dan soal mudah sebesar 13,3%. Secara keseluruhan, tingkat kesukaran soal ini berada pada tingkat sedang dengan 0.50. Oleh karena itu, tes ini memiliki tingkat kesukaran soal yang baik.

(7)

iii

and Blessing to the writer, so that this “Skripsi” can be finished completely. Peace

and Salution be upon our prophet Muhammad, his families, companions, and his

followers.

The writer would like to express his gratitude to Mr. Dr. H. Muhammad

Farkhan, M.Pd. as the writer’s advisor who had kindly spent his time to give his

valuable advice, guidance, corrections, and suggestions in composing this

“Skripsi.”

Also, on this occasion, the writer would like to express his greatest

appreciation, honor, gratitude and love to his beloved mother, Mrs. Tri Hastuti,

S.Pd., who has been a great motivator in every condition, and also to his father

Mr. Juraid Umar, M.Pd., who has given him many inspirations. He thanks to them

for their pray, guidance, patience, and encouragement to motivate the writer to

finish his study.

The writer would like to express his highest appreciation and gratitude to all

lecturers of English Education Department, for teaching the precious knowledge,

sharing the values of life and giving the unforgettable study experinces.

The writer dedicates many thanks to Mr. Rohman, S.Pd. as the Headmaster of

“SMPN” 13 South Tangerang, who had given the permission to the writer to do

the research there. Also, his gratitude is sent to Ms. Dahlia Muflikhati, S.Pd. as

one of English teachers in “SMPN” 13 South Tangerang who had given the writer

great contribution and corporation while he was doing this research.

His gratitude also goes to Mr. Drs. Syauki, M.Pd. as the Head of English

Education Department, Ms. Neneng Sunengsih, S.Pd. as the Secretary of English

Education Department. Also, his thanks is given to the staffs of English Education

Department, specially for Ms. Aida Ainul Wardah, S.Pd. who always gives

excellent service and contribution to the writer.

The writer would like to express his thanks and love to all his beloved friends,

(8)

iv

while studying together.

Finally, the writer realizes that this “Skripsi” is still far from being perfect.

Constructive criticism and suggestion would be welcomed to make it better.

Jakarta, November 2011

(9)

v

ABSTRAK... ii

ACKNOWLEDGEMENT... iii

TABLE OF CONTENTS... v

LIST OF TABLES... viii

CHAPTER I : INTRODUCTION... 1

A. Background of the Study... 1

B. Limitation of the Study... 4

C. Statement of the Problem... 4

D. Objective of the Study... 5

E. Significance of the Study... 5

F. Method of the Study... 5

CHAPTER II : THEORETICAL FRAMEWORK... 6

A. Test... 6

1. Definition of Test... 6

2. Types of Test... 7

a. Achievement Test... 7

1. Placement Test... 8

2. Formative Test... 9

3. Diagnostic Test... 10

4. Summative Test... 11

b. Proficiency Test... 11

c. Progress Test... 12

d. Aptitude Test... 12

B. Categories of Good Test... 13

1. Validity... 13

a. Content Validity... 14

(10)

vi

2. Reliability... 15

3. Practicality... 16

C. Types of Test Item... 17

1. Objective Test... 17

a. Selection-Type Test Item... 18

1. Multiple Choice... 18

2. True-False... 21

3. Matching... 23

4. Rearrangement... 24

b. Supply-Type Test Item... 25

1. Short-Answer... 25

2. Fill-in... 27

2. Essay Test... 29

D. Item Analysis... 31

1. Definition of Item Analysis... 31

2. Kinds of Item Analysis... 32

a. Level of Difficulty... 33

b. Discriminating Power... 36

c. The Effectiveness of Distractors... 38

E. The Importance of Item Analysis... 38

CHAPTER III : THE IMPLEMENTATION OF THE RESEARCH... 41

A. Research Methodology... 41

1. Purpose of the Study... 41

2. Place and Time of the Study... 41

3. Population and Sample... 41

4. Method of the Study... 42

(11)

vii

1. Data Description... 44

2. Data Analysis... 47

3. Data Interpretation... 52

CHAPTER IV : CONCLUSION AND SUGGESTIONS... 53

A. Conclusion... 53

B. Suggestions... 53

BIBLIOGRAPHY... ix

(12)

viii

Table 2. The Group Position Based on the Test Result... 45

Table 3. Format of Item Analysis of the English Summative Test... 48

Table 4. Classification of Items Based on the Proportion of Difficulty Leve...49

Appendix

Table 5. Students’ Answer in the Upper Group (Multiple-Choice Items)

Table 6. Students’ Answer in the Lower Group (Multiple-Choice Items)

(13)

1 CHAPTER I

INTRODUCTION

This chapter discusses and presents background of the study, limitation of the

problem, statement of the problem, objective of the study, significance of the

study, and method of the study.

A. Background of the Study

Evaluation plays an important role in every stage of education. It is

integrated in the school program so it contributes directly to the teaching and

learning process. According to Norman E. Gronlund, “Carefully collected

evaluation data help teachers understand the learners, plan learning

experiences for them, and determine the extent to which the instructional

objectives are being achieved.”1

Evaluation refers to the process of making conclusion from a study of

data gathered to describe value judgments about student’s performance. Lyle

F. Bachman quotes that, “evaluation can be defined as the systematic

gathering of information for the purpose of making decisions.”2 In summary, evaluation takes the very important role because it is a must for teachers to

always concern with the quality of their instructional process and whether

students have reached the instructional goals which have been stated before.

1

Norman E. Gronlund, Measurement and Evaluation in Teaching, (New York: Macmillan Publishing Co., Inc., 1981), 4th Ed., p. 3.

2

(14)

There are many ways for collecting data as information in the process of

evaluation. One of them is by using a test. A test is a set of question, each of

which has a correct answer, which examinees usually answer orally or in

writing.3 There are several types of the test. One of them is achievement test which is designed to know how successful student has mastered the

knowledge, abilities, and skills in the past learning activity.

According to Wilmar Tinambunan, there are four types of achievement

test which are commonly used. First, a placement test is done at the beginning

of learning to know student’s early performance. Next, a formative test is

used to monitor student’s progress during the learning process. Third, a

diagnostic test is intended to detect student’s weaknesses during instruction.

Finally, a summative test is used to show the standard that students have

reached in relation to other students at the same stage.4 In this research, the test that the writer would like to analyze is the summative test.

As one of methods to measure students’ achievement in learning process,

a test should be well constructed. A well constructed test should have three

main characteristics which involve validity, reliability, and practicallity. Valid

in language testing means that how the test really evaluates what we actually

want to measure. Whereas, reliability means that a test has to be consistent

and reproducible. While, practicallity is concerned with a wide range of

factors of economy, convenience and interpretability.5

Making a well constructed test is the teachers’ responsibility because

Wilmar Tinambunan, Evaluation of Students Achievement, (Jakarta: Depdikbud, 1988), p. 3.

4

Wilmar Tinambunan, Evaluation of Students ...p. 7-9.

5

Robert L. Thorndike and Elizabeth Hagen, Measurement and Evaluation in Psychology and

(15)

particular classes. More of these tests are administrated than any other kind.

Unfortunately, they are carelessly constructed and interpreted.”6

Based on the explanation above, teachers need to evaluate the

effecetiveness of the test items because it is necessary for teachers to know

whether the test items work well or not. Meanwhile, Harold S. Maiden

explains that “the selections of appropriate language items are not enough by

itself to ensure a good test. Each question needs to function properly;

otherwise, it can weaken the exam. Fortunately, there are some rather simple

statistical ways of checking individuals’ items. This procedure is called as

item analysis.7 This is done by analyzing the students’ response to each item. Items analysis of a test can be a valuable activity that can improve the

test’s reliability and validity. Items analysis procedures provide information

for evaluating the functional effectiveness of each item and for detecting

weakness, which should be corrected. This information is useful when

reviewing the test and it is indispensable when building a set of high quality

items for the next test.

Items analysis has three main components; they are level difficulty,

discriminating power, and effectiveness of the distracters. The difficulty level

procedure provides data how many percentages of students who answer an

item correctly. Discriminating power means whether the test can discriminate

the students’ ability or not. The last one means whether all the alternatives of

items function well or not.

The writer limits the problem of the study that he will discuss; he only

focuses on the difficulty level of the test. The test should have the difficulty

level whether it is included as easy, moderate, or difficult test. Besides, he

needs to analyze how many percentages of items which are easy, moderate,

and difficult. Moreover, it is able to distinguish between the students who

have studied well and those who have not.

6

J. Stanley Ahmann and Marvin D. Glock, Evaluating Pupil Growth Principles of Tests and

Measurement, (Boston: Allyn and Bacon, Inc, 1967), p. 17.

7

(16)

The writer intends to analyze the difficulty level of English summative

test because he found some problems at the second grade of SMP Negeri 13

Tangerang Selatan. First, some students commented that the test is too

difficult or too easy and so forth. Also, the main problem is that many

students got low score. The writer tried to investigate about this problem. He

wants to know how difficult the test is.

Based on the description given previously, the writer would like to

perform items analysis toward the English Summative Test items for the

second grade of SMP Negeri 13 Tangerang Selatan. The writer did the

research under the title “AN ANALYSIS ON THE DIFFICULTY LEVEL

OF ENGLISH SUMMATIVE TEST FOR SECOND GRADE OF

JUNIOR HIGH SCHOOL AT ODD SEMESTER 2010/2011 (A Case

Study at the Second Grade of SMP Negeri 13 Tangerang Selatan)”.

B. Limitation of the Study

To make this study easier to understand, the writer limits the study as

follow:

1. The research focused only on the difficulty level of English Summative

Test at the odd semester 2010/2011

2. The test which is analyzed is English Summative Test for the second grade

at odd semester, 2010/2011 academic year

3. The research focused only on the second grade students of SMP Negeri 13

Tangerang Selatan

C. Statement of the Problem

From the limitation of problem which has been explained above, the

writer formulates the statement of the problem in this research as follow:

“Does the English Summative Test for the second grade of SMP Negeri 13 Tangerang Selatan at the odd semester 2010/2011 fulfill the criteria of a good

(17)

D. Objective of the Study

In line with the limitation of the problem, the objective of the study is to

measure the quality of English Summative Test for second grade of SMP

Negeri 13 Tangerang Selatan at the odd semester 2010/2011 and to know the

difficulty level of each item.

E. Significance of the Study

The result of this study is expected to have some benefits in English

teaching. It suggests to the test makers or classroom teachers when they find

an item test which has a high or low difficulty. They could review which

items that make the test too easy or too difficult and it can be followed up by

rearranging the test. So, this study can give contributions or a useful input and

feedback as bases for improving English Summative Test.

Besides the purpose above, the study will fulfill the writer’s final

assignment for his bachelor’s degree. Finally, other researchers who are

interested in analysis on the difficulty level can get basic information from

this study to do the further research.

F. Method of the Study

The methods used in the research are descriptive analysis and

quantitative. The writer took the English Summative Test paper and students’

answer sheet, then analyzed the difficulty level of each item. Quantitatively,

the writer used some numerical data which is analyzed statistically. The

writer also did library research by studying a number of references and

literatures related to the topic of discussion to support the theoretical aspect of

(18)

6 CHAPTER II

THEORETICAL FRAMEWORK

In this chapter, the writer tries to give clear description of theoretical

framework which covers definition and types of test, types of test item,

characteristics of good test, definition and types of item analysis, and the

importance of the item analysis.

A. Test

1. Definition of Test

In the process of evaluation, one of the method that can be used to

gather data is a test. Many experts have stated some definitions of test. In

his book, Educational Test and Measurement an Introduction, Anthony J.

Nitko writes “Test is a systematic procedure for observing and describing

one or more characteristics of a person with the aid of either a numerical

scale or category system.”1

Another opinion states that test is a technique or way consisting of

some questions, statements, or tasks that are delivered to students in term

of measuring their performance or behavior.2 Victor H. Noll also writes

1

Anthony J. Nitko, Educational Test and Measurement, an Introduction, (New York: Harcourt Brace Jovanovich, Inc., 1983), p. 6.

2

(19)

that a test usually includes the use of several certain instrument or set of

instruments to determine a specific quality or trait.3

Moreover, Jum C. Nunnally states that, “A test is a standardized

situation that provides an individual with a score.”4

Based on some definitions above, it can be concluded that a test is a

method or way to measure the behavior or performance of individuals and

it consists of some systematic procedures for gathering data about their

achievement. It is usually carried out under standardized situation in

teaching and learning process.

2. Types of Test

There are many types of test used to measure students’ achievement.

However, there are four basic types of language tests: achievement tests,

proficiency tests, progress tests, and aptitude tests.5

a. Achievement Test

In his book, Language Testing, Tim McNamara writes,

“Achievement tests accumulate evidence during, or at the end of, a course of study in order to see whether and where progress has been made in terms of the goals of learning. They relate to the past in that they measure what language the students have learned as a

result of teaching.”6

Furthermore, Nunnally states that, “The purpose of achievement

test is to measure progress in school up to a particular point in time.

Achievement test is based on the core educational objectives shared by

the educators across the country.”7

3

Victor H. Noll, Educational Measurement, (Boston: Houghton Mifflin Company, 1965), 2nd Ed., p. 13.

Tim McNamara, Language Testing, (New York: Oxford University Press, 2000), p. 6.

7

(20)

In addition, according to Rebecca M. Valette, “achievement tests are usually not built around one set of teaching materials but are

designed for use with students from a variety of different schools and

programs.”8

In the writer’s opinion, achievement test is a test which is designed

to know how successful students have mastered the previous materials

of a long period of course and whether they have achieved the

educational objectives. So, by achievement test, it is able to compare

among individual students, classes and school progress with others

across the country.

According to Wilmar Tinambunan, there are four types of

achievement test: placement, formative, diagnostic, and summative

test.9 the top class. In other centres the students’ ability in different skills such as reading and writing may need to be identified. In such a centre a student could conceivably be placed in the top reading class, but in the bottom writing class, or some other combination. In yet other centres placement test may have the purpose of deciding whether students need any further tuition

at all.”10

Also, a quote by James Dean Brown in his book Testing in

Language Programs states that the purpose of this test is to make a

(21)

group of students who are in the same level of ability so teachers

can focus and only concentrate on the problems or learning points

suitable for that level.11

Moreover, placement tests provide information that helps to

place students in the part of learning program most appropriate

with their levels of ability. They are most successful in term of

their use when they are constructed for particular situations.12 Most placement tests constructed by classroom teachers are pretests

which function to know the readiness of students to begin the

instruction and to place the students in the part of learning activity

with the proper instruction.

2. Formative Test

Norman E. Gronlund writes that “formative tests are given

periodically during instruction to monitor pupil learning progress

and to provide ongoing feedback to pupils and teachers.”13

It

usually covers some parts of instruction, such as unit, chapter, etc.

In line with the opinion above, formative tests are carried out

while the instruction is ongoing to identify learning progress

students have made and to give the continuous feedback in term of

strengths and weaknesses of learning activity.14 Furthermore, “the formative test is given during the course of instruction; its purpose

to show which aspects of the chapter the student has mastered and

where remedial work is necessary.”15

11

James Dean Brown, Testing in Language Programs, (New Jersey: Prentice Hall Regents, 1996), p. 11.

12

Arthur Hughes, Testing for Language Teachers, (Cambridge: Cambridge University Press, 2003), 2nd Ed., p. 16-17.

13

Norman E. Gronlund, Measurement and Evaluation in Teaching, (New York: Macmillan Publishing Co., Inc., 1981), 4th Ed., p. 125.

14

Wilmar Tinambunan, Evaluation of Students..., p. 8.

15

(22)

Its result gives the information about how well students have

mastered a particular material and provides them immediate

feedback. With feedback, students can determine their learning

errors or weaknesses then they can revise with or without teachers’

help.

Thus, in the writer’s opinion, formative test is designed to

check students progress during the instruction in mastering one

particular learning point and to give students feedback directly.

3. Diagnostic Test

The result of diagnostic test is intended to show the specific

weaknesses and strengths in a particular material or skill.16 It can be said that it is much comprehensive and detailed because it

identifies the major causes of learning difficulties and then helps

prepare a plan for remedial activity.

In his book, Testing for Language Teachers, Arthur Hughes

states that, “Diagnostic tests are used to identify learners’ strengths

and weaknesses. They are intended primarily to ascertain what

learning still needs to take place.”17_{In addition, “a diagnostic test is}

designed to determine the degree to which the specific instructional

objectives of the course have been accomplished.”18

Therefore, by using diagnostic tests, teacher knows what

students have mastered and what areas in which a student needs

further help. It is made while students are learning the language.

So, diagnostic tests are typically delivered at the beginning or in

the middle of a language course.

16

Robert Lado, Language Testing, The Construction and Use of Foreign Language Tests, (London: Longman Group Limited, 1961), p. 369.

17

Arthur Hughes, Testing for Language ..., p. 15.

18

(23)

4. Summative Test

According to Wilmar Tinambunan, “the summative test is

intended to show the standard which the students have now reached

in relation to other students at the same stage. It typically comes at

the end of a course or unit of instruction.” 19

To support the opinion above, summative assessment methods

are made to determine what a students has accomplished at the

beginning or the end of a language course, then teachers can give a

final mark to students.20 Moreover, Rebecca M. Valette states that,

“the summative test is usually given at the end of a marking period

and measures the “sum” total of the material covered.”21

In conclusion, the summative test is a test that is usually

administered at the end of a language course, a semester or an

academic year to know how successful students has achieved a

wide range of material within a certain period. On this type of a

test, students are usually ranked and graded.

b. Proficiency Test

James Dean Brown writes, “a proficiency test assesses the general

knowledge or skills commonly required or prerequisite to entry into a group of similar institutions. Such tests are very general in language. The content of a proficiency tests, therefore, is not based on the content or objectives of language courses that people taking the test may have followed. Rather, it is based on a specification of

19

Wilmar Tinambunan, Evaluation of Students ..., p. 9.

(24)

what candidates have to be able to do in the language in order to be

considered proficient.”23

To sum up, proficiency tests measure someone’s general ability in

a language and they are not related to some previous courses of

instruction. The proficiency tests usually consist of standardized

multiple-choice items on grammar, vocabulary, reading

comprehension, aural comprehension, and sometimes on writing.

c. Progress Test

Based on the book Language Test Construction and Evaluation,

“progress tests are given at various stages throughout a language

course to see what the students have learnt.”24

Meanwhile, another opinion states that, “the progress test measures

how much the student has learned in a specific course of instruction.

The tests that the classroom teacher prepares for administration at the

end of a unit or end of a semester are progress tests.”25

Thus, progress test is used to check students progress in learning

one particular lesson and teacher can administer it at anytime of

language course.

d. Aptitude Test

According to Robert Lado, “aptitude tests are designed to predict

the degree of success that individual students will have in studying a

(25)

tests imply prediction. They give us a basis for predicting future level

of performance.”28

Because it functions to measure the potential capacity of an

individual, aptitude test can be used to decide how long students will

master a foreign language sufficiently. Also, it is often used in

selecting individuals for language training, for jobs, for scholarships,

and for many other purposes.

B. Categories of Good Test

Test as an instrument of obtaining information should have a good quality.

The quality of a test will influence the result of the test itself. Once the test has

a good quality, the right information will be gained and used to make accurate

decision to the students achievement.

intended.”30 _{Also, Norman E. Gronlund writes that, “validity refers to the}

extent to which the results of an evaluating procedure serve the particular

uses for which they are intended.”31

So, validity of a test means that the test really measures what it is

supposed to measure. According to some experts, three types of validity

have been identified and are commonly used in educational measurement.

28

Howard, B. Lyman, Test Scores and What They Mean, (Boston: Allyn and Bacon, 1998), 6th Ed., p. 22.

29

David P. Harris, Testing English as a Second Language, (New York: McGraw-Hill Inc., 1969). p. 13.

30

J. Stanley Ahmann and Marvin D. Glock, EducatingPupil Growth Principles of Tests and

Measurement, (Boston: Allyn and Bacon, 1967), 3rd Ed., p. 285.

31

(26)

a. Content Validity

A test can be said to have content validity if it is built with a

representative sample of the language skills, structures, etc. which it is

meant to be concerned.32 In line with that, Anthony J. Nitko writes

that, “content validity is the extent the items on a test are representative

of the domain or universe that they are supposed to represent.”33

Thus, the degree of content validity in a test relates to how well the

the test measures the content of subject matter that students studied

before. Therefore, it is important to make sure that the test covers all

the areas of material that are supposed to be assessed. For example, a

grammar test should be made up of items relating to the knowledge of

grammar.

b. Construct Validity

This type of validity relates to any underlying ability that is

formulated in a theory of language ability. Construct validity is “the

extent that a test measures the trait, attribute, or mental process it

should measure, and whether descriptions of persons in terms of such

constructs can follow using the scores from that test.”34

Moreover, Arthur Hughes writes that, “it is a matter of empirical

research to establish whether or not such a distinct ability exsists, can

be measured, and is indeed measured in that test.”35

In other words, it can be said that a test has construct validity if it is

able to measure certain specific characteristics agreeable with a theory

of language and behavior in learning.

c. Criterion-Related Validity

Criterion-related validity relates to the extent how agreeable the

results of the test with the results come from the another independent

32

33

Anthony J. Nitko, Educational Test ..., p. 413

34

Anthony J. Nitko, Educational Test ..., p. 413.

35

(27)

and trustworthy assessment of student’s competence.36

In addition, in

his book, Educational Tests and Measurement, An Introduction,

Anthony J. Nitko states that, “criterion-related validity questions

concern the extent to which scores on a test permit inferences about

examinees’ likely standing on another measure called a criterion.”37

This type of validity can be divided into two parts; namely,

individual’s test scores with his other assessment taken at about the

same time.

2. Predictive Validity

Predictive validity is intended to predict how well someone

will perform in the future. It is supported by a quote, “predictive

validity concerns the degree to which a test can predict candidates’

future performance.”39

To do this validition, the earlier test scores from individual

students are correlated with grades made at the end of the first

semester.

2. Reliability

Consistent measurement is a necessary condition for high quality

educational testing. This consistency of a test is called as reliability.

36

37

38

J. Stanley Ahmann and Marvin D. Glock, EducatingPupil ..., p. 288.

39

(28)

“Reliability refers to the consistency of measurement – that is, to how consistent test scores or other evaluation results are from one measurement

to another.”40

According to Desmond Allison, “the reliability of a test concerns the

accuracy and trustworthiness of its results. Reliable test results will

accurately reflect each student’s understanding of whatever is being

tested.”41

To sum up, a test is reliable if it consistently produces the same, or

nearly the same result or rank for the same individual taking the test

several times on the different occassion.

3. Practicality

The last quality that a good test should have is practicality or usability.

In selecting a test and other instruments, practical considerations cannot be

neglected. These are some factors relevant to the practicality when

In addition, ease of administration involves the simple and

clear directions, the subtests in minimum numbers and the easy timing.

b. Time Required for Administration

The test’s length is directly related to the reliability of a test, so the availability of enough time should be taken. “A safe procedure is to

40

Norman E. Gronlund, Measurement and Evaluation ..., p. 93.

41

Desmond Allison, Language Testing ..., p. 85.

42

Norman E. Gronlund and Robert L. Linn, Measurement and Evaluation in Teaching, (New York: Macmillan Publishing Company, 1990), 6th Ed., p. 102-103.

43

(29)

allot as much time as is necessary to obtain valid and reliable

results.”44

c. Ease of Interpretation and Application

If the test is interpreted correctly and applied effectively, teacher

can make accurate educational decisions about students performance.

d. Availability of Equivalent or Comparable Forms

Equivalent test measure the same aspect and is alike in content,

level of difficulty, and other characteristics. It is useful if teacher wants

to remove the factor of memory when retesting students on the same

domain. Comporable forms are especially useful in measuring the

progress of the basic skills.

e. Cost of Testing

The factor of the cost is actually not really important in selecting

test. Testing is relatively inexpensive. However, the point is the test

should be as economical as possible in cost.

C. Types of Test Item

An item is the basic unit of language testing. According to James Dean

Brown, the definition of the item “is the smallest unit that produces distinctive

and meaningful information on a test or rating scale.”45

The items used in clasroom tests are commonly divided into two broad

categories: (1) the objective item, and (2) the essay test.

1. Objective Test

In constructing an achievement test, the test maker may choose from a

variety of item types. One of them is referred to as objective item. This

kind of item types can be scored objectively. Furthermore, “equally

competent scorers can score them independently and obtain the same

44

45

(30)

results.”46 _{In addition, Rebecca M. Valette defines objective test as “any}

item for which there is a single predictable correct answer.”47

Thus, when scoring this test, any subjective judgement from the scorer

is pushed aside because every item in that test has only one absolutely

right answer. So, although the test is scored in several different times by

one scorer or another, it will obtain the same result.

The objective item can be classified into two types, which are

selection-type test item and supply-type test item.

a. Selection-Type Test Item

1. Multiple Choice

According to Anthony J. Nitko, “a multiple choice item

consists of one or more introductory sentences followed by a list of

two or more suggested responses from which the examinee

chooses one as the correct answer.”48

The other responses which

are as incorrect answers function to distract students’ attention

away from the correct answer in case they are uncertain of the

answer.

In line with that quote, “multiple choice items are made up of

an item stem, or the main part of the item at the top, a correct

answer, which is obviously the choice that will be counted correct,

and the distractors, which are those choices that will be counted as

incorrect.”49

For example:

Budi has been here ____________ half an hour.

(31)

The multiple choice item is commonly recognized as the most

applicable and useful type of objective test item. It can be used to

measure both knowledge outcomes and many types of skills. In

addition, it can measure a variety of learning outcomes from simple

to complex material.

The multiplce choice item is included in discrete point test.

Discrete point test takes language skill apart. Oller states that,

“discrete items attempt to test knowledge of language one bit at a

time.”50

It means that language knowledge can be divided into a

number or components, such as grammar, vocabulary spelling,

punctuation, pronunciation, intonation, and stress. This test only

measures the knowledge of language in one particular component.

Actually, it is not too difficult for test maker or teacher to

construct multiple choice item test. However, there some

suggestions that they shoul consider in constructing this type of test

items:51

a. The stem of the item should be meaningful by itself and should show a specific problem.

b. The item stem should include as much of the item as possible and should be free of irrelevant material.

c. A negatively stated item stem can be used only when significant outcomes need it.

d. All of the alternatives should be grammatically consistent with the stem.

e. An item should contain only one clearly correct answer. f. Items used to measure understanding should contain some

novelty, but beware too much. g. All distracters should be plausible.

h. Verbal associations between the stem and the correct answer should be avoided.

i. The relative length of the alternatives should not provide a clue to the answer.

j. The correct answer should appear in each of the alternative positions and in equal number but in random order.

50

John W. Oller, Language Tests ..., p. 37.

51

(32)

k. The special alternatives such as “none of the above” or “all

of the above” can be used sparingly.

l. Do not use multiple choice item when other item types are more appropriate.

Although it can be said as the most applicable and useful type

of test item, multiple choice item has some limitations, such as:52 a. The technique tests only recognition knowledge. A multiple

choice item gives a quite inaccurate result of students’

ability in productive and receptive skills.

b. Guessing may have a considerable but unknownable effect

on test scores. We never know what part of any individual’s

score comes through guessing. So, we cannot identify the answer, no correct answer, the obvious clues in the options, ineffective distractors.

e. Backwash may be harmful. Practice at multiple choice items will not usually be the best way for students to improve their command of a language.

f. Cheating may be facilitated. The fact that how to response on a multiple choice item is so simple makes students easy to communicate each other non-verbally.

Beside its limitations, multiple choice item also has some

advantages. Wilmar Tinambunan writes the advantages of multiple

choice item as follow:53

a. The multiple choice item can be used for subject matter content

in any different levels of behaviour, such as ability to reason,

discriminate, interpret, analyze, infer, and solve problems.

b. It has less chance for students to guess the right answer than the

true-false item does because it is followed by four or five

alternatives.

52

Arthur Hughes, Testing for Language ..., p. 76-78.

53

(33)

c. One advantage of the multiple choice item over the true-false

item is that students also know what is correct rather than only

know that a statement is incorrect.

In the writer’s opinion, multiple choice item includes at least

three components, which are the stem, the distractors, and the

correct answer. The stem can be the direct question or incomplete

statement which students have to response. The distractors are

presented to distract the students who do not study well for

choosing the answer correctly. This type especially useful for

measuring learning outcomes that require the understanding,

application, or interpretation of factual information.

2. True-False

In the book, Criterion-Referenced Language Testing, true-false

item “requires student to respond to the language by selecting one

of two choices, for instance, between and true and false or between

correct and incorrect.”54

In line with that opinion, Norman gives

the definition of true-false item as follow:

“True-false item is simply a declarative statement that the student must judge as true or false. There are modifications of

this basic form in which the student must respond “yes” or “no,” “agree” or “disagree,” “right” or “wrong,” “fact” or

“opinion,” and the like. Such variations are usually given the

more general name of alternative-response items. In any event this item type is characterized by the fact that only two

responses are possible.”55

For example:

Direction: Read each of the following statements, if the statement

is true grammatically, circle the T. If the statement is

false gramatically, circle the F!

54

James Dean Brown and Thom Hudson, Criterion-Referenced Language Testing, (Cambridge: Cambridge University Press, 2002), p. 66.

55

(34)

T F 1. Toni usually help her mother in cooking.

T F 2. Every student must bring their own book.

T F 3. If I had much money, I would buy a house.

T F 4. She is smarter in our classroom.

T F 5. The men are gathered in a conference room.

The most common use of the true-false item is to measure the

ability to identify the correctness of statement of fact, definition of

terms, principles, etc and to distinguish fact from opinion.56 It is

a. Include only one central, significant idea in each statement b. Word the statement so precisely that it can be judged true

or false unequivocally

c. Keep the statement short, and use simple language structure d. Use negative statements sparingly, and avoid double

negatives

e. Statements of opinion should be attributed to some source f. Avoid extraneous clues to the answer

Moreover, Anthony J. Nitko states that this item type has some

advantages and criticisms.58 Here they are: Advantages:

a. Certain aspects of the subject matter lend themselves to verbal prepositions that can be judged true or false

b. Such items are relatively easy to write c. They can be scored easily and objectively

d. They can cover a wide range of content with a relatively short period of testing

56

57

Norman E. Gronlund, Constructing Achievement ..., p. 55-56

58

(35)

Criticisms:

a. They are often used only to test specific, frequently trivial, facts

b. They can be ambigiously worded

c. They can be answered correctly by blind guessing

d. They may encourage students to study and accept only oversimplified statements of truth and factual details

Thus, true-false item is the item type which contains a single

written statement and then it must be decided by students whether

it is true or false. It is constructed to check and measure whether a

simple particular point has been comprehended or not.

3. Matching

“The matching item consists of two paralell coloumns with

each word, number, or symbol in one coloumn being matched to a word, sentence, or phrase in the other coloumn. The items in the coloumn for which a match is sought are called premises and the items in the coloumn from which the selection is made are called responses. They are useful in measuring students ability to make associations, discern relationship, make

interpretations or measure knowledge of a series of facts.”59

In other words, this item type presents students with two

coloumn of information in which they have to match the correct

option or response to premise. It is typically used to measure

factual information or knowledge based on simple relationship.

Therefore, when learning outcomes concern on the ability to

identify the relationship between two things, matching item should

be the most appropriate. For example:

Match the following words on the left with their synonyms on the

right!

(36)

4. ( ) Appear d. Accept

5. ( ) Improve e. Accomplish

Furthermore, James Dean Brown formulates three guidelines

that teachers should apply in constructing matching items:60

a. More responses should be supplied than premises so that students cannot narrow down the choices as they go along by simply keeping track of the options that they have already used.

b. The responses should usually be shorter than the premises because most students will read a premise and then search through the options for then correct match.

c. The premises and responses should be logically related to one central theme that is obvious to the students.

Moreover, matching item has some advantages to be carried

out in testing. The first advantage is “its compat form, which

makes it possible to measure a large amount of related factual

material in a relatively short time.”61 _{Secondly, “the effects of}

guessing is reduced since the student will have one chance out of a

number of responses available of guessing correctly.”62

At last, it

has ease of construction.

4. Rearrangement

“Rearrangement items require the pupil to put into some

specified order a series of randomly presented material.”63

In the

book, Measurement and Evaluation in the Schools, Louis J.

Karmel states that any kind of specified order may be called for,

such as chronology, order of difficulty, order of importance,

length, weight, logic, and so on.64

60

James Dean Brown, Testing in Language ..., p. 57.

61

Norman E. Gronlund and Robert L. Linn, Measurement and Evaluation ..., p. 159.

62

63

H. H. Remmers, et. al., A Practical Introduction ..., p. 243.

64

(37)

For example:

Rearrange these following sentences into a good paragraph!

1. Suddenly, it was getting dark and he realized that he got lost

2. Once upon a time, there was a bee named Bumbee

3. Bumbee could get home and gathered with his family happily

4. One day, he felt so happy and flew alone in the forest

5. Fortunately, a butterfly appeared and she liked to help him

b. Supply-Type Test Item

1. Short-Answer

According to Norman E. Gronlund in his book, Constructing

Achievement Test, he states that, “the short answer (or completion )

item is the only objective item type that requires the examinee to

supply, rather than select, the answer.”65

In line with that opinion,

this item type “generally requires the students to examine a

statement or question then respond to it with a phrase or two, or a

sentence or two, in the space provided.”66

Both short answer item and completion item can be answered

by a word, phrase, sentence, number, or symbol. In the short

answer item, the question is presented as a direct question:

For example:

a. What is the capital city of West Java? (Bandung)

b. Who invented the lightbulb? (Thomas Alfa Edison)

Whereas, the completion item requires student to supply the

answer in an uncomplete statement.

For example:

a. The capital city of West Java is ... (Bandung)

b. The name of the man who invented the lightbulb is ...

(Thomas Alfa Edison)

65

Norman E. Gronlund, Constructing Achievement ..., p. 57.

66

(38)

It seems obvious that short answer item or completion item

order not to make the items in a careless way:67 a. Require short, definite, clean-cut answers

b. If several correct answers (synonyms) are possible, count

e. Specify the terms in which the response is to be given f. In testing for a knowledge and understanding of definitions,

it is often better to provide the term and require a definition than to provide a definition and require the term

g. Direct questions are probably preferable to incomplete declarative sentences

h. Hints concerning the correct answer, in the form of the first letter of a word, or a number indicating the number of letters in a word, should generally not be given

i. The space for the response should usually be at the right of the question

j. Allow enough space for the responses to permit legible writing

k. Arranging the answer spaces in a coloumn at the right-hand margin of the page makes scoring more convenient

Furthermore, short answer item has some advantages and

disadvantages like Arthur Hughes writes in his book, Testing for

Language Teachers:68

a. Advantages:

1. Guessing will (or should) contribute less to test scores 2. The technique is not restricted by the need for

(39)

3. Cheating is likely to be more difficult

4. Though great care must be taken, items should be easier to write

contents or parts are removed. Then, students are asked to fill those

blank spaces. As James Dean Brown and Thom Hudson write,

“this format provides a language context of some sort and then removes part of the context and replaces it with a blank. The

student’s job is to fill in that blank.”69

For example:

1. He failed another exam, __________ he had studied very hard.

2. She does not come today. She __________ be sick.

3. Once upon a __________, there was a farmer living in a small

village in England. His __________ was Jack. He was a kind

and wise man. He liked to help his neighbors. Jack __________

a mill machine. People came to his place to __________ their

grain. Jack served them happily. However, his wife was a very

__________ woman. She often complained. She __________

angry every time Jack __________ some food to the

neighbors.70

69

James Dean Brown and Thom Hudson, Criterion-Referenced ..., p. 73.

70

(40)

In addition, fill-in item measures the student’s ability to

produce a language, even if a small amount of language. However,

to make the measurement by fill-in item result the valid data, it is

prominent to tell clearly to students that only one word can be put

in each blank or gap.

For more advanced, in order to use fill-in item in an efficient

way for measuring students’ performance, there are five

considerations issued by James Dean Brown that teachers should

remember:71

a. Teachers should check to make sure that each item has one very concise correct answer

b. Teacher should make sure that enough context has been provided that the purpose, or intent, of the item is clear to those students who know the answer

c. All the blanks in a fill-in test should be the same length d. Teachers should also consider putting the main body of the

item before the blank in most of the items so that the students have the information necessary to answer the item once the encounter the blank

e. In situations, where the blanks may be very difficult and frustrating for the students, teachers might consider supplying a list of responses from which the students can choose in filling in the blanks

Furthermore, as one of types of test item, fill-in item has some

advantages and limitations:72 Advantages:

a. It is relatively easy to construct

b. It is flexible to use from a test writer’s point of view

c. It requires a short amount of time to administer

Limitations:

a. It is generally very narrowly focused on testing a single word

or short phrase at most

b. It may have a number of possible answers

71

James Dean Brown, Testing in Language ..., p. 58-59.

72

(41)

2. Essay Test

According to J. Stanley Ahmann and Marvin D. Glock on the book,

Educating Pupil Growth Principles of Test and Measurements, “an

essay test item demands a response composed by the pupil, usually in one or more sentences, of a nature that no single response or pattern of responses can be listed as correct, and the accuracy and quality of which can be judged subjectively only by one skilled and informed in

the subject, customarily the classroom teacher.”73

In addition, the major characteristic of essay test is the freedom of

response it provides. It means that students have to produce their own

answer.74 To support the opinion above, Wilmar Tinambunan states that,

“the essay-type question requires the examinee to read the question,

formulate his response and express the response in his own words.”75

Essay question can be classified into two types, which are:

a. Restricted Response Type

The student is not given a complete freedom to make his response.

“it usually limits both the content and the response. The content is

usually restricted by the scope of topic to be discussed. Limitations of

response are commonly indicated in the question.”76 For example:

1. State the main differences between the objective test and the

subjective test according to Norman E. Gronlund!

2. Explain two advantages and two disadvantages of using the

multiple choice item in testing English as a foreign language!

b. Extended Response Type

In this type, student is given the freedom completely in composing

his response. “it allows pupil to select any factual information that they

think is pertinent, to organize the answer accordance with their best

73

J. Stanley Ahmann and Marvin D. Glock, EducatingPupil ..., p. 157.

74

Norman E. Gronlund, Constructing Achievement ..., p. 71.

75

76

(42)

judgment, and to integrate and evaluate ideas as they deem

appropriate.”77

For example:

1. Why is English so important nowadays?

2. Describe the roles of the teacher in language testing!

Moreover, building the essay item as a measurement of complex

learning outcomes should be done in a proper and careful way. Here are

some suggestions to construct a good essay item:78

1. Make definite provisions for preparing students for taking essay examinations

2. Make sure that questions are carefully focused 3. Structure the content and length of questions

4. Have a colleague review and critique the essay questions

5. Avoid the use of optional questions, except when one is assessing writing ability where a choice of questions is desirable

6. Restrict the use of the essay as an achievement test to those objectives for which it is best

As a method to measure the complex learning outcomes, essay item

has several advantages and weaknesses.

Advantages:79

1. It measures complex learning outcomes that cannot be measured by other means

2. It emphasize on the integration and application of thinking and problem-solving skills

3. It is regarded as a device for improving writing skills

4. It has ease of construction. Most teachers can formulate several essay questions in a matter of minutes

Weaknesses:80

1. There are not many samplings of achievement because only a small

number of questions can be included in essay test

77

78

Kenneth D. Hopkins, et. al., Educational and Psychlogical Measurement and Evaluation, (Englewood Cliffs, New Jersey: Prentice Hall Inc., 1990), 7th Ed., p. 216.

79

80

(43)

2. Scoring the essay test is influenced by student’s writing ability. Poor

expression and errors in punctuation, spelling, grammar usually lower

their score

3. While scoring essay test, the standards can be shifted because of

variations in the content of the answers from paper to paper

4. It requires much time to score the answers81

Thus, in essay item, students are asked to demonstrate their ability to

select, organize, integrate and review ideas to response the question in the

freedom. In addition, this item type is scored subjectively since it will

presents the different results when it is scored by the different person. The

people who are assigned to score the answers are typically influenced by

their own judgment or opinion.

To sum up, based on the previous explanation, an essay test is used to

measure student’s comprehension of a certain knowledge and student is

asked to answer by expressing his own words effectively and organizing

their own ideas, using information from his own background and

knowledge.

D. Item Analysis

1. Definition of Item Analysis

Obtaining the valid data as information is very valuable to give the

clear judgment about student’s performance in evaluation activity. In case

of that, the test should have a good quality and every item functions

properly. Teacher or test maker should know whether the test can be

included as a good test or not by evaluating every item in that test. This

activity is called as item analysis.

According to Anthony J. Nitko, “item analysis refers to the process of

collecting, summarizing, and using information about individual test

items, especially information about pupil’s response to item.”82

81

(44)

In addition, “item analysis as a whole will be defined here as the

systematic statistical evaluation of the effectiveness of individual test items. Item analysis is usually done for purposes of selecting which items will remain on future revised and improved versions of the test. Sometimes, however, item analysis is performed simply to investigate how well the items on a test are working with a particular group of students, or to study which items match the language domain of

interest.”83

Moreover, Arthur Hughes proposes the purpose of item analysis

which is “to examine the contribution that each item is making to the test.

Items that are identified as faulty or inefficient can be modified or

rejected.”84

Although item analysis is done primarily for response-choice item, it

is available for teacher to use several of the techniques described with any

items that are scored dichotomously (simply as correct or incorrect).85

In the writer’s opinion, item analysis is statistical evaluation to know

the quality of a test by identifying whether every item on a test works

appropriately or not. It is done by collecting students’ responses to each

item so that it can also be known which items are included as a good one

and which items that weaken the test. It is very useful for teacher to

performs item analysis since it can be a device for test improvement.

2. Kinds of Item Analysis

Item analysis usually concentrates three vital features: level of

difficulty, discriminating power, and the effectiveness of each alternative.

“Thus, item analysis can tell us if an item was too difficult or too easy,

how well it discriminated between high and law scores on the test, and

whether all the alternatives functioned as intended.”86

82

83

James Dean Brown and Thom Hudson, Criterion-Referenced ..., p. 113.

84

85

86

(45)

a. Level of Difficulty

The first area in item analysis is level of difficulty which concerns

on how easy or difficult each item is. According to Kathleen M.

Bailey, difficulty level is “an index of how easy an individual item was

for the people who took it. It is typically printed as a decimal, ranging

from 0.0 to 1.0. It represents the proportion of people who got the item

right.”87

Furthermore, in the book, Language Tests at School, “difficulty

level (or item facility) has to do with how easy (or difficult) an item is

from the viewpoint of the group of students or examiness taking the

test of which that item is a part.”88

In writer’s opinion, level of difficulty deals with how many

percentage of students who response an item correctly and those who

response incorrectly. By analyzing the difficulty level of each item, it

can be inferred whether an item is included as easy, moderate or

difficult item.

Level of difficulty is interpreted in the form of percentage. The

larger the percentage of the correct answer, the easier the item is. Then,

the fewer the students who answer correctly, the more difficult the

test and the valid data of information about student’s achievement will

not be acquired.

In addition, level of difficulty analysis can be applied for either

large group of students or the small one.

87

Kathleen M. Bailey, Learning about Language Assessment: Dilemmas, Decisions, and

Directions, (New York: Heinle & Heinle Publishers, 1998), p. 132.

88

(46)

As a quote, from Lyle F. Bachman, states that, “to conduct an item analysis, we first arrange the scored test papers or answer sheets in order from the highest score to the lowest score. Next, we separate the papers into upper and lower groups, according to their total test scores. For large groups, we would choose the upper and lower 27 percent, while for smaller groups, we would typically choose the upper and lower one-third.”89

The formula used for analyzing the difficulty level of each item in

large group is stated below:

In which:

TK : Index of difficulty

U : The number of students in the upper group who answer the

item correctly

L : The number of students in the lower group who answer the

item correctly

T : The number of students in upper and lower group90

Next, for the small group, teacher or test maker can easily evaluate

an item by using all the students’ answer sheets. Then, the formula is:

89