The Analysis of Validity, Reliability, Discrimination Power, and Level of Difficulty of First Mid-Term Test. The case of eighth grade students of SMP 33 Semarang.

(1)

THE ANALYSIS OF VALIDITY, RELIABILITY,

DISCRIMINATION POWER AND LEVEL OF DIFFICULTY

OF FIRST MID-TERM TEST IN THE CASE OF THE EIGHTH

GRADE STUDENTS OF SMP 33 SEMARANG

(In the Academic Year of 2008/2009)

a final project

submitted in partial fulfillment of requirements for the degree of Sarjana Pendidikan

in English

by Ajeng Desy H

2201405080

ENGLISH DEPARTMENT

FACULTY OF LANGUAGES AND ARTS

SEMARANG STATE UNIVERSITY

▸ Baca selengkapnya: pes 2017 difficulty levels

(2)

ii

APPROVAL

The final project was approved by the Board of Examiners of the English Department of the Faculty of Languages and Art of Semarang State University on 19th August 2009.

Boards of Examiners

1. Chair person

Dra. Malarsih, M.Sn NIP. 131764021

2. Secretary

Drs. Ahmad Sofwan, PhD NIP. 131813664

3. First examiner

Drs. Suprapto, M. Hum NIP. 131125925

4. Second examiner/ second advisor

Frimadhona Syafri, S.S, M. Hum

NIP. 132300419

5. Third examiner/ first advisor Drs. Amir Sisbiyanto, M. Hum NIP. 131281220

Approved by

Dean of Faculty of Languages and Arts

(3)

iii

PERNYATAAN

Dengan ini saya,

Nama : Ajeng Desy Hidayati

NIM : 2201405080

Prodi/ Jurusan : Pendidikan Bahasa Inggris

Fakultas Bahasa dan Seni Universitas Negeri Semarang menyatakan dengan sesungguhnya bahwa skripsi/ tugas akhir/ final project yang berjudul: “THE ANALYSIS OF VALIDITY, RELIABILITY, DISCRIMINATION

POWER AND LEVEL OF DIFFICULTY OF FIRST MID-TERM TEST IN THE CASE OF EIGHTH GRADE STUDENTS OF SMP 33 SEMARANG IN THE ACADEMIC YEAR 2008/ 2009”

Saya tulis dalam rangka memenuhi salah satu syarat untuk memeperoleh gelar sarjana ini benar-benar merupakan karya sendiri yang saya hasilkan setelah melalui penelitian, pembimbingan, diskusi, dan pemaparan/ ujian. Semua kutipan baik yang langsung maupun yang tidak langsung, baik yang diperoleh dari sumber kepustakaan, wahana elektronik, wawancara langsung maupun sumber lainnya dengan cara sebagaimana yang lazim dalam penulisan karya ilmiah.

Dengan demikian, walaupun tim penguji dan pembimbing penulisan skripsi/ tugas akhir/ final project ini membubuhkan tanda tangan sebagai tanda keabsahannya, seluruh isi karya ilmiah ini tetap menjadi tanggung jawab sendiri. Jika kemudian ditemukan ketidakberesan, saya bersedia menerima akibatnya.

Demikian, harap pernyataan ini digunakan seperlunya.

Semarang, Agustus 2009 Yang membuat pernyataan

(4)

iv

They only are the (true) believers whose hearts feel fear when Allah is mentioned,

and when the revelations of Allah are recited into them they increase their faith,

and who trust in their Lord (Al-Anfaal: 2)

There is a will there is a way.

Dedicated to: My parents My brothers

(5)

v

ACKNOWLEDGEMENT

Foremost, I wish to take this opportunity to express my gratitude to God the Almighty for the blessing, inspiration, and leading me to complete this final project.

First of all, I address my deepest appreciation to Drs. Amir Sisbiyanto, M. Hum, my first adviser, who was given a valuable guidance and unfailing encouragement from the beginning until this final project was completed. I also extend my gratitude to Frimadona Syafri, S.S, M. Hum, my second adviser, who was given many suggestions and corrections of its improvement.

In addition, my thank goes to Mrs. Endang Sarwo Sri, S. Pd, the headmaster of SMP 33 Semarang who was given me permit to conduct the experimental study there.And also special thank to Mrs. Aniek Rita, the teacher of eight grade of SMP 33 Semarang, who help me to conduct the experimental study there.

Furthermore, I owe a special debt to of gratitude to all members of teaching staff of English Department, for their continuous guidance given to me during my years of study there.

(6)

vi

ABSTRACT

Hidayati, Ajeng Desy. 2009. The Analysis of Validity, Reliability,

Discrimination Power, and Level of Difficulty of First Mid-Term Test. The case of

eighth grade students of SMP 33 Semarang. In the academic year 2008/2009. Final Project. English Department. Faculty of Languages and Arts. Semarang State University. First advisor: Drs. Amir Sisbiyanto M. Hum. Second advisor: Frimadhona Syafri S.S M.Hum

Key words: validity, reliability, discrimination power, level of difficulty, test item. A good English test will help students to learn the language by requiring them to study hard, emphasizing the objectives of the course and also showing them in which parts of the course they need improvement. A test, which is intended to measure the students’ achievement, has to fulfill the requirements of good test such as validity and reliability. There are several factors that influenced in building good test. There are relevance, balance, efficiency, specificity, difficulty, discrimination, variability and reliability.

In this study, the writer would like to focus her research on the English mid-term test which is administered to eighth grade students of SMP 33 Semarang in the academic year of 2008/2009. In this study, the writer would like to find the answer of the following question: “how good is the English mid-term test for eight grade of SMP 33 Semarang in the academic year 2008/2009.” The general objective of the study is obtaining an objective description of the structure of a good test item. The method that the writer used in analyzing the data this study is quantitative approach. In writing this final project, the writer conducts to activities. The first is library activities, the writer select some books which give information, or supporting data for reference. Then the second is field activity, it is used to collect the data.

From the result of the analysis the test there are 33 valid items and 17 invalid items. The reliability of the test is 0.39, so this test is still reliable. From the point of view of discrimination power, it can be concluded as poor because the mean of the discrimination power is 0.17. There are 8 good items, 13 marginal items and 29 poor items. In the term of difficulty level this item categorized as moderate item because the mean is 0.41. There are 11 difficult items, 34 moderate items, and 5 easy items. Based on the result above, the writer would like to offer some suggestions.

(7)

vii

BIBLIOGRAPHY………. 95

(10)

x

LIST OF TABLES

Table

(11)

xi

LIST OF APPENDICES

Appendix

1. Analysis of each Item ………. 98

2. Computation of Reliability……….. 100

3. Computation of Discrimination Power……… 102

4. Computation of Level of Difficulty………... 103

5. List of students in the Upper group and Lower Group……….... 104

(12)

1

CHAPTER I

INTRODUCTION

1.1 Background of the Study

Language is a means of communication. By using language, people can express their feelings, thought, and minds. People use language to communicate with others in fulfilling their daily needs. In fact, language has played an important role in human life. English as the first international language is important in global communication. English subject is developing ability of communication in language, both spoken and written.

Realizing the role of English, the government has included English as a compulsory subject in Junior High School, Senior High School even in Elementary School. In Elementary School, is English taught as part of the local content curriculum. The government makes an effort to increase the quality of educational, especially English. So, children as soon as possible introduced to English in the beginning.

(13)

2

According to Madsen (1983:3-5) testing is an important part of every teaching and learning experience. Well-made test of English can help students in at least two ways. English test can help create positive attitudes toward instruction by giving students a sense of accomplishment and a felling that the teacher’s evaluation of them matches with what he was taught to them. Good English test also help students learn the language by requiring them to study hard, emphasizing course objectives, and showing them where they need to improve.

A good test should fulfill some requirements such as validity and reliability. According to Ebel (1979:232) there are some factors which build a good test. The factors are relevance, balance, efficiency, specificity, difficulty, discrimination, variability, and reliability. To make a good test, teacher should support the test with some requirements stated above.

Finally by analyzing the test item of mid-term test of eighth grade students of SMP 33 Semarang in terms of validity, reliability, discrimination power and level of difficulty of English question items with the study. The writer hopes that the test writers can build good tests for each grader.

1.2 Reason for Choosing the Topic

(14)

3

good quality. A good test means that the test must reliable, valid, moderate in the term of difficulty level, which means that the test is neither difficult nor easy. Then, the test must meet with the criterion of discrimination power, which is satisfactory, good, reasonably good, or poor item. The item which considered as poor item should be discarded; it means that the item can not distinguish between the students in the upper group and the students in the lower group well.

1.3 Statement of the Problem

Through the study the writer would like to find the answer of the following question. “How good the test of mid-term tests of eighth grade students of Junior High School?”

More specifically, in analyzing the test item, the writer will limit the problem into the following question:

(1) What is the validity of the test items? (2) What is the reliability of the test items? (3) What is the difficulty level of test items?

(4) What is the discrimination power of the test items?

1.4 Objective of the Study

The general objective of the study is obtaining an objective description of the structures of a good test item. The objectives are then specified into the following goals:

(1) To describe the validity of each test items. (2) To describe the reliability of each test items.

(15)

4

1.5 Significance of the Problem

The advantages that can be required from this study are as follows:

(1) For students: Students can use the result of the study to make their study more effective with regard to the right materials.

According to Madsen (1983:4) a good test of English can help students in at least two ways. First of all, such test can help create positive attitudes toward accomplishment. Second, the English test can help students learn the language by requiring them study hard, emphasizing course objectives, and showing them where they need to improve.

(2) For the teacher: Teacher can use the result of the study as a reference when they want to analyze test items. The test plays several important roles, such as to provide insight into ways of improving the evaluation process and to provide means of diagnosing their own efforts if they have taught effectively. (3) For test maker: The test maker may use it as a supplement in constructing

tests.

(4) For the writer: The writer herself especially it can increase her skill in constructing test items.

1.6 Limitation of the Study

The writer wants to analyze the English test of mid-term tests of eighth grade students of Junior High School in the form of multiple-choice tests in the belief that:

(16)

5

(2) By using this type of item analysis the discrimination power difficulty level of the test can be practically determined.

1.7 Outline of the Research

(17)

6

CHAPTER II

REVIEW OF RELATED LITERATURE

2.1 Characteristics of a Good Test

Test contributes directly to the teaching learning process used in the classroom instruction, and it is useful in programmed instruction, curriculum development, marking, guidance and counseling school administration and research. (Gronlund, 1976:7)

Regarding the test roles, Vallete (1977:3) states “… classroom test plays three important roles in the second language program: they define course objective, they stimulate students’ progress, and they evaluate class achievement.”

Tests provide information that teacher and students ordinary can get of the success of their efforts to teach and to learn. The need of good tests of educational achievement becomes more intense. According to Ebel (1979:14) imperfect tests we now use far better than if we would be served by no tests at all.

Before constructing a test, we must recognize the characteristic of a good test. A good test should valid and reliable. As Harris stated any tests that we use must be appropriate in terms of objectives, dependable, in the evidence it provides, and applicable to our particular situation. (Harris, 1969:13)

(18)

7

with the specific use to be made of the results and with the truthfulness of our proposed interpretation.

The second characteristic which a good test must meet is reliability. Reliability refers to the consistency of a test score. That is how consistent it is from the measurement to another. Vallete and Harris have the same statement about reliability. Reliability refers to the stability of the test score. An important consideration, then, is determining whether or not a test is reliable.

Besides that, there are several types of test based on their criteria:

A. Based on the function the test gives.

Vallete divides the type of the test as follows:

(1) Summative test. This test is usually given at the end of a marking period and measures the ‘sum’ total of the material covered (at the end of the academic year or term)

(2) Formative test. This test is given during the course of instruction; its purpose is to show which aspects of the chapter the student has mastered and where remedial work is necessary. (Vallete, 1977:11

B. Based on the way the test is scored.

This type of the test is also divided into two parts. They are:

(19)

(2) Subjective test. It is one that does not have a single right answer. The result of the test may be different if it is scored by different persons. (Vallete, 1977:10)

C. Based on the test constructor.

Harris divides this type test as follows:

(1) Standardized test. It is formal and large-scale test which is prepared by professional testing services to assist institutions in the selection, placement, and evaluation of the students. Usually, it has been proved in terms of validity and reliability.

(2) Teacher-made test. It is generally prepared administrated, and scored by a teacher. (Harris, 1969:1)

D. Based on the objective of the test.

Harris divides this type of test into three parts. They are:

(1) Achievement test to measure the extent of students’ achievement of the instructional goals.

(2) Aptitude test (prognostic test) to determine whether or not they will be successful in a certain field or study.

(3) General proficiency test to measure what a person already knows (learned) in the target language, but the aim is to determine whether this language ability corresponds to specific language requirements. (Harris, 1969:2)

(20)

the students in the lower group. The level of difficulty refers to the percentage of students who got the item right.

2.2 Multiple-choice Items

According to Mc. Namara (2000:5) the definition of multiple-choice format as a format for test questions in which candidates have to choose from a number of presented alternatives, only one which is correct. Brown (2004:194) stated that the most popular method of testing a reading knowledge of vocabulary and grammar is the multiple-choice format, mainly for reason practically: it is easy to administer and can be scored quickly.

According to John Boker in http: //www.uab.edu/ uasomume/ cdm/ test. Htm, the advantages and the disadvantages of multiple-choice items are as follows: a. Advantages : Multiple-choice test can measure all levels students ability, it

enables wide sampling of subject content, it is quick and easy to score, it enables objective score, and it can be analyzed for effectiveness.

b. Disadvantages : Multiple-choice test is difficult to construct good items; it tends to measure simple recall.

2.2.1 The Uses of Multiple-Choice Items

According to Gronlund (1998:60-75) multiple-choice items are appropriate for both classroom based and large-scale situations. Gronlund (1982:39) suggested that the multiple-choice items could be used to measure both knowledge outcomes and various types of intellectual skills.

Gronlund (1976:190-195) stated that the uses of multiple-choice items are measuring:

(21)

b. Knowledge of specific facts c. Knowledge of conventions

d. Knowledge of trends and sequence

e. Knowledge of classification and categories f. Knowledge of criteria

g. Knowledge of methodology

h. Knowledge of principles and generalization i. Knowledge of theories and structure

2.2.2 Characteristic of Multiple-choice Items

A multiple-choice item consists of a problem and a list of suggested solutions (Gronlund, 1976:188). The problem may be stated in the form of a direct questions or a complete statement, which presents problem situation and is called the stem of the items. The list of suggested solution may include words numbers, symbols, or phrases, which provides possible solution to the problem is called alternatives.

The alternatives include the correct answer while the remaining alternatives or several plausible wrong answers are called distracters. The function of the latter is to distract those students who are not too certain of the answer.

2.2.3 Rules for Constructing Multiple-Choice Items

(22)

a. The stem of item should be meaningful by itself and should present a defined problem.

b. Present a single clearly formulated problem in stem of the item. c. State the stem of the item in simple, clear language.

d. Put as much of wording as possible in the stem of item. e. State the stem in positive form, where ever possible.

f. Emphasize negative wording whenever it is used in the stem of an item. g. Make certain that the intended answer is correct or clearly best.

h. Make all alternatives grammatically consistent with the stem of the item and parallel in form.

i. Avoid verbal clues that might enable students to select the correct answer or eliminate an incorrect alternative.

j. Make the distracters plausible and attractive to the uninformed.

k. Vary the relative length of the correct answer to eliminate length as a clue. l. Vary the position of the correct answer in a random manner.

Most of the test items in the final tests or mid-terms tests in junior or senior high schools are in the form of multiple-choice tests. The writer considers that the reason of this is based on the principles of constructing test items. Multiple-choice item is practical; it has scoring procedure that is specific and efficient. And also the test takers and the test makers are used to with the form of multiple-choice test.

(23)

Validity refers to whether or not a test measures what it proposes to measure. Thus, a test cannot be valid unless it also reliable, for an unreliable test does not measure.

Gronlund (1976: 81-97) claims that there are three basic types of validity that commonly used in educational and psychological measurement. They are:

a. Content validity

It may define as the extent to which a test measures a preventative sample of the subject-matter comment and the behavioral changes under consideration.

b. Criterion-related Validity

It is the extent to which test performance is related to some other valued measured of performance whenever test scores are not to be used to predict future performance on some valued measure other than itself.

c. Construct validity

Construct validity may be defined as the extent to which the test performance can be interpreted in terms of certain psychological construct. A number of factors tend to influence the validity of the test result. Gronlund (1979:98-100) points out that some of these influences can be found in the instrument itself, some of the typical responses of the pupils to the test situation and still others in the nature of the group tested in the composition of the criterion measures used.

(24)

this case, the test items of mid-term test of Junior High School have to meet the criteria of what the test measure.

2.4 Reliability

Reliability deals with the consistency of the result. That is how consistent test scores or other evaluation results are from one measurement to the other. If a test is reliable, then a students’ score on it when compared to the scores of his classmates, should be similar to his relative score on the other test measuring the same information. Gronlund (1982:132) claims that reliability refers to the consistency of test scores that is, how consistent they are from one measurement to another. Reliability measures provide an estimate of how much variation that might expect under different conditions.

Reliability is the consistency of the test. It means how consistence or repeatable the test is. If the test is reliable, it indicates that the first test and the next test are on the same measure. It means that if a students compare with his classmate in a test, so on the next test with the same field of test the score will not change if it is compared to his classmate.

2.5 Item Discrimination Power

Item discrimination or discrimination power explains how well the items perform in separating the better students from the poorer ones. If the good students tend to do well on an item and the poor students badly on the same item, then the item is a good one because it distinguishes the good from the bad students. This is the statement underlying the index of discrimination.

(25)

(1) Find the number in the upper group who got the items right. (2) Find the number in the lower group who got the items right.

(3) Then subtract the number getting it right in the upper group from the number getting it right from the lower group.

(4) Divide this figure by one half of the total numbers of papers in the upper and lower groups.

D = RU-RL ½ T

D : Discrimination power

RU: The number of the students in the upper group who answer the item correctly.

RL: The number of the students in the lower group who answer the item correctly.

½ T: One half of the total number of the students included in the item analysis.

(Gronlund, 1981:259)

(1.0): All the students in the upper group answer correctly and no one in the lower group does.

(.00): Is obtained when an equal number of the students in the upper and lower group answer the item correctly.

(-) : Obtained when more students in the lower group than the upper group answer correctly.

(26)

discrimination or negative discrimination power means that the item is bad. The item with negative value means that the students in the lower group perform better than the students in the upper group. This item must be revised or discarded.

2.6 Item Difficulty

The difficulty of the test item is indicated by the percentage of students who get the item right. The more difficult items, the fewer will be the students who select the correct option. And the easier the items are the more will be the students who select the correct one.

Teacher usually have wrong opinion, they feel that they can get respect of their students by giving them east test, and some of them giving more difficult test items in order to get the respect from the students and parents.

There are some factors in constructing the difficulty level of test items. According to Mahren and Lehman (1984:31) the concept of difficulty or the decision of how difficult the test should be depend on a variety factors.

There are:

(1) The purpose of the text

(2) The ability level of the students (3) The age or grade level of the students

The index of item difficulty (P) can be commutated by two ways. First by dividing the number of the students who answered an item correctly (R) with the total number of the students tested (T), and then multiplied by one hundred.

(27)

The second way is by dividing the students into the upper group and the lower group only, and assuming the responses of the students in the middle group will follow essentially in the same pattern

The index difficulty will run from 0.00 to 1.00, with 1.00 indicating the easiest possible item. The index of difficulty which is in the range from 0.31 to 0.70 would indicate that the item is considered moderate or acceptable. The most difficult will run from 0.00 to 0.30.

By knowing how many students who answered the item right, we can calculate the item difficulty. It can be calculated through two ways, but in this final project the writer used the first way. That is by dividing the number of the students who got the item right with the total number of students tested and multiplied by one hundred. By that, we can know which item is difficult, moderate, and easy.

2.7 Curriculum

The curriculum that is used in the eighth grade of SMP 33 Semarang is KTSP (

KurikulumTingkat Satuan Pendidikan). In the KTSP there are several materials

that is given to the students there are: 1) Descriptive Text

2) Recount Text 3) Narrative Text

4) Asking and Giving Permission 5) Asking and Giving for Help 6) Refusing for Help

(28)

9) Invitation 10)Announcement 11)Short Message

(29)

18

CHAPTER III

METHODS OF INVESTIGATION

In the third chapter, the writer presents the population, sample and sampling technique, identification of the problems, techniques of data collection and technique of data analysis. In this research, the writer used two kinds of methods in order to get data required in this study, namely library research and analysis of students’ works.

3.1 Population

The population on this study is eighth grade students of Junior High School in the first semester, which covers three classes each of them consists of forty students. Therefore the total number of the population is one hundred and twenty.

3.2 Sample and Sampling Technique

To make the study effective, the writer selects some sample. The writer will take three classes on the school that consists of forty students. So, the total samples are one hundred and twenty students. In this study, the writer will use the random sampling technique to take sample.

(30)

19

3.3 Identification of the Problem

Most of the teacher of Junior and Senior High Schools still do not know how to construct a good test. They made a test without paying attention to the characteristic or the quality of a good test.

There are four problems related to the teacher-made English test items. The problems are:

a. The validity level b. The reliability level c. The difficulty level d. The discrimination power

3.4 Technique of Data Collection

In this study the intended test is the mid-term test of eighth grade students of Junior High School. The data are in the form of students’ answer sheets, and the test item of mid-term test of eighth grade students of Junior High School. The writer selects the eighth grade students of Junior High School to get the required data. Before the test administrated to the students, the writer will contact the English teacher of the selected school to ensure that they were not used anymore. Then, she begins to analyze the test.

3.5 Technique of Data Analysis

(31)

The purpose of this analysis is to identify the quality of each item, whether they belong to good items, moderate items, or bad items. Through items analysis, we can also find information about the weakness or the shortcoming of the items. Here the items analysis consists of the following:

3.5.1 Analysis of Validity

Validity refers to whether or not a test measures what it is supposed to measure. In this study the writer used content validity. It means that the items should measure what it is supposed to measure. Then, the writer compare each item with the curriculum of KTSP then if the item meet the criteria of the material of the curriculum the item can be said as valid item and vice versa.

3.5.2 Analysis of Reliability

The formula that is used to estimate the reliability of the test is Kruder-Richardson 20 formula. According to Brown (185:2005) the Kuder-Kruder-Richardson 20 formula is the most accurate and flexible formula to calculate reliability. The formula is:

K-R20 = K ∑Si² 1-

K-1 St²

K_R20 = Kuder-Richardson formula 20 k = number of the items

Si² = item variance

St² = test score variance (Brown, 181:2005)

(32)

St 2

=

 

n

n y

y





2 2

St²= the variance



the sum of Y= the total score

N= the number of respondent

3.5.3 Difficulty Level Analysis

A good test item which is not too difficult or too easy for a group of students if more than 75 percent of the group accurately respond to that item of a test. If between 25 percent and 75 percent of the students in a group accurately respond to an item of a test, the item is considered as moderate. A hard item is one which fewer than 25 percent of students correctly answer on a test.

Based on the above difficulty level criteria, the difficulty level criteria that are used are:

(1) an item with a difficulty level of 0.00 ≤ P ≤ 0.25 is a difficult item (2) 0.25 ≤ P ≤ 0.75 is moderate

(3) 0.75 ≤ P ≤ 1.00 is easy

The formula is: P = R T

P = difficulty level or index of difficulty

R = the number of students responding correctly to the item

(33)

Item discrimination tells how well the item performs in separating the better students from the poorer students. The formula of computing item discriminating power is as follows:

D = RU-RL ½ T D = the index of DP

RU = the number of students in the upper group who answer the item correctly RL = the number of students in the lower group who answer the item correctly ½ T = one half of the total number of the students included in the item analysis

(Gronlund, 1982:103) (.00) is obtained when an equal number of the students in each group answer correctly.

(1.00)Is the highest equal indicating that all students in the upper group got the item correctly and all the students in the lower group got wrong.

(-) is obtained when more students in the lower group answer correctly than the students in the upper group.

Zero and (-) DP of item should be removed from the test and then discarder or improved. Ebel and Frisbie (1991:232) classify the discrimination power values as follows:

Discrimination index

Item Evaluation

0.40 and above Very good item

0.30-0.39 Reasonably good but possibly subject to improvement 0.20-0.29 Marginal items, usually needing being subject to

(34)

0.19 and below Poor items, to be rejected or improved by revision

In estimating the discrimination power, the writer divided the class into three groups there are upper group, lower group, and middle group. In divide the sample into upper group and the lower group, the writer ranks the sample from 1 to 120. Then, from the ranks, the writer classified the upper group of the sample is 27% students who got highest grade from the whole sample. And the lower group is 27% students who got the lowest grade from the whole sample. The rest students are categorized as middle group. The list of upper group and lower group students can be seen in appendix 6.

(35)

24

CHAPTER IV

RESULT OF THE STUDY

4.1 Result of the Analysis

This study analyzed four aspects of the test items; there are validity, reliability, discrimination of power and the level of difficulty. The aim of this study is to analyze the test item of mid-term test of eighth grade students of SMP 33 Semarang.

By analyzed the four items of the test items we can identify whether the item is good, moderate or poor. We can acquire the weaknesses of the items and how to revise it. From the data analysis the test item of mid-term test of eighth grade students of SMP 33 Semarang we can obtain the following data.

4.2.1

Analysis of Item validity

The writer used the Pearson’s product moment table to calculate the value of the validity level of each test item. After that we can consulted the value of each item to the table of r product moment values. The test item can be categorized as valid item if the value of r is higher than the value on the table and vice versa.

From the validity calculation, the writer got the result as follow:

(1) There are 33 test items which fulfill the requirement of the validity. They are the items number 1, 2, 3, 5, 6, 7, 10, 11, 12, 14,15, 16, 17, 18, 19, 20, 24, 25, 26, 27, 30, 35, 36, 37, 38, 39, 40, 42, 43, 44, 47, 49, 50.

(36)

25

From those 50 test items, it can be analyze that there are several item that still can be used in the following test.

4.2.2

Analysis of Item Reliability

To get the coefficient of reliability of the test item the writer applying the Kuder-Richardson 20 formula. From the calculation, it is found that the coefficient of reliability of the test item is 0.39, then it is consulted to the table of r product moment values at level of significance0.05 or 5%, because the value of r

calculation is higher than the value on the table so it can be concluded that the test item that were used in the English mid-term test for the eighth grade of SMP 33 Semarang in the academic year 2008/2009 is reliable. The calculation of the reliability is listed in the appendix 2.

4.2.3

Analysis of the Difficulty Level

By using Nitko formula, the item of difficulty level can be analyzed by calculating the percentage of students who got the item right. The level of item difficulty is categorized into three levels. There are:

(1) Index 0.00 to 0.25 is categorized as difficult items. (2) Index 0.26 to 0.75 is categorized as moderate items. (3) Index 0.76 to 1.00 is categorized as easy items. The result of the data analysis is as follows:

(1) Items that belong to the difficult level are the item number 10, 12, 15, 17, 18, 19, 20, 21, 26, 28, and 33.

(2) Items that can be classified as moderate items are the item number 1, 3, 4, 5, 6, 7, 8, 9, 11, 13, 14, 16, 23, 24, 25, 27, 29, 32, 34, 35, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50.

(37)

From those 50 items, the English mid-term test of eighth grade students of SMP 33 Semarang can be categorized as moderate in the term of difficulty level since the mean of their difficulty level is 0,41. The items that considered as easy item still can be used to encourage and motivate the poor students. The example of the calculation of the difficulty level can be seen in appendix 3.

4.2.4

Analysis of Discrimination Power

In analyzing the discrimination power of the test item, the writer used Gronlund formula. It is tells how well the item performs in separating the upper group and lower group students. The discrimination power of the test item is categorized into four categories. In details it can be explained as follows: (1) Index 0.40 and above are categorized as very good items.

(2) Index 0.30 to 0.39 belongs to reasonably good items but possibly subject to improvement.

(3) Index 0.20 to 0.29 is categorized as marginal items, this item usually needing and being subject to improvement.

(4) Index 0.19 and below belong to poor items. Those items should be rejected or improved by revision.

From the data analysis, the result can be explained as follows:

(1) 29 items are categorized as poor items. They are items number 1, 2, 4, 5, 9, 11, 12, 13, 14, 15, 16, 17, 19, 20, 21, 22, 24, 27, 28, 30, 31, 33, 36, 37, 41, 42, 43, 46, 47.

(2) 13 items can be classified as marginal items. They are items number 6, 7, 8, 10, 18, 23, 26, 29, 332, 38, 40, 44, 50.

(38)

(4) There are only 2 items that can be classified as very good items they are items number 3 and 35.

The mean of the discrimination power is 0.17, so as whole the mid-tern test items are categorized as poor items. In this test items analysis there are 5 items with negative values of discrimination power. The negative value of

discrimination power means that that more students in the lower group than the students in the upper group answer the question correctly. The items stated above are the item, number 13 with discrimination power -0.063, item number 15 with the discrimination power 0.125, item number 30 with the discrimination power -0.03, item number 31 with the discrimination power

-0.06 and item number 42 with the discrimination power -0.06. These items

should be rejected. The example of the discrimination power listed in the appendix 4.

4.2 Discussions

(39)

difficult. Then, based from the discrimination power point of view, a good item is an item which can discriminate between the lower and upper group students.

According to Gronlund (1981:151-160) there are some criteria to determine which item that still can be used, revised or should be discarded.

(1) An item is used if it has the following criteria.

a. Valid, reliable, good discrimination power and moderate difficulty level. b. Valid, reliable, satisfactory discrimination power and moderate difficulty

level.

(2) An item is used with several revisions if it has the following criteria. a. Valid, reliable, good discrimination power but the difficulty level is too

easy or too difficult.

b. Valid, reliable, satisfactory discrimination power, but the difficulty level is too easy or too difficult.

c. Valid, reliable, poor discrimination power and moderate difficulty level. d. Not valid, reliable, good discrimination power and moderate difficulty

level.

e. Not valid, reliable, satisfactory discrimination power and moderate difficulty level.

(3) An item should be discarded if it has the following criteria.

a. Valid, reliable, poor discrimination power and the difficulty level are too easy or too difficult.

b. Not valid, reliable, good discrimination power, but the difficulty level is too easy or too difficult.

(40)

d. Not valid, reliable, poor discrimination power and moderate difficulty level.

e. Not valid, reliable, poor discrimination power and the difficulty level are too easy or too difficult.

Based on the result of item analysis which is includes the analysis of validity, reliability, difficulty level and discrimination power of the items, in detail the result of the data analysis was explained as follows:

(1) The items which can be used again as follows:

a. There are 5 items that can be classified as valid, reliable, good

discrimination power and moderate difficulty level. They are the items number 3, 25, 35, 39 and 49.

b. 6 items can be classified as valid, reliable, satisfactory discrimination power and moderate difficulty level. They are the items number 6, 7, 38, 40, 44, 50.

(2) The items which still can be used but it need several revisions are as follows: a. There are no items that can be classified as valid, reliable, good

discrimination power but the difficulty level is too easy or too difficult. b. There are 2 items that can be categorized as valid, reliable, satisfactory

discrimination power, but the difficulty level is too easy or too difficult. There are the item number 10 and 26.

(41)

d. There are 3 items that can be classified as not valid, reliable, good

discrimination power and moderate difficulty level. There are 34, 45, and 48.

e. There are 5 items that considered as not valid, reliable, satisfactory discrimination power and moderate difficulty level. There are number 8, 18, 23, 29, and 32.

(3) The item which should be discarded are the following:

a. The items that can be classified as valid, reliable, poor discrimination power and the difficulty level are too easy or too difficult are 7 items. There are number 2, 12, 15, 17, 20, 30, and 36.

b. There are no items that considered as not valid, reliable, good

discrimination power, but the difficulty level is too easy or too difficult. c. There are no items that considered as not valid, reliable, satisfactory

discrimination power, but the difficulty level is too easy or too difficult. d. 5 items that can be classified as not valid, reliable, poor discrimination

power and moderate difficulty level. There are items numbers 4, 9, 22, 41, 46.

(42)

For more explanation, please note the following description.

 Item number 1

Question I am an SMP student. My name is Rini. I have one sister and two brothers. My sister’s name is Tuti and my Brother’s names are Fauzan and Doni. My father’s name is Syahbudin and my mother’s name is anis. I live with my family at Gunung Talang street. I am twelve years old. Tuti is ten years old and fauzan is six years old. My father is forty-three and my mother is thirty-five. We are happy family

Rini’s sister is… years old. a. 4

b. 6 c. 10 d. 12

Result A B C* D

Upper 27% 0 1 18 13

Middle 46% 7 1 25 22

Lower 27% 11 1 12 21

Total 28 3 55 56

Validity Valid

P value 0.49

(43)

The item above meet with the criterion of the materials, which is descriptive text, so the item above can be said as valid item, because it meet with the criteria of the KTSP curriculum, and the P value 0.46 it classified as moderate item because only 55 students chose the correct answer. From the D value, this item can categorized as poor item since only 18 students from upper group who chose the correct answer and 12 students from lower group did the same. From that criterion, it can be says that this item still can be used in the next test with several revisions.

Question What is the text about? a. Rini’s address b. Rini’s school c. Rini’s family d. Rini’s age

Result A B C* D

Upper 27% 1 0 30 1

Middle 46% 1 2 51 3

Lower 27% 1 1 27 2

Total 3 3 108 6

P value 0.9

D value 0.09

(44)

poor item because 27 students from the lower group choose the correct answer and only 30 students from the upper group did the same. Although this item can be said as valid item, in the writer’s opinion this item should be discarded.

Question How many children’s do Rini’s parents have? a. 6

b. 5 c. 4 d. 3

Result A B C* D

Upper 27% 1 0 28 3

Middle 46% 15 8 6 17

Lower 27% 3 0 11 18

Total 19 8 45 38

P value 0.37

D value 0.53

(45)

Question “I can’t pack my bag and go home.” The underlined words mean…the bag.

a. ask someone to bring b. put his clothes in c. put his belongings into d. carry things with

Result A B C* D

Upper 27% 4 8 15 5

Middle 46% 9 7 22 12

Lower 27% 2 10 11 9

Total 15 25 48 26

Validity Invalid

P value 0.4

D value 0.125

(46)

Question The tree trunk conducts water and dissolves materials from the roots to the leaves, flowers, and fruits of the plant. It also supports the branches and the twigs. The leaves, flowers, and fruits grow along the twigs.

The roots absorb water and minerals from the soil to feed all parts of the tree. They also anchor a tree in the soil to hold the tree upright against the force of strong wind.

The roots have…function. a. two

b. three c. four d. five

Result A* B C D

Upper 27% 16 13 1 2

Middle 46% 16 37 2 1

Lower 27% 15 16 1 0

Total 47 66 4 3

P value 0.39

D value 0.03

(47)

description text, so it can be said as valid item, because it tests about descriptive text. From the discrimination power criteria, it considered as poor item since it only has 0.03 on the D value. In the term of difficulty level, it can be said as moderate item since 47 students chose the correct answer. But, with the criteria above, by using Gronlund criterion this item still can be used by several revisions.

Question Which one is the correct order of the parts of the tree, from bottom to the upper parts?

a. trunk - roots – branches – twigs – fruits b. trunk – twig - roots – branches – leaves c. roots – trunk – twigs – branches – fruits d. roots – trunk – branches – twig – leaves

Result A B C D*

Upper 27% 9 7 2 14

Middle 46% 13 23 1 19

Lower 27% 6 19 1 6

Total 28 49 4 39

(48)

D value 0.25

Because of the P value is 0.32 this item considered as moderate item in the point of view of difficulty level. While in the term of discrimination power this item considered as marginal item since the D value is 0.25. From the criterion of validity, this item classified as a valid item and because of that criterion above this item still can be used in the next test since it has satisfactory discrimination power and moderate difficulty level.

Question The first paragraph above is told about… a. the function of the trunk

b. the function of the roots c. the function of the leaves d. the function of the twigs

Result A* B C D

Upper 27% 15 15 2 0

Middle 46% 22 29 4 1

Lower 27% 8 17 1 1

Total 45 61 7 2

P value 0.37

(49)

The P value of this item is 0.37; it shows that this item is moderate item since there are only 45 students chose the correct answer. The D value shows it as marginal item since it has 0.25.it is because there are 15 students from the upper group chose the correct answer and only 8 students from the lower group did the same. From the point of view of validity, it belongs to valid item. This item is categorized as a good item, so it still can be used in the next test.

Question Ali : “What is Mr. Bakri’s profession? Amat: “He is a…

Look at his table!

It is full of tools. There are hammer, an axe, a handsaw, a pencil, etc.

a. sailor c. carpenter b. painter d. teacher

Result A B C* D

Upper 27% 1 6 24 1

Middle 46% 0 12 37 5

Lower 27% 3 9 16 4

Total 4 27 77 10

P value 0.64

D value 0.25

(50)

power point of view, this item is can be said as marginal item since 24 students from the upper group and 16 students in the lower group responded to this answer.But based from validity criteria, this item classified as invalid item because it does not meet the criteria of the curriculum. However, it can be categorized as a marginal item and still can be used in the next test with several revisions.

Question My father is a farmer he is work at…,especially in the rainy season, he grows rice.

a. in the farm b. in the rice field c. in the garden d. in the park

Result A B* C D

Upper 27% 15 15 1 1

Middle 46% 38 20 0 0

Lower 27% 23 10 1 0

Total 76 45 2 1

P value 0.37

D value 0.15

(51)

upper group and in the lower group is only 5 students. Meanwhile this item classified as invalid item. In the writer’s opinion this item should be discarded since it is invalid item.

Question I still have an assignment to do. I… it after lunch. a. have been finishing

b. finishing c. was finishing d. will finish

Result A B C D*

Upper 27% 4 16 3 9

Middle 46% 7 31 12 7

Lower 27% 3 16 10 2

Total 14 63 25 18

P value 0.15

D value 0.21

(52)

tense, so from the criterion above the item still can be used in the next test with several revisions.

Question Adhi : How many times do you swim a week? Tio : Actually twice but last week I only…once because I prepared the exam.

a. swam b. swim c. will swim d. have swim

Result A* B C D

Upper 27% 18 6 5 3

Middle 46% 40 11 3 2

Lower 27% 18 8 3 1

Total 76 25 11 6

P value 0.63

D value 0

(53)

because it tests about past tense. So, this item still can be used in the following test with several revisions.

Question Lia : Can I have some apples? Dio : …do you want?

Lia : The Australian ones.

a. how many c. which

b. what d. how much

Result A B C* D

Upper 27% 12 10 8 2

Middle 46% 37 15 4 0

Lower 27% 17 12 2 1

Total 66 37 14 3

P value 0.12

D value 0.19

From the number of the students who answered the item correctly, it can be seen that the item is a difficult item. While in the discrimination power index this item has 0.12 in the D value, so it can be categorized as poor item. While from validity point of view it can be said as valid item because it tests about asking for something. Although it is a valid item, this item should be discarded since it has poor and difficult in the term of discrimination power and difficulty level.

(54)

Question A clever crow

One day a crow was tired and 13)…He looked everywhere for some 14)…to drink, but he could not find any. At last he 15)…an old jar which there was a little water. The jar was so tall and the water was so low that he could not reach it with his short bill. He thought for a while, then he 16)…away to pick up some stones. She 17)…the stones into the jar one after another, and the water came up higher and higher. At last the crow was able to drink as much as she liked.

a. hungry

b. insects

c. water

d. leaves

Result A* B C D

Upper 27% 17 2 10 3

Middle 46% 36 7 12 1

Lower 27% 19 1 11 1

Total 72

(55)

D value -0.06

This item discusses about vocabulary. There are some clues in the question, and then the students have to guess what the answer is. This item is neither too easy nor too difficult. It has moderate level in the term of difficulty level. While from the point of view of discrimination power and validity, this item categorized as poor and valid item, since the students in the lower group chose the correct answer more than the students in the lower group, so it has negative value in D index, while from the validity point of view, this item test about narrative text so it meet the criteria of validity. So, from the criterion above, it can definitely sure that this item still can be used with several revisions.

Question a. meat

b. angry c. amazing d. thirsty

Result A B C* D

Upper 27% 5 0 26 0

Middle 46% 13 0 41 0

Lower 27% 6 3 23 0

(56)

P value 0.75

D value 0.09

From the table above we can see that in the term of difficulty level it belongs to moderate item since 90 students answered the item correctly. While based on the discrimination power level it classified as a poor item since only 26 students in the upper group chose the correct answer and 23 students in the lower group chose the correct answer. While from the validity criteria, this item can be said as valid item. However, this item still can be used in the following test but with several revisions.

Question a. find

b. found

c. founded d. finding

Result A B* C D

Upper 27% 16 6 3 7

Middle 46% 21 9 1 24

Lower 27% 9 10 3 10

Total 46 25 7 41

(57)

D value -0.125

This item belongs to poor item in the term of discrimination power since it has negative value; it means that the students in the lower group did better than the students in the upper group. Based on the criterion of difficulty level it can be classified as a difficult item, it can be seen that only 25 students out of 120 responded the item well. While based on the validity, it categorized as valid item so this item can definitely sure to be discarded.

Question a. flew b. put c. threw d. saw

Result A* B C D

Upper 27% 9 4 6 13

Middle 46% 22 6 9 16

Lower 27% 6 4 4 16

Total 37 14 19 45

P value 0.3

(58)

From the number of the student who answered the item correctly, it can be seen that this item is a moderate item. In the point of view of the discrimination power, it belongs to poor item because 9 students from the upper group and 7 students from the lower group chose the correct answer. Based on the validity, this item classified as valid item. Considering the criteria of the validity,

discrimination power and the difficulty level this item still can be used in the following test with several revisions.

Question a. dropped b. found c. reached d. sent

Result A* B C D

Upper 27% 9 0 22 1

Middle 46% 9 1 36 7

Lower 27% 6 3 20 3

Total 24 4 78 11

P value 0.2

D value 0.09

(59)

the criterion of the difficulty level it can be seen as difficult item since only 24 out of 120 students who chose the correct answer. While based on the validity it categorized as valid item and it should not be used in the next test.

Question Bita : Mom I want to make an omelet. What ingredients do I need to prepare?

Mom: Prepare three eggs, five…of onion, a little salt, and some vegetable oil.

a. sheets b. slices c. packs d. sacks

Result A B* C D

Upper 27% 0 14 12 6

Middle 46% 2 11 26 15

Lower 27% 1 5 12 14

Total 3 30 50 35

P value 0.25

(60)

From the table above we can see that this item considered as difficult item since only 30 students chose the correct answer. While based on the

discrimination power criterion, it belongs to marginal item since it has 0.28 in the D value. From the criterion of the validity, this item categorized as invalid item, and it still can be used in the next test with several revisions.

Question Ani : I want to take my pill…. Lia : Sure. Wait a minute please.

a. Do you want some?

b. Can you get me a glass of water please? c. Can you take me to the doctor please? d. Will you buy it for me please?

Result A B* C D

Upper 27% 7 13 11 1

Middle 46% 13 9 31 1

Lower 27% 8 7 16 1

Total 28 29 58 3

P value 0.24

D value 0.19

(61)

this item only has 0.19; it is because there are only 13 students in the upper group and 7 students in the lower group who answered the item correctly. This item considered as valid item because it meets the criteria of the curriculum that is test about asking for help. Although this item considered as valid item, this item should be discarded since the level of the discrimination power and the difficulty level of the item.

Question Reno : I think this shirt needs ironing. Leo : No, I …it. Touch it. It is still warm.

a. iron c. am ironing b. will iron d. have ironed

Result A B C D*

Upper 27% 2 9 15 6

Middle 46% 8 5 32 9

Lower 27% 4 1 22 5

Total 14 15 69 20

P value 0.16

D value 0.19

(62)

group students only 6. And in the validity point of view, this item classified as valid item, because it tests about past perfect tense.

Question Anto and Willy are my cousins. The word cousins means…

a. aunt’s sisters b. uncle’s daughters c. aunt’s brothers d. uncle’s sisters

Result A* B C D

Upper 27% 7 8 7 10

Middle 46% 10 17 3 26

Lower 27% 7 8 5 12

Total 24 33 15 48

P value 0.2

D value 0

These item also almost the same with the previous number, which is number 20, which can be classified as difficult and poor item in the terms of difficulty level and discrimination power. It has 0.2 in the difficulty level index since there are only 24 students out of 120 who answered the item correctly. And in the D value, it has 0; it means that there is no different between students in the upper group and in the lower group. This item also considered as invalid item, so it can be definitely sure that this item should be discarded.

(63)

Question Euro 2000, one of the biggest 22)…on earth beside the World Cup is about to begin. Football freaks everywhere will be performed in Europe’s top teams. Who will 23) …in the mini World Cup?

a. shows c. festivals b. clubs d. tournament

Result A B C D*

Upper 27% 1 3 0 28

Middle 46% 1 10 5 40

Lower 27% 0 6 3 23

Total 2 19 8 91

P value 0.7

D value 0.16

From the number of the students who answered the item correctly, it can be seen that this item classified as easy item. In the difficulty level index it has 0.7, while in the discrimination power index it has 0.16; it means that this item is poor. Because there are only 28 students in the upper group who chose the correct answer and there are 23 students in the lower group who chose the correct answer. Because the item is invalid, this item is better to be discarded.

(64)

Result A B* C D

Upper 27% 5 13 0 14

Middle 46% 7 13 2 50

Lower 27% 6 5 0 21

Total 18 31 2 85

P value 0.25

D value 0.25

From the table above we can see that the item considered as moderate item since there are 31 students out of 120 answered the item correctly. While from the discrimination power index, this item has 0.25, which mean that the item is considered to be marginal item. It is because the difference between the upper group and in the lower group students is 8 students. Based on the validity criterion, this item classified as invalid item. So, from the criterion of the

difficulty level, discrimination power level, and the validity it can be said that the item classified as good item and still can be used in the following test with several revisions.

(65)

a. lions c. tigers b. leopards d. panthers

Result A B C* D

Upper 27% 0 6 26 0

Middle 46% 1 21 33 0

Lower 27% 6 5 22 0

Total 7 42 81 0

P value 0.67

D value 0.125

Based on the difficulty level index, this item considered as moderate level, it can be seen that there are 81 students attracted to the item. And based on the discrimination power index, this item has 0.125; which mean that it is a poor item. It s because there only 26 students in the upper group and 22 students in the lower group who answered the item correctly. While in the validity point of view, we can see that this item considered as valid item, because it test about description text.

(66)

toward them. The cat ran up a tree and disappeared. Then he said that was the only trick he knew. He then asked to the fox which trick he was going to use. The fox sat there trying to decide which trick to use. He thought a long time. Then he decided to run. But it was too late. What happened then? At then the wild dog got there before he could run away and ate him up.

The word “trick” in the passage above has the similar meaning with…

a. ideas b. ways c. strategies

d. solutions

Result A B C* D

Upper 27% 7 0 16 9

Middle 46% 13 1 12 26

Lower 27% 9 2 5 16

Total 29 3 33 51

P value 0.27

(67)

There are 33 students attracted to this item well. It means that this item considered as moderate item. From the discrimination power point of view this item classified as good item since there are 16 students in the upper group who answered the item correctly and only 5 students in the lower group who chose the correct answer. While in the term of validity criterion, it categorized as valid item, because it tests about narrative which meet the criteria of the curriculum. So, from the criterion above this item still can be used for the following test with several revisions.

Question

The Analysis of Validity, Reliability, Discrimination Power, and Level of Difficulty of First Mid-Term Test. The case of eighth grade students of SMP 33 Semarang.