• Tidak ada hasil yang ditemukan

Test and Item Analysis

Basic Elements of Educational Measurement

6.6 Test and Item Analysis

In the next chapter, an introduction to educational measurement theory will be given that is completely based on item response theory (IRT). IRT provides the theoretical underpinning for the construction of measurement instruments, linking and equating measurements, the evaluation of test bias, item banking, optimal test construction and computerized adaptive testing. However, all these applications require fitting an appropriate IRT model. If that model is found, it provides a framework for evaluation of item and test reliability. However, fitting an IRT model can be quite a tedious process,

and one can never be absolutely sure that the model fit is good enough. Therefore, in practical situations one needs indices of item and test reliability that can be effortlessly computed and do not depend on a model. These indices are provided by classical test theory (Gulliksen, 1950, Lord & Novick, 1968). Before the principles and major consequences of classical test theory are outlined, first an example of its application will be given. Consider the item and test analysis given in Table 6.5.

Table 6.5 Example of a Test and Item Analysis (Number of observations: 2290).

Mean=8.192 S.D.=1.623 Alpha=0.432

Item p-value ρkc ρkd ρkd

1 .792 .445 .014 .145

2 .965 .419 .035 .019

3 .810 .354 .147 .054

4 .917 .052 .155 .052

5 .678 .378 .388 .072

6 .756 .338 .048 .032

7 .651 −.091 .061 .291

8 .770 .126 .100 .121

9 .948 .472 .182 .172

10 .905 .537 .193 .037

The example concerns a test consisting of ten multiple-choice items with three response alternatives. The mean and the standard deviation of the frequency distribution of the number-correct scores are given in the heading of the table. Also given is the value of Cronbach’s Alpha coefficient, which serves as an indication of test reliability. Alpha takes values between zero and one. As a rule of thumb, a test is usually considered sufficiently reliable if the value of Alpha is above 0.80. In the present example, this is not the case. Probably, the test length is too short, and further, there may be items that do not function properly. Inspecting the information on the items given below the heading can further assess this. The column labeled pvalue gives the proportions of correct responses.

The columns labeled ρkc and ρkd give the correlation between the total score and the item score for the correct response and the two distractors, respectively. (The index k stands for the item, c and d for correct and incorrect, respectively). If the total score were a perfect proficiency measure, the item-test correlation for the correct response, ρkc, should be high, because giving a correct response is an indication of proficiency also. By an analogous reasoning, the item-test correlation for a distractor, ρkd, should be close to zero, or even negative, because keying a wrong alternative indicates a low proficiency level.

Now the number correct score is no perfect measure of ability because of the unreliability of test scores and because of the presence of poorly functioning items which detriment the number-correct score. However, the essence of the rationale for expecting an item-test correlation ρkc that is higher than ρkd remains valid. For the item statistics in Table 6.5, it can be inferred that the items 4 and 7 did not function properly: ρkc is close to zero for item 4 and negative for item 7. Further, for these two items, it can be seen that ρkdkc, for the first distractor of item 4 and the second distractor of item 7. So these response alternatives were more appealing to proficient students as the supposedly correct alternatives. Practice shows that items with these patterns of item-test correlations are usually faulty. There might have been made an administrative error in identifying the correct response alternative, or there might be some error in the content of the correct alternative.

Reliability theory

The inferences made in the example about the reliability of the test and the quality of the items are based on classical test theory (CTT). CTT starts with considering two random experiments. In the first experiment a student, which will be labeled i (i=1,…, N), draws a test score from some distribution. So the test score of a student is a random variable. It will be denoted by Xi. The fact that the student’s test score is a random variable reflects the fact that the measurement is unreliable, and subject to random fluctuations. The expected value of the test score is equal to the true score, that is,

The difference between the test score and the true score is the error component ei, that is, ei=Xi–Ti. The second random experiment concerns drawing students from some population. Now not only the observed scores Xi but also the true scores Ti are random variables. As a result, for a given population, it holds that the observed score of a randomly drawn student, say X, is equal to a true score, say T, and an error term e. That is

(1) Since the test is only administered to each student once, the error terms within students and the error terms between students cannot be assessed separately; they are confounded.

It is assumed that the expectation of the error term, both within students and between students is equal to zero, that is,

Further, the true score and the error term are independent, so their covariance is zero:

These assumptions do not imply a model; all that was done is making a decomposition of observed scores into two components: true scores and errors. This decomposition provides the basis for defining reliability. Suppose that there are two tests that are strictly parallel, that is, students have the same true score T on the test and the error components,

say e and e’ have exactly the same variance. The observed scores on the tests will be denoted by X and X’. Then reliability, denoted by ρXX’ can be defined as the correlation between two strictly parallel tests, so

From this definition, a second definition of reliability can be derived. Since it holds that Cov(X,X′)=Cov(T+e, T+e′)=Cov(T,T)+Cov(T,e′)+Cov(T′,e)+Cov(e,e’)

=Var(T)+0+0+0= Var(T), reliability can also be defined as

(2)

So reliability is also the ratio of the true and observed score variance. In other words, it can be viewed as the proportion of systematic variance in the total variance. The reliability ρxx’ can be estimated in several ways. The first is trough the administration of two strictly parallel tests. In practice, this is often hampered by the fact that strictly parallel tests are difficult to find, and the administration of the two tests may induce learning effects. Another way is based on viewing all items in the test as a parallel measures. Under this assumption, reliability can be estimated by Cronbach’s Alpha coefficient (see, Lord & Novick, 1968). If the items are dichotomous (that is, either correct or incorrect), Cronbach’s Alpha becomes the well-known KR-20 coefficient.

Two important consequences of CCT can be delivered from the following considerations. Suppose that there is a test with test scores X, and, as in formula (1), X=T+e. Further, there is a criterion, say some other test, where the scores C can also be written as C=T′+ e’. The interest is in the correlation between the test scores and the scores on the criterion. So the observed criterion scores are the sum of true scores T’ and error terms e’. In principle, the interest is not so much in the correlation between the observed scores, but in the correlation between the true scores T and T’. This correlation can be written as

So the correlation between the true scores is equal to the observed correlation divided by the square root of the reliabilities of the test scores and the criterion scores, respectively.

From this identity, two important conclusions can be drawn. First, it follows that

that is, the observed correlation is lower than the correlation between the true scores, say, the true correlation. This is the so-called attenuation effect. It entails that the unreliability of the scores suppresses the correlation between the observed scores. And if the test and the criterion were perfectly reliable, the observed correlation would equal the true correlation. The number of items in a test is an important antecedent of reliability, so the bottom line is that it is useless to compute correlations between very short, unreliable instruments. This could, in principle, be solved by correcting the observed correlation by

dividing it by the square root of the reliabilities of the test scores and the criterion scores.

However, the bias in the estimates of these reliabilities has unpredictable consequences.

In the next chapter an alternative approach using IRT will be sketched.

The second consequence follows from the fact that it can be derived that

Now if the criterion C is a strictly parallel test, this inequality implies that the square root of the reliability is an upper bound for validity. So validity and reliability are related and one cannot ignore reliability when pursuing validity.

To assess the reliability of a measurement the variance of the observed scores X, say , has been split up into two components, the variance of the true scores T, say , and the error variance, say . So that is,

In many instances, however, specific sources of error variance can be identified, such as tasks, raters, occasions, and so forth. Estimation of these variance components can be done using an analysis of variance technique, known as generalizability theory (Brennan, 1992; Cronbach, Glaser, Nanda & Rajaratnam, 1972). As an example, consider an application to performance assessments (Brennan en Johnson, 1995), where there is interest in the effects of two so-called facets: raters (r) and tasks (t). The total observed score variance of the assessments of raters of the responses of students to tasks can now be decomposed as

where is the true score variance of the students, is the variance attributable to the tasks, is the variance attributable to the raters, and and are the interaction between persons and tasks, persons and raters, and raters and tasks, respectively. All variances on the left-hand side, except can be considered as error variances.

Estimation of the variance components is a so-called G-study. With the components estimated, a D-study can then be done to assess the number of tasks nt and raters nr that must be invoked to attain a certain level of reliability. These numbers can be computed in two ways, depending on the definition of reliability that is used. If the variance of the tasks, the variance of the raters and all variances attributed to interactions,

and are considered as errors, the reliability coefficient becomes (3)

This coefficient is relevant if absolute judgments are to be made about the true score level of the student. However, if relative judgments are to be made, that is, if differentiations are made between the students, irrespective the difficulty of the tasks or the strictness of the raters, the variance of raters and tasks, and the variance of the interaction between

these two are no longer relevant. Therefore, they are removed from the denominator, which leads to the coefficient

(4)

The final remark of this section pertains to the relation between reliability as it is defined in here using CTT, and the alternative definition defined using IRT that will be discussed in the next chapter. Reliability in CTT pertains to the extent to which students can be distinguished based on their test scores. For this, it is essential that the true scores of the students vary. The reliability coefficients defined above by formulas (2), (3) and (4) are all equal zero if . The dependence of the reliability on the variance of the true scores can be misused. Consider a test for visual ability administered to a sample of children of 8 years old enrolled in regular schools. It turns out that the reliability is too low to make an impression on test publishers. Therefore, the test constructor adds a sample of children of 8 years old enrolled in schools for visually handicapped to the sample. These children score very low, which blows up the score variance, and leads to a much higher reliability. As such, this circumstance does not invalidate the concept of reliability. What the test constructor has done is changing the definition of the population of interest. The first coefficient related to distinguishing between non-handicapped children, while the second coefficient related to distinguishing visually impaired children from children who are not visually impaired. The point is that CTT test and item indices must always be interpreted relative to some population. In the next chapter, it will be shown that the main motivation for the development of IRT is separating the effects of tests and items on one hand, and the population of students on the other hand. It will be shown that this leads to another definition of reliability.