• Tidak ada hasil yang ditemukan

List of Appendices

Chapter 2 Literature Review

2.2 The Rasch Model and Its Paradigm

As indicated earlier, in Indonesia, the achievement admission test to enter public universities has been complemented with a scholastic aptitude test from 2009.

It appears that despite the controversy, the aptitude test will continue to be used in practice. It is used either on its own or to provide information not given by an achievement test and therefore to complement the achievement test, which may yield a better prediction of performance.

There are at least four ways in which to use achievement and aptitude tests together in selection. The first is simply by taking a total score. In this way scores on the two assessments compensate each other. A second way is to require high scores on both.

This would restrict entry more than if only one test was used. A third way is to require a high score on only one of these tests, with perhaps a minimum score on the other. This approach enhances entry relative to a high score on just one. In particular, students from educationally disadvantaged backgrounds would have a better chance of being selected.

This might operate differently in different areas of university study. A fourth way is to form a prediction equation with a criterion and use multiple regression to derive empirical weights.

Rasch (1977) argued that the model he developed is in line with the scientific concept of measurement. According to him, scientific measurement deals with comparison and this comparison must be objective. The principles of comparison, in his words, are as follows.

The comparison between two stimuli should be independent of which particular individuals were instrumental for the comparison; and it should also be independent of which other stimuli within the considered class were or might also have been compared.

Symmetrically, a comparison between two individuals should be independent of which particular stimuli within the class considered were instrumental for the comparison;

and it should also be independent of which other individuals were also compared, on the same or some other occasion. (Rasch, 1961, pp. 331-332)

2.2.1 Features of the Class of Rasch Models

(a) Invariant Comparison within a Specified Frame of Reference

Rasch developed a model which fulfilled the principles of invariant comparison. The model has two related components, statistical sufficiency and person and item parameter separation (Andrich, 2005a). The realization of statistical sufficiency in the model is that the total person score is a sufficient statistic for the estimate of the person ability and the total item score is a sufficient statistic for the item difficulty (Andrich, 1988). In other words, given the total score of a person, there is no other information needed to estimate a person’s ability, specifically, there is no information in the response pattern.

Similarly, given the total score of an item, there is no other information needed to estimate the item’s difficulty. The consequence of this statistical sufficiency is that persons who have completed the same items and have the same total score will have the same ability estimate and items with the same total score will have the same difficulty estimate.

Separation of item and person parameters means that comparisons of the difficulties between two items can be made independently of the ability of any person and the comparisons between people can be made independently of the difficulties of the items (Andrich, 1988). This separation also results from sufficiency. Specifically, conditional on the total scores of persons, the distribution of responses depends only on the relative difficulties of the items. This means that in estimating item difficulties the distribution of person abilities does not have to be a particular shape, such as a normal distribution.

Likewise, in estimating a person ability there is no requirement for any particular shape of the distribution of item difficulties (Andrich, 1988). However, for the purpose of precision, and the assessment of the quality of the items of a test, good engagement of persons to items is required. Thus, the items of a test should be distributed across the relevant region of the continuum on which persons are located.

In addition, it does not mean that person characteristics, such as gender, are not important. In fact, the comparison Rasch referred to is within a specified frame of reference. Rasch used a two-way frame of reference (Andrich, 2005a). The frame of reference (F) is a specification of the collection of some elements, namely agents (A), objects (O), and reaction (R) or outcome as a result of contact between an agent and an object (Rasch, 1977). In Andrich’s words (Andrich, 1985a, p. 44), it includes “a definition of the class of persons, the class of items, and any other relevant conditions that would ensure that the objective relationships were maintained”. The two-way frame of reference is “the smallest order for constructing measures” (Andrich, 1988, p. 19). In other cases it may extend to more than a two-way frame of reference. The two-way frame of reference is shown in Table 2.1.

Rasch called the comparison in the specified frame of reference “specifically objective”.

It is called objective because “any comparison of two objects within O is independent of

the choice of the agents within A and also of the other elements in the collection of objects”. It is specific because “the objectivity of these comparisons is restricted to the frame of reference F defined” (Rasch, 1977, p. 77).

Table 2.1. Rasch’s Two-way Frame of Reference of Objects, Agents and Responses Agents (Items)

A1 A2 . Ai . AI

O1 x11 x12 . x1i . x1I

O2 x21 x22 . x2i . x2I

. . . . . .

Objects Ov xv1 xv2 . xvi . xvI

(Persons) . . . . . .

OV xV1 xV2 . xVi . xVI

Note. x = response

(b) Dichotomous and Polytomous Models

Rasch formulated a model that met the invariant comparison requirement with a probability function (Andrich, 1988). For dichotomous response data, the probability of a person answering an item correctly is a function of the difference between the person parameter (ability) and item parameter (difficulty). The function can be expressed in a logarithmic metric (logits).

The model for dichotomous response data, called the Simple Logistic Model (SLM) or Dichotomous Rasch Model (DRM), is presented as

1 exp( )

) (

} exp Pr{

i n

i n ni

x x

X

β δ

δ β

− +

= −

= (2.1)

where x =1 or 0

The Equation 2.1 can take two forms

1 exp( )

) } exp(

1 Pr{

i n

i n

Xni

δ β

δ β

− +

= −

= ;

) exp(

1 } 1 0 Pr{

i n

Xni

δ β

= +

= (2.2)

where Pr{Xni =1} is the probability that person n will answer item i correctly, }

0

Pr{Xni = is the probability that person n will answer item i incorrectly, βn is the location or ability of person n on a latent variable, and δi is the location or difficulty of item i on a latent variable.

The above equations show that the relation between parameters (βn and δi ) is additive.

It is clear that in the Rasch model it is only the difference between βn and δi that governs the probability that a person will get an item correct.

An extension of the DRM which applies to items with polytomous responses in ordered categories is called the Polytomous Rasch Model (PRM) or Extended Logistic Model.

The polytomous Rasch model takes form

∑ ∑ ∑

= =

=

=

= mi

x

x

k ki i

n x

k ki i

n

ni x x x

X

0 1

1

)]

) (

[exp(

/ )]

) (

[exp(

}

Pr{ β δ τ β δ τ ( 2.3)

where x∈{0,1,2...mi}is the integer response variable for person n with ability βn

responding to item i with difficulty δi, and , ,

. . ,.

2 1i τ i τmi

τ

=

i =

m

x xi 0

τ 0 are thresholds

between mi + 1 ordered categories where mi is the maximum score of item i , τ0 ≡0 (Andrich, 1978, 2005a; Wright & Master, 1982).

When the thresholds are equidistance the model takes form

∑ ∑

= =

− +

=

= mi

x

x

k ki i

n i

n

ni x x x m x x

X

0 1

)]

) (

[exp(

/ ] ) ( ) (

[exp(

}

Pr{ β δ θ β δ τ

(2.3a)

where θ is the average half distance between thresholds. It is clear that θ indicates the spread of thresholds.

Equation 2.3 is a general model, therefore it can be applied to dichotomous and polytomous responses. In the case of dichotomous data, the Equation 2.3 becomes a special case in which there is only one threshold. It can be presented as

Pr{Xni =x}=[exp(x(βn −δi))]/[1+exp(βn−δi)]

(2.4) where x∈{0,1} and there is only one threshold δi.

(c) Parallel Item Characteristics Curves (ICCs) in Items with Dichotomous Responses

Graphically, the probability of a person getting an item correct based on Equation 2.1 is shown by an ICC. To illustrate, the ICCs of three items with locations of -1.59, -0.64, and 0.68 respectively, are presented in Figure 2.1.

Figure 2.1. ICCs of three items with dichotomous responses

It follows that as the person location or ability increases, the probability of getting an item correct also increases. For example, a person whose location above -1.59 has a probability of more than 0.5 of getting item 1 correct, while a person with a location below it has a lower probability, less than 0.5. A person with a location of -1.59 has a probability of 0.5 of getting item 1 correct. It is shown that the location of an item is at

the point where a person has a probability of 0.5 of getting an item correct. This means for dichotomous responses the item location on the continuum is where a person has an equal probability of answering an item correctly (1) or incorrectly (0).

In the Rasch model the ICCs are parallel. This is different from other item response theory (IRT) models; namely, the two parameter logistic model (2PL)1 and the three parameter logistic model (3PL)2

(d) Category Characteristic Curves and Threshold Characteristic Curves for Items with Polytomous Responses

where the ICCs can cross each other. When the ICCs cross each other, the probability of getting an item correct is not the same for persons with different locations. This means the requirement of invariant comparison is not met because the comparison of the difficulties between two items cannot be made independently of the ability of any person. Therefore, parallel ICCs for items with dichotomous responses is a distinctive property of the Rasch model and reflects the property of the invariance of comparison (Wright, 1997).

It was shown earlier that ICCs for items with dichotomous responses depict the probability of persons getting an item correct. In the case of items with polytomous responses, category characteristic curves (CCC), show the probability of each response category and add to the information provided by the ICC.

Figure 2.2 shows CCCs (the curves with bold lines) for an item with three response categories (0, 1, 2). It shows that an item with three categories has two points where adjacent category curves intersect. The first intersection is between categories 0 and 1, and the second between categories 1 and 2. In the PRM, the intersection point is called the threshold (τ ). In the case of 3 categories, there are two thresholds, τ and1 τ . 2

1 The 2PL model parameterizes difficulty and discrimination of an item

2 The 3PL model parameterizes difficulty, discrimination, and guessing of an item

Figure 2.2. CCCs and TCCs of an item with three response categories

It appears that the curve for the first category (score 0) shows a monotonic decreasing pattern and that for the third category (score 2) shows a monotonic increasing pattern.

However, the curve for the middle category (score 1) is not monotonic, but shows a single peak. As the proficiency increases, the probability of a score of 1 increases. At some point, however, as the proficiency increases, the probability of getting a score of 1 starts to decrease.

In Figure 2.2, Threshold Characteristic Curves (TCCs) are presented in dotted lines and they are parallel. The TCCs show the conditional probability of success at each latent threshold, considering that the response is in one of the two categories adjacent to the thresholds. It also shows the distances between thresholds. For fit of responses to the model, it is expected that thresholds are in natural order and a reasonable distance exists between thresholds. When the thresholds are very close to each other, the ordered categories may not be working as intended.

Lastly, it is shown from Figure 2.2 that the distance between thresholds is2 . The θ

θ

parameter, as shown in Equation 2.3a is the average half distance between thresholds.

This parameter will be used in examining local dependence and in detecting a distractor with information.

(e) Resolution of Paradox: Attenuation, Differences between Two Scores, and Standard Error

Application of the Rasch model inherently overcomes the problem of the attenuation paradox found in classical test theory (CTT). The paradox refers to a situation where an increase in reliability with items of increasing discrimination does not lead to an increase in validity (Andrich, 2010).

It is generally understood that tests need both high reliability and validity. In addition, it is understood that to have high validity it is necessary to have high reliability.

Therefore, to facilitate validity it is considered important to have a high reliability. In general, high reliability is achieved with items which have high discriminations.

Therefore, in CTT in which the focus is on reliability, it is assumed that the higher the discrimination of an item the better. However, the paradox of CTT is that it is possible to so increase reliability, that for the same number of items, it decreases validity. Such increase in reliability, at the expense of validity, arises when items discriminate very highly for artificial reasons, for example, when items are redundant with other items.

These redundant items, with artificially high discrimination, are not adding new information and therefore the increase in reliability they produce is at the expense of validity.

In contrast, in the Rasch model, using the ICC as a criterion, extremely high and low discrimination are considered violations of the model and therefore violation of sound assessment including invariance of comparison. In particular items with very high discrimination need to be studied as they may indicate redundancy. Therefore, the test will consist of items with non-extreme (average) discrimination based on the criterion

of the ICC, and not those with very high discrimination which may produce artificially high reliability. Also by choosing items in different locations but still around person locations, validity and precision of measurement are both increased.

A second paradox resolved by the Rasch model is related to the difference between two raw scores and standard error (Andrich, 2010). In CTT, the difference between two raw scores is considered the same across the score continuum. Similarly, the standard error of measurement is the same for every score. Meanwhile, it is acknowledged that the differences between scores in the middle of continuum and at the extremes have different meanings (Andrich, 2010; Wright, 1997). Likewise, the standard errors should not be the same for every score.

The Rasch model resolves the paradox by transforming raw scores into a linear scale so that the difference in scores has a different meaning in different locations on the continuum. As such, the same raw score difference is greater at the extremes than at the middle and the standard error at the middle of the measurement continuum is smaller than at the extremes measurement continuum.

This implies that raw score is a misleading measure. It favours middle scores over extreme scores (Wright, 1997). Wright showed a typical relationship between raw scores and a linear scale (logits) to illustrate the magnitude of the effect that can result from using raw scores. He showed that a 10 percent difference in raw scores in the middle of continuum, for example between raw scores of 45 and 55, is equal to 0.6 logits, while a 10 percent difference in raw scores at the extremes of the continuum, for example 88 to 98, is equal to 2.8 logits. Thus, the difference that seems equal in raw scores is actually approximately 5 times greater in the logits scale.

(f) Unidimensional Model and Statistical Independence

The Rasch model is a unidimensional model which requires statistical independence in the responses (Andrich, 1988). It is unidimensional because it has only one person parameter, that is ability or proficiency in a particular dimension (β ). In this way persons can be distinguished based on their performance on one variable or dimension.

However, it does not mean that there is no other factor which influences a person’

response because many factors, cognitive and non-cognitive, determine human behaviour including test performance. In measuring a person’s attribute it is considered convenient to focus on only one variable. In doing so a comparison between persons can be made based on the difference between them on the variable measured (Andrich, 1988). A total score on a unidimensional test can be used to characterize a person.

Statistical independence means the probability of a certain outcome is independent of other outcomes. In relation to a person’s responses to more than one item, this means the person’s response on one item does not depend on the responses to other items (Andrich, 1988).

Marais and Andrich (2008b) called the violation of unidimensionality trait dependence and the violation of statistical independence response dependence. In the literature, trait dependence and response dependence are usually not distinguished, and they are both categorised as violations of local independence.

2.2.2 The Rasch Paradigm

The Rasch model is often called the one parameter logistic model because there is only one parameter for the item, namely its difficulty (δi ) in the model. It is also considered the simplest model of IRT ( Embretson & Reise, 2000). “IRT” is the generic form used to cover a range of response models for test data, where algebraically the Rasch model

is the most special case. However, the Rasch model is not just the simplest model of IRT (Andrich, 2004; Ryan, 1983).

The Rasch model differs from other IRT models in terms of its paradigm (Andrich, 2004). IRT models, according to Andrich, are set in a traditional statistical paradigm of data analysis while the Rasch model has a different paradigm. In the former, the function of a model is to account for the data. One model is chosen ahead of another model because it fits the data better. On the other hand, in the Rasch paradigm, the model is not chosen to describe the data but to serve as a frame of reference in constructing the measurement of variables. It serves as a prescriptive and diagnostic tool to construct and check measurements. When the data do not fit the Rasch model, it means that the requirement of invariance is not met. In this case, data need to be checked, and explanations for misfit sought. Therefore, the model serves as a diagnostic tool. Rasch set the precedent for such an approach in the early 1950s when he found inconsistencies between his model and data from a military intelligence test (Rasch, 1960/1980). Instead of modifying his model he checked the data and then proposed to do some changes in item construction which then resulted in better fitting data. This showed that a model for measurement can serve as a guide to data collection (Andrich, 2004).

2.2.3 The Function of Measurement in Science and the Rasch Paradigm

The Rasch paradigm is compatible with Kuhn’s view (1961) of the function of measurement in science. According to Kuhn, measurement has a specific function, and measurement attempts should be directed by theory. Specifically, theory should precede or guide measurement. Measurement conducted without a theory does not provide anything except number. In Kuhn’s words (p. 175), “numbers gathered without some knowledge of the regularity to be expected almost never speak for themselves. Almost

certainly they remain just numbers”. Measurements conducted based on a theory can show whether they deviate from the theory or not. In Kuhn’s term, the function of measurement is to disclose anomalies. In his words,

To the extent that measurement and quantitative technique play an especially significant role in scientific discovery, they do so precisely because, by displaying serious anomaly, they tell scientists when and where to look for a new qualitative phenomenon (Kuhn, 1961, p. 180).

Based on the Rasch paradigm, as mentioned earlier, a model is chosen independent of any data. The model serves as a guide or a frame of reference in constructing measurement. Therefore, using this approach, which is referenced to a model derived from measurement theory, anomalies in the data can also be disclosed.

2.2.4 Criticism of the Rasch Model

The Rasch model has been criticised, mainly for its simplicity. Specifically, it is criticized for not incorporating an item discrimination parameter. With fewer parameters, in general, the model does not fit the data as well as a model with more parameters. Bock (1997), for example, indicated that, in practice, equal item discrimination is almost impossible to be found. In addition, information of item discrimination is needed in test construction to “ensure good test reliability and favourable score distribution” (p.27). Divgi (1986) concluded that the Rasch model did not work for multiple choice items. In his research, other models which incorporate more parameters, fitted the data better than the Rasch model.

Embretson and Reise (2000), although acknowledging the strength of the Rasch model, still do not recommend applying the model in all situations. According to them, for some psychological measures, varying item discrimination is unavoidable. Therefore, they consider it better to apply a more complex model than the Rasch model. They

argue that this prevents deleting important items, which may lead to changes in the construct. However, it could be argued that such an approach implies deleting misfitting items based purely on statistical criteria. In the Rasch paradigm statistical criteria are not sufficient grounds for deleting items.

It appears from the above exposition that the major criticism of the Rasch model is that it is unlikely to fit the data. There is a view that a model of measurement works when it can explain the data. Data, from this perspective, are considered always correct.

Therefore, Andrich (2004) notes that controversy surrounding the Rasch model is primarily because of the different paradigms held by the proponents of each model. The criticism illustrated above comes from those who hold to a traditional paradigm that the function of a model is to explain the data, while according to the Rasch paradigm, the Rasch model functions as a guide in constructing measures.

2.2.5 Implication of Using the Rasch Model and its Paradigm in Evaluating Tests

Using the Rasch model and its paradigm in evaluating a test means evaluating a test based on the properties of the model. The Rasch model is a unidimensional model which requires statistical independence in the responses (Andrich, 1988). Therefore, violations of these two conditions (unidimensionality and statistical independence) by the data guide the analysis and recommendations in this study. Not only are these two properties central to the model, they are also central to what is required of the test data.

Specifically, unidimensionality is a reflection of being able to use the total score on items to characterize a person, while statistical independence implies that the different items provide relevant but distinct information in building up the total score. This gives focus to the source of any anomalies disclosed by evidence of misfit of data to the

model. Accordingly, in this study, the fit to the Rasch model in general and more specifically, violations of dimensionality and statistical dependence are examined.

In addition, targeting and reliability are also reported. Targeting will provide information on how well matched the distribution of item and person locations are. To obtain accurate measurement, the items administered to a person should be well targeted. As Wright (1997) established, administering well targeted items is one way of minimize guessing and a factor in contributing to the accurate estimation of item and person locations.

Reliability shows internal consistency among the items in measuring the variable of interest. Specifically, the Person Separation Index (PSI) provides information on how well the items separate persons on the variable to be measured and how powerful the items are in disclosing misfit. A detailed explanation of the PSI is presented in Chapter 3.

As described earlier, the Rasch model arises from the requirement of invariant comparisons within a specified frame of reference. Thus, any subset of responses should result in the same item parameters. As an implication, invariant estimates among some classes of persons need to be examined. Because there are some classes of persons which are readily identified, such as gender and educational background, differential item functioning (DIF) with respect to these classes is examined.

Another implication of invariant comparison for the present study is examining the stability of item parameters in the ISAT item bank. As noted earlier, ISAT items were obtained from an item bank. The responses used to estimate item parameters from the item bank and from this study come from different persons. As such, whether the item parameters from the item bank and from the analyses in this study are invariant needs to be checked.