Rationale and Procedure in Examining Internal Consistency

List of Appendices

Chapter 3 Methods

3.1 Rationale and Procedure in Examining Internal Consistency

3.1.1 Treatment of Missing Responses

In administering educational and psychological tests, it is common to find missing responses. This occurs when test takers skip some items or when there is no time to attempt the later items. The former is called an omitted response; the latter is called a not-reached response (Ludlow & O'Leary, 1999; Zhang & Walker, 2008).

Missing responses have been of concern because of their potential impact on the estimation of person and item parameters (Longford, 1994; Ludlow & O'Leary, 1999;

Mislevy & Wu, 1988; Zhang & Walker, 2008). A recent study conducted by Zhang and Walker (2008), for example, indicates that treating missing responses as incorrect responses is not advisable because it underestimates person abilities and misidentifies the persons as not fitting the IRT model.

Ludlow and O’Leary (1999) also studied the effect of missing responses on person ability estimates. Their research examined four scoring strategies. The first is treating omitted and not-reached items as not administered, which means only items which have complete responses are analysed. The problem with this strategy is that a student who attempted fewer items and got a lower raw score could have a higher ability estimate than one who attempted more items and had a higher raw score. In practice, this strategy encourages students to attempt fewer items to maximize their score.

The second strategy is treating any omitted and not-reached response as an incorrect response which means that all persons were expected to attempt all items. The problem with this strategy is that it can be considered unfair for some persons who did not attempt all items.

Another problem with this second strategy relates to the item statistics for the last items.

Firstly, these items become more difficult, not because many persons had incorrect

responses, but because not many persons attempted these items. Secondly, the fit statistic may be inflated, either because persons with higher ability failed on the easy items or those with lower ability succeeded on the difficult items. This problem is especially relevant in item development for item banks because the item difficulty estimate for an item bank should not depend on where the item is located in a test.

The third strategy is treating omitted responses as incorrect and not-reached items as not administered. The problem with this strategy is almost the same as with the first strategy. There is an effect on person ability estimates because of not taking into account the not-reached items. This strategy also encourages students not to attempt all items.

The fourth strategy is to derive the difficulty and ability estimates from a two-phased procedure. This strategy is expected to minimize the effect of omitted responses and encourage students to attempt all items. In the first phase, only item difficulty is estimated. Omitted responses are treated as incorrect responses and not-reached items as not administered. In the second phase, student abilities are estimated using item parameters estimated in the first phase. However, in this second phase, omitted responses and not-reached items are all scored as incorrect responses.

Ludlow and O’Leary‘s work shows that each missing response treatment has its impact and it is possible to take advantage of each treatment. They show that the treatment of missing responses may need to be differentiated for different testing purposes. The treatment of missing responses in estimating item parameters is different from the treatment in estimating ability parameters. This is especially true in high-stakes testing situations, such as in certification or selection. It could be considered unfair if in estimating ability parameters omitted responses or not-reached items are scored as missing responses instead of incorrect responses.

This study, the analysis of the ISAT, is an analysis of a high-stakes test, where all missing responses were initially scored as incorrect responses. However, as mentioned earlier, this study examines the validity of the test, so there is concern about item and person estimates. In this situation, to use data immediately as obtained from the testing situation may not be wise because, as indicated earlier, the estimates might be less accurate. Therefore, it needs to be examined whether different missing response treatments have an effect on the results of the analysis, particularly the item estimates.

Three missing response treatments were identified and are summarised in Table 3.1.

The effect of these three missing response treatments on the item estimates were examined by considering differences in the reliability index and mean item fit residual.

Data which result in a greater reliability index and smaller mean item fit residual will then be used as a base data set for subsequent analyses. Any significant differences among respective indices would be studied with respect to the completion of the items at the end of each subtest. The description of the statistics is provided in the next section; the reliability index in the discussion on targeting and reliability, and the fit residual in the section on item fit.

Table 3.1. Treatment of Missing Responses for Item Estimates

Condition Treatment

A Any omitted or not-reached response scored as incorrect.

B Any omitted or not-reached response scored as missing/not administered.

C Omitted response scored as incorrect; not reached response scored as missing/not administered.

It is considered omitted when missing responses are found in any place, and not only at the end of the subtest. It is considered missing/not administered when missing responses are found only at the end of the subtest.

3.1.2 Item Difficulty Order

As indicated earlier, the Rasch paradigm is employed as a guide in constructing a measure of a variable. The goal is a valid test with a specific internal structure.

Therefore, the order in which the items are presented also needs to be examined.

Differences in item difficulty are central to constructing and understanding the variable measured by a test. However, test administration also needs to take into account the order of the items in terms of their difficulty. To ensure maximum validity of person engagement, the presentation of items should be in the order of difficulty with the easiest item as the first item (Andrich, 2005b).

Much research has been conducted to investigate the effect of item order on test performance. Leary and Doran (1985) reviewed studies examining this topic. Some studies examined the effect of item arrangement (from easy to difficult items or from difficult to easy items) on test performance. In those studies it was found that by arranging items from difficult to easy, examinees did not reach the end of the test. The explanation was that the testing had a time limit although some tests were categorized as power tests, that is it was intended that all examinees have time to complete all the items. As a consequence, different item arrangements led to different test difficulty. In addition, when items were presented from difficult to easy items, the test became more difficult, shown by a lower number of correct responses, than when items were presented from easy to difficult items (Hambleton & Traub, 1974; MacNicol (in Leary

& Doran, 1985). Hambleton and Traub also found that arranging items from difficult to easy generated a more stressful situation for the examinees.

In principle, the ISAT items have been arranged according to their difficulty from the item bank. However, in the case of multiple items with the same stimulus the difficulty order is within the same stimulus or the same type of task. In the Verbal and Reasoning subtests where the type of task is the same in each section, the easy items come first in e ach section. In the Quantitative subtest, especially the Arithmetic and Algebra and geometry sections, where there are different kinds of questions in each section, items

are not always in difficulty order. For example, in Arithmetic and Algebra, the non-applied type of problem is presented earlier than the non-applied ones; within each there are easy and difficult questions but within each they are in order. Similarly, in terms of how problems are displayed, in Geometry, pictorial items are presented earlier than narrative items. In pictorial items, plane figure items are presented earlier than solid figures.

Because of these discipline-specific arrangements, even for one section, we cannot expect that earlier items are always the easier. This is nevertheless a problem from the perspective of students taking a test and using up time on early difficult items, and not having time to answer a later easier item. It is necessary to ensure that there is enough time for all examinees to complete the test.

In checking whether the items were ordered according to their difficulty, one needs to distinguish between item order as presented in the test booklet, and item order as estimated from examinee responses. Two questions can be asked regarding item order.

The first is whether the items have been arranged based on their difficulty; in this case, based on the item location from the item bank. The second is whether the estimated locations are in the same order as the items are presented and whether this is the same order as the item locations of the item bank. To examine the consistency of item order in the above conditions, scatter plots which display the item order under these two conditions, according to location from the item bank and from postgraduate/

undergraduate responses analyses, are shown.

3.1.3 Targeting and Reliability

As the aim of measurement is to generate precise estimates of persons locations on a variable continuum, the targeting of the items to persons and the reliability of the measurements need to be examined in evaluating a test.

Items are considered well-targeted when the items are not too easy nor too difficult for the persons (Andrich, 2005b). How matched items and persons locations are can be examined by comparing the mean person location and mean item location (Tennant &

Gonaghan, 2007). They should be around 0 as the mean location of items by default is 0.

RUMM2030 reports two reliability indices: Cronbach’s Alpha (α) for complete data and the PSI for both complete and incomplete data. The value for both indices ranges from 0.00 to 1.00. The interpretation for both indices is the same. They indicate the proportion of non-error variance among persons relative to the total variance. The higher the index, the greater the proportion of non-variance. This index of reliability is also referred to as an index of internal consistency.

Both Cronbach’s α and the PSI are derived from the CTT proposition that the observed score of person n (X_n) comprises two components, true score (T_n) and error measurement (En). These two components (T and E) are not correlated. The relation can be expressed as

Xn = Tn + En (3.1)

Because T and E are not correlated, the variance of observed scores is sum of the variance of true scores and variance of error measurement. It can be presented as

2 2 2

E T

X σ σ

σ = + (3.2)

where σ_X² is the variance of the observed scores, σ_T² is variance of the true scores, and σ_E² is the variance of the error scores.

Reliability is defined as the ratio of true variance to total variance. It can be expressed as

X T

rtt ₂ 2

= σ = ₂

2 2

X E X

σ σ σ −

= ₂

X E

−σ ( 3.3)

For a test consisting of some items scored dichotomously or polytomously, and administered at one time, Cronbach’s α index can be applied using the formulae as shown in Equation 3.4.













 −



 





= −

∑

= 2 1

1 1 _X

i i

I I

σ σ

α (3.4)

where I is the number of items in the test, σ_i² is the variance of each item scores, and

σX is the variance of test scores.

PSI, the reliability index from the Rasch model, has a similar structure to Cronbach’s Alpha in CTT. However, the latent ability estimate,βˆ , instead of the observed score, _n and actual latent value (β_n), instead of the true score, are used in the calculation of the PSI. This index is calculated using Equation 3.5.

2 ˆ

β ββ ε

−σ

r (3.5)

where

1 ˆ

1 2

∑

−

σn

σ_ε (3.6)

is the average of the error variances of individual person estimates.

Although both reliability indices show a similar structure and have the same interpretation, they are different in some ways. Firstly, as shown above, the PSI is based on estimated abilities while Cronbach’s α is based on raw scores. Secondly, the PSI can be calculated even when the data are incomplete, while Cronbach’s α can be calculated only when there are no missing responses. Lastly, the PSI indicates how well the test separates persons. A higher index pertains to higher precision of the test in separating

persons. This relates to the power of the test in detecting misfit. The higher the PSI, the more power the test has in detecting misfit. In ideal conditions with respect to targeting and with complete data, Cronbach’s α and the PSI give virtually the same numerical values (Andrich, 1982).

In this study, the PSI is used as a reliability index as it provides more information than Cronbach’s α.

3.1.4 General Item Fit

In principle, fit to the model is a comparison between observed values from the data, and theoretical values as predicted by the model. Fit can be shown graphically by the ICC and statistically by fit indices (Andrich, 2009; Andrich, Sheridan, & Luo, 2004), as will be discussed in the following sections.

Fit in general is examined irrespective of person groups such as gender. If it is examined by person group, the analysis is called Differential Item Functioning (DIF) analysis. This section is devoted to fit in general; the rationale and procedure for DIF analysis is discussed in later sections.

Two principles in the assessment of fit need to be noted. Firstly, a model which incorporates a smaller number of parameters is less likely to fit the data. This means that, in general, data are less likely to fit the Rasch model than the 2PL or 3PL.

Secondly, the greater the precision of the parameter estimates, the less likely the data will fit the model. When the parameter estimates are very precise, misfit is easily shown. Data with a larger sample size is one factor that leads to greater precision, and therefore, the larger the sample size, the less likely the data will fit the model.

(a) Fit Shown by an ICC

In using the ICC to examine fit, the persons are classified into several class intervals based on their ability estimates. The observed mean in each class interval is calculated and compared with the theoretical probability. In dichotomous data, the observed mean is the proportion of persons who had a score of 1 (correct answer) while the theoretical probability is the theoretical mean. The observed proportion correct is represented in the ICC by a dot (●). When data fit the model, the observed means should be close to the theoretical means.

Figure 3.1 presents the ICCs of two items. Item 1 fits the model (left panel) and item 2 misfits (right panel). As seen in the ICC of item 1, the observed mean of all class intervals is close to the theoretical mean. In the ICC of item 2, the observed means of lower class intervals are higher than the theoretical mean while those for higher class intervals are lower than the theoretical mean. Item 2 shows poor discrimination as it does not distinguish between persons in lower and higher groups.

Figure 3.1. ICCs of two items indicating fit (left) and misfit (right)

(b) Chi-square Test Fit

To complement the ICC, a chi square (χ²) test of fit can be calculated. This is to test statistically whether the observed mean in each class interval g = 1, G, is different from the theoretical mean. For this, the standardized residual for each class interval in each item (zgi) is calculated using the equation below (Andrich, et al., 2004).

∑

⁻ ^Ε

g n

ni g n

ni g

n ni

gi V X

X X

ε ε ε

] [

(3.7)

where for item i, Z_gi is the standardized residual for a particular class interval (g); X_ni is the observed score for each person; E[Xni] is the theoretical score for each person; and V[Xni] is the theoretical variance for each person.

To obtain a total chi-square for an item, the squares of the standardized residuals for each class interval are summed. It can be expressed as:

∑

1 2 2

g gi

i Z

χ (3.8)

on approximately G-1 degree of freedom.

The data are considered as fitting the model when there is no significant difference between the observed and theoretical means. The significance of the difference is tested using 0.05 level of significance or 0.01 level of significance. If the probability that the obtained difference occurred by chance is greater than 0.05 (0.05 level of significance) or 0.01 (0.01 level of significance) the difference is not statistically significant, indicating that item fits the model. The smaller the value of χ², the greater the probability that the item fits the model.

In this study, the probability value used as the criterion is the Bonferroni-adjusted probability provided by RUMM2030. The Bonferroni adjustment is made to reduce the risk of type I error which occurs when the difference is found significant when actually

it is not. In testing the significance of the item set, the test is conducted not only once, but as many times as there are items in a set. The level of significance of an item, therefore, should be smaller than the level of significance of a set of items. With the Bonferroni adjustment, the probability of an item is obtained by dividing the significance level (in this study 0.01) for the set of items by the number of testings, that is number of items (n), α/n. This results in a smaller value than α (Bland & Altman, 1995).

Andrich, et al. (2004) warn that theχ² test is very general and will not pick up specific violations of the model. Likewise, theχ²statistic is inflated when the estimated probabilities lie close to the extremes (0 or 1). Therefore the χ² statistic test should not be used as the only indicator of fit.

(c) Item Fit Residual

The item fit residual is another statistic to indicate an item’s fit to the model. This statistic is also a standardized residual. While in the χ² test the standardized residual is calculated based on responses to the item in each class interval, the standardized residual in the item fit residual statistic is not based on class intervals. The standardized residual of each person for an item is calculated by

] [

ni ni ni

ni V X

z x −Ε

= (3.9)

The total standardized residual for an item can be calculated by squaring the standardized residual for each person and summing over the persons as shown in Equation 3.10.

∑

= ^N

i zni

1 2 2

(3.10)

The total standardized residual for an item is then transformed to a standard normal deviation. When data fit the model, the mean of the deviation, or the difference between observed and theoretical, is close to 0 and the standard deviation close to 1. The transformation into standardised deviate is

] [

2 2 2

i i i

i V Y

Z =Y −Ε (3.11)

A logarithmic transformation in which this statistic is made more symmetric is used in the interpretation of fit (Marais & Andrich, 2008a) .

In this study, how well the items fit the model is reported using the χ² probability and item fit residual criteria. The χ² probability indicates the statistical significance of the item’s fit. The item fit residual indicates the direction of the fit and its value can be positive or negative. An extreme positive item fit residual value indicates under discrimination and an extreme negative item fit residual value indicates over discrimination.

3.1.5 Local Independence

Local independence is an assumption of IRT models, that is it is not only a requirement of the Rasch model but also of other IRT models. Because it has been shown that violations of local independence lead to inaccurate estimates of item and person parameters, concern about checking this assumption has increased (Marais & Andrich, 2008a, 2008b; Wainer & Wang, 2000; Zenisky, Hambleton, & Sireci, 2002).

The Rasch model is a unidimensional model which requires statistical independence in the responses (Andrich, 1988). Not only are these two properties central to the model, they are also central to the requirement of test data. Specifically, unidimensionality refers to being able to use the total score on items to characterize a person, while

statistical independence implies that the different items provide relevant but distinct information in building up the total score.

In a violation of unidimensionality, person parameters other than β (proficiency or ability) could influence the response. Marais and Andrich (2008b) called this violation

“trait dependence”. A violation of statistical independence occurs when, for the same person, the response to an item depends on the responses to previous items. Marais and Andrich (2008b) referred to this violation of the model as “response dependence”. In the Literature, trait dependence and response dependence are usually not distinguished;

being both categorised as violations of local independence (Marais & Andrich, 2008b).

Trait dependence may occur in tests comprising some subtests or sections measuring different aspects of the variable, and where the items of those subtests are summed.

Another example of trait dependence can be found in tests in which some items have a common stimulus. In general, this structure is called a subtest (Andrich, 1985b), testlet (Wainer & Lewis, 1990), or item bundle (Rosenbaum, 1988).

Even though some item structures are known to increase dependencies between items, tests with these structures are still being used. This is because the construct could be better represented by including items from different relevant aspects, as in tests comprising some subtests, or by constructing context- dependent items by using the same stimulus stem, as in testlet structures. The purpose of using these item structures is to improve construct validity (Marais & Andrich, 2008b; Zenisky, et al., 2002) in an efficient way (Andrich, 1985b).

While the purpose of constructing tests with a trait dependent structure is to increase validity, tests which have response dependence, do not serve the same purpose (Andrich

& Kreiner, 2010; Marais & Andrich, 2008b). In fact, response dependence, which

Dalam dokumen Evaluation of the Indonesian Scholastic Aptitude Test According to the Rasch Model and Its Paradigm (Halaman 71-114)