Internal Consistency Analysis of the Verbal Subtest

List of Appendices

Chapter 4 Internal Consistency Analysis of the Postgraduate Data Postgraduate Data

4.2 Internal Consistency Analysis of the Verbal Subtest

It will be recalled from Chapter 3 that in the Verbal postgraduate set, 49 items were analysed. The Verbal subtest is comprised of four sections. The sections respectively are Synonyms (items 1–13), Antonyms (items 14–25), Analogies (items 26–38), and Reading Comprehension (items 39–50), which consists of three passages. The following

sections show the results of the analysis of aspects related to the internal consistency of the Verbal subtest.

4.2.1 Treatment of Missing Responses

In Chapter 3 three approaches in treating missing responses examined in this study were identified. The missing response treatment that resulted in a greater reliability index (PSI) and smaller mean item fit residual was used to score missing responses in this study.

The three treatments of missing data showed no significant differences in the reliability indices and in the means of the item fit residuals. The largest difference in the reliability indices and in the means of the item fit residuals respectively was 0.006 and 0.01, as shown in Table 4.2.

Table 4.2. The Effect of Different Treatments of Missing Responses in the Verbal Subtest

Statistics A (Incorrect) B (Missing) C (Mixed)

PSI 0.774 0.768 0.772

Mean item fit residual 0.148 0.137 0.150

The findings show that the three treatments of missing data yielded virtually the same results in terms of fit and reliability. Because of this evidence, the scoring system used for further analyses of the data in this study was the one that was used initially to score the data in the testing situation, that is missing responses were treated as incorrect responses. The advantage of using this scoring system is that it provides a data set with no missing responses.

4.2.2 Item Difficulty order

In presenting the difficulty order of the items of the Verbal subtest, six sections instead of only four are presented. It was decided that in the last section, Reading Comprehension, the three reading passages would be considered as three separate sections because each passage has its own stimulus; accordingly, the items were arranged within each passage.

Figure 4.1 shows the order of the items in each section of the test and their relative difficulty according to the item bank (top panel) and according to the postgraduate analysis (bottom panel). The sections are separated by lines: some are straight, some aro not. This is because items are not necessarily in order of difficulty. The corresponding item difficulties are presented in Appendix A1. The first section contains Synonyms, followed by Antonyms, Analogies, Reading Comprehension passage 1 (Reading 1), Reading Comprehension passage 2 (Reading 2), and Reading Comprehension passage 3 (Reading 3).

It appears from the top panel that, except for Reading 1, which started from the most difficult item in the section, the items in other sections were arranged according to their difficulty from the item bank. In particular, items at the beginning of the section are the easier ones. Consistent with the arrangement, the correlation between item order and item bank location in Reading 1 is -0.80, while in other sections the correlation ranges from 0.98 to 1.00. The correlation is taken as a general indicator of a relationship.

However, the difficulty estimates from the student responses as shown in the bottom panel, are not consistent with their order, except in Reading 2 and Reading 3, which each consists of a few items. Reading 1 showed the least consistency with item order, with a correlation of -0.30, followed by Synonym with a correlation of 0.48. Antonym and Analogy showed a similar degree of consistency with a correlation of 0.79.

Figure 4.1. Item Order of the Verbal subtest according to the location from the item bank (top panel) and from the postgraduate analysis (bottom panel)

The above results show that in general the items were arranged according to their difficulty in the item bank. However, in some sections the order of item difficulties from the examinees responses was not consistent with the item order in the item bank.

Despite inconsistency in some sections, the findings reported in the previous section that missing responses had no systematic effect on fit and reliability suggest that the examinees’ responses resulted from a valid engagement with the test. This is perhaps because the items were relatively well targeted for the examinees, as will be seen in the next section. Hence, it is expected that the data used in this study yield valid results.

4.2.3 Targeting and Reliability

In the Verbal subtest the range of the item locations was similar to that of the person locations. The item locations ranged from -1.311 to 2.679, with a mean of 0.0 (fixed) and a standard deviation of 0.865. The person locations ranged from -1.516 to 2.658, with a mean of 0.436 and a standard deviation of 0.692. However, as shown in Figure 4.2, the items did not spread evenly along the continuum. Especially at the more difficult end of the continuum, there were regions not represented by any item. In fact, many items were located at the easier regions while only some items were located at the more difficult regions.

From the means of the person and item locations, it appears that the Verbal subtest was of moderate difficulty. With a mean of person and item locations of 0.436 and 0.00 respectively, the probability of success for a person of mean ability (0.436) on an item of mean difficulty (0.0) is 0.61. In terms of engagement between persons and items in general, the subtest seems reasonably acceptable because it was neither too difficult nor too easy. However, for a test aimed at selecting high ability students, some more difficult items are needed.

The PSI of 0.774 indicates that in general the items separated persons reasonably but could have been higher. This would be achieved with more items of greater difficulty.

The PSI value also indicates that the power of the test to detect misfit was reasonable.

Figure 4.2. Person-item location distribution for the Verbal subtest 4.2.4 Item Fit

In terms of the fit residual, the range was relatively large, that is between -3.473 and 3.543. Using the criterion of ± 2.5, there were three items (items 18, 32 and 35) with a fit residual below -2.5 and five items (items 4, 6, 7, 25 and 41) with a fit residual above +2.5. However, according to the χ²probability as the criterion, there were only two items (items 18 and 35) which misfitted significantly at a Bonferroni-adjusted probability with N = 49, for p = 0.01. Details of fit statistics for these eight items are shown in Table 4.3, while the fit statistics for all items are shown in Appendix A1.

Table 4.3. Fit Statistics of Misfitting Items for the Verbal Subtest

Item Section Location SE FitResid ChiSq DF

Prob

35 Analogy 0.675 0.101 -3.473 25.619 5 0.000107^a

32 Analogy -0.277 0.105 -3.177 22.435 5 0.000434

18 Antonym -0.463 0.108 -3.029 26.791 5 0.000063^a

... ... ... ... ... ... ... ...

7 Synonym 0.635 0.101 2.518 7.021 5 0.219104

4 Synonym 0.466 0.1 2.869 12.085 5 0.033636

25 Antonym 0.845 0.102 3.003 10.574 5 0.060504

41 Reading C. 0.681 0.101 3.509 15.189 5 0.009586

6 Synonym 0.478 0.1 3.543 13.807 5 0.016883

Note. ^a Below Bonferroni-adjusted probability of 0.000204 for individual item level of p = 0.01

Table 4.3 shows that two of the three most discriminating items, that is items 18 and 35 with negative residuals of large magnitude had a χ²probability less than the critical value; while all five of the poorest discriminating items, that is items 7, 4, 25, 41, and 6 with positive fit residuals of large magnitude had a χ²probability greater than the critical value. Thus, the five poorest discriminating items according to the fit residual did not deviate significantly from the Rasch model according to the χ²probability. In summary, there are only two items, items 18 and 35 that misfit according to both criteria. The ICCs of both items, as shown in Figure 4.3, confirmed that they were relatively over discriminating. One of the possible reasons that items over discriminate is because of local dependence. The results for detecting local dependence are presented in the next section.

Figure 4.3. The ICCs of items 18 and 35 4.2.5 Local Independence

To identify local dependence among items, the residual correlations between items are examined. As noted in Chapter 3, relatively large positive and large negative correlations indicate some form of dependence that cannot be accounted for by the person and item locations. When local dependence is observed, a testlet analysis may be

performed to confirm dependence. In this case subtests are formed based on the pattern of high correlation. Another way of confirming local dependence is to form subtests based on an a priori structure among the items. Because there are two analyses involved, the results of these analyses are discussed in two subsections.

(a) Examining Local Dependence of All Items

The residual correlations between items in the Verbal subtest were relatively low. None of the correlations exceeded 0.2, except for the correlation between items 18 and 19 which was 0.21. This suggests there is no particular pattern in the residuals indicating local dependence. Therefore, a testlet analysis to confirm dependence between the items was not performed.

(b) Examining Local Dependence from an a Priori Structure

As mentioned earlier, the Verbal subtest is comprised of separate sections. In the original analysis to estimate item and person locations, items from all sections were analysed together. This kind of analysis does not consider that separate sections may have a dependent structure that may lead to trait dependence. Therefore, a testlet analysis which takes into account the dependent structure was performed to investigate the evidence of any local dependence.

For the testlet analysis, six testlets were formed which consist of three sections in the Verbal subtest, namely Synonym, Antonym, and Analogy and three Reading Comprehension testlets. As mentioned earlier, Reading Comprehension items in the postgraduate set contains three passages; therefore, in the testlet analysis three testlets, one for the items within each passage of Reading Comprehension were formed.

As indicated in Chapter 3, four statistics are used as indicators of dependence in a testlet analysis, namely the spread value (

θ

), the reliability index (PSI), the overall test of fit index, and the variance of the person estimates.

The spread value (

θ

) for each testlet is presented in Table 4.4. The spread values of Reading 2 and Reading 3 were smaller than the minimum value. The differences between the spread value and the minimum value respectively were 0.2 and 0.3, in which the spread value was almost half of the minimum value. This suggests substantial dependence in Reading 2 and Reading 3.

Table 4.4. Spread Value and the Minimum Value Indicating Dependence in the Verbal Subtest

Testlet

Range of Item Locations

No of Items

Minimum Value

Spread Value (

θ

)

Dependence Confirmed

Synonym 3.15 13 0.15 0.18 No

Antonym 3.84 12 0.15 0.20 No

Analogy 3.52 13 0.15 0.18 No

Reading 1 1.38 5 0.35 0.38 No

Reading 2 0.92 4 0.41 0.21 Yes

Reading 3 0.47 2 0.69 0.39 Yes

Note. The minimum value provided in Andrich (1985b) is only up to eight items. The value for number of items greater than 8, in this case 12 and 13, was calculated following Andrich (1985b).

In contrast, the spread values for Synonym, Antonym, Analogy, and Reading 1 were greater than the minimum value. However, the magnitude of the difference was small, between 0.03 and 0.05. It was indicated in Chapter 3 that while a spread value smaller than the minimum value confirms dependence, a spread value greater than the minimum value does not necessarily mean that there is no dependence. This is because, other than dependence among items of a testlet, a spread value also accounts for difference in the difficulties of the items (Andrich, 1985b). Therefore, in Table 4.4 the range of item locations within a testlet is also included.

It appears from Table 4.4 that the four testlets with spread values greater than the minimum value had greater differences in item difficulties than the two testlets with lower spread values. This may suggest that the minimum value in the four testlets was not reached because of the large range of item difficulties in the testlet.

To investigate whether dependence not only occurs in two sections (Reading 2 and Reading 3) but perhaps also in the other four sections (Synonym, Antonym, Analogy, and Reading 1), the reliability indices (PSI) from three analyses were compared. The first analysis is when all 49 items were treated as dichotomous items (original analysis), the second is when the items were formed into six testlets as shown in Table 4.4, and the third is when four items in Reading 2 and two items in Reading 3 were formed as two testlets while the remaining items (43) were treated as dichotomous items. It is expected that when dependence not only occurs in Reading 2 and Reading 3 but also in other sections, the decrease in the PSI from the first analysis, will be greater in the second analysis, when all items were formed in six testlets, than in the third, when only two testlets were formed. The results are presented in Table 4.5.

Table 4.5. PSIs in Three Analyses to Confirm Dependence in Six Verbal Testlets

Analysis Analysed Items PSI

1 49 dichotomous items 0.774

2 49 items forming 6 testlets 0.729

3 43 dichotomous items of Synonym, Antonym, Analogy, and Reading1 and 6 items forming Reading 2 and Reading 3 testlets

0.766

It is apparent that the decrease was greater in the second analysis (by 0.045) than in the third analysis (by 0.008). This shows that dependence not only occurs in Reading 2 and Reading 3, but also in Synonym, Antonym, Analogy, and Reading 1 although it was small and not observed in the spread value.

Two other indicators of dependence, namely, the overall test of fit index, and the variance of person estimates, were examined. These statistics resulting from the dichotomous and the testlet analysis were compared.

With regards to fit, the total chi-square probability was p = 0.000 for the dichotomous analysis compared to p = 0.156 for the testlet analysis. The better fit from the testlet analysis indicates that local dependence present in original (dichotomous) analysis has been taken accounted of in the testlet analysis.

In terms of the variance of person estimates, a decrease in variance of person estimates is expected when trait dependence is present (Marais & Andrich, 2008b). The person estimates obtained from the original analysis had a greater variance, 0.692, than the person estimate from the testlet analysis, 0.582.

The dependence hypothesized in the Verbal subtest due to its structure was confirmed.

The magnitude of dependence, however, was not substantial, as evidenced by the relatively small decrease in the reliability index. Therefore, although more precise measurements were obtained when the items were analysed in testlets, similar reasonable measurements were generated when items were analysed as discrete, dichotomous independent items using the dichotomous Rasch model.

4.2.6 Evidence of Guessing

As described in Chapter 3 some analyses were carried out to examine the presence of guessing. These analyses can be categorized into three major steps. Firstly, graphical evidence of the observed means in class intervals relative to the ICCs was examined.

Secondly, statistical evidence comparing the item estimates from a tailored and an anchored analysis, in which the mean of a group of easy items is anchored to their mean in the tailored analysis, was examined. Thirdly, an analysis which confirms the presence of guessing and distinguishes the effect of guessing from the effect of differences in

item discrimination was performed. In addition, an examination was made of the content of the items showing evidence of guessing to try to understand the source of guessing.

(a) Examining Graphical Evidence

It is considered that guessing is more likely to occur in items that are relatively difficult for a person. As indicated earlier, when guessing occurs, the observed proportion correct of persons in lower class intervals is greater than expected and that of higher class intervals is lower than expected. This pattern is observed in Figure 4.4, which shows the ICC of item 36, an Analogy item and the fourth most difficult in the Verbal subtest.

Figure 4.4. The ICC of item 36 indicating guessing graphically (b) Examining Statistical Evidence

To examine statistical evidence of guessing, an original, a tailored, and an anchored analysis as described in Chapter 3 were conducted. In the tailored analysis, the response of a person whose probability of answering an item correctly is below 0.2, that is less than chance with five alternatives, was recorded as a missing response. For the anchored analysis, the mean of the estimates of 10 of the 12 easiest items (2, 5, 9, 11, 14, 15, 17, 19, 42, and 44) which fitted the model and which did not indicate guessing based on the

ICC in the original analysis, was anchored to the tailored analysis and all responses reanalysed. Then the relative difficulties of the items from the tailored and anchored analyses were compared.

Figure 4.5 presents the plot of the item locations from tailored and anchored analyses. It shows that the four most difficult items (from the most difficult one: 37, 21, 13 and 36), had noticeably different item locations in the two analyses, although the direction of the change is not the same in all these items. In three items (37, 13, 36), the difficulty is higher in the tailored than in the anchored analysis, which is expected when guessing occurs. For item 21, on the other hand, the difficulty is lower in the tailored than in the anchored analysis.

Figure 4.5 The plot of item locations from the tailored and anchored analyses for the Verbal subtest

Table 4.6 provides statistical significance of the differences in difficulties between the tailored and anchored analyses for selected items. It includes the item location and the standard error of the estimate from the tailored and the anchored analyses; the difference in item locations (d); standardized difference in locations (stdz d), and the

significance of difference at p < 0.01. The size of the sample for each item in the tailored analysis is also included, which reflects the number of persons eliminated due to the tailoring procedure. For example, if the tailored sample is 426, it indicates that the responses of 14 persons were eliminated. The item location from the original analysis is also shown. The statistics presented in Table 4.6 are in order of the difficulty of the items from the tailored analysis. This is the analysis that is considered to have the least effect due to guessing. As the item difficulty is in the order of smallest to the largest, the position of the items with the greatest difference in difficulty between the tailored and anchored analyses will be towards the bottom of the table.

It is evident that, among the four items showing a substantial difference, only in one item (36) is the difference statistically significant. The difference in locations between the tailored and anchored analyses in the other three items which showed graphical differences (37, 21, 13) is not significant. This is because of the relatively large standard error of the difference in estimates. The statistics for all items are presented in Appendix A2.

As mentioned in Chapter 3, a significant difference in location is determined by the size of the difference in location estimates and by the size of the standard error of the difference. Accordingly, a significant difference in location is observed in the items showing a substantial difference in location but relatively small difference in standard errors. This is the case for items 12, 25, and 41. As shown in Table 4.6, in these three items the magnitude of the difference between two estimates is not substantial and, hence was not observed in Figure 4.5. The difference is significant statistically because of the small standard error of the difference.

Table 4.6. Statistics of Some Verbal Items after Tailoring Procedure

Item Loc original

Loc tailored

Loc anchored

SE tailored

SE anchored

d (tail- anc)

SE (d)^a stdz d^b >2.58 Tailored sample 5^c -1.311 -1.336 -1.328 0.132 0.132 -0.008 0.000 undefined 440

27 -1.200 -1.216 -1.217 0.128 0.127 0.001 0.016 0.063 440

... ... ... ... ... ... ... ... ... ... ...

16 -0.437 -0.447 -0.455 0.108 0.108 0.008 0.000 undefined 440 1 -0.430 -0.441 -0.448 0.108 0.108 0.007 0.000 undefined 440 30 -0.385 -0.402 -0.402 0.107 0.107 0.000 0.000 undefined 440 26 -0.370 -0.382 -0.387 0.107 0.107 0.005 0.000 undefined 440

.... .... .... .... .... .... .... .... .... .... ....

22 0.629 0.647 0.612 0.102 0.101 0.035 0.014 2.457 426

12 0.686 0.715 0.669 0.102 0.101 0.046 0.014 3.229 * 423 41 0.681 0.717 0.664 0.102 0.101 0.053 0.014 3.720 * 423 25 0.845 0.893 0.827 0.105 0.102 0.066 0.025 2.648 * 404

10 1.380 1.396 1.363 0.120 0.109 0.033 0.050 0.658 323

36 1.422 1.575 1.404 0.125 0.110 0.171 0.059 2.880 * 310

13 1.839 2.048 1.821 0.155 0.120 0.227 0.098 2.314 221

21 2.679 2.379 2.661 0.330 0.155 -0.282 0.291 -0.968 42

37 2.324 2.590 2.306 0.237 0.138 0.284 0.193 1.474 103

mean 0.000 0.000 -0.017 0.117 0.110 0.017 0.023 SD 0.865 0.886 0.865 0.037 0.011 0.071 0.050

Note. ^aStandard error of the difference, is SE_tail²−SE_anch² , ^bstandardized of the difference(z) is d / SE(d), ^c anchor items. The items in bold are those that showed significant difference between tailored and anchored estimates and showed evidence of guessing from ICC. The items in italics are those items that showed significant difference between tailored and anchored estimates but did not indicate guessing from their ICCs.

The table also shows that for the easy items, the number of persons in the tailored analysis remains at 440, indicating that no person responses were eliminated. This is as expected, as in easy items the probability of persons getting an item correct is greater than by chance. However, it does not mean that in easy items the location and the standard error of estimate from the tailored and the anchored analyses is the same. They may be the same or different in any analysis. The location estimates may be different because only the mean of the anchored items is fixed. The standard error may be

different because that is a function of both the location and the number of responses.

The standard error is inversely related to the number of persons responding to an item.

For example, items 1, 16, and 26 show a slight difference in locations but no difference in standard errors. Others, such as item 30, showed the same location and the same standard error. In both cases, where the standard errors were the same, and consequently the difference in the standard error was 0, the standardized difference in locations is undefined.

As indicated in Chapter 3, the difference in locations between the tailored and anchored analyses does not necessarily mean that guessing took place. Therefore, for guessing to be diagnosed in an item there should be evidence of guessing according to both criteria:

graphically in the ICC and statistically by a significant change in locations. In the case of the Verbal items, and as confirmed in the next subsection, only item 36 met both graphical and statistical criteria. The other three items (12, 41, 25) which showed a significant difference in location did not show a guessing pattern from the ICC. Later in the section dealing with information from distractors, it will be shown that item 36 also has a problem with distractors. Specifically, two distractors can be the correct answer.

This is a possible explanation for the guessing in item 36.

(c) Confirming Guessing

To confirm that guessing rather than a difference in discrimination is the source of the difference in difficulty estimates between the anchored and tailored analyses, it will be recalled that a fourth analysis has to be conducted. In this analysis, the locations of all items from the tailored analysis are fixed and all the data reanalysed. This is called the anchored all analysis. In this analysis no new item difficulty estimates are obtained, but new person estimates, ICCs and fit statistics are obtained. If the difference in item locations is a result of guessing, then the observed means of the more able students will

Dalam dokumen Evaluation of the Indonesian Scholastic Aptitude Test According to the Rasch Model and Its Paradigm (Halaman 128-165)