Evaluation of the Indonesian Scholastic Aptitude Test According to the Rasch Model and Its Paradigm

(1)

Aptitude Test According to the Rasch Model and Its Paradigm

Asrijanty Asril

Bachelor in Psychology (Gadjah Mada University, Indonesia)

Master in Social Research and Evaluation (Murdoch University, Australia)

This thesis is presented for the degree of Doctor of Philosophy of The University of Western Australia

Graduate School of Education 2011

(2)

(3)

i

This study evaluates a high stakes test, the Indonesian Scholastic Aptitude Test (ISAT) from the perspectives of the Rasch model and its paradigm. This test has been developed by the Center for Educational Assessment (CEA) in Jakarta and has been used as one of the admission tests for undergraduate and postgraduate levels of study in some public universities in Indonesia. The CEA has formed a bank of items which is used to construct different sets of items for different purposes. For this study the data from two different sets of items from the item bank, one administered to students for undergraduate entry, and one for postgraduate entry, were available for analysis. Each test consists of three subtests, called Verbal, Quantitative, and Reasoning, to reflect the capacities they are intended to assess. Firstly, this study examines the internal structure of the subtests by applying the Rasch model and its paradigm. Secondly, this study examines the stability of item bank parameters for the items of the subtests. Thirdly, the predictive validity of the test is examined.

The Rasch model can be applied as primarily a statistical model used to model data.

However, its use in this thesis goes beyond this narrow focus: rather the Rasch paradigm is used as a framework for the whole study. The case for the model is that the comparisons among persons are invariant with respect to which items are used from a class of relevant items, and that the comparisons among items are invariant with respect to the class of persons. These invariance properties are independent of any particular data set. They are especially important when not all persons can attempt the same items on every occasion, which occurs, for example, when item banks are used. However, data will have these invariant properties only if they fit the model. It follows that data

(4)

are examined for fit to the model, and that if data do not fit the model, it is the data that need to be examined and a substantive explanation for the misfit sought. The purpose of the examination is a better understanding of the design of the instrument and the variable and context of measurement. It is this perspective that involves the broader Rasch paradigm, not merely the application of the model. In this paradigm validity, reliability and fit of the data to the model are integrated.

In this study the test is examined not only according to the Rasch model but also according to the Rasch paradigm. Accordingly, the aspects that were examined in addition to the fit of data to the Rasch model were factors that may affect the validity of responses and inferences, including the accuracy of person and item estimates. General fit to the model included standard checks on evidence of (i) violation of local independence, (ii) differential item functioning, (iii) unidimensionality, and (iv) reliability based on Rasch estimates which also provided evidence of the power of detecting misfit. Less standard aspects included checks on (i) the effects of missing responses, (ii) item difficulty order in relation to the item order in the tests, (iii) targeting of the person and item distributions, (iii) possible information in distractors of multiple choice items, (iv) the presence and accounting of guessing using recent contributions to the study of guessing using the Rasch model, (v) differences in units of measurement in the item bank and in the analyses, and (vi) the comparison of item difficulties from the item bank and from the analyses. Thus from the examination of the data from the above perspectives, a comprehensive understanding of the data and frame of reference was demonstrated.

Data for this study consisted of the responses of 440 postgraduate examinees and responses of 833 undergraduate examinees. All items were multiple choice items with five alternatives with one of these being the correct response. For the analysis of the fit

(5)

of data to the Rasch model, all these data were analysed. However, for the analysis of predictive validity, data for only 327 postgraduate examinees and 177 undergraduate examinees were examined. These examinees had been accepted into a university program and academic performance records for these students were available. The undergraduate examinees were located in Economics and Engineering. The postgraduate examinees were located in Life Science, Economics, Law, Literature, Natural Science, Medicine, Psychology and Social Studies. For purposes of predictive validity, a grade point average (GPA) in the first two years of study was used as a criterion.

The findings show that in all data sets three different ways of scoring missing responses did not show a significant effect on reliability and item fit. Therefore, missing responses in all data were scored as incorrect responses. This is consistent with how the responses were scored in the selection situation. This scoring system also resulted in a data set with no missing responses which has some advantages in this analysis.

It is shown that, in general, the items in the test booklet were arranged according to their difficulty from the item bank. However, the difficulties obtained from the data which were analysed were not the same as those of the test booklet. Despite this inconsistency, it was inferred that the ordering of items did not have an impact on the validity and reliability of the test. This is because missing responses had no impact on fit and reliability.

The analyses showed that, in general, the internal structure of the undergraduate and postgraduate tests was reasonably consistent with the Rasch model. The items were relatively well targeted and had reasonable power, indicated by the reliability index, to disclose misfit and to differentiate examinees. In all subtests of the ISAT for both the postgraduate and undergraduate tests, there was some misfit to the model. However,

(6)

because misfit was observed in only a few items in each subtest, its effect on reliability was small. The analyses also showed that low or high discrimination, guessing and DIF were evident in some items. Some local dependence, due to the structure of the items, was also evident in all subtests. Dependence between specific items, which was not directly a result of the structure of the test, was observed only between two items in the Quantitative undergraduate set. Information in a distractor was also found in some items. In each case, where an item showed misfit or rescoring was suggested by the statistical analysis, a substantive explanation was sought and provided.

Item parameter estimates from the analysis of the postgraduate and undergraduate tests were compared with item parameters from the item bank at the CEA and considerable differences were found. However, using the standard deviations of the same items in the item bank and in the data analysed to assess the relative units in the two contexts, little difference was found between the units in the item bank and in the data analysed.

Despite differences in the estimates of the individual item parameters, the person estimates were virtually the same whether item bank parameters were used or parameters from the analysis of the postgraduate/undergraduate test data were used.

This is partly because of each of the following (i) the arbitrary origin was adjusted by making the mean difficulties of the items from the item bank zero as in the data analysed, (ii) all students had responses to all items, (iii) the total score in the Rasch model is a sufficient statistic for the person parameter estimate, and (iv) the units were virtually the same.

The differences in the relative item difficulties from those of the item bank suggest that frame of reference of the original application and new application is not exactly the same. Further study to understand the instability and regular check for the stability of item bank parameters need to be performed.

(7)

In terms of predictive validity, for the postgraduate data, a positive correlation between the GPA and the ISAT estimates was found for most of fields study. However some correlations were not statistically significant and relatively small. Only in three fields of study (Literature, Social Studies and Psychology) was academic performance in the university, as indicated by the GPA, predicted by the ISAT estimates. The variance explained ranged from 11.9 % to 94.2 %. The Verbal subtest was a significant predictor in Literature accounting for 31.4 % of the variance, and the Reasoning subtest was a significant predictor in Social Studies, accounting for 11.9 % of the variance. In Psychology, all the subtests were significant predictors, accounting jointly for 94.2 % variance. However, it was noted that there were only nine students in Psychology, but the high predictive validity was considered worth reporting.

In both Economics and Engineering undergraduate studies, the GPA was significantly correlated with all the ISAT estimates. The correlation was consistently higher in Economics than in Engineering despite the standard deviation of the GPA distribution being greater in Engineering than in Economics. When the three subtest estimates were included as predictors in a multiple regression analysis, the variance accounted for was 27.9 % in Economics and 10.4 % in Engineering. The Quantitative subtest predicted better than the other subtests, both in Economics and Engineering.

That the positive and significant correlation between ISAT estimates and the GPA was small in some fields and not observed in other fields of study at the postgraduate level can perhaps be explained by the very small range of the GPA in the postgraduate data, especially in some fields such as Medicine. The standard deviation of the GPA in the postgraduate data was approximately half of the standard deviation in the undergraduate data. Therefore, as expected, the correlation between GPA and ISAT estimates was stronger in the undergraduate studies.

(8)

Another factor which needs to be taken into account in interpreting the result of predictive validity analysis is that the sample size in each field of study, especially in the postgraduate data, was very small. This may lead to sampling errors and unstable estimates.

This study provides comprehensive evidence of the degree of the broadly defined reliability and validity of the ISAT. It shows that the ISAT met the basic criteria of the Rasch model and that it had some predictive validity in regard to academic performance in postgraduate and undergraduate studies as assessed by correlations with the students’

GPAs. However, it is necessary to consider further the implications of the differences in the relative difficulties of the item bank and those observed in the data analysed.

This study is significant in two ways. Firstly, it contributes to the specific item development process for the ISAT. The results of this study can be used to provide better items and a better test to measure the construct more validly, reliably and efficiently. Secondly, the study contributes to the field of measurement in general by illustrating an application of not only the Rasch model, but the Rasch paradigm, in constructing and evaluating a test. The differences between applying a measurement model within the Rasch paradigm and within a general item response theory (IRT) paradigm is demonstrated.

(9)

vii

In accordance with the regulations for presenting thesis and other work in higher degrees, I hereby declare that this thesis is entirely my own work and that it has not been submitted for a degree at this or any other university. I have the permission of my co-author to include the work from the following publication in my thesis.

Asril, Asrijanty and Marais, Ida (2011). Applying a Rasch Model Distractor Analysis:

Implication for Teaching Learning. In Robert Cavanagh and Russel F. Waugh (Eds), Application of Rasch Measurement for Learning Environments Research (pp.77-100).

The Netherlands: Sense publishers. ISBN: 978-94-6091-492-1 (paperback), 978-94- 6091-492-8 (hardback).

Asrijanty Asril

The University of Western Australia August 2011

Note. This thesis has been formatted in accordance with modified American Psychological Association (2010) publication guidelines.

(10)

viii

Abstract ... i

Declaration ... vii

Table of Contents ... viii

Acknowledgements... x

List of Acronyms ... xi

List of Tables ... xii

List of Figures ... xv

List of Appendices ... xix

Chapter 1 Introduction ... 1

1.1 Selection for Higher Education Studies ... 2

1.2 The Indonesian Scholastic Aptitude Test (ISAT) ... 5

1.3 Present Study... 10

1.4 Significance of the Study ... 14

1.5 Overview of the Dissertation ... 16

Chapter 2 Literature Review ... 18

2.1 Aptitude Testing for Selection ... 18

2.2 The Rasch Model and Its Paradigm ... 29

Chapter 3 Methods ... 46

3.1 Rationale and Procedure in Examining Internal Consistency ... 47

3.2 Rationale and Procedure in Examining the Stability of Item Bank Parameters . 90 3.3 Rationale and Procedure in Examining Predictive Validity ... 100

3.4 ISAT Items Analysed in this Study... 102

Chapter 4 Internal Consistency Analysis of the Postgraduate Data ... 103

4.1 Examinees of the Postgraduate Data ... 103

4.2 Internal Consistency Analysis of the Verbal Subtest ... 104

(11)

4.3 Internal Consistency Analysis of the Quantitative Subtest ... 141

4.4 Internal Consistency Analysis of the Reasoning Subtest ... 169

4.5 Summary of Internal Consistency Analysis of the Postgraduate Data ... 192

Chapter 5 Internal Consistency Analysis of the Undergraduate Data ... 194

5.1 Examinees of the Undergraduate Data ... 194

5.2 Treatment of Missing Responses and Item Difficulty Order for the Undergraduate Data ... 195

5.3 Internal Consistency Analysis of the Verbal Subtest ... 196

5.4 Internal Consistency Analysis of the Quantitative Subtest ... 201

5.5 Internal Consistency Analysis of the Reasoning Subtest ... 209

5.6 Summary of Internal Consistency Analysis of the Undergraduate Data ... 214

Chapter 6 Stability of the Item Bank Parameters in the Postgraduate and Undergraduate Data ... 217

6.1 Correlations between Item Locations... 217

6.2 Comparisons between Item Locations ... 218

6.3 The Effect of Unstable Item Parameters on Person Measurement ... 223

6.4 Summary ... 226

Chapter 7 Predictive Validity of the ISAT for Postgraduate and Undergraduate Studies ... 227

7.1 The Predictor and Criterion for the Predictive Validity Analysis ... 227

7.2 Analysis of the Postgraduate Data ... 230

7.3 Analysis of the Undergraduate Data ... 245

7.4 Summary ... 254

Chapter 8 Discussion and Conclusion ... 256

8.1 Discussion ... 256

8.2 Conclusion ... 267

References ... 269

Appendices...277

(12)

x

Acknowledgements

I would like to express my gratitude to David Andrich for his guidance and continuous support. His understanding and generosity in guiding me made the journey of finishing this study rewarding and enjoyable. This study applies much of his work on the Rasch model.

I would also like to thank and to acknowledge the support and constructive input of my co-supervisors, Ida Marais and Stephen Humphry throughout the study. Frequent discussion that we had helped me gain more understanding of Rasch analysis. This study also applies their recent work on the Rasch model.

I would like to acknowledge and to thank Irene Styles for reading my thesis. Her suggestion improves the final thesis.

The data I used in this study were obtained from the Center for Educational Assessment, Jakarta. I would like to thank to N.Y. Wardani for granting me permission to use the data and all my colleagues in the Center for their support, especially Mbak Tuti, Nana, Irma, Daru, and Yoyok for their assistance in preparing the data.

I would like to acknowledge and to thank the Department of Education, Employment, and Workplace Relations (DEEWR) of Australia for providing financial support throughout my studies through the Endeavour Postgraduate Award.

Lastly, I would like to thank to my family and friends for their support and encouragement. Special thank goes to Vitti for her assistance in editing my first draft and her support throughout.

(13)

xi

CEA Center for Educational Assessment CCC Category Characteristic Curve CTT Classical Test Theory

DIF Differential Item Functioning DRM Dichotomous Rasch Model GPA Grade Point Average ICC Item Characteristic Curve IRT Item Response Theory

ISAT Indonesian Scholastic Aptitude Test PRM Polytomous Rasch Model

PSI Person Separation Index

SNMPTN National Selection to Enter Public Universities SPMB Selection for Admission of New Students TCC Threshold Characteristic Curve

(14)

xii

Table 1.1. ISAT Specifications ... 7

Table 2.1. Rasch’s Two-way Frame of Reference of Objects, Agents and Responses .. 32

Table 3.1. Treatment of Missing Responses for Item Estimates ... 49

Table 4.1. Composition of Postgraduate Examinees ... 104

Table 4.2. The Effect of Different Treatments of Missing Responses in ... 105

Table 4.3. Fit Statistics of Misfitting Items for the Verbal Subtest ... 109

Table 4.4. Spread Value and the Minimum Value Indicating Dependence ... 112

Table 4.5. PSIs in Three Analyses to Confirm Dependence in Six Verbal Testlets ... 113

Table 4.6. Statistics of Some Verbal Items after Tailoring Procedure ... 118

Table 4.7. Results of Rescoring 17 Verbal Items ... 122

Table 4.8. Results of Rescoring Four Verbal Items ... 122

Table 4.9. Results of Rescoring Items 13 and 36 ... 129

Table 4.10. Problematic Items in the Verbal Subtest Postgraduate Data ... 141

Table 4.11. The Effect of Different Treatments of Missing Responses in the Quantitative Subtest ... 142

Table 4.12. Item Difficulty Order in the Quantitative Subtest... 144

Table 4.13. Spread Value and the Minimum Value in the Quantitative Subtest ... 148

Table 4.14. PSIs in Three Analyses to Confirm Dependence ... 149

Table 4.15. Statistics of Some Quantitative Items after Tailoring Procedure ... 153

Table 4.16. Results of Rescoring for 22 Quantitative Items ... 161

Table 4.17. Results of Rescoring Three Quantitative Items ... 162

Table 4.18. Problematic Items in the Quantitative Subtest Postgraduate Data ... 168

(15)

Table 4.19. The Effect of Different Treatments of Missing Responses

in the Reasoning Subtest ... 169

Table 4.20. Spread Value and the Minimum Value Indicating Dependence ... 172

Table 4.21. PSIs in Three Analyses to Confirm Dependence ... 173

Table 4.22. Statistics of Some Reasoning Items after Tailoring Procedure... 178

Table 4.23. Results of Rescoring 19 Reasoning Items... 184

Table 4.24. Result Rescoring for 6 Reasoning Items ... 185

Table 4.25. Problematic Items in the Reasoning Subtest Postgraduate Data... 191

Table 5.1. Composition of Undergraduate Examinees ... 195

Table 5.2. Problematic Items in the Verbal Subtest Undergraduate Data ... 201

Table 5.3. Problematic Items in the Quantitative Subtest Undergraduate Data... 209

Table 5.4. Problematic Items in the Reasoning Subtest Undergraduate Data ... 214

Table 6.1. Correlations between Item locations of the Item Bank and of the Postgraduate/Undergraduate Analyses... 218

Table 6.2. Standard Deviation of the Item Locations from the Item Bank and from the Postgraduate/Undergraduate Analyses ... 219

Table 6.3. Significance of the Difference in Variance of Item Locations from the Item Bank and from the Postgraduate/Undergraduate Analyses ... 219

Table 6.4. Identification of Unstable Items without Adjusting the Units for the Verbal Subtest Postgraduate Data ... 221

Table 6.5. Identification of Unstable Items with Adjusting the Units for the Verbal Subtest Postgraduate Data ... 221

Table 6.6. The Effect of Adjusting the Units as a Function of a Unit Ratio and Correlation between Item locations of the Item Bank and Postgraduate/Undergraduate Analyses ... 222

(16)

Table 6.7. Comparisons of the Means of Person Locations Using Item Bank Values and Item Estimate from the

Postgraduate/Undergraduate Analyses ... 225

Table 7.1. Number of Examinees who had Academic Records in Each Semester ... 230

Table 7.2. Descriptive Statistics of ISAT Location Estimates for all Postgraduate Examinees ... 233

Table 7.3. Descriptive Statistics of the ISAT and the GPA per Field of Study for the Postgraduate Data ... 235

Table 7.4. Summary of Correlations between Subtests ... 241

Table 7.5. Correlation between the ISAT and GPA in the Postgraduate Data ... 242

Table 7.6. Summary of Regression Analyses for the Postgraduate Data ... 245

Table 7.7. Descriptive Statistics of ISAT Location Estimates for All Undergraduate Examinees ... 247

Table 7.8. Descriptive Statistics of ISAT and GPA per Field of Study for the Undergraduate Data ... 250

Table 7.9. Summary of Correlation between Subtests ... 252

Table 7.10. Correlation between the ISAT and GPA in the Undergraduate Data ... 253

Table 7.11. Summary of Regression Analyses for the Undergraduate Data ... 254

(17)

xv

List of Figures

Figure 1.1. ISAT development process ... 8

Figure 2.1. ICCs of three items with dichotomous responses ... 34

Figure 2.2. CCCs and TCCs of an item with three response categories ... 36

Figure 3.1. ICCs of two items indicating fit (left) and misfit (right) ... 55

Figure 3.2. Examples of items showing guessing (right) and no guessing (left). ... 68

Figure 3.3. ICCs of an Item where guessing is confirmed, before tailoring (left) and after tailoring (right) ... 73

Figure 3.4. ICCs of an item where guessing is not confirmed, before tailoring (left) and after tailoring (right) ... 74

Figure 3.5. CCCs and TCCs for polytomous responses with three category responses ... 76

Figure 3.6. Plots of distractors with potential information ... 79

Figure 3.7. CCC (left) and TCC (right) of an Item showing categories working as intended (top) and not working as intended (bottom) ... 82

Figure 3.8. An Item Show Uniform DIF ... 85

Figure 4.1. Item Order of the Verbal subtest according to the location from the item bank (top panel) and from the postgraduate analysis (bottom panel) ... 107

Figure 4.2. Person-item location distribution for the Verbal subtest ... 109

Figure 4.3. The ICCs of items 18 and 35 ... 110

Figure 4.4. The ICC of item 36 indicating guessing graphically ... 115

Figure 4.5 The plot of item locations from the tailored and anchored analyses for the Verbal subtest ... 116

(18)

Figure 4.6. ICCs for item 36 from the original analysis (left) and

the anchored all analysis (right) to confirm guessing ... 120

Figure 4.7. Graphical fit for item 3 ... 124

Figure 4.9. Graphical fit for ftem 36 ... 125

Figure 4.11. The content of item 13 ... 127

Figure 4.12. Distractor plot of item 13... 127

Figure 4.13. The Content of item 36 ... 128

Figure 4.14. Distractor plots of item 36 ... 128

Figure 4.15. The graphical fit for rescored item 13 into three categories ... 130

Figure 4.16. The graphical fit for rescored item 36 into four categories ... 131

Figure 4.17. The ICCs of Verbal items indicating DIF for gender, educational level and program of study ... 133

Figure 4.18. ICCs for males and females for resolved item 7 ... 135

Figure 4.19. ICCs for Masters and doctorates for resolved item18 ... 137

Figure 4.20. ICCs for social sciences and non-social sciences for resolved item 11 .... 138

Figure 4.21. Item order of the Quantitative subtest according to the location from the item bank (top) and from the postgraduate analysis (bottom) ... 143

Figure 4.22. Person-item location distribution of the Quantitative subtest ... 146

Figure 4.23. The ICC of item 74 ... 147

Figure 4.24. ICCs of four Quantitative items indicating guessing graphically... 151

Figure 4.25. The plot of tailored and anchored locations for the Quantitative subtest ... 152

Figure 4.26. The ICCs from original analysis for four Quantitative items which indicate significant location difference between tailored and anchored analyses but did not indicate guessing from the ICC ... 155

(19)

Figure 4.27. ICCs of four Quantitative items from the original

analysis (left) and anchored all analysis (right) to confirm guessing ... 157

Figure 4.28. The Content of four Quantitative items indicate guessing ... 159

Figure 4.29. Graphical fit of item 55... 162

Figure 4.32. Graphical fit for rescored item 58 only... 164

Figure 4.33. The content of item 58 ... 165

Figure 4.34. Distractor plots of item 58 ... 165

Figure 4.35. Reasoning item order according to item location from the item bank (top panel) and from postgraduate analysis (bottom panel) ... 170

Figure 4.36. Person-item location distribution of the Reasoning subtest ... 171

Figure 4.37. ICCs of four Reasoning items indicating guessing graphically ... 175

Figure 4.38. The Plot of item locations from the tailored and anchored analyses for the Reasoning subtest ... 176

Figure 4.39. The ICC of item 110 ... 179

Figure 4.40. The ICCs of four Reasoning items from the original (left) and the anchored all analysis (right) to confirm guessing ... 180

Figure 4.41. The content of items 96, 108, 109, and 112... 183

Figure 4.44. Graphical fit item 95 ... 187

Figure 4.45. Content of items 92, 94, and 95 ... 188

Figure 4.46. Distractor plots of items 92, 94, 95 ... 189

Figure 7.1. Distribution of ISAT location for admitted and non-admitted groups ... 231

Figure 7.2. Distribution of location estimate in Verbal for each field of study ... 236

Figure 7.3. Distribution of location estimate in Quantitative for each field of study ... 237

(20)

Figure 7.4. Distribution of location estimate in Reasoning for each field of study ... 238 Figure 7.5. Distribution of the location estimates in Total for each field of study ... 239 Figure 7.6. Distribution of the location estimates in GPA for each field of study ... 240 Figure 7.7. Distribution of ISAT location estimates for sample predictive

validity group and other groups ... 248 Figure 7.8. Distribution of ISAT subtests location estimate and GPA for

Economics and Engineering of undergraduate studies ... 251

(21)

xix

List of Appendices

Appendix A1. Item Fit Statistics for Verbal (Postgraduate) Subtest ……... 277

Appendix A2. Statistics of Verbal (Postgraduate) Items after Tailoring Procedure....279

Appendix A3. Results of DIF Analysis for Verbal (Postgraduate) Subtest...281

Appendix B1. Item Fit Statistics for Quantitative (Postgraduate) Subtest... 293

Appendix B2. Statistics of Quantitative (Postgraduate) Items after Tailoring Procedure... 294

Appendix B3. Results of DIF Analysis for Quantitative (Postgraduate) Subtest...295

Appendix C1. Item Fit Statistics Analysis for Reasoning (Postgraduate) Subtest...298

Appendix C2. Statistics of Reasoning (Postgraduate) Items after Tailoring Procedure... ... ...299

Appendix C3. Results of DIF Analysis for Reasoning (Postgraduate) Subtest... 300

Appendix D1.Treatment of Missing Responses for Verbal Subtest in Undergraduate Data...303

Appendix D2. Item Difficulty Order for Verbal (Undergraduate) Subtest...304

Appendix D3. Targeting and Reliability for Verbal (Undergraduate) Subtest...305

Appendix D4. Item Fit Statistics for Verbal (Undergraduate) Subtest...306

Appendix D5. Local Independence in Verbal Subtest of Undergraduate Data...308

Appendix D6. Evidence of Guessing in Verbal Subtest of Undergraduate Data...309

Appendix D7. Distractor Information in Verbal Subtest of Undergraduate Data... 313

Appendix D8. Results of DIF Analysis for Verbal (Undergraduate) Subtest... 321

Appendix E1.Treatment of Missing Responses for Quantitative Subtest of Undergraduate Data... 330

Appendix E2. Item Difficulty Order for Quantitative (Undergraduate) Subtest... 331

(22)

Appendix E3.Targeting and Reliability for Quantitative (Undergraduate) Subtest...332 Appendix E4. Item Fit Statistics for Quantitative Subtest of Undergraduate Data... ...333 Appendix E5. Local Independence in Quantitative Subtest of

Undergraduate Data... 334 Appendix E6. Evidence of Guessing in Quantitative Subtest of

Undergraduate Data... 335 Appendix E7. Distractor Information for Quantitative Subtest of

Undergraduate Data... 340 Appendix E8. Results of DIF Analysis for Quantitative Subtest of

Undergraduate Data... 345 Appendix E9. Content of Problematic Items in Quantitative (Undergraduate)

Subtest... 349 Appendix F1. Treatment of Missing Responses for Reasoning Subtest of

Undergraduate Data...351 Appendix F2. Item Difficulty Order for Reasoning Subtest of

Undergraduate Data...352 Appendix F3. Targeting and Reliability for Reasoning Subtest of

Undergraduate Data... 353 Appendix F4. Item Fit Statistics for Reasoning Subtest of Undergraduate Data...354 Appendix F5. Local Independence in Reasoning Subtest of Undergraduate Data...355 Appendix F6. Evidence of Guessing in Reasoning Subtest of Undergraduate Data...356 Appendix F7. Distractor Information in Reasoning Subtest of

Undergraduate Data... 359 Appendix F8. Results of DIF Analysis for Reasoning Subtest of

Undergraduate Data...366 Appendix F9.Content of Problematic Items in Reasoning

(Undergraduate) Subtest...368 Appendix G1.Correlations between Item Location from the Item Bank and

from Postgraduate Analysis... 370

(23)

Appendix G2. Correlations between Item Location from the Item Bank and

from Undergraduate Analysis... 371 Appendix G3. Identification of Unstable Items after Adjusting the Units

in Postgraduate Data…...372 Appendix G4. Identification of Unstable Items after Adjusting the Units

in Undergraduate Data ………... 376 Appendix G5. Correlations between Person Location from the Item Bank and

from Postgraduate Analysis... 380 Appendix G6. Correlations between Person Location from the Item Bank and

from Undergraduate Analysis... 381 Appendix H1. Relationship between the ISAT and GPA in Postgraduate Data ... 382 Appendix H2. The Results of Multiple Regression Analyses for Postgraduate

Data…... 383 Appendix H3. Relationship between the ISAT and GPA in Undergraduate Data... 387 Appendix H4. The Results of Multiple Regression Analyses for Undergraduate

Data...390

(24)

(25)

1

Chapter 1 Introduction

Selection for entry to higher education is considered an important issue in many countries. There are at least three reasons for its importance. The reasons are that tertiary selection determines the quality of the graduates, that it affects curricula and teaching methods in secondary schools, and that it affects social equity and social cohesion within societies (Harman, 1994).

Accordingly, ensuring an admission test is reliable and that the inferences made from test scores are valid becomes crucial. To achieve this, the internal structure of the test and its relation to external criteria need to be examined. In particular, to ensure that the test meets important measurement criteria, an examination based on a model which has properties of fundamental measurement, namely the Rasch model, has advantages compared to other approaches.

Andrich (2004) argues that the distinction between the Rasch model and other measurement models, namely item response theory (IRT) models, is not only a distinction between model properties but also between statistical paradigms. The IRT models are used within the traditional statistical paradigm (Andrich, 2004). In the traditional paradigm, the function of a model is to account for the data. Thus, when the data do not fit the model, another model which explains or describes the data better is used. In contrast, in the Rasch paradigm a model serves as a frame of reference. When the data do not fit the Rasch model, the data need to be examined and an explanation of the misfit sought. Thus, the Rasch model serves as a prescriptive and diagnostic tool.

(26)

Applying the Rasch model and its paradigm can help in developing better items to measure a construct more validly, reliably, and efficiently.

This study evaluates the Indonesian Scholastic Aptitude Test (ISAT) internally, through the Rasch model and its paradigm, and externally through its predictive validity. In addition, the stability of the estimates of item difficulty relative to the item bank is also examined. The test, developed by the Center for Educational Assessment (CEA) in Jakarta, has been used as one of the admission tests for undergraduate and postgraduate levels in some public universities in Indonesia. However, although it has been analysed and an item bank developed based on the Rasch model, it has not been reviewed comprehensively using the Rasch model and its paradigm.

The chapter starts with the context and background of this study. Selection for higher education and the development of the ISAT are discussed first. This is followed by a description of the study, its significance, and an outline of the structure of the dissertation.

1.1 Selection for Higher Education Studies

Selection for higher education generally takes place because the number of applicants is greater than the available places. The greater the ratio of applicants to places the more competitive the selection. In Asian countries where the number of applicants is increasing rapidly (Harman, 1994), the competition is inevitably very high.

Competition, however, does not occur only in developing countries but also in developed countries. In the United States (US), for example, in general the chance for applicants to enter university (four-year institution) is relatively high. At least three- quarters of applicants are admitted to about 65 % of the institutions. Still, the competition in some prestigious colleges is very high (Zwick, 2004). In many of these

(27)

countries there is strong competition for particular professional studies, for example, Law and Medicine.

Higher education institutions differ in how they select students. However, in general, variation in selection method originates from three sources, namely evidence of applicants’ quality, either aptitude or achievement; reference of assessment, either criterion-based assessment or norm-based assessment; context of assessment, either secondary school-based assessment or national or external assessment (Fulton, 1992).

The issue which attracts much attention is the choice between assessment of aptitude and achievement. Some argue that the basis for selection should be based on the assessment of achievement, not potentiality or aptitude; others consider the assessment of aptitude more relevant. Different countries apply different criteria for selection and these criteria are usually a function of a country’s education context. In the US, both achievement and aptitude are used as admission criteria. Most of the US universities accept either a score on the SAT, developed by the College Board New York, which measures reasoning, or a score on the ACT test, developed by the American College Testing IOWA), which measures achievement (Briggs, 2009). In other countries, such as the United Kingdom (UK) and Australia, the criterion of admission is student achievement in prescribed subjects (Andrich & Mercer, 1997).

In Indonesia, where public (state) universities are generally preferred to private universities, selection for undergraduate studies into all public universities until 2001 was based on a centralized achievement test as the selection tool. The applicants for all public universities sit for the same admission test at the same time, generally over two days. The subjects that all applicants are tested on are Basic Mathematics, Indonesian, and English. In addition, applicants for Natural Science programs sit for Natural Science subject tests, namely Biology, Chemistry, Physics, Science Mathematics and

(28)

Applied Natural Science. Those who apply for social science programs sit for social science subject tests including History, Geography, Economics, and Applied Social Science. To study Kinesiology and Arts, applicants are required to take additional tests.

From 2002, the system for selection was changed as a consequence of the Ministerial decree 173/U/2001. The decree states that student selection, including criteria and procedures, is set by each university. Nevertheless, there is an agreement among public universities to continue to use the previous system which is centralized, and to use the same criteria. This system selection is called “Selection for Admission of New Students” (SPMB).

However, SPMB is not the only scheme in recruiting students. The universities, especially the prestigious ones, in addition to SPMB, also apply other schemes in recruiting students. These schemes may be different from each other in terms of the criteria and the selection procedures. The criteria may be outstanding performance in an academic national or international competition (for example, Physics or Math Olympiad), outstanding academic performance nominated by the region, outstanding performance in school and in a scholastic aptitude test, outstanding performance in a school with a low socioeconomic background, and outstanding performance in sport and arts.

It is clear then that from 2002, especially for some prestigious universities there are schemes in recruiting students for undergraduate studies which in general can be classified into two groups. The first is through SPMB (centralised selection procedure with achievement tests as the selection tool). The second is other than SPMB where in this category the selection procedures and criteria vary.

In 2008 the SPMB changed to SNMPTN (National Selection to Enter Public Universities). However, except for the name, the selection system, including the

(29)

selection tool, did not change. Only from 2009 has a scholastic aptitude test been added as an admission test to complement the achievement tests.

Meanwhile, selection at postgraduate level has never been centralised. Each university sets and applies its own selection system. Although the procedures are different, the criteria are the same. For doctorate programs, three components are generally assessed, namely English, scholastic aptitude, and subject matter. The last component may be assessed from a research proposal, interview, written test or portfolio. For Masters programs, some fields of study use the three components as for the doctorate level or just English and scholastic aptitude.

1.2 The Indonesian Scholastic Aptitude Test (ISAT)

1.2.1 The Background

As indicated earlier, in the 1980s, in Indonesia selection to enter public universities, for undergraduate level, was based only on performance on an admission test which was an achievement test in some subjects. There had been concern about this selection system.

The system was considered as not providing adequate information about an applicant’s potential for further study, because it captures only an applicant’s knowledge in certain subjects. Some argued that certain students may not perform well in the achievement test for some reason even though they may be capable of succeeding in university studies.

For example, applicants from low social and economic backgrounds may not perform well, not because they are incapable of further study, but because they have been disadvantaged in their schooling. Although it is not always the case, there is a trend that students from high social and economic status background attends high quality schools and students from low social and economic status backgrounds attend lower quality

(30)

schools. Similarly, those who live in big cities (urban areas) tend to get better service in education than those in small cities (rural areas). In remote areas, in particular, the learning process is hindered by limited resources which, in turn, lead to low levels of academic achievement.

Also, many students, especially in big cities, attend test preparation courses before sitting for university entrance tests. Some test preparation institutions are well known for their success in helping students get a place in universities. It is suspected that some students get a place in a university due to the drilling process in the preparation program even though their academic ability is relatively low.

The CEA, formerly the Research and Development Center for the Examination System, organized a national seminar for student selection methods as a response to these concerns in the late 1980s. One of the recommendations that followed from this seminar was to develop a scholastic aptitude test to be used as one of the selection instruments for higher education admission. It was thought that using a scholastic aptitude test to complement an achievement test would provide a better prediction of future success than an achievement test alone. Since then, the Indonesian Scholastic Aptitude Test (ISAT) has been developed by the CEA.

1.2.2 Description

The ISAT has been developed to measure individual scholastic aptitude or academic capability. This aptitude is considered a significant factor contributing to the success in higher education studies at both undergraduate and postgraduate levels. Therefore, although the idea of developing the ISAT was originally for selection at undergraduate level, during its development it was considered that it would be useful for selection at the postgraduate level as well.

(31)

The test consists of three subtests, Verbal, Quantitative, and Reasoning, and uses multiple choice item formats with five alternatives. The Verbal subtest measures reasoning in a verbal context; the Quantitative subtest measures reasoning in a numerical context; the Reasoning subtest measures the ability to draw a conclusion from a hypothetical situation or condition. The details of the test including the sections in each subtest, the number of items, and the time allocated to complete the subtest are shown in Table 1.1.

Table 1.1. ISAT Specifications

Subtest Section Number of Items Allocated Time

Verbal Synonyms

Antonyms Analogies

Reading Comprehension

12 13 13 12

50 items 30 min

Quantitative Number Sequence

Arithmetic & Algebra concepts Geometry

10 10 10

30 items 60 min

Reasoning Logic

Diagrams Analytical

8 8 16

32 items 40 mins

Total 112 items 130 min

1.2.3 Test Development

As indicated previously, the ISAT has been developed over almost 20 years. In the first years of its development the focus was on the development of the test specifications, the result of which is shown in Table 1.1. In the latter years the focus has been on the development of an item bank. For this purpose each year the CEA organizes activities related to item development, including item writing, item review, item trial, and item analysis.

(32)

In the item trials, in which the respondents are normally high school students (year 12), each student does not take all three subtests. Only one set of a subtest (about 40-50 items) is given to each group (class). It takes approximately 90-120 minutes to complete the test. Some linking items across trial forms are included.

Items are then analysed using classical test theory and Rasch measurement theory.

Classical item analysis, which is undertaken before Rasch analysis, is conducted to examine how well the items work from the perspective of classical test theory. The main statistic which is used is the item discrimination index. The Rasch analysis is conducted only for items which show a positive discrimination index for the correct answer (key). It may be argued that this step of first using classical test theory is not necessary when applying the Rasch model. However, here the process which is currently used is described. Items which show a negative discrimination index for the key are not included for further analysis. If it is found that these items can be revised, they are retained for retrial. Those items for which an explanation of negative discrimination cannot be offered and could not be revised are dropped. In using Rasch analysis items are examined in terms of fit to the Rasch model, in this case the criterion is the item fit statistic. The steps in ISAT development are summarized in Figure 1.1

Figure 1.1. ISAT development process

(33)

1.2.4 Test Administration, Scoring and Reported Results

To administer the test, testers need to attend a coaching session and to follow the instruction manual. Normally, it takes about 15 minutes for testing preparation including reading test instructions and filling in the identity details on a computer answer sheet. The testing time is 130 minutes with allocated time for each subtest as described in Table 1.1.

The examinees are informed that the ISAT scoring does not apply a penalty for incorrect responses. Each correct answer is scored 1 and each incorrect response is scored 0. A missing response is also scored 0. It is apparent that this scoring system encourages examinees to guess and thus, theoretically, the ISAT data may contain guessed responses.

There are four scores reported, Verbal, Quantitative, Reasoning, and the Total. In each subtest a person’s proficiency estimate in logits is converted relative to a scale with a mean of 300 and a standard deviation of 40. In this way a score in each subtest ranges approximately between 100 and 500. A total score is obtained by summing the scaled scores on the three subtests. The range of total scores is 300-1500 and is scaled to have a mean of 900 and a standard deviation of 120.

1.2.5 Test Usage

Although the test has been developed over about 20 years, it has not been used widely until recently. From the early 1990s until the early 2000s, it was used only by one private university as one of its selection instruments. Only since 2004 has the test been employed by some public universities in Indonesia, notably the prestigious ones, as part of their selection tools.

The ISAT has been used as a selection instrument for undergraduate and postgraduate levels in some fields of study in different ways. Some universities use the ISAT along

(34)

with other instruments, such as an achievement test and/or interview, while others may use the ISAT as the only selection tool. The role of the ISAT in the selection process also varies. Some give more weight to the ISAT score than to other scores, and some do not. Some use the ISAT scores for filtering applicants; some use the ISAT scores and other results simultaneously. When the ISAT is used for filtering, generally the cut off score is 900 or above for more selective programs.

For security and for aligning students to the difficulties of the items, different item sets are used for different groups of examinees. In terms of security, for example, the same item set would not be administered as an admission test in two different universities where there is a possibility that the examinees could sit both tests. In terms of aligning students to the difficulties of the items, a more difficult set would be given to higher proficiency examinees. However, because the examinees’ proficiency level in scholastic aptitude is usually not known, the examinees’ of level proficiency is inferred from the competitive level of the selection system. It is assumed that in more competitive selection systems the proficiency of the examinees is higher than in less competitive systems. A more difficult test is given to examinees in more selective selection procedures.

It should be noted that the scholastic aptitude test which has been used in SNMPTN (National Selection to Enter Public Universities) since 2009 is not the ISAT which has been developed by the CEA. The SNMPTN test was prepared by the SNMPTN Committee.

1.3 Present Study

As stated earlier the ISAT has been used for 20 years. However, until now, no study has been conducted to examine this test comprehensively, especially based on the Rasch

(35)

paradigm of measurement. It is considered critical to examine thoroughly an instrument that serves as such a high stakes test.

In addition, because the items used in this study were obtained from an item bank, it is necessary to examine the stability of the item parameters of the test with respect to their item bank values. Although in practice it is assumed that item parameters are invariant over time they may change over time or across different groups.

Another area examined is the predictive validity of the test. The ISAT, as described earlier, is used as a selection tool to enter higher education studies. Therefore, the extent to which the test predicts academic performance in higher education studies need to be studied. This can be considered as an effort to build a sound validity argument to support the intended use of the test according to the Standard for Educational and Psychological Testing set by the American Educational Research Association, the American Psychological Association, and the National Council on Educational Measurement (AERA, APA, & NCME, 1999).

Therefore, this study examines the validity of the test by examining its internal structure based on the Rasch model and the Rasch paradigm, the stability of the item bank parameters and its predictive validity.

For the predictive validity purpose, responses of the examinees on the ISAT and their academic performance in universities are needed. Although the data of the ISAT responses can be obtained from the CEA, academic performance data are available only from the universities. Thus, the predictive validity of the ISAT can be studied only with the cooperation of universities.

To provide comprehensive results, it is desirable that data are obtained from examinees from as many fields of study as possible, both at undergraduate and postgraduate levels, and with evidence of their academic performance in universities. Therefore, the

(36)

universities chosen were those that used the ISAT to select students for various programs of study, had academic performance records for at least one year, and were willing to supply such data.

Two years before this study started, that is in 2005, two universities, which will be referred to as A and B, used the ISAT to select students for undergraduate studies for almost all fields of study. In the same year, for postgraduate studies a third university, C, used the ISAT, to select students for postgraduate studies in all fields of study in that university.

However, only university A and university C were able to provide data for the academic performance of those who were tested in 2005. Although university C was able to provide data of students’ academic performance for postgraduate studies from all fields of study, university A which had undergraduate data, provided students’ academic performance from only two fields of study, namely Economics and Engineering.

In 2005 university A used the ISAT to select students for undergraduate studies in a special scheme (not SNMPTN). In this scheme students who were in the top ten in their class in their third year of high school could apply to take the test and the ISAT was the only test administered to the applicants. In contrast, in selection for postgraduate studies by university C, the ISAT was not the only admission test. Tests in specific areas were also used.

As indicated earlier, in a selection situation, the number of applicants is generally greater than the number who are admitted. In this study, although the number of applicants is known, it is not clear how many applicants were actually admitted. Also not known was the cut score of the ISAT or the role of the ISAT in the admission decisions and whether there were some criteria or considerations in the admission decision other than the admission test results.

(37)

The complexity of the selection situation and the difficulty in obtaining accurate information regarding the selection decisions in general was acknowledged by Gulliksen (1950) 60 years ago. He asserted that in practical situations what other variables in selection were involved and how much weight was given to these variables is generally not known. To overcome this situation he suggested making the most reasonable guesses based on the available data.

Information on the selection ratio and the role of the ISAT in admission decisions will help describe the distribution of scores of those admitted and will show the degree of homogeneity of the scores. Homogeneity of scores is relevant in studying predictive validity. For example, if the selection ratio is very small and the ISAT is the only selection criterion, then it is expected that the scores will be more homogeneous and that high predictive validity in terms of correlatiosn using those scores will not be observed.

Although the selection ratio and the role of the ISAT in admission are not known, the distribution of the ISAT scores of all applicants including those who had academic records (admitted group) and not (non-admitted group) were available. They are examined to show the degree of heterogeneity of the ISAT scores in the predictive validity sample.

The undergraduate and postgraduate groups may have different characteristics which may lead to different predictive validities. Therefore, examining predictive validity in these two groups is conducted although the available data show there is a considerable difference between postgraduate and undergraduate data in terms of the field of study and the number of students available.

Because different item sets were used for the undergraduate and postgraduate examinees, separate analyses have to be carried out for each group. Although they are

(38)

from the same item bank, the characteristics of the items in the sets may be different and the interaction with the persons may have an impact on the predictive validity of the test.

In summary, this study examines the internal consistency of the ISAT used in the selection of undergraduate and postgraduate students, the stability of the item parameters with respect their item bank values, and the predictive validity of the test.

Details of the aspects examined with regard to the internal consistency analysis are provided at the end of Chapter 2 and in Chapter 3. Because of the many aspects that are assessed in the internal consistency analysis, and to prevent redundancy, the results are reported in detail only for one set of data, in this case the postgraduate data. The results of the internal consistency analysis for the undergraduate data are reported as a summary. The reason that the postgraduate data were chosen to be reported in detail is that the postgraduate data were available earlier than the undergraduate data. The rationale and the procedure in examining all these aspects are presented in Chapter 3 and they are the same for both sets of data.

1.4 Significance of the Study

Until relatively recently, many measures in the social sciences could not be categorized as scientific measurements (Bond & Fox, 2001). Most of the available measures are not constructed according to standard scientific measurement criteria such as those in the physical sciences. The requirements for scientific measures such as objectivity and equal units are not met. Without these properties, objective comparisons between measurements cannot be made, or as Wright summarises, “one ruler for everyone, everywhere, every time” cannot be provided (Wright, 1997). Furthermore, non- scientific measures, when applied in statistical analyses, lead to biased inferences (Embretson & Reise, 2000).

(39)

It is argued that only the Rasch model provides the objective measurement required in scientific measurement (Wright, 1997). Such measurement has also been called fundamental measurement. Andrich (1988) notes that fundamental measurement in principle can be achieved in the social sciences by constructing instruments based on a sound substantive theory and by applying the Rasch model carefully. This is because the Rasch model fulfils the requirement of fundamental measurement, that is additivity, invariant comparisons, and constant units. It is additive because the relation between variables (person and item parameters) is additive. It provides invariant comparisons because the comparison between persons is independent of the items used to compare them, and comparison between items is independent of persons used to compare them.

The Rasch model produces a constant unit, which means the difference between two numbers or location of objects has the same meaning across the measurement continuum.

Therefore, this study, which evaluates the ISAT according to the Rasch model and its paradigm and examines validity based on external criteria, provides evidence of objective measurement as well as comprehensive evidence of test validity. This is especially important because, as stated earlier, until now no study has been conducted to examine the ISAT comprehensively.

In particular, this research makes two significant contributions. The first is in the item development process for the ISAT. As has been indicated earlier in the ISAT development, both classical test theory and the Rasch model are used in item analysis.

However, the use of Rasch analysis is limited. It is used only to examine consistency with the model and it is based on a single index. The extensive use of the Rasch model as a prescriptive and diagnostic tool has not been explored in the development of the ISAT, which means the effort made to construct and understand the instrument is not

(40)

optimal. The obvious result is that some items may be dropped based on a statistical index even though no explanation is offered. This is not a good practice as item development is costly. The greater the number of items dropped, the less efficient the item development process. This study, which investigates the use of the Rasch model as a prescriptive and diagnostic tool with the ISAT, potentially provides a better set of items to measure the construct more validly, reliably and efficiently.

The second contribution the study may bring is to characterize a comprehensive and illustrative application of the Rasch model and its paradigm in constructing and evaluating a test. This is distinctive because it demonstrates the differences between applying the Rasch model within the Rasch paradigm and the Rasch model within a general IRT paradigm. In addition, relatively recent research in Rasch measurement is applied in this study. These aspects are, firstly, local dependence, based on Marais &

Andrich (2008b), Andrich & Kreiner (2010); secondly, guessing based on Andrich, Marais & Humphry (in press); thirdly, distractor information based on Andrich & Styles (2009); and fourthly, the concept of the unit of measurement as a group factor for a set of items based on Humphry and Andrich (2008), and Humphry (2010).

Thus, this research contributes to the field of measurement in general and to the construction of a scholastic aptitude measure in particular.

1.5 Overview of the Dissertation

The structure of this dissertation is described as follows. Chapter 2 reviews the literature on aptitude testing in a selection context and the Rasch model and its paradigm. In Chapter 3 the rationale and procedure of data analysis to assess internal validity, stability of item parameters, and external validity are described. A description of the examinees and the results are presented in Chapters 4 and 5. Chapter 4 is devoted to the analysis of the postgraduate data. It consists of a description of the postgraduate

(41)

examinees and the detailed results of the internal consistency analysis. In Chapter 5, a description of the undergraduate examinees and the results of the analysis of the internal consistency of the undergraduate test are summarized. Chapter 6 concerns the results of the stability of the item bank parameters for both the postgraduate and undergraduate data. The results for the predictive validity analysis for postgraduate and undergraduate data are presented in Chapter 7. The last chapter, Chapter 8, contains a discussion and concluding remarks.

(42)

18

Chapter 2 Literature Review

In the first part of this chapter the place of aptitude testing in selection is reviewed. It covers the strengths and limitations of the aptitude test as a selection tool and its prospects in selection contexts. In the second section the Rasch model and its paradigm as a frame of reference in this study are described. This section covers the features of the Rasch model, the difference between the Rasch paradigm and the traditional paradigm, a critique of the Rasch model and the implications of using the Rasch model and its paradigm in evaluating tests.

2.1 Aptitude Testing for Selection

Although where and when the first university admission test was used exactly is debatable, it is agreed by most historians that institutionalized admission testing began in Germany and England by the mid-1800s (Zwick, 2004). In the US, the first admission test was developed by the College Entrance Examination Board in the early 1900s. The College Board’s earlier tests were achievement tests in nine subject areas with essay format questions. Later, the College Board changed the type of test and the test format. It was no longer an achievement and essay test but a more general ability test with multiple choice item formats. The items were similar to items of the Army Alpha intelligence test. This new test, called the Scholastic Aptitude Test (SAT), was administered for the first time in 1926 with about 8,000 test takers (Zwick, 2004).

Many admission tests have been developed and are in use across the world for undergraduate and postgraduate levels. In Australia, for example, some admission tests include the Special Tertiary Admission Test (STAT), the Australian aptitude test for

(43)

non-school leavers, the Undergraduate Medical Admissions Test (UMAT), and the Graduate Australian Medical Schools Admissions Test (GAMSAT). In Sweden, there is the Högskoleprovet and the Swedish Scholastic Aptitude Test. Admission tests in the UK include the History Aptitude Test, the National Admissions Test for Law (LNAT), and the United Kingdom Clinical Aptitude Test (UKCAT).

In the US there are many admission tests, but the most well known ones are the SAT, ACT, the Graduate Record Examination (GRE), the Medical College Admission Test (MCAT), the Graduate Management Admission Test (GMAT), and the Law School Admission Test (LSAT). The SAT and ACT are admission tests for undergraduate levels and the rest are for postgraduate levels.

Although many countries have admission tests, there is not much information about these tests in the literature. Discussions as well as research studies that have been reported mostly concentrate on the admission tests in the US. Therefore, the information about admission tests in this section is mainly drawn from the US literature.

Some admission tests used in the US are categorized as aptitude tests such as SAT I or SAT Reasoning, GRE General, GMAT, and LSAT, while others such as SAT II or SAT subjects, ACT, and GRE Subject Tests are categorized as achievement tests. A test such as the Medical College Admission Test (MCAT) measures both aptitude and achievement as it consists of a Verbal Reasoning section which measures aptitude and a Science section which measures knowledge in science subjects. In Australia, the admission tests such as STAT, UNITEST, and UMAT can be categorized as aptitude tests. They all measure reasoning and thinking skills, while the GAMSAT measures both aptitude and achievement (ACER, 2007b).

(44)

2.1.1 The case of SAT

The SAT is perhaps the most widely known admission test in the US. This test attracts much attention and controversy. Many scholars have raised criticisms about the SAT (Crouse & Trusheim, 1988; Lemann, 1999; Owen & Doerr, 1999; Zwick, 2004). As indicated earlier, the SAT consists of two types of tests. SAT I, now called SAT Reasoning, measures reasoning and thinking skills; SAT II or SAT Subject, measures knowledge in certain subject areas. The SAT I or SAT Reasoning is the more controversial of the two tests.

The test has been criticized as being biased against minority groups and women, lacking predictive validity, having limited utility in making admission decisions, and being vulnerable to coaching effects (Crouse & Trusheim, 1988; Linn, 1990). It has also been criticized for being used as an indicator of school quality and for disadvantaging lower social class students since they do not have the same access to test preparation (Syverson, 2007). Other criticisms are that the SAT leads to overemphasizing test preparation in regard to content which is not relevant to school subjects and which does not provide information about how well the students perform and how to improve their skills (Atkinson, 2004).

Several changes have been made during its development. These involve the question types, testing times to ensure the speed factor does not affect test performance, and test administration such as permission to use a calculator in the mathematics section (Lawrence, Rigol, Van Essen, & Jackson, 2004). The name was also subjected to change. Originally, SAT stood for the “Scholastic Aptitude Test” and it then changed to the “Scholastic Assessment Test”. Now SAT is no longer an acronym, but just the name of the test (Noddings, 2007; Zwick, 2004).

(45)

The modifications to the test were partly made in response to the criticisms. The new SAT, administered in 2005, for example, was a result of criticisms made by the University of California president, Richard Atkinson, in 2001 (Zwick, 2004). However, the criticism has not lessened. Since the new version was released it has drawn more criticism than ever before (Syverson, 2007).

Changes have taken place not only in the SAT but also for other admission tests. The GRE, for example, consisted of Verbal, Math, and Analytical Ability sections prior to 2002. In the current version, the Analytical Ability section has been replaced by Analytical Writing (GRE, 2007). Despite some significant changes in some major admission tests (SAT, MCAT, GRE, LSAT), “the fundamental character of the tests remains largely constant” (Linn, 1990, p. 298).

2.1.2 Aptitude versus Achievement

In general, as stated above, admission tests can be categorized into two groups, achievement and aptitude. A popular but misleading conception is that aptitude tests measure innate abilities (Lohman, 2004). Criticism about aptitude tests partly results from this misconception (Atkinson, 2004) and misunderstandings about the relationship between aptitude and achievement tests (Gardner, 1982).

Both aptitude and achievement tests measure developed abilities since, “all tests reflect what a person has learned” (Anastasi, 1981, p.1086). The difference between these two tests is that achievement tests measure certain experiences which can be identified, while aptitude tests measure broad life experience (Anastasi, 1981). However, it is not easy to distinguish between aptitude and achievement tests. The difference between them is relatively subtle. As Gardner (1982, p. 317) puts it, “aptitude tests cannot be designed that are completely independent of past learning and experience and achievement tests cannot be constructed that are completely independent of aptitude”.