INTERNATIONA
L LARGE-SCALE ASSESSMENTS:
A PROBLEM WE CAN IGNORE?
Presentation to Department of Science Education, Faculty of Education, Chulalongkorn University, August 2023 Prof. Gavin T L Brown, FAPS, The University of Auckland Bualuang ASEAN Chair Prof. Thammasat University
ILSA PROBLEM
2
AGENDA
Is China really #1?
Are PISA tests really comparable?
Assessment means something different in eastern Asia Is effort the same when country reputation is at
stake?
ILSA
• PISA, TIMSS, PIRLS, etc.
• reports assume that greater performance is explained by greater ability rather than by motivation or cultural factors
• Between-country comparisons are made assuming score is a pure and accurate measure
• But the importance of ILSAs at the student level
differs across jurisdictions (Eklöf, Pavešič, & Grønmo,
2014) or contexts (Knekta, 2017)
APPLES WITH APPLES?
Is China really #1?
Sampling
• Top PISA results
• Singapore
• Shanghai
• Beijing
• Macau
• Hong Kong
• Can cities be compared
with countries?
SELECTION BY HUKOU
6
2009-2012 Shanghai household registration system (hukou) controls
access to educational resources work migrants don’t have hukou no more than 6% of PISA participants were non- hukou migrant students despite being about
31% of all 15-year-olds living in Shanghai
Limited Sample
Limited province selection 2015 Beijing-Shanghai- Guangdong-Jiangsu 2018 Beijing-Shanghai- Jiangsu-Zhejiang
What about the rest of China where the rural poor live or where the lesser developed regions participate?
They are NOT reporting the nation.
No one else cherry picks like that
PRESENTATION TITLE 7
Province Population Beijing 22M
Jiangsu 85M Shanghai 25M Zhejiang 65M Guangdon
g 126M
CHINA 1,419M
2015: 258M/1419M=18%
2018: 197M/1419M=14%
Conclusion
8
Apples are not being compared
to apples.
Ignore China’s
sample because it
isn’t fair
TECHNICALLY INCOMPARABLE
Are PISA tests really comparable?
10
To compare scores
• Items need to judged to be valid for each jurisdiction
• Items need to belong to the same factor structure
• The regression weights from the factor to each item should vary by chance (metric equivalence)
• The regression intercepts (start points) of each item at the factor should vary by chance (scalar equivalence)
• Then, factor scores can be compared
11
PISA is NOT comparable
• many previous studies reporting lack of MI in PISA tests
• Arffman, I. (2010). Equivalence of translations in international reading literacy studies. Scandinavian Journal of Educational Research, 54(1), 37–59.
• He, J., & van de Vijver, F. (2012). Bias and equivalence in cross-cultural research.
Online Readings in Psychology and Culture, 2(2). doi: 10.9707/2307-0919.1111
• Kreiner, S., & Christensen, K. B. (2014). Analyses of model fit and robustness. A new look at the PISA scaling model underlying ranking of countries according to reading literacy. Psychometrika, 79(2), 210–231.
• Oliveri, M. E., & von Davier, M. (2011). Investigation of model fit and score scale comparability in international assessments. Journal of Psychological Test and AssessmentModeling, 53(3), 315–333.
• Wetzel, E., & Carstensen, C. H. (2013). Linking PISA 2000 and PISA 2009:
Implications of instrument design on measurement invariance. Psychological Test and Assessment Modeling, 55(2), 181–206.
• But PISA persists
PISA Reading 2009 Booklet 11
• 28 reading literacy items
• Multiple choice items were scored 0 or 1;
• Polytomous items ranged from 0 to 2.
• Reading processes measured were
• Access and Retrieve (11 items),
• Integrate and Interpret (11 items), and
• Reflect and Evaluate (6 items).
• Reading literacy used various text formats & types
• N = 32,704 from 55 countries
• Pairwise comparison: Australia vs. 54 countries
• CFA (1 factor), invariance testing; metric equivalence
Asil, M., & Brown, G. T. L. (2016). Comparing OECD PISA Reading in English to Other Languages:
Identifying Potential Sources of Non-Invariance.
International Journal of Testing, 16(1), 71-93.
https://doi.org/10.1080/15305058.2015.1064431
Reject all countries above the line as differing by more than chance at N=500 per country
Accept all countries below line as differences are
trivial or small enough to be ignored
Understanding the problem
• The PISA ESCS index “captures a range of aspects of a student’s family and home background that combines information on parents’ education and occupations and home possessions” (OECD, 2010c, p. 29).
• The relationship between ESCS and δCFI and dMACS for the 47
countries for which it was available was investigated. There was a
moderate but negative relationship between ESCS and Δcfi (r = –0.61, p < 0.05), ESCS and dMACS (r = –0.54, p < 0.05),
• lower levels of ESCS tended to be associated with less equivalence to Australia and much greater effect sizes in the difference.
• complex factors to do with educational practice and socioeconomic resourcing of education interfere with non-invariance of the PISA
reading comprehension results, with impact seen most strongly among the poorer economies
Possible solution
• Group countries into clusters of “countries-like-me”
For example,
• East Asian societies which use Mandarin (i.e., Singapore, China, Macao, and Taiwan) and having strong dependence on testing and public examinations and shared cultural approaches to schooling and testing.
• Possibly include other East Asian societies that have different writing scripts and languages, but similar cultural histories and forces (i.e., Japan, Korea, and Hong Kong). The range of dMACS
relative to Australia for all seven economies was just 0.136 to 0.199.
• Nordic countries (i.e., Finland, Sweden, Norway, Iceland, and Denmark)
• Continental Western European countries because of their similarities in ESCS, despite differences in language.
• Anglo-developed nations: Australia, New Zealand, Canada, USA, UK.
• Latin America; MENA; South-East Asia; Oceania; etc.
DO TESTS MEAN THE SAME THING?
Assessment means something different in eastern Asia
Chinese context
• Chinese culture has a long history of:
• Using examinations and tests to select and reward talent;
• Regarding high academic performance on high-stakes examinations as a legitimate, meritocratic basis for
upward social mobility regardless of social background.
• Considering the person with high academic success as morally virtuous
• Doing well on tests fulfills obligations to families
Chinese context
• Chinese parents expect students to become better academically, attitudinally, and
behaviourally through schooling and will enforce such expectations with harsh authoritarian
parenting practices
• (tiger/dragon mom?)
• Demand for higher education exceeds space
available at fully funded institutions (25% in HK;
50% in PRC)
• Confucian-Heritage learners popular construct only but superficially Confucian…..
• PRC entrance to university no longer simply
based on gao kao scores. Non-academic criteria include
• demonstrating right moral character (e.g., not participating in anti-government activities or protests),
• giving first choices to students who are resident near specific institutions,
• membership in a specified minority group,
• having a recommendation that permits bypassing the examination altogether,
• Having economic resources to move to locations with lower entry standards or to just buy access
Culture vs Context
Chinese teacher beliefs are not narrow
• Hui, 2012 • Brown & Gao, 2015
Chinese: assessing leads to personal development
• Chinese-TCoA
• Hong Kong + China teachers
• Improvement includes using assessment to help students develop personally
• Accountability includes controlling schools and teachers
• Improvement & Accountability strongly connected
• But Irrelevance with
Accountability, inverse to improvement
2011 Brown, Hui, Yu, & Kennedy
Student Survey
• Large-scale survey of HK & PRC university students with a NEW Chinese Student Conceptions of Assessment
Inventory
• Factors recovered
• Confucian-Heritage societies : Competition, Societal Use, Exam Accuracy, and Family Effects
• Jurisdictional differences in institutional practices and policies
were: Teacher Use, School Quality, Class Benefit, and Negative
Effects.
Models
PRC Hong Kong
Mean Score differences
Group Descriptive Statistics Difference Statistics HK B PRC Pre PRC
Post MANOVA Effect size C-SCoA(HE) Scale M SD M SD M SD F(2) p Cohen’s d Culturally-similar factors
Competition 3.86 0.89 3.40 0.88 3.69 0.97 13.02 <.001 .50 Societal Use 3.65 1.09 2.84 0.86 2.98 0.95 45.64 <.001 .82 Exam Accuracy 3.01 0.87 2.54 0.93 2.91 1.00 13.88 <.001 .50 Family Effects 2.49 0.97 2.43 0.90 2.49 0.93 0.27 .76 .06 Jurisdictional Policy Factors
Teacher Use 3.98 0.89 3.50 1.05 3.83 1.07 12.57 <.001 .49 School Quality 3.63 0.81 2.54 0.89 3.39 0.88 17.35 <.001 .56 Class Benefit 3.00 0.90 3.09 0.94 3.09 1.03 0.65 .52 .09 Negative Effects 2.80 0.75 2.39 0.75 2.46 0.82 19.13 <.001 .52
HK students higher in the GREEN
• Assessment
• Can help teachers and students know what to do next
• Being evaluated can help improve performance
• If you want high scores, copy the Chinese model of assessment
•
考考考,老师的法宝;分分分, 学生的命根• exam, exam, exam, teacher’s magic weapon;
grade grade grade, students’ lifeblood
• But be prepared for an unhappy, stressed life
• If testing matters so much in PRC, maybe that’s why East Asia scores so high??
Implications
COUNTRY REPUTATION
Is effort the same when country reputation is at stake?
Test-taking motivation (TTM) matters
• TTM predicts performance
• it is possible that country reputation consequences may elicit different effort in different societies, rather than
ability
• Students in Shanghai scored world best for PISA in 2009 and 2012, but, given the importance of political-
ideological education in China (Chen & Brown, 2018),
that score MAY depend upon TTM effort for national
reputation.
Proposed Model
2-way Comparison
• between-subjects experiment: 3 conditions
• The consequence of a hypothetical test (i.e., none, my country, or me personally)
• in Shanghai (n=1003) and New Zealand (n=479)
• jurisdictions that range from high to middle on PISA.
• Instruments
• four factor Student Conceptions of Assessment (Brown, 2008) inventory before being randomly assigned to experimental condition
• TTM estimated for effort, the importance of the test, and their anxiety for that test;
adapted from Knekta and Eklöf (2015) and Thelk et al. (2009).
• Analysis
• Confirmatory factor analysis for measurement models, invariance testing between jurisdictions
• structural equation modeling to examine relations between SCoA and TTM.
Zhao, A., Brown, G. T. L., & Meissel, K. (2020). Manipulating the consequences of tests: how Shanghai teens react to different consequences.
Educational Research and Evaluation, 26(5-6), 221-251. https://doi.org/10.1080/13803611.2021.1963938
Zhao, A., Brown, G. T. L., & Meissel, K. (2022). New Zealand students’ test-taking motivation: an experimental study examining the effects of stakes. Assessment in Education: Principles, Policy & Practice, 29(4), 397-421. https://doi.org/10.1080/0969594X.2022.2101043
Measurement Model Results
• SCoA measurement model
• bi-factor: four orthogonal specific factors and one general factor,
• dropping two items [i.e., sq1 and pe1]); correspondence to both NZ and Shanghai data
• TTM measurement model
• invariant within jurisdictions across consequence conditions,
• but only configural equivalence across jurisdictions
Structural Models
New Zealand Shanghai
SCoA Validation
• The bi-factor structure of the SCoA replicated previous studies (Matos et al., 2019; Weekers et al., 2009).
• SCoA had a statistically significant, largely through the general factor, but small impact on test-taking
motivation in both the Shanghai and New Zealand samples.
• in the Shanghai sample, the stronger the general
conception of test as improving learning, being related to external attributions, class climate and personal
emotions, the more students thought tests were
important, stimulated greater anxiety, and exerted more effort.
• in the New Zealand sample, students’ general
conception of test was not a significant predictor of
their reported anxiety.
TTM • Regardless of condition, greater anxiety
resulted in greater effort in both jurisdictions.
• consistent with control-value theory in that negative emotion anxiety is an activating force for greater effort and
performance (Pekrun et al., 2002).
• Non-invariance in the structural model between Shanghai and New Zealand predominantly
because of the country at stakes condition, indicating students from the two contexts
perceived this consequence quite differently.
Structural Equation Model Results
• invariant within Shanghai across the three experimental conditions,
• only configural invariance in New Zealand,
• supports ecological rationality (Rieskamp & Reimer, 2007) in how students respond;
• that is, student TTM changes in New Zealand according to consequence, but
doesn’t in Shanghai.
Mean Score
differences by Jurisdiction
&
Condition
Only TTM
Compared to no or personal stakes
• NZ big
differences
• SH small to no
differences
Test
Importanc e by
Condition
&
Jurisdiction
Difference in score by condition
18/22
Impact of Condition by Jurisdiction on Effort, Anxiety
Difference in score by condition
Conclusion
• students in high-performing societies might have tried harder on PISA than students elsewhere, because of the importance of
national reputation within that jurisdiction.
• Motivational factors (i.e., Effort, Importance, and Anxiety) were all greater as consequences shifted from none to country at stake and were highest in individual consequences.
• margin of increase on motivational factors by stakes was trivial or small for Shanghai and moderate to large for New Zealand.
• Corroborates Gneezy et al. (2019) and Chen and Brown (2018), in that there may be no such thing as a low-consequence test in
Shanghai insofar as test-taking motivation is concerned.
• Thus, PISA scores from Shanghai may reflect the continuing importance of being tested in that society, as much as greater competence.
• Does a society permit students not to try or care?
SO WHAT?
39
ILSA do not compare well
• NZ has fallen in rank but our average performance within NZ has not changed.
• We are being beat by test-fetish countries and by distorted samples
• Push back against bad uses
• Travel to visit more successful nations only if there is similarity in context
• Use it if it suits internal politics, like Germany did 20 years ago
• Focus on socio-economic development
• Successful, healthy families produce kids who learn