International Large-Scale Assessments: a problem we can ignore?

(1)

INTERNATIONA

L LARGE-SCALE ASSESSMENTS:

A PROBLEM WE CAN IGNORE?

Presentation to Department of Science Education, Faculty of Education, Chulalongkorn University, August 2023 Prof. Gavin T L Brown, FAPS, The University of Auckland Bualuang ASEAN Chair Prof. Thammasat University

(2)

ILSA PROBLEM

2

AGENDA

Is China really #1?

Are PISA tests really comparable?

Assessment means something different in eastern Asia Is effort the same when country reputation is at

stake?

(3)

ILSA

• PISA, TIMSS, PIRLS, etc.

• reports assume that greater performance is explained by greater ability rather than by motivation or cultural factors

• Between-country comparisons are made assuming score is a pure and accurate measure

• But the importance of ILSAs at the student level

differs across jurisdictions (Eklöf, Pavešič, & Grønmo,

2014) or contexts (Knekta, 2017)

(4)

APPLES WITH APPLES?

Is China really #1?

(5)

Sampling

• Top PISA results

• Singapore

• Shanghai

• Beijing

• Macau

• Hong Kong

• Can cities be compared

with countries?

(6)

SELECTION BY HUKOU

6

2009-2012 Shanghai household registration system (hukou) controls

access to educational resources work migrants don’t have hukou no more than 6% of PISA participants were non- hukou migrant students despite being about

31% of all 15-year-olds living in Shanghai

(7)

Limited Sample

Limited province selection 2015 Beijing-Shanghai- Guangdong-Jiangsu 2018 Beijing-Shanghai- Jiangsu-Zhejiang

What about the rest of China where the rural poor live or where the lesser developed regions participate?

They are NOT reporting the nation.

No one else cherry picks like that

PRESENTATION TITLE 7

Province Population Beijing 22M

Jiangsu 85M Shanghai 25M Zhejiang 65M Guangdon

g 126M

CHINA 1,419M

2015: 258M/1419M=18%

2018: 197M/1419M=14%

(8)

Conclusion

8

Apples are not being compared

to apples.

Ignore China’s

sample because it

isn’t fair

(9)

TECHNICALLY INCOMPARABLE

Are PISA tests really comparable?

(10)

10

To compare scores

• Items need to judged to be valid for each jurisdiction

• Items need to belong to the same factor structure

• The regression weights from the factor to each item should vary by chance (metric equivalence)

• The regression intercepts (start points) of each item at the factor should vary by chance (scalar equivalence)

• Then, factor scores can be compared

(11)

11

PISA is NOT comparable

• many previous studies reporting lack of MI in PISA tests

• Arffman, I. (2010). Equivalence of translations in international reading literacy studies. Scandinavian Journal of Educational Research, 54(1), 37–59.

• He, J., & van de Vijver, F. (2012). Bias and equivalence in cross-cultural research.

Online Readings in Psychology and Culture, 2(2). doi: 10.9707/2307-0919.1111

• Kreiner, S., & Christensen, K. B. (2014). Analyses of model fit and robustness. A new look at the PISA scaling model underlying ranking of countries according to reading literacy. Psychometrika, 79(2), 210–231.

• Oliveri, M. E., & von Davier, M. (2011). Investigation of model fit and score scale comparability in international assessments. Journal of Psychological Test and AssessmentModeling, 53(3), 315–333.

• Wetzel, E., & Carstensen, C. H. (2013). Linking PISA 2000 and PISA 2009:

Implications of instrument design on measurement invariance. Psychological Test and Assessment Modeling, 55(2), 181–206.

• But PISA persists

(12)

PISA Reading 2009 Booklet 11

• 28 reading literacy items

• Multiple choice items were scored 0 or 1;

• Polytomous items ranged from 0 to 2.

• Reading processes measured were

• Access and Retrieve (11 items),

• Integrate and Interpret (11 items), and

• Reflect and Evaluate (6 items).

• Reading literacy used various text formats & types

• N = 32,704 from 55 countries

• Pairwise comparison: Australia vs. 54 countries

• CFA (1 factor), invariance testing; metric equivalence

Asil, M., & Brown, G. T. L. (2016). Comparing OECD PISA Reading in English to Other Languages:

Identifying Potential Sources of Non-Invariance.

International Journal of Testing, 16(1), 71-93.

https://doi.org/10.1080/15305058.2015.1064431

(13)

Reject all countries above the line as differing by more than chance at N=500 per country

Accept all countries below line as differences are

trivial or small enough to be ignored

(14)

Understanding the problem

• The PISA ESCS index “captures a range of aspects of a student’s family and home background that combines information on parents’ education and occupations and home possessions” (OECD, 2010c, p. 29).

• The relationship between ESCS and δCFI and dMACS for the 47

countries for which it was available was investigated. There was a

moderate but negative relationship between ESCS and Δcfi (r = –0.61, p < 0.05), ESCS and dMACS (r = –0.54, p < 0.05),

• lower levels of ESCS tended to be associated with less equivalence to Australia and much greater effect sizes in the difference.

• complex factors to do with educational practice and socioeconomic resourcing of education interfere with non-invariance of the PISA

reading comprehension results, with impact seen most strongly among the poorer economies

(15)

Possible solution

• Group countries into clusters of “countries-like-me”

For example,

• East Asian societies which use Mandarin (i.e., Singapore, China, Macao, and Taiwan) and having strong dependence on testing and public examinations and shared cultural approaches to schooling and testing.

• Possibly include other East Asian societies that have different writing scripts and languages, but similar cultural histories and forces (i.e., Japan, Korea, and Hong Kong). The range of dMACS

relative to Australia for all seven economies was just 0.136 to 0.199.

• Nordic countries (i.e., Finland, Sweden, Norway, Iceland, and Denmark)

• Continental Western European countries because of their similarities in ESCS, despite differences in language.

• Anglo-developed nations: Australia, New Zealand, Canada, USA, UK.

• Latin America; MENA; South-East Asia; Oceania; etc.

(16)

DO TESTS MEAN THE SAME THING?

Assessment means something different in eastern Asia

(17)

Chinese context

• Chinese culture has a long history of:

• Using examinations and tests to select and reward talent;

• Regarding high academic performance on high-stakes examinations as a legitimate, meritocratic basis for

upward social mobility regardless of social background.

• Considering the person with high academic success as morally virtuous

• Doing well on tests fulfills obligations to families

(18)

Chinese context

• Chinese parents expect students to become better academically, attitudinally, and

behaviourally through schooling and will enforce such expectations with harsh authoritarian

parenting practices

• (tiger/dragon mom?)

• Demand for higher education exceeds space

available at fully funded institutions (25% in HK;

50% in PRC)

(19)

• Confucian-Heritage learners popular construct only but superficially Confucian…..

• PRC entrance to university no longer simply

based on gao kao scores. Non-academic criteria include

• demonstrating right moral character (e.g., not participating in anti-government activities or protests),

• giving first choices to students who are resident near specific institutions,

• membership in a specified minority group,

• having a recommendation that permits bypassing the examination altogether,

• Having economic resources to move to locations with lower entry standards or to just buy access

Culture vs Context

(20)

Chinese teacher beliefs are not narrow

• Hui, 2012 • Brown & Gao, 2015

(21)

Chinese: assessing leads to personal development

• Chinese-TCoA

• Hong Kong + China teachers

• Improvement includes using assessment to help students develop personally

• Accountability includes controlling schools and teachers

• Improvement & Accountability strongly connected

• But Irrelevance with

Accountability, inverse to improvement

2011 Brown, Hui, Yu, & Kennedy

(22)

Student Survey

• Large-scale survey of HK & PRC university students with a NEW Chinese Student Conceptions of Assessment

Inventory

• Factors recovered

• Confucian-Heritage societies : Competition, Societal Use, Exam Accuracy, and Family Effects

• Jurisdictional differences in institutional practices and policies

were: Teacher Use, School Quality, Class Benefit, and Negative

Effects.

(23)

Models

PRC Hong Kong

(24)

Mean Score differences

Group Descriptive Statistics Difference Statistics HK B PRC Pre PRC

Post MANOVA Effect size C-SCoA(HE) Scale M SD M SD M SD F₍₂₎ p Cohen’s d Culturally-similar factors

Competition 3.86 0.89 3.40 0.88 3.69 0.97 13.02 <.001 .50 Societal Use 3.65 1.09 2.84 0.86 2.98 0.95 45.64 <.001 .82 Exam Accuracy 3.01 0.87 2.54 0.93 2.91 1.00 13.88 <.001 .50 Family Effects 2.49 0.97 2.43 0.90 2.49 0.93 0.27 .76 .06 Jurisdictional Policy Factors

Teacher Use 3.98 0.89 3.50 1.05 3.83 1.07 12.57 <.001 .49 School Quality 3.63 0.81 2.54 0.89 3.39 0.88 17.35 <.001 .56 Class Benefit 3.00 0.90 3.09 0.94 3.09 1.03 0.65 .52 .09 Negative Effects 2.80 0.75 2.39 0.75 2.46 0.82 19.13 <.001 .52

HK students higher in the GREEN

(25)

• Assessment

• Can help teachers and students know what to do next

• Being evaluated can help improve performance

• If you want high scores, copy the Chinese model of assessment

•

考考考，老师的法宝；分分分，学生的命根

• exam, exam, exam, teacher’s magic weapon;

grade grade grade, students’ lifeblood

• But be prepared for an unhappy, stressed life

• If testing matters so much in PRC, maybe that’s why East Asia scores so high??

Implications

(26)

COUNTRY REPUTATION

Is effort the same when country reputation is at stake?

(27)

Test-taking motivation (TTM) matters

• TTM predicts performance

• it is possible that country reputation consequences may elicit different effort in different societies, rather than

ability

• Students in Shanghai scored world best for PISA in 2009 and 2012, but, given the importance of political-

ideological education in China (Chen & Brown, 2018),

that score MAY depend upon TTM effort for national

reputation.

(28)

Proposed Model

(29)

2-way Comparison

• between-subjects experiment: 3 conditions

• The consequence of a hypothetical test (i.e., none, my country, or me personally)

• in Shanghai (n=1003) and New Zealand (n=479)

• jurisdictions that range from high to middle on PISA.

• Instruments

• four factor Student Conceptions of Assessment (Brown, 2008) inventory before being randomly assigned to experimental condition

• TTM estimated for effort, the importance of the test, and their anxiety for that test;

adapted from Knekta and Eklöf (2015) and Thelk et al. (2009).

• Analysis

• Confirmatory factor analysis for measurement models, invariance testing between jurisdictions

• structural equation modeling to examine relations between SCoA and TTM.

Zhao, A., Brown, G. T. L., & Meissel, K. (2020). Manipulating the consequences of tests: how Shanghai teens react to different consequences.

Educational Research and Evaluation, 26(5-6), 221-251. https://doi.org/10.1080/13803611.2021.1963938

Zhao, A., Brown, G. T. L., & Meissel, K. (2022). New Zealand students’ test-taking motivation: an experimental study examining the effects of stakes. Assessment in Education: Principles, Policy & Practice, 29(4), 397-421. https://doi.org/10.1080/0969594X.2022.2101043

(30)

Measurement Model Results

• SCoA measurement model

• bi-factor: four orthogonal specific factors and one general factor,

• dropping two items [i.e., sq1 and pe1]); correspondence to both NZ and Shanghai data

• TTM measurement model

• invariant within jurisdictions across consequence conditions,

• but only configural equivalence across jurisdictions

(31)

Structural Models

New Zealand Shanghai

(32)

SCoA Validation

• The bi-factor structure of the SCoA replicated previous studies (Matos et al., 2019; Weekers et al., 2009).

• SCoA had a statistically significant, largely through the general factor, but small impact on test-taking

motivation in both the Shanghai and New Zealand samples.

• in the Shanghai sample, the stronger the general

conception of test as improving learning, being related to external attributions, class climate and personal

emotions, the more students thought tests were

important, stimulated greater anxiety, and exerted more effort.

• in the New Zealand sample, students’ general

conception of test was not a significant predictor of

their reported anxiety.

(33)

TTM • Regardless of condition, greater anxiety

resulted in greater effort in both jurisdictions.

• consistent with control-value theory in that negative emotion anxiety is an activating force for greater effort and

performance (Pekrun et al., 2002).

• Non-invariance in the structural model between Shanghai and New Zealand predominantly

because of the country at stakes condition, indicating students from the two contexts

perceived this consequence quite differently.

(34)

Structural Equation Model Results

• invariant within Shanghai across the three experimental conditions,

• only configural invariance in New Zealand,

• supports ecological rationality (Rieskamp & Reimer, 2007) in how students respond;

• that is, student TTM changes in New Zealand according to consequence, but

doesn’t in Shanghai.

(35)

Mean Score

differences by Jurisdiction

&

Condition

Only TTM

Compared to no or personal stakes

• NZ big

differences

• SH small to no

differences

(36)

Test

Importanc e by

Condition

&

Jurisdiction

Difference in score by condition

(37)

18/22

Impact of Condition by Jurisdiction on Effort, Anxiety

Difference in score by condition

(38)

Conclusion

• students in high-performing societies might have tried harder on PISA than students elsewhere, because of the importance of

national reputation within that jurisdiction.

• Motivational factors (i.e., Effort, Importance, and Anxiety) were all greater as consequences shifted from none to country at stake and were highest in individual consequences.

• margin of increase on motivational factors by stakes was trivial or small for Shanghai and moderate to large for New Zealand.

• Corroborates Gneezy et al. (2019) and Chen and Brown (2018), in that there may be no such thing as a low-consequence test in

Shanghai insofar as test-taking motivation is concerned.

• Thus, PISA scores from Shanghai may reflect the continuing importance of being tested in that society, as much as greater competence.

• Does a society permit students not to try or care?

(39)

SO WHAT?

39

(40)

ILSA do not compare well

• NZ has fallen in rank but our average performance within NZ has not changed.

• We are being beat by test-fetish countries and by distorted samples

• Push back against bad uses

• Travel to visit more successful nations only if there is similarity in context

• Use it if it suits internal politics, like Germany did 20 years ago

• Focus on socio-economic development

• Successful, healthy families produce kids who learn

(41)

THANK YOU

Gavin T L Brown [email protected]

41