The multiple meanings of test scores

(1)

The modern approach to statistical analysis of MCQ

items: Quality testing

Prof. Gavin T L Brown [email protected] Presentation to the Computer Science Dept. June 15, 2018

(2)

The goal of a test score: Estimation

Quantity Quality

 How much of a specified

domain do you know or can you do?

◦ >80%=A

◦ 65-79%=B

◦ 50-65%=C

◦ <49%=D

 How well do you know the content and skills you are supposed to learn?

◦ A=Excellent

◦ B=Good

◦ C=Satisfactory

◦ D=Unsatisfactory

(3)

◦Observed—What you actually get on a test

◦True—The score you are likely to get taking into account uncertainty and error in our estimation.

◦Ability—What you really are able to do or know of a domain independent of what’s in any one test

The multiple meanings of test scores

Ability

True Score Range But what if the

test was designed so that it didn’t stretch the

person’s ability?

(4)

 Answers are marked as right =1, wrong=0

◦ Categorical/Nominal labelling of each response

 Tests use the sum of all answers marked as right to indicate total score

◦ 1,0,1,1,1,0,0,0,1,0=5/10: A continuous, numeric scale

 What other factors might be at work and how do we account for them?

◦ Difficulty of items, quality of distractors, probability of getting an item right despite low knowledge, varying ability of people

Scoring items

(5)

 Students A, B and C all sit the same test

 Test has ten items

 All items are dichotomous (score 0 or 1)

 All three students score 6 out of 10

 What can we say about the ability of these three students?

Test Scores: The

problem

(6)

 Classical Test Theory (CTT)

◦Served test industry well for most of 20^th century

◦Relatively weak theoretical assumptions – easy to

◦meetSimple frequency of observed correct rate, converted to %

◦Easy to apply – hand

calculated statistics, but

◦Major limitations

◦Focus at TEST level

 Item Response Theory (IRT)

◦‘Modern’ test theory

◦Strong theoretical

assumptions … which may not always be met!

◦Estimation of the item difficulty point at which test-taker is likely to be able to answer correctly 50% of the items

◦Requires computers to apply

◦Focus at ITEM level

Two Major Models

(7)

 All items in test are like bricks in a wall held

together by mortar of the total impact of the test

 All items have equal

weight in making up test statistics and scores

 Items only mean

something in context of the test they’re in

Classical Approach

TEST

item

item item

item

(8)

◦ How hard is the item?Item Difficulty (p):

◦ % of people who answered correctly

◦ Mean correct across people is p

◦ Applications

 generalised ability testusually delete items too easy (p>.9) or too hard (p<.1) for

 A mastery testmaximise items that fit the difficulty of the cut score (e.g., p=.35, which means 65% correct)

Item Difficulty

(9)

 Who gets the item right?Item Discrimination

◦ Correlation between each item and the total score without the item in it

 If the item behaves like the rest of the test the correlation should be positive (i.e., students who do best on the test tend to get each item right more than people who get low total scores)

 Ideally, look for values > .20

◦ Beware negative or zero discrimination items, otherwise the item might be testing a different construct or else the

distractors might be poorly crafted

Item Discrimination r

_pb

(10)

Selecting Items for Test:

Using difficulty and discrimination

Student Q1 Q2 Q3 Q4 Q5 Tot.

1 1 1 0 0 0 2

2 1 0 1 1 0 3

3 0 1 1 1 1 4

Diff p .67 .67 .67 .67 .33 Disc r -.87 .00 .87 .87 .87

ITEMS

All items acceptable difficulty

Poor items:

Q1 (reverse discrimination) Q2 (zero discrimination)

Need many more students to have confidence in

estimates

(11)

CTT Scores

Item 1 2 3 4 5 6 7 8 9 10

A       60

B       60

C       60

% correct

Assumptions: Each item is equally difficult and has equal

weight towards total score. Total score based on sum of items correct is a good estimate of true ability

Inference: these students are equally able.

But what if the items aren’t equally difficult?

(12)

 All statistics for persons, items, and tests are sample dependent

◦ Requires robust representative sampling (expensive, time consuming, difficult)

 Items have equal weight towards total score

 Easy statistics to obtain & interpret for TESTS only

 Has major limitations, however

Summary of CTT

(13)

 Dependency on test and item statistics

 Indices of difficulty and discrimination are sample dependent

◦ change from sample to sample

 Trait or ability estimates (test scores) are test dependent

◦ change from test to test

 Comparisons require parallel tests or test equating – not a trivial matter

 Reliability depends on SEM, which is assumed to be of equal magnitude for all examinees

◦ But we know SEM is not constant across all examinees

Major limitations of CTT

(14)

 Rethink items and tests because of CTT weaknesses

 Often called “latent trait theory” –

◦ assumes existence of unobserved (latent) trait OR ability which leads to consistent performance

 Focus at item level, not at level of the test

 Calculates parameters as estimates of population characteristics, not sample statistics

Item Response Theory (IRT)

(15)

 All items like $c in a bank

◦ Different denominations have different purchasing power ($5<$10<$20….)

◦ All coins & bills on same scale

 Can be assembled into a test flexibly tailored to the ability or difficulty required AND reported on a common scale

 All items in test are systematic sample of domain

 All items in test have different weight in making up test score

Item Response Theory

(16)

 Probability of getting an item correct is a function of ability—P_i(θ)

◦ as ability increases the chance of getting item correct increases

 People and items can be located on one scale

 Item statistics are invariant across groups of examinees

 S-shaped curves (ogive) plot relationship of parameters

 Multiple statistical models to account for components

IRT Assumptions

Items of varying difficulty

People of varying ability

- ∞ ∞

(17)

Probability of getting an item right

Ability & Difficulty Increasing Chance of

getting it right increasing

Probability does NOT increase in linear fashion

Probability increases in relation to ability, difficulty, and chance factors in an ogive fashion

Item x

NB: both 1.00 and 0.00

probability rates are asymptotes

(18)

 Difficulty:

◦ the ability point at which the probability of getting it right is 50% (b)

◦ Note in 1PL /Rasch this is the ONLY parameter

 Discrimination:

◦ The slope of the curve at the difficulty point (a)

◦ The 2PL model uses the a and b parameters

 Pseudo-Chance:

◦ The probability of getting it right when no TRUE ability exists (c)

◦ The 3PL model uses a, b, and c parameters

Three IRT Parameters

(19)

 

⁽ ₍ ⁾ ₎

) 1 1

(

g g

b Da

g g

g

e

c e c

P

_



 





_



IRT - the generalised model

Where

g = an item on the test

a_g = gradient of the ICC at the point  (item discrimination) b_g = the ability level at which a_g is maximised (item difficulty) c_g = probability of low candidates correctly answering question g e = natural log, ≈2.718

D = scaling factor to make logistic function close to ogive, =1.7

Rasch model remove c, D, and a

Creates estimate of where the student has a 50% chance of getting items correct

Note more items answered correctly at the same difficulty point only increases accuracy of estimate, not the estimate of ability

(20)

 The score is on scale with M=0, SD=1

◦ Range is usually -3.00 to +3.00

◦ So score needs to be transformed using a judgment to lead to proper location on target scale

 The relationship of NORMS and STANDARDS is knotty.

◦ My point is that if you scratch a criterion-referenced

interpretation, you will very likely find a norm-referenced set of assumptions underneath. (Angoff, 1974, p. 4)

 Angoff, W. H. (1974). Criterion-referencing, norm-referencing, and the SAT (Research Memorandum RM-74-1). Princeton, NJ: Educational

Testing Service.

Using an IRT Estimate

(21)



Content of test located vs.

ability of people



Judgments made about

what evidence each group of items provides

concerning the QUALITY of learning

◦ Not how many you got right

◦ It is what probability you

have of getting items of this difficulty right

IRT person-item analysis

Dunne, T., Long, C., Craig, T., & Venter, E. (2012). Meeting the requirements of both classroom-based and systemic assessment of mathematics proficiency: The potential of Rasch measurement theory. Pythagoras, 33(3), Art. #19,

doi:10.4102/pythagoras.v33i3.19

Where would you put UOA grades?

(22)

Equal scores but what if the items are not equal….so?

Item 1 2 3 4 5 6 7 8 9 10

Difficulty -3 -2 -1 -1 0 0 1 1 2 3

A       60

B       60

C       60

% correct

(23)

CTT and IRT Test Scores Compared

Item 1 2 3 4 5 6 7 8 9 10

Difficulty -3 -2 -1 -1 0 0 1 1 2 3

A       60 530

B       60 545

C       60 593

% correct asTTle v4

Conclusions: C > A, B; B ≈ A because C answered all the hardest items correctly—no penalty for skipping or getting easy items wrong

(24)

2 real University Tests:

We too have the same problem

Item Characteristic Curves show reverse/close to zero

discrimination; especially in the mid-term test. Most in the final exam were good

Brown, G. T. L., & Abdulnabi, H. (2017). Evaluating the quality of higher education instructor-constructed multiple-choice tests: Impact on student grades. Frontiers in Education, 2(24). doi:10.3389/feduc.2017.00024

When was the last time you checked that the items behaved properly even with CTT?

(25)

 Academics will have to admit that even though they know content, it doesn’t mean they can write discriminating items or hard questions.

 Going from CTT to IRT means going from counting quantity to judging quality based on probability of success estimates

 If we write easy tests, high scores will not indicate EXCELLENCE, just satisfactory or good….

◦ This changes scores downward if we exercise that judgment

◦ high-scoring students on easy tests will get lower marks….will they believe, trust, and accept?

The Politics of Change

(26)

 IRT is used in New Zealand standardised tests for schools

◦ e-asTTle & PATs

 IRT is used in high-stakes selection examinations in the USA

◦ GRE, GMAT, LSAT, etc.

 IRT is used in international comparative testing systems

◦ PISA, TIMSS, PIRLS, etc.

 So why are universities so far behind industry gold- standard?

In operation

(27)

Maybe what we need

• A tool to eliminate bad items and help teachers set quality grade boundaries

• Learning Enhancement Grant to build a tool to force judgment of quality

• We can demonstrate that if you want