The modern approach to statistical analysis of MCQ
items: Quality testing
Prof. Gavin T L Brown [email protected] Presentation to the Computer Science Dept. June 15, 2018
The goal of a test score: Estimation
Quantity Quality
How much of a specified
domain do you know or can you do?
◦ >80%=A
◦ 65-79%=B
◦ 50-65%=C
◦ <49%=D
How well do you know the content and skills you are supposed to learn?
◦ A=Excellent
◦ B=Good
◦ C=Satisfactory
◦ D=Unsatisfactory
◦Observed—What you actually get on a test
◦True—The score you are likely to get taking into account uncertainty and error in our estimation.
◦Ability—What you really are able to do or know of a domain independent of what’s in any one test
The multiple meanings of test scores
Ability
True Score Range But what if the
test was designed so that it didn’t stretch the
person’s ability?
Answers are marked as right =1, wrong=0
◦ Categorical/Nominal labelling of each response
Tests use the sum of all answers marked as right to indicate total score
◦ 1,0,1,1,1,0,0,0,1,0=5/10: A continuous, numeric scale
What other factors might be at work and how do we account for them?
◦ Difficulty of items, quality of distractors, probability of getting an item right despite low knowledge, varying ability of people
Scoring items
Students A, B and C all sit the same test
Test has ten items
All items are dichotomous (score 0 or 1)
All three students score 6 out of 10
What can we say about the ability of these three students?
Test Scores: The
problem
Classical Test Theory (CTT)
◦Served test industry well for most of 20th century
◦Relatively weak theoretical assumptions – easy to
◦meetSimple frequency of observed correct rate, converted to %
◦Easy to apply – hand
calculated statistics, but
◦Major limitations
◦Focus at TEST level
Item Response Theory (IRT)
◦‘Modern’ test theory
◦Strong theoretical
assumptions … which may not always be met!
◦Estimation of the item difficulty point at which test-taker is likely to be able to answer correctly 50% of the items
◦Requires computers to apply
◦Focus at ITEM level
Two Major Models
All items in test are like bricks in a wall held
together by mortar of the total impact of the test
All items have equal
weight in making up test statistics and scores
Items only mean
something in context of the test they’re in
Classical Approach
TEST
item
item
item item
item
item
item
item
◦ How hard is the item?Item Difficulty (p):
◦ % of people who answered correctly
◦ Mean correct across people is p
◦ Applications
generalised ability testusually delete items too easy (p>.9) or too hard (p<.1) for
A mastery testmaximise items that fit the difficulty of the cut score (e.g., p=.35, which means 65% correct)
Item Difficulty
Who gets the item right?Item Discrimination
◦ Correlation between each item and the total score without the item in it
If the item behaves like the rest of the test the correlation should be positive (i.e., students who do best on the test tend to get each item right more than people who get low total scores)
Ideally, look for values > .20
◦ Beware negative or zero discrimination items, otherwise the item might be testing a different construct or else the
distractors might be poorly crafted
Item Discrimination r
pbSelecting Items for Test:
Using difficulty and discrimination
Student Q1 Q2 Q3 Q4 Q5 Tot.
1 1 1 0 0 0 2
2 1 0 1 1 0 3
3 0 1 1 1 1 4
Diff p .67 .67 .67 .67 .33 Disc r -.87 .00 .87 .87 .87
ITEMS
All items acceptable difficulty
Poor items:
Q1 (reverse discrimination) Q2 (zero discrimination)
Need many more students to have confidence in
estimates
CTT Scores
Item 1 2 3 4 5 6 7 8 9 10
A 60
B 60
C 60
% correct
Assumptions: Each item is equally difficult and has equal
weight towards total score. Total score based on sum of items correct is a good estimate of true ability
Inference: these students are equally able.
But what if the items aren’t equally difficult?
All statistics for persons, items, and tests are sample dependent
◦ Requires robust representative sampling (expensive, time consuming, difficult)
Items have equal weight towards total score
Easy statistics to obtain & interpret for TESTS only
Has major limitations, however
Summary of CTT
Dependency on test and item statistics
Indices of difficulty and discrimination are sample dependent
◦ change from sample to sample
Trait or ability estimates (test scores) are test dependent
◦ change from test to test
Comparisons require parallel tests or test equating – not a trivial matter
Reliability depends on SEM, which is assumed to be of equal magnitude for all examinees
◦ But we know SEM is not constant across all examinees
Major limitations of CTT
Rethink items and tests because of CTT weaknesses
Often called “latent trait theory” –
◦ assumes existence of unobserved (latent) trait OR ability which leads to consistent performance
Focus at item level, not at level of the test
Calculates parameters as estimates of population characteristics, not sample statistics
Item Response Theory (IRT)
All items like $c in a bank
◦ Different denominations have different purchasing power ($5<$10<$20….)
◦ All coins & bills on same scale
Can be assembled into a test flexibly tailored to the ability or difficulty required AND reported on a common scale
All items in test are systematic sample of domain
All items in test have different weight in making up test score
Item Response Theory
Probability of getting an item correct is a function of ability—Pi(θ)
◦ as ability increases the chance of getting item correct increases
People and items can be located on one scale
Item statistics are invariant across groups of examinees
S-shaped curves (ogive) plot relationship of parameters
Multiple statistical models to account for components
IRT Assumptions
Items of varying difficulty
People of varying ability
- ∞ ∞
Probability of getting an item right
Ability & Difficulty Increasing Chance of
getting it right increasing
Probability does NOT increase in linear fashion
Probability increases in relation to ability, difficulty, and chance factors in an ogive fashion
Item x
NB: both 1.00 and 0.00
probability rates are asymptotes
Difficulty:
◦ the ability point at which the probability of getting it right is 50% (b)
◦ Note in 1PL /Rasch this is the ONLY parameter
Discrimination:
◦ The slope of the curve at the difficulty point (a)
◦ The 2PL model uses the a and b parameters
Pseudo-Chance:
◦ The probability of getting it right when no TRUE ability exists (c)
◦ The 3PL model uses a, b, and c parameters
Three IRT Parameters
( ( ) )) 1 1
(
g gg g
b Da
b Da
g g
g
e
c e c
P
IRT - the generalised model
Where
g = an item on the test
ag = gradient of the ICC at the point (item discrimination) bg = the ability level at which ag is maximised (item difficulty) cg = probability of low candidates correctly answering question g e = natural log, ≈2.718
D = scaling factor to make logistic function close to ogive, =1.7
Rasch model remove c, D, and a
Creates estimate of where the student has a 50% chance of getting items correct
Note more items answered correctly at the same difficulty point only increases accuracy of estimate, not the estimate of ability
The score is on scale with M=0, SD=1
◦ Range is usually -3.00 to +3.00
◦ So score needs to be transformed using a judgment to lead to proper location on target scale
The relationship of NORMS and STANDARDS is knotty.
◦ My point is that if you scratch a criterion-referenced
interpretation, you will very likely find a norm-referenced set of assumptions underneath. (Angoff, 1974, p. 4)
Angoff, W. H. (1974). Criterion-referencing, norm-referencing, and the SAT (Research Memorandum RM-74-1). Princeton, NJ: Educational
Testing Service.
Using an IRT Estimate
Content of test located vs.
ability of people
Judgments made about
what evidence each group of items provides
concerning the QUALITY of learning
◦ Not how many you got right
◦ It is what probability you
have of getting items of this difficulty right
IRT person-item analysis
Dunne, T., Long, C., Craig, T., & Venter, E. (2012). Meeting the requirements of both classroom-based and systemic assessment of mathematics proficiency: The potential of Rasch measurement theory. Pythagoras, 33(3), Art. #19,
doi:10.4102/pythagoras.v33i3.19
Where would you put UOA grades?
Equal scores but what if the items are not equal….so?
Item 1 2 3 4 5 6 7 8 9 10
Difficulty -3 -2 -1 -1 0 0 1 1 2 3
A 60
B 60
C 60
% correct
CTT and IRT Test Scores Compared
Item 1 2 3 4 5 6 7 8 9 10
Difficulty -3 -2 -1 -1 0 0 1 1 2 3
A 60 530
B 60 545
C 60 593
% correct asTTle v4
Conclusions: C > A, B; B ≈ A because C answered all the hardest items correctly—no penalty for skipping or getting easy items wrong
2 real University Tests:
We too have the same problem
Item Characteristic Curves show reverse/close to zero
discrimination; especially in the mid-term test. Most in the final exam were good
Brown, G. T. L., & Abdulnabi, H. (2017). Evaluating the quality of higher education instructor-constructed multiple-choice tests: Impact on student grades. Frontiers in Education, 2(24). doi:10.3389/feduc.2017.00024
When was the last time you checked that the items behaved properly even with CTT?
Academics will have to admit that even though they know content, it doesn’t mean they can write discriminating items or hard questions.
Going from CTT to IRT means going from counting quantity to judging quality based on probability of success estimates
If we write easy tests, high scores will not indicate EXCELLENCE, just satisfactory or good….
◦ This changes scores downward if we exercise that judgment
◦ high-scoring students on easy tests will get lower marks….will they believe, trust, and accept?
The Politics of Change
IRT is used in New Zealand standardised tests for schools
◦ e-asTTle & PATs
IRT is used in high-stakes selection examinations in the USA
◦ GRE, GMAT, LSAT, etc.
IRT is used in international comparative testing systems
◦ PISA, TIMSS, PIRLS, etc.
So why are universities so far behind industry gold- standard?
In operation
Maybe what we need
• A tool to eliminate bad items and help teachers set quality grade boundaries
• Learning Enhancement Grant to build a tool to force judgment of quality
• We can demonstrate that if you want