• Tidak ada hasil yang ditemukan

The multiple meanings of test scores

N/A
N/A
Protected

Academic year: 2025

Membagikan "The multiple meanings of test scores"

Copied!
27
0
0

Teks penuh

(1)

The modern approach to statistical analysis of MCQ

items: Quality testing

Prof. Gavin T L Brown [email protected] Presentation to the Computer Science Dept. June 15, 2018

(2)

The goal of a test score: Estimation

Quantity Quality

How much of a specified

domain do you know or can you do?

>80%=A

65-79%=B

50-65%=C

<49%=D

How well do you know the content and skills you are supposed to learn?

A=Excellent

B=Good

C=Satisfactory

D=Unsatisfactory

(3)

Observed—What you actually get on a test

True—The score you are likely to get taking into account uncertainty and error in our estimation.

Ability—What you really are able to do or know of a domain independent of what’s in any one test

The multiple meanings of test scores

Ability

True Score Range But what if the

test was designed so that it didn’t stretch the

person’s ability?

(4)

Answers are marked as right =1, wrong=0

Categorical/Nominal labelling of each response

Tests use the sum of all answers marked as right to indicate total score

1,0,1,1,1,0,0,0,1,0=5/10: A continuous, numeric scale

What other factors might be at work and how do we account for them?

Difficulty of items, quality of distractors, probability of getting an item right despite low knowledge, varying ability of people

Scoring items

(5)

Students A, B and C all sit the same test

Test has ten items

All items are dichotomous (score 0 or 1)

All three students score 6 out of 10

What can we say about the ability of these three students?

Test Scores: The

problem

(6)

Classical Test Theory (CTT)

Served test industry well for most of 20th century

Relatively weak theoretical assumptions – easy to

meetSimple frequency of observed correct rate, converted to %

Easy to apply – hand

calculated statistics, but

Major limitations

Focus at TEST level

Item Response Theory (IRT)

‘Modern’ test theory

Strong theoretical

assumptions … which may not always be met!

Estimation of the item difficulty point at which test-taker is likely to be able to answer correctly 50% of the items

Requires computers to apply

Focus at ITEM level

Two Major Models

(7)

All items in test are like bricks in a wall held

together by mortar of the total impact of the test

All items have equal

weight in making up test statistics and scores

Items only mean

something in context of the test they’re in

Classical Approach

TEST

item

item

item item

item

item

item

item

(8)

How hard is the item?Item Difficulty (p):

% of people who answered correctly

Mean correct across people is p

Applications

generalised ability testusually delete items too easy (p>.9) or too hard (p<.1) for

A mastery testmaximise items that fit the difficulty of the cut score (e.g., p=.35, which means 65% correct)

Item Difficulty

(9)

Who gets the item right?Item Discrimination

Correlation between each item and the total score without the item in it

If the item behaves like the rest of the test the correlation should be positive (i.e., students who do best on the test tend to get each item right more than people who get low total scores)

Ideally, look for values > .20

Beware negative or zero discrimination items, otherwise the item might be testing a different construct or else the

distractors might be poorly crafted

Item Discrimination r

pb

(10)

Selecting Items for Test:

Using difficulty and discrimination

Student Q1 Q2 Q3 Q4 Q5 Tot.

1 1 1 0 0 0 2

2 1 0 1 1 0 3

3 0 1 1 1 1 4

Diff p .67 .67 .67 .67 .33 Disc r -.87 .00 .87 .87 .87

ITEMS

All items acceptable difficulty

Poor items:

Q1 (reverse discrimination) Q2 (zero discrimination)

Need many more students to have confidence in

estimates

(11)

CTT Scores

Item 1 2 3 4 5 6 7 8 9 10

A       60

B       60

C       60

% correct

Assumptions: Each item is equally difficult and has equal

weight towards total score. Total score based on sum of items correct is a good estimate of true ability

Inference: these students are equally able.

But what if the items aren’t equally difficult?

(12)

All statistics for persons, items, and tests are sample dependent

Requires robust representative sampling (expensive, time consuming, difficult)

Items have equal weight towards total score

Easy statistics to obtain & interpret for TESTS only

Has major limitations, however

Summary of CTT

(13)

Dependency on test and item statistics

Indices of difficulty and discrimination are sample dependent

change from sample to sample

Trait or ability estimates (test scores) are test dependent

change from test to test

Comparisons require parallel tests or test equating – not a trivial matter

Reliability depends on SEM, which is assumed to be of equal magnitude for all examinees

But we know SEM is not constant across all examinees

Major limitations of CTT

(14)

Rethink items and tests because of CTT weaknesses

Often called “latent trait theory” –

assumes existence of unobserved (latent) trait OR ability which leads to consistent performance

Focus at item level, not at level of the test

Calculates parameters as estimates of population characteristics, not sample statistics

Item Response Theory (IRT)

(15)

All items like $c in a bank

Different denominations have different purchasing power ($5<$10<$20….)

All coins & bills on same scale

Can be assembled into a test flexibly tailored to the ability or difficulty required AND reported on a common scale

All items in test are systematic sample of domain

All items in test have different weight in making up test score

Item Response Theory

(16)

Probability of getting an item correct is a function of ability—Pi(θ)

as ability increases the chance of getting item correct increases

People and items can be located on one scale

Item statistics are invariant across groups of examinees

S-shaped curves (ogive) plot relationship of parameters

Multiple statistical models to account for components

IRT Assumptions

Items of varying difficulty

People of varying ability

- ∞ ∞

(17)

Probability of getting an item right

Ability & Difficulty Increasing Chance of

getting it right increasing

Probability does NOT increase in linear fashion

Probability increases in relation to ability, difficulty, and chance factors in an ogive fashion

Item x

NB: both 1.00 and 0.00

probability rates are asymptotes

(18)

Difficulty:

the ability point at which the probability of getting it right is 50% (b)

Note in 1PL /Rasch this is the ONLY parameter

Discrimination:

The slope of the curve at the difficulty point (a)

The 2PL model uses the a and b parameters

Pseudo-Chance:

The probability of getting it right when no TRUE ability exists (c)

The 3PL model uses a, b, and c parameters

Three IRT Parameters

(19)

 

( ( ) )

) 1 1

(

g g

g g

b Da

b Da

g g

g

e

c e c

P

 

IRT - the generalised model

Where

g = an item on the test

ag = gradient of the ICC at the point (item discrimination) bg = the ability level at which ag is maximised (item difficulty) cg = probability of low candidates correctly answering question g e = natural log, ≈2.718

D = scaling factor to make logistic function close to ogive, =1.7

Rasch model remove c, D, and a

Creates estimate of where the student has a 50% chance of getting items correct

Note more items answered correctly at the same difficulty point only increases accuracy of estimate, not the estimate of ability

(20)

The score is on scale with M=0, SD=1

Range is usually -3.00 to +3.00

So score needs to be transformed using a judgment to lead to proper location on target scale

The relationship of NORMS and STANDARDS is knotty.

My point is that if you scratch a criterion-referenced

interpretation, you will very likely find a norm-referenced set of assumptions underneath. (Angoff, 1974, p. 4)

Angoff, W. H. (1974). Criterion-referencing, norm-referencing, and the SAT (Research Memorandum RM-74-1). Princeton, NJ: Educational

Testing Service.

Using an IRT Estimate

(21)

Content of test located vs.

ability of people

Judgments made about

what evidence each group of items provides

concerning the QUALITY of learning

Not how many you got right

It is what probability you

have of getting items of this difficulty right

IRT person-item analysis

Dunne, T., Long, C., Craig, T., & Venter, E. (2012). Meeting the requirements of both classroom-based and systemic assessment of mathematics proficiency: The potential of Rasch measurement theory. Pythagoras33(3), Art. #19,

doi:10.4102/pythagoras.v33i3.19

Where would you put UOA grades?

(22)

Equal scores but what if the items are not equal….so?

Item 1 2 3 4 5 6 7 8 9 10

Difficulty -3 -2 -1 -1 0 0 1 1 2 3

A       60

B       60

C       60

% correct

(23)

CTT and IRT Test Scores Compared

Item 1 2 3 4 5 6 7 8 9 10

Difficulty -3 -2 -1 -1 0 0 1 1 2 3

A       60 530

B       60 545

C       60 593

% correct asTTle v4

Conclusions: C > A, B; B ≈ A because C answered all the hardest items correctly—no penalty for skipping or getting easy items wrong

(24)

2 real University Tests:

We too have the same problem

Item Characteristic Curves show reverse/close to zero

discrimination; especially in the mid-term test. Most in the final exam were good

Brown, G. T. L., & Abdulnabi, H. (2017). Evaluating the quality of higher education instructor-constructed multiple-choice tests: Impact on student grades. Frontiers in Education, 2(24). doi:10.3389/feduc.2017.00024

When was the last time you checked that the items behaved properly even with CTT?

(25)

Academics will have to admit that even though they know content, it doesn’t mean they can write discriminating items or hard questions.

Going from CTT to IRT means going from counting quantity to judging quality based on probability of success estimates

If we write easy tests, high scores will not indicate EXCELLENCE, just satisfactory or good….

This changes scores downward if we exercise that judgment

high-scoring students on easy tests will get lower marks….will they believe, trust, and accept?

The Politics of Change

(26)

IRT is used in New Zealand standardised tests for schools

e-asTTle & PATs

IRT is used in high-stakes selection examinations in the USA

GRE, GMAT, LSAT, etc.

IRT is used in international comparative testing systems

PISA, TIMSS, PIRLS, etc.

So why are universities so far behind industry gold- standard?

In operation

(27)

Maybe what we need

A tool to eliminate bad items and help teachers set quality grade boundaries

Learning Enhancement Grant to build a tool to force judgment of quality

We can demonstrate that if you want

Referensi

Dokumen terkait

A test constructor should construct the objective test more carefully in order to avoid the mistake in the item test construction, especially in choosing item format,

It is stated that the good test should fulfill the criteria of validity, reliability, item facility, item discrimination, and item format analysis.. While the main

The finding shows that there is positive correlation between three intelligences and proposal writing score: (verbal-linguistic with Pearson correlation = 0, 248,

Correlation between nonverbal cognitive test battery for individuals with or without intellectual disability (CIID) score and performance intelligence quotient (IQ)

Effectiveness of the LATCH score as the assessment tool of breastfeeding Based on the research Cakmak & Kuguoglu, 2007 the LATCH score is an effective tool for assessing differences in

This research is about item analysis of the quality of final test related to the difficulty level, discrimination, the effectiveness of distractor, validity and reliability of the final

Spearman's correlation test between the students’ mean scores with Internet addiction score, life satisfaction score and self esteem score of students Variable Spearman correlation

[email protected], [email protected] Abstract This study empirically analyzed the item difficulty and discrimination indices of senior school certificate examination