Now With Users Involved

(1)

EVALUASI EMPIRIS

Pengenalan Evaluasi Empiris

Perancangan Eksperimen

Partisipasi, IRB dan Etika

Pengumpulan Data

(2)

Why Evaluate?

Recall:

• Users and their tasks were identified

• Needs and requirements were specified • Interface was designed, prototype built • But is it any good? Does the system

support the users in their tasks? Is it better than what was there before (if anything)?

Types of Evaluation

• Interpretive and Predictive

(a

reminder)

(3)

Now With Users Involved

• Interpretive (naturalistic) vs.

Empirical:

• Naturalistic

– In realistic setting, usually

includes some detached

observation, careful study of

users

• Empirical

– People use system, manipulate

independent variables and

(4)

Why Gather Data?

• Design the experiment to

collect the data to test the

hypotheses to evaluate the

interface to refine the design

• Information gathered can be:

objective

or

subjective

• Information also can be:

(5)

Conducting an Experiment

• Determine the TASK

• Determine the performance

measures

• Develop the experiment

• IRB approval

• Recruit participants

• Collect the data

• Inspect & analyze the data

• Draw conclusions to resolve design

problems

(6)

The Task

• Benchmark tasks - gather quantitative

data

• Representative tasks - add breadth,

can help understand process

• Tell them what to do, not how to do it

• Issues:

– Lab testing vs. field testing

– Validity - typical users; typical tasks; typical setting?

(7)

“Benchmark” Tasks

• Specific, clearly stated task for

users to carry out

• Example: Email handler

– “Find the message from Mary

and reply with a response of

‘Tuesday morning at 11’.”

(8)

Defining Performance

• Based on the task

• Specific, objective

measures/metrics

• Examples:

– Speed (reaction time, time to

complete)

– Accuracy (errors, hits/misses)

– Production (number of files

processed)

(9)

Types of Variables

• Independent

– What you’re studying, what you

intentionally vary (e.g., interface

feature, interaction device,

selection technique)

• Dependent

(10)

“Controlling” Variables

• Prevent a variable from affecting

the results in any systematic way

• Methods of controlling for a

variable:

– Don’t allow it to vary

• e.g., all males

– Allow it to vary randomly

• e.g., randomly assign participants to different groups

– Counterbalance - systematically vary it

• e.g., equal number of males, females in each group

(11)

Hypotheses

• What you predict will happen

• More specifically, the way you

predict the dependent variable (i.e.,

accuracy) will depend on the

independent variable(s)

• “Null” hypothesis (H

_o

)

– Stating that there will be no effect – e.g., “There will be no difference in

performance between the two groups” – Data used to try to disprove this null

(12)

Example

• Do people complete operations

faster with a black-and-white

display or a color one?

– Independent - display type (color or b/w)

– Dependent - time to complete task (minutes)

– Controlled variables - same number of males and females in each group

– Hypothesis: Time to complete the task will be shorter for users with color

display

(13)

Experimental Designs

• Within Subjects Design

– Every participant provides a

score for all levels or conditions

Color B/ W

P1 12 secs. 17 secs. P2 19 secs. 15 secs. P3 13 secs. 21 secs. ...

• Between Subjects

– Each participant provides results

for only one condition

Color B/ W

(14)

Within Subjects Designs

• More efficient:

– Each subject gives you more data -they complete more “blocks” or

“sessions”

• More statistical “power”:

– Each person is their own control

• Therefore, can require fewer

participants

• May mean more complicated

design to avoid “order effects”

(15)

Between Subjects Designs

• Fewer order effects

– Participant may learn from first

condition

– Fatigue may make second

performance worse

• Simpler design & analysis

• Easier to recruit participants

(only one session)

(16)

IRB, Participants, & Ethics

• Institutional Review Board (IRB)

– http://www.osp.gatech.edu/compliance.htm

• Reviews all research involving human (or animal) participants

• Safeguarding the participants, and thereby the researcher and university • Not a science review (i.e., not to asess

your research ideas); only safety & ethics • Complete Web-based forms, submit

research summary, sample consent forms, etc.

(17)

Recruiting Participants

• Various “subject pools”

– Volunteers

– Paid participants

– Students (e.g., psych undergrads) for course credit

– Friends, acquaintances, family, lab members – “Public space” participants - e.g., observing

people walking through a museum

• Must fit user population (validity)

• Motivation is a big factor - not only $$ but also explaining the importance of the

research

(18)

Ethics

• Testing can be arduous

• Each participant should

consent to be in experiment

(informal or formal)

– Know what experiment involves,

what to expect, what the

potential risks are

• Must be able to stop without

danger or penalty

(19)

Consent

• Why important?

– People can be sensitive about this process and issues

– Errors will likely be made, participant may feel inadequate

– May be mentally or physically strenuous

• What are the potential risks (there

are always risks)?

– Examples?

• “Vulnerable” populations need

special care & consideration (& IRB

review)

(20)

Attribution Theory

• Studies why people believe

that they succeeded or

failed--themselves or outside factors

(gender, age differences)

(21)

Evaluation is Detective Work

• Goal: gather evidence that can

help you determine whether

your hypotheses are correct or

not.

• Evidence (data) should be:

– Relevant

– Diagnostic

– Credible

(22)

Data as Evidence

• Relevant

– Appropriate to address the hypotheses

• e.g., Does measuring “number of errors” provide insight into how effective your new air traffic control system supports the

users’ tasks?

• Diagnostic

– Data unambiguously provide evidence one way or the other

• e.g., Does asking the users’ preferences clearly tell you if the system performs

(23)

Data as Evidence

• Credible

– Are the data trustworthy?

• Gather data carefully; gather enough data

• Corroborated

– Do more than one source of

evidence support the

hypotheses?

• e.g., Both accuracy and user opinions indicate that the new

(24)

General Recommendations

• Include both objective &

subjective data

– e.g., “completion time” and “preference”

• Use multiple measures, within a

type

–

e.g., “reaction time” and “accuracy”

• Use quantitative measures

where possible

–

e.g., preference score (on a scale of 1-7)

(25)

Types of Data to Collect

• “Demographics”

– Info about the participant, used for grouping or for correlation with other measures

• e.g., handedness; age; first/best language; SAT score

• Note: Gather if it is relevant. Does not have to be self-reported: you can use tests

(e.g.,Edinburgh Handedness)

• Quantitative data

– What you measure

• e.g., reaction time; number of yawns

• Qualitative data

– Descriptions, observations that are not quantified

(26)

Planning for Data Collection

• What data to gather?

– Depends on the task and any benchmarks

• How to gather the data?

– Interpretive, natural, empirical, predictive??

• What criteria are important?

– Success on the task? Score? Satisfaction?…

• What resources are available?

– Participants, prototype, evaluators, facilities, team knowledge

(27)

Collecting Data

• Capturing the Session

– Observation & Note-taking – Audio and video recording – Instrumented user interface – Software logs

– Think-aloud protocol - can be very helpful – Critical incident logging - positive & negative

• Post-session activities

– Structured interviews; debriefing

• “What did you like best/least?”; “How would you change..?”

(28)

Observing Users

• Not as easy as you think

• One of the best ways to gather

feedback about your interface

• Watch, listen and learn as a

(29)

Observation

– Cheap, quicker to set up and to

intrusion, but doesn’t

eliminate it – Cameras

focused on

screen, face & keyboard

(30)

Location

• Observations may be

– In lab - Maybe a specially built

usability lab

• Easier to control

• Can have user complete set of tasks

– In field

• Watch their everyday actions • More realistic

(31)

Challenge

• In simple observation, you

observe actions but don’t know

what’s going on in their head

• Often utilize some form of

(32)

Verbal Protocol

• One technique:

Think-aloud

– User describes verbally what

s/he is thinking while performing

the tasks

• What they believe is happening • Why they take an action

• What they are trying to do

• Very widely used, useful technique

• Allows you to understand user’s

(33)

Teams

• Another technique:

Co-discovery learning

(Constructive interaction)

– Join pairs of participants to work

together

– Use think aloud

– Perhaps have one person be

semi-expert (coach) and one be

novice

(34)

Alternative

• What if thinking aloud during

session will be too disruptive?

• Can use

post-event protocol

– User performs session, then watches video and describes what s/he was thinking

– Sometimes difficult to recall

– Opens up door of interpretation

Historical Record

(35)

Capturing a Session

1. Paper & pencil

– Can be slow

– May miss things

– Is definitely cheap and easy

Time 10:00 10:03 10:08 10:22

Task 1 Task 2 Task 3 …

S e S

(36)

Capturing a Session

2. Recording (audio and/or

video)

– Good for talk-aloud

– Hard to tie to interface

– Multiple cameras probably

needed

– Good, rich record of session

– Can be intrusive

(37)

Capturing a Session

3. Software logging

– Modify software to log user

actions

– Can give time-stamped

keypress or mouse event

– Two problems:

• Too low-level, want higher level

events

• Massive amount of data, need

(38)

Subjective Data

• Satisfaction is an important

factor in performance over time

• Learning what people prefer is

valuable data to gather

Methods

• Ways of gathering subjective data

– Questionnaires – Interviews

(39)

Questionnaires

• Preparation is expensive, but

administration is cheap

• Oral vs. written

– Oral advs: Can ask follow-up questions – Oral disadvs: Costly, time-consuming

• Forms can provide more quantitative

data

• Issues

– Only as good as questions you ask – Establish purpose of questionnaire – Don’t ask things that you will not use – Who is your audience?

(40)

Questionnaire Topic

• Can gather demographic data

and data about the interface

being studied

• Demographic data:

– Age, gender

– Task expertise

– Motivation

(41)

Interface Data

• Can gather data about

– screen

– graphic design

– terminology

– capabilities

– learning

(42)

Closed Format

• Likert Scale

– Typical scale uses 5, 7 or 9 choices – Above that is hard to discern

– Doing an odd number gives the neutral choice in the middle

– You may not want to give a neutral option

Characters on screen were:

• Closed format

– Answer restricted to a set of choices – Typically very quantifiable

(43)

Other Styles

1 - Very helpful 2 - Ambivalent 3 - Not helpful 0 - Unused

___ Tutorial

___ On-line help ___ Documentation Which word processing

(44)

Open Format

• Asks for unprompted opinions

• Good for general, subjective information, but difficult to analyze rigorously

• May help with design ideas

– “Can you suggest improvements to this interface?”

Closed Format

• Advantages

– Clarify alternatives – Easily quantifiable – Eliminate useless

(45)

Questionnaire Issues

• Question specificity

– “Do you have a computer?”

• Language

– Beware terminology, jargon

• Clarity

– “How effective was the system?” (ambiguous)

• Leading questions

– Can be phrased either positive or negative

• Prestige bias - (British sex survey)

– People answer a certain way because they want you to think that way about them

• Embarrassing questions

– “What did you have the most problem with?”

• Hypothetical questions • “Halo effect

– When estimate of one feature affects estimate of another (eg, intelligence/looks)

(46)

Deployment

• Steps

– Discuss questions among team

– Administer verbally/written to a

few people (pilot). Verbally

query about thoughts on

questions

– Administer final test

– Use computer-based input if

possible

– Have data pre-processed,

(47)

Interviews

• Get user’s viewpoint directly, but certainly a subjective view

• Advantages:

– Can vary level of detail as issue arises – Good for more exploratory type questions

which may lead to helpful, constructive suggestions

• Disadvantages

– Subjective view

– Interviewer(s) can bias the interview

– Problem of inter-rater or inter-experimenter reliability (a stats term meaning agreement)

– User may not appropriately characterize usage

(48)

Interview Process

• How to be effective

– Plan a set of questions (provides

for some consistency)

– Don’t ask leading questions

• “Did you think the use of an icon there was really good?”

• Can be done in groups

(49)

Data Inspection

• Look at the results

• First look at each participant’s

data

– Were there outliers, people who

fell asleep, anyone who tried to

mess up the study, etc.?

(50)

Inspecting Your Data

• “What happened in this study?”

• Keep in mind the goals and

hypotheses you had at the

beginning

• Questions:

– Overall, how did people do?

– “5 W ’s” (Where, what, why,

(51)

Descriptive Statistics

• For all variables, get a feel for

results:

• Total scores, times, ratings,

etc.

• Minimum, maximum

• Mean, median, ranges, etc.

What is the

v e.g. “ Twenty participants

completed both sessions (10 males, 10 females; mean age 22.4, range 18-37 years).”

v e.g. “ The median time to complete

(52)

Subgroup Stats

• Look at descriptive stats

(means, medians, ranges, etc.)

for any subgroups

– e.g. “The mean error rate for the

mouse-input group was 3.4%.

The mean error rate for the

keyboard group was 5.6%.”

– e.g. “The median completion

time (in seconds) for the three

groups were: novices: 4.4,

(53)

Plot the Data

(54)

Experimental Results

• How does one know if an

experiment’s results mean

anything or confirm any

beliefs?

• Example: 40 people

participated,

(56)

Inferential (Diagnostic) Stats

• Tests to determine if what you see in the data (e.g., differences in the means) are reliable (replicable), and if they are likely caused by the independent variables, and not due to random effects

– e.g., t-test to compare two means

– e.g., ANOVA (Analysis of Variance) to compare several means

– e.g., test “significance level” of a correlation between two variables

Means Not Always Perfect

Experiment 1

Group 1 Group 2

Experiment 2

(57)

Inferential Stats and the Data

• Ask diagnostic questions about the

data

Are these really

different? What would that

(58)

Hypothesis Testing

• Recall: We set up a “null

hypothesis”

– e.g., there should be no

difference between the

completion times of the three

groups

– Or, H

₀

: Time

_Novice

= Time

_Moderate

= Time

_Expert

(59)

Hypothesis Testing

• “Significance level” (p):

– The probability that your null

hypothesis was wrong, simply by chance

– Can also think of this as the probability that your “real” hypothesis (not the

null), is wrong

– The cutoff or threshold level of p

(“alpha” level) is often set at 0.05, or 5% of the time you’ll get the result you saw, just by chance

– e.g. If your statistical t-test (testing the difference between two means)

returns a t-value of t=4.5, and a

(60)

Errors

• Errors in analysis do occur

• Main Types:

– Type I/False positive - You

conclude there is a difference,

when in fact there isn’t

– Type II/False negative - You

conclude there is no different

when there is

(61)

Drawing Conclusions

• Make your conclusions based on

the descriptive stats, but back them

up with inferential stats

– e.g., “The expert group performed faster than the novice group t(1,34) = 4.6, p > .01.”

• Translate the stats into words that

regular people can understand

– e.g., “Thus, those who have computer experience will be able to perform

(62)

Feeding Back Into Design

• Your study, was designed to yield

information you can use to redesign your interface

• What were the conclusions you reached? • How can you improve on the design?

• What are quantitative benefits of the redesign?

– e.g., 2 minutes saved per transaction, which means 24% increase in production, or

$45,000,000 per year in increased profit

• What are qualitative, less tangible benefit(s)?

(63)

Usability Specifications

• Quantitative usability goals, used a guide for knowing when interface is “good

enough”

• Should be established as early as possible

– Generally a large part of the

Requirements Specifications at the center of a design contract

– Evaluation is often used to

demonstrate the design meets certain requirements (and so the

designer/developer should get paid)

– Often driven by competition’s usability, “Is it good enough…

(64)

Measurement Process

• “If you can’t measure it,

you can’t manage it”

• Need to keep gathering data on

each iterative evaluation and

refinement

• Compare benchmark task

(65)

What is Included?

• Common usability attributes that

are often captured in usability

specs:

– Initial performance

– Long-term performance – Learnability

– Retainability

– Advanced feature usage – First impression

(66)

Assessment Technique

Usability Measure Value to Current Worst Planned Best poss attribute instrum. be meas. level perf. level target level level

I nitial Benchmk Length of 15 secs 30 secs 20 secs 10 secs

perf task time to (manual) successfully

add appointment on the first trial

First Quest -2..2 ?? 0 0.75 1.5

impression

(67)

Fields

• Measuring Instrument

– Questionnaires, Benchmark tasks

• Value to be measured

– Time to complete task

– Number of percentage of errors

– Percent of task completed in given time – Ratio of successes to failures

– Number of commands used – Frequency of help usage

• Target level

– Often established by comparison with

(68)