2.1 RELIABILITY AND VALIDITY
ü Define and identify the difference between the three types of reliability ü Define and identify the difference between the types of validity
ü Understand the difference between reliability and validity
All measurements can be broken into at least two components – the true value of what is being measured and measurement error:
Measured score = true score + error X = T + e
REDUCING ERROR Error is reduced with:
- Many participants – individual differences error - Many measurements – measurement error - Many occasion
Averages of scores are more reliable than individual scores.
The consistency/repeatability of the results of a measurement.
There are three types of reliability:
Þ Observers: Inter-Observer reliability Þ Observations: Internal (Split-half) reliability Þ Occasions: Test-retest reliability
INTER-OBSERVER RELIABILITY
- Degree to which observers agree upon an observation or judgement
- Measure inter-observer reliability with correlations – positive relationship between the scores of each observer
- The higher the correlation between observer judgements, the more reliable the results INTERNAL RELIABILITY
- Degree to which several items that propose to measure the same general construct produce similar scores or the extent to which a measure is consistent within itself
- Having more items that consistently measure the construct of interest reduces error SPLIT-HALF RELIABILITY
- A form of internal consistency reliability
- Measures the extent to which all parts of the test contribute equally to what is being measured - This is done by comparing the results of one half of a test with the results from the other half. If the
two halves of the test provide similar results this would suggest that the test has internal reliability - If there are multiple categories, scores need to be compared across the test within each of the
question bank categories MEASUREMENT AND ERROR
RELIABILITY
TEST-RETEST RELIABILITY
- The reliability of a measure to produce the same results at different points in time or occasions - Important to show that the test or measure consistently measures the construct we are interested
in, provided no other variables have changed
- Practice effects undermine test-retest reliability (e.g., in a visual search task) REPLICATION
- The reliability of results across experiments
- The more times a result is replicated, the more likely it is that the findings are accurate and not due to error
RELIABILITY:The consistency and repeatability of the results of a measurement
- E.g., the scales consistently tell me that I weigh 55 kgs – they are reliable because they produce the same results consistently
versus
VALIDITY: A measure of how well a measure or experiment measures what it claims to measure - E.g., if my scales are always 5 kg less than my actual weight, then it is not a valid measure of my
weight (though it may be very reliable if it was always exactly 5kh off) - Relates to accuracy
INTERNAL VALIDITY
- Focused on whether the research design and evidence allow us to demonstrate a clear cause and effect relationship
- Crucial to make claims about the causal relationship between variables J.S. Mills proposed three requirements for causality:
1. Covariation
Is there evidence for a relationship between the variables?
2. Temporal sequence
One variable occurs before the other 3. Eliminate confounds
Explain or rule out other possible explanations EXTERNAL VALIDITY
- How well a causal relationship holds across different people, settings, treatment variables, measurements and time, i.e. how well the causal relationship can be generalised outside the specifics of the experiment
- E.g., Can experimental findings from animal labs generalise to humans?
- Difficult to obtain high external validity in controlled experimental settings
Population validity: how well the experimental findings can be generalised to the wider population Ecological validity: how well the experimental findings can be generalised outside of a laboratory setting to the real world
VALIDITY
MEASUREMENT VALIDITY
- How well a measure or an operationalised variable corresponds to what it is supposed to measure/represent
- Shows that the measurement procedure measures what it claims to - A number of methods are used to assess the validity of a measurement
Construct validity: How well operationalised variables (independent and/or dependent) represent the abstract variables of interest = strength of the operational definitions
Content validity: degree to which the items or tasks on a multi-faceted measure accurately sample the target domain
- How well does a measure/task represent all the facets of a construct?
- E.g., IQ tests: 7 questions on a Facebook quiz that ask you about mathematics, general knowledge and logical reasoning… can they really adequately represent a construct like IQ?
- Many constructs are multi-faceted and sometimes multiple measures must be used to achieve adequate content validity
Content validity vs internal reliability
Content validity demonstrates that all of the items on a multiple domain measure accurately measure the construct.
- E.g., extroversion scale: all questions need to accurately measure extroversion and not another construct
Internal reliability relates to whether the items on a multiple measure domain consistently measure the construct.
- E.g., intelligence test: all questions about verbal intelligence should produce consistent score in the same individual
CRITERION VALIDITY
- How well scores on one measure predict the outcome for another measure (criterion) - Two types of criterion validity: concurrent (now) and predictive (future)
Concurrent validity compares the scores on two current measures to determine whether they are consistent.
- How well do scores on one measure predict the outcome for another existing measure?
- If the two tests produce similar and consistent results, it can be said that they have concurrent validity
- Can be used to predict the outcome of a current behaviour from a separate measure (e.g., do likes of dog pictures on Facebook predict whether you buy dog products now)
Predictive validity makes predictions about how scores on a measure for a certain construct affect future behaviour.
- If the measurement of a construct accurately predicts future behaviour, then the measurement would have high predictive validity
- E.g., listening to hip-hop leads to violent crime – compare hours listening to hip-hop with number of violent criminal offenses
2.2 THREATS TO VALIDITY
ü Identify and define confounds Þ Third variable
Þ Experimenter bias Þ Participant effects Þ Time effects
ü Identify and define artefacts Þ Mere measurement effect Þ History effects
Þ Selection bias
ü Understand the difference between confounds and artefacts
INTERNAL VALIDITY
Þ Strength of causal claim EXTERNAL VALIDITY
Þ Population validity
Þ Ecological (environmental validity) MEASUREMENT VALIDITY
Þ Construct validity Þ Content validity CRITERION VALIDITY
Þ Concurrent Þ Predictive
Recall: J.S Mill’s 3 criteria to infer causation:
1. Covariation
2. Temporal sequence
3. Eliminate confounds (alternative explanations) – third variables problem CONFOUNDS
- Extraneous (third and unaccounted for) variable that systematically varies or influences both the independent and the dependent variable
- Factors other than the independent variable that may cause a result
- E.g., there is a positive correlation between coffee drinking and the likelihood of having a heart attack
Can it be concluded that drinking coffee causes heart attacks?
Increased caffeine consumption is observed among those who smoke, experience job-related stress, suffer from sleep deprivation, etc.
The best way to account for these systematic confounds is usually to manipulate the independent variable.
VALIDITY REVISION
THREATS TO INTERNAL VALIDITY – CONFOUNDS
Control groups
- Want to manipulate only the variable of interest between groups/conditions and to keep everything else constant
- However, it is almost impossible to isolate and remove all other variables THREATS TO INTERNAL VALIDITY: EXPERIMENTER
Experimenter bias
- A confound which undermines the strength of a causal claim
- May influence the way a dependent variable is scored – the researcher may interpret observations in a manner which supports their predictions
- Due to certain expectations, the experimenter may behave in a way that influences the participants and confounds the results of the experiment
- Not always intentional – previous knowledge and ideas can create tunnel vision in the experimenter
Judge leniency
- More lenient on attractive people (only for burglary)
- Not solved by inter-observer reliability, highlights the importance of blind procedures DOUBLE BLIND PROCEDURES
- Gold standard for experimental research
- Neither the experimenter nor the participant are aware of the conditions, e.g., who is receiving a particular treatment
- Prevents both unintentional and intentional bias when interpreting data THREATS TO INTERNAL VALIDITY: PARTICIPANT
- The way a participant behaves can influence the validity of the results Participant expectancy
- Whereby a researcher’s expectations about the findings of his or her research are inadvertently conveyed to participants and influence their responses
- May arise from participants’ reactions to subtle cues (demand characteristics) unintentionally given by the researcher, for example, through body movements, gestures, or facial expressions
- May threaten the ecological validity of the research Demand characteristics
- Cues about the purpose of the study that may influence or bias participants’ behaviour, for example, by suggesting the outcome or response that the experimenter expects or desires - May arise when participants want to be viewed in good light (often occurs in surveys and scales as
it is difficult to hide the purpose of the study with questionnaire research) Individual differences
- Differences in characteristics between all individual participants, including animals
- When there is a systematic issue with individual differences, there is a threat to internal validity - Impossible to completely remove
Participant reactance bias
- When a person modifies their behaviour because they are participating in a research study - Occurs due to participants’ concerns about being judged favourably or unfavourably
- A feature of the study suggests the purpose of the study and this changes the ability to measure the behaviour of interest
- Good subject role: try to produce the response the researcher wants - Negative subject role: try to act contrary to what the researcher wants - Apprehensive subject role: overly concerned about their performance - Faithful subject role: follow instructions and ignore suspicions
THREATS TO INTERNAL VALIDITY: TIME RELATED CONFOUNDS Maturation
- The effect of time on participants
- Short term: mood, tiredness, hunger, boredom
- This could be dealt with by counterbalancing the order of tests, controlling for time of day, designing experiments of reasonable length and including breaks in the experimental design - Long term: age, education, wealth
- Difficult to control, only important for longitudinal studies which take place over many years - Random assignment and sampling helps to reduce this confound
UNOBSTRUCTIVE OBSERVATION
- A method of making observations without the knowledge of those being observed
- E.g., hidden microphones or cameras observing behaviour (“the marshmallow test”)/ doing garbage audits to determine consumption
INDIRECT MEASURES
- “Missing Child” Experiment
- As it cannot be determined whether missing child posters work by asking people if they pay attention to them, in this experiment missing child posters were placed in a public place with the
“missing” child standing nearby to see how many people would report the child or notice the child DECEPTION/CONFEDERATES
- Confederate: an aide of the experimenter who poses as a participant but whose behaviour is rehearsed prior to the experiment
- E.g., Asch’s conformity studies
ARTIFACT
- An experimental finding that is not a reflection of the true state of the phenomenon of interest but rather is the consequence of a flawed design or analytic error
- Prevents a researcher from generalising his/her results
- Unlike a confound, an artefact is something that is ever present in all groups being tested that stays constant
Mere measurement effect
- Being aware that someone is observing or measuring your behaviour may change the way you behave
- This is important for external validity as it undermines the ability to generalise our results to the wider population and context
- Similar to participant reactance, except that it affects all subjects in the experiment - E.g., Hawthorne factory experiment
SOLUTIONS TO PARTICIPANT MEASUREMENT EFFECTS
THREATS TO EXTERNAL VALIDITY – ARTIFACTS
- Researchers were interested in which factors affect productivity and after testing a range of
different conditions found that all improved productivity by 30%. Act of observation alone affected performance.
History effects
- The effect of a period of time may make an entire sample biased
- E.g., level of education in Syria: warzone = limited access to school, limited shelter and food ∴ the data is influenced by the moment in time and cannot be generalised to a wider population or different contexts
Selection bias
- Volunteers for a study having a biased interest in the topic or the outcome of research
- It is important to use a random sample of the population to reduce the likelihood of systematic biases in the data
- Compulsory poll, census, reduces sampling bias, produces results and data that are more widely applicable, is less susceptible to biased groups but still susceptible to demand characteristics Non-response bias
- An issue for experiments that involve voluntary signups or surveys (polls and online surveys) - A large sample of the population is lost as people do not get involved in studies they do not find
interesting
- Limited population means that the results cannot be generalised to a wider population
10.1 – SINGLE SAMPLE T-TEST
- Developed in 1908 by William Sealy Gosset, a chemist working for the Guinness brewery in Dublin (“Student” was his pen name)
- t-test was a cheap way to monitor the quality of stout
- The t-test work was published in the journal Biometrika in 1908 under the pseudonym - “Student” because the company policy did not allow their chemists to publish their findings The t test tells you how significant the differences between groups are. In other words, it lets you know if those differences (measured in means) could have happened by chance.
- Does not require any knowledge of the population standard deviation
- Widely used in behavioural sciences (experiments involving a single sample or two samples or conditions)
The t statistic can be used to test hypotheses about a completely unknown population; that is, both μ and σ are unknown, and the only available information about the population comes from the sample.
All that is required for a hypothesis test with t is a sample and a reasonable hypothesis about the population mean.
THE STEPS FOR CALCULATING SINGLE SAMPLE T-TEST 1. Determine the hypothesised or population mean 2. Compute the sample mean
3. Compute the variance and standard deviation 4. Compute the standard error of the mean 5. Compute t: t(8) = 3.00, p < 0.05
6. Determine degrees of freedom (df): t(8) = 3.00, p < 0.05
7. Determine the critical value (i.e. significance level): t(8) = 3.00, p < 0.05 Degrees of Freedom (DF)
If you have a set of scores, there is one score that is not allowed to vary and has to be constant. All the other scores are independent and free to vary without breaking any constraints. These are the degrees of freedom.
Because the sample mean places a restriction on the value of one score in the sample, if the sample size is n, then the number of degrees of freedom is n – 1 (i.e., if you have 10 scores in a sample, the number of degrees of freedom is 9)
E.g., consider the following set of scores: 6 8 5 9 6 8 4 11 7 X 6 + 8 + 5 + 9 + 6 + 8 + 4 + 11 + 7 = 64
Total sum = 69
Thus, X has to be 5 and cannot vary
OR, if you have 7 different outfits to wear during the week, on Monday you can wear any of the 7 outfits, on Tuesday any of the 6 remaining outfits etc., until Sunday when you only have ONE outfit remaining.
T-TEST OR STUDENT’S T-TEST
THE T STATISTIC FOR SINGLE SAMPLE EXPERIMENTS
The greater the value of df for a sample, the better the sample variance represents the population variance.
For t-test statistics, degrees of freedom are given in parentheses: t(8) = 3.00, p < 0.05
The hypothesis test with a t statistic follows the same four-step procedure that was used before:
1. State the hypotheses and select a value for α. (Note: The null hypothesis always states a specific value for μ.)
2. Locate the critical region (i.e. set criteria for decision) 3. Calculate the test statistic.
4. Make a decision. (Either "reject" or "fail to reject" the null hypothesis.)
1. The values in the sample must consist of independent observations
- Two observations are independent if there is no consistent, predictable relationship between the first observation and the second
- Two observations are independent if the occurrence of the first observation has no effect on the probability of the second event.
2. The population that is sampled must be normal
- With very small samples, a normal distribution is important
- With very large samples this assumption can be violated without affecting the validity of the hypothesis test
Significant = null hypothesis has been rejected Not significant = failure to reject the null hypothesis
- Calculated value of the test statistic: (8) = 3, p < 0.05 (t-test result) - Degrees of freedom: t(8) = 3.00, p < 0.05 (t statistics)
- Significance level: t(8) = 3.00, p < 0.05 (probability of committing Type I error)
E.g., “for an unknown population mean: Do infants prefer to look at attractive faces compared to less attractive faces?”
Hypotheses:
Null hypothesis: M(attractive faces) = 10 sec (thus, μ = 10) Alternative hypothesis: M(attractive face) > 10 sec (thus, μ > 10) Results:
Average looking time for all infants M=13 sec for attractive face, S= 3 M (attractive) = 13 sec, S = 3.00
The infants spent an average of M = 13 (out of 20) seconds looking at the attractive faces, with SD = 3.00.
Statistical analysis indicates that the time spent looking at the attractive faces was significantly greater than would be expected if there were no preference, t(8) = 3.00, p< 0.05.
In peer-reviewed journals, and otherwise, the confidence intervals should be reported as well.
HYPOTHESIS TESTS WITH THE T STATISTIC
ASSUMPTIONS OF THE SINGLE SAMPLE T-TEST
REPORTING THE RESULTS OF A T-TEST
Often it is more informative to give an interval estimate for μ rather than a single value.
An interval estimate gives us a range of believable values and is called confidence interval, the boundaries within which we believe the true value of the mean will fall.
- Example: population mean height is 165cm; sample mean is 167; we take our sample mean as a mid-point (range163 cm- 171cm or 166 cm-168?)... We take more samples and obtain the mean for each; the sample mean=the population mean; most of the intervals will contain the population mean, but some will not.
- How do we find this interval?
DETERMINING 99% CONFIDENCE INTERVAL FOR POPULATION MEAN
“Who can catch a liar?” Ekman and O‘Sullivan (1991) Participants:
- Law-enforcement personnel, such as members of US Secret Service, Central Intelligence Agency, Federal Bureau of Investigation, National Security Agency, Drug Enforcement Agency, California police and judges
- Psychiatrists, college students, and adults from various other professions Procedure:
- A videotape showed 10 people who were either lying or telling the truth in describing their feelings
- The participants were asked to identify who was lying and who was telling the truth Results:
- The Secret Service Agents were significantly more accurate than other participants
- Those who were accurate used different clues and had different skills than those who were inaccurate
Sample mean: M = 64.12 Sample size: n = 34
Standard error of the mean: SEM = #$= '(.*+
,( = '(.*+
-.*, = 2.54
The limits should be constructed such that 99% of the time the true value of the population mean will fall within these limits. A Z-score of 2.57 corresponds to approximately 49.5% of the area under the normal curve. 2 times 49.5 equals 99%.
We find the lower and upper interval limits by using the transformed z-score formula:
Cl34456 = M + 2.57 x SEM = M + 2.57 x 2.54 = 64.12 + 6.53 = 70.65
Cl78956 = M = 2.75 x SEM = M – 2.75 x 2.54 = 64.12 – 6.53 = 57.59
∴ We can be 99% confident that the interval between 57.59 and 70.65 includes the true ability of Secret Service agents to detect deception.
ESTIMATING 𝝁 : CONFIDENCE INTERVAL
The width of the confidence intervals is affected by:
Variation in the population of interest – greater variation in the population leads to a wider confidence interval.
Small variation:
- Samples more similar to each other
- The population estimate based on the sample more accurate - Small width of the confidence interval
Large variation:
- Samples are more different from each other
- We are less sure that the sample mean is close to the population - Large width of the confidence interval
Sample size – smaller sample size leads to a wider confidence interval Small samples
- vary more from each other - more sampling error
- less confidence about the estimation - the width of the CI interval is larger Large samples
- smaller variation between the samples - the samples are more similar to each other - the effect of sampling error is reduced - More confidence about the estimation - the width of the CI interval is smaller
10.2 – INDEPENDENT SAMPLES T-TEST
Two separate scores are obtained from each individual in the sample.
The repeated measures (related samples) hypothesis test allows researchers to evaluate the mean difference between two treatment conditions using the data from a single sample.
- Each individual from a single sample is measured in both of the treatment conditions being compared
- The data consist of two scores for each individual Examples:
- A study investigating effects of a medical therapy: A group of patients could be measured before therapy and then measured again after the therapy
- Effects of alcohol on driving: A group of participants could be tested while simulating driving in a driving simulator when they are sober and then tested again after consuming alcohol
- Effects of fatigue on pilots’ decision making in emergencies: A group of pilots could perform a simulated task in a flight simulator and then tested again after performing a series of cognitively demanding tasks
- Pleasant odours improve students’ reaction times in processing of lexical information: A group of students does a lexical decision experiment exposed to no scent, and then exposed to a floral scent
The related samples t-test: each individual in one treatment is matched one-to-one with a corresponding individual in the second treatment.
- Subjects each pair: identical (or nearly identical) scores on the variable that is being used for matching
- Data consist of pairs of scores
- A difference in scores is computed for each matched pair
- The matching process can never be perfect, matched–subjects designs are relatively rare
- E.g., study investigating verbal learning: Participants have to be matched in terms of IQ and sex, a male participant in one sample with an IQ= 120 is matched with a male participant from another sample with an IQ = 120
The mean difference between two populations using data from two separate samples.
- Two distinct populations (such as men vs. women; children vs. adults; dancers vs. scuba-divers;
etc.)
- Two different treatment conditions (such as drug vs. no-drug) Examples:
- Does regular physical therapy help lower back pain?
- A bank wants to know which of two incentive plans will increase the use of its credit cards.
- Do women trust advertisements more than men?
REPEATED–MEASURES (WITHIN SUBJECT) DESIGNS: RELATED SAMPLES
REPEATED–MEASURES (MATCHED-SUBJECTS) DESIGNS: RELATED SAMPLES
REPEATED–MEASURES OF INDEPENDENT–MEASURES DESIGNS
Assumptions underlying the t-test for independent measures design 1. The observations within each sample must be independent
2. The two populations from which the samples are selected must be normal (large samples) 3. The two populations from which the samples are selected must have equal variances Independent design: what does t(9) = 2.967, p < 0.05 really mean?
“The results were statistically significant at the .05 level: t (9) = 2.967, p < 0.05”
- The t represents the type of statistical test, which in this case is a t-test.
- The (9) represents the number of degrees of freedom (which is the number of participants N – 1 in a single sample test, (N1 – 1) + (N2 – 1) in a two unrelated samples test)
- The 2.967 is the obtained value, or the value which resulted from applying the t-test to the results of the study
- The 0.05 represents the level of significance or Type I error rate Advantages vs disadvantages
Advantages:
- Number of subjects – requires fewer subjects because the same subjects are measured in both of the treatment conditions (important when there are relatively few subjects available)
- Study changes over time – well suited for studying learning, development or other changes that take place over time
- Individual differences – the primary advantage, because it reduces or eliminates differences such as age, IQ, gender, and personality
Disadvantages:
- Participants’ health or mood may change over time - Time-related factors
- The first treatment may influence the second treatment (such as practice effects) – solution:
counterbalancing
A probability that a given test will find an affect assuming that it exists in the population:
- We can calculate the sample size necessary to achieve a certain level of power - We can calculate the power of test
STATISTICAL POWER