PSYC3001 Course Notes
Note: This course has a lot of content! The 10-page course summary sheet is at the end of this document, on page 80.
Week 1
Lecture 1
Inference for a comparison between two means
• Quickly moving through reminder of content from PSYC2001
• In this course we will consider data analysis methods for experiments with the following attributes:
o Between-subjects design: participants are randomly assigned to treatments o Within-subjects design: participants are measured on multiple occasions
o Dependent variable: continuous construct, has interval or ratio scale of measurement properties
• Dependent variable scores are (approximately) normally distributed
• Treatment effects can be expressed as the difference between population means
• In carrying out data analysis, we want to generate confident statistical inference about the existence, direction or magnitude of treatment effects
o Confident inference: produces a specified error rate, e.g. Type I error
A comparing two means example
• Example: J = 2 (note: J refers to number of groups)
• How effective is studying while being distracted by social media?
o Group 1 (treatment condition): 10 students, studying with access to social media o Group 2 (control condition): 10 students, no access to social media
o Independent variable: study mode factor (two levels: distraction, no distraction) o Dependent variable: number of correct items out of 24 on quiz
o Yi1 = score for ith participant in group 1 o N = n1 + n2 = 20
• In this course, we will mostly consider experiments where groups have the same number of participants
•
o The aim of the experiment is to make an inference about the population treatment effect: −
• Our point estimate from our sample means is 13 - 18 = -5
o But we know that a single point estimate isn't a very good predictor. We should now be doing confidence intervals or hypothesis tests
o So we need to consider sampling variability for each group:
• Calculating the sum of squared deviations for each group, we get 78 for group 1 and 162 for group 2
o Next, we want to calculate the standard error estimate (SE):
o
= = × +
• Note: df = n1 + n2 – 2
o
= × + = 1.633
• This SE is our estimate standard deviation for the difference between our sample means o = ! is our estimate of the population variance
Making our inference
• At this point in the example, we have two options to make our inference:
• Null hypothesis significance test:
o " : − = 0, " : − ≠ 0
o ' = ( , which we compare to the critical value based on our t-tables
o Our level of significance refers to the type I error rate that we accept for this experiment
• i.e. alpha is the probability of rejecting H0 if H0 is true
• In our example:
o ' = ). )) = −3.062
o Our critical value (alpha = 0.05 two tailed, df = 18 is 2.101
• |−3.602| > 2.101, therefore we reject H0
o Because we did a two-tailed test, we can only make an inference about inequality, i.e.
they are not the same
• One tailed tests allow us to make an inference about direction
• Remember, we don't accept a null, we only fail to reject it
o We don't have evidence that there is no effect, we just don't have enough evidence that there is one
• Interval estimate of treatment effect:
o Constructs a (100-alpha)% confidence interval for our two-mean difference parameter o -. = /012' 45'167'48! ± × ':, !
o Higher level of confidence = wider interval
• In our example:
o -. = −5 ± 1.633 × 2.101 = −8.43, −1.57!
o Important to make a substantive inference by referring to the actual things being measured
o ‘We are 95% confident that studying with social media distraction reduces performance on the quiz by at least 1.57 points, and up to 8.43 points, compared to studying with no distraction’
• Noncoverage error: probability that the confidence interval does not contain the parameter o The noncoverage error is equal to alpha, i.e. it is the type 1 error
o NOTE: this is different to type I error in the sense that it is possible to make a
noncoverage error even when H0 is false (whereas Type I error only applies when H0 is true)
Errors
• Type I error (alpha): probability of falsely rejecting the null hypothesis o i.e. P(reject H0 | H0 is true)
• Type II error (beta): probability of falsely failing to reject the null hypothesis
o i.e. P(fail to reject H0 | H0 is false)
o Note: if we get the results and make no inference based on it, then we have technically made a Type II error (in that the data failed to allow us to make the right inference
• Type III error: a directional inference is made in the wrong direction o ? @ < @ |" 15 B7C54 ∩ > !
o Obviously works for swapping < and > directions too
o The probability of a Type III error is usually extremely small (especially if the sample size is large enough)
• The most important for null hypothesis tests is Type I error
o Ensuring adequate and appropriate control of Type I error rate is important
o Otherwise, tests will often generate spurious significant outcomes which will render research findings meaningless
Hierarchy of inference strength
• There is an evident level in the 'strength' or usefulness of inference levels for treatment effects
1. Confidence interval inference: giving a numerical range with a particular level of confidence
2. Confident direction inference: making an inference about the sign of mu(1) - mu(2) with a level of confidence
3. Confident inequality inference: saying mu(1) does not equal mu(2) with a particular level of confidence
• What does a CI provide that a hypothesis test doesn't?
o Information about effect size:
• If a CI excludes zero, then by extension it implies directional inference AND inequality inference
• If a CI includes zero, then it does not justify directional or inequality inference
• In other words, the CI encompasses the inferences of the other two options o Information about precision of estimation:
• Suppose we have two CIs: (3.2, 6.2) and (1.2, 8.2)
• Both exclude zero so allow for a directional inference, but the limits for the first interval provide a more precise estimate of the treatment effect than the second interval
• For both of these, a hypothesis test would only provide a confident direction inference with no evident difference between the two in precision
o Possibility of a practical equivalence inference:
• Inference that the absolute magnitude of the parameter is trivially small (i.e. close enough to zero to not be important)
• i.e. related to questions where "the smallest clinically significant TE is a difference of X points"
A hypothesis test doesn't let you make inferences regarding this
Lecture 2
The Problem of Multiple Comparisons
An example
• For an experiment with J > 2, what happens to the Type I error rate as we make multiple inferences?
• Let's take an example where J = 3
o Participants are randomly assigned to one of three treatment conditions (DV = depression scores)
• Mu1 = population mean depression score for New Treatment (T1)
• Mu2 = population mean depression score for Standard Treatment (T2)
• Mu3 = population mean depression score for Control condition (C3) o This means there can be three pairwise comparisons: − , − ), − )
• Now suppose we do three hypothesis tests, (1H0, 2H0, 3H0) o Also suppose that all 3 null hypotheses are true
o The possible outcomes for this experiment in terms of Type 1 errors:
• 0 type 1 errors
• 1 type 1 error (3 possible ways)
• 2 type 1 errors (3 possible ways)
• 3 type 1 errors
• The type I error rate can be applied to:
o Individual comparisons: known as the per-comparison error rate (PCER)
• i.e. the type 1 error rate for an individual comparison
o The set of comparisons: known as the familywise error rate (FWER)
• i.e. the probability of at least one type I error in a set of inferences
The FWER and maximal comparisons
• So what are these in our example above?
o The PCER should stay at the significance level for that test, i.e. alpha o The FWER will be 1 − ? 20 'E/4 1 4FF0F5! = 1 − 1 − G!
Where n is the number of comparisons being made o For our example, n = 3 so FWER = 1-0.95^3 = 0.1426
• So the FWER is inflated above the nominal alpha = 0.05
o As J increases, the number of possible pairwise comparisons increases greatly o Specifically, for J groups there are JC2 possible combinations, i.e. H H !
• The FWER continues to increase (although at a decelerating rate)
• Controlling FWER is important for a post-hoc analysis, i.e. analysis guided by how data has turned out
o e.g. consider the experimental design for J = 3
o Let's say we want to examine comparisons that show the biggest difference between sample means
• This will involve Mmax - Mmin
o But we don't know which means will end up being max and min until after the analysis - i.e. it's a post-hoc comparison
• The maximal comparison is very different to a fixed comparison such as M1 - M2
o Even though the actual values for M1 and M2 will change over replications, they are still measuring the same samples
• So what's the average value of the maximal comparison? Much higher than it is for a fixed comparison, since the largest is always being chosen
o Also, on any replication where there is at least one significant difference (from a set of comparisons), the maximal comparison must be significant – why? Because the maximal comparison will have the largest difference, and significance is determined by the size of the difference – if one difference is significant, then it must be the largest of all differences (i.e. the maximal one)
• This rule works vice versa too - if there is a significant maximal comparison, then there must be at least one significant fixed comparison
o Therefore it is easier to make a Type I error for maximal comparison than for fixed comparison
• i.e. the type 1 error rate for a maximal comparison must be larger than for a fixed, i.e. > 0.05
Controlling the FWER for post-hoc analysis
• What's the error rate for a maximal comparison?
o Well as we noted above, it's the probability of finding at least one significant difference from our set - meaning it must be the FWER
• This applies for other types of post-hoc analyses as well
o The appropriate error rate unit for post-hoc analysis is the FWER
o i.e. procedure should control for the probability of making at least one Type I error in a set
• Therefore it is appropriate to control the FWER at alpha for these measures
• Another important reason for controlling the FWER is statistical dependencies among comparisons
o When J > 2, inferences on all comparisons are not statistically independent
• e.g. consider the estimates M1 - M2, and M1 - M3
o These are going to be positively correlated as they both contain M1
o Therefore, the probability of a Type 1 error for M1 - M2 is positively correlated with the probability of a Type 1 error for M1 - M3
o The more that outcomes are correlated, the more likely that Type I errors will occur in clusters
• In this context, the appropriate error rate unit is the FWER rather than the PCER
The FWER and confidence intervals
• With individual t-based CI procedures, we get the same sort of problem with familywise noncoverage errors
o Remember: noncoverage error = probability that our confidence interval does not contain the population parameter
o A 95% individual t-based CI will control the per-comparison noncoverage error rate at alpha
• BUT it will inflate the familywise noncoverage error rate, i.e. with multiple CIs it becomes more likely that one of them won't include the parameter
• Remember that the noncoverage error rate is different to Type I error in that it is possible to make noncoverage errors when H0 is false
o So noncoverage FWER is more general than hypothesis test FWER, because it doesn't depend on the parameter value of the comparison
Week 2
Lecture 1
The Tukey Method
The range distribution
• For analysis restricted to all comparisons, the Tukey Multiple Comparison Procedure is the way to go
o It provides a test that controls the FWER (and non-coverage FWER) at alpha
• It's based on the range distribution, which is a sampling distribution based on the range of sample means
o Where the range of sample means = Mmax - Mmin
• Our range is influenced by:
o The range of population means ( IJK− LMN) o Population variance for each population ( O) o Sample size (n)
• It can be shown that:
o When J population variances are assumed to be homogenous
•
= = ⋯ =
O=
o Then across replications, Mmax – Mmin increases as Q increases
• Then we have two options, dependent on if we know the J variances or not
• If we do know the population variance (unlikely):
o For testing " : IJK− IR ! = 0
o Reject H0 if
S =
TUVY TWXX
> S
Zo Note: rejecting this null also implies rejection of the homogeneity hypothesis (all population means are the same)
Tukey's with unknown variance
• If we don't know the population variance:
o We use the mean square error as an unbiased estimate for the population variance
o
@ =
[\\]^ [\\]^=
! ! ⋯ `⋯ _ _ a=
H ( !o Pooled SS is the sum of each within-group SS, and is called the SSE
o Pooled df is the sum of each within-group degrees of freedom (it's the df associated with the SSE)
• Note sometimes we see the symbol ‘nu’: b = 8B ( = c 2 − 1! to denote degrees of freedom
• When all the population means are homogenous, then the test statistic:
o
d =
TUVefgTWXX
o Has a studentised range distribution
o The critical value for this stat is dZ= dh,H,i and is available in textbooks and things
• Note: the numerator of the q statistic is the range of sample means i.e. the maximal comparison
o The denominator of the q-stat is a function of MSE and sample size
• More often, it's convenient to work with the test statistic q*
o
d
∗=
TUVQk TWXe_le_m
• In comparing q and q*, the numerator is the same o But the denominator is the SE of a comparison
o
=
_ _m= @ ×
• We can then substitute and rearrange to find that o
d
∗=
√n• The distribution of q* depends on:
o J, the number of groups
o v = J(n-1) the degrees of freedoom o Alpha, obviously
o Critical values are given in the stat tables handout
Hypothesis tests with the q-stat
• For an overall test (maximal comparison):
o " : IJK− IR = 0
• Reject H0 if
d
∗=
TUVQk TWXe_le_m
> d
h,H,i∗• For an alpha-level Tukey test on all comparisons:
o " : O− Om = 0, B0F p, pq= 1, 2, 3, … c, p ≠ pq!
• Reject H0 if
d
∗=
Qk _ _me_le_m
> d
h,H,i∗• Note the above two are effectively the same - both imply the homogeneity hypothesis (mu1 = mu2 = mu3 = …)
• Once we've done a maximal comparison test, what if we want to test each of the remaining comparisons?
o Well, we could recalculate the q*-statistic again for the remaining ones
• Or we can compare each comparison to the Honestly Significant Difference, which is effectively the smallest difference two means would have to be for us to reject it:
o " s = dh,H,i∗ × _ _m
o It's sort of equivalent to converting a z-score difference to x-scores
• So if our HSD is 8.5, that means any comparison with a difference > 8.5 (absolute value) is significant
Tukey confidence intervals
• We can also create a Tukey Simultaneous Confidence Interval (SCI) for comparisons
• O− Om ∈ `@O− @Oma ± dZ∗×
o Where = @ ×
• The term simultaneous indicates that the non-coverage error rate is applied to the set of comparisons and controlled at the familywise level, rather than the individual level
• Hence the confidence interval controls the probability at alpha of making one or more non- coverage errors for the set of all comparisons
o But we still use the interval with one individual comparison at a time, e.g. mu4 - mu1
Tukey in SPSS
• See instructions in the slides for using Tukey in SPSS
• SPSS doesn't produce q stats, so we have to refer to p-values or SCI limits
• SPSS can also create homogenous subsets, which groups sample means by those which are not significantly different from each other, but are different from those in other subsets
Properties of the Tukey procedure
• The Tukey procedure requires equal sample size per group
• Tukey HSD procedure controls the FWER at alpha for analyses including inferences on all comparisons
• For planned analysis on some (but not all) comparisons, the Tukey method is still valid but is quite conservative (as it will control for ALL comparisons)
o The FWER for a Tukey test on some comparisons will be lower than alpha o The PCER for a Tukey test at alpha will be much lower than alpha
• For this reason, Tukey is recommended for an analysis looking at all comparisons o But not if only some comparisons of an interest, e.g. if a researcher wants to only
consider comparisons that compare an active treatment to control condition, but not active treatments to each other
• The overall Tukey test can be used as the basis for a simultaneous test procedure (STP) o An STP allows for tests on all comparison hypotheses implied by the homogeneity
hypothesis
o i.e. if the homogeneity hypothesis is true, then all comparison null hypotheses are true
• In general, an STP:
o Uses a single critical value for all tests
• It controls the FWER at alpha
o Allows for construction of simultaneous Cis that control the non-coverage error rate for the whole set of comparisons at alpha
o Provides a coherent statistical analysis
• If the H0 for maximal comparison cannot be rejected, then no other comparison H0
can be rejected by the Tukey STP
• Because if the maximal comparison wasn't significant enough, then the other comparisons are smaller so they definitely won't be
• Note: SPSS generates Tukey tests and CI outputs from one-way ANOVA analysis
o This means it invites incoherent analysis, and so you should ignore the other parts it spits out
Lecture 2
Analysis of Variance
The one-way ANOVA model
• The one-way ANOVA model is applied to data from a single factor randomised experiment with J conditions (treatments, groups) and n participants per group
• A single factor randomised experiment is a study with a single independent variable (IV, treatment factor) with J levels and an equal number of participants per condition
o Ps are randomly allocated to levels of IV
• The single-factor fixed effects ANOVA model:
o uRO = + GO+ vRO
• uRO is the observed score for subject i in group j
• is a constant
• GO is an effect parameter referring to treatment j (the IV is assumed to have J fixed levels)
• vRO is the error component for subject I in group j