Two-Way Repeated Measures ANOVA with One-Between and One Within-Subjects Factor in SPSS

(1)

Two-way repeated measures ANOVA with one-between and one within-subjects factor SPSS version 29

Mike Crowson, Ph.D.

January 2023

(2)

In this example, we will be testing whether individuals exposed to different psychotherapy treatments differ in how their anxiety levels change over the course of a brief six-week intervention period. Measures of anxiety were obtained prior to the onset of treatment (time 1), during mid-treatment (time 2), and immediately following the last session (time 3). The elapsed time between the adjacent measurement occasions was approximately 2 weeks. In our dataset (see subset below), the

between-subjects’ factor is “tx.group”, which is coded 1=wait-list control, 2=treatment A, 3=treatment B. The ‘anx.t1’, ‘anx.t2’, and ‘anx.t3’ variables are the repeated measurements of anxiety, occurring at times 1, 2, and 3 respectively.

To summarize, the results of our RM ANOVA will help us address the question, ‘Do individuals enrolled in the three treatment conditions exhibit differences in how they change over time with respect to anxiety?’, with this question being addressed by with the test of the effect of the time X treatment interaction on anxiety. Assuming this question is answered in the affirmative, we will conduct simple effects tests and probe a profile plot to ‘unpack’ the nature of the interaction. Technically, the question of whether individuals – irrespective of treatment condition – change in anxiety over the 6-week period is also tested in the RM ANOVA. However, the central focus of our analysis is on the interaction effect, since a significant effect will qualify any

statements regarding changes in anxiety over time. If the interaction is not significant, then our attention might naturally be directed back to trying to understand why changes were observed over time, irrespective of treatment.

(3)

A little terminology:

The repeated factor in our analysis is time (or measurement occasion). The repeated measure is anxiety, since the same measure is given to participants at the three measurement occasions. The between-subjects factor is treatment group.

(4)

(5)

(6)

(7)

Under the Estimated Marginal Means menu, click ‘compare main effects’ and ‘compare simple main effects’. The selection of

‘Compare simple main effects’ invokes simple effects tests that allows us to test differences across measurement occasion between the three treatment groups. Assuming the interaction effect is significant, results generated with this option will allow us to better understand the changes in anxiety that occur within groups – and (indirectly) compare differences in change between groups.

The selection of ‘compare main effects’ can be used to generate pairwise tests of mean differences in anxiety over time, irrespective of treatment group membership (which may or may not be of interest to you).

(8)

Plotting the mean anxiety scores by time and by time and treatment group.

(9)

If we are interested in testing pairwise difference in subjects’ average (i.e., averaged across time) level of anxiety, we can select Tukey’s post hoc tests. Quite frankly, I do not see the point in the selection given the focus of this study.

However, I have made the selection below for demonstrative purposes only.

(10)

Descriptive statistics for each group at each time point.

It so happens that we have unequal n’s associated with our treatment groups.

(11)

When performing a repeated measures ANOVA (with or without the specification of an interaction effect), there are two sets of test results that are produced in the output:

Multivariate test results

Univariate test results

(12)

Typically, a researcher will report on either (but not both) the univariate test results or the multivariate test results. That said, historically researchers have tended to rely on the univariate tests when reporting their results. An assumption of the standard univariate test is sphericity (see table below left with row for ‘sphericity assumed’), which is a very specific condition where a variance-covariance matrix of a transformation of the original repeated measures takes the form of an identity matrix (see Stevens, 2010). [A more restrictive condition called ‘compound symmetry’ – where the variances and covariances of the

repeated measure across measurement occasions are all equal – is not a requirement for RM ANOVA. For a nice little discussion of the difference between sphericity and CS, see https://www.youtube.com/watch?v=VCxoYW_QHsc]. Citing Box (1954), Stevens (2010) notes that a violation of this assumption will result in a positive bias in the test result that increases the Type 1 error

probability. When this assumption is violated, one strategy is to report on univariate tests that correct for test statistics for that violation, effectively making these tests more conservative (and decreasing Type 1 error probability). Another alternative is to use the multivariate test results from your output, which do not make this assumption. As noted above, the historical tendency has been to rely on corrected tests [shown below right] when one has concerns about a violation of sphericity.

Standard test that assumes sphericity Corrected test results

(13)

Review of the multivariate test results in SPSS

(14)

Here, we have the multivariate test results for time (the within-subjects factor) and the time X group

interaction.

Box’s test is a test of the assumption of equality variance-covariance matrices of difference scores between groups (Weinfurt, 2000). This is an assumption of the multivariate tests below. If Box’s test is significant, then you have evidence of a violation. [Note: Box’s test is sensitive to multivariate

nonnormality, thereby increasing rejection rate.]

Nevertheless, the multivariate test results below are robust when you have equal or nearly equal n’s in your groups (i.e., largest n/smallest n is less than 1.5). If the ratio of the larger n to the smaller n is large, then, this can bias the multivariate test results. Depending on the nature of the violation (see Pituch &

Stevens, 2016) the analyst might be at greater risk of committing a Type 1 or Type 2 error.

(15)

The main effect of time is statistically significant, Wilks’

lambda=.173, F(2,65)=155.687, p<.001. This effect, however, is qualified by a significant time X group interaction, Wilks’ lambda

= .440, F(4,130)=16.480, p<.001.

A significant Box’s test result indicates a violation of homogeneity of covariance matrices, whereas a non-significant result is consistent with the MANOVA assumption.

Here, we see that p<.001, where there is evidence of the violation. Nevertheless, the ratio of the largest n to smallest n is 26/19 = 1.368 (which is less than the 1.5 threshold suggested by Pituch

& Stevens, 2016).

Keep in mind, the focus of the study is on addressing the question of whether there are differences between groups in the changes observed on anxiety over the three measurement occasions. The statistically significant interaction provides an indication that this is the case.

(16)

Review of the univariate test results in SPSS

(17)

As noted previously, the sphericity assumption is required for all univariate main effects tests and interaction tests (O’Brien & Kaiser, 1985).

One option for assessing this assumption is to look at the results for Mauchly’s Test of Sphericity. The null hypothesis for this test is that the assumption of sphericity is met. Rejection of the null hypothesis leads to a decision that sphericity is not evident in the data.

Thus, statistical significance is regarded as an indication of a violation of the sphericity assumption. Given Mauchly’s test is impacted by multivariate non-normality of your repeated measures and by sample size, it is NOT recommended as the primary basis for evaluating whether the sphericity condition has been met.

(18)

An alternative approach when judging sphericity and deciding which (if any) corrected univariate test to report requires examination of the Epsilon (ε) parameters provided in the table above. Epsilon quantifies the departure of your data from

sphericity. Three different versions are provided in the output. According to Stevens (2010), the Greenhouse-Geisser ε tends to underestimate ε (which means it underestimates the degree to which sphericity is met) and the Huynh-Feldt ε overestimates ε (meaning it overestimates the degree to which sphericity is met). [The lower bound estimate is likely an overly conservative estimate of ε. As such, we will not consider it when deciding on which univariate test to report.]

For more information, see https://www.ibm.com/docs/en/spss-statistics/28.0.0?topic=variance-mauchlys-test-sphericity

According to Verma (2016) if the Greenhouse-Geisser ε is < .75, one should report on the univariate test where the Greenhouse- Geisser correction has been made (see table of “Tests of within-subjects effects”); if the Greenhouse-Geisser ε falls between .75 and 1, then report on the univariate test result where the Huynh-Feldt correction has been made.

(19)

Using the rule of thumb articulated in the previous slide, we see that the Greenhouse-Geisser ε falls between .75 and 1. Thus, we would use the Huynh-Feldt test when interpreting the univariate test results if we choose to use a corrected test under the

assumption sphericity is violated. That said, these values are so close to 1 that it seems reasonable to assume sphericity is met.

Notably, all three test results lead to the same conclusions regarding the main and interaction effects in the model.

(20)

With respect to the sphericity assumed tests:

The main effect of time on anxiety scores is statistically significant, F(2,132)=120.752, p<.001.

This effect was qualified by a significant time X group interaction effect, F(4,132)=20.658, p<.001.

(21)

The results provided in this table can be useful for (a) describing the nature of change over time on your repeated measure (irrespective of group membership; see occasion portion of output) and (b) judging whether there are between-group differences in observed changes (see interaction portion of the output).

Included in the table are (a) tests of lower-order and higher-order (trend) components associated with a polynomial function (see occasion) and (b) tests of whether there are differences among groups in those change components (occasion x group).

Keep in mind that with only two measurement occasions, the highest order trend that can be tested is a linear trend. With three measurement occasions, the highest-order trend that can be tested is a quadratic trend. With four measurement occasions, the highest-order trend that can be tested is a cubic trend; and so on.

Note: Interpretation of these results will only make sense in a research context where the levels of the repeated factor are ordinal in nature. The repeated factor in the current study is time, with values naturally ordered as 1, 2, and 3.

(22)

It is worth mentioning that generally one would not attempt to interpret the test results in this table if the univariate test results from the previous RM ANOVA were not significant. Of course, if the interaction (which, in our case was used to address our main question regarding between-group differences in change over time) is significant, we would pay attention to the

interaction tests in the output above to describe differences in trending between groups (even if the main effect in the previous output did not indicate significance in the previous results).

(23)

Although the issue of between-group differences in change over time is of greatest interest to us, I will begin by interpreting the main effects tests for occasion (time) in the output above.

In our case, we had three measurement occasions, so the highest order trend that can be tested is quadratic.

When determining whether changes in means across time (irrespective of group membership) is best described as following a linear or quadratic trajectory, we start by examining the test of the higher-order trend component: In this case, the quadratic component. If this effect is significant (which it is; p<.001), then we would most likely conclude the change trajectory follows a quadratic functional form (irrespective of whether the linear component is significant). If the higher-order term is not

significant and the linear term is significant, then we would conclude the change trajectory is linear.

In general, the highest-order term in the polynomial function that is statistically significant should be used to define the nature of change over time (irrespective of the whether the lower-order components are significant). [Keep in mind too that I am glossing over more subtle issues in interpretation, such as effect size.]

(24)

The highlighted area above suggests the changes in means on anxiety follow a quadratic functional form. Below, the profile plot helps us to visualize the change trajectory. With a quadratic functional form, the pattern is more ‘bowl-shaped’.

(25)

The next portion of our output can be used to test whether the change trajectory associated with anxiety scores differs between groups.

Looking at the highest-order component (quadratic), we see that the test of the occasion X treatment group interaction is statistically significant, F(2,66)=23.903, p<.001. This suggests that differences in quadratic change between treatment groups.

Here is our profile plot allowing us to visualize the change

trajectories in the three groups. The means in treatment group B shows the most pronounced quadratic change trajectory, with anxiety being the highest mid-treatment. In treatment group A, we see a small decrease in means between occasions 1 and 2 and a slightly greater decrease between occasions 2 and 3. In the control group, we see a decrease in anxiety from occasion 1 to 2 and a slightly smaller decrease from occasions 2 to 3.

(26)

I will be straight with you. The results here (given the context of our research is largely uninformative). The ANOVA in the table below is testing differences in the average of the repeated measurements on anxiety over time, which is something we really are not interested in.

Nevertheless, the test indicates a significant difference in the average level of anxiety between groups, F(2,66)=28.949, p<.001.

(27)

Here, we have Bonferroni-adjusted pairwise contrasts that allow us to test for differences in anxiety between time points. Here we see all pairwise differences on anxiety are statistically significant (p’s≤.016).

The results in the next portion of the output contain marginal means (think of these as adjusted means) and follow-up tests we requested earlier on (under the EMMEANS menu). When reporting on these tests, be sure to reference the marginal means since these are what are used in the computations.

Below, you see marginal means for anxiety at times 1 (mean=99.177), time 2 (mean=97.732), and time 3 (mean=92.28).

(28)

Given our selection earlier under the EMMEANS tab, Bonferroni adjustments were made for multiple tests.

The table containing Pairwise comparisons are tests of differences in anxiety (averaged over time) between groups. As discussed earlier, this is not of substantive interest in our analysis.

Nevertheless, for completeness I will note that all pairwise group differences are statistically significant (with p’s ≤ .004).

(29)

These comparisons are based on Tukey’s tests. Again, these results are not of substantive interest in our analysis.

(30)

Here, we have the estimated marginal means on anxiety for each group at each measurement occasion.

To better ‘unpack’ the effect of the interaction between treatment group and time/measurement occasion on anxiety, we can perform simple effects tests. When we made our selections earlier in SPSS, we clicked ‘Compare simple main effects’. By doing this then, our output now contains these tests (which rely on the marginal means in the table to the left). [Note:

the simple effects tests will also utilize a Bonferroni adjustment]

(31)

In the table below, we have simple effects tests, involving tests of differences between treatment groups at each measurement occasion.

At time 1, the marginal mean for the control group was less than the marginal mean for treatment A (difference is computed as: 100- 101.500 = -1.5); however, the difference was not significant. The mean for the control group was greater than the mean for

treatment B (difference is computed as: 100-96.03 = 3.97); the difference was significant (p<.001). The mean for treatment A was greater than the mean for treatment B (difference is computed as:

101.5-96.03 = 5.47). The difference was significant (p<.001).

At time 2, the marginal mean for the control group was less than that for treatment A (difference: 96-99.992 = -3.992), with the difference being significant (p=.004). The marginal mean for the control group was less than that for treatment B (difference: 96- 97.204 = -1.204), with the difference being non-significant (p=.859).

The mean for treatment A was greater than the mean for treatment B (difference: 99.992-97.204=2.788); the difference was non-

significant (p=.075).

(32)

At time 3, the marginal mean for the control group was less than for treatment A (difference: 94-96.76=-2.76), with the difference being significant (p=.013). The marginal mean for the control group was greater than that for treatment B (difference: 94-86.08=7.92), with the difference being significant (p<.001). The mean for

treatment A was greater than the mean for treatment B (difference:

96.76-86.08=10.68); the difference was significant (p<.001).

(33)

The differences we were testing involved these marginal means within measurement occasion.

As reflected in the test results, the pairwise differences in means appeared to vary across measurement occasion.

(34)

In this section of our output, the pairwise tests below are simple effects tests of differences in means between measurement occasions for each treatment group.

Within the control group, the difference in anxiety between time 1 and time 2 was significant (p<.001), with anxiety being higher at time 1 (mean difference computed as: 100-96=4). The difference in

anxiety between time 1 and time 3 was significant (p<.001), with anxiety being higher at time 1 (mean difference: 100-94 = 6). The difference in anxiety between time 2 and time 3 was significant (p=.044), with anxiety being higher at time 2 (mean difference: 96- 94=2).

Within treatment A, the difference in anxiety between time 1 and time 2 was not significant (p=.348; mean difference: 101.5-

99.992=1.508). The difference between time 1 and time 3 was significant (p<.001), with anxiety higher at time 1 (mean difference:

101.5-96.76=4.74). The difference between time 2 and time 3 was significant (p=.003), with anxiety higher at time 2 than time 3 (mean difference: 99.992-96.76=3.232).

(35)

With treatment B, the difference in anxiety between time 1 and time 2 was not significant (p=.505; mean difference: 96.03-97.204=-

1.174). The difference between time 1 and time 3 was significant (p<.001), with anxiety higher at time 1 (mean difference: 96.03- 86.08=9.95). The difference between time 2 and time 3 was

significant (p<.001), with anxiety higher at time 2 than time 3 (mean difference: 97.204-86.08=11.124).

(36)

In terms of reporting, it would be unusual to report on both sets of simple effects tests (as I have done here). My

recommendation is to think of which set of test results best coincide with your primary research objective. In our study, we essentially were testing whether change over time in level of anxiety is moderated by treatment condition. We

were assuming differences in change dependent on which treatment group our participants were members of. For this reason, I personally would be inclined to report on the latter results since they further highlight differences in change between groups.

(37)

References

Lomax, R.G., & Hahs-Vaughn, D.L. (2012). An introduction to statistical concepts (3^rd ed). New York: Routledge.

O’Brien, R. G., & Kaiser, M. K. (1985). MANOVA method for analyzing repeated measures designs: An extensive primer.

Psychological Bulletin, 97, 316-333.

Pituch, K.A., & Stevens, J.P. (2016). Applied multivariate statistics for the social sciences (6^th ed). New York: Routledge.

Stevens, J.P. (2010). Applied multivariate statistics for the social sciences (5^th ed). New York: Routledge.

Verma, J.P. (2016). Repeated measures designs for empirical researchers. John Wiley & Sons.

Weinfurt, K.P. (2000). Repeated measures analysis: ANOVA, MANOVA, and HLM. In L.G. Grimm & P.R. Yarnold (Eds.), Reading and understanding more multivariate statistics (pp. 317-361). Washington, DC: American Psychological Association.