One way - If there are more than two samples: ANOVA

5.3 If there are more than two samples: ANOVA

5.3.1 One way

statistical test to find if this difference is significant. Report also the confidence interval and effect size.

The null hypothesis here is thatall samplesbelong to the same population (“are not different”), and the alternative hypothesis is thatat least one sampleis divergent, does not belong to the same population (“samples are different”).

In terms of p-values:

p-value

< 0.05^''

abandon null, go to alternative at least one

sample isdivergent

≥0.05^ww

stay with null allsamples arenotdifferent

The idea of ANOVA is tocompare variances: (1)grandvariance within whole dataset, (2) total variancewithin samples(subsets in long form or columns in short form) and (3) variancebetween samples(columns, subsets). Figure5.6explains it on example of multiple apple samples mixed with divergent tomato sample.

If any sample came from different population, then variance between samples should be at least comparable with (or larger then) variation within samples; in other words, F-value(or F-ratio) should be≥ 1. To check that inferentially,F-testis applied. If p-value is small enough, then at least one sample (subset, column) is divergent.

ANOVA does not revealwhichsample is different. This is because variances in ANOVA are pooled. But what if we still need to know that? Then we should applypost hoc tests. In is not required to run themafter ANOVA; what is required is to perform them carefully and always applyp-value adjustmentfor multiple comparisons. This adjustment typicallyincreases p-value to avoid accumulation from multiple tests.

ANOVA andpost hoctests answerdifferentresearch questions, therefore this is up to the researcher to decide which and when to perform.

* * *

ANOVA is aparametricmethod, and this typically goes well with its ﬁrst assumption, normal distribution of residuals (deviations between observed and expected values). Typically, we check normality of the whole dataset because ANOVA uses pooled data anyway. It is also possible to check normality of residuals directly (see below). Please note that ANOVA tolerates mild deviations from normality, both in data and in residuals. But if the data is clearly nonparametric, it is recommended to use other methods (see below).

within between

within

Figure 5.6: Core idea of ANOVA: compare within and between variances.

Second assumption is homogeinety of variance (homoscedasticity), or, simpler,sim- ilarity of variances. This is more important and means that sub-samples were col- lected with similar methods.

Third assumption is more general. It was already described in the ﬁrst chapter: in- dependence of samples. “Repeated measurements ANOVA” is however possible, but requires more speciﬁc approach.

All assumptions must be checked before analysis.

* * *

The best way of data organization for the ANOVA is thelong formexplained above:

two variables, one of them contains numerical data, whereas the other describes grouping (inRterminology, it is a factor). Below, we create the artiﬁcial data which describes three types of hair color, height (in cm) and weight (in kg) of 90 persons:

> hwc <- read.table("data/hwc.txt", h=TRUE)

> str(hwc)

'data.frame': 90 obs. of 3 variables:

$ COLOR : Factor w/ 3 levels "black","blond",..: 1 1 1 ...

$ WEIGHT: int 80 82 79 80 81 79 82 83 78 80 ...

$ HEIGHT: int 166 170 170 171 169 171 169 170 167 166 ...

> boxplot(WEIGHT ~ COLOR, data=hwc, ylab="Weight, kg")

(Note that notches and other “bells and whistles” do not help here because we want to estimate joint differences; raw boxplot is probably the best choice.)

> sapply(hwc[sapply(hwc, is.numeric)], Normality) # shipunov WEIGHT HEIGHT

"NORMAL" "NORMAL"

> tapply(hwc$WEIGHT, hwc$COLOR, var) black blond brown

8.805747 9.219540 8.896552

(Note the use of doublesapply()to check normality only for measurement columns.) It looks like both assumptions are met: variance is at least similar, and variables are normal. Now we run the core ANOVA:

> wc.aov <- aov(WEIGHT ~ COLOR, data=hwc)

> summary(wc.aov)

Df Sum Sq Mean Sq F value Pr(>F) COLOR 2 435.1 217.54 24.24 4.29e-09 ***

Residuals 87 780.7 8.97 ---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

This output is slightly more complicated then output from two-sample tests, but contains similar elements (from most to least important):

1. p-value (expressed asPr(>F)) and its signiﬁcance;

2. statistic (F value);

3. degrees of freedom (Df)

All above numbers should go to the report. In addition, there are also:

4. variance within columns (Sum SqforResiduals);

5. variance between columns (Sum SqforCOLOR);

6. mean variances (Sum Sqdivided byDf)

(Grand variance is just a sum of variances between and within columns.)

If degrees of freedom are already known, it is easy enough to calculate F value and p-value manually, step by step:

> df1 <- 2

> df2 <- 87

> group.size <- 30

> (sq.between <- sum(tapply(hwc$WEIGHT, hwc$COLOR,

+ function(.x) (mean(.x) - mean(hwc$WEIGHT))^2))*group.size) [1] 435.0889

> (mean.sq.between <- sq.between/df1) [1] 217.5444

> (sq.within <- sum(tapply(hwc$WEIGHT, hwc$COLOR, + function(.x) sum((.x - mean(.x))^2))))

[1] 780.7333

> (mean.sq.within <- sq.within/df2) [1] 8.973946

> (f.value <- mean.sq.between/mean.sq.within) [1] 24.24178

> (p.value <- (1 - pf(f.value, df1, df2))) [1] 4.285683e-09

Of course,Rcalculates all of that automatically, plus also takes into account all possible variants of calculations, required for data with another structure. Related to the above example is also that toreportANOVA, most researches list three things:

two values for degrees of freedom, F value and, of course, p-value.

All in all, this ANOVA p-value is so small that H₀should be rejected in favor of the hypothesis thatat least one sampleis different. Remember, ANOVA does not tell whichsample is it, but boxplots (Fig.5.7) suggest that this might be people with black hairs.

* * *

To check the second assumption of ANOVA, thatvariances should be at least similar, homogeneous, it is sometimes enough to look on the variance of each group with tapply()as above or withaggregate():

> aggregate(hwc[,-1], by=list(COLOR=hwc[, 1]), var) COLOR WEIGHT HEIGHT

1 black 8.805747 9.154023 2 blond 9.219540 8.837931 3 brown 8.896552 9.288506

●

black blond brown

707580

Weight, kg

Figure 5.7: Is there a weight difference between people with different hair color?

(Artiﬁcial data.)

But better is totestif variances are equal with, for example,bartlett.test()which has the same formula interface:

> bartlett.test(WEIGHT ~ COLOR, data=hwc) Bartlett test of homogeneity of variances data: WEIGHT by COLOR

Bartlett's K-squared = 0.016654, df = 2, p-value = 0.9917 (The null hypothesis of the Bartlett test is the equality of variances.) Alternative is nonparametric Fligner-Killeen test:

> fligner.test(WEIGHT ~ COLOR, data=hwc)

Fligner-Killeen test of homogeneity of variances

data: WEIGHT by COLOR

Fligner-Killeen:med chi-squared = 1.1288, df = 2, p-value = 0.5687 (Null is the same as in Bartlett test.)

The ﬁrst assumption of ANOVA could also be checked here directly:

> Normality(wc.aov$residuals) [1] "NORMAL"

* * *

Effect sizeof ANOVA is calledη²(eta squared). There are many ways to calculate eta squared but simplest is derived from the linear model (see in next sections). It is handy to deﬁneη²as a function:

> Eta2 <- function(aov) + {

+ summary.lm(aov)$r.squared + }

and then use it for results of both classic ANOVA and one-way test (see below):

> (ewc <- Eta2(wc.aov)) [1] 0.3578557

> Mag(ewc) # shipunov [1] "high"

The second function is an interpreter forη²and similar effect size measures (liker correlation coefﬁcient or R²from linear model).

If there is a need to calculate effect sizes for each pair of groups, two-sample effect size measurements like coefﬁcient of divergence (Lyubishchev’sK) are applicable.

* * *

One more example of classic one-way ANOVA comes from the data embedded inR (makeboxplot yourself):

> Normality(chickwts$weight) [1] "NORMAL"

> bartlett.test(weight ~ feed, data=chickwts) Bartlett test of homogeneity of variances data: weight by feed

Bartlett's K-squared = 3.2597, df = 5, p-value = 0.66

> boxplot(weight ~ feed, data=chickwts)

> chicks.aov <- aov(weight ~ feed, data=chickwts)

> summary(chicks.aov)

Df Sum Sq Mean Sq F value Pr(>F) feed 5 231129 46226 15.37 5.94e-10 ***

Residuals 65 195556 3009 ---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

> Eta2(chicks.aov) [1] 0.5416855

> Mag(Eta2(chicks.aov)) # shipunov [1] "very high"

Consequently, there is a very high difference between weights of chickens on different diets.

* * *

If there is a goal to ﬁnd the divergent sample(s) statistically, one can usepost hoc pairwise t-test which takes into account the problem of multiple comparisons described above; this is just a compact way to run many t-tests and adjust resulted p-values:

> pairwise.t.test(hwc$WEIGHT, hwc$COLOR)

Pairwise comparisons using t tests with pooled SD data: hwc$WEIGHT and hwc$COLOR

black blond blond 1.7e-08 - brown 8.4e-07 0.32

P value adjustment method: holm

(This test uses by default the Holm method of p-value correction. Another way is Bonferroni correction explained below. All available ways of correction are accessi- ble trough thep.adjust()function.)

Similar to the result of pairwise t-test (but more detailed) is the result of Tukey Hon- est Signiﬁcant Differences test (Tukey HSD):

> TukeyHSD(wc.aov)

Tukey multiple comparisons of means 95% family-wise confidence level

Fit: aov(formula = WEIGHT ~ COLOR, data=hwc)

$COLOR

diff lwr upr p adj

blond-black -5.0000000 -6.844335 -3.155665 0.0000000 brown-black -4.2333333 -6.077668 -2.388999 0.0000013 brown-blond 0.7666667 -1.077668 2.611001 0.5843745

Are our groups different also by heights? If yes, are black-haired still different?

Post hoctests output p-values so they do not measure anything. If there is a need to calculate group-to-group effect sizes, two samples effect measures (like Lyu- bishchev’sK) are generally applicable. To understand pairwise effects, you might want to use the custom functionpairwise.Eff()which is based on doublesapply():

> pairwise.Eff(hwc$WEIGHT, hwc$COLOR, eff="cohen.d") # shipunov

black blond brown

black

blond 1.67 (large)

brown 1.42 (large) -0.25 (small)

* * *

Next example is again from the embedded data (makeboxplot yourself):

> Normality(PlantGrowth$weight) [1] "NORMAL"

> bartlett.test(weight ~ group, data=PlantGrowth) Bartlett test of homogeneity of variances

data: weight by group

Bartlett's K-squared = 2.8786, df = 2, p-value = 0.2371

> plants.aov <- aov(weight ~ group, data=PlantGrowth)

> summary(plants.aov)

Df Sum Sq Mean Sq F value Pr(>F) group 2 3.766 1.8832 4.846 0.0159 * Residuals 27 10.492 0.3886

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

> Eta2(plants.aov) [1] 0.2641483

> Mag(Eta2(plants.aov)) # shipunov

[1] "high"

> boxplot(weight ~ group, data=PlantGrowth)

> with(PlantGrowth, pairwise.t.test(weight, group)) Pairwise comparisons using t tests with pooled SD data: weight and group

ctrl trt1 trt1 0.194 - trt2 0.175 0.013

P value adjustment method: holm

As a result, yields of plants from two treatment condition are different, but there is no difference between each of them and the control. However, the overall effect size if this experiment is high.

* * *

If variances are not similar, thenoneway.test()will replace the simple (one-way) ANOVA:

> hwc2 <- read.table("data/hwc2.txt", h=TRUE)

> boxplot(WEIGHT ~ COLOR, data=hwc2)

> sapply(hwc2[, 2:3], Normality) # shipunov WEIGHT HEIGHT

"NORMAL" "NORMAL"

> tapply(hwc2$WEIGHT, hwc2$COLOR, var) black blond brown

62.27126 23.45862 31.11379 # suspicious!

> bartlett.test(WEIGHT ~ COLOR, data=hwc2) Bartlett test of homogeneity of variances data: WEIGHT by COLOR

Bartlett's K-squared = 7.4914, df = 2, p-value = 0.02362 # bad!

> oneway.test(WEIGHT ~ COLOR, data=hwc2)

One-way analysis of means (not assuming equal variances) data: WEIGHT and COLOR

F = 7.0153, num df = 2.000, denom df = 56.171, p-value = 0.001907

> (e2 <- Eta2(aov(WEIGHT ~ COLOR, data=hwc2))) [1] 0.1626432

> Mag(e2) [1] "medium"

> pairwise.t.test(hwc2$WEIGHT, hwc2$COLOR) # most applicable post hoc

... # check results yourself

(Here we used another data ﬁle where variables are normal but group variances are not homogeneous. Pleasemakeboxplot andcheckresults ofpost hoctest yourself.)

* * * What if the data isnot normal?

The ﬁrst workaround is to apply some transformation which might convert data into normal:

> Normality(InsectSprays$count) # shipunov [1] "NOT NORMAL"

> Normality(sqrt(InsectSprays$count)) [1] "NORMAL"

However, the same transformation could inﬂuence variance:

> bartlett.test(sqrt(count) ~ spray, data=InsectSprays)$p.value [1] 0.5855673 # bad for ANOVA, use one-way test

Frequently, it is better to use the nonparametric ANOVA replacement, Kruskall- Wallis test:

> hwc3 <- read.table("data/hwc3.txt", h=TRUE)

> boxplot(WEIGHT ~ COLOR, data=hwc3)

> sapply(hwc3[, 2:3], Normality) # shipunov

WEIGHT HEIGHT

"NOT NORMAL" "NOT NORMAL"

> kruskal.test(WEIGHT ~ COLOR, data=hwc3) Kruskal-Wallis rank sum test

data: WEIGHT by COLOR

Kruskal-Wallis chi-squared = 32.859, df = 2, p-value = 7.325e-08

(Again, another variant of the data ﬁle was used, here variables are not even normal.

Pleasemakeboxplot yourself.)

Effect sizeof Kruskall-Wallis test could be calculated withϵ²:

> Epsilon2 <- function(kw, n) # n is the number of cases + {

+ unname(kw$statistic/((n^2 - 1)/(n+1))) + }

> kw <- kruskal.test(WEIGHT ~ COLOR, data=hwc3)

> Epsilon2(kw, nrow(hwc3)) [1] 0.3691985

> Mag(Epsilon2(kw, nrow(hwc3))) # shipunov [1] "high"

The overall efefct size is high, it also visible well on the boxplot (makeit yourself):

> boxplot(WEIGHT ~ COLOR, data=hwc3)

To ﬁnd outwhichsample is deviated, use nonparametricpost hoctest:

> pairwise.wilcox.test(hwc3$WEIGHT, hwc3$COLOR) Pairwise comparisons using Wilcoxon rank sum test data: hwc3$WEIGHT and hwc3$COLOR

black blond blond 1.1e-06 - brown 1.6e-05 0.056

P value adjustment method: holm ...

(There are multiple warnings about ties. To get rid of them, replace the ﬁrst argu- ment withjitter(hwc3$HEIGHT). However, sincejitter()adds random noise, it is better to be careful and repeat the analysis several times if p-values are close to the threshold like here.)

Anotherpost hoctest for nonparametric one-way layout is Dunn’s test. There is a separatedunn.testpackage:

> library(dunn.test)

> dunn.test(hwc3$WEIGHT, hwc3$COLOR, method="holm", altp=TRUE) Kruskal-Wallis rank sum test

data: x and group

Kruskal-Wallis chi-squared = 32.8587, df = 2, p-value = 0 Comparison of x by group

(Holm) Col Mean-|

Row Mean | black blond

---+--- blond | 5.537736

| 0.0000*

brown | 4.051095 -1.486640

| 0.0001* 0.1371

alpha = 0.05

Reject Ho if p <= alpha

(Output is more advanced but overall results are similar. Morepost hoc tests like Dunnett’s test exist in themultcomppackage.)

It isnot necessary to check homogeneity of variance before Kruskall-Wallis test, but please note that it assumes that distribution shapes are not radically different between samples. If it is not the case, one of workarounds is to transform the data ﬁrst, either logarithmically or with square root, or to the ranks³, or even in the more sophisticated way. Another option is to apply permutation tests (see Appendix). As apost hoctest, is is possible to use pairwise.Rro.test()fromshipunovpackage which does not assume similarity of distributions.

* * *

Next ﬁgure (Fig.5.8) contains the Euler diagram which summarizes what was said above about different assumptions and ways of simple ANOVA-like analyses. Please note that there are much morepost hoctests procedures then listed, and many of them are implemented in variousRpackages.

The typical sequence of procedures related with one-way analysis is listed below:

• Check if data structure is suitable (head(), str(), summary()), is it long or short

• Plot (e.g.,boxplot(),beanplot())

• Normality, with plot orNormality()-like function

• Homogeneity of variance (homoscedasticity) (withbartlett.test() orfligner.test())

• Core procedure(classicaov(),oneway.test()orkruskal.test())

• Optionally, effect size (η²orϵ²with appropriate formula)

• Post hoctest, for exampleTukeyHSD(),pairwise.t.test(),dunn.test() orpairwise.wilcox.test()

In the open repository, data ﬁlemelampyrum.txtcontains results of cow-wheat (Melampyrumspp.) measurements in multiple localities. Please ﬁnd if there is a difference in plant height and leaf length between plants from different localities.

Which localities are divergent in each case? To understand the structure of data, use companion ﬁlemelampyrum_c.txt.

3Like it is implemented in theARToolpackage; there also possible to use multi-way nonparametric designs.

normal, variances

similar:

aov() TukeyHSD()

normal, variances different:

oneway.test() pairwise.t.test()

not normal:

kruskal.test() paiwise.wilcox.test()

or dunn.test()

Figure 5.8: Applicability of different ANOVA-like procedures and relatedpost hoc.

tests. Please read it from bottom to the top.

* * *

All in all, if you have two or more samples represented with measurement data, the following table will help to researchdifferences:

Dalam dokumen Shipunov visual statistics (Halaman 168-181)