• Tidak ada hasil yang ditemukan

Table tests

Dalam dokumen Shipunov visual statistics (Halaman 188-192)

5.4 Is there an association? Analysis of tables

5.4.2 Table tests

[1] 0.1555872

(Note how expected values calculated and how they look: expected (null) areequal proportionsbetween both rows and columns. June and September have 30 days each, hence slight differences in values—but not in expected proportions.)

Let us see now whether hair color and eye color from the 3-dimensional embedded HairEyeColordata are associated. First, we can examine associations graphically withassocplot()(Fig.5.12):

> (HE <- margin.table(HairEyeColor, 1:2)) Eye

Hair Brown Blue Hazel Green

Black 68 20 15 5

Brown 119 84 54 29

Red 26 17 14 14

Blond 7 94 10 16

> assocplot(HE)

(Instead of apply() used in the previous example, we employed margin.table() which essentially did the same job.)

Association plot shows several things: theheight of bars reflects the contribution of each cell into the total chi-squared, this allows, for example, to detect outliers.

Square of rectangle corresponds with difference between observed and expected value, thusbig tall rectangles indicate more association (to understand this better, comparethis current plot withassocplot(comf.month)).Color and positionof rect- angle show the sign of the difference.

Overall, it is likely that there is an association. Now we need to check this hypothesis with a test:

> chisq.test(HE)

Pearson's Chi-squared test data: HE

X-squared = 138.29, df = 9, p-value < 2.2e-16

The chi-squared test takes as null hypothesis “no pattern”, “no association”. There- fore, in our example, since we reject the null hypothesis, we find that the factors are associated.

And what about survival on the “Titanic”?

> chisq.test(titanic) Pearson's Chi-squared test data: titanic

Black Brown Red Blond

GreenHazelBlueBrown

Figure 5.12: Association between hair color and eye color.

X-squared = 190.4, df = 3, p-value < 2.2e-16

Yes (as reader might remember from the famous movie), survival was associated with being in the particular class.

General chi-squared test shows only if asymmetry presents anywhere in the table.

This means that if it is significant, thenat least onegroup of passengers has the dif- ference in survival. Like ANOVA, test does not showwhichone.Post hoc, orpairwise table test is able do show this:

> pairwise.Table2.test(titanic) # shipunov

Pairwise comparisons using Pearson's Chi-squared test data: titanic

1st 2nd 3rd

2nd 4.7e-07 - - 3rd < 2e-16 8.3e-07 - Crew < 2e-16 3.9e-08 0.6 P value adjustment method: BH

From the table of p-values, it is apparent that 3rd class and crew members were not different by survival rates. Note thatpost hoctests applyp-value adjustment for mul- tiple comparisons; practically, it means that because 7 tests were performed simulta- neously, p-values were magnified with some method (here, Benjamini & Hochberg method is default).

The fileseedlings.txtcontains results of an experiment examining germination of seeds infected with different types of fungi. In all, three fungi were tested, 20 seeds were tested for each fungus, and therefore with the controls 80 seeds were tested. Do the germination rates of the infected seeds differ?

Let us examine now the more complicated example. A large group of epidemiolo- gists gathered for a party. The next morning, many woke up with symptoms of food poisoning. Because they were epidemiologists, they decided to remember what each of them ate at the banquet, and thus determine what was the cause of the illness.

The gathered data take the following format:

> tox <- read.table("data/poisoning.txt", h=TRUE)

> head(tox)

ILL CHEESE CRABDIP CRISPS BREAD CHICKEN RICE CAESAR TOMATO

1 1 1 1 1 2 1 1 1 1

2 2 1 1 1 2 1 2 2 2

3 1 2 2 1 2 1 2 1 2

4 1 1 2 1 1 1 2 1 2

...

ICECREAM CAKE JUICE WINE COFFEE

1 1 1 1 1 1

2 1 1 1 1 2

3 1 1 2 1 2

4 1 1 2 1 2

...

(We usedhead()here because the table is really long.)

The first variable (ILL) tells whether the participant got sick or not (1 or 2 respec- tively); the remaining variables correspond to different foods.

A simple glance at the data will not reveal anything, as the banquet had 45 partici- pants and 13 different foods. Therefore, statistical methods must be used. Since the data are nominal, we will use contingency tables:

> tox.1 <- lapply(tox[,-1], function(.x) table(tox[, 1], .x))

> tox.2 <- array(unlist(tox.1),

+ dim=c(dim(tox.1[[1]]), length(tox.1))) # or simply c(2, 2, 13)

> dimnames(tox.2) <- list(c("ill","not ill"), + c("took","didn't take"), names(tox.1))

(First, we ranILLvariable against every column and made a list of small contingency tables. Second, we converted list into 3-dimensional array, just like theTitanicdata is, and also made sensible names of dimensions.)

Now our data consists of small contingency tables which are elements of array:

> tox.2[,,"TOMATO"]

took didn't take

ill 24 5

not ill 6 10

(Note two commas which needed to tellRthat we want the third dimension of the array.)

Now we need a kind ofstratified(with every type of food) table analysis. Since every element in thetox.2is 2×2 table,fourfold plotwill visualize this data well (Fig.5.13):

> fourfoldplot(tox.2, conf.level=0, col=c("yellow","black"))

(In fourfold plots, association corresponds with the difference between two pairs of diagonal sectors. Since we test multiple times, confidence rings are suppressed.) There are some apparent differences, especially forCAESAR,BREAD andTOMATO. To check their significance, we will at first apply chi-squared test multiple times and check out p-values:

> cbind(apply(tox.2, 3, function(.x) chisq.test(.x)$p.value)) [, 1]

CHEESE 0.8408996794

Dalam dokumen Shipunov visual statistics (Halaman 188-192)