• Tidak ada hasil yang ditemukan

Answers to exercises

Dalam dokumen Shipunov visual statistics (Halaman 134-147)

4.2 1-dimensional plots

4.7 Answers to exercises

In the open repository, there is a data filephyllotaxis.txtwhich contains mea- surements of phyllotaxis in nature. Variables N.CIRCLESandN.LEAVESare nu- merator and denominator, respectively. Variable FAMILY is the name of plant family. Many formulas in this data file belong to “classic” Fibonacci group (see above), but some do not. Please count proportions of non-classic formulas per family, determine which family is the most deviated and check if the proportion of non-classic formulas in this family is statistically different from the average proportion (calculated from the whole data).

Figure 4.11: Phyllotaxis. From left to right: leaves arranged by1/2,1/3and2/5formulas of phyllotaxis.

$ data.name: chr "rnorm(100)"

- attr(*, "class")= chr "htest"

Well, p-value most likely comes from thep.valuecomponent, this is easy. Check it:

> set.seed(1683)

> shapiro.test(rnorm(100))$p.value [1] 0.8424077

This is what we want. Now we can insert it into the body of our function.

* * *

Answerto the “birch normality” exercise. First, we need to check the data and un- derstand its structure, for example withurl.show(). Then we can read it into R, check its variables and applyNormality()function to all appropriate columns:

> betula <- read.table(

+ "http://ashipunov.info/shipunov/open/betula.txt", h=TRUE)

> Str(betula) # shipunov

'data.frame': 229 obs. of 10 variables:

1 LOC : int 1 1 1 1 1 1 1 1 1 1 ...

2 LEAF.L : int 50 44 45 35 41 53 50 47 52 42 ...

3 LEAF.W : int 37 29 37 26 32 37 40 36 39 40 ...

4 LEAF.MAXW: int 23 20 19 15 18 25 21 21 23 19 ...

5 PUB : int 0 1 0 0 0 0 0 0 1 0 ...

6 PAPILLAE : int 1 1 1 1 1 1 1 1 1 1 ...

7 CATKIN : int 31 25 21 20 24 22 40 25 14 27 ...

8 SCALE.L : num 4 3 4 5.5 5 5 6 5 5 5 ...

9 LOBES * int 0 0 1 0 0 0 0 0 1 0 ...

10 WINGS * int 0 0 0 0 0 0 1 0 0 1 ...

> sapply(betula[, c(2:4, 7:8)], Normality) # shipunov

LEAF.L LEAF.W LEAF.MAXW CATKIN SCALE.L

"NOT NORMAL" "NOT NORMAL" "NOT NORMAL" "NORMAL" "NOT NORMAL"

(Note how only non-categorical columns were selected for the normality check. We usedStr()because it helps to check numbers of variables, and shows that two vari- ables,LOBESandWINGShave missing data. There is no problem in usingstr()in- stead.)

OnlyCATKIN(length of female catkin) is available to parametric methods here. It is a frequent case in biological data.

What about the graphical check for the normality, histogram or QQ plot? Yes, it should work but we need to repeat it 5 times. However,latticepackage allows to make it in two steps and fit on onetrellis plot(Fig.4.12):

> betula.s <- stack(betula[, c(2:4, 7:8)])

> qqmath(~ values | ind, data=betula.s,

+ panel=function(x) {panel.qqmathline(x); panel.qqmath(x)})

(Library lattice requireslong data formatwhere all columns stacked into one and data supplied with identifier column, this is why we usedstack()function and formula interface.

There are many trellis plots. Pleasecheckthetrellis histogramyourself:

> bwtheme <- standard.theme("pdf", color=FALSE)

> histogram(~ values | ind, data=betula.s, par.settings=bwtheme) (There was also an example of how to apply grayscale theme to these plots.)

As one can see,SCALE.Lcould be also accepted as “approximately normal”. Among others,LEAF.MAXWis “least normal”.

* * *

Answerto the birch characters variability exercise. To create a function, it is good to start fromprototype:

> CV <- function(x) {}

This prototype does nothing, but on the next step you can improve it, for example, withfix(CV)command. Then testCV()with some simple argument. If the result is not satisfactory,fix(CV)again. At the end of this process, your function (actually, it “wraps” CV calculation explained above) might look like:

> CV <- function(x) + {

+ 100*sd(x, na.rm=TRUE)/mean(x, na.rm=TRUE) + }

Thensapply()could be used to check variability of each measurement column:

> sapply(betula[, c(2:4, 7:8)], CV)

LEAF.L LEAF.W LEAF.MAXW CATKIN SCALE.L 17.93473 20.38630 26.08806 24.17354 24.72061

qnorm

values

0 20 40 60

−3 −2 −1 0 1 2 3

● ●

●●●●●●

● ●

LEAF.L

● ●●●●

●●●●

LEAF.W

−3 −2 −1 0 1 2 3

● ●●●●●●

● ●

LEAF.MAXW

● ●●●●●●●● ●

CATKIN

−3 −2 −1 0 1 2 3

0 20 40 60

● ●●●●●●●● ●

SCALE.L

Figure 4.12: Normality QQ trellis plots for the five measurement variables inbetula dataset (variables should be read from bottom to top).

As one can see,LEAF.MAXW(location of the maximal leaf width) has the biggest vari- ability. In theshipunovpackage, there isCVs()function which implements this and three other measurements of relative variation.

* * *

Answerto question aboutdact.txtdata. Companion filedact_c.txtdescribes it as a random extract from some plant measurements. From the first chapter, we know that it is just one sequence of numbers. Consequently,scan()would be better than read.table(). First, load and check:

> dact <- scan("data/dact.txt")

Read 48 items

> str(dact)

num [1:48] 88 22 52 31 51 63 32 57 68 27 ...

Now, we can check the normality with our new function:

> Normality(dact) # shipunov [1] "NOT NORMAL"

Consequently, we must apply todactonly those analyses and characteristics which are robust to non-normality:

> summary(dact)[-4] # no mean

Min. 1st Qu. Median 3rd Qu. Max.

0.00 22.00 33.50 65.25 108.00

> IQR(dact) [1] 43.25

> mad(dact) [1] 27.4281

Confidence interval for the median:

> wilcox.test(dact, conf.int=TRUE)$conf.int [1] 34.49995 53.99997

attr(,"conf.level") [1] 0.95

Warning messages:

...

(Using the idea that every test output is alist, we extracted the confidence interval from output directly. Of course, we knew beforehand that name of a component we need isconf.int; this knowledge could be obtained from the function help (section

“Value”). The resulted interval is broad.)

To plot single numeric data, histogram (Fig.4.13) is preferable (boxplots are better for comparison between variables):

> Histr(dact, xlab="", main="") # shipunov Similar to histogram is the steam-and-leaf plot:

> stem(dact)

The decimal point is 1 digit(s) to the right of the | 0 | 0378925789

2 | 0224678901122345 4 | 471127

Frequency

0 20 40 60 80 100 120

051015

Figure 4.13: Histogram with overlaid normal distribution curve fordactdata.

6 | 035568257 8 | 2785 10 | 458

In addition, here we will calculateskewness and kurtosis, third and fourth central moments (Fig.4.14). Skewness is a measure of how asymmetric is the distribution, kurtosis is a measure of how spiky is it. Normal distribution has both skewness and kurtosis zero whereas “flat” uniform distribution has skewness zero and kurtosis ap- proximately –1.2 (checkit yourself).

What aboutdact data? From the histogram (Fig. 4.13) and stem-and-leaf we can predict positive skewness (asymmetricity of distribution) and negative kurtosis (dis- tribution flatter than normal). To check, one need to load librarye1071first:

> library(e1071)

> skewness(dact) [1] 0.5242118

● ●

● ●

● ●

●●

●●

Figure 4.14: Central moments (left to right, top to bottom): default, different scale, different skewness, different kurtosis.

> kurtosis(dact) [1] -0.8197875

* * *

Answerto the question about water lilies. First, we need to check the data, load it intoRand check the resulted object:

> ny <- read.table(

+ "http://ashipunov.info/shipunov/open/nymphaeaceae.txt", + h=TRUE, sep="\t")

> Str(ny) # shipunov

'data.frame': 267 obs. of 5 variables:

1 SPECIES: Factor w/ 2 levels "Nuphar lutea",..: 1 1 1 1 ...

2 SEPALS : int 4 5 5 5 5 5 5 5 5 5 ...

3 PETALS : int 14 10 10 11 11 11 12 12 12 12 ...

4 STAMENS* int 131 104 113 106 111 119 73 91 102 109 ...

5 STIGMAS* int 13 12 12 11 13 15 10 12 12 11 ...

(FunctionStr()shows column numbers and the presence ofNA.)

One of possible ways to proceed is to examine differences between species by each character, with four paired boxplots. To make them in one row, we will employfor() cycle:

> oldpar <- par(mfrow=c(2, 2))

> for (i in 2:5) boxplot(ny[, i] ~ ny[, 1], main=names(ny)[i])

> par(oldpar)

(Not here, but in many other cases,for()inRis better to replace with commands ofapply()family. Boxplot function accepts “ordinary” arguments but in this case, formula interface with tilde is much more handy.)

Pleasereview this plotyourself.

It is even better, however, to comparescaled charactersin theoneplot. First variant is to loadlatticelibrary and create trellis plot similar to Fig.7.8or Fig.7.7:

> library(lattice)

> ny.s <- stack(as.data.frame(scale(ny[ ,2:5])))

> ny.s$SPECIES <- ny$SPECIES

> bwplot(SPECIES ~ values | ind, ny.s, xlab="")

(As usual, trellis plots “want” long form and formula interface.) Pleasecheckthis plot yourself.

Alternative is theBoxplots()(Fig.4.15) command. It is not a trellis plot, but de- signed with a similar goal to compare many things at once:

> Boxplots(ny[, 2:5], ny[, 1], srt=0, adj=c(.5, 1)) # shipunov

(By default,Boxplots()rotates character labels, but this behavior is not necessary with 4 characters. This plot usesscale()soy-axis is, by default, not provided.) Or, with even more crispLinechart()(Fig.4.16):

Linechart

> Linechart(ny[, 2:5], ny[, 1], se.lwd=2) # shipunov

(Sometimes, IQRs are better to percept if you addgrid()to the plot. Try it yourself.

By the way, if you have just one species, useDotchart3()function.)

Evidently (afterSEPALS),PETALSandSTAMENSmake the best species resolution. To obtain numerical values, it is better tocheck the normalityfirst.

Note that species identity is the natural, internal feature of our data. Therefore, it is theoretically possible that the same character in one species exhibit normal distri- bution whereas in another species does not. This is why normality should be checked

SEPALS PETALS STAMENS STIGMAS

Nuphar lutea Nymphaea candida

Figure 4.15: Grouped boxplots withBoxplots()function.

per character per species. This idea is close to the concept offixed effectswhich are so useful in linear models (see next chapters). Fixed effects oppose the random ef- fects which are not natural to the objects studied (for example, if we sampleonly one species of water lilies in the laketwo times).

> aggregate(ny[, 3:4], by=list(SPECIES=ny[, 1]), Normality) # shipunov SPECIES PETALS STAMENS

1 Nuphar lutea NOT NORMAL NOT NORMAL 2 Nymphaea candida NOT NORMAL NOT NORMAL

(Function aggregate() does not only apply anonymous function to all elements of its argument, but also splits it on the fly with by list of factor(s). Similar is tapply()but it works only with one vector. Another variant is to usesplit()and thenapply()reporting function to the each part separately.)

Nuphar lutea Nymphaea candida Nuphar lutea Nymphaea candida Nuphar lutea Nymphaea candida Nuphar lutea Nymphaea candida

SEPALS

PETALS

STAMENS

STIGMAS

−1.0 −0.5 0.0 0.5 1.0 1.5

Figure 4.16: Grouped medians and IQRs withLinechart()function.

By the way, the code above is good for learning but in our particular case, normality check is not required! This is because numbers of petals and stamens arediscrete characters and therefore must be treated with nonparametric methodsby definition.

Thus, for confidence intervals, we should proceed with nonparametric methods:

> aggregate(ny[, 3:4], + by=list(SPECIES=ny[, 1]),

+ function(.x) wilcox.test(.x, conf.int=TRUE)$conf.int) SPECIES PETALS.1 PETALS.2 STAMENS.1 STAMENS.2 1 Nuphar lutea 14.49997 14.99996 119.00003 125.50005 2 Nymphaea candida 25.49997 27.00001 73.99995 78.49997

Confidence intervals reflect the possible location of central value (here median). But we still need to report our centers and ranges (confidence interval is not a range!).

We can use eithersummary()(tryit yourself), or some customized output which, for example, can employ median absolute deviation:

> aggregate(ny[, 3:4], by=list(SPECIES=ny[, 1]), function(.x) + paste(median(.x, na.rm=TRUE), mad(.x, na.rm=TRUE), sep="±"))

SPECIES PETALS STAMENS 1 Nuphar lutea 14±1.4826 119±19.2738 2 Nymphaea candida 26±2.9652 77±10.3782

Now we can give the answer like “if there are 12–16 petals and 100–120 stamens, this is likely a yellow water lily, otherwise, if there are 23–29 petals and 66–88 stamens, this is likely a white water lily”.

* * *

Answerto the question about phyllotaxis. First, we need to look on the data file, ei- ther withurl.show(), or in the browser window and determine its structure. There are four tab-separated columns with headers, and at least the second column con- tains spaces. Consequently, we need to tellread.table()about both separator and headers and then immediately check the “anatomy” of new object:

> phx <- read.table(

+ "http://ashipunov.info/shipunov/open/phyllotaxis.txt", + h=TRUE, sep="\t")

> str(phx)

'data.frame': 6032 obs. of 4 variables:

$ FAMILY : Factor w/ 11 levels "Anacardiaceae",..: 1 1 1 1 1 ...

$ SPECIES : Factor w/ 45 levels "Alnus barbata",..: 9 9 9 9 9 ...

$ N.CIRCLES: int 2 2 2 2 2 2 2 2 2 2 ...

$ N.LEAVES : int 4 4 5 5 5 5 5 5 5 5 ...

As you see, we have 11 families and therefore 11 proportions to create and analyze:

> phx10 <- sapply(1:10, Phyllotaxis)

> phx.all <- paste(phx$N.CIRCLES, phx$N.LEAVES, sep="/")

> phx.tbl <- table(phx$FAMILY, phx.all %in% phx10)

> dotchart(sort(phx.tbl[,"FALSE"]/(rowSums(phx.tbl)))) # shipunov (Here we usedDotchart()function which is a modified variant of classicdotchart() with better defaults and improved margins.)

Here we created 10 first classic phyllotaxis formulas (ten is enough since higher order formulas are extremely rare), then made these formulas (classic and non-

Elaeagnaceae Saxifragaceae Fagaceae Anacardiaceae Leguminosae Rosaceae Malvaceae Salicaceae Ericaceae Betulaceae Onagraceae

0.0 0.2 0.4 0.6

Figure 4.17: Dotchart shows proportions of non-classic formulas of phyllotaxis.

classic) from data and finally made a table from the logical expression which checks if real world formulas are present in the artificially made classic sequence. Dotchart (Fig.4.17) is probably the best way to visualize this table. Evidently, Onagraceae (evening primrose family) has the highest proportion ofFALSE’s. Now we need ac- tual proportions and finally, proportion test:

> mean.phx.prop <- sum(phx.tbl[, 1])/sum(phx.tbl)

> prop.test(phx.tbl["Onagraceae", 1], sum(phx.tbl["Onagraceae", ]), + mean.phx.prop)

1-sample proportions test with continuity correction

data: phx.tbl["Onagraceae", 1] out of sum(phx.tbl["Onagraceae", ]), null probability mean.phx.prop

X-squared = 227.9, df = 1, p-value < 2.2e-16

alternative hypothesis: true p is not equal to 0.2712202 95 percent confidence interval:

0.6961111 0.8221820

sample estimates:

p 0.7647059

As you see, proportion of non-classic formulas in Onagraceae (almost 77%) is sta- tistically different from the average proportion of 27%.

* * *

Answerto the exit poll question from the “Foreword”. Here is the way to calculate how many people we might want to ask to be sure that our sample 48% and 52% are

“real” (represent the population):

> power.prop.test(p1=0.48, p2=0.52, power=0.8)

Two-sample comparison of proportions power calculation n = 2451.596

p1 = 0.48 p2 = 0.52 sig.level = 0.05 power = 0.8

alternative = two.sided

NOTE: n is number in *each* group We need to ask almost 5,000 people!

To calculate this, we used a kind ofpower testwhich are frequently used for planning experiments. We madepower=0.8since it is the typical value of power used in social sciences. The next chapter gives definition ofpower(as a statistical term) and some more information about power test output.

Chapter 5

Dalam dokumen Shipunov visual statistics (Halaman 134-147)