Shipunov visual statistics

All commands used in the text of this book can be downloaded as one large Rscript (collection of text commands) from http://shipunov.info/shipunov/school/biol_. Of course, many statistical methods, including really important ones, are not discussed in this book.

One or two dimensions

The data

Origin of the data
Population and sample
How to obtain the data
What to ﬁnd in the data

Why do we need the data analysis
What data analysis can do
What data analysis cannot do

Answers to exercises

However, even small samples can be useful, and there are data analysis methods that work with five and even three replications. There are also special methods (power analysis) which allow you to estimate how many objects should be collected (we will give an example in due course).

How to process the data

General purpose software

And thinking about data visualization in spreadsheets—what if the data doesn't fit the window. Another example—what if you need to work with three non-adjacent columns of data at once.

Statistical software

Graphical systems
Statistical environments

In that case, the spreadsheet will begin to hinder the understanding of data instead of helping it. Unfortunately, SAS is often overcomplicated even for the experienced programmer, has many "reviews" from the 1970s (when it was written), closed source, and extremely expensive.

The very short history of the S and R

Use, advantages and disadvantages of the R

And (this is rare) if the method is not available, it is possible to write your own commands that implement it. It is also possible to correct errors in the code (since everything made by humans contains errors and is no exception) in a Wikipedia-like communal way.

How to download and install R

For the beginner in R, it is better to avoid energies as they hinder the learning process. 2If you do not use these managers or centers, it is recommended that you update your R regularly, at least once a year.

How to start with R

Launching R
First steps
How to type
How to play with R
Overgrown calculator

If you happened to answer "yes" to the question at the end of the previous session, you may want to remove unwanted files:. When working in R, the previous command can be called if you press "up arrow".

R and data

How to enter the data from within R
How to name your objects
How to load the text data
How to load data from Internet
How to use read.table()
How to load binary data
How to load data from clipboard
How to edit data in R
How to save the results
History and scripts

Commandread.table() is sophisticated, but it is not smart enough to determine the data structure on the fly8. By the way, if you type file.show("data/my) and press Tab, completion will show you if your file is here—if it really is.

R graphics

Graphical systems
Graphical devices
Graphical options
Interactive graphics

We recommend checking what will happen if you supply the three-column data frame (such as nested tree data) or contingency table (such as nested Titanic or HairEyeColor data) to plot(). Note that R does this silently, so if there was the file with the same name, it will be overwritten.

Figure 2.3: Example of the plot with title and legend.

Answers to exercises

To know the available point types, run example(points) and skip several graphs to see the point table; or just look at Figure A.1 in this book (and read the notes on how it was made). To know the structure, (1) we need to look at this file from Rwithurl.show() (or withoutR, in the web browser), and also (2) look at the accompanying file, eggs_c.txt.

Types of data

Degrees, hours and kilometers: measurement data

Remember that by "normal" we mean the data whose distribution allows us to guess that parametric methods are appropriate ways to analyze them. 1Discrete measurement data is actually more convenient for computers: as you may know, processors are based on 0/1 logic and do not readily understand non-integral floating-point numbers.

Figure 3.1: Histograms of normal and non-normal data.

Grades and t-shirts: ranked data

In particular, we measured the total diameter of the heads (variable HEAD.D) and counted the number of rays ("petals", variable RAYS, see Figure 3.4). If we still want to use parametric methods, we need to obtain measurement data (which usually means a different study design) and also check for normality. With this, we obtain measurement data that can be suitable for parametric analysis methods.

Colors, names and sexes: nominal data

Character vectors
Factors
Logical vectors and binary data

The answer is really simple: because "she" is the first in alphabetical order. Uses this order every time when converting factors to numbers. Sometimes binary data can be ordered (as with presence/absence), sometimes not (as with right or wrong answers). Binary data can be presented either as 0/1 numbers or as a logical vector, which is the string of TRUE or FALSE values.

Figure 3.5: This is how plot() plots a factor.

Fractions, counts and ranks: secondary data

Word cloud plots use random numbers, so it is better to runset.seed() immediately before plotting to have plots similar to fig.3.9 and your computer. For example, it is easy to remove columns from the data frame with the command liketrees[ 3]. It is possible to avoid bindings that add small random noise with jitter() command (examples follow).

Figure 3.7: Barplot of 12 most frequent R commands.

Missing data

The "trick" here was to use names to represent rows. AllR objects together with values can carry names). Imagine that we are studying birdhouses and measuring beak lengths in birds found there, but suddenly found a squirrel in one of the boxes. There are many other ways to impute missing data, more complicated ones based on bootstrap, regression and/or discriminant analysis.

Outliers, and how to ﬁnd them

This behavior is useful for identifying errors, such as "O" (the letter O) instead of "0" (zero), but it will lead to problems if the headers are not explicitly defined.

Changing data: basics of transformations

How to tell the kind of data

Data are often transformed to approximate parametric and homogenize standard deviations. It can normalize distributions with a positive slope (right tail), approximate relationships between variables to linear, and equalize variances. It can normalize negatively skewed (left) data, approximate relationships between variables to linear, and equalize variances.

Inside R

Matrices
Lists
Data frames
Overview of data types and modes

Since arrays are vectors, all array elements must be of the same type: numeric, character, or logical. Vectors and arrays can only contain elements of the same type, while lists accept anything, including other lists. Each column of a data frame must contain data of the same type (as with vectors), but the columns themselves can be of different types (as with lists).

Figure 3.11: Most important R data objects.

Answers to exercises

However, there is one big problem that is not easy to recognize at first: in many places points overlap each other and therefore the amount of visible data points is much less than in the data file. What's worse, we can't tell if first and third species are well or not well separated, because we don't see how many data values are located on the "boundary" between them. Functionjitter() adds random noise to variables and shitpoints that make it possible to see what's underneath.

Figure 3.13: Scatterplot which shows density of data points for each species.

One-dimensional data

How to estimate general tendencies

Median is the best
Quartiles and quantiles
Variation

The first method uses sattach() and adds the columns from the table to the list of "visible" variables. These two functions sometimes give slightly different results, but this is irrelevant to the research. Figure (Figure 4.1) summarizes the most important ways of reporting central tendency and variance using the same Euler diagram that was used to show the relationship between parametric and non-parametric approaches (Figure 3.2).

Figure 4.1: How to report center and variation in parametric (smaller circle) and all other cases (bigger circle).

4.2 1-dimensional plots

Conﬁdence intervals

This is our null hypothesis, H0, which we want to accept or reject based on the test results. However, what is really important at the moment is the confidence interval - a range in which the true population mean should fall with given probability (95%). Wilcoxon signed rank test with continuity correction data: salary. alternative hypothesis: true location is not equal to 0 95 percent confidence interval:.

Figure 4.8: Bean plot with overall line and median lines (default lines are means).

Normality

If this notation is not comfortable for you, there is a way to get rid of it:. Most of the time these three ways of determining normality agree, but it is not a surprise if they give different results. Normality check is not a death sentence, it is just an assessment based on probability.

How to create your own functions

Here we followed the convention that in the anonymous functions argument names must start with a dot.). In the first chapter we used dact.txt data to illustrate the situation where it is really hard to say anything about data without statistical analysis. In the open repository, the file nymphaeaceae.txt contains counts of flower parts taken from two members of the water lily family (Nymphaeaceae), Nuphar lutea (yellow water lily) and Nymphaea candida (white water lily).

How good is the proportion?

Now we have to ask the main statistical question: what is the proportion of women in the whole population (all similar firms). According to the confidence interval, the actual proportion of people who voted for candidate A varies from 100% to 47%. In the open repository there is a data filefillotaxis.txt that contains measurements of phyllotaxis in nature.

Answers to exercises

First we need to check the data and understand its structure, for example withurl.show(). From the histogram (Fig. 4.13) and stem-and-leaf, we can predict positive skewness (distribution asymmetry) and negative kurtosis (distribution flatter than normal). Therefore, it is theoretically possible for the same character in one species to exhibit a normal distribution, whereas in another species it does not.

Figure 4.12: Normality QQ trellis plots for the ﬁve measurement variables in betula dataset (variables should be read from bottom to top).

Two-dimensional data

What is a statistical test?

Statistical hypotheses
Statistical errors

The null hypothesis is a proposition of absence of something (for example difference between two samples or relationship between two variables). In fact, p-value is a probability of having the same or greater effect if the null hypothesis is true. The conventional answer puts that threshold at 0.05—the alternative hypothesis is accepted if the p-value is less than 5% (greater than 95% confidence level).

Table 5.1: Statistical hypotheses, including illustrations of (b) Type I and (c) Type II errors

Is there a difference? Comparing two samples

Two sample tests
Effect sizes

The data is in the long form: column extra contains the increase in sleep times (in hours, positive or negative), while column group indicates the group (type of drug). This is plausible because the level of ozone in the atmosphere is strongly dependent on solar activity, temperature and wind. In the data ﬁlegrades.txt are the grades for a certain group of students for the first exam (in the column labeled A1) and the second exam (A2), as well as the grades for another group of students for the first exam (B1) ).

Figure 5.2: How not to interpret p-values (taken from XKCD, https://xkcd.com/

If there are more than two samples: ANOVA

One way
More then one way

The null hypothesis here is that all samples belong to the same population (“are not different”), and the alternative hypothesis is that at least one sample is divergent, and does not belong to the same population (“samples are different”). If not, one of the solutions is to first transform the data logarithmically or square root, or to rank3, or even in a more sophisticated way. As a post-hoc test, it is possible to use pairwise.Rro.test()fromshipunovpackage, assuming no similarity of distributions.

Figure 5.6: Core idea of ANOVA: compare within and between variances.

Is there an association? Analysis of tables

Contingency tables
Table tests

If the data is a "table" with more than one dimension, object, plot() command will perform mosaic plot by default.). Two-sample chi-square (orχ2) test requires either contingency table or two factors of the same length (to first calculate the table from it). To check its significance, we will first apply chi-square test several times and check p-values:.

CRABDIP 0.9493138514

A simple look at the data will reveal nothing, as the banquet had 45 participants and 13 different dishes.

Figure 5.13: Association between food taken and illness.

CAKE 0.8694796709

Answers to exercises

Two sample tests, effect sizes

Chi-square test works well when the number of cases per cell is more than 5. There is also mcnemar.test(), which is used to compare proportions when they belong to the same objects (paired proportions). If there are more than two groups per case involved, you can optionally run post hoc pairwise tests with the appropriate correction: pairwise.Table2.test().