All commands used in the text of this book can be downloaded as one large Rscript (collection of text commands) from http://shipunov.info/shipunov/school/biol_. Of course, many statistical methods, including really important ones, are not discussed in this book.
One or two dimensions
The data
- Origin of the data
- Population and sample
- How to obtain the data
- What to find in the data
- Why do we need the data analysis
- What data analysis can do
- What data analysis cannot do
- Answers to exercises
However, even small samples can be useful, and there are data analysis methods that work with five and even three replications. There are also special methods (power analysis) which allow you to estimate how many objects should be collected (we will give an example in due course).
How to process the data
General purpose software
And thinking about data visualization in spreadsheets—what if the data doesn't fit the window. Another example—what if you need to work with three non-adjacent columns of data at once.
Statistical software
- Graphical systems
- Statistical environments
In that case, the spreadsheet will begin to hinder the understanding of data instead of helping it. Unfortunately, SAS is often overcomplicated even for the experienced programmer, has many "reviews" from the 1970s (when it was written), closed source, and extremely expensive.
The very short history of the S and R
Use, advantages and disadvantages of the R
And (this is rare) if the method is not available, it is possible to write your own commands that implement it. It is also possible to correct errors in the code (since everything made by humans contains errors and is no exception) in a Wikipedia-like communal way.
How to download and install R
For the beginner in R, it is better to avoid energies as they hinder the learning process. 2If you do not use these managers or centers, it is recommended that you update your R regularly, at least once a year.
How to start with R
- Launching R
- First steps
- How to type
- How to play with R
- Overgrown calculator
If you happened to answer "yes" to the question at the end of the previous session, you may want to remove unwanted files:. When working in R, the previous command can be called if you press "up arrow".
R and data
- How to enter the data from within R
- How to name your objects
- How to load the text data
- How to load data from Internet
- How to use read.table()
- How to load binary data
- How to load data from clipboard
- How to edit data in R
- How to save the results
- History and scripts
Commandread.table() is sophisticated, but it is not smart enough to determine the data structure on the fly8. By the way, if you type file.show("data/my) and press Tab, completion will show you if your file is here—if it really is.
R graphics
- Graphical systems
- Graphical devices
- Graphical options
- Interactive graphics
We recommend checking what will happen if you supply the three-column data frame (such as nested tree data) or contingency table (such as nested Titanic or HairEyeColor data) to plot(). Note that R does this silently, so if there was the file with the same name, it will be overwritten.
Answers to exercises
To know the available point types, run example(points) and skip several graphs to see the point table; or just look at Figure A.1 in this book (and read the notes on how it was made). To know the structure, (1) we need to look at this file from Rwithurl.show() (or withoutR, in the web browser), and also (2) look at the accompanying file, eggs_c.txt.
Types of data
Degrees, hours and kilometers: measurement data
Remember that by "normal" we mean the data whose distribution allows us to guess that parametric methods are appropriate ways to analyze them. 1Discrete measurement data is actually more convenient for computers: as you may know, processors are based on 0/1 logic and do not readily understand non-integral floating-point numbers.
Grades and t-shirts: ranked data
In particular, we measured the total diameter of the heads (variable HEAD.D) and counted the number of rays ("petals", variable RAYS, see Figure 3.4). If we still want to use parametric methods, we need to obtain measurement data (which usually means a different study design) and also check for normality. With this, we obtain measurement data that can be suitable for parametric analysis methods.
Colors, names and sexes: nominal data
- Character vectors
- Factors
- Logical vectors and binary data
The answer is really simple: because "she" is the first in alphabetical order. Uses this order every time when converting factors to numbers. Sometimes binary data can be ordered (as with presence/absence), sometimes not (as with right or wrong answers). Binary data can be presented either as 0/1 numbers or as a logical vector, which is the string of TRUE or FALSE values.
Fractions, counts and ranks: secondary data
Word cloud plots use random numbers, so it is better to runset.seed() immediately before plotting to have plots similar to fig.3.9 and your computer. For example, it is easy to remove columns from the data frame with the command liketrees[ 3]. It is possible to avoid bindings that add small random noise with jitter() command (examples follow).
Missing data
The "trick" here was to use names to represent rows. AllR objects together with values can carry names). Imagine that we are studying birdhouses and measuring beak lengths in birds found there, but suddenly found a squirrel in one of the boxes. There are many other ways to impute missing data, more complicated ones based on bootstrap, regression and/or discriminant analysis.
Outliers, and how to find them
This behavior is useful for identifying errors, such as "O" (the letter O) instead of "0" (zero), but it will lead to problems if the headers are not explicitly defined.
Changing data: basics of transformations
- How to tell the kind of data
Data are often transformed to approximate parametric and homogenize standard deviations. It can normalize distributions with a positive slope (right tail), approximate relationships between variables to linear, and equalize variances. It can normalize negatively skewed (left) data, approximate relationships between variables to linear, and equalize variances.
Inside R
- Matrices
- Lists
- Data frames
- Overview of data types and modes
Since arrays are vectors, all array elements must be of the same type: numeric, character, or logical. Vectors and arrays can only contain elements of the same type, while lists accept anything, including other lists. Each column of a data frame must contain data of the same type (as with vectors), but the columns themselves can be of different types (as with lists).
Answers to exercises
However, there is one big problem that is not easy to recognize at first: in many places points overlap each other and therefore the amount of visible data points is much less than in the data file. What's worse, we can't tell if first and third species are well or not well separated, because we don't see how many data values are located on the "boundary" between them. Functionjitter() adds random noise to variables and shitpoints that make it possible to see what's underneath.
One-dimensional data
How to estimate general tendencies
- Median is the best
- Quartiles and quantiles
- Variation
The first method uses sattach() and adds the columns from the table to the list of "visible" variables. These two functions sometimes give slightly different results, but this is irrelevant to the research. Figure (Figure 4.1) summarizes the most important ways of reporting central tendency and variance using the same Euler diagram that was used to show the relationship between parametric and non-parametric approaches (Figure 3.2).
4.2 1-dimensional plots
Confidence intervals
This is our null hypothesis, H0, which we want to accept or reject based on the test results. However, what is really important at the moment is the confidence interval - a range in which the true population mean should fall with given probability (95%). Wilcoxon signed rank test with continuity correction data: salary. alternative hypothesis: true location is not equal to 0 95 percent confidence interval:.
Normality
If this notation is not comfortable for you, there is a way to get rid of it:. Most of the time these three ways of determining normality agree, but it is not a surprise if they give different results. Normality check is not a death sentence, it is just an assessment based on probability.
How to create your own functions
Here we followed the convention that in the anonymous functions argument names must start with a dot.). In the first chapter we used dact.txt data to illustrate the situation where it is really hard to say anything about data without statistical analysis. In the open repository, the file nymphaeaceae.txt contains counts of flower parts taken from two members of the water lily family (Nymphaeaceae), Nuphar lutea (yellow water lily) and Nymphaea candida (white water lily).
How good is the proportion?
Now we have to ask the main statistical question: what is the proportion of women in the whole population (all similar firms). According to the confidence interval, the actual proportion of people who voted for candidate A varies from 100% to 47%. In the open repository there is a data filefillotaxis.txt that contains measurements of phyllotaxis in nature.
Answers to exercises
First we need to check the data and understand its structure, for example withurl.show(). From the histogram (Fig. 4.13) and stem-and-leaf, we can predict positive skewness (distribution asymmetry) and negative kurtosis (distribution flatter than normal). Therefore, it is theoretically possible for the same character in one species to exhibit a normal distribution, whereas in another species it does not.
Two-dimensional data
What is a statistical test?
- Statistical hypotheses
- Statistical errors
The null hypothesis is a proposition of absence of something (for example difference between two samples or relationship between two variables). In fact, p-value is a probability of having the same or greater effect if the null hypothesis is true. The conventional answer puts that threshold at 0.05—the alternative hypothesis is accepted if the p-value is less than 5% (greater than 95% confidence level).
Is there a difference? Comparing two samples
- Two sample tests
- Effect sizes
The data is in the long form: column extra contains the increase in sleep times (in hours, positive or negative), while column group indicates the group (type of drug). This is plausible because the level of ozone in the atmosphere is strongly dependent on solar activity, temperature and wind. In the data filegrades.txt are the grades for a certain group of students for the first exam (in the column labeled A1) and the second exam (A2), as well as the grades for another group of students for the first exam (B1) ).
If there are more than two samples: ANOVA
- One way
- More then one way
The null hypothesis here is that all samples belong to the same population (“are not different”), and the alternative hypothesis is that at least one sample is divergent, and does not belong to the same population (“samples are different”). If not, one of the solutions is to first transform the data logarithmically or square root, or to rank3, or even in a more sophisticated way. As a post-hoc test, it is possible to use pairwise.Rro.test()fromshipunovpackage, assuming no similarity of distributions.
Is there an association? Analysis of tables
- Contingency tables
- Table tests
If the data is a "table" with more than one dimension, object, plot() command will perform mosaic plot by default.). Two-sample chi-square (orχ2) test requires either contingency table or two factors of the same length (to first calculate the table from it). To check its significance, we will first apply chi-square test several times and check p-values:.
CRABDIP 0.9493138514
A simple look at the data will reveal nothing, as the banquet had 45 participants and 13 different dishes.
CAKE 0.8694796709
- Answers to exercises
- Two sample tests, effect sizes
Chi-square test works well when the number of cases per cell is more than 5. There is also mcnemar.test(), which is used to compare proportions when they belong to the same objects (paired proportions). If there are more than two groups per case involved, you can optionally run post hoc pairwise tests with the appropriate correction: pairwise.Table2.test().