Seefeld_StatsRBio.pdf

It has become one of the best, if not the best, advanced data analytics programs out there. Because of R's price, extensibility, and growing use in bioinformatics, R was chosen as the software for this book.

The Menu Bar

This can also be done using the help () command, as we will see in the next chapter. Apropos” pops up a dialog box similar to the help box shown in Figure 2-5, but you only need to enter a partial search term to search for R documents.

The Toolbar

R is actually very extensively documented and only a fraction of this documentation is directly available using the help menu. However, much of the documentation is technical rather than tutorial, and more aimed at the programmer and developer rather than the applied user.

Packages

Installing Packages

Select the package of interest and OK, and R will automatically download the package to your computer and place it in the appropriate directory to load into the command. For these packages download them to a suitable folder on your computer from their source page.

Loading Packages

Select the package to load and you should be ready to use the features of that package during the current session. Load packages are downloaded when you exit R and are NOT saved when you save your workspace or history (discussed in the next chapter).

R vs. S-Plus

Even if you open your previous work in R, you will still need to reload the packages whose functions you are using.

R and Other Technologies

Using the S Language

Functional objects perform tasks that manipulate data objects, usually some type of calculation, graphical representation, or other data analysis. Basically, R is about creating data objects and manipulating these data objects using function objects.

Structuring Data With Objects

Function objects are the objects that perform tasks on data objects, and are obtained as part of the R base system, part of an imported package, or can be written from scratch. We'll start working with function objects right away in this chapter, but the next chapter will cover them in more depth, including some basics about writing your own functions.

Types of Data Objects in R

Vectors are probably the most used data objects in R and in this book they are used literally everywhere. A list, unlike a vector, can contain data with different modes under the same variable name and include other data objects.

Working with Data Objects

Because a data frame is a set of vectors, you can use vector tricks mentioned above to work with specific elements of each vector within the data frame. To address a specific vector (column) in a data frame, use the "$" operator with the specified column of the data frame after the.

Controlling the Workspace

If calculations are performed on data objects with NA values, the NA value is passed to the result. If you have a calculation problem with an element of a data object and are not sure whether it is a missing value, the function is.na can be used to determine whether the element in question is an NA value .

Editing data objects

Saving your work

The option to save the workspace can be executed at any time through the save.image() command (also available as "Save Workspace" under the file menu) or at the end of a session, when R will ask you if you want to save the workspace. It defaults to having no specific name (and will be overwritten the next time you save a workspace with no specific name), but you can do save.image(filename) if you want to save different workspaces and they ' want to give a name.

Importing Files

The most convenient way to import data into R is to use the read functions, specifically read.table(). Specifically read.csv() which will read comma-delimited spreadsheet file data, which is how most spreadsheets can save files.

Figure 3-4 gives an example of what this format should look like in Notepad:

Getting Help

Data from most spreadsheet programs can be saved to one or both types of files. Saving spreadsheet data in text format (with a comma or tab) is particularly simple and useful.

Program Help Files

Caution is advised, as some spreadsheet programs may limit the number of data values that can be saved. To use this function, type part of the word you are looking for, using quotation marks as the function argument.

Documents

Note that the online help is particularly useful for describing function parameters, which this book will not discuss in detail, because many functions in R have many optional parameters (for example, read.table has about 12 different optional parameters). You can download and save them or view them online with a web browser that reads pdf files.

Books

On-line Discussion Forums

The aim of this chapter is for the reader to become familiar with the basic principles of programming in R - the use of correct syntax, the use of logical statements and write function. This should empower the reader with a minimal set of programming skills needed to use R effectively.

Variables and Assignment

Syntax

No prior programming knowledge is assumed, but this would be helpful, as many R programming techniques are similar to those of other high-level programming languages.

Arithmetic Operators

Logical and Relational Operators

A parenthesis condition usually uses a relational operator to determine whether the condition is true. When the if statement is used alone and the condition inside the parentheses is not true, nothing happens.

Looping

Note that x=6 at the end because the loop is still true at the beginning of the loop when x=5. For loops can also be nested (one inside another) to initialize matrices or perform similar tasks:.

Subsetting with Logical Operators

Subsetting with logical operators is based on the following simple fact: Each logical statement produces one of the outcomes TRUE or FALSE. If we use the outcomes of a logical vector statement to subset a vector, only those elements where the outcomes are TRUE will be selected.

Functions

An alternative to this is to use an underscore "functionName_function()" which you will also see, although this may be deprecated in the latest version. The "do this" task is executed and/or the value is returned to the calling code.

Using Existing Functions

To use features from packages, always remember that you must have the package installed and loaded in order to use its features.

Writing Functions

This book is filled with additional examples of task-oriented functions, and you should be fluent in using them.

Package Creation

Exploratory Data Analysis

Data Summary Functions in R

Here is a case where using the summary function can be very useful as a screening tool to look for interesting trends in the data. This can be statistically significant, and this observation can be used for statistical testing of differences (such as those discussed in Chapter 13).

Working with Graphics

Note that there is some redundancy of low-level drawing functions with high-level drawing function arguments. Note that most of these work not only with the plot() function, but also with other high-level plotting functions.

Table 5-2 lists other selected high-level plotting functions:

Saving Graphics

However, graphics code in R can be as complex as you want it to be, and here is just a snapshot of R's full graphics capabilities. With a little practice, you can code R to produce custom publication-quality graphics for effectively illustrated almost any data analysis result.

Additional Graphics Packages

Probability is a mathematical language and framework that allows us to make statistical statements and analyze data. Understanding probability is key to being able to analyze data to provide meaningful and scientifically valid conclusions.

Two Schools of Thought

Understanding classical (or frequentist) and Bayesian statistics requires knowledge of the theory and methods of probability and probability models. This chapter reviews the basics of probability and serves as a conceptual foundation for much of the material in the subsequent chapters of this section and for various other chapters in this book.

Essentials of Probability Theory

Set Operations

Set Operation 1: Union

Set Operation 2: Intersection

Set Operation 3: Complement

Disjoint Events

The Fundamental Rules of Probability

Probability is always positive

Rule 2: For a given sample space, the sum of probabilities is 1

For each event in a sample space where the outcomes have the same probability of occurring, the probability of an event is equal to the frequency of the event over the total events in the sample space. In the 4-nucleotide example, suppose that only P(A) is known and we want to calculate the probability of any other nucleotide (which is the event of the complement of A).

But the sum of events in the sample space is still 1; the probabilities of individual nucleotides are changed to reflect the composition of events in the sample space. In this case, if the outcomes are disjoint events, then the probability of all events can be represented by

Counting

For events that are not disjoint, the calculation of P(A U B) must take into account the plane of intersection of the events, and is calculated as P (A U B)=P (A)+P (B)-P (A∩B ). ), subtracting the crop area (which would otherwise be double counted). For more than two events that are not disjoint, this calculation becomes more complicated, because it is necessary to count all intersections only once.

The Fundamental Principle of Counting

Suppose there are 4 genetic alleles for protein 1, 2 genetic alleles for protein 2, 9 genetic alleles for protein 3, 11 alleles for protein 4, and 6 alleles for protein 5. The answer is a direct multiplication of the number of alleles involved in all 5 proteins, which is equal to a total of 4752 possible combinations of alleles involved in this 5 protein complex.

Permutations (Order is Important)

Combinations (Order is Not Important)

Using permutations and combinatorial calculations, probabilities in a sample space where all outcomes are equally likely can be found as the number of outcomes that define a particular event over the total number of possible outcomes in a sample space.

Counting in R

Introducing the Gamma function

While the gamma function may seem strange at first (and not usually taught in lower math courses), it's good to familiarize yourself with the gamma function. The gamma function is involved in several continuous probability distributions (discussed later in this chapter), including the gamma distribution, the beta distribution, and the chi-square distribution.

Modeling Probability

Bayesian statistics make use of these distributions, and they are widely used in bioinformatics applications. The remainder of this chapter is devoted to several concepts important to understanding probability models: random variables, discrete versus continuous variables, probability distributions and densities, parameters, and univariate versus multivariate statistics.

Random Variables

Discrete versus Continuous Random Variables

Probability Distributions and Densities

In R, a simple histogram (shown in Figure 6-9) can be used to model the probability distribution function for this example. For a continuous random variable, the CDF is . step) and the F (x) is the interval from negative infinity (or wherever x is defined) to the value of x.

Figure 6-9: Histogram Illustrating the Probability Mass Function for RNA Residue Example

Empirical CDFs and Package stepfun

Parameters

One of those nice features is that the mean of the distribution corresponds to the location parameter (and the standard deviation corresponds to the scale parameter). Shifting the location parameter of the normal distribution shifts the point where the center of the distribution is located.

Univariate versus Multivariate Statistics

Even before specific models are introduced, you may be wondering how to choose a model to use and how to choose values of parameters. Sometimes the parameters can be determined from the data, as is the case with the mean and standard deviation of normally distributed data.

Univariate Discrete Distributions

The Binomial Distribution

Let's assume that the probability of a child having blue eyes is 0.16 (not empirically verified!) and that's it. Note that this is no longer a probability distribution function, as it will only model the probability of a given sequence of n=10 children, of which k have blue eyes.

Table 7-1: Distribution of outcomes of a Bernoulli Trial

Note that this distribution is quite skewed towards lower values of the random variable (X=number of children with blue eyes) because the value of p is 0.16. Notice in Figure 7-2 how the larger p shifts the distribution more toward higher values of the random variable.

Figure 7-1 illustrates the graphic model of the probability distribution function for this example

The Poisson Distribution

A multivariate version of the binomial, the multinomial distribution, will be introduced in the next chapter. Let's do this with the range X=x from 0 to 20 for all the values of lambda used in the previous plot.

Figure 7-4: Poisson Distribution, Lambda =0.2.

Univariate Continuous Distributions

Univariate Normal Distribution

Remember that increasing the scale parameter, which is the standard deviation in the case of the normal, increases how spread out the data is. This automatically calculates what percentage of the distribution corresponds to the given cumulative probability regions.

Figure 7-8: Univariate Standard Normal Distribution

The Gamma Family

The mean of the gamma distribution is equal to alpha * beta and the variance (=standard deviation squared) is related to alpha and beta by being equal to alpha*beta2. So the distribution of the data can be modeled using a gamma probability density function with shape (alpha) = 5.6 and scale (beta) = 0.6.

Figure 7-12: Gamma distributions with different shape parameters

The Exponential Distribution

The Chi Square Distribution

The Beta Family

One very important thing to note - the range of x in the equation for the beta probability density is clearly indicated as between 0 and 1. Unfortunately, the interpretation of the parameters with the beta is not as clear as with the gamma, and with the beta distribution, the alpha and beta parameters are sometimes referred to as the "shape 1" and "shape 2" parameters.

For example, if you have data on the percentage of an amino acid in a protein motif (say a leucine chain) and it is not likely to be normally distributed (check with a Q-Q plot), then you should model the data with a beta density function.

Other Continuous Distributions

Simulations

Each distribution mentioned in this chapter has a corresponding function in R that starts with "r" for randomness. All you need to do to get simulated values is to specify the number of simulated values you want and the parameters of the function you are simulating from.

Expanded Probability Concepts

The previous two chapters have looked at the basic principles of probability and the probability distribution of a random variable. In this chapter we introduce some other concepts of probability and then expand looking at probability distributions to include distributions that model two or more random variables.

Conditional Probability

Within this new limited space, we wonder about the probability of the event being heterozygous pea. P(A|B) is read as “the conditional probability of event A given that event B has occurred”.

Independence

This form of the definition of independence is very useful for computing the joint probabilities of independent events. In many cases, in sequence analysis or in the analysis of other events, it is clear that the events are not independent.

Joint and Marginal Probabilities

What if we wanted to know the probability that nucleotide A is at position 1, regardless of the nucleotide at position 2. Here, the joint probability, P(T at P2 ∩ G at P1), is obtained from the cell in Table 8-2. containing the joint probability for nucleotide G at position 1 and T at position 2, and P (G at P1) is the marginal probability of nucleotide G at position 1, obtained from the column total for that column in Table 8-2.

Table 8-1: Joint Probabilities of Nucleotides at Adjacent Sites

The Law of Total Probability

In the above formula and in Figure 8-2, A is the union of disjoint (mutually exclusive) sets, A∩Bi, for all i. Although the law of total probability may seem confusing, we only applied the law of total probability earlier in this chapter when calculating marginal probabilities in the adjacent nucleotide example.

Probability Distributions Involving More than One Random Variable

Joint Distributions of Discrete Random Variables

Although we won't do that here, it's fairly easy to extend joint distributions to more than two variables. We can extend the nucleotide example with the distribution of the nucleotide at position 3 of a sequence.

Marginal Distributions

The probability of any given 3-nucleotide sequence such as ATG would be given by the joint distribution of the random variables X, Y, and Z representing the corresponding probabilities of each nucleotide at each position. Similarly, we could calculate the marginal probability mass function (pmf) of the random variable Y by summing all the values of X, as shown in Table 8-5.

Conditional Distributions

Note that the sum of the probabilities for each marginal distribution adds to 1 (obeying the law that the sum of all probabilities in a sample space adds to 1), which should always be the case (and serves as a good check for determined if you have done the calculations correctly). Again, note that the sum of the conditional probabilities adds to 1, and it satisfies the probability law that the sum of the probabilities of the events in a sample space add to 1.

Joint, Marginal and Conditional Distributions for Continuous Variables

Continuing this calculation for other nucleotides gives the conditional distribution of Y given X=A in Table 8-6. But conceptually, the probability that (X, Y) lies in the area A is equal to the volume under the function f(x, y) in the area A.

Common Multivariable Distributions

The Multinomial Distribution

Among the many applications in bioinformatics, the multinomial model is often used in modeling the joint distribution of the number of observed genotypes. Suppose we are only interested in the marginal probability mass function of the random variable X.

Multivariate Normal Distribution

To do this, we can write a function in R to simulate regressions from a standard bivariate normal. If we were to run a simulation of 1000 values from a one-sided standard normal distribution, we would get a virtually identical result.

Figure 8-5 is shaped like a symmetric upside down cone with its base on the XY plane

Dirichlet Distribution

Again there is no need to write a function to perform simulations, because the gregmisc package contains predefined functions for calculating the density or generating random values from the Dirichlet distribution. It is important here to understand the basics of Dirichlet and its mathematical model and how to derive Dirichlet simulations using R.

The Essential Difference

Importantly, the Bayesian first uses subjective belief to construct an initial model of the probability distribution. It is this inclusion of subjective belief in the model that distinguishes Bayesians from frequentists.

Why the Bayesian Approach?

Due to the computational intensity, the Bayesian approximation can handle situations involving multiple variables (parameters). As we will see with Bayes' theorem, the Bayesian approach involves constantly updating the model based on new data.

Bayes’ Rule

In the first case, we want complete information about the joint probability of two events. P(A∩B) is the joint probability of A and B just found from the cell in the joint probability table and is 0.3.

Figure 9-1 illustrates this, where A is the event inside the larger circle, B is the event inside the smaller circle, and C is the intersection of events A and B and represents the joint probability of A and B

Extending Bayes’ Rule to Working with Distributions

In this case, this will be the Bayesian's subjective belief about the proportion of times the coin will land heads before the coin is flipped. All we have is the Bayesian's subjective belief about the proportion of times the coin will land heads.

Figure 9-2 shows the resulting distribution. Is this an adequate prior? It is centered on 0.25, our Bayesian’s subjective belief

Priors

Conjugate priors have the property that the posterior distribution will follow the same parametric form as the prior distribution, also using a probability model of some form (but the probability form is not the same as the prior and posterior form). There are other conjugate pairs with an earlier probability of yielding a posterior with the same shape as the prior.

Likelihood Functions

In frequentist statistics, the likelihood function is used to find a single value of a parameter, the so-called point estimate. Obviously, making such a "true" estimate is more common than the Bayesian thing, but it's worth noting here because the use of the likelihood function is important.

Evaluating the Posterior

The interpretation is that the MLE is the most plausible value of the parameter, in the sense that it produces the highest possible probability of the observed data occurring. Regarding the role of the prior, the more vague the prior, the less the prior affects the posterior.

Figure 9-8 shows the result of this analysis. Note that the second posterior is centered on the data value of p=0.5 and is also much more “precise” on this value with less spread

Applications of Bayesian Statistics in Bioinformatics

This chapter begins a series of three chapters devoted to the basics of Markov chains (Chapter 10), the computational algorithms involved in their evaluation for MCMC methods (Chapter 11), and the use of WinBugs in conjunction with R and the CODA package to implementation of simple applications using MCMC techniques of interest to bioinformatics (Chap. 12). Our focus in this chapter is to provide a basic introduction to Markov chains for the purpose of using them in Markov chain Monte Carlo methods to simulate posterior probability distributions for complex data models.

Beginning with the End in Mind

For example, consider what is actually a fairly simple case using a conjugate pair of multinomial probabilities and a Dirichlet prior that produces a Dirichlet posterior distribution. But the dilemma is that in bioinformatics and other fields working with high-dimensional data sets, we want information in the posterior distribution—one that we cannot directly evaluate.

Figure 10-1: Posterior MCMC Simulation of a Bivariate Normal Distribution

Stochastic Modeling

The idea here is to introduce the reader to the power of this technique so that he/she understands the terminology in the literature, is able to perform simple exploratory applications using R and utility programs, and encourages the interested reader. to collaborate with statisticians who can assist them in their modeling and data analysis.

Stochastic Processes

Markov Processes

The pattern of 5' to 3' nucleotides in a DNA sequence is not necessarily independent and is often patterned so that the nucleotides are dependent on the nucleotide immediately upstream of them. Each nucleotide state {A, T, C, G} depends sequentially on its adjacent and immediately upstream nucleotide, and only that nucleotide.

Classifications of Stochastic Processes

In the case of discrete state spaces, there are finite and countable values and the random variables used are discrete random variables. For example, a discrete state space is the set of four nucleotides in a DNA sequence and modeled using random variables that will be discrete random variables (since they can only take on four different values).

Random Walks

To avoid confusion with state space (below), it is better to use the term sequence, position, or place for indices representing locations. If there is a continuum of values (any value) in the state space, a continuous random variable is used and the state space is referred to as continuous.

Probability Models Using Markov Chains

Real-world applications of random walks include fractals, dendritic growth (for example, tree roots), models for stock options and portfolios, and gambling game outcomes.

Matrix Basics

To multiply matrix A by matrix B to create a new matrix C, we multiply the elements of row 1 of A by the corresponding elements of column 1 of B and add the results to get the first entry in C (c11). Then we multiply the elements of row 1 of A by the elements of column 2 of B to get the c12 entry of C, etc….

A Simple Markov Chain Model

The matrix below is for the transition probabilities for the first move given the initial state. To get the following transition matrix (t=1 to t=2), we can use R to multiply the first transition matrix by itself, using the logic described above.

Figure 10-6: Transition Probabilities for Frog’s First Move

Modeling A DNA Sequence with a Markov Chain

We can also use this distribution to calculate the expected values of the numerical distribution of nucleotides. You can then perform a statistical analysis to determine if there is a significant difference in nucleotide composition in coding or non-coding regions.

Characteristics of a Markov Chain

Finiteness

Aperiodicity

Irreducibility

Therefore, we want our systems to be IRREDUCIBLE when used to determine a unique posterior distribution.

Ergodicity

Mixing

Reversibility

In the detailed balance equation for a chain with state space S, x is the current state of the system and y is the state at the next step. If the system has more than two states, the detailed balance applies to any two states in the system's state space.

The Stationary State

To do this, we will look at the marginal and conditional probability distributions for the parameters of interest. Recall the discussion and graphical illustrations in Chapter 8 that discussed these distributions for the multivariate normal, multinomial, and Dirichlet distributions.

Continuous State Space

The package CODA will be introduced in chapter 12 and contains functionality to perform certain convergence diagnostics. It is difficult to analyze or interpret graphically, so usually we will be interested in analyzing individual variables and parameters separately.

Non-MCMC Applications of Markov Chains in Bioinformatics

The convergence of Markov chains is in itself a mathematical topic, the details of which are beyond our discussion. However, the basic properties of Markov chains introduced in this chapter are applicable to other types of applications and should serve as a basis for further research.

A Brief Word About Modeling

For example, there are many applications of Markov Chains in sequence analysis, and the use of Markov Chains for additional applications is growing in all disciplines related to bioinformatics. Many books of interest are listed in the appendix for those interested in further exploring the use of Markov chains in sequence analysis; in particular, Durbin's book is devoted to this topic.

Monte Carlo Simulations of Distributions

It is possible for a user to work in WinBugs without understanding the algorithmic background that the program uses, but it is not wise or useful. Therefore, the focus of this chapter is on the theory of why the program works.

Inverse CDF Method

Let's make an example of the inverse CDF method to simulate an exponential distribution with parameter lambda =2. However, the inverse CDF method is easy to implement when faced with a new distribution not available in software.

Rejection Sampling

Therefore, we can use as the envelope for the beta distribution a multiple of the uniform pdf with y=1.5 in the range x from 0 to 1. Draw a sample x from the distribution whose pdf is a multiple of the envelope - in our case, we sample from (uniform [0,1]).

The Gibbs Sampler

A high dimensional posterior distribution can be decomposed into conditional distributions for each of the parameters of interest. Finally, we will not study the joint posterior distribution of the sample values, but rather marginal distributions of individual parameters since this is what we are primarily interested in.

A Gibbs Example Using Simple Probability

Here is a small program in R that uses the Gibb's sampler to compute the marginal distributions of A and B by sampling from the full conditionals. Converting the sampling data to probabilities and comparing it to the original joint distribution shows the results obtained using the iterative Gibb's sampling from the conditional distributions that reproduced the joint distribution.

A Gibbs Example Using the Bivariate Normal

Notice in Figure 11-8, which shows plots of selected marginal distributions, that all the marginal distributions show a standard normal distribution pattern. When examining individual variables using the Gibb's sampler's output, one essentially obtains results regarding the marginal distributions of parameters of interest.

Figure 11-6 shows that no matter where you start, after 200 runs the joint distribution simulated using the Gibbs algorithm appears the same as if started from (0,0)

Generalized Gibbs Sampler

The Metropolis-Hastings Algorithm