It has become one of the best, if not the best, advanced data analytics programs out there. Because of R's price, extensibility, and growing use in bioinformatics, R was chosen as the software for this book.
The Menu Bar
This can also be done using the help () command, as we will see in the next chapter. Apropos” pops up a dialog box similar to the help box shown in Figure 2-5, but you only need to enter a partial search term to search for R documents.
The Toolbar
R is actually very extensively documented and only a fraction of this documentation is directly available using the help menu. However, much of the documentation is technical rather than tutorial, and more aimed at the programmer and developer rather than the applied user.
Packages
Installing Packages
Select the package of interest and OK, and R will automatically download the package to your computer and place it in the appropriate directory to load into the command. For these packages download them to a suitable folder on your computer from their source page.
Loading Packages
Select the package to load and you should be ready to use the features of that package during the current session. Load packages are downloaded when you exit R and are NOT saved when you save your workspace or history (discussed in the next chapter).
R vs. S-Plus
Even if you open your previous work in R, you will still need to reload the packages whose functions you are using.
R and Other Technologies
Using the S Language
Functional objects perform tasks that manipulate data objects, usually some type of calculation, graphical representation, or other data analysis. Basically, R is about creating data objects and manipulating these data objects using function objects.
Structuring Data With Objects
Function objects are the objects that perform tasks on data objects, and are obtained as part of the R base system, part of an imported package, or can be written from scratch. We'll start working with function objects right away in this chapter, but the next chapter will cover them in more depth, including some basics about writing your own functions.
Types of Data Objects in R
Vectors are probably the most used data objects in R and in this book they are used literally everywhere. A list, unlike a vector, can contain data with different modes under the same variable name and include other data objects.
Working with Data Objects
Because a data frame is a set of vectors, you can use vector tricks mentioned above to work with specific elements of each vector within the data frame. To address a specific vector (column) in a data frame, use the "$" operator with the specified column of the data frame after the.
Controlling the Workspace
If calculations are performed on data objects with NA values, the NA value is passed to the result. If you have a calculation problem with an element of a data object and are not sure whether it is a missing value, the function is.na can be used to determine whether the element in question is an NA value .
Editing data objects
Saving your work
The option to save the workspace can be executed at any time through the save.image() command (also available as "Save Workspace" under the file menu) or at the end of a session, when R will ask you if you want to save the workspace. It defaults to having no specific name (and will be overwritten the next time you save a workspace with no specific name), but you can do save.image(filename) if you want to save different workspaces and they ' want to give a name.
Importing Files
The most convenient way to import data into R is to use the read functions, specifically read.table(). Specifically read.csv() which will read comma-delimited spreadsheet file data, which is how most spreadsheets can save files.
Getting Help
Data from most spreadsheet programs can be saved to one or both types of files. Saving spreadsheet data in text format (with a comma or tab) is particularly simple and useful.
Program Help Files
Caution is advised, as some spreadsheet programs may limit the number of data values that can be saved. To use this function, type part of the word you are looking for, using quotation marks as the function argument.
Documents
Note that the online help is particularly useful for describing function parameters, which this book will not discuss in detail, because many functions in R have many optional parameters (for example, read.table has about 12 different optional parameters). You can download and save them or view them online with a web browser that reads pdf files.
Books
On-line Discussion Forums
The aim of this chapter is for the reader to become familiar with the basic principles of programming in R - the use of correct syntax, the use of logical statements and write function. This should empower the reader with a minimal set of programming skills needed to use R effectively.
Variables and Assignment
Syntax
No prior programming knowledge is assumed, but this would be helpful, as many R programming techniques are similar to those of other high-level programming languages.
Arithmetic Operators
Logical and Relational Operators
A parenthesis condition usually uses a relational operator to determine whether the condition is true. When the if statement is used alone and the condition inside the parentheses is not true, nothing happens.
Looping
Note that x=6 at the end because the loop is still true at the beginning of the loop when x=5. For loops can also be nested (one inside another) to initialize matrices or perform similar tasks:.
Subsetting with Logical Operators
Subsetting with logical operators is based on the following simple fact: Each logical statement produces one of the outcomes TRUE or FALSE. If we use the outcomes of a logical vector statement to subset a vector, only those elements where the outcomes are TRUE will be selected.
Functions
An alternative to this is to use an underscore "functionName_function()" which you will also see, although this may be deprecated in the latest version. The "do this" task is executed and/or the value is returned to the calling code.
Using Existing Functions
To use features from packages, always remember that you must have the package installed and loaded in order to use its features.
Writing Functions
This book is filled with additional examples of task-oriented functions, and you should be fluent in using them.
Package Creation
Exploratory Data Analysis
Data Summary Functions in R
Here is a case where using the summary function can be very useful as a screening tool to look for interesting trends in the data. This can be statistically significant, and this observation can be used for statistical testing of differences (such as those discussed in Chapter 13).
Working with Graphics
Note that there is some redundancy of low-level drawing functions with high-level drawing function arguments. Note that most of these work not only with the plot() function, but also with other high-level plotting functions.
Saving Graphics
However, graphics code in R can be as complex as you want it to be, and here is just a snapshot of R's full graphics capabilities. With a little practice, you can code R to produce custom publication-quality graphics for effectively illustrated almost any data analysis result.
Additional Graphics Packages
Probability is a mathematical language and framework that allows us to make statistical statements and analyze data. Understanding probability is key to being able to analyze data to provide meaningful and scientifically valid conclusions.
Two Schools of Thought
Understanding classical (or frequentist) and Bayesian statistics requires knowledge of the theory and methods of probability and probability models. This chapter reviews the basics of probability and serves as a conceptual foundation for much of the material in the subsequent chapters of this section and for various other chapters in this book.
Essentials of Probability Theory
Set Operations
Set Operation 1: Union
Set Operation 2: Intersection
Set Operation 3: Complement
Disjoint Events
The Fundamental Rules of Probability
Probability is always positive
Rule 2: For a given sample space, the sum of probabilities is 1
For each event in a sample space where the outcomes have the same probability of occurring, the probability of an event is equal to the frequency of the event over the total events in the sample space. In the 4-nucleotide example, suppose that only P(A) is known and we want to calculate the probability of any other nucleotide (which is the event of the complement of A).
But the sum of events in the sample space is still 1; the probabilities of individual nucleotides are changed to reflect the composition of events in the sample space. In this case, if the outcomes are disjoint events, then the probability of all events can be represented by
Counting
For events that are not disjoint, the calculation of P(A U B) must take into account the plane of intersection of the events, and is calculated as P (A U B)=P (A)+P (B)-P (A∩B ). ), subtracting the crop area (which would otherwise be double counted). For more than two events that are not disjoint, this calculation becomes more complicated, because it is necessary to count all intersections only once.
The Fundamental Principle of Counting
Suppose there are 4 genetic alleles for protein 1, 2 genetic alleles for protein 2, 9 genetic alleles for protein 3, 11 alleles for protein 4, and 6 alleles for protein 5. The answer is a direct multiplication of the number of alleles involved in all 5 proteins, which is equal to a total of 4752 possible combinations of alleles involved in this 5 protein complex.
Permutations (Order is Important)
Combinations (Order is Not Important)
Using permutations and combinatorial calculations, probabilities in a sample space where all outcomes are equally likely can be found as the number of outcomes that define a particular event over the total number of possible outcomes in a sample space.
Counting in R
Introducing the Gamma function
While the gamma function may seem strange at first (and not usually taught in lower math courses), it's good to familiarize yourself with the gamma function. The gamma function is involved in several continuous probability distributions (discussed later in this chapter), including the gamma distribution, the beta distribution, and the chi-square distribution.
Modeling Probability
Bayesian statistics make use of these distributions, and they are widely used in bioinformatics applications. The remainder of this chapter is devoted to several concepts important to understanding probability models: random variables, discrete versus continuous variables, probability distributions and densities, parameters, and univariate versus multivariate statistics.
Random Variables
Discrete versus Continuous Random Variables
Probability Distributions and Densities
In R, a simple histogram (shown in Figure 6-9) can be used to model the probability distribution function for this example. For a continuous random variable, the CDF is . step) and the F (x) is the interval from negative infinity (or wherever x is defined) to the value of x.
Empirical CDFs and Package stepfun
Parameters
One of those nice features is that the mean of the distribution corresponds to the location parameter (and the standard deviation corresponds to the scale parameter). Shifting the location parameter of the normal distribution shifts the point where the center of the distribution is located.
Univariate versus Multivariate Statistics
Even before specific models are introduced, you may be wondering how to choose a model to use and how to choose values of parameters. Sometimes the parameters can be determined from the data, as is the case with the mean and standard deviation of normally distributed data.
Univariate Discrete Distributions
The Binomial Distribution
Let's assume that the probability of a child having blue eyes is 0.16 (not empirically verified!) and that's it. Note that this is no longer a probability distribution function, as it will only model the probability of a given sequence of n=10 children, of which k have blue eyes.
Note that this distribution is quite skewed towards lower values of the random variable (X=number of children with blue eyes) because the value of p is 0.16. Notice in Figure 7-2 how the larger p shifts the distribution more toward higher values of the random variable.
The Poisson Distribution
A multivariate version of the binomial, the multinomial distribution, will be introduced in the next chapter. Let's do this with the range X=x from 0 to 20 for all the values of lambda used in the previous plot.
Univariate Continuous Distributions
Univariate Normal Distribution
Remember that increasing the scale parameter, which is the standard deviation in the case of the normal, increases how spread out the data is. This automatically calculates what percentage of the distribution corresponds to the given cumulative probability regions.
The Gamma Family
The mean of the gamma distribution is equal to alpha * beta and the variance (=standard deviation squared) is related to alpha and beta by being equal to alpha*beta2. So the distribution of the data can be modeled using a gamma probability density function with shape (alpha) = 5.6 and scale (beta) = 0.6.
The Exponential Distribution
The Chi Square Distribution
The Beta Family
One very important thing to note - the range of x in the equation for the beta probability density is clearly indicated as between 0 and 1. Unfortunately, the interpretation of the parameters with the beta is not as clear as with the gamma, and with the beta distribution, the alpha and beta parameters are sometimes referred to as the "shape 1" and "shape 2" parameters.
For example, if you have data on the percentage of an amino acid in a protein motif (say a leucine chain) and it is not likely to be normally distributed (check with a Q-Q plot), then you should model the data with a beta density function.
Other Continuous Distributions
Simulations
Each distribution mentioned in this chapter has a corresponding function in R that starts with "r" for randomness. All you need to do to get simulated values is to specify the number of simulated values you want and the parameters of the function you are simulating from.
Expanded Probability Concepts
The previous two chapters have looked at the basic principles of probability and the probability distribution of a random variable. In this chapter we introduce some other concepts of probability and then expand looking at probability distributions to include distributions that model two or more random variables.
Conditional Probability
Within this new limited space, we wonder about the probability of the event being heterozygous pea. P(A|B) is read as “the conditional probability of event A given that event B has occurred”.
Independence
This form of the definition of independence is very useful for computing the joint probabilities of independent events. In many cases, in sequence analysis or in the analysis of other events, it is clear that the events are not independent.
Joint and Marginal Probabilities
What if we wanted to know the probability that nucleotide A is at position 1, regardless of the nucleotide at position 2. Here, the joint probability, P(T at P2 ∩ G at P1), is obtained from the cell in Table 8-2. containing the joint probability for nucleotide G at position 1 and T at position 2, and P (G at P1) is the marginal probability of nucleotide G at position 1, obtained from the column total for that column in Table 8-2.
The Law of Total Probability
In the above formula and in Figure 8-2, A is the union of disjoint (mutually exclusive) sets, A∩Bi, for all i. Although the law of total probability may seem confusing, we only applied the law of total probability earlier in this chapter when calculating marginal probabilities in the adjacent nucleotide example.
Probability Distributions Involving More than One Random Variable
Joint Distributions of Discrete Random Variables
Although we won't do that here, it's fairly easy to extend joint distributions to more than two variables. We can extend the nucleotide example with the distribution of the nucleotide at position 3 of a sequence.
Marginal Distributions
The probability of any given 3-nucleotide sequence such as ATG would be given by the joint distribution of the random variables X, Y, and Z representing the corresponding probabilities of each nucleotide at each position. Similarly, we could calculate the marginal probability mass function (pmf) of the random variable Y by summing all the values of X, as shown in Table 8-5.
Conditional Distributions
Note that the sum of the probabilities for each marginal distribution adds to 1 (obeying the law that the sum of all probabilities in a sample space adds to 1), which should always be the case (and serves as a good check for determined if you have done the calculations correctly). Again, note that the sum of the conditional probabilities adds to 1, and it satisfies the probability law that the sum of the probabilities of the events in a sample space add to 1.
Joint, Marginal and Conditional Distributions for Continuous Variables
Continuing this calculation for other nucleotides gives the conditional distribution of Y given X=A in Table 8-6. But conceptually, the probability that (X, Y) lies in the area A is equal to the volume under the function f(x, y) in the area A.
Common Multivariable Distributions
The Multinomial Distribution
Among the many applications in bioinformatics, the multinomial model is often used in modeling the joint distribution of the number of observed genotypes. Suppose we are only interested in the marginal probability mass function of the random variable X.
Multivariate Normal Distribution
To do this, we can write a function in R to simulate regressions from a standard bivariate normal. If we were to run a simulation of 1000 values from a one-sided standard normal distribution, we would get a virtually identical result.
Dirichlet Distribution
Again there is no need to write a function to perform simulations, because the gregmisc package contains predefined functions for calculating the density or generating random values from the Dirichlet distribution. It is important here to understand the basics of Dirichlet and its mathematical model and how to derive Dirichlet simulations using R.
The Essential Difference
Importantly, the Bayesian first uses subjective belief to construct an initial model of the probability distribution. It is this inclusion of subjective belief in the model that distinguishes Bayesians from frequentists.
Why the Bayesian Approach?
Due to the computational intensity, the Bayesian approximation can handle situations involving multiple variables (parameters). As we will see with Bayes' theorem, the Bayesian approach involves constantly updating the model based on new data.
Bayes’ Rule
In the first case, we want complete information about the joint probability of two events. P(A∩B) is the joint probability of A and B just found from the cell in the joint probability table and is 0.3.
Extending Bayes’ Rule to Working with Distributions
In this case, this will be the Bayesian's subjective belief about the proportion of times the coin will land heads before the coin is flipped. All we have is the Bayesian's subjective belief about the proportion of times the coin will land heads.
Priors
Conjugate priors have the property that the posterior distribution will follow the same parametric form as the prior distribution, also using a probability model of some form (but the probability form is not the same as the prior and posterior form). There are other conjugate pairs with an earlier probability of yielding a posterior with the same shape as the prior.
Likelihood Functions
In frequentist statistics, the likelihood function is used to find a single value of a parameter, the so-called point estimate. Obviously, making such a "true" estimate is more common than the Bayesian thing, but it's worth noting here because the use of the likelihood function is important.
Evaluating the Posterior
The interpretation is that the MLE is the most plausible value of the parameter, in the sense that it produces the highest possible probability of the observed data occurring. Regarding the role of the prior, the more vague the prior, the less the prior affects the posterior.
Applications of Bayesian Statistics in Bioinformatics
This chapter begins a series of three chapters devoted to the basics of Markov chains (Chapter 10), the computational algorithms involved in their evaluation for MCMC methods (Chapter 11), and the use of WinBugs in conjunction with R and the CODA package to implementation of simple applications using MCMC techniques of interest to bioinformatics (Chap. 12). Our focus in this chapter is to provide a basic introduction to Markov chains for the purpose of using them in Markov chain Monte Carlo methods to simulate posterior probability distributions for complex data models.
Beginning with the End in Mind
For example, consider what is actually a fairly simple case using a conjugate pair of multinomial probabilities and a Dirichlet prior that produces a Dirichlet posterior distribution. But the dilemma is that in bioinformatics and other fields working with high-dimensional data sets, we want information in the posterior distribution—one that we cannot directly evaluate.
Stochastic Modeling
The idea here is to introduce the reader to the power of this technique so that he/she understands the terminology in the literature, is able to perform simple exploratory applications using R and utility programs, and encourages the interested reader. to collaborate with statisticians who can assist them in their modeling and data analysis.
Stochastic Processes
Markov Processes
The pattern of 5' to 3' nucleotides in a DNA sequence is not necessarily independent and is often patterned so that the nucleotides are dependent on the nucleotide immediately upstream of them. Each nucleotide state {A, T, C, G} depends sequentially on its adjacent and immediately upstream nucleotide, and only that nucleotide.
Classifications of Stochastic Processes
In the case of discrete state spaces, there are finite and countable values and the random variables used are discrete random variables. For example, a discrete state space is the set of four nucleotides in a DNA sequence and modeled using random variables that will be discrete random variables (since they can only take on four different values).
Random Walks
To avoid confusion with state space (below), it is better to use the term sequence, position, or place for indices representing locations. If there is a continuum of values (any value) in the state space, a continuous random variable is used and the state space is referred to as continuous.
Probability Models Using Markov Chains
Real-world applications of random walks include fractals, dendritic growth (for example, tree roots), models for stock options and portfolios, and gambling game outcomes.
Matrix Basics
To multiply matrix A by matrix B to create a new matrix C, we multiply the elements of row 1 of A by the corresponding elements of column 1 of B and add the results to get the first entry in C (c11). Then we multiply the elements of row 1 of A by the elements of column 2 of B to get the c12 entry of C, etc….
A Simple Markov Chain Model
The matrix below is for the transition probabilities for the first move given the initial state. To get the following transition matrix (t=1 to t=2), we can use R to multiply the first transition matrix by itself, using the logic described above.
Modeling A DNA Sequence with a Markov Chain
We can also use this distribution to calculate the expected values of the numerical distribution of nucleotides. You can then perform a statistical analysis to determine if there is a significant difference in nucleotide composition in coding or non-coding regions.
Characteristics of a Markov Chain
Finiteness
Aperiodicity
Irreducibility
Therefore, we want our systems to be IRREDUCIBLE when used to determine a unique posterior distribution.
Ergodicity
Mixing
Reversibility
In the detailed balance equation for a chain with state space S, x is the current state of the system and y is the state at the next step. If the system has more than two states, the detailed balance applies to any two states in the system's state space.
The Stationary State
To do this, we will look at the marginal and conditional probability distributions for the parameters of interest. Recall the discussion and graphical illustrations in Chapter 8 that discussed these distributions for the multivariate normal, multinomial, and Dirichlet distributions.
Continuous State Space
The package CODA will be introduced in chapter 12 and contains functionality to perform certain convergence diagnostics. It is difficult to analyze or interpret graphically, so usually we will be interested in analyzing individual variables and parameters separately.
Non-MCMC Applications of Markov Chains in Bioinformatics
The convergence of Markov chains is in itself a mathematical topic, the details of which are beyond our discussion. However, the basic properties of Markov chains introduced in this chapter are applicable to other types of applications and should serve as a basis for further research.
A Brief Word About Modeling
For example, there are many applications of Markov Chains in sequence analysis, and the use of Markov Chains for additional applications is growing in all disciplines related to bioinformatics. Many books of interest are listed in the appendix for those interested in further exploring the use of Markov chains in sequence analysis; in particular, Durbin's book is devoted to this topic.
Monte Carlo Simulations of Distributions
It is possible for a user to work in WinBugs without understanding the algorithmic background that the program uses, but it is not wise or useful. Therefore, the focus of this chapter is on the theory of why the program works.
Inverse CDF Method
Let's make an example of the inverse CDF method to simulate an exponential distribution with parameter lambda =2. However, the inverse CDF method is easy to implement when faced with a new distribution not available in software.
Rejection Sampling
Therefore, we can use as the envelope for the beta distribution a multiple of the uniform pdf with y=1.5 in the range x from 0 to 1. Draw a sample x from the distribution whose pdf is a multiple of the envelope - in our case, we sample from (uniform [0,1]).
The Gibbs Sampler
A high dimensional posterior distribution can be decomposed into conditional distributions for each of the parameters of interest. Finally, we will not study the joint posterior distribution of the sample values, but rather marginal distributions of individual parameters since this is what we are primarily interested in.
A Gibbs Example Using Simple Probability
Here is a small program in R that uses the Gibb's sampler to compute the marginal distributions of A and B by sampling from the full conditionals. Converting the sampling data to probabilities and comparing it to the original joint distribution shows the results obtained using the iterative Gibb's sampling from the conditional distributions that reproduced the joint distribution.
A Gibbs Example Using the Bivariate Normal
Notice in Figure 11-8, which shows plots of selected marginal distributions, that all the marginal distributions show a standard normal distribution pattern. When examining individual variables using the Gibb's sampler's output, one essentially obtains results regarding the marginal distributions of parameters of interest.
Generalized Gibbs Sampler
The Metropolis-Hastings Algorithm