Overview of R

d=read_data()

clean_d=clean_and_format(d)

d_result=applicable_statistical_routine(clean_d) display_results(d_result)

library(help=circular)

"Hello world"

"Hello World"

d e h l o r w 0.0

0.5 1.0 1.5 2.0 2.5 3.0

Figure 15.1: Plot produced by hello_world.R program.

Github–Local

the_data=read.csv("hello_world.csv") plot(the_data)

install.packages

setBreakpoint

c(1, 2) + c(3, 4)

as.numeric("1") ==1

as.numeric("a")

storage.mode

x = 2 # new vector containing one value x = c(2, 4, 6, 8, 10) # new vector containing five values

# new vector containing the contents of x and two values x = c(x, 12, 14)

y = vector(length=5) # new vector created by function call y = 3:8 # same as c(3, 4, 5, 6, 7, 8)

z = seq(from=3, to=13, by=3) # create a sequence of values

# All elements converted to a common type

z = c(1, 2, "3") # String has the greater conversion precedence

> # create 3-dimensional array of 2 by 4 by 6, initialized to 0

> a3=array(0, c(2, 4, 6))

> matrix(c(1, 2, 3, 4, 5, 6), ncol=2) # default, populate in column order [,1] [,2]

> # specify the number of rows and populate by row order

> matrix(c(1, 2, 3, 4, 5, 6), nrow=2, byrow=TRUE) [,1] [,2] [,3]

> x = matrix(nrow=2, ncol=4) # create a new matrix

> y = c(1, 2, 3)

> z = as.matrix(y) # convert a vector to a matrix

num [1:3] 1 2 3

num [1:3, 1] 1 2 3

> x[-1] # exclude element 1

[1] 11 12 13 14 15 16 17 18 19

> x[12] # there is no 12’th element

> x[12]=100 # there is now

[1] 10 11 12 13 14 15 16 17 18 19 NA 100

> x[c(2,5)] # elements 2 and 5

> y = x[x > 25] # all elements greater than 25

[1] 26 27 28 29

> # The expression x > 25 returns a vector of boolean values

> i = x > 25

[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE

> # an element of x is returned if the corresponding index is TRUE

[1] 26 27 28 29

> x = matrix(c(1, 2, 3, 4, 5, 6), ncol=2)

> # x[2, 3] need to be able to handle out-of-bounds subscripts in Sweave...

> x[1, ] [1] 1 4

> x[3, ]=c(0, 9)

> x=cbind(x, c(10, 11, 12)) # add a new column

[,1] [,2] [,3]

> x=rbind(x, c(5, 10, 20)) # add a new row

[,1] [,2] [,3]

[4,] 5 10 20

> x = list(name="Bill", age=25, developer=TRUE)

$name [1] "Bill"

$developer [1] TRUE

> x$name [1] "Bill"

> x = list("Bill", 25, TRUE)

[1] 25 [[3]]

> y = unlist(x) # convert x to a vector, all elements are converted to strings

[1] "Bill" "25" "TRUE"

> x = list(name="Bill", age=25, developer=TRUE)

> x$sex="M" # add a new element

$name [1] "Bill"

$developer [1] TRUE

$sex [1] "M"

> x$age = NULL # remove an existing element

$name [1] "Bill"

$developer [1] TRUE

$sex [1] "M"

> x = data.frame(num=c(1, 2, 3, 4), name=c("a", "b", "c", "d"))

> x num name

> # Middle two values of the column named num

> x$num[2:3]

> # Have to remember that rows are indexed first and also specify x twice

> x[x$num > 2, ] num name

> # Using the subset function removes the possibility of making common typo mistakes

> # No need to remember row/column order and only specify x once

> subset(x, num > 2) num name

exp = expression(x/(y+z))

eval(expr) # evaluate expression using the current values of x, y and z

text(x, y, expression(p == over(1, 1+e^(alpha*x+beta))))

> D(expression(x/(y + z)^2), "z") -(x * (2 * (y + z))/((y + z)^2)^2)

> factor(c("win", "win", "lose", "win", "lose", "lose")) [1] win win lose win lose lose

Levels: lose win

> c(5, 6) + 1 [1] 6 7

> c(1, 2) + c(3, 4) [1] 4 6

> c(7, 8, 9, 10) + c(11, 12) [1] 18 20 20 22

> c(0, 1) < c(1, 0) [1] TRUE FALSE

c(0,1) && c(1,1)

Table 15.1:Roperators listed in precedence order.

> x = c(a = 1, b = 2)

> x = list(a = 1, b = 2)

> str(x[1]) List of 1

> str(x[[1]]) num 1

> x = matrix(1:4, nrow = 2)

> x[1, ] [1] 1 3

> x[1, , drop = FALSE]

iOut-of-bounds handling is also different, but I’m sure readers’ don’t do that sort of thing.

This chapter gives a brief overview of

for developers who are fluent in at least one other computer language. The discussion pays attention to language features that are very different from languages the reader is likely to be familiar with; the focus is on a few language features that can be used to solve most problems.

The

language is defined by its one implementation; available from the

core team.

A language definition,

written in English prose, is gradually being written.

programs tend to be very short, compared to programs in languages such as C

and Java; 100 lines is a long

program. It is assumed that most readers will be casual users of

, whose programs generally follow the pattern:

If your problem cannot be solved using this algorithm, then the most efficient solution may be for you, dear reader, to use the languages and tools you are already familiar with, to preprocess the data so that it can be analysed and processed using

.

is a domain specific language, whose designers have done an excellent job of creating a tool suited to the tasks frequently performed when analysing the kinds of datasets en- countered in statistical analysis. Yes,

is Turing complete, so any algorithm that can be implemented in other programming languages can be implemented in

, but it has been designed to do certain things very well, with no regard to making it suitable for general programming tasks.

As a language the syntax and semantics of

is a lot smaller than many other languages.

However, it has a very large base library, containing over 1,000 functions. Most of the investment needed to become a proficient user of

has to be targeted at learning to how to combine these functions to solve the problem at hand. There are over 10,000+ add-on packages available from the

(Comprehensive

Archive Network).

Help on a specific identifier, if any is available, can be obtained using the

(question mark) unary operator, followed by the identifier. The

unary operator, followed by the identifier, returns a list of names associated with that identifier for which a help page is available.

The call

, lists the functions and objects provided by the pack- age named in the argument.

15.1 Your first R program

Much like Python, Perl and many other interpreted implementations,

can be run in an interactive mode, where code can be typed and immediately executed (with

producing the obvious output).

Your first

program ought to read some data and plot it, not just print

. The following program reads a file containing a single column of values and plots them, to produce figure 15.1:

387

The

function is included in the library that comes bundled with the base system (functions not included in this library have to be loaded using the

function, before they can be referenced; the package containing them may also need to be installed via the

function) and has a variety of optional arguments (arguments can be omitted if the function definition includes default values for the corresponding parameter). Perhaps the most commonly used optional arguments are

(the character used to separate, or delimit, values on a line, defaults to comma) and

(whether the contents of the first line should be treated as column names, default

).

The value returned by

has class

, which might be thought of as a C

type (it contains data only, there are no member functions as such).

The

function attempts to produce a reasonable looking graphic of whatever data is passed, which for character data is a histogram of the number of occurrences. Users of

are not expected to be interested in manipulating low level details, and some effort is needed to get at the numeric values of characters.

There are a wide variety of options to change the appearance of

output; these can be applied on each call to

, or globally for every call (using the

function).

All objects in the current environment can be saved to a file using the

function, and a previously saved environment can be restored using the

function. When quiting an

session (by calling

), the user is given the option of saving the current environment to a file named

; if a file of this name exists in the home directory, when

is started, its contents are automatically loaded.

15.2 Language overview

is a language and an environment. Like Perl, it is defined by how its single implemen- tation behaves, i.e., the software maintained by the

project.

was designed, in the mid-1990s, to be largely compatible with S (a language, which like C, started life in the mid-1970s at Bell Labs). When S was created, Fortran was the dominant engineering language and the Fortran way of doing things had a strong impact on early design decisions, i.e.,

does not have a C view of the world; for instance, it uses a row/column, rather than column/row ordering.

The designers of

have called it a functional language, and it does support a way of do- ing things that is most strongly associated with functional program languages (including making life cumbersome for developers wanting to assign to global variables).

Lateral thinking is often required to code a problem in

, using knowledge of functions contained in the base system, e.g., calling

to map a vector of strings to a unique vector of numbers.

Some data analysts write non-trivial programs in

, which means they have to deal with the testing and debugging issues experienced by users of other languages.

Base library support for debugging includes support for single stepping through a func- tion, via the

function, and setting breakpoints via the

function;

package support includes:

package for unit testing, and the

package for mea- suring code coverage.

15.2.1 Differences between R and widely used languages