• raw: essentially uninterpreted byte values,
• logical: holds one of the values:
TRUE,
FALSE,
Tor
F. The conversion
as.logical(an y_non_zero_value)returns
TRUE,
• character: what many other languages call a string type,
• integer: the only integer type, contains 32 bits (
NAis represented using the most negative value, so this value is not available as an integer; trying to generate this, or any other value outside the representable value of a 32-bit integer, will result in a value having a double type),
• double: the only floating-point type, contains 64 bits. Can exactly represent all 32-bit integers,
• complex: contains a real and imaginary double type,
An object may be reported to have one of these basic types, but it may actually be a vector or array of this type.
More complicated types may be created, such as lists, data frames, etc.
15.5.1 Converting the type (mode) of a value
It is often possible to convert the mode (type) of a value by calling the
as.some_modefunction, where
some_modeis the name of a mode, e.g.,
integer. If a conversion fails,
NAis returned.
Conversion precedence
NULL < raw < logical < integer < real < complex < character < list < expression
15.6 Statements
R
contains the usual language constructs that look like statements, but they can behave like expressions:
• function: defines a function, whose value has to be assigned to an object:
f=function(p1, p2) {return(p1+p2)}
• blocks of code are bracketed using the punctuation pair: { and },
• ; (semicolon) is required to delimit multiple expressions on the same line, but is other- wise optional,
• if: which takes an optional else arm (there is no then keyword, but there is an
ifthene lsefunction),
• for: which has the form
for (i in x), where
xis a vector (such as
1:10),
• while and repeat loops are available,
15.7. DEFINING A FUNCTION 397
• loops may be terminated using the break keyword or the
breakfunction, and may be continued at the next iteration using the next keyword or
nextfunction,
•
returnis a function:
return(1+return(1))returns the value
1,
•
switchis a function.
15.7 Defining a function
> g=1 # a global variable
> f = function(p1, p2) # define a function and assign it to f + {
+ l=g # Value access, check lexical and dynamic scope for g
+ g=2 # Assignment: only check local scope, if no variable exists, create one +
+ m=h # h is dynamically in scope +
+ return(return(1)+1) # return is a function call + }
> h=2 # another global variable
> f(1, 2) [1] 1
> g [1] 1
> h=3 # At global scope, so must be global variable
Argument evaluation is lazy, that is, they are evaluated the first time their value is required.
The
...token (three dots) specifies that a variable number of unknown arguments may be passed.
unk_args=function(...) {
a=list(...) # Convert any arguments passed to a list of values
# Access the list of values in a }
15.8 Commonly used functions
Technically every operation is a function call (so
’+’(1, 2)and
1+2are equivalent), but not all function calls have equivalent operator tokens.
> x = 1:10
> if (any(x > 7)) print("At least one value greater than 7") [1] "At least one value greater than 7"
> if (all(x > 0)) print("All values greater than zero") [1] "All values greater than zero"
> rep(1:2, 3) [1] 1 2 1 2 1 2
•
head/
tailmimics the behavior of the Unix
head/
tailprograms,
•
lengthreturns the number of elements in its vector argument,
•
nrow/
ncolreturn the number of rows/columns in the data frame argument (
NROW/
NCOLgracefully handle vector arguments),
•
orderreturns a vector containing an index in to the argument in the order needed to sort the argument values,
•
strlists the columns in a variable, along with their type and the first few values in each row; it provides a quick way of verifying that columns have the expected type.
•
whichreturns a vector of values containing the index of the argument values that are true,
•
methods: list functions overloaded on the argument name
•
installed.packages: list all installed packages
•
ls: lists variables that exist in the current environment,
•
system.time,
proc.time: cpu time used, and the real, and cpu time of the currently
running
Rprocess.
15.9 Input/Output
Functions are available for reading data having a variety of formats (e.g., comma sep- arated values), from all the common data sources, e.g., files, databases, web pages. In some cases the contents of a compressed file will be automatically uncompressed before reading. In the case of files, all the data contained in the file is often read, and returned as a single object.
Many functions try to automatically deduce the datatype of the data read, e.g., whether it is integer, real, character sequence, etc. Sometimes the datatype selected is not correct, and work has to be done to ensure the data is treated as having the desired type; the
read.csv
function bases its decision on the type of each column, by analysing the first 6, or so, lines of the file.
Some functions in the base system, e.g.,
read.csv, convert columns containing string values to factors, by default; the original intent was, presumably, to reduce the storage needed to hold the data. A column of factors, as a type, does not always behave the same as a column of strings and this default conversion behavior is often a liability. Using the argument
as.is=TRUEprevents values being converted to factors (it is used in all of this book’s example code).
data=read.csv("measurements.csv.gz", as.is=TRUE) # file will be uncompressed data=read.csv("measurements.csv", sep="|", as.is=TRUE) # change separator
data=read.csv("https://github.com/Derek-Jones/ESEUR-code-data/blob/master/benchmark/MSTR10-DIMM.csv.xz", as.is=TRUE)
The first line of the input file is assumed to denote the name of each column, specifying
header=FALSEswitches off this default behavior.
All characters on an input line after, and including, the comment character,
#, are ignored (various options interact with this behavior, including the
comment.charoption which can be used to change the character used).
The
foreignpackage supports the reading (and some writing) of data stored in some of the binary file formats used by other applications, e.g.,
read.spss.
If data is not already in a form that can be easily processed by
R, it may be simpler to convert it using a language or tool that you are already familiar with, rather than using
R. The
Renvironment includes a simple spreadsheet like editor for manual data entry and modifying existing data.
scores = edit(scores) # invoke built-in spreadsheet like editor
There are corresponding
writefunctions for many of the
readfunctions, e.g.,
write.csv
.
The
printfunction performs relatively simple formatted output (the
formatfunction can be used to create more sophisticated formatting, that can then be output); the
catfunction performs relatively little formatting, but is more flexible, and in particular does not terminate its output with a newline; the
sinkfunction can be used to specify an alternative location to write console output.
15.9.1 Graphical output
There are probably more functions supporting graphical output, in
R, than textual output.
Perhaps the most commonly used graphical output function is
plot. This function often does a good job of producing a reasonable graphical representation of the data. Over- loaded versions of this function are often provided by packages, to plot data having a particular class created by the package.
By default, graphical output is sent to the console device; this behavior can be overridden to produce a file having a particular format, e.g.,
pdf,
jpeg,
pngand
pictex. The list of supported output devices varies across the operating systems on which
Rruns.
The behavior of the
plotfunction can be influenced by previous calls to the
parfunction, which set configurable options.
Various packages providing graphical output are available, with the
ggplotpackage prob-
ably being the most commonly used by frequent
Rusers.
15.10. NON-STATISTICAL USES OF R 399
0 500 1500 2500 3500
0.1 0.2 0.3 0.4 0.5 0.6 0.7
File offset
Fraction Unique
Figure 15.2: The unique bytes per window (256 bytes wide) of a pdf file.Github–Local
15.10 Non-statistical uses of R
While the target of
R’s domain specialised functionality is statistical data analysis, there are other application domains where this functionality could be useful (but may not war- rant effort needed to learn
R).
A variety of functions designed for manipulating the rows and columns of delimited data files are available; see
Github–
Rlang/Top500.R.
A technique for spotting whether a file contains compressed data (e.g., a virus hidden in a script by compressing it to look like a jumble of numbers) is to plot the fraction of dis- tinct values appearing in successive, fixed size, blocks; see figure 15.2. Compressed data is likely to contain an approximately uniform distribution of byte values (compression is achieved by reducing apparent information content), your mileage may vary between compression methods.
The following code reads a pdf file, applies a sliding window to the data and then plots the fraction of distinct values in each window (at a given offset).
window_width=256 # if less than 256, divisor has to change in plot call
plot_unique=function(filename) {
t=readBin(filename, what="raw", n=1e7)
# Sliding the window over every point is too much overhead cnt_points=seq(1, length(t)-window_width, 5)
u=sapply(cnt_points, function(X) length(unique(t[X:(X+window_width)]))) plot(u/256, type="l", xlab="Offset", ylab="Fraction Unique", las=1)
return(u) }
dummy=plot_unique("http://www.coding-guidelines.com/R_code/requirements.tgz")