CHAPTER 6. EFFICIENT . . . 6.2. THE APPLY AND OUTER . . .
The cumsum function To calculate cumulative sums of vector elements use the function cumsum. For example:
x <- 1:10 y <- cumsum(x) y
[1] 1 3 6 10 15 21 28 36 45 55
The function cumsum also works on matrices in which case the cumulative sums are calculated per column. Use cumprod for cumulative products, cummin for cumulative minimums and cummax for cumulative maximums.
Matrix multiplication In R a matrix-multiplication is performed by the operator %*%.
This can sometimes be used to avoid explicit looping. An m by n matrix A can be multiplied by an n by k matrix B in the following manner:
C <- A %*% B
So element C[i,j] of the matrix C is given by the formula:
Ci,j =X
k
Ai,kBk,j
If we choose the elements of the matrices A and B ‘cleverly’ explicit for-loops could be avoided. For example, column-averages of a matrix. Suppose we want to calculate the average of each column of a matrix. Proceed as follows:
A <- matrix(rnorm(1000),ncol=10) n <- dim(A)[1]
mat.means <- t(A) %*% rep(1/n, n)
CHAPTER 6. EFFICIENT . . . 6.2. THE APPLY AND OUTER . . .
M <- matrix(rnorm(10000),ncol=100) apply(M,1,mean)
The first argument of apply is the matrix, the second argument is either a 1 or a 2. If one chooses 1 then the mean of each column will be calculated, if one chooses 2 then the mean will be calculated for each row. The third argument is the name of a function that will be applied to the columns or rows.
The function apply can also be used with a function that you have written yourself.
Extra arguments to your function must now be passed trough the apply function. The following construction calculates the number of entries that is larger than a threshold d for each column in a matrix.
tresh <- function(x,d){
sum(x>d) }
M <- matrix(rnorm(10000),ncol=100) apply(M,1,tresh,0.6)
[1] 24 26 24 26 31 26 30 27 28 29 26 23 33 23 27 23 27 31 22 [20] 28 25 28 30 25 28 32 23 24 27 33 29 25 26 20 31 28 29 31 [39] 37 36 26 23 23 28 26 28 30 25 23 30 20 34 29 32 34 30 29 [58] 30 37 28 22 27 20 30 24 29 21 26 26 31 26 18 26 34 29 20 [77] 18 27 28 33 33 25 21 35 25 33 27 28 20 35 23 31 25 29 20 [96] 30 27 28 21 31
6.2.2 the lapply and sapply functions
These functions are suitable for performing calculations on the components of a list.
Specifically, calculations on the columns of a data frame. If, for instance, you want to find out which columns of the data frame cars are of type numeric then proceed as follows:
lapply(cars, is.numeric)
$Price:
[1] TRUE
$Country:
[1] FALSE
$Reliability:
[1] FALSE
CHAPTER 6. EFFICIENT . . . 6.2. THE APPLY AND OUTER . . .
$Mileage:
[1] TRUE ...
...
The function sapply can be used as well:
sapply(car.test.frame, is.numeric)
Price Country Reliability Mileage Type Weight Disp. HP
T F F T F T T T
The function sapply can be considered as the ‘simplified’ version of lapply. The func-tion lapply returns a list and sapply a vector (if possible). In both cases the first argument is a list (or data frame) , the second argument is the name of a function.
Extra arguments that normally are passed to the function should be given as arguments of lapply or sapply.
mysummary <- function(x){
if(is.numeric(x)) return(mean(x)) else
return(NA) }
sapply(car.test.frame,mysummary)
Price Country Reliability Mileage Type Weight Disp. HP 12615.67 NA NA 24.58333 NA 2900.833 152.05 122.35
Some attention should be paid to the situation where the output of the function to be called in sapply is not constant. For instance, if the length of the output-vector depends on a certain calculation:
myf <- function(x){
n<-as.integer(sum(x)) out <- 1:n
out }
testdf <- as.data.frame(matrix(runif(25),ncol=5)) sapply(testdf,myf)
$X.1:
[1] 1 2
CHAPTER 6. EFFICIENT . . . 6.2. THE APPLY AND OUTER . . .
$X.2:
[1] 1 0
$X.3:
[1] 1 2 3
$X.4:
[1] 1 2
$X.5:
[1] 1
The result will then be an object with a list structure.
6.2.3 The tapply function
This function is used to run another function on the cells of a so called ragged array. A ragged array is a pair of two vectors of the same size. One of them contains data and the other contains grouping information. The following data vector x en grouping vector y form an example of a ragged array.
x <- rnorm(50)
y <- as.factor(sample(c("A","B","C","D"), size=50, replace=T))
A cell of a ragged array are those data points from the data vector that have the same label in the grouping vector. The function tapply calculates a function on each cell of a ragged array.
tapply(x, y, mean, trim = 0.3)
A B C D
-0.4492093 -0.1506878 0.4427229 -0.1265299
Combining lapply and tapply To calculate the mean per group in every column of a data frame, one can use sapply/lapply in combination with tapply. Suppose we want to calculate the mean per group of every column in the data frame cars, then we can use the following code:
CHAPTER 6. EFFICIENT . . . 6.2. THE APPLY AND OUTER . . .
mymean <- function(x,y){
tapply(x,y,mean) }
lapply(cars, mymean, cars$Country)
$Price
France Germany Japan Japan/USA Korea Mexico Sweden USA 15930.000 14447.500 13938.053 10067.571 7857.333 8672.000 18450.000 12543.269
$Country
France Germany Japan Japan/USA Korea Mexico Sweden USA
NA NA NA NA NA NA NA NA
$Reliability
France Germany Japan Japan/USA Korea Mexico Sweden USA
NA NA NA 4.857143 NA 4.000000 3.000000 NA
...
6.2.4 The by function
The by function applies a function on parts of a data.frame. Lets look at the cars data again, suppose we want to fit the linear regression model Price Weight for each type of car. First we write a small function that fits the model Price Weight for a data frame.
myregr <- function(data) {
lm(Price ~ Weight, data = data) }
This function is then passed to the by function outreg <- by(cars, cars$Type, FUN=myregr) outreg
cars$Type: Compact Call:
lm(formula = Price ~ Weight, data = data) Coefficients:
(Intercept) Weight
2254.765 3.757
---CHAPTER 6. EFFICIENT . . . 6.2. THE APPLY AND OUTER . . .
cars$Type: Large Call:
lm(formula = Price ~ Weight, data = data) Coefficients:
(Intercept) Weight 17881.2839 -0.5183 ...
...
The output object outreg of the by function contains all the separate regressions, it is a so called ‘by’ object. Individual regression objects can be accessed by treating the ‘by’
object as a list outreg[[1]]
Call:
lm(formula = Price ~ Weight, data = data) Coefficients:
(Intercept) Weight
2254.765 3.757
6.2.5 The outer function
The function outer performs an outer-product given two arrays (vectors). This can be especially useful for evaluating a function on a grid without explicit looping. The function has at least three input-arguments: two vectors x and y and the name of a function that needs two or more arguments for input. For every combination of the vector elements of x and y this function is evaluated. Some examples are given by the code below.
x <- 1:3 y <- 1:3
z <- outer(x,y,FUN="-") z
[,1] [,2] [,3]
[1,] 0 -1 -2
[2,] 1 0 -1
[3,] 2 1 0
x <- c("A", "B", "C", "D") y <- 1:9