Data frames - Types of data - Shipunov visual statistics

Types of data

3.8 Inside R

3.8.3 Data frames

With a dollar sign or character vector, the object we obtain by indexing retains its original type, just as with double square bracket. Note that indexing with dollar sign works only in lists. If you have to index other objects with named elements, use square brackets with character vectors:

> names(w) <- c("Rick", "Amanda", "Peter", "Alex", "Kathryn", + "Ben", "George")

> w["Jenny"]

Jenny 68

* * *

Lists are so important to learn because many functions inRstore their output as lists:

> x2.wilcox <- wilcox.test(x.ranks2) ...

> str(x2.wilcox) List of 7

$ statistic : Named num 36 ..- attr(*, "names")= chr "V"

$ parameter : NULL

$ p.value : num 0.0141 ...

Therefore, if we want to extract any piece of the output (like p-value, see more in next chapters), we need to use the list indexing principles from the above:

> x2.wilcox$p.value [1] 0.0141474

Rdata objects

vectors

matrices KSlists

data frames

KS oo

Figure 3.11: Most importantRdata objects.

Each column of the data frame must contain data of the same type (like in vectors), but columns themselves may be of different types (like in lists). Let us create a data frame from our existing vectors:

> d <- data.frame(weight=w, height=x, size=m.o, sex=sex.f)

> row.names(d) <- c("Rick", "Amanda", "Peter", "Alex", "Kathryn", + "Ben", "George")

> d

weight height size sex

Rick 69 174.0 L male

Amanda 68 162.0 S female

Peter 93 188.0 XL male

Alex 87 192.0 XXL male

Kathryn 59 165.0 S female

Ben 82 168.0 M male

George 72 172.5 L male

(It was not absolutely necessary to enterrow.names()since ourwobject could still retain names and they, by rule, will become row names of the whole data frame.) This data frame represents data inshort form, with many columns-features.Long formof the same data could, for example, look like:

Rick weight 69

Rick height 174.0

Rick size L

Rick sex male

Amanda weight 68

...

In long form, features are mixed in one column, whereas the other column speciﬁes feature id. This is really useful when we ﬁnally come to the two-dimensional data analysis.

* * *

Commandsrow.names()orrownames()specify names of data frame rows (objects).

For data frame columns (variables), usenames()orcolnames().

Alternatively, especially if objectsw,x,m.o, orsex.fare for some reason absent from the workspace, you can type:

> d <- read.table("data/d.txt", h=TRUE)

> d$size <- ordered(d$size, levels=c("S", "M", "L", "XL", "XXL")) ... and then immediately check the structure:

> str(d)

'data.frame': 7 obs. of 4 variables:

$ weight: num 69 68 93 87 59 82 72

$ height: num 174 162 188 192 165 168 172.5

$ size : Ord.factor w/ 5 levels "S"<"M"<"L"<"XL"<..: 3 1 4

$ sex : Factor w/ 2 levels "female","male": 2 1 2 2 1 2 2

Since the data frame is in fact a list, we may successfully apply to it all indexing methods for lists. More then that, data frames available for indexing also as two- dimensional matrices:

> d[, 1]

[1] 69 68 93 87 59 82 72

> d[[1]]

[1] 69 68 93 87 59 82 72

> d$weight

[1] 69 68 93 87 59 82 72

> d[, "weight"]

[1] 69 68 93 87 59 82 72

> d[["weight"]]

[1] 69 68 93 87 59 82 72

To be absolutely sure that any of two these methods output the same, run:

> identical(d$weight, d[, 1]) [1] TRUE

To select several columns (all these methods givesameresults):

> d[, 2:4] # matrix method height size sex

Rick 174.0 L male

Amanda 162.0 S female

Peter 188.0 XL male ...

> d[, c("height", "size", "sex")]

height size sex

Rick 174.0 L male

Amanda 162.0 S female Peter 188.0 XL male ...

> d[2:4] # list method height size sex

Rick 174.0 L male

Amanda 162.0 S female Peter 188.0 XL male ...

> subset(d, select=2:4) height size sex

Rick 174.0 L male

Amanda 162.0 S female Peter 188.0 XL male ...

George 172.5 L male

> d[, -1] # negative selection height size sex

Rick 174.0 L male

Amanda 162.0 S female Peter 188.0 XL male ...

(Threeof these methods work also for this data framerows. Try all of them andﬁnd which are not applicable. Note also thatnegative selectionworks only for numerical vectors; to use several negative values, type something liked[, -(2:4)]. Think why the colon is not enough and you need parentheses here.)

Among all these ways, the most popular is the dollar sign and square brackets (Fig3.12).

While ﬁrst is shorter, the second is more universal.

df$

column

name

df[ , ]

indexrow column index

or

Figure 3.12: Two most important ways to select from data frame.

Selection by column indices is easy and saves space but it requires to remember these numbers. Here could help theStr()command (note the uppercase) which replaces dollar signs with column numbers (and also indicates with star* sign the presence of NAs, plus shows row names if they are not default):

'data.frame': 7 obs. of 4 variables:

1 weight: int 69 68 93 87 59 82 72 2 height: num 174 162 188 192 165 ...

3 size : Ord.factor w/ 5 levels "S"<"M"<"L"<"XL"<..: 3 1 4 4 sex : Factor w/ 2 levels "female","male": 2 1 2 2 1 2 2 row.names [1:7] "Rick" "Amanda" "Peter" "Alex" "Kathryn" ...

* * *

Now, how to make asubset, select several objects (rows) which have particular features? One way is throughlogical vectors. Imagine that we are interesting only in the values obtained from females:

> d[d$sex=="female", ]

weight height size sex

Amanda 68 162 S female

Kathryn 59 165 S female

(To select only rows, we used thelogical expressiond$sex==femalebefore the comma.) By itself, the above expression returns a logical vector:

> d$sex=="female"

[1] FALSE TRUE FALSE FALSE TRUE FALSE FALSE

This is whyRselected only the rows which correspond toTRUE: 2nd and 5th rows.

The result is just the same as:

> d[c(2, 5), ]

weight height size sex

Amanda 68 162 S female

Kathryn 59 165 S female

Logical expressions could be used to select whole rows and/or columns:

> d[, names(d) != "weight"]

height size sex

Rick 174.0 L male

Amanda 162.0 S female Peter 188.0 XL male

...

It is also possible to apply more complicated logical expressions:

> d[d$size== "M" | d$size== "S", ] weight height size sex

Amanda 68 162 S female

Kathryn 59 165 S female

Ben 82 168 M male

> d[d$size %in% c("M", "L") & d$sex=="male", ] weight height size sex

Rick 69 174.0 L male

Ben 82 168.0 M male

George 72 172.5 L male

(Second example shows how to compare with several character values at once.) If the process of selection with square bracket, dollar sign and comma looks too complicated, there is another way, withsubset()command:

> subset(d, sex=="female")

weight height size sex

Amanda 68 162 S female

Kathryn 59 165 S female

However, “classic selection” with[is preferable (see the more detailed explanation in?subset).

* * *

Selection does not only extract the part of data frame, it also allows toreplaceexist- ing values:

> d.new <- d

> d.new[, 1] <- round(d.new[, 1] * 2.20462)

> d.new

weight height size sex

Rick 152 174.0 L male

Amanda 150 162.0 S female

Peter 205 188.0 XL male

...

(Now weight is in pounds.)

Partial matching does not work with the replacement, but there is another interesting effect:

> d.new$he <- round(d.new$he * 0.0328084)

> d.new

weight height size sex he

Rick 152 174.0 L male 6

Amanda 150 162.0 S female 5

Peter 205 188.0 XL male 6

...

(A bit mysterious, is not it? However, rules are simple. As usual, expression works from right to left. When we calledd.new$heon the right, independent partial matching substituted it withd.new$heightand converted centimeters to feet. Then replacement starts. It does not understand partial matching and therefored.new$he on the left returnsNULL. In that case,the new column(variable) is silently created.

This is because subscripting with$returns NULLif subscript is unknown, creating a powerful method to add columns to the existing data frame.)

Another example of “data frame magic” isrecycling. Data frame accumulates shorter objects if they evenly ﬁt the data frame after being repeated several times:

> data.frame(a=1:4, b=1:2) a b

1 1 1 2 2 2 3 3 1 4 4 2

The following table (Table3.2) provides a summary ofRsubscripting with “[”:

subscript effect

positive numeric vector selects items with those indices negative numeric vector selects all but those indices

character vector selects items with those names (or dimnames) logical vector selects the TRUE (and NA) items

missing selects all

Table 3.2: Subscription with “[”.

* * *

Commandsort()does not work for data frames. To sort values in ad data frame, saying, ﬁrst with sex and then with height, we have to use more complicated opera- tion:

> d[order(d$sex, d$height), ] weight height size sex

Amanda 68 162.0 S female

Kathryn 59 165.0 S female

Ben 82 168.0 M male

George 72 172.5 L male

Rick 69 174.0 L male

Peter 93 188.0 XL male

Alex 87 192.0 XXL male

Theorder()command creates a numerical, not logical, vector with the future order of the rows:

> order(d$sex, d$height) [1] 2 5 6 7 1 3 4

Useorder()to arrange thecolumnsof thedmatrix in alphabetic order.

Dalam dokumen Shipunov visual statistics (Halaman 96-103)