Types of data
3.3 Colors, names and sexes: nominal data
3.3.2 Factors
Butplot()could do nothing with the character vector (checkit yourself). To plot the nominal data, we are to informRfirst that this vector has to be treated asfactor:
> sex.f <- factor(sex)
> sex.f
[1] male female male male female male male Levels: female male
Nowplot()will “see” what to do. It will invisibly count items and draw a barplot (Fig.3.5):
> plot(sex.f)
It happened because character vector was transformed into an object of a type spe- cific to categorical data, a factor with twolevels:
> is.factor(sex.f) [1] TRUE
> is.character(sex.f) [1] FALSE
> str(sex.f)
Factor w/ 2 levels "female","male": 2 1 2 2 1 2 2
> levels(sex.f)
female male
012345
Figure 3.5: This is howplot()plots a factor.
[1] "female" "male"
> nlevels(sex.f) [1] 2
InR, many functions (includingplot()) prefer factors to character vectors. Some of them could even transform character into factor, but some not. Therefore, be careful!
There are some other facts to keep in mind.
First (and most important), factors, unlike character vectors, allow for easy trans- formation into numbers:
> as.numeric(sex.f) [1] 2 1 2 2 1 2 2
But why is female 1 and male 2? Answer is really simple: because “female” is the first in alphabetical order.Ruses this order every time when factors have to be converted into numbers.
Reasons for such transformation become transparent in a following example. Sup- pose, we also measured weights of the employees from a previous example:
> w <- c(69, 68, 93, 87, 59, 82, 72)
We may wish to plot all three variables: height, weight and sex. Here is one possible way (Fig.3.6):
> plot(x, w, pch=as.numeric(sex.f), col=as.numeric(sex.f), + xlab="Height, cm", ylab="Weight, kg")
> legend("topleft", pch=1:2, col=1:2, legend=levels(sex.f))
●
●
165 170 175 180 185 190
60657075808590
Height, cm
Weight, kg
● female male
Figure 3.6: A plot with three variables.
Parameterspch (from “print character”) and col (from “color”) define shape and color of the characters displayed in the plot. Depending on the value of the variable sex, data point is displayed as a circle or triangle, and also in black or in red. In general, it is enough to use either shape, or color to distinguish between levels.
Note that colors were printed from numbers in accordance with the current palette.
To see which numbers mean which colors, type:
> palette()
[1] "black" "red" "green3" "blue" "cyan" "magenta"
[7] "yellow" "gray"
It is possible to change the default palette using this function with argument. For example,palette(rainbow(8)) will replace default with 8 new “rainbow” colors.
To return, typepalette("default"). It is also possible to create your own palette, for example with functioncolorRampPalette()(see examples in next chapters) or using the separate package (likeRColorBrewerorcetcolor, the last allows to create perceptually uniformpalettes).
How to color barplot from Fig.3.5in black (female) and red (male)?
If your factor is made from numbers and you want to convert itback into numbers (this task is not rare!), convert it first to the characters vector, and only then—to numbers:
> (ff <- factor(3:5)) [1] 3 4 5
Levels: 3 4 5
> as.numeric(ff) # incorrect!
[1] 1 2 3
> as.numeric(as.character(ff)) # correct!
[1] 3 4 5
Next important feature of factors is that subset of a factor retains by default the original number of levels, even if some of the levels are not here anymore. Compare:
> sex.f[5:6]
[1] female male Levels: female male
> sex.f[6:7]
[1] male male
Levels: female male
There are several ways to exclude the unused levels, e.g. withdroplevels()com- mand, withdrop argument, or by “back and forth” (factor to character to factor) transformation of the data:
> droplevels(sex.f[6:7]) [1] male male
Levels: male
> sex.f[6:7, drop=T]
[1] male male Levels: male
> factor(as.character(sex.f[6:7])) [1] male male
Levels: male
Third, we may order factors. Let us introduce a fourth variable—T-shirt sizes for these seven hypothetical employees:
> m <- c("L", "S", "XL", "XXL", "S", "M", "L")
> m.f <- factor(m)
> m.f
[1] L S XL XXL S M L
Levels: L M S XL XXL
Here levels follow alphabetical order, which is not appropriate because we wantS (small) to be the first. Therefore, we must tellRthat these data are ordered:
> m.o <- ordered(m.f, levels=c("S", "M", "L", "XL", "XXL"))
> m.o
[1] L S XL XXL S M L
Levels: S < M < L < XL < XXL
(NowRrecognizes relationships between sizes, andm.ovariable could be treated as ranked.)
* * *
In this section, we created quite a few newRobjects. One of skills to develop is to understand which objects are present in your session at the moment. To see them, you might want tolist objects:
> ls()
[1] "aa" "bb" "cards" "coordinates" "dice"
...
If you want all objects together with their structure, usels.str()command.
There is also a more sophisticated version of object listing, which reports objects in a table:
> Ls() # shipunov
Name Mode Type Obs Vars Size
1 aa numeric vector 5 1 88 bytes
2 bb numeric matrix 3 3 248 bytes
3 cards character vector 36 1 2 Kb
4 coordinates list data.frame 10000 2 92.2 Kb
5 dice character vector 36 1 2 Kb
...
(To useLs(), installshipunovpackage first, see the preface for explanation.) Ls() is also handy when you start to work with large objects: it helps to clean R memory3.