VI. Data frame
Data frames are R objects that represent data sets (and probably the ones we will deal with most frequently). Each element of a data frame has to be a either a numeric, character or logical vector and each of these must have the same length. They are very similar to matrices because they have the same rectangular array structure except that they allow the columns to contain different types of data of the same length, whereas a matrix is restricted to one type of data only. Usually, this can be thought of as a matrix where the rows are cases, called observations and the columns are the variables.
A) Creating data frame:
• read.table read data from external files.
• read.csv read data from external files.
• data.frame binds together objects of different kinds.
The syntax for creating a data frame is:
data.frame(data1,data2,…) Example:
> data.frame(c(3,5,8),4:6,rev(6:8))
> data.frame(1:4,rep(1:2,2),seq(10,3,length=4)) B) Giving name to a data frames:
Both the rows and columns of a data frame can be labelled.
* If we just want to name the columns, data frames can have a column’s names by names( ) function or when we define the objects, (same as cbind( ) and rbind( )).
* If we want naming the rows and columns, same as matrices, there are two methods:
the first one by using rownames( ) and colnames( ), the second by dimnames( ) at once.
Example:
> x=1:2
> y=letters[1:2]
> z=1:3
> data.frame(x,y) # appears with names
> data.frame(x,y,z) NOTE:
Data frames must have variables of the same length.
Example:
> d=data.frame(l=letters[4:8],b=c(T,T,F,F,T),n=7:3)
> d # appears with names
> dim(d)
C) Data frame Arithmetic:
You can only apply numeric computations to numeric variables in data frame.
D) Data frame Indexing:
Data frames can be accessed in several ways.
• Same tools used with matrix indexing can be used, df[row,col]
• And also we can use $ to extract vector (columns only).
• By data frame names, but first use attach( ) (columns only).
NOTE:
• When R attaches a data frame it makes a copy of the variables. If we make changes to the variables, the data frame is not actually changed.
• After using attach( ), it is better to use detach( ).
Example:
> d=data.frame(l=letters[4:8],b=c(T,T,F,F,T),n=7:3)
> d[1,2]
> d[4,3]
> d$l
> d$n
> d$b
> d$n[d$n>5]
> d[1,"n"]
> n # n is not there
> attach(d) # now l,b and n are there
> n
> n[3]
> n[3]=10
> n
> d # d not changed
> detach(d)
>n # n is there and changed
> l # not there
>d # d is not changed
NOTE:
Data frames are just a particular kind of list where all its components have the same length.
Example:
Car information Example
car.inf=matrix(c(8895,33,2560,7402,33,2345,6319,37,1895),3,3,byrow=T) country=c("USA","Korea","Japan")
inf=c("Price","Mileage","Weight") dimnames(car.inf)=list(country,inf) dimnames(car.inf)
Compute the price of cars after 25% discount, then add this variable to car.inf data frame
dis=car.inf[,1]-car.inf[,1]*0.25 car.inf=cbind(car.inf,dis)
type=c("sporty","compact","van") car =cbind(car.inf,type)
car
car.frame =data.frame(car.inf,type) car.frame
car.frame$millage car.frame$ Mileage
car.frame["USA",] #Select info. of USA cars To add a row to data frame
SA=data.frame(7000,40,2000,4000,"van")
inf=c("Price","Mileage","Weight","dis","type") dimnames(SA)=list("SA",inf)
d1=rbind(SA,car.frame) d1
Price Mileage Weight dis type SA 7000 40 2000 4000.00 van USA 8895 33 2560 6671.25 sporty Korea 7402 33 2345 5551.50 compact Japan 6319 37 1895 4739.25 van
What happen if you use this
d2=rbind(SA=c(7000,40,2000,4000,"van"),car.frame) ?????
see the output of this d2$dis
d1$dis
Access data.frame with logical vector
car.frame$Price[dis>5000]
car.frame$Price[Weight>2000] # Problem!!! Weight was not defined
# dis was defined separately so did not make a problem car.frame$Price[car.frame$Weight>2000]
car.frame["USA",]
car.frame$Mileage[car.frame[,1]>7000]
car.frame$Mileage[car.frame$Price>7000]
What is car.frame?????
car.frame[dis>4000,1:4]
car.frame[dis>5000,1:4]
subset:
Working with data frames can become a bit cumbersome because we always need to prefix the name of the data frame to every column. There are some functions to make this easier. subset can be used to select rows of a data frame.
To extract values from a data frame, returns subsets of vectors or data frames that meet specific requirements
new.frame=subset(car.frame,subset=dis>4000,select=1:4)
V.List
Lists are very flexible data structures that are used extensively in R. A list allows a programmer to tie together related data that do not have the same structure (different lengths or modes).
Lists are collections of objects of different known as its components.The components of a list could be vectors, logical values, matrices, character arrays, functions or other lists and they can have different length and types. An example of a list is the names of the rows and columns of a matrix.
A) Creating List:
Used list( ) function.
Example:
> x=1:2;y=letters[1:2];z=1:3
> l=list(x,y,z) # not have a name for componenets.
B) Adding names to list:
• List can have a names attribute by using names( ) function. The names of a list refer to the top-level components .
• if we want to define names when using list( ) (i.e at the same command), we can use a name=value format.
Example:
> names(l)=c("first","secomd","third") # or l=list(first=x,second=y,third=z)
> l
$first [1] 1 2
$secomd [1] "a" "b"
$third
[1] 1 2 3
> l=list(x=x,y=y,z=z) ;l
$x [1] 1 2
$y
[1] "a" "b"
$z
[1] 1 2 3
C) List Indexing:
Components of a list can be selected in one of two ways:
• The more general method extracts the component by referring to it by its position on the list useing a double square brackets [[ ]]. list2[[2]] selects the second component of the list list2.
• If the components are named, we may select them using the expression list$component or list[[’component’]].
then the sub elements by using a single square brackets.
Example:
g=list(1:10,c(T,F),c("Hey","You"))
#to name all elements in a list
names(g)=c("number","bool","message")
g=list(number=1:10,bool=c(T,F),message=c("Hey","You")) g[[2]]
[1] TRUE FALSE g[["message"]]
[1] "Hey" "You"
g$number
[1] 1 2 3 4 5 6 7 8 9 10
#add another element
g$comment="we assign a new element"
length(g) # number of top-level components [1] 4
Adding elements to a list can be achieved by
• adding a new component name:
> L1=list(Item1=c(7,2,5,8),Item2=c(T,T,F,T),Item3=c("a","h","m","t"))
> L1$Item4=c("apple","orange","melon","grapes")
# alternative ways
> L1[["Item4"]]=c("apple","orange","melon","grapes")
# OR
> L1[[4]]=c("apple","orange","melon","grapes") #without name
> names(L1)[4]="Item4"
What is names(L1)???
> names(L1)
[1] "Item1" "Item2" "Item3" "Item4"
Example:
Create list contains three components, first component: car.frame data frame, second component: vector of the names of the companies that made the cars, third component:
vector of model numbers of the cars.
company=c("Toyota","Kia","Mersedis") modl=c(1990,2005,2006,2004)
#carlist=list(car.inf,company,modl) carlist=list(car.frame,company,modl)
#to name component in a list
names(carlist)=c("car.information","company","model") names(carlist)
#Acces the elements in carlist
carlist[[2]] # gives company vector carlist $car.information
carlist $car.information$type carlist $company[2]
carlist[[3]][2]
apply(carlist[[1]][,1:4],2,max)
#make a list inside a list m=matrix(1:4,2)
carlist2=list(carlist,m) names(carlist2)=c("e","p") carlist2$e$car.information$dis carlist2[[1]][[1]][,"dis"]
carlist2[[1]][[2]][3]
The lapply( ) and sapply( ) Functions:
We often need to apply one particular function to all elements in a vector or a list, same as apply function. Generally, this would work by looping through all those elements. R has a few functions to do this elegantly;
• lapply: Returns the results as a list
• sapply: Tries to simplify the results and make it a vector
#Example 1: lapply # Example 2: sapply
> Sex=c("M","F","M","M","F","M") > sapply(l,table)
> smok=c("Y","N","N","Y","Y","Y") Sex Smoking
> l=list(Sex=Sex,Smoking=smok) F 2 2
> lapply(l,table) M 4 4
$Sex F M 2 4
$Smoking N Y
2 4
Example: (lapply,sapply)
x <- list(a = 1:10, beta = exp(-3:3), logic =c(TRUE,FALSE,FALSE,TRUE)) # compute the list mean for each list element
lapply(x,mean)
# median and quartiles for each list element lapply(x, quantile, probs = 1:3/4) lapply(x,"-",1)
#better look
sapply(x, mean) sapply(x, quantile)
• lapply: takes any structure, gives a list of results
• sapply: like lapply, but simplifies the result if possible
• apply: only used for arrays
• tapply: used for ragged arrays: vectors with an indexing specified by one or more factors.
VI.Factor
A factor is a data object used to specify a discrete classification (grouping) of the components of other vectors of the same length.
A factor is a special type of character vector. In most cases character data is used to describe the other data, and is not used in calculations. To store character data as
qualitative variables, a factor data type is used.
Factors are how R handles categorical data. Such data are often available as numeric codes, but should be converted to factors for proper analysis. Then, factors provide compact ways to handle categorical data.
A) Creating a factor:
The function factor is used to encode a vector as a factor.
EXAMPLE:
> x=1:3
> factor(x)
> x+factor(x) EXAMPLE:
Suppose we have a sample of 30 tax accountants from all the states and territories of Australia and their individual state of origin is specified by a character vector of state : You may create a factor by first creating a character vector, and then converting it to a factor type using the factor () function:
> state<-c("tas","sa","qld","nsw","nsw","nt","wa","wa","qld","vic","nsw","vic","qld", + "qld","sa","tas","sa","nt","wa","vic","qld","nsw","nsw","wa","sa","act","nsw", + "vic","vic","act")
> statef <- factor(state)
>statef
Notice that this creates “levels” based on the factor values (these are the values
of categorical variables). To find out the levels of a factor the function level( ) can be used.
> levels( statef)
[1] "act" "nsw" "nt" "qld" "sa" "tas" "vic" "wa"
>table(statef)
Note that the levels are sorted alphabetically by default.
ANOVA
To encode a vector as a factor use factor(rep(c("tr1","tr2"),c(10,10)))
[1] tr1 tr1 tr1 tr1 tr1 tr1 tr1 tr1 tr1 tr1 tr2 tr2 tr2 tr2 tr2 tr2 tr2 tr2 [19] tr2 tr2
The function tapply():
To continue the previous example, suppose we have the incomes of the same tax accountants in another vector (in suitably large units of money)
> incomes <- c(60, 49, 40, 61, 64, 60, 59, 54, 62, 69, 70, 42, 56, 61, 61, 61, 58, 51, 48, 65, 49, 49, 41, 48, 52, 46, 59, 46, 58, 43)
To calculate the sample mean income for each state we can now used the special function tapply():
> inc.means <- tapply(incomes, statef, mean)
> inc.means
qld act nsw nt qld sa tas vic wa 56.00 44.500 57.33333 55.50 53.00 55.00 60.50 56.00 52.250