header = TRUE, dec = ".", sep = ",") # Import the .csv file.
getwd() # Identify the working directory
ls() # List objects
attach(MilkBreedFatProt.df) # Attach the data, for later use str(MilkBreedFatProt.df) # Identify structure
nrow(MilkBreedFatProt.df) # List the number of rows ncol(MilkBreedFatProt.df) # List the number of columns dim(MilkBreedFatProt.df) # Dimensions of the data frame names(MilkBreedFatProt.df) # Identify names
colnames(MilkBreedFatProt.df) # Show column names rownames(MilkBreedFatProt.df) # Show row names head(MilkBreedFatProt.df) # Show the head tail(MilkBreedFatProt.df) # Show the tail
MilkBreedFatProt.df # Show the entire dataframe summary(MilkBreedFatProt.df) # Summary statistics
An object called MilkBreedFatProt.df has been created by completing this action. This R-based object is a dataframe and it consists of the data originally included in the fileMilkBreedButterfatProtein.csv, a comma-separated .csv file. To avoid possible conflicts, make sure that there are no prior R-based ob- jects calledMilkBreedFatProt.df. The prior use of the rm(list = ls()) functions accommodates this concern, removing all prior objects in the current R session.
It was only necessary to key the filename for the .csv file and not the full pathname since the R working directory is currently set to the directory and/or subdirectory where this .csv file is located (see the Housekeeping section at the beginning of this lesson).
R Input
class(MilkBreedFatProt.df) # Class
R Output
[1] "data.frame"
R Input
str(MilkBreedFatProt.df) # Structure
R Output
’data.frame’: 42 obs. of 4 variables:
$ Subject : Factor w/ 42 levels "SH01","SH02",..: 1 2 ...
$ Breed : int 1 1 1 1 1 1 1 1 1 1 ...
$ PctButterfat: num 3.03 4.04 3.14 4.08 3.15 ...
$ PctProtein : num 3.28 3.6 3.36 2.98 3.11 ...
R Input
duplicated(MilkBreedFatProt.df$Subject) # Duplicates
# DataFrame$Object notation
R Output
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [10] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [19] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [28] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [37] FALSE FALSE FALSE FALSE FALSE FALSE
The class and structure for each object seems to be correct and there are no du- plicate subjects in the sample. Saying this, a Code Book will help with future understanding of this dataset, even if the data currently seem simple and obvi- ous.
R Input
#######################################################
# Code Book for MilkBreedFatProt.df #
#######################################################
# #
# Subject ... Factor (e.g., nominal) #
# A unique ID assigned to each cow #
# #
# Breed ... Factor (e.g., nominal) #
# 1 = Holstein and 2 = Jersey (alpha order) #
# #
# PctButterfat ... Numeric (e.g., interval) #
# Percent Butterfat that can reach #
# 5.000000 or more #
# #
# PctProtein ... Numeric (e.g., interval) #
# Percent Protein that can reach #
# 4.000000 or more #
#######################################################
Although this dataset is fairly simple, labels and recoding of individual object variables are used in this lesson. The process calls for attention to detail while the actions are put into place. Recall that the Code Book shows data in their desired formats, which often requires some degree of recoding which has not yet occurred.
Once there is agreement that the data were brought into R in correct format, it is usually necessary to organize the data to some degree:
• The object variable Subject is currently viewed as a factor, with each code beginning with S (e.g., Subject) and then either H (e.g., Holstein) or J (e.g., Jersey).
• Integer numeric codes (e.g., 1 and 2) have been used in the original file to identify groups for the factor object variable Breed. A set of simple R-based actions can easily: (1) transform (e.g., recode) the object variable MilkBreedFatProt.df$Breed into a new object variable, (2) change the recoded object variable from original integer format to enumerated factor format, and (3) apply English text labels for the otherwise cryptic numeric codes (e.g., 1 and 2).
• Values for PctButterfat and PctProtein are in decimal format and are treated in R as numeric data.
A transformation (typically called a recode action) is needed and the process, using R-based syntax, follows. There may be some unnecessary (perhaps re- dundant) actions with the following recode activities, but these are purposely done to provide assurance that each variable is in desired format, both origi- nal variables and well as the newly-created (e.g., enumerated) variables such as Breed.recode:
R Input
MilkBreedFatProt.df$Subject <- as.factor(
MilkBreedFatProt.df$Subject)
MilkBreedFatProt.df$Breed.recode <- factor(
MilkBreedFatProt.df$Breed, labels=c("Holstein", "Jersey"))
# Use factor() and not as.factor().
MilkBreedFatProt.df$PctButterfat <- as.numeric(
MilkBreedFatProt.df$PctButterfat)
MilkBreedFatProt.df$PctProtein <- as.numeric(
MilkBreedFatProt.df$PctProtein)
Use a wide selection of R-based factors to confirm that the dataset is in good order and that all object variables are organized as desired. Give special notice to the summary() function, comparing output from when this function was previously applied.
R Input
getwd() # Identify the working directory
ls() # List objects
attach(MilkBreedFatProt.df) # Attach the data, for later use str(MilkBreedFatProt.df) # Identify structure
nrow(MilkBreedFatProt.df) # List the number of rows ncol(MilkBreedFatProt.df) # List the number of columns dim(MilkBreedFatProt.df) # Dimensions of the data frame names(MilkBreedFatProt.df) # Identify names
colnames(MilkBreedFatProt.df) # Show column names rownames(MilkBreedFatProt.df) # Show row names head(MilkBreedFatProt.df) # Show the head tail(MilkBreedFatProt.df) # Show the tail
MilkBreedFatProt.df # Show the entire dataframe summary(MilkBreedFatProt.df) # Summary statistics
R Output
Subject Breed PctButterfat PctProtein SH01 : 1 Min. :1.00 Min. :2.86 Min. :2.91 SH02 : 1 1st Qu.:1.00 1st Qu.:3.56 1st Qu.:3.34 SH03 : 1 Median :2.00 Median :4.54 Median :3.47 SH04 : 1 Mean :1.52 Mean :4.24 Mean :3.48 SH05 : 1 3rd Qu.:2.00 3rd Qu.:4.82 3rd Qu.:3.66 SH06 : 1 Max. :2.00 Max. :5.24 Max. :4.01
(Other):36 NA’s :1
Breed.recode Holstein:20 Jersey :22
The object variableMilkBreedFatProt.df$Breed has been retained in original format. However, the object variable MilkBreedFatProt.df$Breed.recode was created by putting the object variable MilkBreedFatProt.df$Breed into factor format, in contrast to the original integer-type use of 1 and 2 codes.
Labels were then applied in sequential order for this new object, with Holstein used to represent every occurrence of the code 1 and Jersey used to represent every occurrence of the code 2.
Note the formal use of DataFrame$Object notation when working with ob- ject variables that are part of a dataframe. Note also how the $ symbol is used to separate the name of the dataframe from the name of the object:
DataFrame$Object.