Import Data Into R - Thomas W. MacFarland Jan M. Yates

– In many cases, the addenda either introduce or reinforce concepts, packages, functions, function arguments, etc., in greater detail than what was presented earlier, often using diﬀerent data—all to give a new perspective.

– There is a great deal of attention to the concept of parametric data, as well as attention to the converse of nonparametric data—issues related to normality. Too many texts give scrutiny about data distribution patterns short shrift, or worse, totally ignore this assumption inherent to the correct use of many inferential tests.

– Practice data sets are also part of the addenda, often demonstrating one dataset addressing data with a normal distribution pattern and another dataset addressing data that do not exhibit normal distribution patterns. Students and beginning researchers beneﬁt when reminded that data are not always neat and pretty and multiple ap- proaches to statistical analysis have merit.

• Prepare to Exit, Save, and Later Retrieve this R Session: As a good programming practice, it is best to prepare all R syntax in a separate ﬁle, using a text editor. This practice makes it easy to save and reuse the prepared syntax for later use or as a template for other future R-based analyses. It is also a good programming practice to save active R sessions, again for the purpose of facilitating later reuse. Clear instructions are provided on how to execute a graceful exit from the R session and to save the session for later reuse.

Again, small and easy-to-follow confidence-building examples are used at the beginning of this text. Greater complexity is gradually introduced until the final lessons in this text. In these last lessons R is used in a fairly robust manner, where large and complex datasets are used to introduce and reinforce skills needed for independent statistical analyses that support research efforts.

on operating system. Open R after it has been downloaded and experiment with the interface, recognizing that it is a plain and stark compared to more common Graphical User Interface (GUI) statistical analysis programs.

Use R in an interactive mode, typing syntax such as 2 * 2, or x <- c(1, 2, 3); print(x), or print(rnorm(100, mean=120, sd=6)) at the R prompt. Do not forget to press the Enter key after each line of syntax.

However, for a more meaningful experience, download the many diﬀerent datasets made available at the publisher’s Web-based resource associated with this text. With R open, and with the datasets downloaded, follow along with the diﬀerent examples used in this lesson to gain initial experience with R. To promote the best use of time-on-task, it may be better to prepare all R-based syntax using a text editor or some IDE-type middleware software, and then transfer the syntax to R. The ways by which this can be put into place are many and there is no one-and-only-one way to achieve this aim, ranging from a simple copy and paste action to a more complex action where achunk of syntax in an external editor is blocked and sent to the R session.⁸

For now, observe the R syntax shown in the rest of this lesson, but do not give too much concern about the syntax. The many R-based functions, arguments, etc., that collectively represent R syntax will be explained in future lessons.

Look at the initial Housekeeping syntax, below, to see a sample of what to expect whenever a new R session is started:

• The date() function is used to provide a marker for when the R session is saved and then reviewed again, later.

• R.version.string is used to verify which version of R was used for the session.

• The rm() function is used to remove objects.

• The setwd() function is used to change directories and have the R session in the desired directory orF:/R_BiostatisticsIntroductionin this session.

• The getwd() function is used to conﬁrm the current working directory.

• The ls() function is used to list all ﬁles in the current directory.

• The sessionInfo() function is used to conﬁrm the R version and to gain information on the locale and available packages.

8The term chunk was purposely selected to provide an advance organizer to terms used with the markdown process, where R syntax (a section or chunk of R code) and narrative text are integrated into one common document, which has become an increasingly popular way of preparing formal reports that include narrative text and R syntax. The markdown process is detailed in the last lesson in this text.

• The search() function is used to review the attached packages and to list all objects in the directory.

R Input

##############################################################

# Housekeeping Use for All Analyses #

##############################################################

date() # Current system time and date.

Sys.time() # Current system time and date (redundant).

R.version.string # R version and version release date.

options(digits=6) # Confirm default digits.

options(scipen=999)# Suppress scientific notation.

options(width=60) # Confirm output width.

ls() # List all objects in the working

# directory.

rm(list = ls()) # CAUTION: Remove all files in the

# working directory. If this

# action is not desired, use rm()

# one-by-one to remove the objects

# that are not needed.

ls.str() # List all objects with finite detail.

getwd() # Identify the current working directory.

setwd("F:/R_BiostatisticsIntroduction")

# Set to a new working directory.

# Note the single forward slash and double

# quotes.

# This new directory should be the directory

# where the data file is located, otherwise

# the data file will not be found.

getwd() # Confirm the working directory.

list.files() # List files at the PC directory.

.libPaths() # Library pathname.

.Library # Library pathname.

sessionInfo() # R version, locale, and packages.

search() # Attached packages and objects.

searchpaths() # Attached packages and objects.

###############################################################

When these actions are completed, the current R session should then begin, knowing that all ﬁles are in proper order, the working directory is set as desired, etc. These initial actions, known as Housekeeping in this text, provide assurance that the R session will begin correctly. Omitting any of this syntax may result

in either problems with output or output that does not match what shows throughout this lesson.^9,10

With this beginning Housekeeping activity completed, the different subsections that follow show how to import data into an active R session. Practice with these many examples to gain exposure and confidence before the later lessons are attempted. Remember, again, that these are all simple datasets and no attempt has been made to apply any analyses or to generate any highly-detailed figures against the data in the main body of this lesson.

1.5.1 Import a .csv File of Comma-Separated Values into R

Follow the syntax shown below to importGenderEndurance.csv, a .csv (comma- separated values) ﬁle saved in the declared working directory, which was cited in the Housekeeping section by using the setwd() function. The read.table() function, which is included in the utils package, will be used to direct this activity.

The output of this action will be placed into an object called GenEnd.df.¹¹

R Input

GenEnd.df <- utils::read.table (file =

"GenderEndurance.csv",

header=TRUE, dec=".", sep=",")

# Use the utils::read.table() function to import the

# .csv file GenderEndurance.csv into the current R

# session and place the contents into the object

9When writing R syntax, recall that functions are used by R to make things happen. The mean() function is used to determine the mean (e.g., arithmetic average) of a set of numbers included in an object variable. However, functions are contained in packages. As such, it is common to include the package name along with the function name, such that base::mean() is a more formal (and arguably, better) way to write R-based syntax that results in calculation of the mean, given that the mean() function is included in the base package. Both methods (e.g., Function() and the more formal Package::Function()) are used throughout this text, reﬂecting how syntax is written by others. Generally, it is common to avoid writing the package name for the packages that are included in the initial R download, but to write the package name when using functions included in external packages—but this is all a matter of choice in many cases. Demonstration of these two ways of writing function names is not meant to be confusing but is instead shown in this lesson and later lessons to reﬂect what should be expected when viewing syntax prepared by others, either colleagues or what may be seen using other resources.

10Throughout this text, notice how R Input and R Output are placed in diﬀerent lightly- colored boxes. The syntax for all input is included in this text. However, to conserve space and to prevent an overﬂow of pages, all output is not shown. All output can be generated, however, by using the datasets associated with this text and the input shown throughout.

Ideally, it would only be necessary to change the name of the working directory, declared at the local level, to replicate all examples in this and other lessons.

11Throughout this text, the.dfextension is used to reinforce that the object is a dataframe.

# GenEnd.df, which is a dataframe that: (1) has a

# header row, (2) uses a period for decimals, and (3)

# uses a comma to separate one field from another.

# Note how the utils package is available as one of the

# packages immediately put into use when a R session is

# first started.

With the object GenEnd.df imported into R, it is then necessary to perform a series of quality assurance actions to be sure that everything is correct and that the ﬁle is in proper order and ready for use.¹²

R Input

getwd() # Identify the working directory

ls() # List objects

attach(GenEnd.df) # Attach the data, for later use str(GenEnd.df) # Identify structure

head(GenEnd.df, n=3) # Show the head, 1st 3 cases summary(GenEnd.df) # Summary statistics

R Output

[Selected output is not shown, to save space.]

> str(GenEnd.df) # Identify structure

’data.frame’: 20 obs. of 3 variables:

$ Subject : Factor w/ 20 levels "S01","S02","S03",..: 1 2

$ Gender : Factor w/ 2 levels "Female","Male": 1 1 1 1 1

$ Endurance: num 3.49 2.97 2.89 2.92 3.42 ...

> summary(GenEnd.df) # Summary statistics Subject Gender Endurance

S01 : 1 Female:10 Min. :2.37 S02 : 1 Male :10 1st Qu.:2.91

S03 : 1 Median :2.95

S04 : 1 Mean :3.01

12This lesson is focused on the many ways that data can be brought into a R session. The emphasis, now, is not on functions such as str(), summary(), etc. These functions are detailed completely in later lessons. As such, the text input and text output is presented in a slightly diﬀerent manner in this lesson than what is seen in all later lessons, so that focus remains on how data are brought into R.

S05 : 1 3rd Qu.:3.15

S06 : 1 Max. :3.49

(Other):14

The ﬁle calledGenderEndurance.csv has been imported into R, with the data contained in the object GenEnd.df. The data seem to be in correct order, but a graphic will also serve a quality assurance role, to be certain that everything is in order. Again, do not be overly concerned about the syntax used to generate the graphic and, instead, observe how the plot() function is used an overlay on the density() function, to gain a sense of data distribution for the GenEnd.df$Endurance object variable (Fig.1.1).¹³

R Input

par(ask=TRUE)

plot(density(GenEnd.df$Endurance, na.rm=TRUE),

main="Endurance of Selected Female and Male Subjects:

Quality Assurance Density Plot", col="red", lwd=5)

# This graphic is a quality assurance density plot of the

# object variable GenEnd.df$Endurance, with arguments

# resulting in a thick (lwd=5) red line.

Figure 1.1: Quality assurance of endurance

13The$character is used to fully identify the GenEnd.df$Endurance object variable. In the same way that a formal, if long, naming process is used for packages and functions (e.g., Package::Function), the same concept applies to objects and variables associated with objects.

The more formal GenEnd.df$Endurance is a better way to identify the object in question, instead of Endurance alone. Otherwise, there would be confusion if there were two objects in an active R session and each object contained a variable called Endurance. Explicit naming schemes such asObject$VariableandPackage::Functionhave value as quality assurance measures.

1.5.2 Import a .txt File of Tab-Separated Values into R

Similar to what was used when importing data found in a .csv file, follow along with the syntax shown below to import a .txt file that consists of data separated by tab characters, not commas. Although the fileBreedMilkLb365.txt uses a .txt (e.g., text) file extension, it is not uncommon to see files consisting of tab- separated values with either a .tsv file extension or a .tab file extension, although they are actually text files, regardless of the .tsv and .tab file extensions.

R Input

BreedMilk.df <- utils::read.table (file =

"BreedMilkLb365.txt",

header=TRUE, dec=".", sep="\t")

# Use the utils::read.table() function to import the

# .txt file BreedMilkLb365.txt into the current R

# session and place the contents into the object

# BreedMilk.df, which is a dataframe that: (1) has a

# header row, (2) uses a period for decimals, and (3)

# uses a tab to separate one field from another.

Once again, use standard quality assurance actions to conﬁrm that the data are correct and acceptable for later use.

R Input

getwd() # Identify the working directory

ls() # List objects

attach(BreedMilk.df) # Attach the data, for later use str(BreedMilk.df) # Identify structure

head(BreedMilk.df, n=3) # Show the head, 1st 3 cases summary(BreedMilk.df) # Summary statistics

R Output

[Selected output is not shown, to save space.]

> str(BreedMilk.df) # Identify structure

’data.frame’: 20 obs. of 3 variables:

$ Cow : Factor w/ 20 levels "C01","C02","C03",..: 1 2

$ Breed : Factor w/ 2 levels "Holstein","Jersey": 2 2 2

$ MilkLb365: int 14514 15443 14963 15997 15653 15854 14361

> summary(BreedMilk.df) # Summary statistics

Cow Breed MilkLb365 C01 : 1 Holstein:10 Min. :14219 C02 : 1 Jersey :10 1st Qu.:15164

C03 : 1 Median :16987

C04 : 1 Mean :17449

C05 : 1 3rd Qu.:19691

C06 : 1 Max. :20810

(Other):14

As a ﬁnal quality assurance review, produce a graphic (speciﬁcally, a density plot) of the object variableBreedMilk.df$MilkLb365to further understand the nature of the data and to review if the data follow along expected distribution patterns (Fig.1.2).

R Input

par(ask=TRUE)

plot(density(BreedMilk.df$MilkLb365, na.rm=TRUE),

main="Annual Milk Production (Pounds) of Holstein and Jersey Cows: Quality Assurance Density Plot",

col="red", lwd=5)

# This graphic is a quality assurance density plot of the

# object variable BreedMilk.df$MilkLb365, with arguments

# resulting in a thick (lwd=5) red line.

Figure 1.2: Quality assurance of MilkLb365

1.5.3 Import a .txt File of Fixed-Width Format Values into R Data in a .txt (e.g., text) file that are placed into fixed-with format are placed in neatly organized rows and columns, usually with all data for each case showing on one row. In turn, each object has a set width, ranging from 1 to N columns wide. When viewing these rows and columns it is important to recall that spaces are used to separate one datum from another—neither commas nor tabs are used to separate the data. It is common to use the term padding when describing fixed-width data, especially if leading zeros are used to align data in columns.

Fixed-width format .txt files were once fairly common but admittedly comma- separated values (.csv) file format is now far more common. Thorough review and setup are needed to successfully import a fixed-width format file, which is again a reason for more frequent use of .csv files. Even so, most researchers will eventually encounter fixed-width format files so it is useful to gain some degree of experience with this row-by-column, space-delimited file format.¹⁴

The fileYearSoilTypeCropRainYieldBushelsPerAcreNoHeader.txtis a fixed- width format .txt file, consisting of 60 cases (e.g., rows) and five object variables (e.g., columns). Carefully review the documentation to see more about the data, what the data represent, and how the data are organized. Give special attention to the utils::read.fwf() function and how it is used to import the data.

R Input

SoilYield.df <- utils::read.fwf(

"YearSoilTypeCropRainYieldBushelsPerAcreNoHeader.txt", header=FALSE, # There is no header for column names.

skip=0, # Skip no lines; read data from the 1st line.

na.strings=" ",# Blank spaces in a character object are NA.

width=c(-1,4, # Year (1997 to 2016)

-1,4, # Predominant Soil Type (Sand, Silt, Clay) -1,4, # Crop

-1,6, # Rain (Dry, Normal, Wet)

-1,9)) # Yield (Bushels Per Acre, BUperAcre)

# Use the utils::read.fwf() function to import the .txt file

# YearSoilTypeCropRainYieldBushelsPerAcreNoHeader.txt into

# the current R session and place the contents into the

14Although fixed-width format files are no longer common, they were once a standard format for dataset organization, where simple text editors were used to organize and manip- ulate data. Among many remaining comparative advantages of fixed-width format is that an incorrectly placed comma in a fixed-width format dataset cannot alter data structure beyond placement of the errant comma. However, when a comma is typed, by mistake, during manual data entry of a .csv text file, errors will likely show throughout—all due to a simple comma out-of-place.

# object SoilYield.df, which is a dataframe that: (1) in

# original format does not have a header row, (2) uses a

# somewhat complicated process to identify those fixed-width

# columns that do not contain data and therefore should be

# skipped and those columns that do contain data and should

# therefore be used to read data:

# Skip the first column (-1) and then read Year data in the

# next four columns.

# Skip the next column (-1) and then read Soil data in the

# next four columns.

# Skip the next column (-1) and then read Crop data in the

# next four columns.

# Skip the next column (-1) and then read Rain data in the

# next six columns.

# Skip the next column (-1) and then read BUperAcre data in

# the next nine columns.

R Input

names(SoilYield.df) <- c(

"Year", # Year

"Soil", # Soil

"Crop", # Crop

"Rain", # Rain

"BUperAcre") # Bushels per Acre

# Column names were not included in the original .txt

# dataset. Use the names() function to supply column

# names to the object SoilYield.df.

If there are no output errors with this seemingly complex scheme for declaring columns and data placement, use simple quality assurance actions such as str() and summary() to conﬁrm that the data are correct.

R Input

getwd() # Identify the working directory

ls() # List objects

attach(SoilYield.df) # Attach the data, for later use str(SoilYield.df) # Identify structure

head(SoilYield.df, n=3) # Show the head, 1st 3 cases summary(SoilYield.df) # Summary statistics

R Output

[Selected output is not shown, to save space.]

> str(SoilYield.df) # Identify structure

’data.frame’: 60 obs. of 5 variables:

$ Year : int 1997 1998 1999 2000 2001 2002 2003 2004

$ Soil : Factor w/ 3 levels "Clay","Sand",..: 1 1 1 1

$ Crop : Factor w/ 1 level "Corn": 1 1 1 1 1 1 1 1 1 1

$ Rain : Factor w/ 3 levels "Dry ","Normal",..: 3 2

$ BUperAcre: int 184 166 169 170 175 166 183 191 164 162

> summary(SoilYield.df) # Summary statistics

Year Soil Crop Rain BUperAcre

Min. :1997 Clay:20 Corn:60 Dry : 9 Min. :127 1st Qu.:2002 Sand:20 Normal:39 1st Qu.:149 Median :2006 Silt:20 Wet :12 Median :162

Mean :2006 Mean :158

3rd Qu.:2011 3rd Qu.:168

Max. :2016 Max. :192

Output from use of the summary() function provides evidence that all values are within expected ranges. A graphic of corn yields (e.g.,SoilYield.df$BUperAcre) will provide further conﬁrmation that the data were imported correctly into R (Fig.1.3).¹⁵

R Input

par(ask=TRUE)

plot(density(SoilYield.df$BUperAcre, na.rm=TRUE),

main="Corn Yield (Bushels per Acre) for Different Soils from 1997 to 2016 at a Selected Midwestern

Region: Quality Assurance Density Plot", col="red", lwd=5)

# This graphic is a quality assurance density plot of the

# object variable SoilYield.df$BUperAcre, with arguments

# resulting in a thick (lwd=5) red line.

15Again, do not be concerned about completely understanding syntax such as par(ask=TRUE) or lwd=5. This lesson is focused on the processes needed to import data into R. Syntax used to generate ﬁgures and descriptive statistics is detailed in later lessons.

Dalam dokumen Thomas W. MacFarland Jan M. Yates (Halaman 41-72)