Conduct a Visual Data Check Using Graphics (e.g., Figures)Figures)

R Output

Subject Breed PctButterfat PctProtein SH01 : 1 Min. :1.00 Min. :2.86 Min. :2.91 SH02 : 1 1st Qu.:1.00 1st Qu.:3.56 1st Qu.:3.34 SH03 : 1 Median :2.00 Median :4.54 Median :3.47 SH04 : 1 Mean :1.52 Mean :4.24 Mean :3.48 SH05 : 1 3rd Qu.:2.00 3rd Qu.:4.82 3rd Qu.:3.66 SH06 : 1 Max. :2.00 Max. :5.24 Max. :4.01

(Other):36 NA’s :1

Breed.recode Holstein:20 Jersey :22

The object variableMilkBreedFatProt.df$Breed has been retained in original format. However, the object variable MilkBreedFatProt.df$Breed.recode was created by putting the object variable MilkBreedFatProt.df$Breed into factor format, in contrast to the original integer-type use of 1 and 2 codes.

Labels were then applied in sequential order for this new object, with Holstein used to represent every occurrence of the code 1 and Jersey used to represent every occurrence of the code 2.

Note the formal use of DataFrame$Object notation when working with object variables that are part of a dataframe. Note also how the $ symbol is used to separate the name of the dataframe from the name of the object:

DataFrame$Object.

3.4 Conduct a Visual Data Check Using Graphics (e.g.,

dotchart(), qqnorm(), and qqnorm() and an accompanying qqline(). Many arguments are available to embellish these graphical ﬁgures, but for now the ﬁgures will be prepared in simple format.

The par(ask=TRUE) function and argument are used to freeze the presentation on the screen, one figure at a time. Note how the top line of the figure, under File—Save as, provides a variety of graphical formats to save each figure, presented in the following order: Metafile, Postscript, PDF, PNG, BMP, TIFF, and JPEG.⁶

Factor-Type Barchart of Breed.recode

Wrap the barplot() function around the table() function to produce a simple barchart of Breed.recode (Fig.3.1).

R Input

par(ask=TRUE)

barplot(table(MilkBreedFatProt.df$Breed.recode), main="Breed: Barplot Frequency Distribution", col=c("black", "burlywood4"), ylim=c(0,25))

Figure 3.1: Distribution of breed by count—1 Numeric-Type Graphics of PctButterfat and PctProtein

To save space and to also provide a convenient side-by-side view for common figures, look at the way multiple figures are put into one common figure, using the par(mfrow=c()) functions. This technique is especially useful and should be considered not only for exploratory figures but also for final output, when appropriate (Figs.3.2 and 3.3).

6It is also possible to perform a simple copy and paste against each graphical image or to use R syntax to save a graphical image by using R syntax.

R Input

par(ask=TRUE)

par(mfrow=c(2,4)) # 8 figures into a 2 row by 4 column grid hist(MilkBreedFatProt.df$PctButterfat,

main="Percent Butterfat: Histogram") plot(MilkBreedFatProt.df$PctButterfat,

main="Percent Butterfat: Plot")

plot(density(MilkBreedFatProt.df$PctButterfat,

na.rm=TRUE), # Required: na.rm=TRUE for missing data main="Percent Butterfat: Density Plot")

boxplot(MilkBreedFatProt.df$PctButterfat, main="Percent Butterfat: Box Plot")

stripchart(MilkBreedFatProt.df$PctButterfat, main="Percent Butterfat: Stripchart") dotchart(MilkBreedFatProt.df$PctButterfat,

main="Percent Butterfat: Dotchart") qqnorm(MilkBreedFatProt.df$PctButterfat,

main="Percent Butterfat: Q-Q Plot") qqnorm(MilkBreedFatProt.df$PctButterfat,

main="Percent Butterfat: Q-Q Plot\nand Q-Q Line") qqline(MilkBreedFatProt.df$PctButterfat)

# Common figures for a numeric-type object variable

# The \n characters force a new line.

Figure 3.2: Distribution of percent butterfat—1

R Input

par(ask=TRUE)

par(mfrow=c(2,4)) # 8 figures into a 2 row by 4 column grid hist(MilkBreedFatProt.df$PctProtein,

main="Percent Protein: Histogram") plot(MilkBreedFatProt.df$PctProtein,

main="Percent Protein: Plot")

plot(density(MilkBreedFatProt.df$PctProtein,

na.rm=TRUE), # Required: na.rm=TRUE for missing data main="Percent Protein: Density Plot")

boxplot(MilkBreedFatProt.df$PctProtein, main="Percent Protein: Box Plot")

stripchart(MilkBreedFatProt.df$PctProtein, main="Percent Protein: Stripchart") dotchart(MilkBreedFatProt.df$PctProtein,

main="Percent Protein: Dotchart") qqnorm(MilkBreedFatProt.df$PctProtein,

main="Percent Protein: Q-Q Plot") qqnorm(MilkBreedFatProt.df$PctProtein,

main="Percent Protein: Q-Q Plot\nand Q-Q Line") qqline(MilkBreedFatProt.df$PctProtein)

# Common figures for a numeric-type object variable

Figure 3.3: Distribution of percent protein—1

After these initial figures are reviewed and when there is agreement that data are correct and the general approach for graphics is acceptable, prepare a more embellished figure if desired. Remember to make colors vibrant and use print that is large and dark, whenever possible, to support future public display of the figure.

From among the many potential packages and functions available to the R com- munity, for factor-type object variables consider use of the epiDisplay package and associated functions. Many functions in the epiDisplay package are used not only to prepare visually appealing graphics, but they also generate useful statistics printed to the screen, such as frequency distributions and percentage representation about the factor object variables in question, Breed.recode for

this lesson. The additional statistics gained by using these specialized functions are information-rich and give reason why specialized functions, such as epiDisplay::tableStack() and epiDisplay::tab1() demand attention Fig.3.4.

R Input

install.packages("epiDisplay", dependencies=TRUE)

library(epiDisplay) # Load the epiDisplay package.

help(package=epiDisplay) # Show the information page.

sessionInfo() # Confirm all attached packages.

epiDisplay::tableStack(Breed.recode, dataFrame=MilkBreedFatProt.df, by="none", count=TRUE, decimal=2, percent=c("column", "row"),

frequency=TRUE, name.test=TRUE, total.column=TRUE, test=TRUE)

# Descriptive statistics, only

R Output

Total

Total 42

Breed.recode

Holstein 20 (47.62) Jersey 22 (52.38)

R Input

par(ask=TRUE)

epiDisplay::tab1(MilkBreedFatProt.df$Breed.recode,

decimal=2, # Use the tab1() function

sort.group=FALSE, # from the epiDisplay cum.percent=TRUE, # package to see details

graph=TRUE, # about the selected

missing=TRUE, # object variable. (The

bar.values=c("frequency"), # 1 of tab1 is the one

horiz=TRUE, # numeric character and

cex=1.15, # it is not the letter

cex.names=1.15, # lowercase l).

cex.lab=1.15, cex.axis=1.15,

main="Dairy Cow Breeds (Holstein v Jersey) N Values",

col= c("black", "burlywood4"))

# Prepare a publishable quality barplot of Breed.recode

# and have descriptive statistics printed to the screen.

R Output

MilkBreedFatProt.df$Breed.recode : Frequency Percent Cum. percent

Holstein 20 47.62 47.62

Jersey 22 52.38 100.00

Total 42 100.00 100.00

Figure 3.4: Distribution of breed by count—2

There are also many ways to show relationships between and among the numeric variables PctButterfat and PctProtein, individually and by breakout groups.

From among the many possible selections, functions found in the ggplot2 package and associated packages are quite popular and for many researchers the functions in these packages are often the ﬁrst choice for graphical presentations.

Look at the way the ﬁgures are prepared individually and then put them into one common ﬁgure, to enhance easy comparisons of graphical presentations.⁷

R Input

install.packages("ggplot2", dependencies=TRUE)

library(ggplot2) # Load the ggplot2 package.

help(package=ggplot2) # Show the information page.

sessionInfo() # Confirm all attached packages.

7The ggplot2 package and supporting packages are used to produce a variety of ﬁgures associated with the concept of Beautiful Graphics. These packages are external to what is available when R is ﬁrst downloaded and it is necessary to actively download these packages to take advantage of their specialized functionality.

install.packages("ggthemes", dependencies=TRUE)

library(ggthemes) # Load the ggthemes package.

help(package=ggthemes) # Show the information page.

sessionInfo() # Confirm all attached packages.

install.packages("ggmosaic", dependencies=TRUE)

library(ggmosaic) # Load the ggmosaic package.

help(package=ggmosaic) # Show the information page.

sessionInfo() # Confirm all attached packages.

install.packages("gridExtra", dependencies=TRUE)

library(gridExtra) # Load the gridExtra package.

help(package=gridExtra) # Show the information page.

sessionInfo() # Confirm all attached packages.

install.packages("grid", dependencies=TRUE)

library(grid) # Load the grid package.

help(package=grid) # Show the information page.

sessionInfo() # Confirm all attached packages.

install.packages("scales", dependencies=TRUE)

library(scales) # Load the scales package.

help(package=scales) # Show the information page.

sessionInfo() # Confirm all attached packages.

In an attempt to make the ﬁgures bold and vibrant, but also in an attempt to reduce redundant keying, look at the self-created theme_MacYates, a theme for use with the ggplot2::ggplot() function. The use of additional themes is valuable in an attempt to alter font appearance and size, axis and tick mark presentation, title, appearance and size, etc. However, the use of these additional themes requires many lines of syntax that could easily be placed into syntax that can be reproduced, constructing a user-created theme which is called theme_MacYates().

R Input

theme_MacYates <- function(base_size=12, base_family="sans"){

theme(

plot.title=element_text(face="bold", size=14, hjust=0), axis.title.x=element_text(face="bold", size=14,

hjust=0.5),

axis.text.x=element_text(face="bold", size=10),

axis.title.y=element_text(face="bold", size=14, vjust=1, angle=90),

axis.text.y=element_text(face="bold", size=12, hjust=1), legend.title=element_text(face="bold", size=14),

legend.text=element_text(face="bold", size=14), axis.ticks.x=element_line(size=1.2),

axis.ticks.y=element_line(size=1.2), axis.ticks.length=unit(0.25,"cm"),

panel.background=element_rect(fill="whitesmoke") )

}

# hjust - horizonal justification; 0 = left edge to 1 = right

# edge, with 0.5 the default

# vjust - vertical justification; 0 = bottom edge to 1 = top

# edge, with 0.5 the default

# angle - rotation; generally 1 to 90 degrees, with 0 the

# default

R Input

class(theme_MacYates)

# Confirm the class of the enumerated theme.

R Output

[1] "function"

Self-created themes are quite ﬂexible. Explore many options for size, bold, etc.

See what type of theme is best for project requirements.

Now that the theme theme_MacYates() has been created, prepare a set of ﬁgures using the ggplot2::ggplot() function to examine the numeric-type object variables, PctButterfat and PctProtein. Note the use of self-documentation and how naming schemes are descriptive and clearly detail the nature of each graphic (Figs.3.5, 3.6, and 3.7).

R Input

HistogramPctButterfatOverall <- ggplot2::ggplot(MilkBreedFatProt.df,

aes(x=PctButterfat)) +

geom_histogram(binwidth=0.30, color="black", lwd=1.25, fill="cornsilk2") +

geom_vline(aes(xintercept=mean(PctButterfat, na.rm=TRUE)), color="darkred", linetype="dashed", size=1.25) +

geom_vline(aes(xintercept=median(PctButterfat, na.rm=TRUE)), color="dodgerblue", linetype="dotted", size=1.25) +

ggtitle(

"Percent Butterfat Produced by Holstein and Jersey Dairy Cows, Mean - Red and Median - Blue") +

scale_x_continuous(name="Percent Butterfat", limits=c(0,6), breaks=seq(0,6,1.0)) +

scale_y_continuous(name="Count", limits=c(0,10), breaks=seq(0,10,1)) +

theme_MacYates()

# Generate the figure, but it will not show until using the

# gridExtra::grid.arrange() function.

# Regarding the scales used for this and other figures, it is

# often best to first generate the figure with no attention

# to the later axis scales, to see what the default shows.

# Then, determine individual needs for presentation and

# practice with scale_x_continuous(), scale_y_continuous(),

# and associated options such as name, limits, and breaks.

# Using name, the axis label can be placed within

# scale_x_continuous() as well as scale_y_continuous().

# Notice how limits is used to set the scale that shows

# on the axis, from minimum and maximum.

# See how breaks is used to determine the placement of

# tick marks.

HistogramPctButterfatBreed.recode <- ggplot2::ggplot(MilkBreedFatProt.df,

aes(x=PctButterfat)) +

geom_histogram(binwidth=0.30, color="black", lwd=1.25, fill="cornsilk2") +

facet_grid(. ~ Breed.recode) + ggtitle(

"Percent Butterfat Produced by Holstein and Jersey Dairy Cows by Breed: Holstein v Jersey") +

scale_x_continuous(name="Percent Butterfat", limits=c(0,6), breaks=seq(0,6,1.0)) +

scale_y_continuous(name="Count", limits=c(0,10), breaks=seq(0,10,1)) +

theme(strip.text.x=element_text(face="bold", size=12, color="navyblue")) +

theme(strip.background=element_rect(fill="wheat1")) +

theme_MacYates()

# Note how theme(strip.text) and theme(strip.background) are

# placed before theme_MacYates() par(ask=TRUE)

gridExtra::grid.arrange(

HistogramPctButterfatOverall,

HistogramPctButterfatBreed.recode, ncol=2)

Figure 3.5: Distribution of percent butterfat—2

R Input

BoxplotPctButterfatOverall <-

ggplot2::ggplot(MilkBreedFatProt.df, aes(x=factor(0), y=PctButterfat)) + geom_boxplot() +

stat_summary(fun.y=mean, geom="point", shape=1, size=12, col="red") +

ggtitle(

"Percent Butterfat Produced by Holstein and Jersey Dairy Cows, Mean - Red Circle") +

xlab("Both Breeds: Holstein and Jersey") + scale_x_discrete(breaks=NULL) +

scale_y_continuous(name="Percent Butterfat", limits=c(3.0,6.0), breaks=seq(3,6,0.5)) + theme_MacYates()

# Note the creation of a dummy variable, x=factor(0).

BoxplotPctButterfatBreed.recode <- ggplot2::ggplot(MilkBreedFatProt.df,

aes(x=factor(0), y=PctButterfat)) + geom_boxplot() +

facet_grid(. ~ Breed.recode) + ggtitle(

"Percent Butterfat Produced by Holstein and Jersey Dairy Cows by Breed: Holstein v Jersey") +

scale_x_discrete(name="Breed", breaks=NULL) + scale_y_continuous(name="Percent Butterfat",

limits=c(3.0,6.0), breaks=seq(3,6,0.5)) +

theme(strip.text.x=element_text(face="bold", size=12, color="navyblue")) +

theme(strip.background=element_rect(fill="wheat1")) + theme_MacYates()

par(ask=TRUE)

gridExtra::grid.arrange(

BoxplotPctButterfatOverall,

BoxplotPctButterfatBreed.recode, ncol=2)

Figure 3.6: Distribution of percent butterfat—3

R Input

DensityCurvePctButterfatOverall <- ggplot2::ggplot(MilkBreedFatProt.df,

aes(x=PctButterfat)) +

geom_density(size=1.5, col=("red")) + ggtitle(

"Percent Butterfat Produced by Holstein and Jersey Dairy Cows") +

scale_x_continuous(name="Percent Butterfat", limits=c(0,6),

breaks=seq(0,6,1.0)) +

scale_y_continuous(name="Density", limits=c(0,0.60), breaks=seq(0,1,0.1)) +

theme_MacYates()

# There is no easy way to know the best Y axis scale to use

# with a density curve. It is usually best to generate the

# density curve using default settings first, with no use of

# arguments with scale_y_continuous. Then, experiment to see

# the best values for limits and breaks.

DensityCurvePctButterfatBreed.recode <- ggplot2::ggplot(MilkBreedFatProt.df,

aes(x=PctButterfat)) +

geom_density(size=1.5, col=("red")) + facet_grid(. ~ Breed.recode) +

ggtitle(

"Percent Butterfat Produced by Holstein and Jersey Dairy Cows by Breed: Holstein v Jersey") +

scale_x_continuous(name="Percent Butterfat", limits=c(0,6), breaks=seq(0,6,1.0)) +

scale_y_continuous(name="Density", limits=c(0,2), breaks=seq(0,2,0.5)) +

theme(strip.text.x=element_text(face="bold", size=12, color="navyblue")) +

theme(strip.background=element_rect(fill="wheat1")) + theme_MacYates()

# Generate the figure, but it will not show until using the

# gridExtra::grid.arrange() function.

par(ask=TRUE)

gridExtra::grid.arrange(

DensityCurvePctButterfatOverall,

DensityCurvePctButterfatBreed.recode, ncol=2)

Use of a histogram, boxplot, and density curve, both overall and side-by-side with breakouts of Percent Butterfat, should give a reasonable sense of data distribution, overall and by breakouts. As spacing and appearance permit, practice with the gridExtra::grid.arrange() function to help others understand the data. Comparative side-by-side ﬁgures are always helpful to gain a full sense of the data (Figs.3.8, 3.9, and 3.10).

Figure 3.7: Distribution of percent butterfat—4

R Input

HistogramPctProteinOverall <-

ggplot2::ggplot(MilkBreedFatProt.df, aes(x=PctProtein)) +

geom_histogram(binwidth=0.30, color="black", lwd=1.25, fill="cornsilk2") +

geom_vline(aes(xintercept=mean(PctProtein, na.rm=TRUE)), color="darkred", linetype="dashed", size=0.75) +

geom_vline(aes(xintercept=median(PctProtein, na.rm=TRUE)), color="dodgerblue", linetype="dotted", size=1.25) + ggtitle(

"Percent Protein Produced by Holstein and Jersey Dairy Cows, Mean - Red and Median - Blue") +

scale_x_continuous(name="Percent Protein", limits=c(0,5), breaks=seq(0,5,1.0)) +

scale_y_continuous(name="Count", limits=c(0,20), breaks=seq(0,20,5)) +

theme_MacYates()

# Generate the figure, but it will not show until using the

# gridExtra::grid.arrange() function.

# Practice with scale_x_continuous and scale_y_continuous to

# determine the best selections for limits and breaks.

# The red dashed line (PctProtein Mean = 3.48) and the blue

# dotted line (PctProtein Median = 3.47) are quite close to

# each other, given the similar values for mean and median.

HistogramPctProteinBreed.recode <- ggplot2::ggplot(MilkBreedFatProt.df,

aes(x=PctProtein)) +

geom_histogram(binwidth=0.30, color="black", lwd=1.25, fill="cornsilk2") +

facet_grid(. ~ Breed.recode) + ggtitle(

"Percent Protein Produced by Holstein and Jersey Dairy Cows by Breed: Holstein v Jersey") +

scale_x_continuous(name="Percent Protein", limits=c(0,5), breaks=seq(0,5,1.0)) +

scale_y_continuous(name="Count", limits=c(0,20), breaks=seq(0,20,5)) +

theme(strip.text.x=element_text(face="bold", size=12, color="navyblue")) +

theme(strip.background=element_rect(fill="wheat1")) + theme_MacYates()

# Note how theme(strip.text) and theme(strip.background) are

# placed before theme_MacYates() par(ask=TRUE)

gridExtra::grid.arrange(

HistogramPctProteinOverall,

HistogramPctProteinBreed.recode, ncol=2)

Figure 3.8: Distribution of percent protein—2

R Input

BoxplotPctProteinOverall <-

ggplot2::ggplot(MilkBreedFatProt.df, aes(x=factor(0), y=PctProtein)) + geom_boxplot() +

stat_summary(fun.y=mean, geom="point", shape=1, size=12,

col="red") + ggtitle(

"Percent Protein Produced by Holstein and Jersey Dairy Cows, Mean - Red Circle") +

scale_x_discrete(name="Both Breeds: Holstein and Jersey", breaks=NULL) +

scale_y_continuous(name="Percent Protein",

limits=c(3.0,4.25), breaks=seq(3.0,4.25,0.5)) + theme_MacYates()

# Note the creation of a dummy variable, x=factor(0). Give

# attention, also, to scale_x_discrete() and how it is set to

# present the X axis.

BoxplotPctProteinBreed.recode <- ggplot2::ggplot(MilkBreedFatProt.df,

aes(x=factor(0), y=PctProtein)) + geom_boxplot() +

facet_grid(. ~ Breed.recode) + ggtitle(

"Percent Protein Produced by Holstein and Jersey Dairy Cows by Breed: Holstein v Jersey") +

scale_x_discrete(name="Breed", breaks=NULL) + scale_y_continuous(name="Percent Protein",

limits=c(3.0,4.25), breaks=seq(3.0,4.25,0.5)) + theme(strip.text.x=element_text(face="bold", size=12,

color="navyblue")) +

theme(strip.background=element_rect(fill="wheat1")) + theme_MacYates()

par(ask=TRUE)

gridExtra::grid.arrange(

BoxplotPctProteinOverall,

BoxplotPctProteinBreed.recode, ncol=2)

R Input

DensityCurvePctProteinOverall <- ggplot2::ggplot(MilkBreedFatProt.df,

aes(x=PctProtein)) +

geom_density(size=1.5, col=("red")) + ggtitle(

"Percent Protein Produced by Holstein and

Figure 3.9: Distribution of percent protein—3

Jersey Dairy Cows") +

scale_x_continuous(name="Percent Protein", limits=c(0,6), breaks=seq(0,6,1.0)) +

scale_y_continuous(name="Density", limits=c(0.0,2.5), breaks=seq(0.0,2.5,0.50)) +

theme_MacYates()

# There is no easy way to know the best Y axis scale to use

# with a density curve. It is usually best to generate the

# density curve using default settings first, with no use of

# arguments with scale_y_continuous. Then, experiment to see

# the best values for limits and breaks.

DensityCurvePctProteinBreed.recode <- ggplot2::ggplot(MilkBreedFatProt.df,

aes(x=PctProtein)) +

geom_density(size=1.5, col=("red")) + facet_grid(. ~ Breed.recode) +

ggtitle(

"Percent Protein Produced by Holstein and Jersey Dairy Cows by Breed: Holstein v Jersey") +

scale_x_continuous(name="Percent Protein", limits=c(0,6), breaks=seq(0,6,1.0)) +

scale_y_continuous(name="Density", limits=c(0.0,2.5), breaks=seq(0.0,2.5,0.50)) +

theme(strip.text.x=element_text(face="bold", size=12, color="navyblue")) +

theme(strip.background=element_rect(fill="wheat1")) + theme_MacYates()

# Generate the figure, but it will not show until using the

# gridExtra::grid.arrange() function.

par(ask=TRUE)

gridExtra::grid.arrange(

DensityCurvePctProteinOverall,

DensityCurvePctProteinBreed.recode, ncol=2)

Figure 3.10: Distribution of percent protein—4

Similar to the above comment, side-by-side comparative ﬁgures of overall and breakouts for Percent Protein give a good sense of the data and how the data are distributed, for both breeds (e.g., overall) and by breakouts of each breed (e.g., Holstein v Jersey). The gridExtra::grid.arrange() function is a useful tool for this aim.

When viewing these ﬁgures, remember that the syntax used in this lesson can (and should) be used in future analyses. The modularity of syntax use and reuse is a comparative advantage of R, as opposed to a more trackpointer or mouse-driven approach toward statistical analyses, common to other software packages. In many cases, the syntax for a histogram, boxplot, or density plot used for one set of data can be easily adjusted for a new histogram, boxplot, or density plot—simply alter the syntax, typically changing the dataframe name and object names, and then adjust margins and scales, as needed.

As an ending comment to this section on the use of graphics produced using R, never forget that although inferential tests (e.g., Student’s t-Test for Indepen- dent Samples, in this lesson) are needed to make final judgment as to whether there are or are not statistically significant (p <= 0.05) differences between Holstein dairy cows and Jersey dairy cows regarding percent butterfat and percent protein, the figures provide a fairly good idea (but still—just an idea) of general trends and how the data compare to each other, individually and by group breakouts. Of course, even when judgment is finally made, always remember the key concepts of replication and representation. This lesson applies

only to two small collections of dairy cows and the data are by no means presented as being representative of the herd or breeds. Again, there will be more discussion on replication and representation in future lessons.

Dalam dokumen Thomas W. MacFarland Jan M. Yates (Halaman 180-197)