These initial graphics are simple and currently have only a few embellishments.
They only serve as a first guide to general trends in data organization. Embel- lishments to the graphics will be introduced in later lessons by demonstrating the many arguments used to present titles, prepare text and lines in bold and color, etc.
R Input
table(is.na(CPIIISecLbsGen.df$Lbs))
# Returns TRUE if indexed value is missing (e.g., NA) and
# FALSE if indexed value is not missing
R Output
FALSE TRUE
60 1
R Input
table(complete.cases(CPIIISecLbsGen.df$Lbs))
# Returns TRUE if indexed value is not missing (e.g., NA)
# and FALSE if indexed value is missing
R Output
FALSE TRUE
1 60
R Input
summary(CPIIISecLbsGen.df$Lbs)
# Descriptive statistics, including NAs if any
R Output
Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s
99 121 127 132 139 192 1
Although there are many functions associated with descriptive statistics and measures of central tendency, output from the summary() function is often a first choice in an effort to make judgment on data organization and quality assurance issues.
Other functions for descriptive statistics and measures of central tendency are demonstrated below. Again, notice how missing values needed to be accommo- dated.
R Input
mean(CPIIISecLbsGen.df$Lbs, na.rm=TRUE)
# Mean or arithmetic average
R Output
[1] 131.733
R Input
sd(CPIIISecLbsGen.df$Lbs, na.rm=TRUE)
# Standard Deviation
R Output
[1] 17.5894
R Input
var(CPIIISecLbsGen.df$Lbs, na.rm=TRUE)
# Variance
R Output
[1] 309.385
R Input
median(CPIIISecLbsGen.df$Lbs, na.rm=TRUE)
# Median or midpoint
R Output
[1] 127
R Input
install.packages("modes", dependencies=TRUE)
library(modes) # Load the modes package.
help(package=modes) # Show the information page.
sessionInfo() # Confirm all attached packages.
R Input
modes::modes(CPIIISecLbsGen.df$Lbs, type=1)
# Mode, or the most frequent value
# Note how this distribution is bimodal
R Output
[,1] [,2]
Value 114 122
Length 4 4
R Input
range(CPIIISecLbsGen.df$Lbs, na.rm=TRUE)
# Range, minimum and maximum
R Output
[1] 99 192
R Input
min(CPIIISecLbsGen.df$Lbs, na.rm=TRUE)
# Minimum
R Output
[1] 99
R Input
which.min(CPIIISecLbsGen.df$Lbs)
# Location (e.g., index) of the first occurrence of the
# minimum value
R Output
[1] 53
R Input
max(CPIIISecLbsGen.df$Lbs, na.rm=TRUE)
# Maximum
R Output
[1] 192
R Input
which.max(CPIIISecLbsGen.df$Lbs)
# Location (e.g., index) of the first occurrence of the
# maximum value
R Output
[1] 58
R Input
quantile(CPIIISecLbsGen.df$Lbs, na.rm=TRUE)
# Quantiles, or values at: 0%, 25%, 50% 75%, and 100%
R Output
0% 25% 50% 75% 100%
99.00 121.00 127.00 139.25 192.00
R Input
sum(CPIIISecLbsGen.df$Lbs, na.rm=TRUE)
# Arithmetic sum of all values in a vector
R Output
[1] 7904
R Input
boxplot.stats(CPIIISecLbsGen.df$Lbs)
# Produce values for a vector related to a boxplot:
# lower whisker, lower hinge, median, upper hinge, upper
# whisker, N, and outliers
R Output
$stats
[1] 99.0 121.0 127.0 139.5 165.0
$n [1] 60
$conf
[1] 123.226 130.774
$out
[1] 168 169 192
R Input
fivenum(CPIIISecLbsGen.df$Lbs, na.rm=TRUE)
# Tukey’s five number summary for a vector: minimum,
# lower-hinge, median, upper-hinge, and maximum
R Output
[1] 99.0 121.0 127.0 139.5 192.0
R Input
IQR(CPIIISecLbsGen.df$Lbs, na.rm=TRUE)
# Interquartile range of a vector (e.g., a measure of
# dispersion that is equal to the difference between the
# upper quartile and the lower quartile
R Output
[1] 18.25
Although descriptive statistics and measures of central tendency are needed to understand the data, they are presented at the singular level—for all values of CPIIISecLbsGen.df$Lbs. That is to say, the above statistics offer no details by breakouts of weight (e.g., Lbs) by either Section or by Gender, which are the two factor-type object variables found in the dataset.
Prepare frequency distributions by Section and by Gender, overall. Then, know- ing the degree of representation of subjects by these two factor-type object vari- ables, it will be possible to have a better understanding of descriptive statistics for each group, overall (e.g., Section and Gender), and then by group breakouts (e.g., Section, AM and PM; Gender, Female and Male).7
R Input
table(CPIIISecLbsGen.df$Lbs, CPIIISecLbsGen.df$Section)
# Value-by-value contingency table (e.g., crosstab) of
# counts for each combination of numeric object variable
# values v factor levels (e.g., Weight (rows) by Section
# (columns)
R Input
table(CPIIISecLbsGen.df$Lbs, CPIIISecLbsGen.df$Gender)
# Value-by-value contingency table (e.g., crosstab) of
# counts for each combination of numeric object variable
# values v factor levels (e.g., Weight (rows) by Gender
# (columns)
R Input
table(CPIIISecLbsGen.df$Gender, CPIIISecLbsGen.df$Section, useNA=c("always"))
# Contingency table of cell sums, not individual values
# Observe Row (e.g., Gender) by Column (e.g., Section)
# placement and also note how there are no missing data
# for either Gender or Section (only for Lbs).
7The screen print of output from the table() function, both for Section and Gender when applied against Lbs, is quite long and is not shown. A few other screen prints, when overly long, are also excluded in an attempt to keep this lesson a reasonable length.
R Output
AM PM <NA>
Female 15 20 0 Male 16 10 0
<NA> 0 0 0
R Input
table(CPIIISecLbsGen.df$Section, CPIIISecLbsGen.df$Gender, useNA=c("always"))
# Contingency table of cell sums, not individual values
# Observe Row (e.g., Section) by Column (e.g., Gender)
# placement and also note how there are no missing data
# for either Section or Gender (only for Lbs).
R Output
Female Male <NA>
AM 15 16 0
PM 20 10 0
<NA> 0 0 0
R Input
prop.table(table(CPIIISecLbsGen.df$Section, CPIIISecLbsGen.df$Gender, useNA=c("always")))
# Proportions for each breakout group, Row by Column,
# which cell-by-cell adds to 100 percent
R Output
Female Male <NA>
AM 0.245902 0.262295 0.000000 PM 0.327869 0.163934 0.000000
<NA> 0.000000 0.000000 0.000000
R Input
prop.table(table(CPIIISecLbsGen.df$Gender,
CPIIISecLbsGen.df$Section, useNA=c("always")))
# Proportions for each breakout group, Row by Column,
# which cell-by-cell adds to 100 percent
R Output
AM PM <NA>
Female 0.245902 0.327869 0.000000 Male 0.262295 0.163934 0.000000
<NA> 0.000000 0.000000 0.000000
The xtabs() function and the ftable() function should also be considered for preparation of frequency distributions in the form of contingency tables. The output from these functions is generally the same as what is seen with use of the table() function, but the presentation and placement of row and column headers may show differently.
R Input
xtabs(~Section+Gender, data=CPIIISecLbsGen.df)
R Output
Gender
Section Female Male
AM 15 16
PM 20 10
R Input
ftable(xtabs(~Section+Gender, data=CPIIISecLbsGen.df))
# Common to many uses with R, note how the ftable()
# function is wrapped around the xtabs() function.
R Output
Gender Female Male Section
AM 15 16
PM 20 10
Use of the table() function and similar functions provides frequency distribu- tions of factor-type object variables. Observe how descriptive statistics are also provided, overall, and more importantly by breakout classifications for factor-
type object variables. Although there are many R functions from which to choose, the RcmdrMisc::numSummary() function is a good first selection in terms of utility and appearance of presentation.
R Input
install.packages("RcmdrMisc", dependencies=TRUE)
library(RcmdrMisc) # Load the RcmdrMisc package.
help(package=RcmdrMisc) # Show the information page.
sessionInfo() # Confirm all attached packages.
R Input
RcmdrMisc::numSummary(CPIIISecLbsGen.df$Lbs)
# Descriptive statistics overall, no breakout groupings
R Output
mean sd IQR 0% 25% 50% 75% 100% n NA 131.733 17.5894 18.25 99 121 127 139.25 192 60 1
The RcmdrMisc::numSummary() can also be used to display measures of central tendency at the breakout group level for numeric-type object variables:
R Input
RcmdrMisc::numSummary(CPIIISecLbsGen.df[,c("Lbs")],
groups=Section) # Default printout, breakouts by Section
R Output
[Selected output is not shown, to save space.]
mean sd 0% 25% 50% 75% 100% data:n data:NA AM 128.300 13.1520 107 120.25 126 136.5 157 30 1 PM 135.167 20.7864 99 122.25 128 143.5 192 30 0
R Input
RcmdrMisc::numSummary(CPIIISecLbsGen.df[,c("Lbs")],
groups=Gender) # Default printout, breakouts by Gender
R Output
mean sd IQR 0% 25% 50% 75% 100% data:n data:NA Female 123.971 8.27997 9 107 120 124 129 144 35 0 Male 142.600 21.27401 32 99 125 142 157 192 25 1