Descriptive Statistics for Initial Analysis of the Data

These initial graphics are simple and currently have only a few embellishments.

They only serve as a ﬁrst guide to general trends in data organization. Embel- lishments to the graphics will be introduced in later lessons by demonstrating the many arguments used to present titles, prepare text and lines in bold and color, etc.

R Input

table(is.na(CPIIISecLbsGen.df$Lbs))

# Returns TRUE if indexed value is missing (e.g., NA) and

# FALSE if indexed value is not missing

R Output

FALSE TRUE

60 1

R Input

table(complete.cases(CPIIISecLbsGen.df$Lbs))

# Returns TRUE if indexed value is not missing (e.g., NA)

# and FALSE if indexed value is missing

R Output

FALSE TRUE

1 60

R Input

summary(CPIIISecLbsGen.df$Lbs)

# Descriptive statistics, including NAs if any

R Output

Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s

99 121 127 132 139 192 1

Although there are many functions associated with descriptive statistics and measures of central tendency, output from the summary() function is often a ﬁrst choice in an eﬀort to make judgment on data organization and quality assurance issues.

Other functions for descriptive statistics and measures of central tendency are demonstrated below. Again, notice how missing values needed to be accommo- dated.

R Input

mean(CPIIISecLbsGen.df$Lbs, na.rm=TRUE)

# Mean or arithmetic average

R Output

[1] 131.733

R Input

sd(CPIIISecLbsGen.df$Lbs, na.rm=TRUE)

# Standard Deviation

R Output

[1] 17.5894

R Input

var(CPIIISecLbsGen.df$Lbs, na.rm=TRUE)

# Variance

R Output

[1] 309.385

R Input

median(CPIIISecLbsGen.df$Lbs, na.rm=TRUE)

# Median or midpoint

R Output

[1] 127

R Input

install.packages("modes", dependencies=TRUE)

library(modes) # Load the modes package.

help(package=modes) # Show the information page.

sessionInfo() # Confirm all attached packages.

R Input

modes::modes(CPIIISecLbsGen.df$Lbs, type=1)

# Mode, or the most frequent value

# Note how this distribution is bimodal

R Output

[,1] [,2]

Value 114 122

Length 4 4

R Input

range(CPIIISecLbsGen.df$Lbs, na.rm=TRUE)

# Range, minimum and maximum

R Output

[1] 99 192

R Input

min(CPIIISecLbsGen.df$Lbs, na.rm=TRUE)

# Minimum

R Output

[1] 99

R Input

which.min(CPIIISecLbsGen.df$Lbs)

# Location (e.g., index) of the first occurrence of the

# minimum value

R Output

[1] 53

R Input

max(CPIIISecLbsGen.df$Lbs, na.rm=TRUE)

# Maximum

R Output

[1] 192

R Input

which.max(CPIIISecLbsGen.df$Lbs)

# Location (e.g., index) of the first occurrence of the

# maximum value

R Output

[1] 58

R Input

quantile(CPIIISecLbsGen.df$Lbs, na.rm=TRUE)

# Quantiles, or values at: 0%, 25%, 50% 75%, and 100%

R Output

0% 25% 50% 75% 100%

99.00 121.00 127.00 139.25 192.00

R Input

sum(CPIIISecLbsGen.df$Lbs, na.rm=TRUE)

# Arithmetic sum of all values in a vector

R Output

[1] 7904

R Input

boxplot.stats(CPIIISecLbsGen.df$Lbs)

# Produce values for a vector related to a boxplot:

# lower whisker, lower hinge, median, upper hinge, upper

# whisker, N, and outliers

R Output

$stats

[1] 99.0 121.0 127.0 139.5 165.0

$n [1] 60

$conf

[1] 123.226 130.774

$out

[1] 168 169 192

R Input

fivenum(CPIIISecLbsGen.df$Lbs, na.rm=TRUE)

# Tukey’s five number summary for a vector: minimum,

# lower-hinge, median, upper-hinge, and maximum

R Output

[1] 99.0 121.0 127.0 139.5 192.0

R Input

IQR(CPIIISecLbsGen.df$Lbs, na.rm=TRUE)

# Interquartile range of a vector (e.g., a measure of

# dispersion that is equal to the difference between the

# upper quartile and the lower quartile

R Output

[1] 18.25

Although descriptive statistics and measures of central tendency are needed to understand the data, they are presented at the singular level—for all values of CPIIISecLbsGen.df$Lbs. That is to say, the above statistics oﬀer no details by breakouts of weight (e.g., Lbs) by either Section or by Gender, which are the two factor-type object variables found in the dataset.

Prepare frequency distributions by Section and by Gender, overall. Then, know- ing the degree of representation of subjects by these two factor-type object variables, it will be possible to have a better understanding of descriptive statistics for each group, overall (e.g., Section and Gender), and then by group breakouts (e.g., Section, AM and PM; Gender, Female and Male).⁷

R Input

table(CPIIISecLbsGen.df$Lbs, CPIIISecLbsGen.df$Section)

# Value-by-value contingency table (e.g., crosstab) of

# counts for each combination of numeric object variable

# values v factor levels (e.g., Weight (rows) by Section

# (columns)

R Input

table(CPIIISecLbsGen.df$Lbs, CPIIISecLbsGen.df$Gender)

# Value-by-value contingency table (e.g., crosstab) of

# counts for each combination of numeric object variable

# values v factor levels (e.g., Weight (rows) by Gender

# (columns)

R Input

table(CPIIISecLbsGen.df$Gender, CPIIISecLbsGen.df$Section, useNA=c("always"))

# Contingency table of cell sums, not individual values

# Observe Row (e.g., Gender) by Column (e.g., Section)

# placement and also note how there are no missing data

# for either Gender or Section (only for Lbs).

7The screen print of output from the table() function, both for Section and Gender when applied against Lbs, is quite long and is not shown. A few other screen prints, when overly long, are also excluded in an attempt to keep this lesson a reasonable length.

R Output

AM PM <NA>

Female 15 20 0 Male 16 10 0

<NA> 0 0 0

R Input

table(CPIIISecLbsGen.df$Section, CPIIISecLbsGen.df$Gender, useNA=c("always"))

# Contingency table of cell sums, not individual values

# Observe Row (e.g., Section) by Column (e.g., Gender)

# placement and also note how there are no missing data

# for either Section or Gender (only for Lbs).

R Output

Female Male <NA>

AM 15 16 0

PM 20 10 0

<NA> 0 0 0

R Input

prop.table(table(CPIIISecLbsGen.df$Section, CPIIISecLbsGen.df$Gender, useNA=c("always")))

# Proportions for each breakout group, Row by Column,

# which cell-by-cell adds to 100 percent

R Output

Female Male <NA>

AM 0.245902 0.262295 0.000000 PM 0.327869 0.163934 0.000000

<NA> 0.000000 0.000000 0.000000

R Input

prop.table(table(CPIIISecLbsGen.df$Gender,

CPIIISecLbsGen.df$Section, useNA=c("always")))

# Proportions for each breakout group, Row by Column,

# which cell-by-cell adds to 100 percent

R Output

AM PM <NA>

Female 0.245902 0.327869 0.000000 Male 0.262295 0.163934 0.000000

<NA> 0.000000 0.000000 0.000000

The xtabs() function and the ftable() function should also be considered for preparation of frequency distributions in the form of contingency tables. The output from these functions is generally the same as what is seen with use of the table() function, but the presentation and placement of row and column headers may show diﬀerently.

R Input

xtabs(~Section+Gender, data=CPIIISecLbsGen.df)

R Output

Gender

Section Female Male

AM 15 16

PM 20 10

R Input

ftable(xtabs(~Section+Gender, data=CPIIISecLbsGen.df))

# Common to many uses with R, note how the ftable()

# function is wrapped around the xtabs() function.

R Output

Gender Female Male Section

AM 15 16

PM 20 10

Use of the table() function and similar functions provides frequency distributions of factor-type object variables. Observe how descriptive statistics are also provided, overall, and more importantly by breakout classiﬁcations for factor-

type object variables. Although there are many R functions from which to choose, the RcmdrMisc::numSummary() function is a good ﬁrst selection in terms of utility and appearance of presentation.

R Input

install.packages("RcmdrMisc", dependencies=TRUE)

library(RcmdrMisc) # Load the RcmdrMisc package.

help(package=RcmdrMisc) # Show the information page.

sessionInfo() # Confirm all attached packages.

R Input

RcmdrMisc::numSummary(CPIIISecLbsGen.df$Lbs)

# Descriptive statistics overall, no breakout groupings

R Output

mean sd IQR 0% 25% 50% 75% 100% n NA 131.733 17.5894 18.25 99 121 127 139.25 192 60 1

The RcmdrMisc::numSummary() can also be used to display measures of central tendency at the breakout group level for numeric-type object variables:

R Input

RcmdrMisc::numSummary(CPIIISecLbsGen.df[,c("Lbs")],

groups=Section) # Default printout, breakouts by Section

R Output

[Selected output is not shown, to save space.]

mean sd 0% 25% 50% 75% 100% data:n data:NA AM 128.300 13.1520 107 120.25 126 136.5 157 30 1 PM 135.167 20.7864 99 122.25 128 143.5 192 30 0

R Input

RcmdrMisc::numSummary(CPIIISecLbsGen.df[,c("Lbs")],

groups=Gender) # Default printout, breakouts by Gender

R Output

mean sd IQR 0% 25% 50% 75% 100% data:n data:NA Female 123.971 8.27997 9 107 120 124 129 144 35 0 Male 142.600 21.27401 32 99 125 142 157 192 25 1

2.6 Quality Assurance, Data Distribution, and Tests for

Dalam dokumen Thomas W. MacFarland Jan M. Yates (Halaman 101-111)