Addendum 3: Additional Practice Datasets for Data with Normal Distribution Patterns and Data That

Do Not Exhibit Normal Distribution Patterns

2.11.1 Purpose of This Addendum

The purpose of this addendum is to use the R environment to provide additional guidance on data exploration, descriptive statistics, and measures of central tendency. Mastery of these topics is essential for anyone who regularly engages in empirically-based research, either as a consumer or producer of research.

Fortunately, the R environment provides many excellent tools for the core topics associated with this addendum.¹⁰

Note In the front matter to this lesson, the syntax is presented and is then immediately followed in most cases by either a copy of screen output or an accompanying figure. That approach is not used in this addendum. View this addendum as practice homework-type bonus materials. Syntax is presented and descriptive text goes along with the syntax. However, screen output and figures are generally excluded and when they show it is only to offer mid-point guidance that analyses follow along in a correct manner. Either key the syntax in this addendum or copy and paste it into an editor—whatever is feasible and most convenient. And then, to use the common expression, Practice—Practice—

Practice!

2.11.2 Background

There are two datasets used in the first part of this addendum. The two datasets are self-generated, each created using R-based tools. The first dataset is created using the stats::rnorm() function and the data follow a generally normal distribution pattern. Then, to offer an interesting contrast, the second dataset is created using the stats::runif() function and the data do not follow a normal distribution pattern. Be sure to closely examine how R-based functions are used against both datasets, but then notice how there are widely different results between the data that exhibit normal distribution and the data that do not exhibit normal distribution.

In an eﬀort to provide consistency and reproducible results, the base::set.seed() function is used at the start of this demonstration. Self-generated datasets, such as the two datasets used in the ﬁrst part of this addendum (if prepared correctly using the base::set.seed() function) will then allow for equivalent replication of

10Syntax is provided throughout this addendum, but of course this syntax is only a sugges- tion. Experiment and take other approaches to how the data can be analyzed and outcomes presented by using other functions and other arguments. arguments. Use this addendum as a conﬁdence-building resource on how R is used with increasingly complex analyses.

the data if generated again and/or if generated by other researchers.

There are no inferential analyses associated with this addendum. Therefore, a Null Hypothesis is not provided. All analyses are descriptive (e.g., describe the data) data) in nature, not inferential (e.g., allow an inference or judgment about diﬀerences between groups, association between object variables, etc.).

2.11.3 Import Data in Comma-Separated Values (.csv) File Format and/or Self-Generate the Data Using R-Based Functions Instead of importing .csv ﬁles, the data in the ﬁrst part of this addendum are self-generated, using R-based tools. After the base::set.seed() function is used, two R-based functions are used to self-generate the two datasets:

• The stats::rnorm() function is used to generate a dataset (e.g., called SBPNormal) that follows a pattern of normal distribution.

• The stats::runif() function is used to generate a dataset (e.g., called SBPNotNormal) that does not follow a pattern of normal distribution.

R Input

# Set the seed base::set.seed(8)

# The base::set.seed() function is commonly used

# immediately before any attempt to generate random

# numbers. Some numerical value is then used along

# with this function in an effort to set the seed

# and in turn produce a specific sequence of random

# numbers. For this addendum, the number 8 was used

# to set the seed. There is nothing special about

# the number 8 being used to set the seed and the

# sequence of random numbers generated because of

# this selection. It would have been possible to

# generate another set of random numbers by using

# 1234 or any other number to set the seed. Some

# number had to be selected and for this addendum

# the number 8 was used to set the seed. If this

# number were used by others the same set of random

# numbers would be generated, allowing reproduction

# of results in the future and/or by others.

R Input

# Data with Normal Distribution Patterns

SBPNormal <- stats::rnorm(100000, mean=120, sd=06)

# Approximately +3 and -3 SDs for SBP with 120 mean,

# when data show normal distribution.

# Use the stats::rnorm() function to generate random

# numbers.

# Create a set of 100,000 random numbers that

# exhibits normal distribution, with mean = 120 and

# standard deviation = 06. To allow for some degree

# of familiarity, this distribution (mean = 120 and

# sd = 06) is equivalent to common metrics for

# Systolic Blood Pressure, SBP.

R Input

# Data That Do Not Exhibit Normal Distribution Patterns SBPNotNormal <- stats::runif(100000, min=102, max=138)

# Approximately +3 and -3 SDs for SBP with 120 mean,

# when data show normal distribution.

# Use the stats::runif() function to generate random

# numbers.

# Create a set of 100,000 random numbers that does

# not exhibit normal distribution. For this set of

# random numbers, the minimum value is set to 102

# and the maximum value is set to 138, which model

# to a large degree the two extreme values for SBP

# readings (mean = 120 and standard deviation = 06)

# at 3 SDs (standard deviations).

2.11.4 Organize the Data and Display the Code Book

The data for the ﬁrst part of this addendum consist of two object variables, namedSBPNormalandSBPNotNormal. The two object variables are self-generated using R-based functions. They are not imported from an external source.

R Input

##############################################

# Code Book for SBPNormal and SBPNotNormal #

##############################################

# #

# SBPNormal ... Numeric #

# 100,000 SBP readings with mean = #

# 120 and sd = 06 #

# #

# SBPNotNormal ...Numeric #

# 100,000 SBP readings with minimum = #

# 102 and maximum = 138 #

##############################################

Recall that object variables SBPNormal and SBPNotNormal are separate object variables and they are not part of a common dataframe.

Additional R-based functions are now used to be sure that the data are of the expected type, appear correct and ready for use, and are within expected ranges. Follow along with the syntax and reproduce it to obtain equivalent output. Recall that output is only shown occasionally, with the expectation being that these bonus materials are geared toward self-initiated practice.

R Input

base::getwd() # Confirm working directory

base::ls() # Confirm available objects

utils::str(SBPNormal) # Identify structure

utils::head(SBPNormal, n=10) # Show the head, first 10 cases utils::tail(SBPNormal, n=10) # Show the tail, last 10 cases base::summary(SBPNormal) # Summary statistics

utils::str(SBPNotNormal) # Identify structure

utils::head(SBPNotNormal, n=10) # Show the head, first 10 cases utils::tail(SBPNotNormal, n=10) # Show the tail, last 10 cases base::summary(SBPNotNormal) # Summary statistics

There are many more R-based functions available for further diagnostics about singular object variables and dataframes with multiple object variables, but the functions demonstrated immediately above should be more than suﬃcient to conﬁrm that the data are in good form.

2.11.5 Conduct a Visual Data Check Using Graphics (e.g., Figures) For immediate use as a data check, produce throw-away graphics and avoid all embellishments. This approach, using graphics as a quality assurance measure, provides a convenient way to review the data and have conﬁdence that the data are acceptable for later analyses.

For each numeric object variable, it is a good practice to put a histogram, a density plot, a boxplot, and a Q-Q plot all in the same ﬁgure, as shown below.

If desired, or if there are any concerns about the data, it is also useful to use a violin plot and a dot plot to graphically examine numeric data. Each figure provides a different perspective of the data, regardless of whether the figures are ever shared with others.

R Input

par(ask=TRUE) # Pause

par(mfrow=c(2,2)) # 4 figures - 2 rows by 2 column grid graphics::hist(SBPNormal)

graphics::plot(stats::density(SBPNormal, na.rm=TRUE)) graphics::boxplot(SBPNormal)

stats::qqnorm(SBPNormal); stats::qqline(SBPNormal)

# Place four separate figures (e.g., histogram, density

# plot, boxplot, and a Q-Q plot with an accompanying Q-Q

# line) into one common figure.

# Note how one function can be wrapped around another

# function. In this example, the graphics::plot() function

# has been wrapped around the stats::density() function. A

# requirement for the stats::density() function is that the

# na.rm=TRUE argument must be used.

par(ask=TRUE) # Pause

par(mfrow=c(2,2)) # 4 figures - 2 rows by 2 column grid graphics::hist(SBPNotNormal)

graphics::plot(stats::density(SBPNotNormal, na.rm=TRUE)) graphics::boxplot(SBPNotNormal)

stats::qqnorm(SBPNotNormal); stats::qqline(SBPNotNormal)

# Place four separate figures (e.g., histogram, density

# plot, boxplot, and a Q-Q plot with an accompanying Q-Q

# line) into one common figure.

These simple black-and-white throw-away graphics provided an initial view of the data, which for this addendum details diﬀerences in expected visualization for data that exhibit normality (e.g.,SBPNormal) and data that do not exhibit

normality (e.g., SBPNotNormal). To allow for more comparative visualization between the two object variables (e.g.,SBPNormalandSBPNotNormal), generate four separate figures: one figure for two side-by-side histograms, one figure for two side-by-side density plots, one figure for two side-by-side boxplots, and one figure for two side-by-side Q-Q plots and Q-Q lines. Arrange each figure into a 1 by 2 grid (1 row by 2 columns) and adjust the X axis and the Y axis for a common scale to allow meaningful side-by-side comparisons.

R Input

# Histogram

par(ask=TRUE) # Pause

par(mfrow=c(1,2)) # 2 figures - 1 row by 2 column grid graphics::hist(SBPNormal,

main="SBP - Normal Distribution", col="red", # Add color

breaks=50, # Increase granularity of histogram font.lab=2, # Bold labels

xlim=c(0,200), # X axis scale ylim=c(0,7000)) # Y axis scale axis(side=1, font=2) # X axis bold axis(side=2, font=2) # Y axis bold graphics::hist(SBPNotNormal,

main="SBP - Not Normal Distribution", col="red", # Add color

breaks=50, # Increase granularity of histogram font.lab=2, # Bold labels

xlim=c(0,200), # X axis scale ylim=c(0,7000)) # Y axis scale axis(side=1, font=2) # X axis bold axis(side=2, font=2) # Y axis bold

# Notice how both histograms have the same X axis scale

# and Y axis scale, allowing meaningful side-by-side

# comparisons.

# Density Plot

par(ask=TRUE) # Pause

par(mfrow=c(1,2)) # 2 figures - 1 row by 2 column grid graphics::plot(stats::density(SBPNormal, na.rm=TRUE),

main="SBP - Normal Distribution", col="red", # Add color

lwd=5, # Thick line

font.lab=2, # Bold labels xlim=c(0,200), # X axis scale ylim=c(0,0.08)) # Y axis scale axis(side=1, font=2) # X axis bold axis(side=2, font=2) # Y axis bold

graphics::plot(stats::density(SBPNotNormal, na.rm=TRUE), main="SBP - Not Normal Distribution",

col="red", # Add color

lwd=5, # Thick line

font.lab=2, # Bold labels xlim=c(0,200), # X axis scale ylim=c(0,0.08)) # Y axis scale axis(side=1, font=2) # X axis bold axis(side=2, font=2) # Y axis bold

# Notice how both density plots have the same X axis

# scale and Y axis scale, allowing meaningful side-by-

# side comparisons.

# Boxplot

par(ask=TRUE) # Pause

par(mfrow=c(1,2)) # 2 figures - 1 row by 2 column grid graphics::boxplot(SBPNormal,

main="SBP - Normal Distribution", xlab="Boxplot", # X axis label ylab="SBP", # Y axis label cex.axis=1.15, # Axis size cex.lab=1.15, # Label size col="red", # Box color

lwd=2, # Line thickness

font.lab=2, # Bold labels

font=2, # Bold font

ylim=c(0,200)) # Y axis scale graphics::boxplot(SBPNotNormal,

main="SBP - Not Normal Distribution", xlab="Boxplot", # X axis label ylab="SBP", # Y axis label cex.axis=1.15, # Axis size cex.lab=1.15, # Label size col="red", # Box color

lwd=2, # Line thickness

font.lab=2, # Bold labels

font=2, # Bold font

ylim=c(0,200)) # Y axis scale

# Notice how both boxplots have the same Y axis scale,

# allowing meaningful side-by-side comparisons.

# Q-Q Plot

par(ask=TRUE) # Pause

par(mfrow=c(1,2)) # 2 figures - 1 row by 2 column grid stats::qqnorm(SBPNormal,

main="Q-Q Plot (Blue) and Q-Q Line (Red) of SBP - Normal Distribution",

col="blue", xlim=c(-4,4), ylim=c(0,200), font.axis=2, font.lab=2)

stats::qqline(SBPNormal, # Add a Q-Q Line to the Q-Q Plot col="red", lwd=4, lty=2)

stats::qqnorm(SBPNotNormal,

main="Q-Q Plot (Blue) and Q-Q Line (Red) of SBP - Not Normal Distribution",

col="blue", xlim=c(-4,4), ylim=c(0,200), font.axis=2, font.lab=2)

stats::qqline(SBPNotNormal,# Add a Q-Q Line to the Q-Q Plot col="red", lwd=4, lty=2)

# Notice how both Q-Q plots have the same X axis scale

# and Y axis scale, allowing meaningful side-by-side

# comparisons.

This addendum has not yet included categorical data similar to: (1) nominal objects, such as headcounts of female or male (gender) subjects or (2) ordinal objects, such as small, medium, or large (size) rankings of non-interval mea- surements. The visual representation of categorical data, whether nominal or ordinal, are often shown as barcharts and mosaic plots, as demonstrated in a later part of this addendum.

2.11.6 Descriptive Statistics for Initial Analysis of the Data

Descriptive statistics, as the name suggests, describe the data. There are many R-based functions that serve this purpose, using the packages that are obtained when R is ﬁrst downloaded and also by using external packages that are downloaded from external sites that host the CRAN package repository:

• Some functions provide the desired statistic, only.

• Some functions provide the desired statistic, at the summary level and also by breakouts of factor-type object variables.

• Some functions not only provide the desired statistic, but they also provide a graphic that reinforces the outcome.

The data for this addendum are fairly simple and for the ﬁrst part of this addendum the two self-created datasets (e.g., SBPNormal and SBPNotNormal) do not include factor-type object variables. Descriptive statistics by factor-type object variable breakouts are certainly important, and they are demonstrated in later parts of this addendum.

The important issue to recall for descriptive statistics of measured object variables is that the focus is usually on a central measure or average (e.g., mode, median, mean, etc.) and dispersion (e.g., variance, standard deviation, minimum, maximum, etc.). When describing a collection of numbers, it is common to hear reference to the term average, but this statement is incomplete. It is equally important to know something about the dispersion of the numbers to have a more complete understanding of the data.

Again, most statistics presented immediately below use packages made available from when R is ﬁrst downloaded. However, as needed, external packages are obtained selected functions from these packages are used to provide either singular or multiple descriptive statistics and measures of central tendency. As a general reminder reminder about the functions presented immediately below:

• Depending on how the R session has been organized in the Housekeep- ing section, object variables with large Ns often produce output using exponential notation (e.g., e-notation, scientiﬁc notation, standard index form). If exponential notation is a problem, consider using the base::format() function or the base::options() function and adjust the arguments to achieve desired output.

• When obtaining external packages made available by CRAN, some packages may be unavailable from speciﬁc sites at speciﬁc times. Select another CRAN mirror site if a package does not download.

R text-based output of calculated statistics should always be correct, but consider some type of redundant quality assurance check just to be sure that output is indeed correct. The only thing worse than using a function that does not produce an outcome is a function that produces an outcome, but the outcome is not correct due to logic problems or other errors—not the function. Quality assurance should be pervasive. Planned redundancy is not a burden if it avoids a later error.

As practiced throughout this addendum, package names are generally used along with function names (e.g., Package::Function) to be precise even if this at first seems overly-formal—even for common functions found in packages that are obtained when R is first downloaded. This practice is used in this addendum as a reminder of where these many functions reside. If needed, summarize many packages gained from when R is first downloaded by keying:

R Input

base::getOption("defaultPackages")

For all other R-based functions found in external packages, be sure to download the appropriate package.

R Input

# Descriptive Statistics of SBPNormal base::summary(SBPNormal)

# Summary of all object variables modes::modes(SBPNormal)

# Mode, or the most frequent value stats::median(SBPNormal)

# Median, or the mid-point base::mean(SBPNormal)

# Mean, or the arithmetic average

# The argument na.rm was not used since there are no

# missing data for the object variable SBPNormal. If

# there were missing data then this argument would be

# needed.

Observe how functions from three separate packages were used to determine the three separate views towardaverage—mode, median, and mean. This observa- tion serves as an example of why it is best to either use, or at least consider, a Package::Function naming system to keep current with functions and their use—certainly for functions gained from external packages.

Along with use of the base::mean() function, other R-based functions are available to investigate the concept of mean (e.g., arithmetic average) as a descriptive statistic of central tendency. A few examples immediately below should be more than suﬃcient to demonstrate the possible value of geometric mean, harmonic mean, trimmed mean, and winsor mean.

R Input

base::mean(SBPNormal, trim=0.05)

# Trimmed mean, or the arithmetic average after

# removing 5 percent (i.e., trim=0.05) of the

# highest and lowest values

install.packages("psych", dependencies=TRUE)

library(psych) # Load the psych package.

help(package=psych) # Show the information page.

sessionInfo() # Confirm all attached packages.

psych::geometric.mean(SBPNormal)

# Geometric mean, used to limit the impact of

# extreme values

psych::harmonic.mean(SBPNormal)

# Harmonic mean, used to address the impact of

# outliers

psych::winsor.mean(SBPNormal, trim=0.05)

# Winsorized mean, where data at the ends are not

# so much trimmed as they are replaced with

# values that provide a robust estimate of

# central tendency, accommodating the potential

# undue influence of outliers stats::var(SBPNormal)

# Variance

stats::sd(SBPNormal)

# Standard deviation base::min(SBPNormal)

# Minimum value

base::which.min(SBPNormal)

# Minimum value location (e.g., row number) base::max(SBPNormal)

# Maximum value

base::which.max(SBPNormal)

# Maximum value location (i.e., row number) base::range(SBPNormal)

# Range of values, minimum to maximum base::length(SBPNormal)

# Number of occurrences (e.g., N, datapoints) utils::head(base::sort(SBPNormal))

# First few datapoints of SBPNormal, sorted

# The head() function is wrapped around the

# sort() function.

utils::tail(base::sort(SBPNormal))

# Last few datapoints of SBPNormal, sorted base::sum(SBPNormal)

# Sum of all values stats::quantile(SBPNormal)

# Quantile scores, 0% 25% 50% 75% 100%

stats::quantile(SBPNormal,

prob=seq(0, 1, length=11), type=5)

# Use of stats::quantile() function to produce

# deciles, 0% 10% 20% 30% 40% 50% 60% 70% 80% 90%

# 100%

stats::IQR(SBPNormal)

# Interquartile range, or a measure of dispersion

# between the 3rd quartile and the 1st quartile stats::mad(SBPNormal)

# Median Absolute Deviation (MAD), or the median

# of the absolute deviations from the median

# (compare the MAD statistic to the sd statistic) grDevices::boxplot.stats(SBPNormal)

# Boxplot Statistics: Lower-Whisker, Lower-Hinge,

# Median, Upper-Hinge, and Upper-Whisker, N, and

# Outliers

# The boxplot.stats() function is included in the

# grDevices package, which is available when R is

# first downloaded.

# Outlies and descriptive statistics are printed.

stats::fivenum(SBPNormal)

# Tukey’s Five-Number Summary: Minimum,

# Lower-Hinge, Median, Upper-Hinge, and Maximum

Many functions print to the screen one and only one measure of descriptive statistics and central tendency. However, there are external R packages that include functions where many diﬀerent descriptive statistics are generated, all at the same time. From among the many possible selections, the following functions will be used to provide multiple statistics that may provide a broader understanding of the data:

• doBy::descStat()

• tables::tabular()

• pastecs::stat.desc()

• psych::describe()

• furniture::table1()

• RcmdrMisc::numSummary()

• epiDisplay::summ()

As a brief warning, the default screen output for some of these functions is quite verbose, and the screen output may show in horizontal (e.g., wide) format instead of vertical (e.g., long) format, making it diﬃcult to copy and paste the output into an external word-processed document. Experiment with these functions and their many arguments to develop personal preferences, both for content and later manipulation of presentation.

R Input

install.packages("doBy", dependencies=TRUE)

library(doBy) # Load the doBy package.

help(package=doBy) # Show the information page.

sessionInfo() # Confirm all attached packages.

doBy::descStat(SBPNormal)

install.packages("tables", dependencies=TRUE)

library(tables) # Load the tables package.

Dalam dokumen Thomas W. MacFarland Jan M. Yates (Halaman 128-170)