Thomas W. MacFarland Jan M. Yates

Purpose of This Text

GUI—It is possible to download external R packages that support using R with a graphical user interface (GUI). IDE—It is also possible to download an Integrated Development Environment (IDE) designed specifically for using R.

Development of Biostatistics

Using point-and-click menu selection is initially easy and has value as an early learning activity. Ultimately, however, using only menu choices is limited and places limitations on R's full potential.

Development of R

How R is Used in This Text

Follow the data or at least approximate the patterns of normal distribution (eg bell curve). The statistical tests are demonstrated where it is assumed that the data either conforms to or at least approximates normal distribution.

Import Data Into R

Addendum 1: Efficient Programming with R, Project Workflow, and Good Programming Practices (gpp)Workflow, and Good Programming Practices (gpp)

Note the use of functions such as date(), ls.str(), getwd(), setwd(), sessionInfo() and other functions in the Housekeeping session. The use of a .df extension is not required, but using this convention leaves no doubt that the object is indeed a data frame. Notice the consistent use of an Object$Variable naming scheme for objects GenEnd.df$Endurance, BreedMilk.df$MilkLb365,SoilYield.df$.

Addendum 2: Preview of Descriptive Statistics and Graphics Using RGraphics Using R

Much more discussion of descriptive statistics and measures of central tendency appears in later lessons throughout this text. Together, these features should provide a sense of the data, the average birth weight, spread of birth weights, etc.25. However, to get a more complete understanding of the BirthWeightGrams object, focus on the various graphics shown below, all based on the use of the hist() function.

Addendum 3: R and Beautiful Graphics

This appendix uses the ggplot2 package and supporting packages to produce a variety of figures using the data frames currently available. In a later lesson on correlation and association, this type of question will be explored further, where various statistical tests will be used to provide a more ﬁnite answer. Various statistical tests will also be used to address issues such as, from this data set, fuel economy between city driving and highway driving (eg City.MPG v Hwy.MPG).

Figure 1.11: ggplot2 demonstration 1—simple to complex

Addendum 4: Research Designs Used in Biostatistics

Pretest-Posttest for Control Group is a somewhat more ambitious approach, where there are typically two groups, a control group and an experimental group. The posttest only for the control group is a research process where there are typically two groups, a control group and an experimental group. Differences between the control group and the experimental group, if any, are attributed to the treatment.

Prepare to Exit, Save, and Later Retrieve This R Session

External Data and/or Data Resources Used in This Lesson

Use the data from these external sources to practice and replicate the results from this lesson. This lesson provides a demonstration of descriptive statistics, measures of central tendency, and graphical presentation of data, which are essential before performing inferential statistical analysis. Initial efforts should be placed on data exploration, and specifically on the use of descriptive statistics and measures of central tendency (e.g. mode, median, mean, standard deviation, etc.).

Background

Quite often, when looking at data and between data, it is useful to provide an overview of the data. It is common to present these statistics early in the research process to give the reader an overview of the data. Emphasis will be placed on using features found in the R packages obtained when R was first downloaded.

Import Data in Comma-Separated Values (.csv) File Format and/or Self-Generate the Data Using R-Based

All analyzes are descriptive (e.g. describe data), not inferential (e.g. allow inferences or judgments about differences between groups, associations between object variables, etc.). The CPIIISecLbsGen.df object will be a data frame, as indicated by the enumerated.dfex extension to the object name. Note the arguments used with the read.table() function indicating that there is a header with descriptive variable names (header = TRUE) and that the fields are separated by a comma (sep.

Organize the Data and Display the Code Book

1] FALSCH FALSCH FALSCH FALSCH FALSCH FALSCH FALSCH FALSCH [10] FALSCH FALSCH FALSCH FALSCH FALSCH FALSCH FALSCH FALSCH FALSCH [19] FALSCH FALSCH FALSCH FALSCH FALSCH FALSCH FALSCH FALSCH FALSCH [28] FALSCH FALSCH FALSCH FALSCH FALSCH FALSCH FALSCH FALSCH [37] FALSCH SE FALSCH AR FALSCH FALSCH FALSCH FALSCH FALSCH FALSCH [46] FALSCH FALSCH FALSCH FALSCH FALSCH FALSCH FALSCH FALSCH FALSCH [55] FALSCH FALSCH FALSCH FALSCH FALSCH FALSCH FALSCH.

Conduct a Visual Data Check Using Graphics (e.g., Fig- ures)

As an early introduction to the ggplot2::ggplot() function and how it is used to produce breakout details in graphical format, using facets, look at the density of the numeric object variable Lbs for each of the two factor types object variables (e.g. Section and Gender). Now that the figures (for example, DensityFacetSectionLbs, DensityFacetGenderLbs, BoxplotSectionLbs and BoxplotGenderLbs) have been generated as separate objects, place these four objects into one common figure, using the gridExtra::grid.arrange() function (Fig. 2.3). Embellishments of the images will be introduced in later lessons by demonstrating the many arguments used to present titles, prepare text and lines in bold and color, etc.

Figure 2.1: Multiple visualization of weight

Descriptive Statistics for Initial Analysis of the Data

Although descriptive statistics and measures of central tendency are necessary to understand the data, they are presented at the single level – for all values of CPIIISecLbsGen.df$Lbs.

Quality Assurance, Data Distribution, and Tests for Normality

Is it possible that the weight of AM students follows a normal distribution pattern regardless of the weight distribution pattern of PM students? Is it possible that the weight of PM students follows a normal distribution pattern regardless of the weight distribution pattern of AM students? Overall, the data for CPIIISecLbsGen.df$Lbs does not follow a pattern of normal distribution (overall calculated p-value).

Figure 2.4: Weight Q-Q plot breakouts by section and gender

Statistical Test(s)

When looking at normality by section, students in the morning (eg AM-calculated p-value section) had weights that followed a normal distribution, while students in the afternoon (eg PM-calculated p-value section had weights did not follow a normal) distribution. When looking at normality by gender, both female students (calculated p-value = 0.9367) and male students (calculated p-value = 0.9261) had weights that followed a normal distribution. Looking ahead, would avoid a useful test of difference such as Student's t-test for independent samples simply because a few students in the afternoon course section may have affected normality.

Summary

The emphasis in this lesson has been on data mining, descriptive statistics, and measures of central tendency. Of course, there was no baseline information provided at the front of this lesson to know whether the sample, drawn from two sections of computer programming courses taught at a specific Florida-based high school, was representative of students' peers. in high school, county school district, state, etc. Consequently, the statistics and figures in this lesson are useful, but the production of these statistics and figures should be seen as just the beginning of the research process.

Addendum 1: Specialized External Packages and Func- tions

Issues related to the overall research process must consider representation, but the research process must also be balanced against the costs of data collection. Some epiDisplay functions provide not only attractive graphics but also useful statistics in text format printed on the screen (Figs. 2.5, 2.6 and 2.7). It also generates text-based descriptive statistics for the screen that add further insight into the data (Figs 2.8 and 2.9).

Figure 2.5: Section and gender: frequency distribution overall

Addendum 2: Parametric v Nonparametric

Addendum 3: Additional Practice Datasets for Data with Normal Distribution Patterns and Data That

The base::summary() function provides descriptive statistics and central tendency measures for the numeric object variableSBPGenderRaceEthnic.df$SBP. The base::summary() function also provides insight into the frequency distribution for the two object variables of the factor type, both SBPRaceEthnic.df$Gender2 and SBPRaceEthnic.df$RaceEthnic2. Create the Gen object, which represents a named number object from the names and values for SBPGenderRaceEthnic.df$Gender2 (Fig.2.12).

Consider the persistent object SBPGenderRaceEthnic.df$SBP, when treated as an object variable of type factor using the basic::factor() function, which has 30 or more divisions. A simple barplot of SBPGenderRaceEthnic.df$SBP.factorbreakouts using the ggplot2 package is a reasonable way to approach the best use of this newly enumerated object variable.

Figure 2.10: Multiple standard deviations with the same mean

Prepare to Exit, Save, and Later Retrieve This R SessionSession

External Data and/or Data Resources Used in This LessonLesson

This lesson provides an introduction to inquiries about differences between groups, specifically using the Student's t-test for independent samples. Student's t-test is an appropriate test for comparing differences between small samples, usually 30 or less. However, the Student's t-test for independent samples is also often seen used for larger samples.

Background

Create an object called MilkBreedFatProt.df that is a data frame, as indicated by the enumerated .df extension of the object name. This R-based object is a data frame and it consists of the data originally included in the file MilkBreedButterfatProtein.csv, a comma-separated .csv file. To avoid possible conﬂicts, make sure there are no previous R-based objects named MilkBreedFatProt.df.

Organize the Data and Display the Code Book

1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [10] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE. A set of simple R-based actions can easily: (1) transform (eg recode) the object variable MilkBreedFatProt.df$Breed into a new object variable, (2) change the recoded object variable from original integer format to summed factor format , and (3 ) apply English text labels for the otherwise cryptic numeric codes (e.g. 1 and 2). However, the object variable MilkBreedFatProt.df$Breed.recode was created by putting the object variable MilkBreedFatProt.df$Breed in factor format, as opposed to the original integer type use of 1 and 2 codes.

Conduct a Visual Data Check Using Graphics (e.g., Figures)Figures)

HistogramPctButterfatBreed.recode <- ggplot2::ggplot(MilkBreedFatProt.df,. 34;Percent Butterfat Produced by Holstein and Jersey Dairy Cows by Breed: Holstein v Jersey"). HistogramPctProteinBreed.recode <- ggplot.Fruit.Flottf. 34;Procent protein produceret af Holstein og Jersey Dairy Cows efter race: Holstein v Jersey"). BoxplotPctProteinBreed.recode <- ggplot2::ggplot(MilkBreedFatProt.df,. 34;Procent Protein Produced by Holstein and Jersey Dairy Cows by Breed: Holstein v Jersey").

Figure 3.1: Distribution of breed by count—1 Numeric-Type Graphics of PctButterfat and PctProtein

Descriptive Statistics for Initial Analysis of the Data

Instead, consider functions from packages that provide a variety of descriptive statistics as one attractive and compact screen printout that allows easy copying and pasting into a word-processed document if needed. This option will generate frequency distribution statistics on the screen, but without an accompanying image. Again, look at the epiDisplay::summ() function as another early choice to fully understand the data, as this function provides full descriptive statistics, descriptive statistics by breakpoints, and a dot plot of the data distribution by breakpoints.