How R is Used in This Text - Thomas W. MacFarland Jan M. Yates

Each biologically-oriented problem addressed in this text is approached in a manner that promotes consistency, modularity, and ease of syntax reuse for later analyses. The structure of how R is used for each problem follows:

• Background: Information is usually provided, at least in part, about the data.

– Description of the Data: Information about the research process, methods, etc., is provided to give context for the data, how the data were obtained, what the data represent, expected minimum and maximum values for the data, etc. There are examples in this text, however, where there is purposely only a limited amount of information about the data, reﬂecting how some biostatisticians are tasked withblack-box analyses, where they are purposely given only limited detail so as to minimize the possible introduction of bias.

– Null Hypothesis (Ho): When the structure of the research process and the resulting data allow, a Null Hypothesis (Ho) is provided to give a sense of the expected analyses (e.g., a Null Hypothesis that addresses diﬀerence, a Null Hypothesis that addresses association, etc.) and how to interpret the meaning of resulting analyses.

• Import Data in Comma-Separated Values (.csv) File Format:

With only a few exceptions for demonstration purposes, most data used in this text are prepared in .csv (comma-separated values) ﬁle format.⁵ Various R functions are then used to import the .csv dataset into R.⁶ The

5Multiple terms are used along with comma-separated values to signify the .csv ﬁle format, including: Comma-Separated Values, Comma Separated Values, Comma-separated values, etc. There is some variance in this text for the term used to identify the .csv ﬁle format, recognizing these many terms for the same construct.

6The .csv (comma-separated values) file format is nearly universal for the way data are put into electronic format and then shared with others. The data for each record (e.g., case, subject, etc.) are typically placed on one line (e.g., row), with commas used to separate the fields (e.g., columns). Open a .csv file in a text editor, not a spreadsheet, to see how commas

various .csv datasets are made available at the publisher’s Web-based resource associated with this text.

• Organize the Data and Display the Code Book: A Code Book is created by the programmer and is used to communicate data organization, both for personal use and to accommodate others with whom the data are shared. Ideally, a Code Book provides an adequate description of all data, including data contents, data layout, data structure, data types (e.g., nominal, ordinal, interval, numeric, logical, text, etc.). It is also common to provide expected minimum and maximum values for numeric data, to give a sense of data that are in-range as opposed to data that are either outliers or incorrect data entries. Ideally, the Code Book should be of suﬃcient detail so that personal acquaintance and recall are not needed to use the data eﬀectively.⁷

• Conduct a Visual Data Check Using Graphics (e.g., Figures):

Most analyses in this text are reviewed first by using simple graphics (e.g., figures), to present a visual data check. Graphics, in this manner, serve many purposes but their main use early in the statistical analyses process is that they provide a sense of the data, range of values, and ultimately a glimpse of direction and possible outcomes. Later, as needed and judged appropriate, the simple graphics are embellished to prepare highly-detailed figures—figures that in many cases are appropriate for professional publication.

• Descriptive Statistics for Initial Analysis of the Data: A beginning activity for all analyses involving numeric data (e.g., weight—Kg or Lb, systolic blood pressure—mmHg, low-density lipoproteins—mg/dL), from simple to complex, is to provide a full understanding of the data.

For numeric data, this activity is achieved by preparing simple descriptive statistics that reﬂect measures of central tendency (e.g., mode, mean, standard deviation, median, minimum, maximum, etc.). For data that involve headcounts, it is common to prepare simple descriptive statistics that reﬂect frequency distributions (e.g., N and percent of total by break- out groups, Female v Male, Alive v Dead, etc.). Descriptive statistics are, collectively, one of many tools used to gain a complete understanding of the data—an essential task before any attempt is made for more complex statistical analyses.

appear as field separators. A .csv file is easily managed and can be opened with most, if not all, text editors or spreadsheets, whether the software is proprietary or freeware. Because of this simplicity, .csv files can be easily shared with others, across multiple platforms, operating systems, and data management software. R is by no means restricted to the use of .csv file format datasets, but this is more than likely the most common file format.

7It is common to provide suﬃcient detail in a Code Book so that a dataset can be revisited 1 year or more after last use and the Code Book will provide all prompts needed to quickly understand the data.

• Quality Assurance, Data Distribution, and Tests for Normality:

A variety of quality assurance approaches are used in this text, with the selected process based on need and opportunity. In some cases, when warranted, a precise statistical test is used to provide a sense of quality assurance. In other cases, when the data do not allow such precision, simple descriptive statistics alone are used to provide a gauge of data quality. Graphics, from simple throwaway black and white ﬁgures to highly- detailed embellished ﬁgures, are also frequently used to serve the quality assurance process. As an adjunct activity closely linked to the quality assurance process, it is common to examine data for distribution patterns.

Do the data follow or at least approximate patterns of normal distribution (e.g., bell-shaped curve)? Or, do the data instead fail to show a discernible pattern of normal distribution? This issue is important for selection of the many available statistical tests and the underlying assumptions associated with each test. For some statistical analysis tests, it is assumed that the data for a variable in question follow a normal distribution pattern—tests based on parametric data. If this pattern of normal distribution is not met, then other tests may be more appropriate—tests based on nonparametric data. There are more than a few tests for normality, and they are demonstrated in various lessons throughout this text. The key thing to re- member about quality assurance is that it should be pervasive throughout the research and statistical analysis process. Quality assurance is never a one-and-done process..

• Statistical Test(s): A few of the leading statistical tests are demonstrated in this text, both tests that use data where this is an assumption of normal distribution (e.g., parametric tests) and tests that use data that fail to meet normal distribution (e.g., nonparametric tests). The statistical tests are demonstrated where it is assumed that the data either meet or at least approximate normal distribution. However, in many cases an addendum is provided that challenges the assumption of normality and, instead, a nonparametric approach is used for statistical analysis.

• Summary of Outcomes: It is only too common for many students and beginning researchers to have some degree of diﬃculty understanding output from the many available statistical tests, whether R or some other software package is used. Recognizing this concern, a summary of outcomes for each statistical test is provided as a guide for interpretation as the previously stated Null Hypothesis is addressed.

• Addenda: Each chapter ends with multiple addenda, providing an opportunity to either introduce or reinforce important topics and associated functions that go far beyond the basics of what could be otherwise found in a standard R-based documentation page:

– In many cases, the addenda either introduce or reinforce concepts, packages, functions, function arguments, etc., in greater detail than what was presented earlier, often using diﬀerent data—all to give a new perspective.

– There is a great deal of attention to the concept of parametric data, as well as attention to the converse of nonparametric data—issues related to normality. Too many texts give scrutiny about data distribution patterns short shrift, or worse, totally ignore this assumption inherent to the correct use of many inferential tests.

– Practice data sets are also part of the addenda, often demonstrating one dataset addressing data with a normal distribution pattern and another dataset addressing data that do not exhibit normal distribution patterns. Students and beginning researchers beneﬁt when reminded that data are not always neat and pretty and multiple approaches to statistical analysis have merit.

• Prepare to Exit, Save, and Later Retrieve this R Session: As a good programming practice, it is best to prepare all R syntax in a separate ﬁle, using a text editor. This practice makes it easy to save and reuse the prepared syntax for later use or as a template for other future R-based analyses. It is also a good programming practice to save active R sessions, again for the purpose of facilitating later reuse. Clear instructions are provided on how to execute a graceful exit from the R session and to save the session for later reuse.

Again, small and easy-to-follow confidence-building examples are used at the beginning of this text. Greater complexity is gradually introduced until the final lessons in this text. In these last lessons R is used in a fairly robust manner, where large and complex datasets are used to introduce and reinforce skills needed for independent statistical analyses that support research efforts.

Dalam dokumen Thomas W. MacFarland Jan M. Yates (Halaman 38-41)