Second Edition

Most of the chapters in the Second Edition are similar in content to the chapters in the First Edition. In the First Edition and again in the Second Edition, the code and data for all the examples and figures in the book are available for download.

The HH Package in R

TheRlanguage is an extremely well-developed tool for statistical research and analysis, that is, for exploring and designing new analysis techniques, as well as for analysis. The implementation of lattice graphics in R's lattice package is particularly powerful for statistical graphics, the output of data analysis through which raw data and results are displayed to the analyst and client.

S-Plus , now called S+

I decided to leave most of the SAS discussions and examples out of the body of the second edition. The now standard terminology introduced by SAS, notably the notation for sum-of-squares "types" described in Section 13.6, is listed and described.

5 Chapters in the Second Edition

Revised Chapters

The older version of HH (Heiberger,2009), designed for the first edition of this book, continues to work with S+. The examples now center on mosaic graphics, using thevcd package that was not available when the first edition was written.

Revised Appendices

The discussion focuses on the Adverse Effects dotplot, demonstrating how multi-panel plots of graphics can replace pages of tabular data. The discussion is based on the work I participated in during my research leave at GSK (Amit et al., 2008).

6 Exercises

Rating scales such as Likert scales and semantic differential scales are very common in marketing research, customer satisfaction studies, psychometrics, opinion polls, population studies, and many other fields.

Acknowledgments: First Edition

Acknowledgments

We are also grateful to Insightful Corp. for providing us with current copies of the S-Plus software for us and our student, and to the many experts who reviewed portions of early drafts of this manuscript.

Author Bios

Introduction and Motivation

Statistics in Context

The statistician is an expert in designing the data collection procedures and in calculating and displaying the results of statistical analyses. Help the client formulate the question(s) to be answered in a way that leads to sensible data collection and is suitable for statistical analysis.

Examples of Uses of Statistics

Investigation of Salary Discrimination
Measuring Body Fat
Minimizing Film Thickness
Surveys
Bringing Pharmaceutical Products to Market

Our analysis in Chapter9 demonstrates that essentially all of the body fat information in the 15 other measurements can be captured by just three of these other measurements. We develop a regression model of body fat as a function of these three measurements, and then examine how well these three inexpensive measurements alone can estimate body fat.

The Rest of the Book .1 Fundamentals.1 Fundamentals

Linear Models
Other Techniques
New Graphical Display Techniques
Appendices on Software
Appendices on Mathematics and Probability
Appendices on Statistical Analysis and Writing

Chapter 5 introduces some of the basic inference techniques used throughout the rest of the book. We show the algebra and pictures for finding the center and spread of the distributions.

Data and Statistics

Types of Data
Data Display and Calculation

Presentation
Rounding

Importing Data

Datasets for This Book
Other Data sources

Analysis with Missing Values
Data Rearrangement
Tables and Graphs
R Code Files for Statistical Analysis and Data Display (HH)

For use with R, all of the datasets mentioned in the book are available in the HH package for R. Many of the graphs were created using functions included in and fully documented in the HH package.

Table 2.1 shows two tables with identical numerical information. The first is legible because it follows both principles; the second is not because it doesn’t.

2.A Appendix: Missing Values in R

The entire row containing the missing values is removed from the analysis and subsequent processing. The entire row containing the missing values is removed from the analysis and subsequent processing.

Table 2.4 The argument na.strings defines strings "999" and "." to indicate missing values.

Statistics Concepts

A Brief Introduction to Probability

Another method is to recognize that the "at least one white" event can be divided into three mutually exclusive events: draw white first, and draw red second; first draw red and second draw white; and both draw white. The probability of "at least one white" is seen as the sum of the probabilities of the events that make up this division.

Random Variables and Probability Distributions

Discrete Versus Continuous Probability Distributions
Displaying Probability Distributions—Discrete Distributions
Displaying Probability Distributions—Continuous DistributionsDistributions

TheAevent is "the first ball is white" and theBevent is "the second ball is white". But if only those dates with measurable precipitation are considered, Y is continuous, that is, the distribution of (Y|Y>0) is continuous.

Table 3.1 Probability of intersection events, conditional events, union events in the setting of a box containing six white and four red billiard balls

Concepts That Are Used When Discussing Distributions

Expectation and Variance of Random Variables
Median of Random Variables
Symmetric and Skewed Distributions
Displays of Univariate Data

Histogram
Stem-and-Leaf Display

Approximately 25% of the sample lies within each of the four intervals (all final intervals are closed, so double counting is possible). For example, adding anotches to the sides of the box provides information about the variability of the sample median.

Fig. 3.3 Negatively skewed, symmetric, and positively skewed distributions.

Quartiles of F(3,36)

Multivariate Distributions—Covariance and Correlation

The mean or expectation of Xisμ=(μ1, μ2, . . . , μp)′, the vector of means of the univariate distribution of the Xi′s. This is actually a panel of the set of rotated images of the density shown in Figure 3.10.

Fig. 3.8 Bivariate Normal distribution—scatterplot at various correlations. The distributions in the panels are related

Three Probability Distributions

The Binomial Distribution
The Normal Distribution
The (Student’s) t Distribution

If a random sample with replacement from a population has a proportion p of successes, then the number of successes in the sample is binomially distributed. Use the location of the critical valuewc=x¯ on the graph and in the table below the graph to see that the critical value for theα=.05 test is moving away from the null hypothesis value μ0 as df gets larger.

Fig. 3.11 We show the discrete density for the binomial with n = 15 and p = .4, underlaid with the normal approximation with μ = np = 15 × .4 = 6 and σ =

Sampling Distributions

In the panel then =4, the mean is the average of the 4 observations and they lie only in the central half of the panel with σ2¯x=52/4. In panel then=16, the mean is the average of the 16 observations and they are distributed only in the central quarter of the panel with σ2x¯=52/16.

Fig. 3.15 Each panel shows 10 sets of n observations from the N(μ = 100, σ 2 = 5 2 ) distribution.

Estimation

Statistical Models
Point and Interval Estimators
Criteria for Point Estimators
Confidence Interval Estimation
Example—Confidence Interval on the Mean µ of a Population Having Known Standard Deviation
Example—One-Sided Confidence Intervals

Estimation is one of two broad categories of statistical techniques used for this purpose. The sample mean ¯ is an unbiased estimator of the population mean μ and the sample variance2 is an unbiased estimate of the population variance.

Fig. 3.17 Confidence interval plot for the t distribution with n = 25, ¯ x = 8.5, ν = 24, s 2 = 4, α = .05

Hypothesis Testing

By pre-specifying the α statistic, it maintains better control over Type I error than Type II error. If the test statistic is on one side of the critical value, H0 is retained; if it is on the other side, H0 is rejected.

Fig. 3.19 One-sided confidence interval plot for the normal distribution with n = 25, ¯ x = 8.5, σ 2 = 4, α = .05

Examples of Statistical Tests

The figure shows a two-sided rejection region—anything in the deep blue region outside the critical limits (¯xcritLeft = 31.92,x¯critRight =32.08).

Power and Operating Characteristic (O.C.) (Beta) Curves

The figure shows a one-sided rejection area – everything in the deep blue area below the limit ¯xc=31.93. Non-centrality is not a problem for tests using the normal distribution, since the normal does not have a non-central form.

Fig. 3.22 Test whether the bottle production is within bounds. The figure shows a one-sided re- re-jection region—anything in the deep blue region below the limit ¯ x c = 31.93

Efficiency

The crosshair indicates the location on the type II error power-probability curves for a particular value of μa of the alternative. On the right, assuming unknown variance, which requires that it be estimated from the data, the alternative distribution has a noncentral distribution.

Sampling

Simple Random Sampling
Stratified Random Sampling
Cluster Random Sampling
Systematic Random Sampling
Standard Errors of Sample Means
Sources of Bias in Samples

But precision is sacrificed because this method prevents a large part of the population from being sampled. Then, by the Central Limit Theorem, an approximate sample size 100(1−α)% confidence interval for the population mean is of the form.

Exercises

In this chapter, we present several of the types of graphs and plots that we will use throughout. We discuss the visual impact of the graphs and relate them to the tabular presentation of the same material.

What Is a Graph?

The goal of statistical graphics is to show data characteristics such as location, variability, range, shape, correlation, interaction, and clustering. Once we understand the data visually, we usually try to model it using formal algebraic procedures.

Example—Ecological Correlation

Scatterplots

In figure 4-3 we look at the sales price against the number of bedrooms, the dining room and the kitchen. Now we see very clear upward price trends on the units of measure within each of the panels of the figure.

Scatterplot Matrix

We explore this possibility in Figure 4.4, where we show all three plots according to whether the property is a condominium or a house. In the condominium panel in Figure 4.6, we see that the condominium with the largest kitchen and dining room is one of the more expensive properties (but not the highest) and only has two bedrooms.

Fig. 4.4 Selling price by number of bedrooms, by dining room area, and by kitchen area for 105 single-family homes, conditioned on whether the property is a condominium or house, in Mount Laurel, New Jersey, from March 1992 through September 1994.

Array of Scatterplots

In this figure, we use two lines of top bar labels, one for the pairs of columns that represent the level and the other for the levels of levels arranged within the level. There are two rows of stickers on the top ribbon and one column of stickers on the left ribbon.

Example—Life Expectancy

Study Objectives
Data Description
Initial Graphs

When we plot the 45◦ line in Figure 4.9d, we get back much of the correct impression. In Figure 4.9e, we have plotted the least-squares line through the points beyond the 45◦ line.

Scatterplot Matrices—Continued

It is usually more informative to produce two or more adjacent plumes, by conditioning on the categorical variables, or to use different plot symbols for the different levels of one of the factors. We have presented what we consider to be the best orientation of the splom in Figure 4.10.

Fig. 4.8 Life Expectancy. In most countries, female life expectancy is longer than male life ex- ex-pectancy.

Televisions, Physicians, and Life Expectancy

The single triangle of a scatterplot matrix can be created in R by suppressing one of the triangles, for example with a call similar to . Older programs sometimes show a very confusing subset of the lower triangle in which the rows and columns of the screen show different sets of variables.

Fig. 4.11 Nonoptimal alternate orientation with rectangular panels for splom. The downhill diag- diag-onal is harder to read (see Figure 4.12)

2 21 variables 3

1 2 3 variables

Data Transformations
Life Expectancy Example—Continued
Color Vision
Exercises

Figure 4.16 shows the plots for all three families: the two parameterizations of the Box-Cox power transformations Tp(x) andT∗p(x) and the third, poorly parameterized power family Wp(x). More typically, the shape of the plot changes as the strength of both variables changes.

Fig. 4.14 Televisions, physicians, and life expectancy.

4.A Appendix: R Graphics

In the graphs illustrating the lack of homogeneity of variance (Figure 6.6), one of the groups in the Cartesian product is the function of the data represented (observed data, data centered on the median, absolute value of the data centered on the median) . In power scale graphs (Figure 4.17) the group rows are the ey powers and the columns are the ex powers.

Fig. 4.21 The response to treatments A and B was measured at weeks 1, 2, 4, and 8. The boxplots have been positioned at distances illustrating the time difference and with A and B adjacent at each time point.

4.B Appendix: Graphs Used in This Book

In this framework, the superposition of all levels of a factor is considered a level. The graph is designed as a crossover of the means of the response variable at the levels of one factor with itself.

Fig. 4.22 Several ways to plot multiple variables simultaneously. The top row shows the

Introductory Inference

Normal (z) Intervals and Tests

Test of a Hypothesis Concerning the Mean of a Population Having Known Standard DeviationHaving Known Standard Deviation
Confidence Intervals for Unknown Population Proportion p
Tests on an Unknown Population Proportion p
Example—One-Sided Hypothesis Test Concerning a Population Proportiona Population Proportion

The test procedure for the second pair of hypotheses is the mirror image of the first pair. H0 is rejected if ¯y<(μ0−zασ/√ .n) and kept otherwise. Similar to the two-tailed interval proposed by Agresti and Caffo, Cai (2003) proposes improved one-tailed confidence intervals to bring coverage probabilities closer to 1−α than the conventional intervals.

Table 5.1 Confidence intervals and tests with known standard deviation σ, where σ ¯ y = σ

Example—Inference on a Population Mean µ

The confidence interval and tests for μusing the tdistribution are similar to those using the normal (Z) distribution (ie, Table 5.1 is applicable), with tcalcreplacing zcalcandtαreplacingzα. Assuming that the standard deviation of the study population is not known, we want to calculate a 95% confidence interval for μ and to test H0:μ=10 vsH1:μ10.

Confidence Interval on the Variance or Standard Deviation of a Normal Population

Comparisons of Two Populations Based on Independent Samples

Confidence Intervals on the Difference Between Two Population ProportionsPopulation Proportions
Confidence Interval on the Difference Between Two Means
Tests Comparing Two Population Means When the Samples Are IndependentAre Independent
Comparing the Variances of Two Normal Populations

For a confidence interval on a difference of two means assuming the population variances are unknown, there are two cases. If the samples are independent and we assume several unknown variances, we use Δ¯y = s(¯y1−y¯2)andtcalcas given by equation (5.14).

Table 5.3 Confidence intervals and tests for two population means. When the samples are in- in-dependent and we can assume a common unknown variance, use s Δ¯ y = s p '

Paired Data

Example—t-test on Matched Pairs of Means

Data are available as data (teachers); the first column is the error score in the English version of the sentence and the second column is the error score in the Greek version of the sentence. the null hypothesis of equal difficulty corresponds to a comparison of the sample means on the scale transformed to 4.123, not 0.

Fig. 5.7 Dotplot of language difficulty scores. The difficulty in learning each of 32 sentences writ- writ-ten in English for Greek speakers (marked English) and writwrit-ten in Greek for English speakers (marked Greek) is noted

Sample Size Determination

Sample Size for Estimation
Sample Size for Hypothesis Testing

The sample size formulas here are all the result of the explicit solution of a given equation. The required sample size for the Agresti and Caffo CI in a single proportion, equation (5.3), is

Fig. 5.9 Sample size and power for the one-sample, one-sided normal test. This figure illustrates Equation 5.18

Goodness of Fit

Chi-Square Goodness-of-Fit Test
Example—Test of Goodness-of-Fit to a Binomial Distribution

Determining the rule of thumb for "too small" required goodness-of-fit tests of chi-square distributions with appropriate degrees of freedom. The chi-square distribution can be used to perform tests of goodness-of-fit, i.e., of form.

Fig. 5.11 Plot of the hypothesis test of Table 5.9. The observed value χ 2 = 6.8 shows p = 0.236 and is in the middle of the do-not-reject region,

Normal Probability Plots and Quantile Plots

Normal Probability Plots
Example—Comparing t-Distributions

S(y) the empirical distribution function of the data, the fraction of the data that is less than or equal toy. Distributions have more and more probability in the tails of the distribution (larger|q|) with decreasing degrees of freedom.

Figure 5.13 contrasts the appearance of normal probability plots for the normal dis- dis-tribution and various departures from normality

Kolmogorov–Smirnov Goodness-of-Fit Tests

Example—Kolmogorov–Smirnov Goodness-of-Fit Test

On the left we compare a random selection from the distribution with 5 df with a null hypothesis distribution often with 2 df. On the right we compare a random selection from the standard normal distribution with a null hypothesis distribution with 2 df.

Fig. 5.17 Kolmogorov–Smirnov plots. Kolmogorov–Smirnov One-Sample Test. On the left we compare a random selection from the t distribution with 5 df to a null hypothesis distribution of t with 2 df

Maximum Likelihood

Maximum Likelihood Estimation
Likelihood Ratio Tests

The value of μ that maximizes this expression is the value of μ that minimizes (yi−μ)2=. The answer, ˆμ = y, is both the “least squares” and maximum likelihood estimator of μ. While likelihood ratio tests generally do not have optimal properties, experience has shown that they are often competitive.

Exercises

The mean μ of the Poisson distribution is known in advance or must be estimated from the data. The Poisson parameter μ is unknown and must be estimated as a weighted average of the possible values y, i.e., .

One-Way Analysis of Variance

Example—Catalyst Data

The F-test in the ANOVA table addresses the null hypothesis that the four catalysts have equal mean concentrations. From the smallp value (p= .0014) we immediately see that these four catalysts do not yield the same average concentrations.

Fig. 6.1 Boxplots Comparing the Concentrations for each Catalyst

Fixed Effects

The total sum of squares is the sum of the treatment and the remaining sum of the squares, and the total degrees of freedom is the sum of the treatment and the remaining degrees of freedom. RDoes not print the Total line in the ANOVA tables. Remaining dfRes SSRes MSRes Total dfTotal SSTotal The terms of the table are defined by.

Table 6.2 Sample Table to Illustrate Structure of the ANOVA Table Analysis of Variance of Dependent Variable y Source Degrees Sum of Mean F p-value

Multiple Comparisons—Tukey Procedure for Comparing All Pairs of MeansAll Pairs of Means