Most of the chapters in the Second Edition are similar in content to the chapters in the First Edition. In the First Edition and again in the Second Edition, the code and data for all the examples and figures in the book are available for download.
The HH Package in R
TheRlanguage is an extremely well-developed tool for statistical research and analysis, that is, for exploring and designing new analysis techniques, as well as for analysis. The implementation of lattice graphics in R's lattice package is particularly powerful for statistical graphics, the output of data analysis through which raw data and results are displayed to the analyst and client.
S-Plus , now called S+
I decided to leave most of the SAS discussions and examples out of the body of the second edition. The now standard terminology introduced by SAS, notably the notation for sum-of-squares "types" described in Section 13.6, is listed and described.
5 Chapters in the Second Edition
Revised Chapters
The older version of HH (Heiberger,2009), designed for the first edition of this book, continues to work with S+. The examples now center on mosaic graphics, using thevcd package that was not available when the first edition was written.
Revised Appendices
The discussion focuses on the Adverse Effects dotplot, demonstrating how multi-panel plots of graphics can replace pages of tabular data. The discussion is based on the work I participated in during my research leave at GSK (Amit et al., 2008).
6 Exercises
Rating scales such as Likert scales and semantic differential scales are very common in marketing research, customer satisfaction studies, psychometrics, opinion polls, population studies, and many other fields.
Acknowledgments: First Edition
Acknowledgments
We are also grateful to Insightful Corp. for providing us with current copies of the S-Plus software for us and our student, and to the many experts who reviewed portions of early drafts of this manuscript.
Author Bios
Introduction and Motivation
Statistics in Context
The statistician is an expert in designing the data collection procedures and in calculating and displaying the results of statistical analyses. Help the client formulate the question(s) to be answered in a way that leads to sensible data collection and is suitable for statistical analysis.
Examples of Uses of Statistics
- Investigation of Salary Discrimination
- Measuring Body Fat
- Minimizing Film Thickness
- Surveys
- Bringing Pharmaceutical Products to Market
Our analysis in Chapter9 demonstrates that essentially all of the body fat information in the 15 other measurements can be captured by just three of these other measurements. We develop a regression model of body fat as a function of these three measurements, and then examine how well these three inexpensive measurements alone can estimate body fat.
The Rest of the Book .1 Fundamentals.1 Fundamentals
- Linear Models
- Other Techniques
- New Graphical Display Techniques
- Appendices on Software
- Appendices on Mathematics and Probability
- Appendices on Statistical Analysis and Writing
Chapter 5 introduces some of the basic inference techniques used throughout the rest of the book. We show the algebra and pictures for finding the center and spread of the distributions.
Data and Statistics
- Types of Data
- Data Display and Calculation
- Presentation
- Rounding
- Importing Data
- Datasets for This Book
- Other Data sources
- Analysis with Missing Values
- Data Rearrangement
- Tables and Graphs
- R Code Files for Statistical Analysis and Data Display (HH)
For use with R, all of the datasets mentioned in the book are available in the HH package for R. Many of the graphs were created using functions included in and fully documented in the HH package.
2.A Appendix: Missing Values in R
The entire row containing the missing values is removed from the analysis and subsequent processing. The entire row containing the missing values is removed from the analysis and subsequent processing.
Statistics Concepts
A Brief Introduction to Probability
Another method is to recognize that the "at least one white" event can be divided into three mutually exclusive events: draw white first, and draw red second; first draw red and second draw white; and both draw white. The probability of "at least one white" is seen as the sum of the probabilities of the events that make up this division.
Random Variables and Probability Distributions
- Discrete Versus Continuous Probability Distributions
- Displaying Probability Distributions—Discrete Distributions
- Displaying Probability Distributions—Continuous DistributionsDistributions
TheAevent is "the first ball is white" and theBevent is "the second ball is white". But if only those dates with measurable precipitation are considered, Y is continuous, that is, the distribution of (Y|Y>0) is continuous.
Concepts That Are Used When Discussing Distributions
- Expectation and Variance of Random Variables
- Median of Random Variables
- Symmetric and Skewed Distributions
- Displays of Univariate Data
- Histogram
- Stem-and-Leaf Display
Approximately 25% of the sample lies within each of the four intervals (all final intervals are closed, so double counting is possible). For example, adding anotches to the sides of the box provides information about the variability of the sample median.
Quartiles of F(3,36)
Multivariate Distributions—Covariance and Correlation
The mean or expectation of Xisμ=(μ1, μ2, . . . , μp)′, the vector of means of the univariate distribution of the Xi′s. This is actually a panel of the set of rotated images of the density shown in Figure 3.10.
Three Probability Distributions
- The Binomial Distribution
- The Normal Distribution
- The (Student’s) t Distribution
If a random sample with replacement from a population has a proportion p of successes, then the number of successes in the sample is binomially distributed. Use the location of the critical valuewc=x¯ on the graph and in the table below the graph to see that the critical value for theα=.05 test is moving away from the null hypothesis value μ0 as df gets larger.
Sampling Distributions
In the panel then =4, the mean is the average of the 4 observations and they lie only in the central half of the panel with σ2¯x=52/4. In panel then=16, the mean is the average of the 16 observations and they are distributed only in the central quarter of the panel with σ2x¯=52/16.
Estimation
- Statistical Models
- Point and Interval Estimators
- Criteria for Point Estimators
- Confidence Interval Estimation
- Example—Confidence Interval on the Mean µ of a Population Having Known Standard Deviation
- Example—One-Sided Confidence Intervals
Estimation is one of two broad categories of statistical techniques used for this purpose. The sample mean ¯ is an unbiased estimator of the population mean μ and the sample variance2 is an unbiased estimate of the population variance.
Hypothesis Testing
By pre-specifying the α statistic, it maintains better control over Type I error than Type II error. If the test statistic is on one side of the critical value, H0 is retained; if it is on the other side, H0 is rejected.
Examples of Statistical Tests
The figure shows a two-sided rejection region—anything in the deep blue region outside the critical limits (¯xcritLeft = 31.92,x¯critRight =32.08).
Power and Operating Characteristic (O.C.) (Beta) Curves
The figure shows a one-sided rejection area – everything in the deep blue area below the limit ¯xc=31.93. Non-centrality is not a problem for tests using the normal distribution, since the normal does not have a non-central form.
Efficiency
The crosshair indicates the location on the type II error power-probability curves for a particular value of μa of the alternative. On the right, assuming unknown variance, which requires that it be estimated from the data, the alternative distribution has a noncentral distribution.
Sampling
- Simple Random Sampling
- Stratified Random Sampling
- Cluster Random Sampling
- Systematic Random Sampling
- Standard Errors of Sample Means
- Sources of Bias in Samples
But precision is sacrificed because this method prevents a large part of the population from being sampled. Then, by the Central Limit Theorem, an approximate sample size 100(1−α)% confidence interval for the population mean is of the form.
Exercises
In this chapter, we present several of the types of graphs and plots that we will use throughout. We discuss the visual impact of the graphs and relate them to the tabular presentation of the same material.
What Is a Graph?
The goal of statistical graphics is to show data characteristics such as location, variability, range, shape, correlation, interaction, and clustering. Once we understand the data visually, we usually try to model it using formal algebraic procedures.
Example—Ecological Correlation
Scatterplots
In figure 4-3 we look at the sales price against the number of bedrooms, the dining room and the kitchen. Now we see very clear upward price trends on the units of measure within each of the panels of the figure.
Scatterplot Matrix
We explore this possibility in Figure 4.4, where we show all three plots according to whether the property is a condominium or a house. In the condominium panel in Figure 4.6, we see that the condominium with the largest kitchen and dining room is one of the more expensive properties (but not the highest) and only has two bedrooms.
Array of Scatterplots
In this figure, we use two lines of top bar labels, one for the pairs of columns that represent the level and the other for the levels of levels arranged within the level. There are two rows of stickers on the top ribbon and one column of stickers on the left ribbon.
Example—Life Expectancy
- Study Objectives
- Data Description
- Initial Graphs
When we plot the 45◦ line in Figure 4.9d, we get back much of the correct impression. In Figure 4.9e, we have plotted the least-squares line through the points beyond the 45◦ line.
Scatterplot Matrices—Continued
It is usually more informative to produce two or more adjacent plumes, by conditioning on the categorical variables, or to use different plot symbols for the different levels of one of the factors. We have presented what we consider to be the best orientation of the splom in Figure 4.10.
Televisions, Physicians, and Life Expectancy
The single triangle of a scatterplot matrix can be created in R by suppressing one of the triangles, for example with a call similar to . Older programs sometimes show a very confusing subset of the lower triangle in which the rows and columns of the screen show different sets of variables.
2 21 variables 3
1 2 3 variables
- Data Transformations
- Life Expectancy Example—Continued
- Color Vision
- Exercises
Figure 4.16 shows the plots for all three families: the two parameterizations of the Box-Cox power transformations Tp(x) andT∗p(x) and the third, poorly parameterized power family Wp(x). More typically, the shape of the plot changes as the strength of both variables changes.
4.A Appendix: R Graphics
In the graphs illustrating the lack of homogeneity of variance (Figure 6.6), one of the groups in the Cartesian product is the function of the data represented (observed data, data centered on the median, absolute value of the data centered on the median) . In power scale graphs (Figure 4.17) the group rows are the ey powers and the columns are the ex powers.
4.B Appendix: Graphs Used in This Book
In this framework, the superposition of all levels of a factor is considered a level. The graph is designed as a crossover of the means of the response variable at the levels of one factor with itself.
Introductory Inference
Normal (z) Intervals and Tests
- Test of a Hypothesis Concerning the Mean of a Population Having Known Standard DeviationHaving Known Standard Deviation
- Confidence Intervals for Unknown Population Proportion p
- Tests on an Unknown Population Proportion p
- Example—One-Sided Hypothesis Test Concerning a Population Proportiona Population Proportion
The test procedure for the second pair of hypotheses is the mirror image of the first pair. H0 is rejected if ¯y<(μ0−zασ/√ .n) and kept otherwise. Similar to the two-tailed interval proposed by Agresti and Caffo, Cai (2003) proposes improved one-tailed confidence intervals to bring coverage probabilities closer to 1−α than the conventional intervals.
- Example—Inference on a Population Mean µ
The confidence interval and tests for μusing the tdistribution are similar to those using the normal (Z) distribution (ie, Table 5.1 is applicable), with tcalcreplacing zcalcandtαreplacingzα. Assuming that the standard deviation of the study population is not known, we want to calculate a 95% confidence interval for μ and to test H0:μ=10 vsH1:μ10.
Confidence Interval on the Variance or Standard Deviation of a Normal Population
Comparisons of Two Populations Based on Independent Samples
- Confidence Intervals on the Difference Between Two Population ProportionsPopulation Proportions
- Confidence Interval on the Difference Between Two Means
- Tests Comparing Two Population Means When the Samples Are IndependentAre Independent
- Comparing the Variances of Two Normal Populations
For a confidence interval on a difference of two means assuming the population variances are unknown, there are two cases. If the samples are independent and we assume several unknown variances, we use Δ¯y = s(¯y1−y¯2)andtcalcas given by equation (5.14).
Paired Data
- Example—t-test on Matched Pairs of Means
Data are available as data (teachers); the first column is the error score in the English version of the sentence and the second column is the error score in the Greek version of the sentence. the null hypothesis of equal difficulty corresponds to a comparison of the sample means on the scale transformed to 4.123, not 0.
Sample Size Determination
- Sample Size for Estimation
- Sample Size for Hypothesis Testing
The sample size formulas here are all the result of the explicit solution of a given equation. The required sample size for the Agresti and Caffo CI in a single proportion, equation (5.3), is
Goodness of Fit
- Chi-Square Goodness-of-Fit Test
- Example—Test of Goodness-of-Fit to a Binomial Distribution
Determining the rule of thumb for "too small" required goodness-of-fit tests of chi-square distributions with appropriate degrees of freedom. The chi-square distribution can be used to perform tests of goodness-of-fit, i.e., of form.
Normal Probability Plots and Quantile Plots
- Normal Probability Plots
- Example—Comparing t-Distributions
S(y) the empirical distribution function of the data, the fraction of the data that is less than or equal toy. Distributions have more and more probability in the tails of the distribution (larger|q|) with decreasing degrees of freedom.
Kolmogorov–Smirnov Goodness-of-Fit Tests
- Example—Kolmogorov–Smirnov Goodness-of-Fit Test
On the left we compare a random selection from the distribution with 5 df with a null hypothesis distribution often with 2 df. On the right we compare a random selection from the standard normal distribution with a null hypothesis distribution with 2 df.
Maximum Likelihood
- Maximum Likelihood Estimation
- Likelihood Ratio Tests
The value of μ that maximizes this expression is the value of μ that minimizes (yi−μ)2=. The answer, ˆμ = y, is both the “least squares” and maximum likelihood estimator of μ. While likelihood ratio tests generally do not have optimal properties, experience has shown that they are often competitive.
Exercises
The mean μ of the Poisson distribution is known in advance or must be estimated from the data. The Poisson parameter μ is unknown and must be estimated as a weighted average of the possible values y, i.e., .
One-Way Analysis of Variance
Example—Catalyst Data
The F-test in the ANOVA table addresses the null hypothesis that the four catalysts have equal mean concentrations. From the smallp value (p= .0014) we immediately see that these four catalysts do not yield the same average concentrations.
Fixed Effects
The total sum of squares is the sum of the treatment and the remaining sum of the squares, and the total degrees of freedom is the sum of the treatment and the remaining degrees of freedom. RDoes not print the Total line in the ANOVA tables. Remaining dfRes SSRes MSRes Total dfTotal SSTotal The terms of the table are defined by.
Multiple Comparisons—Tukey Procedure for Comparing All Pairs of MeansAll Pairs of Means