• Tidak ada hasil yang ditemukan

Second Edition

N/A
N/A
Nguyễn Gia Hào

Academic year: 2023

Membagikan " Second Edition "

Copied!
909
0
0

Teks penuh

Most of the chapters in the Second Edition are similar in content to the chapters in the First Edition. In the First Edition and again in the Second Edition, the code and data for all the examples and figures in the book are available for download.

The HH Package in R

TheRlanguage is an extremely well-developed tool for statistical research and analysis, that is, for exploring and designing new analysis techniques, as well as for analysis. The implementation of lattice graphics in R's lattice package is particularly powerful for statistical graphics, the output of data analysis through which raw data and results are displayed to the analyst and client.

S-Plus , now called S+

I decided to leave most of the SAS discussions and examples out of the body of the second edition. The now standard terminology introduced by SAS, notably the notation for sum-of-squares "types" described in Section 13.6, is listed and described.

5 Chapters in the Second Edition

Revised Chapters

The older version of HH (Heiberger,2009), designed for the first edition of this book, continues to work with S+. The examples now center on mosaic graphics, using thevcd package that was not available when the first edition was written.

Revised Appendices

The discussion focuses on the Adverse Effects dotplot, demonstrating how multi-panel plots of graphics can replace pages of tabular data. The discussion is based on the work I participated in during my research leave at GSK (Amit et al., 2008).

6 Exercises

Rating scales such as Likert scales and semantic differential scales are very common in marketing research, customer satisfaction studies, psychometrics, opinion polls, population studies, and many other fields.

Acknowledgments: First Edition

Acknowledgments

We are also grateful to Insightful Corp. for providing us with current copies of the S-Plus software for us and our student, and to the many experts who reviewed portions of early drafts of this manuscript.

Author Bios

Introduction and Motivation

Statistics in Context

The statistician is an expert in designing the data collection procedures and in calculating and displaying the results of statistical analyses. Help the client formulate the question(s) to be answered in a way that leads to sensible data collection and is suitable for statistical analysis.

Examples of Uses of Statistics

  • Investigation of Salary Discrimination
  • Measuring Body Fat
  • Minimizing Film Thickness
  • Surveys
  • Bringing Pharmaceutical Products to Market

Our analysis in Chapter9 demonstrates that essentially all of the body fat information in the 15 other measurements can be captured by just three of these other measurements. We develop a regression model of body fat as a function of these three measurements, and then examine how well these three inexpensive measurements alone can estimate body fat.

The Rest of the Book .1 Fundamentals.1 Fundamentals

  • Linear Models
  • Other Techniques
  • New Graphical Display Techniques
  • Appendices on Software
  • Appendices on Mathematics and Probability
  • Appendices on Statistical Analysis and Writing

Chapter 5 introduces some of the basic inference techniques used throughout the rest of the book. We show the algebra and pictures for finding the center and spread of the distributions.

Data and Statistics

  • Types of Data
  • Data Display and Calculation
    • Presentation
    • Rounding
  • Importing Data
    • Datasets for This Book
    • Other Data sources
  • Analysis with Missing Values
  • Data Rearrangement
  • Tables and Graphs
  • R Code Files for Statistical Analysis and Data Display (HH)

For use with R, all of the datasets mentioned in the book are available in the HH package for R. Many of the graphs were created using functions included in and fully documented in the HH package.

Table 2.1 shows two tables with identical numerical information. The first is legible because it follows both principles; the second is not because it doesn’t.
Table 2.1 shows two tables with identical numerical information. The first is legible because it follows both principles; the second is not because it doesn’t.

2.A Appendix: Missing Values in R

The entire row containing the missing values ​​is removed from the analysis and subsequent processing. The entire row containing the missing values ​​is removed from the analysis and subsequent processing.

Table 2.4 The argument na.strings defines strings "999" and "." to indicate missing values.
Table 2.4 The argument na.strings defines strings "999" and "." to indicate missing values.

Statistics Concepts

A Brief Introduction to Probability

Another method is to recognize that the "at least one white" event can be divided into three mutually exclusive events: draw white first, and draw red second; first draw red and second draw white; and both draw white. The probability of "at least one white" is seen as the sum of the probabilities of the events that make up this division.

Random Variables and Probability Distributions

  • Discrete Versus Continuous Probability Distributions
  • Displaying Probability Distributions—Discrete Distributions
  • Displaying Probability Distributions—Continuous DistributionsDistributions

TheAevent is "the first ball is white" and theBevent is "the second ball is white". But if only those dates with measurable precipitation are considered, Y is continuous, that is, the distribution of (Y|Y>0) is continuous.

Table 3.1 Probability of intersection events, conditional events, union events in the setting of a box containing six white and four red billiard balls
Table 3.1 Probability of intersection events, conditional events, union events in the setting of a box containing six white and four red billiard balls

Concepts That Are Used When Discussing Distributions

  • Expectation and Variance of Random Variables
  • Median of Random Variables
  • Symmetric and Skewed Distributions
  • Displays of Univariate Data
    • Histogram
    • Stem-and-Leaf Display

Approximately 25% of the sample lies within each of the four intervals (all final intervals are closed, so double counting is possible). For example, adding anotches to the sides of the box provides information about the variability of the sample median.

Fig. 3.3 Negatively skewed, symmetric, and positively skewed distributions.
Fig. 3.3 Negatively skewed, symmetric, and positively skewed distributions.

Quartiles of F(3,36)

Multivariate Distributions—Covariance and Correlation

The mean or expectation of Xisμ=(μ1, μ2, . . . , μp)′, the vector of means of the univariate distribution of the Xi′s. This is actually a panel of the set of rotated images of the density shown in Figure 3.10.

Fig. 3.8 Bivariate Normal distribution—scatterplot at various correlations. The distributions in the panels are related
Fig. 3.8 Bivariate Normal distribution—scatterplot at various correlations. The distributions in the panels are related

Three Probability Distributions

  • The Binomial Distribution
  • The Normal Distribution
  • The (Student’s) t Distribution

If a random sample with replacement from a population has a proportion p of successes, then the number of successes in the sample is binomially distributed. Use the location of the critical valuewc=x¯ on the graph and in the table below the graph to see that the critical value for theα=.05 test is moving away from the null hypothesis value μ0 as df gets larger.

Fig. 3.11 We show the discrete density for the binomial with n = 15 and p = .4, underlaid with the normal approximation with μ = np = 15 × .4 = 6 and σ =
Fig. 3.11 We show the discrete density for the binomial with n = 15 and p = .4, underlaid with the normal approximation with μ = np = 15 × .4 = 6 and σ =

Sampling Distributions

In the panel then =4, the mean is the average of the 4 observations and they lie only in the central half of the panel with σ2¯x=52/4. In panel then=16, the mean is the average of the 16 observations and they are distributed only in the central quarter of the panel with σ2x¯=52/16.

Fig. 3.15 Each panel shows 10 sets of n observations from the N(μ = 100, σ 2 = 5 2 ) distribution.
Fig. 3.15 Each panel shows 10 sets of n observations from the N(μ = 100, σ 2 = 5 2 ) distribution.

Estimation

  • Statistical Models
  • Point and Interval Estimators
  • Criteria for Point Estimators
  • Confidence Interval Estimation
  • Example—Confidence Interval on the Mean µ of a Population Having Known Standard Deviation
  • Example—One-Sided Confidence Intervals

Estimation is one of two broad categories of statistical techniques used for this purpose. The sample mean ¯ is an unbiased estimator of the population mean μ and the sample variance2 is an unbiased estimate of the population variance.

Fig. 3.17 Confidence interval plot for the t distribution with n = 25, ¯ x = 8.5, ν = 24, s 2 = 4, α = .05
Fig. 3.17 Confidence interval plot for the t distribution with n = 25, ¯ x = 8.5, ν = 24, s 2 = 4, α = .05

Hypothesis Testing

By pre-specifying the α statistic, it maintains better control over Type I error than Type II error. If the test statistic is on one side of the critical value, H0 is retained; if it is on the other side, H0 is rejected.

Fig. 3.19 One-sided confidence interval plot for the normal distribution with n = 25, ¯ x = 8.5, σ 2 = 4, α = .05
Fig. 3.19 One-sided confidence interval plot for the normal distribution with n = 25, ¯ x = 8.5, σ 2 = 4, α = .05

Examples of Statistical Tests

The figure shows a two-sided rejection region—anything in the deep blue region outside the critical limits (¯xcritLeft = 31.92,x¯critRight =32.08).

Power and Operating Characteristic (O.C.) (Beta) Curves

The figure shows a one-sided rejection area – everything in the deep blue area below the limit ¯xc=31.93. Non-centrality is not a problem for tests using the normal distribution, since the normal does not have a non-central form.

Fig. 3.22 Test whether the bottle production is within bounds. The figure shows a one-sided re- re-jection region—anything in the deep blue region below the limit ¯ x c = 31.93
Fig. 3.22 Test whether the bottle production is within bounds. The figure shows a one-sided re- re-jection region—anything in the deep blue region below the limit ¯ x c = 31.93

Efficiency

The crosshair indicates the location on the type II error power-probability curves for a particular value of μa of the alternative. On the right, assuming unknown variance, which requires that it be estimated from the data, the alternative distribution has a noncentral distribution.

Sampling

  • Simple Random Sampling
  • Stratified Random Sampling
  • Cluster Random Sampling
  • Systematic Random Sampling
  • Standard Errors of Sample Means
  • Sources of Bias in Samples

But precision is sacrificed because this method prevents a large part of the population from being sampled. Then, by the Central Limit Theorem, an approximate sample size 100(1−α)% confidence interval for the population mean is of the form.

Exercises

In this chapter, we present several of the types of graphs and plots that we will use throughout. We discuss the visual impact of the graphs and relate them to the tabular presentation of the same material.

What Is a Graph?

The goal of statistical graphics is to show data characteristics such as location, variability, range, shape, correlation, interaction, and clustering. Once we understand the data visually, we usually try to model it using formal algebraic procedures.

Example—Ecological Correlation

Scatterplots

In figure 4-3 we look at the sales price against the number of bedrooms, the dining room and the kitchen. Now we see very clear upward price trends on the units of measure within each of the panels of the figure.

Scatterplot Matrix

We explore this possibility in Figure 4.4, where we show all three plots according to whether the property is a condominium or a house. In the condominium panel in Figure 4.6, we see that the condominium with the largest kitchen and dining room is one of the more expensive properties (but not the highest) and only has two bedrooms.

Fig. 4.4 Selling price by number of bedrooms, by dining room area, and by kitchen area for 105 single-family homes, conditioned on whether the property is a condominium or house, in Mount Laurel, New Jersey, from March 1992 through September 1994.
Fig. 4.4 Selling price by number of bedrooms, by dining room area, and by kitchen area for 105 single-family homes, conditioned on whether the property is a condominium or house, in Mount Laurel, New Jersey, from March 1992 through September 1994.

Array of Scatterplots

In this figure, we use two lines of top bar labels, one for the pairs of columns that represent the level and the other for the levels of levels arranged within the level. There are two rows of stickers on the top ribbon and one column of stickers on the left ribbon.

Example—Life Expectancy

  • Study Objectives
  • Data Description
  • Initial Graphs

When we plot the 45◦ line in Figure 4.9d, we get back much of the correct impression. In Figure 4.9e, we have plotted the least-squares line through the points beyond the 45◦ line.

Scatterplot Matrices—Continued

It is usually more informative to produce two or more adjacent plumes, by conditioning on the categorical variables, or to use different plot symbols for the different levels of one of the factors. We have presented what we consider to be the best orientation of the splom in Figure 4.10.

Fig. 4.8 Life Expectancy. In most countries, female life expectancy is longer than male life ex- ex-pectancy.
Fig. 4.8 Life Expectancy. In most countries, female life expectancy is longer than male life ex- ex-pectancy.

Televisions, Physicians, and Life Expectancy

The single triangle of a scatterplot matrix can be created in R by suppressing one of the triangles, for example with a call similar to . Older programs sometimes show a very confusing subset of the lower triangle in which the rows and columns of the screen show different sets of variables.

Fig. 4.11 Nonoptimal alternate orientation with rectangular panels for splom. The downhill diag- diag-onal is harder to read (see Figure 4.12)
Fig. 4.11 Nonoptimal alternate orientation with rectangular panels for splom. The downhill diag- diag-onal is harder to read (see Figure 4.12)

2 21 variables 3

1 2 3 variables

  • Data Transformations
  • Life Expectancy Example—Continued
  • Color Vision
  • Exercises

Figure 4.16 shows the plots for all three families: the two parameterizations of the Box-Cox power transformations Tp(x) andT∗p(x) and the third, poorly parameterized power family Wp(x). More typically, the shape of the plot changes as the strength of both variables changes.

Fig. 4.14 Televisions, physicians, and life expectancy.
Fig. 4.14 Televisions, physicians, and life expectancy.

4.A Appendix: R Graphics

In the graphs illustrating the lack of homogeneity of variance (Figure 6.6), one of the groups in the Cartesian product is the function of the data represented (observed data, data centered on the median, absolute value of the data centered on the median) . In power scale graphs (Figure 4.17) the group rows are the ey powers and the columns are the ex powers.

Fig. 4.21 The response to treatments A and B was measured at weeks 1, 2, 4, and 8. The boxplots have been positioned at distances illustrating the time difference and with A and B adjacent at each time point.
Fig. 4.21 The response to treatments A and B was measured at weeks 1, 2, 4, and 8. The boxplots have been positioned at distances illustrating the time difference and with A and B adjacent at each time point.

4.B Appendix: Graphs Used in This Book

In this framework, the superposition of all levels of a factor is considered a level. The graph is designed as a crossover of the means of the response variable at the levels of one factor with itself.

Fig. 4.22 Several ways to plot multiple variables simultaneously. The top row shows the
Fig. 4.22 Several ways to plot multiple variables simultaneously. The top row shows the

Introductory Inference

Normal (z) Intervals and Tests

  • Test of a Hypothesis Concerning the Mean of a Population Having Known Standard DeviationHaving Known Standard Deviation
  • Confidence Intervals for Unknown Population Proportion p
  • Tests on an Unknown Population Proportion p
  • Example—One-Sided Hypothesis Test Concerning a Population Proportiona Population Proportion

The test procedure for the second pair of hypotheses is the mirror image of the first pair. H0 is rejected if ¯y<(μ0−zασ/√ .n) and kept otherwise. Similar to the two-tailed interval proposed by Agresti and Caffo, Cai (2003) proposes improved one-tailed confidence intervals to bring coverage probabilities closer to 1−α than the conventional intervals.

Table 5.1 Confidence intervals and tests with known standard deviation σ, where σ ¯ y = σ
Table 5.1 Confidence intervals and tests with known standard deviation σ, where σ ¯ y = σ
  • Example—Inference on a Population Mean µ

The confidence interval and tests for μusing the tdistribution are similar to those using the normal (Z) distribution (ie, Table 5.1 is applicable), with tcalcreplacing zcalcandtαreplacingzα. Assuming that the standard deviation of the study population is not known, we want to calculate a 95% confidence interval for μ and to test H0:μ=10 vsH1:μ10.

Confidence Interval on the Variance or Standard Deviation of a Normal Population

Comparisons of Two Populations Based on Independent Samples

  • Confidence Intervals on the Difference Between Two Population ProportionsPopulation Proportions
  • Confidence Interval on the Difference Between Two Means
  • Tests Comparing Two Population Means When the Samples Are IndependentAre Independent
  • Comparing the Variances of Two Normal Populations

For a confidence interval on a difference of two means assuming the population variances are unknown, there are two cases. If the samples are independent and we assume several unknown variances, we use Δ¯y = s(¯y1−y¯2)andtcalcas given by equation (5.14).

Table 5.3 Confidence intervals and tests for two population means. When the samples are in- in-dependent and we can assume a common unknown variance, use s Δ¯ y = s p
Table 5.3 Confidence intervals and tests for two population means. When the samples are in- in-dependent and we can assume a common unknown variance, use s Δ¯ y = s p '

Paired Data

  • Example—t-test on Matched Pairs of Means

Data are available as data (teachers); the first column is the error score in the English version of the sentence and the second column is the error score in the Greek version of the sentence. the null hypothesis of equal difficulty corresponds to a comparison of the sample means on the scale transformed to 4.123, not 0.

Fig. 5.7 Dotplot of language difficulty scores. The difficulty in learning each of 32 sentences writ- writ-ten in English for Greek speakers (marked English) and writwrit-ten in Greek for English speakers (marked Greek) is noted
Fig. 5.7 Dotplot of language difficulty scores. The difficulty in learning each of 32 sentences writ- writ-ten in English for Greek speakers (marked English) and writwrit-ten in Greek for English speakers (marked Greek) is noted

Sample Size Determination

  • Sample Size for Estimation
  • Sample Size for Hypothesis Testing

The sample size formulas here are all the result of the explicit solution of a given equation. The required sample size for the Agresti and Caffo CI in a single proportion, equation (5.3), is

Fig. 5.9 Sample size and power for the one-sample, one-sided normal test. This figure illustrates Equation 5.18
Fig. 5.9 Sample size and power for the one-sample, one-sided normal test. This figure illustrates Equation 5.18

Goodness of Fit

  • Chi-Square Goodness-of-Fit Test
  • Example—Test of Goodness-of-Fit to a Binomial Distribution

Determining the rule of thumb for "too small" required goodness-of-fit tests of chi-square distributions with appropriate degrees of freedom. The chi-square distribution can be used to perform tests of goodness-of-fit, i.e., of form.

Fig. 5.11 Plot of the hypothesis test of Table 5.9. The observed value χ 2 = 6.8 shows p = 0.236 and is in the middle of the do-not-reject region,
Fig. 5.11 Plot of the hypothesis test of Table 5.9. The observed value χ 2 = 6.8 shows p = 0.236 and is in the middle of the do-not-reject region,

Normal Probability Plots and Quantile Plots

  • Normal Probability Plots
  • Example—Comparing t-Distributions

S(y) the empirical distribution function of the data, the fraction of the data that is less than or equal toy. Distributions have more and more probability in the tails of the distribution (larger|q|) with decreasing degrees of freedom.

Figure 5.13 contrasts the appearance of normal probability plots for the normal dis- dis-tribution and various departures from normality
Figure 5.13 contrasts the appearance of normal probability plots for the normal dis- dis-tribution and various departures from normality

Kolmogorov–Smirnov Goodness-of-Fit Tests

  • Example—Kolmogorov–Smirnov Goodness-of-Fit Test

On the left we compare a random selection from the distribution with 5 df with a null hypothesis distribution often with 2 df. On the right we compare a random selection from the standard normal distribution with a null hypothesis distribution with 2 df.

Fig. 5.17 Kolmogorov–Smirnov plots. Kolmogorov–Smirnov One-Sample Test. On the left we compare a random selection from the t distribution with 5 df to a null hypothesis distribution of t with 2 df
Fig. 5.17 Kolmogorov–Smirnov plots. Kolmogorov–Smirnov One-Sample Test. On the left we compare a random selection from the t distribution with 5 df to a null hypothesis distribution of t with 2 df

Maximum Likelihood

  • Maximum Likelihood Estimation
  • Likelihood Ratio Tests

The value of μ that maximizes this expression is the value of μ that minimizes (yi−μ)2=. The answer, ˆμ = y, is both the “least squares” and maximum likelihood estimator of μ. While likelihood ratio tests generally do not have optimal properties, experience has shown that they are often competitive.

Exercises

The mean μ of the Poisson distribution is known in advance or must be estimated from the data. The Poisson parameter μ is unknown and must be estimated as a weighted average of the possible values ​​y, i.e., .

One-Way Analysis of Variance

Example—Catalyst Data

The F-test in the ANOVA table addresses the null hypothesis that the four catalysts have equal mean concentrations. From the smallp value (p= .0014) we immediately see that these four catalysts do not yield the same average concentrations.

Fig. 6.1 Boxplots Comparing the Concentrations for each Catalyst
Fig. 6.1 Boxplots Comparing the Concentrations for each Catalyst

Fixed Effects

The total sum of squares is the sum of the treatment and the remaining sum of the squares, and the total degrees of freedom is the sum of the treatment and the remaining degrees of freedom. RDoes not print the Total line in the ANOVA tables. Remaining dfRes SSRes MSRes Total dfTotal SSTotal The terms of the table are defined by.

Table 6.2 Sample Table to Illustrate Structure of the ANOVA Table Analysis of Variance of Dependent Variable y Source Degrees Sum of Mean F p-value
Table 6.2 Sample Table to Illustrate Structure of the ANOVA Table Analysis of Variance of Dependent Variable y Source Degrees Sum of Mean F p-value

Multiple Comparisons—Tukey Procedure for Comparing All Pairs of MeansAll Pairs of Means

Gambar

Table 2.6 Missing numerical values are displayed as NA. Missing character and factor items are displayed as &lt;NA&gt;.
Fig. 2.1 The dataset abcd is defined in Table 2.6. The plot was drawn with
Fig. 3.1 Mosaic plot corresponding to Table 3.1. The area of each panel is equal to the probability of the event identified in that panel
Fig. 3.2 P(2 &lt; X &lt; 4) equals the area under the density between 2 and 4.
+7

Referensi

Dokumen terkait

Berdasarkan hasil dari pengujian pada data penelitian dengan menggunakan Uji Kebaikan Suai (Goodness of Fit) Chi-Square yang telah dilakukan pada Bengkel Yamaha Motor

Pada tabel 2.5 menjelaskan beberapa indeks yang merupakan acuan dalam proses kecocokan model atau Goodness of Fit Model diantaranya Chi Square merupakan salah satu

Model Loglinier Jointly Independent ( � , �� ).. Goodness-of-Fit

Berdasarkan hasil dari pengujian pada data penelitian dengan menggunakan Uji Kebaikan Suai (Goodness of Fit) Chi-Square yang telah dilakukan pada Bengkel Yamaha Motor

RMSEA adalah sebuah indeks yang digunakan untuk mengokompensasi chi square statistic dalam sampel yang besar. Nilai RMSEA menunjukkan goodness of fit yang dapat

Penelitian yang dilakukan menyimpulkan bahwa (1) dengan Chi-Square goodness of fit test, frequency of loss distribution pemberhentian pegawai sebelum masa ikatan dinasnya

Score: 0 Accepted Answers: Chi-Square Value is a test statistic of the goodness of fit of the model, and it is used to test the null hypothesis that the model fits the analyzed

Berdasarkan hasil analisis statistic Chi Square Goodness Of Fit diperoleh nilai Chie Square hitung X2 29,114 dibandingkan dengan X2 tabel dengan taraf signifikan 0,05 5% pada df 3