The Empirical Rule - The Statistics Art and Science of Learning from Data

88 Chapter 2 Exploring Data with Graphs and Numerical Summaries

Section 2.4 Measuring the Variability of Quantitative Data 89

Question to Explore

Can we use the empirical rule to describe the variability from the mean of these data? If so, how?

Empirical rule b

Female Student Heights

Example 14

Picture the Scenario

Many human physical characteristics have bell-shaped distributions. Let’s explore height. Question 1 on the student survey in Activity 3 of Chapter 1 asked for the student’s height. Figure 2.13 shows a histogram of the heights from responses to this survey by 261 female students at the University of Georgia. (The data are in the Heights data file on the book’s website. Note that the height of 92 inches was omitted from the analysis.) Table 2.5 pres- ents some descriptive statistics, using MINITAB.

Height (in)

Percent (%)

0 4 12

8 16

56 58 60 62 64 66 68 70 72 74 76 78

mFigure 2.13 Histogram of Female Student Height Data. This summarizes heights, in inches, of 261 female college students. Question How would you describe the shape, center, and variability of the distribution?

Table 2.5 MINITAB Output for Descriptive Statistics of Student Height Data

Variable N Mean Median StDev Minimum Maximum HEIGHT 261 65.284 65.000 2.953 56.000 77.000

Think It Through

Figure 2.13 has approximately a bell shape. The figure in the margin shows a bell-shaped curve that approximates the histogram. From Table 2.5, the mean and median are close, about 65 inches, which reflects an approximately symmetric distribution. The empirical rule is applicable.

From Table 2.5, the mean is 65.3 inches and the standard deviation (labeled StDev) is 3.0 inches (rounded). By the empirical rule, approximately

j 68% of the observations fall between

xQ - s = 65.3 - 3.0 = 62.3 and xQ + s = 65.3 + 3.0 = 68.3, that is, within the interval (62.3, 68.3), about 62 to 68 inches.

Height (in)

Percent (%)

0 4 12 8 16

56 58 60 62 64 66 68 70 72 74 76 78

90 Chapter 2 Exploring Data with Graphs and Numerical Summaries

With a bell-shaped distribution for a large data set, the observations usually extend about 3 standard deviations below the mean and about 3 standard deviations above the mean.

When the distribution is highly skewed, the most extreme observation in one direction may not be nearly that far from the mean. For instance, on a recent exam, the scores ranged between 30 and 100, with median = 88, xQ = 84, and s = 16. The maximum score of 100 was only 1 standard deviation above the mean (that is, 100 = xQ + s = 84 + 16). By contrast, the minimum score of 30 was more than 3 standard deviations below the mean. This happened because the distribution of scores was highly skewed to the left. (See the margin figure.)

Remember that the empirical rule is only for bell-shaped distributions. In Chapter 6, we’ll see why the empirical rule works. For a special family of smooth, bell-shaped curves (called the normal distribution), we’ll see how to find the percentage of the distribution in any particular region by knowing only the mean and the standard deviation.

Sample Statistics and Population Parameters

Of the numerical summaries introduced so far, the mean xQ and the standard deviation s are the most commonly used in practice. We’ll use them frequently in the rest of the text. The formulas that define xQ and s refer to sample data. They are sample statistics.

We will distinguish between sample statistics and the corresponding parameter values for the population. The population mean is the average of all observations in the population. The population standard deviation describes the variability of the population observations about the population mean. Both of these population parameters are usually unknown. Inferential statistical methods help us make decisions and predictions about the population parameters based on the sample statistics.

In later chapters, to help distinguish between sample statistics and population parameters, we’ll use different notation for the parameter values. Often, Greek letters are used to denote the parameters. For instance, we’ll use μ (the Greek letter, mu) to denote the population mean and σ (the Greek letter lowercase sigma) to denote the population standard deviation.

Caution: Using the Empirical Rule

The empirical rule may approximate the actual percentages falling within 1, 2, and 3 standard deviations of the mean poorly if the data are highly skewed or highly discrete (the variable taking relatively few values). See Exercise 2.56. b

Recall

From Section 1.2, a population is the total group about whom you want to make conclusions. A sample is a subset of the population for whom you actually have data.

A parameter is a numerical summary of the population, and a statistic is a numerical summary of a sample. b

j 95% of the observations fall within

xQ { 2s, which is 65.3 { 213.02, or 159.3, 71.32, about 59 to 71 inches.

j Almost all observations fall within xQ { 3s, or (56.3, 74.3).

Of the 261 observations, by actually counting, we find that

j 187 observations, 72%, fall within (62.3, 68.3).

j 248 observations, 95%, fall within (59.3, 71.3).

j 258 observations, 99%, fall within (56.3, 74.3).

In summary, the percentages predicted by the empirical rule are near the actual ones.

Insight

Because the distribution is close to bell shaped, we can predict simple summaries effectively using only two numbers—the mean and the standard deviation.

We can do the same if we look at the data only for the 117 males, who had a mean of 70.9 and standard deviation of 2.9.

c Try Exercise 2.51

Caution

The empirical rule (which applies only to bell-shaped distributions) is not a general interpretation for what the standard deviation measures. The general interpretation is that the standard deviation measures the typical distance of observations from mean (this is for all distributions). b

30 40 50 60 20

Frequency

Score

70 80 90 100

Section 2.4 Measuring the Variability of Quantitative Data 91

2.46 Traffic violations A company decides to examine the number of points its employees have accumulated in the last two years on a driving point record system. A sample of twelve employees yields the following observations:

0 5 3 4 8 0 4 0 2 3 0 1 a. Find and interpret the range.

b. Find and interpret the standard deviation s.

c. Suppose 2 was recorded incorrectly and is supposed to be 20. Redo parts a and b with the rectified data and describe the effect of this outlier.

2.47 Life expectancy The Human Development Report 2013, published by the United Nations, showed life expectancies by country. For Western Europe, some values reported were Austria 81, Belgium 80, Denmark 80, Finland 81, France 83, Germany 81, Greece 81, Ireland 81, Italy 83, Netherlands 81, Norway 81, Portugal 80, Spain 82, Sweden 82, Switzerland 83.

For Africa, some values reported were

Botswana 47, Dem. Rep. Congo 50, Angola 51, Zambia 57, Zimbabwe 58, Malawi 55, Nigeria 52, Rwanda 63,

Uganda 59, Kenya 61, Mali 55, South Africa 56, Madagascar 64, Senegal 63, Sudan 62, Ghana 61.

a. Which group (Western Europe or Africa) of life expectancies do you think has the larger standard deviation?

Why?

b. Find the standard deviation for each group. Compare them to illustrate that s is larger for the group that shows more variability from the mean.

2.48 Life expectancy including Russia For Russia, the United Nations reported a life expectancy of 70. Suppose we add this observation to the data set for Western Europe in the previous exercise. Would you expect the standard deviation to be larger, or smaller, than the value for the Western European countries alone? Why?

2.49 Shape of home prices? According to the National Association of Home Builders, the median selling price of new homes in the United States in February 2014 was $261,400. Which of the following is the most plausible value for the standard deviation:

-$15,000, $1000, $60,000, or $1,000,000? Why? Explain what’s unrealistic about each of the other values.

2.50 Exam standard deviation For an exam given to a class, the students’ scores ranged from 35 to 98, with a mean of 74. Which of the following is the most realistic value for the standard deviation: -10, 0, 3, 12, 63? Clearly explain what’s unrealistic about each of the other values.

2.51 Heights For the sample heights of Georgia college students in Example 14, the males had xQ = 71 and s = 3, and the females had xQ = 65 and s = 3.

a. Use the empirical rule to describe the distribution of heights for males.

b. The standard deviation for the overall distribution (com- bining females and males) was 4. Why would you expect

2.4 Practicing the Basics

it to be larger than the standard deviations for the sep- arate male and female height distributions? Would you expect the overall distribution still to be unimodal?

2.52 Histograms and standard deviation The figure shows histograms for three samples, each with sample size n = 100.

a. Which sample has the (i) largest and (ii) smallest standard deviation?

b. To which sample(s) is the empirical rule relevant? Why?

1 2 3 4 5 6 7 8 9 a

1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 9 c

Histograms and relative sizes of standard deviations

2.53 On-time performance of airlines In 2015, data collected on the monthly on-time arrival rate of major domestic and regional airlines operating between Australian airports shows a roughly bell-shaped distribution for 72 observations with xQ = 85.93 and s = 3. Use the empirical rule to describe the distribution. (Source: https://bitre.gov.au/

publications/ongoing/airline_on_time_monthly.aspx).

2.54 Students’ shoe size Data collected over several years from college students enrolled in a business statistics class regarding their shoe size shows a roughly bell-shaped distribution, with xQ = 9.91 and s = 2.07.

a. Give an interval within which about 95% of the shoe sizes fall.

b. Identify the shoe size of a student which is three standard deviations above the mean in this sample. Would this be a rather unusual observation? Why?

2.55 Shape of cigarette taxes A recent summary for the distribution of cigarette taxes (in cents) among the 50 states and Washington, D.C., in the United States reported xQ = 73 and s = 48. Based on these values, do you think that this distribution is bell shaped? If so, why? If not, why not, and what shape would you expect?

2.56 Empirical rule and skewed, highly discrete distribution The table below shows data (from a 2004 Bureau of the Census report) on the number of times 20- to 24-year-old men have been married.

No. Times Count Percentage

0 8418 84.0

1 1594 15.9

2 10 0.1

Total 10022 100.0

The mean and standard deviation are 70.4 and 16.7. Using these, determine whether the distribution is either left or right skewed. Construct a dot plot to check.

2.60 Youth unemployment in the EU The Youth

Unemployment data file on the book’s website contains 2013 unemployment rates in the 28 EU countries for people between 15 and 24 years of age. (The data are also shown in Exercise 2.63). Using software,

a. Construct a graph to visualize the distribution of the unemployment rate.

b. Find the mean, median, and standard deviation.

c. Write a short paragraph summarizing the distribution of the youth unemployment rate, interpreting some of the above statistics in context.

2.61 Create data with a given standard deviation Use the Mean versus Median web app on the book’s website to investigate how the standard deviation changes as the data change. When you start the app, you have a blank graph.

Under “Options”, you can request to show the standard deviation of the points you create by clicking in the graph.

a. Create 3 observations (by clicking in the graph) that have a mean of about 50 and a standard deviation of about 20. (Clicking on an existing point deletes it.) b. Create 3 observations (click Refresh to clear the pre-

vious points) that have a mean of about 50 and a standard deviation of about 40.

c. Placing 4 values between 0 and 100, what is the largest standard deviation you can get? What are the values that have that standard deviation?

a. Verify that the mean number of times men have been married is 0.16 and that the standard deviation is 0.37.

b. Find the actual percentages of observations within 1, 2, and 3 standard deviations of the mean. How do these compare to the percentages predicted by the empirical rule?

c. How do you explain the results in part b?

2.57 Time spent using electronic devices A student conducted a survey about the amount of free time spent using electronic devices in a week. Of 350 collected responses, the mode was 9, the median was 14, the mean was 17, and the standard deviation was 11.5. Based on these statistics, what would you surmise about the shape of the distribution? Why?

2.58 Facebook friends A student asked her coworkers, par- ents, and friends, “How many friends do you have on Facebook?” She summarized her data and reported that the average number of Facebook friends in her sample is 170 with a standard deviation of 90. The distribution had a median of 120 and a mode of 105.

a. Based on these statistics, what would you surmise about the shape of the distribution? Why?

b. Does the empirical rule apply to these data? Why or why not?

2.59 Judging skew using xQ and s If the largest observation is less than 1 standard deviation above the mean, then the distribution tends to be skewed to the left. If the smallest observation is less than 1 standard deviation below the mean, then the distribution tends to be skewed to the right. A professor examined the results of the first exam given in her statistics class. The scores were

35 59 70 73 75 81 84 86

2.5 Using Measures of Position to Describe Variability

The mean and median describe the center of a distribution. The range and the standard deviation describe the variability of the distribution. We’ll now learn about some other ways of describing a distribution using measures of position.

One type of measure of position tells us the point where a certain percentage of the data fall above or fall below that point. The median is an example. It specifies a location such that half the data fall below it and half fall above it. The range uses two other measures of position, the maximum value and the minimum value. Another type of measure of position tells us how far an observation falls from a particular point, such as the number of standard deviations an observation falls from the mean.

Measures of Position: The Quartiles

Dalam dokumen The Statistics Art and Science of Learning from Data (Halaman 89-93)