The empirical rule tells us that for a bell-shaped distribution, it is unusual for an observation to fall more than 3 standard deviations from the mean. An alterna- tive criterion for identifying potential outliers uses the standard deviation.
j An observation in a bell-shaped distribution is regarded as a potential outlier if it falls more than 3 standard deviations from the mean.
How do we know the number of standard deviations that an observation falls from the mean? When xQ = 84 and s = 16, a value of 100 is 1 standard deviation above the mean because 1100 - 842 = 16. Alternatively, 1100 - 842/16 = 1.
Section 2.5 Using Measures of Position to Describe Variability 99
Taking the difference between an observation and the mean and dividing by the standard deviation tells us the number of standard deviations that the obser- vation falls from the mean. This number is called the z-score.
z-Score for an Observation
The z-score for an observation is the number of standard deviations that it falls from the mean. A positive z-score indicates the observation is above the mean. A negative z-score indicates the observation is below the mean. For sample data, the z-score is calculated as
z = observation-mean standard deviation .
The z-score allows us to tell quickly how surprising or extreme an observation is. The z-score converts an observation (regardless of the observation’s unit of measurement) to a common scale of measurement, which allows comparisons.
z-scores b
Pollution Outliers
Picture the Scenario
Let’s consider air pollution data for the European Union (EU). The Energy-EU data file6 on the book’s website contains data on per capita carbon- dioxide (CO2) emissions, in metric tons, for the 28 nations in the EU. The mean was 7.9 and the standard deviation was 3.6. Figure 2.17 shows a box plot of the data; the histogram is shown in the margin. The maximum of 21.4, representing Luxembourg, is highlighted as a potential outlier.
Example 17
CO2 Emission per Capita (metric tons)
Count
0 4 6
2 8
0 2 4 6 8 10 12 14 16 18 20 22
6Source: www.iea.org/stats/index.asp.
22 20 18 16 14 12 10 8 6 4
0 2
CO2 Emission per Capita (metric tons)
*
mFigure 2.17 Box Plot of Carbon Dioxide Emissions for European Union Nations.
Question Can you use this plot to approximate the five-number summary?
Questions to Explore
a. How many standard deviations from the mean was the CO2 value of 21.4 for Luxembourg?
b. The CO2 value for the United States was 16.9. According to the three standard deviation criterion, is the United States an outlier on carbon dioxide emissions relative to the EU?
Think It Through
a. Since xQ = 7.9 and s = 3.6, the z-score for the observation of 21.4 is z = observation - mean
standard deviation = 21.4 - 7.9
3.6 = 13.5 3.6 = 3.75 The carbon dioxide emission (per capita) for Luxembourg is 3.75 stan- dard deviations above the mean. By the 3 standard deviation criterion,
100 Chapter 2 Exploring Data with Graphs and Numerical Summaries
this is a potential outlier. Because it’s well removed from the rest of the data, we’d regard it as an actual outlier. However, Luxembourg has only about 500,000 people, so in terms of the amount of pollution, it is not a major polluter in the EU.
b. The z-score for the CO2 value of the United States is z = 116.9 - 7.92>3.6 = 2.5. This value does not flag the United States as an oulier relative to EU nations in terms of per-capita emission. (If we consid- ered total emission, the situation would be different.)
Insight
The z-scores of 3.75 and 2.5 are positive. This indicates that the observations are above the mean, because an observation above the mean has a positive z-score. In fact, these large positive z-scores tell us that Luxembourg and the United States have very high per capita CO2 emissions compared to the other nations. The z-score is negative when the observation is below the mean. For instance, France has a CO2 value of 5.6, which is below the mean of 7.9 and has a z-score of -0.64.
c Try Exercises 2.75 and 2.76
2.62 Vacation days National Geographic Traveler magazine recently presented data on the annual number of vacation days averaged by residents of eight countries. They reported 42 days for Italy, 37 for France, 35 for Germany, 34 for Brazil, 28 for Britain, 26 for Canada, 25 for Japan, and 13 for the United States.
a. Report the median.
b. By finding the median of the four values below the me- dian, report the first quartile.
c. Find the third quartile.
d. Interpret the values found in parts a–c in the context of these data.
2.63 Youth unemployment In recent years, many European nations have suffered from relatively high youth unem- ployment. For the 28 EU nations, the table below shows the unemployment rate among 15- to 24-year-olds in 2013, also available as a file on the book’s website. (Source:
World Bank). If you are computing the statistics below by hand, it may be easier to arrange the data in increasing order and write them in 4 rows of 7 values each.
a. Find and interpret the second quartile 1=median2. b. Find and interpret the first quartile (Q1).
c. Find and interpret the third quartile (Q3).
d. Will the 10th percentile be around the value of 6 or 16?
Explain.
Belgium 8.4 Germany 5.3 Spain 26.4 Bulgaria 13.0 Estonia 8.6 France 10.3 Czech Republic 7.0 Ireland 13.1 Croatia 17.2 Denmark 7.0 Greece 27.3 Italy 12.2
2.5 Practicing the Basics
Cyprus 15.9 Netherlands 6.7 Slovakia 14.2 Latvia 11.9 Austria 4.9 Finland 8.2 Lithuania 11.8 Poland 10.3 Sweden 8.0
Luxembourg 5.8 Portugal 16.5 United Kingdom 7.5 Hungary 10.2 Romania 7.3
Malta 6.5 Slovenia 10.1
2.64 On-time performance of airlines In 2015, data col- lected on the monthly on-time arrival rate of major domestic and regional airlines operating between Australian airports is numerically summarized by xQ = 85.93, Q1 = 83.75, median = 85.65, Q3 = 87.75, number of observations = 72.
a. Interpret the quartiles.
b. Would you guess that the distribution is skewed or roughly symmetric? Why?
2.65 Students’ shoe size Data collected over several years from college students enrolled in a business statistics class regarding their shoe size is numerically summarized by xQ = 9.91, Q1 = 8, median = 10, Q3 = 11.
a. Interpret the quartiles.
b. Would you guess that the distribution is skewed or roughly symmetric? Why?
2.66 Ways to measure variability The standard deviation, the range, and the interquartile range (IQR) summarize the variability of the data.
a. Why is the standard deviation s usually preferred over the range?
b. Why is the IQR sometimes preferred to s?
c. What is an advantage of s over the IQR?
Section 2.5 Using Measures of Position to Describe Variability 101
2.72 Central Park temperature distribution revisited Exercise 2.26 showed a histogram for the distribution of Central Park annual average temperatures. The box plot for these data is shown here.
2.67 Variability of net worth of billionaires Here is the five-number summary for the distribution of the net worth (in billions) of the Top 659 World’s Billionaires according to Forbes magazine.
Minimum = 2.9, Q1 = 3.6, Median = 4.8, Q3 = 8.2, Maximum = 79.2
a. About what proportion of billionaires have wealth (i) greater than $3.6 billion and (ii) greater than $8.2 billion?
b. Between which two values are the middle 50% of the observations found?
c. Find the interquartile range. Interpret it.
d. Based on the summary, do you think that this distribu- tion was bell shaped? If so, why? If not, why not, and what shape would you expect?
2.68 Traffic violations A company decides to examine the number of points its employees have accumulated in the last two years on their driving record point system. A sample of twelve employees yields the following observations:
0 5 3 4 8 0 4 0 2 3 0 1
a. The standard deviation is 2.505. Find and interpret the range.
b. The quartiles are Q1 = 0, median = 2.5, Q3 = 4. Find the interquartile range.
c. Suppose the 2 was incorrectly recorded and is supposed to be 20. The standard deviation is then 5.625 but the quartiles do not change. Redo parts a and b with the correct data and describe the effect of this outlier.
Which measure of variability, the range, IQR, or stan- dard deviation is least affected by the outlier? Why?
2.69 Infant mortality Africa The Human Development Report 2006, published by the United Nations, showed infant mor- tality rates (number of infant deaths per 1000 live births) by country. For Africa, some of the values reported were:
South Africa 54, Sudan 63, Ghana 68, Madagascar 76, Senegal 78, Zimbabwe 79, Uganda 80, Congo 81, Botswana 84, Kenya 96, Nigeria 101, Malawi 110, Mali 121, Angola 154.
a. Find the first quartile (Q1) and the third quartile (Q3).
b. Find the interquartile range (IQR). Interpret it.
2.70 Infant mortality Europe For Western Europe, the infant mortality rates reported by the Human Development Report 2006 were:
Sweden 3, Finland 3, Spain 3, Belgium 4, Denmark 4, France 4, Germany 4, Greece 4, Italy 4, Norway 4, Portugal 4, Netherlands 5, Switzerland 5, UK 5.
Show that Q1 = Q2 = Q3 = 4. (The quartiles, like the median, are less useful when the data are highly discrete.) 2.71 Computer use During a recent semester at the
University of Florida, students having accounts on a mainframe computer had storage space use (in kilobytes) described by the five-number summary, minimum = 4, Q1 = 256, median = 530, Q3 = 1105, and maximum = 320,000.
a. Would you expect this distribution to be symmetric, skewed to the right, or skewed to the left? Explain.
b. Use the 1.5 * IQR criterion to determine whether any potential outliers are present.
a. If this distribution is skewed, would you expect it to be skewed to the right or to the left? Explain.
b. Approximate each component of the five-number sum- mary and interpret.
2.73 Box plot for exam The scores on an exam have mean = 88, standard deviation = 10, minimum = 65, Q1 = 77, median = 85, Q3 = 91, maximum = 100.
Sketch a box plot, labeling which of these values are used in the plot.
2.74 Public transportation Exercise 2.37 described a survey about how many miles per day employees of a company use public transportation. The sample values were:
0 0 4 0 0 0 10 0 6 0
a. Identify the five-number summary and sketch a box plot.
b. Explain why Q1 and the median share the same line in the box.
c. Why does the box plot not have a left whisker?
2.75 Energy statistics The Energy Information Administration records per capita consumption of energy by country. The box plot below shows 2011 per capita energy consumption (in millions of BTUs) for 36 OECD countries, with a mean of 195 and a standard deviation of 120. Iceland had the largest per capita consumption at 665 million BTU.
a. Use the box plot to give approximate values for the five-number summary of energy consumption.
b. Italy had a per capita consumption of 139 million BTU.
How many standard deviations from the mean was its consumption?
c. The United States was not included in the data, but its per capita consumption was 334 million BTU. Relative to the distribution for the included OECD nations, the United States is how many standard deviations from the mean?
50 150 250 350 450 550 650 Per Capita Energy Consumption
(in millions BTU)
*
finalist was 9377 seconds. Is this an unusual result? Answer by providing statistical justification. (Source: https://www .iaaf.org/)
2.80 Florida students again Refer to the FL Student Survey data set on the book’s website and the data on weekly hours of TV watching.
a. Use software to construct a box plot. Interpret the in- formation on the plot and use it to describe the shape of the distribution.
b. Using a criterion for outliers, investigate whether there are any potential outliers.
2.81 Females or males watch more TV? Refer to the previous exercise. Suppose you wanted to compare TV watching by males and females. Construct a side-by-side box plot to do this. Interpret.
2.82 CO2 comparison The vertical side-by-side box plots shown below compare per capita carbon dioxide emis- sions in 2011 for many Central and South American and European nations. (Data available on the book’s website. Source: www.eia.gov)
a. Give the approximate value of carbon dioxide emis- sions for the outliers shown. (There are actually two outliers.)
b. What shape would you predict for the distribution in Central and South America? Why?
c. Summarize how the carbon dioxide emissions compare between the two regions.
2.76 European Union youth unemployment rates The 2013 unemployment rates of countries in the European Union shown in Exercise 2.63 ranged from 4.9 to 27.3, with Q1 = 7.15, median = 10.15, Q3 = 13.05, a mean of 11.1, and a standard deviation of 5.6.
a. In a box plot, what would be the values at the outer edges of the box, and what would be the values to which the whiskers extend?
b. Which two countries will show up as outliers in the box plot? Why?
c. Greece had the highest unemployment rate of 27.3.
Is it an outlier according to the 3 standard deviation criterion? Explain.
d. What unemployment value for a country would have a z-score equal to 0?
2.77 Air pollution Example 17 discussed EU carbon dioxide emissions, which had a mean of 7.9 and standard devia- tion of 3.6.
a. Finland’s observation was 11.5. Find its z-score and interpret.
b. Sweden’s observation was 5.6. Find its z-score, and interpret.
c. The UK’s observation was 7.9. Find the z-score and interpret.
2.78 Height of buildings For the 50 tallest buildings in Washington D.C., the mean was 195.57 ft and the standard deviation was 106.32 ft. The tallest building in this sample had a height of 761 ft (www.skyscrapercenter.com).
a. Find the z-score for the height of 761 ft.
b. What does the positive sign for the z-score represent?
c. Is this observation a potential outlier according to the 3 standard deviation distance criterion? Explain.
2.79 Marathon results The results for the first 20 finish- ers of the Women’s Marathon in the World Athletics Championships in Beijing in 2015 appears to have a roughly bell-shaped distribution with a mean of 9073 seconds and a standard deviation of 161.4 seconds. Among the first 20 runners to finish the race, the finishing time of the last
0 2 4 6 8 10 12 14
C&S
America Europe
Per Capita CO2 emmission (in metric tons)
*
2.6 Recognizing and Avoiding Misuses of Graphical Summaries
In this chapter, we’ve learned how to describe data. We’ve learned that the nature of the data affects how we can effectively describe them. For example, if a distribu- tion is very highly skewed or has extreme outliers, we’ve seen that some numerical summaries, such as the mean, can be misleading.
To illustrate, during a recent semester at the University of Florida, students with accounts on a mainframe computer had storage space usage (in kilobytes) described by the five-number summary:
minimum = 4, Q1 = 256, median = 530, Q3 = 1105, maximum = 320,000 and by the mean of 1921 and standard deviation of 11,495. The maximum value was an extreme outlier. You can see what a strong effect that outlier had on the 102 Chapter 2 Exploring Data with Graphs and Numerical Summaries
Section 2.6 Recognizing and Avoiding Misuses of Graphical Summaries 103
mean and standard deviation. For these data, it’s misleading to use these two val- ues to summarize the distribution. To finish this chapter, let’s look at how we also need to be on the lookout for misleading graphical summaries.