• Tidak ada hasil yang ditemukan

Practicing the Basics

Beware of Poor Graphs

2.6 Practicing the Basics

2.85 Enrollment trends Examples 18 and 19 presented graphs showing the total student enrollment at a U.S. university between 2004 and 2012 and the data for STEM major enrollment during that same time period. The data are repeated here.

Enrollment at the U.S. University

Year Total Students STEM Majors

2004 29,404 2,003

2005 29,693 1,906

2006 30,009 1,871

2007 30,912 1,815

2008 31,288 1,856

2009 32,317 1,832

2010 32,941 1,825

2011 33,878 1,897

2012 33,405 1,854

a. Construct a graph for only STEM major enrollments over this period. Describe the trend in these enroll- ment counts.

b. Find the percentage of enrolled students each year who are STEM majors and construct a time plot of the percentages.

c. Summarize what the graphs constructed in parts a and b tell you that you could not learn from Figures 2.18 and 2.19 in Examples 18 and 19.

2.86 Terrorism and war in Iraq In 2004, a college newspa- per reported results of a survey of students taken on campus. One question asked was, “Do you think going to war with Iraq has made Americans safer from terror- ism, or not?” The figure shows the way the magazine reported results.

Somerfields (6%)

Market Share of Scottish Supermarket Chain

Tesco (27.2%)

ASDA (24%) Safeway

(15.2%)

William Morrison Safeway (15.2%)

Source: Copyright The Scotsman Publications, Ltd.

2.89 Bad graph Search some publications and find an example of a graph that violates at least one of the principles for constructing good graphs. Summarize what’s wrong with the graph and explain how it could be improved.

a. Explain what’s wrong with the way this bar chart was constructed.

b. Explain why you would not see this error made with a pie chart.

65%

60%

55%

50%

45%

40%

35%

30% yes, safer 35%

no, not 63%

2.87 BBC license fee Explain what is wrong with the time plot shown of the annual license fee paid by British subjects for watching BBC programs. (Source: Evening Standard, November 29, 2006.)

2.88 Federal government spending Explain what is wrong with the following pie chart, which depicts the federal government breakdown by category for 2010.

Unemployment and welfare

(16%) Other (18%)

Social Security (20%) Defense

(19%) Medicare

(13%) Education (1%) Interest on national debt

(5%) Medicaid (8%)

Federal Spending for 2010

Source: www.gpoaccess.gov/usbudget/fy10.

AnSwerS tO the ChApter Figure QueStiOnS

Figure 2.1 It is not always clear exactly what portion or percentage of a circle the slice represents, especially when a pie has many slices. Also, some slices may be representing close to the same percentage of the cir- cle; thus, it is not clear which is larger or smaller.

Figure 2.2 Ordering the bars from largest to smallest immediately shows the (few) categories that are the most frequent. It also shows how rapid the heights of the bars decrease.

Figure 2.3 The number of dots above a value on the number line rep- resents the frequency of occurrence of that value.

Figure 2.4 There’s no leaf if there’s no observation having that stem.

Figure 2.7 The direction of the longer tail indicates the direction of the skew.

Figure 2.8 The annual average temperatures are tending to increase over time.

Figure 2.9 The mean is pulled in the direction of the longer tail since extreme observations affect the balance point of the distribution. The median is not affected by the size of the observation.

Figure 2.10 The range would change from $10,000 to $65,000.

Figure 2.11 A deviation is positive when the observation falls above the mean and negative when the observation falls below the mean.

Figure 2.12 Since about 95% of the data fall within 2 standard deviations of the mean, about 100% -95% =5% fall more than 2 standard devia- tions from the mean.

Chapter Review

Answers to the Chapter Figure Questions 107

108 Chapter 2 Exploring Data with Graphs and Numerical Summaries

Figure 2.17 Yes. The left whisker extends to the minimum (about 3.5), the left side of the box is at Q1 (about 5.5), the line inside the box is at the median (about 7), the right side of the box is at Q3 (about 9), and the outlier identified with a star is at the maximum (about 21).

Figure 2.18 Neither the height nor the area of the human figures accu- rately represents the frequencies reported.

Figure 2.19 The total enrollment increases until 2011. The STEM enrollment seems to be constant, but it would be better to plot STEM enrollments separately on a smaller scale.

Figure 2.13 The shape is approximately bell shaped. The center appears to be around 66 inches. The heights range from 56 to 77 inches.

Figure 2.14 The second quartile has two quarters (25% and 25%) of the data below it. Therefore, Q2 is the median since 50% of the data falls below the median.

Figure 2.15 The left whisker is drawn only to 50 because the next low- est sodium value, which is 0, is identified as a potential outlier.

Figure 2.16 The quartiles are about Q1 =69, Q2 =71, and Q3= 73 for males and Q1= 64, Q2=65, and Q3 =67 for females.

ChApter SummAry

This chapter introduced descriptive statistics—graphical and numerical ways of describing data. The characteristics of interest that we measure are called variables, because values on them vary from subject to subject.

j A categorical variable has observations that fall into one of a set of categories.

j A quantitative variable takes numerical values that represent different magnitudes of the variable.

j A quantitative variable is discrete if it has separate possible values, such as the integers 0, 1, 2, . . . for a variable expressed as “the number of . . .” It is continuous if its possible values form an interval.

When we explore a variable, we are interested in its distribution (i.e., how the observations are distributed across the range of possible values). Distributions are described through graphs and tables. For categorical variables, we construct a frequency table showing the (relative) frequencies of observations in each category and mention the categories with the highest frequen- cies. For quantitative variables, frequency tables are obtained by partitioning the range into intervals. Key features to describe for quantitative variables are the shape (number of distinct mounds, symmetry or skew, outliers), center, and variability. No description of a variable is complete without showing a graphical representation of its distribution through an appropriate graph.

Overview of Graphical Methods

j For categorical variables, data are displayed using pie charts and bar graphs. Bar graphs provide more flexibility and make it easier to compare categories having similar percentages.

j For quantitative variables, a histogram is a graph of a fre- quency table. It displays bars that specify frequencies or rela- tive frequencies (percentages) for possible values or intervals of possible values. The stem-and-leaf plot (a vertical line dividing the final digit, the leaf, from the stem) and dot plot (dots above the number line) show the individual observations.

They are useful for small data sets. These three graphs all show shape, such as whether the distribution is approximately bell shaped, skewed to the right (longer tail pointing to the right), or skewed to the left.

j The box plot has a box drawn between the first quartile and third quartile, with a line drawn in the box at the median.

It has whiskers that extend to the minimum and maximum values, except for potential outliers. An outlier is an extreme value falling far below or above the bulk of the data.

j A time plot graphically displays observations for a variable mea- sured over time. This plot can visually show trends over time.

Overview of Measures of the Center and of Position

Measures of the center attempt to describe a typical or represen- tative observation.

j The mean is the sum of the observations divided by the num- ber of observations. It is the balance point of the data.

j The median divides the ordered data set into two parts of equal numbers of observations, half below and half above that point. The median is the 50th percentile (second quartile). It is a more representative summary than the mean when the data are highly skewed and is resistant against outliers.

j The lower quarter of the observations fall below the first quar- tile (Q1), and the upper quarter fall above the third quartile (Q3). These are the 25th percentile and 75th percentile. These quartiles and the median split the data into four equal parts.

Overview of Measures of Variability

Measures of variability describe the variability of the observations.

j The range is the difference between the largest and smallest observations.

j The deviation of an observation x from the mean is x - xQ. The variance is an average of the squared deviations. Its square root, the standard deviation, is more useful, describing a typical distance from the mean.

j The empirical rule states that for a bell-shaped distribution:

About 68% of the data fall within 1 standard deviation of the mean, xQ { s.

About 95% of the data fall within 2 standard deviations, xQ { 2s.

Nearly all the data fall within 3 standard deviations, xQ { 3s.

j The interquartile range (IQR) is the difference between the third and first quartiles, which span the middle 50% of the data in a distribution. It is more resistant than the range and standard deviation, being unaffected by extreme observations.

j An observation is a potential outlier if it falls (a) more than 1.5 * IQR below Q1 or more than 1.5 * IQR above Q3, or (b) more than 3 standard deviations from the mean. The z-score is the number of standard deviations that an observa- tion falls from the mean.

Chapter Problems 109

Standard deviation s = C

Σ1x- xQ22 n- 1

z@score = observed value - mean standard deviation Mean xQ = Σx

n where x denotes the variable, n is the sample size, and Σ indicates to sum

SummAry OF nOtAtiOn

ChApter prOblemS

Practicing the Basics

2.90 Categorical or quantitative? Identify each of the fol- lowing variables as categorical or quantitative.

a. Number of children in family

b. Amount of time in football game before first points scored

c. College major (English, history, chemistry, . . .) d. Type of music (rock, jazz, classical, folk, other) 2.91 Continuous or discrete? Which of the following variables

are continuous, when the measurements are as precise as possible?

a. Age of mother

b. Number of children in a family c. Cooking time for preparing dinner d. Latitude and longitude of a city e. Population size of a city

2.92 Young non-citizens in the U.S. The table shows the number of 18- to 24-year-old noncitizens living in the United States between 2010 and 2012.

Noncitizens aged 18 to 24 in the United States Region of Birth Number (in Thousands)

Africa 115

Asia 590

Europe 148

Latin America & Caribbean 1,666

Other 49

Total 2,568

Source: www.census.gov

a. Is Region of Birth quantitative or categorical? Show how to summarize results by adding a column of percentages to the table.

b. Which of the following is a sensible numerical sum- mary for these data: Mode (or modal category), mean, median? Explain, and report whichever is/are sensible.

c. How would you order the Region of Birth categories for a Pareto chart? What’s its advantage over the or- dinary bar graph?

2.93 Cool in China A recent survey7 asked 1200 university students in China to pick the personality trait that most

defines a person as “cool.” The possible responses allowed, and the percentage making each, were individualistic and innovative (47%), stylish (13.5%), dynamic and capable (9.5%), easygoing and relaxed (7.5%), other (22.5%).

a. Identify the variable being measured.

b. Classify the variable as categorical or quantitative.

c. Which of the following methods could you use to describe these data: (i) bar chart, (ii) dot plot, (iii) box plot, (iv) median, (v) mean, (vi) mode (or modal cate- gory), (vii) IQR, (viii) standard deviation?

2.94 Chad voting problems The 2000 U.S. presidential election had various problems in Florida. One was overvotes—people mistakenly voting for more than one presidential candidate. (There were multiple minor-party candidates.) There were 110,000 overvote ballots, with Al Gore marked on 84,197 and George W.

Bush on 37,731. These ballots were disqualified. Was overvoting related to the design of the ballot? The figure shows MINITAB dot plots of the overvote per- centages for 65 Florida counties organized by the type of voting machine—optical scanning, Votomatic (voters manually punch out chads), and Datavote (voter presses a lever that punches out the chad mechanically)—and the number of columns on the ballot (1 or 2).

a. The overvote was highest (11.6%) in Gadsden County. Identify the number of columns and the method of registering the vote for that county.

b. Of the six ballot type and method combinations, which two seemed to perform best in terms of having relatively low percentages of overvotes?

c. How might these data be summarized further by a bar graph with six bars?

7Source: Public relations firm Hill & Knowlton, as reported by Time magazine.

Overvote Percentages by Number of Columns on Ballot and Method of Voting

Source: A. Agresti and B. Presnell, Statistical Science, vol. 17, pp. 1–5, 2002.

110 Chapter 2 Exploring Data with Graphs and Numerical Summaries

15 25 35 45 55 65 75 85 95 c

15 25 35 45 55 65 75 85 95 d

2.95 Number of siblings In a survey administered to mem- bers of an entertainment club, the results obtained for the question “How many siblings do you have?” were

No. of siblings 0 1 2 3 4 5 6 7 8+

Count 12 122 228 324 444 561 635 618 59

a. What is the sample size in this study?

b. Which is the most appropriate graph to summarize the data—dot plot, stem-and-leaf plot, or histogram?

Why?

c. For this data, construct the appropriate graph and comment on the shape of the distribution.

2.96 Longevity The Animals data set on the book’s website has data on the average longevity (measured in years) for 21 animals.

a. Construct a stem-and-leaf plot of longevity b. Construct a histogram of longevity.

c. Summarize what you see in the histogram or stem- and-leaf plot. (Most animals live to be how old? Is the distribution skewed?)

2.97 Newspaper reading Exercise 2.24 gave results for the number of times a week a person reads a daily newspaper for a sample of 36 students at the University of Georgia.

The frequency table is shown below.

a. Construct a dot plot of the data.

b. Construct a stem-and-leaf plot of the data. Identify the stems and the leaves.

c. The mean is 3.94. Find the median.

d. Is the distribution skewed to the left, skewed to the right, or symmetric? Explain.

No. Times Frequency

0 2

1 4

2 4

3 8

4 4

5 5

6 2

7 4

8 2

9 1

2.98 Match the histogram Match each lettered histogram with one of the following descriptions: Skewed to the left, bimodal, symmetric, skewed to the right.

15 25 35 45 55 65 75 85 95 a

15 25 35 45 55 65 75 85 95 b

2.99 Sandwiches and protein Listed in the table below are the prices of six-inch Subway sandwiches at a particular franchise and the number of grams of protein contained in each sandwich.

Sandwich Cost($) Protein(g)

BLT $2.99 17

Ham (Black Forest, without

cheese) $2.99 18

Oven Roasted Chicken $3.49 23

Roast Beef $3.69 26

Subway Club® $3.89 26

Sweet Onion Chicken Teriyaki $3.89 26

Turkey Breast $3.49 18

Turkey Breast & Ham $3.49 19

Veggie Delite® $2.49 8

Cold Cut Combo $2.99 21

Tuna $3.10 21

a. Construct a stem-and-leaf plot of the protein amounts in the various sandwiches.

b. What is the advantage(s) of using the stem-and-leaf plot instead of a histogram?

c. Summarize your findings from these graphs.

2.100 Sandwiches and cost Refer to the previous exercise.

Repeat parts a–c for the cost of the sandwiches. Summarize your findings.

2.101 What shape do you expect? For the following variables, indicate whether you would expect its histogram to be bell shaped, skewed to the right, or skewed to the left.

Explain why.

a. Number of times arrested in past year

b. Time needed to complete difficult exam (maximum time is 1 hour)

c. Assessed value of house d. Age at death

2.102 Sketch plots For each of the following, sketch roughly what you expect a histogram to look like and explain whether the mean or the median would be greater.

a. The selling price of new homes 2015.

b. The number of children ever born per woman age 40 or over

c. The score on an easy exam 1mean = 88, standard deviation = 10, maximum= 100, minimum = 502 d. Number of months in which subject drove a car

last year

Chapter Problems 111

a. Describe the shape of each of the two distributions.

b. Estimate the balance point for each of the two distri- butions and compare.

c. Compare the variability of the two distributions.

Estimate from the dot plots how the range and standard deviation for the January temperature distribution compare to the July temperature distri- bution. Are you surprised by your results for these two months?

2.109 What does s equal?

a. For an exam given to a class, the students’ scores ranged from 35 to 98, with a mean of 74. Which of the following is the most realistic value for the stan- dard deviation: -10, 1, 12, 60? Clearly explain what is unrealistic about the other values.

b. The sample mean for a data set equals 80. Which of the following is an impossible value for the standard deviation? 200, 0, -20? Why?

2.110 Female heights According to a recent report from the U.S. National Center for Health Statistics, females between 25 and 34 years of age have a bell-shaped distribution for height, with mean of 65 inches and standard deviation of 3.5 inches.

a. Give an interval within which about 95% of the heights fall.

b. What is the height for a female who is 3 standard deviations below the mean? Would this be a rather unusual height? Why?

2.111 Energy and water consumption In parts a and b, what shape do you expect for the distributions of electricity use and water use in a recent month in Gainesville, Florida?

Why? (Data supplied by N. T. Kamhoot, Gainesville Regional Utilities.)

a. Residential electricity used had mean = 780 and standard deviation = 506 kilowatt hours (Kwh). The minimum usage was 3 Kwh and the maximum was 9390 Kwh.

b. Water consumption had mean = 7100 and standard deviation = 6200 (gallons).

2.112 Hurricane damage The histogram shows the distri- bution of the damage (in billion dollars) of the 30 most costly hurricanes hitting the U.S. mainland between 1900 and 2010. (Numbers are inflation adjusted and in 2010 dollars). The data are available in the Hurricane file on the book’s website.

2.103 Median versus mean sales price of new homes The U.S. Bureau of the Census reported a median sales price of new houses sold in March 2014 of $290,000. Would you expect the mean sales price to have been higher or lower? Explain.

2.104 Household net worth A study reported that in 2007 the mean and median net worth of American families were

$556,300 and $120,300, respectively.

a. Is the distribution of net worth for these families likely to be symmetric, skewed to the right, or skewed to the left? Explain.

b. During the Great Recession of 2008, many Americans lost wealth due to the large decline in values of assets such as homes and retirement savings. In 2009, mean and median net worth were reported as $434,782 and

$91,304. Why do you think the difference in decline from 2007 to 2009 was larger for the mean than the median?

2.105 Golfers’ gains During the 2010 Professional Golfers Association (PGA) season, 90 golfers earned at least

$1 million in tournament prize money. Of those, 5 earned at least $4 million, 11 earned between $3 million and

$4 million, 21 earned between $2 million and $3 million, and 53 earned between $1 million and $2 million.

a. Would the data for all 90 golfers be symmetric, skewed to the left, or skewed to the right?

b. Two measures of central tendency of the golfers’

winnings were $2,090,012 and $1,646,853. Which do you think is the mean and which is the median?

2.106 Hiking In a guidebook about interesting hikes to take in national parks, each hike is classified as easy, medium, or hard and by the length of the hike (in miles). Which classification is quantitative and which is categorical?

2.107 Lengths of hikes Refer to the previous exercise.

a. Give an example of five hike lengths such that the mean and median are equal.

b. Give an example of five hike lengths such that the mode is 2, the median is 3, and the mean is larger than the median.

2.108 Central Park monthly temperatures The MINITAB graph below uses dot plots to compare the distributions of the Central Park temperatures from 1869–2010 for the months of January and July.

Hurricane Damage

Damage (in billion US Dollar)

Count

0 4 2 6 8 10 12

0 10 20 30 40 50 60 70 80 90 100 110