• Tidak ada hasil yang ditemukan

Associations with Quantitative and Categorical Variables

In this chapter, we’ve learned how to explore an association between cate- gorical variables and between quantitative variables. It’s also possible to mix the variable types or add other variables. For example, with two quantitative variables, we can identify points in the scatterplot according to their values on a relevant categorical variable. This is done by using different symbols or colors on the scatterplot to portray the different categories.

y

x r2 = 1

y

x r2 = 0

Comparing fitted linesb

The Gender Difference in Winning Olympic High Jumps

Picture the Scenario

The summer Olympic Games occur every four years, and one of the track and field events is the high jump. Men have competed in the high jump since 1896 and women since 1928. The High Jump data file on the book’s website contains the winning heights (in meters) for each year.8

Questions to Explore

a. How can we display the data on these two quantitative variables (win- ning height, year) and the categorical variable (gender) graphically?

b. How have the winning heights changed over time? How different are the winning heights for men and women in a typical year?

Think It Through

a. Figure 3.16 shows a scatterplot with x = year and y = winning height.

The data points are displayed with a red circle for men and a blue triangle for women. There were no Olympic Games during World War II, so no observations appear for 1940 or 1944.

Example 11

8From www.olympic.org/medallists-results.

150 Chapter 3 Association: Contingency, Correlation, and Regression

2020

Men Women Gender Winning Height for Olympic High Jump Event

2000 1980 1960 Year 1940 2.25

2.00

Winning Height (m) 1.75

1920 1900

mFigure 3.16 Scatterplot for the Winning High Jumps (in Meters) in the Olympics.

The red dots represent men and the blue triangles represent women. Question In a typical year, what is the approximate difference between the winning heights for men and for women?

b. The scatterplot shows that for each gender the winning heights have an increasing trend over time. Men have consistently jumped higher than women, between about 0.3 and 0.4 meters in a given year. The wom- en’s winning heights are similar to those for the men about 60 years earlier—for instance, about 2.0 meters in 1990–2012 for women and in 1930–1940 for men.

Insight

We could describe these trends by fitting a regression line to the points for men and a separate regression line to the points for women. Figure 3.16 indi- cates these lines. The slopes are nearly identical, indicating a similar rate of improvement over the years for both genders. However, note that in recent Olympics the winning heights have leveled off somewhat. We should be cau- tious in using regression lines to predict future winning heights.

c Try Exercise 3.42

3.24 Sketch plots of lines Identify the values of the y-intercept a and the slope b, and sketch the following regression lines, for values of x between 0 and 10.

a. yn = 7 + 0.5x b. yn = 7 + x c. yn = 7 - x d. yn = 7

3.25 Sit-ups and the 40-yard dash Is there a relationship be- tween how many sit-ups you can do and how fast you can run 40 yards? The EXCEL output shows the relationship between these variables for a study of female athletes to be discussed in Chapter 12.

3.3 Practicing the Basics

40-yd dash (sec.)

Sit-ups 7

6.5

6

5.5

5

10 15 20 25 30 35 40

Excel scatterplot of time to run 40-yard dash by number of sit-ups.

Section 3.3 Predicting the Outcome of a Variable 151

b. Your friend says she spends 60 hours on the Internet and 10 hours on email in a week. Find her predicted email use based on the regression equation.

c. Find her residual. Interpret.

3.30 Government debt and population Data used in this exercise was published by www.bloomberg.com for the most government debt per person for 58 countries and their respective population sizes in 2014. When using population size (in millions) as the explanatory variable x, and government debt per person (in dollars) as the response variable y, the regression equation is predicted as government debt per person = 19560.405 - 13.495 population.

a. Interpret the slope of the regression equation. Is the association positive or negative? Explain what this means.

b. Predict government debt per person at the (i) mini- mum population size x value of 4 million, (ii) at the maximum population size x value of 1367.5 million.

c. For India, government debt per person = $946, and population = 1259.7 million. Find the predicted gov- ernment debt per person and the residual for India.

Interpret the value of this residual.

3.31 Diamond weight and price The weight (in carats) and the price (in millions of dollars) of the 9 most expensive diamonds in the world was collected from www.elitetraveler.com. Let the explanatory variable x = weight and the response variable y = price. The regression equation is yn = 109.618 + 0.043x.

a. Princie is a diamond whose weight is 34.65 carats. Use the regression equation to predict its price.

b. The selling price of Princie is $39.3 million. Calculate the residual associated with the diamond and comment on its value in the context of the problem.

c. The correlation coefficient is 0.053. Does it mean that a diamond’s weight is a reliable predictor of its price?

3.32 How much do seat belts help? In 2013, data was collected from the U.S. Department of Transportation and the Insurance Institute for Highway Safety. According to the collected data, the number of deaths per 100,000 individuals in the U.S would decrease by 24.45 for every 1 percentage point gain in seat belt usage. Let yn = predicted number of deaths per 100,000 individuals in 2013 and x = seat belt use rate in a given state.

a. Report the slope b for the equation yn = a + bx.

b. If the y intercept equals 32.42, then predict the num- ber of deaths per 100,000 people in a state if (i) no one wears seat belts, (ii) 74% of people wear seat belts (the value for Montana), (iii) 100% of people wear seat belts.

3.33 Regression between cereal sodium and sugar The fol- lowing figure shows the result of a regression analysis of the explanatory variable x = sugar and the response variable y = sodium for the breakfast cereal data set discussed in Chapter 2 (the Cereal data file on the book’s website).

a. What criterion is used in finding the line?

b. Can you draw a line that will result in a smaller sum of the squared residuals?

a. The regression equation is yn = 6.71- 0.024x. Find the predicted time in the 40-yard dash for a subject who can do (i) 10 sit-ups, (ii) 40 sit-ups. Based on these times, explain how to sketch the regression line over this scatterplot.

b. Interpret the y-intercept and slope of the equation in part a, in the context of the number of sit-ups and time for the 40-yard dash.

c. Based on the slope in part a, is the correlation positive or negative? Explain.

3.26 Wage bill of Premier League Clubs Data of the Premier League Clubs’ wage bills was obtained from www.tsmplug .com. For the response variable y = wage bill in millions of pounds in 2014 and the explanatory variable x = wage bill in millions of pounds in 2013, yn = -1.537 + 1.056x.

a. How much do you predict the value of a club’s wage bill to be in 2014 if in 2013 the club had a wage bill of (i) £100 million, (ii) £200 million?

b. Using the results in part a, explain how to interpret the slope.

c. Is the correlation between these variables positive or negative? Why?

d. A Premier League club had a wage bill of £100 million in 2013 and £105 million in 2014. Find the residual and interpret it.

3.27 Rating restaurants Zagat restaurant guides publish rat- ings of restaurants for many large cities around the world (see www.zagat.com). The review for each restaurant gives a verbal summary as well as a 0- to 30-point rating of the quality of food, décor, service, and the cost of a dinner with one drink and tip. For 31 French restaurants in Boston in 2014, the food quality ratings had a mean of 24.55 and standard deviation of 2.08 points. The cost of a dinner (in U.S. dollars) had a mean of $50.35 and stan- dard deviation of $14.92. The equation that predicts the cost of a dinner using the rating for the quality of food is yn = -70 + 4.9x. The correlation between these two vari- ables is 0.68. (Data available in the Zagat_Boston file.) a. Predict the cost of a dinner in a restaurant that gets the

(i) lowest observed food quality rating of 21, (ii) highest observed food quality rating of 28.

b. Interpret the slope in context.

c. Interpret the correlation.

d. Show how the slope can be obtained from the correlation and other information given.

3.28 Predicting cost of meal from rating Refer to the previous exercise. The correlation with the cost of a dinner is 0.68 for food quality rating, 0.69 for service rating, and 0.56 for décor rating. According to the definition of r2 as a measure for the reduction in the prediction error, which of these three rat- ings can be used to make the most accurate predictions for the cost of a dinner: quality of food, service, or décor? Why?

3.29 Internet and email use According to data selected from GSS in 2014, the correlation between y = email hours per week and x = Internet hours per week is 0.33. The regression equation is predicted email hours = 3.54 + 0.25 Internet hours

a. Based on the correlation value, the slope had to be positive. Why?

152 Chapter 3 Association: Contingency, Correlation, and Regression

a. Sketch a scatterplot.

b. From inspection of the scatterplot, state the correlation and the regression line. (Note: You should be able to figure them out without using software or formulas.) c. Find the mean and standard deviation for each

variable.

d. Using part c, find the regression line, using the for- mulas for the slope and the y-intercept. Interpret the y-intercept and the slope.

3.36 Midterm–final correlation For students who take Statistics 101 at Lake Wobegon College in Minnesota, both the midterm and final exams have mean = 75 and standard deviation = 10. The professor explores using the midterm exam score to predict the final exam score.

The regression equation relating y = final exam score to x = midterm exam score is yn = 30 + 0.60x.

a. Find the predicted final exam score for a student who has (i) midterm score = 100, (ii) midterm score = 50.

Note that in each case the predicted final exam score regresses toward the mean of 75. (This is a property of the regression equation that is the origin of its name, as Chapter 12 will explain.)

b. Show that the correlation equals 0.60 and interpret it. (Hint: Use the relation between the slope and correlation.)

3.37 Predict final exam from midterm In an introductory sta- tistics course, x = midterm exam score and y = final exam score. Both have mean = 80 and standard deviation = 10.

The correlation between the exam scores is 0.70.

a. Find the regression equation.

b. Find the predicted final exam score for a student with midterm exam score = 80 and another with midterm exam score = 90.

3.38 NL baseball Example 9 related y = team scoring (per game) and x = team batting average for American League teams. For National League teams in 2010, yn = -6.25 + 41.5x. (Data available on the book’s website in the NL team statistics file.)

a. The team batting averages fell between 0.242 and 0.272. Explain how to interpret the slope in context.

b. The standard deviations were 0.00782 for team batting average and 0.3604 for team scoring. The correlation between these variables was 0.900. Show how the correlation and slope of 41.5 relate in terms of these standard deviations.

c. Software reports r2 = 0.81. Explain how to interpret this measure.

3.39 Study time and college GPA A graduate teaching assistant (Euijung Ryu) for Introduction to Statistics (STA 2023) at the University of Florida collected data from one of her classes in spring 2007 to investigate the relationship between using the explanatory variable x = study time per week (average number of hours) to predict the response variable y = college GPA. For the 21 females in her class, the correlation was 0.42. For the eight males in her class, the data were as shown in the following table.

a. Create a data file and use it to construct a scatterplot.

Interpret.

b. Find and interpret the correlation.

c. Now let’s look at a histogram of the residuals. Explain what the two short bars on the far right of the histo- gram mean in the context of the problem. Which two brands of cereal do they represent? Can you find them on the scatterplot?

d. In general, how reliable would you say amount of sugar is as a predictor of the amount of sodium?

3.34 Expected time for weight loss In 2014, the statistical summary of a weight loss survey was created and pub- lished on www.statcrunch.com.

a. In this study, it seemed that the desired weight loss (in pounds) was a good predictor of the expected time (in weeks) to achieve the desired weight loss. Do you ex- pect r2 to be large or small? Why?

b. For this data, r = 0.607. Interpret r2.

c. Show the algebraic relationship between the correla- tion of 0.607 and the slope of the regression equation b = 0.437, using the fact that the standard deviations are 20.005 for pounds and 14.393 for weeks. (Hint:

Recall that = r sy sx.)

3.35 Advertising and sales Each month, the owner of Fay’s Tanning Salon records in a data file y = monthly total sales receipts and x = amount spent that month on advertising, both in thousands of dollars. For the first three months of operation, the observations are as shown in the table.

Advertising Sales

0 4

1 6

2 8

Section 3.3 Predicting the Outcome of a Variable 153

a. Do you observe a linear relationship? Is the single re- gression line, which is yn = 1896 - 40.45x, the best way to fit the data? How would you suggest fitting the data?

c. Find and interpret the prediction equation by reporting the predicted GPA for a student who studies (i) 5 hours per week, (ii) 25 hours per week.

Student Study Time GPA

1 14 2.8

2 25 3.6

3 15 3.4

4 5 3.0

5 10 3.1

6 12 3.3

7 5 2.7

8 21 3.8

3.40 Oil and GDP An article in the September 16, 2006, issue of The Economist showed a scatterplot for many nations relating the response variable y = annual oil consumption per person (in barrels) and the explanatory variable x = gross domestic product (GDP, per person, in thousands of dollars). The values shown on the plot were approximately as shown in the table.

a. Create a data file and use it to construct a scatterplot.

Interpret.

b. Find and interpret the prediction equation.

c. Find and interpret the correlation.

d. Find and interpret the residual for Canada.

Nation GDP Oil Consumption

India 3 1

China 8 2

Brazil 9 4

Mexico 10 7

Russia 11 8

S. Korea 20 18

Italy 29 12

France 30 13

Britain 31 11

Germany 31 12

Japan 31 16

Canada 34 26

U.S. 41 26

1200 1000 800 600 400 200 0

28 29 30 31 32 33 Weight

Scatterplot of Bicycle Price vs Weight

Price

34 35 36 37

FE FU

Suspension

b. Find separate regression equations for the two suspen- sion types. Summarize your findings.

c. The correlation for all 12 data points is r = -0.32. If the correlations for the full and front-end suspension bikes are found separately, how do you believe the correlations will compare to r = -0.32? Find them and interpret.

d. You see a mountain bike advertised for $700 that weighs 28.5 lb. The type of suspension is not given.

Would you predict that this bike has a full or a front- end suspension? Statistically justify your answer.

3.43 Fuel Consumption Most cars are fuel efficient when running at a steady speed of around 40 to 50 mph. A scat- terplot relating fuel consumption (measured in mpg) and steady driving speed (measured in mph) for a mid-sized car is shown below. The data are available in the Fuel file on the book’s Web site. (Source: Berry, I. M. (2010). The Effects of Driving Style and Vehicle Performance on the Real-World Fuel Consumption of U.S. Light-Duty Vehicles.

Masters thesis, Massachusetts Institute of Technology, Cambridge, MA.)

a. The correlation equals 0.106. Comment on the use of the correlation coefficient as a measure for the association between fuel consumption and steady driving speed.

b. Comment on the use of the regression equation as a tool for predicting fuel consumption from the velocity of the car.

c. Over what subrange of steady driving speed might fit- ting a regression equation be appropriate? Why?

50 40 30 20 10 0

5 15 25 35 45

Steady Driving Speed (mph) Fuel Consumption vs. Speed

Fuel Consumpt. (mpg)

55 65 75 85 3.41 Mountain bikes revisited Is there a relationship between

the weight and price of a mountain bike? This question was considered in Exercise 3.21. We will analyze the Mountain Bike data file on the book’s website. (The data also were shown in Exercise 3.21.)

a. Construct a scatterplot. Interpret.

b. Find the regression equation. Interpret the slope in con- text. Does the y-intercept have contextual meaning?

c. You decide to purchase a mountain bike that weighs 30 pounds. What is the predicted price for the bike?

3.42 Mountain bike and suspension type Refer to the previ- ous exercise. The data file contains price, weight, and type of suspension system (FU = full, FE= front@end in the scatterplot shown).

154 Chapter 3 Association: Contingency, Correlation, and Regression

3.4 Cautions in Analyzing Associations

This chapter has introduced ways to explore associations between variables. When using these methods, you need to be cautious about certain potential pitfalls.