Confounding - The Statistics Art and Science of Learning from Data

166 Chapter 3 Association: Contingency, Correlation, and Regression

Section 3.4 Cautions in Analyzing Associations 167

3.44 Extrapolating murder The SPSS figure shows the data and regression line for the 50 states in Table 3.6 relating x = percentage of single-parent families to

y= annual murder rate (number of murders per 100,000 people in the population).

a. The lowest x value was for Utah and the highest was for Mississippi. Using the figure, approximate those x values.

b. Using the regression equation stated above the figure, find the predicted murder rate at x = 0. Why is it not sensible to make a prediction at x = 0 based on these data?

3.4 Practicing the Basics

b. Which do you think is a better prediction for the year 2016—the sample mean of the y values in this plot or the value obtained by plugging 2016 into the fitted regression equation?

c. Would you feel comfortable using the regression line shown to predict the winning long jump for men in the year 2100? Why or why not?

3.46 U.S. average annual temperatures Use the U.S.

Temperatures data file on the book’s website. (Source:

National Climatic Data Center).

a. Construct a scatterplot, fit a trend line, and interpret the slope.

b. Predict the annual mean U.S. temperature for the year (i) 2016 and (ii) 2500.

c. In which prediction in part b do you have more faith?

Why?

3.47 Murder and education Example 13 found the

regression line yn = -3.1 + 0.33x for all 51 observations on y = murder rate and x = percent with a college education.

a. Show that the predicted murder rates increase from 1.85 to 10.1 as percent with a college education increases from x = 15% to x = 40%, roughly the range of ob- served x values.

b. When the regression line is fitted only to the 50 states, yn = 8.0 - 0.14x. Show that the predicted murder rate decreases from 5.9 to 2.4 as percent with a college education increases from 15% to 40%.

c. D.C. has the highest value for x (38.3) and is an extreme outlier on y (41.8). Is it a regression outlier? Why?

d. What causes results to differ numerically according to whether D.C. is in the data set? Which line is more appropriate as a summary of the relationship? Why?

3.48 Murder and poverty For Table 3.6, the regression equation for the 50 states and D.C. relating y = murder rate and x = percent of people who live below the poverty level is yn = -4.1 + 0.81x. For D.C., x = 17.4 and y = 41.8.

a. When the observation for D.C. is removed from the data set, yn = 0.4+ 0.36x. Does D.C. have much influence on this regression analysis? Explain.

b. If you were to look at a scatterplot, based on the information given, do you think that the poverty value for D.C. would be relatively large or relatively small?

Explain.

3.49 TV watching and the birth rate The figure shows recent data on x = the number of televisions per 100 people and y = the birth rate (number of births per 1000 people) for six African and Asian nations. The regression line, yn = 29.8- 0.024x, applies to the data for these six countries. For illustration, another point is added at (81, 15.2), which is the observation for the United States. The regression line for all seven points is yn = 31.2 - 0.195x. The figure shows this line and the one without the U.S. observation.

Murder Rate

Single Parent Fitted Line Plot

Murder = –8.25 + 0.56 Single Parent 100.00

80.00 60.00 40.00 20.00 0.00

10.00 15.00 20.00 25.00 30.00

R sq Linear = 0.459

3.45 Men’s Olympic long jumps The Olympic winning men’s long jump distances (in meters) from 1896 to 2012 and the fitted regression line for predicting them using x = year are displayed in the graph below (data on website).

a. Identify an observation that may influence the fit of the regression line. Why did you identify this observation?

1900 6 7 8

Meters

1920 1940 1960 Year

Men’s Winning Distance Olympic Long Jump

1980 2000 y = –20.5 + 0.0145x

^ r² = 0.77

168 Chapter 3 Association: Contingency, Correlation, and Regression

3.51 Regression between cereal sodium and sugar Let x = sodium and y = sugar for the breakfast cereal data in the Cereal data file on the book’s website and in Table 2.3 in Chapter 2.

a. Construct a scatterplot. Do any points satisfy the two criteria for a point to be potentially influential on the regression? Explain.

b. Find the regression line and correlation, using all the data points, and then using all except a potentially influential observation. Summarize the influence of that point.

3.52 Gestational period and life expectancy Does the life expectancy of animals depend on the length of their gestational period? The data in the Animals file on the book’s website show observations on the average longevity (in years) and average gestational period (in days) for 21 animals. (Source: Wildlife Conservation Society)

a. Use the scatterplot below to describe the association.

Are there any unusual observations?

b. Using software, verify the given numbers for the correlation and slope of the regression line shown on the plot.

c. Are there any outliers? If so, what makes the observations unusual? Would you expect any of them to be influential observations? Why or why not?

d. Find the regression line and the correlation without one of the observations identified in part c. Compare your results to those in part b. Was the observation influential? Explain.

a. Does the U.S. observation appear to be (i) an outlier on x, (ii) an outlier on y, or (iii) a regression outlier relative to the regression line for the other six observations?

b. State the two conditions under which a single point can have a dramatic effect on the slope and show that they apply here.

c. This one point also drastically affects the correlation, which is r = -0.051 without the United States but r = -0.935 with the United States. Explain why you would conclude that the association between birth rate and number of televisions is (i) very weak without the U.S. point and (ii) very strong with the U.S. point.

d. Explain why the U.S. residual for the line fitted using that point is very small. This shows that a point can be influential even if its residual is not large.

Birth Rate

Number of TVs

Regression Equations for Birth Rate and Number of TVs per 100 People

Prediction equation without United States

y = 29.8 – 0.024x

y = 31.2 – 0.195x Prediction equation

with United States 35 y

x 25

0 20 40 60 80

3.50 Looking for outliers Using software, analyze the relationship between x = college education and y = percentage single-parent families, for the data in Table 3.6, which are in the U.S. Statewide Crime data file on the book’s website.

a. Construct a scatterplot. Based on your plot, identify two observations that seem quite different from the others, having a y value relatively large in one case and somewhat small in the other case.

b. Find the regression equation (i) for the entire data set, (ii) deleting only the first of the two outlying observations, and (iii) deleting only the second of the two outlying observations.

c. Is either deleted observation influential on the slope?

Summarize the influence.

d. Including D.C., yn = 21.2+ 0.089x, whereas deleting that observation, yn = 28.1 - 0.206x. Find yn for D.C. in the two cases. Does the predicted value for D.C. depend much on which regression equation is used?

Scatterplot of Average Life Expectancy vs.

Gestational Period for 21 Animals

0 200 400 600

Hippopotamus

Longevity (years)

Cow

Gestational Period (days)

y = 6.29 + 0.045x

^ r² = 0.73

Elephant

3.53 GPA and hours spent watching TV A study conducted among sophomores at Fairfield University showed a correlation of -0.69 between the number of hours spent watching TV and GPA. It was found that for each additional hour a week you spent watching TV, you could expect on average to see a drop of .0452, on a 4.0 scale, in your GPA.

a. Is there a causal relationship, whereby watching more TV decreases your GPA?

b. Explain how a student’s intelligence, measured by her/

his IQ score, could be a lurking variable that might be responsible for this association, having a common correlation with both GPA and TV-watching hours.

Section 3.4 Cautions in Analyzing Associations 169

b. Construct a scatterplot between y = crime rate and x = education, labeling each point as rural or urban.

c. Find the overall correlation between crime rate and education for all eight data points. Interpret.

d. Find the correlation between crime rate and education for the (i) urban counties alone and (ii) rural counties alone. Why are these correlations so different from the correlation in part c?

c. Sketch a hypothetical scatterplot (as we did in Figure 3.22 for the example on crime and education), labeling points by three IQ ranges, such that overall there is a negative trend, but the slope would be about 0 when we consider only students in a given IQ range. How would a student’s intelligence play a role in the association between GPA and TV-watching hours?

3.54 Hospital size and length of stay A study shows that there is a positive correlation between x =size of a hospital (measured by its numbers of beds) and y = median number of days that patients remain in that hospital.

a. Does this mean that you can shorten a hospital stay by choosing a small hospital?

b. Identify a third variable that could be a common cause of x and y. Construct a hypothetical scatterplot (like Figure 3.22 for crime and education), identifying points according to their value on the third variable, to illus- trate your argument.

3.55 Does ice cream prevent flu? Statistical studies show that a negative correlation exists between the number of flu cases reported each week throughout the year and the amount of ice cream sold in that particular week. Based on these findings, should physicians prescribe ice cream to patients who have colds and flu or could this conclusion be based on erroneous data and statistically unjustified?

a. Discuss at least one lurking variable that could affect these results.

b. Explain how multiple causes could affect whether an individual catches flu.

3.56 What’s wrong with regression? Explain what’s wrong with the way regression is used in each of the following examples:

a. Winning times in the Boston marathon (at www.

bostonmarathon.org) have followed a straight-line decreasing trend from 160 minutes in 1927 (when the race was first run at the Olympic distance of about 26 miles) to 128 minutes in 2014. After fitting a regression line to the winning times, you use the equation to predict that the winning time in the year 2300 will be about 13 minutes.

b. Using data for several cities on x = % of residents with a college education and y = median price of home, you get a strong positive correlation. You conclude that having a college education causes you to be more likely to buy an expensive house.

c. A regression between x = number of years of education and y = annual income for 100 people shows a modest positive trend, except for one person who dropped out after 10th grade but is now a multimillion- aire. It’s wrong to ignore any of the data, so we should report all results including this point. For this data, the correlation r = -0.28.

3.57 Education causes crime? The table shows a small data set that has a pattern somewhat like that in Figure 3.22 in Example 14. As in that example, education is measured as the percentage of adult residents who have at least a high school degree. Using software,

a. Construct a data file with columns for education, crime rate, and a rural/urban label.

Urban Counties Rural Counties Education Crime Rate Education Crime Rate

70 140 55 50

75 120 58 40

80 110 60 30

85 105 65 25

3.58 Death penalty and race The table shows results of whether the death penalty was imposed in murder trials in Florida between 1976 and 1987. For instance, the death penalty was given in 53 out of 467 cases in which a white defendant had a white victim.

Death Penalty, by Defendant’s Race and Victim’s Race Victim’s

Race Defendant’s Race

Death Penalty

Yes No Total

White White 53 414 467

Black 11 37 48

Black White 0 16 16

Black 4 139 143

Source: Originally published in Florida Law Review. Michael Radelet and Glenn L. Pierce, Choosing Those Who Will Die: Race and the Death Penalty in Florida, vol. 43, Florida Law Review 1 (1991).

a. First, consider only those cases in which the victim was white. Find the conditional proportions that got the death penalty when the defendant was white and when the defendant was black. Describe the association.

b. Repeat part a for cases in which the victim was black.

Interpret.

c. Now add these two tables together to get a summary contingency table that describes the association between the death penalty verdict and defendant’s race, ignoring the information about the victim’s race. Find the conditional proportions. Describe the association and compare to parts a and b.

d. Explain how these data satisfy Simpson’s paradox.

How would you explain what is responsible for this re- sult to someone who has not taken a statistics course?

e. In studying the effect of defendant’s race on the death penalty verdict, would you call victim’s race a confounding variable? What does this mean?

3.59 NAEP scores In 2015, eighth-grade math scores on the National Assessment of Educational Progress had a mean of 283.56 in Maryland compared to a mean of 284.37 in Connecticut (Source: http://nces.ed.gov/nationsreportcard/

naepdata/dataset.aspx).

a. Identify the response variable and the explanatory variable.

cancer. The study concluded that diabetes is associated with advanced stages of breast cancer in patients and this could be a reason behind higher mortality rates. The researchers suggested looking at the possibility of race/ethnicity being a possible confounder.

a. Explain what the last sentence means and how race/

ethnicity could potentially explain the association between diabetes and breast cancer.

b. If race/ethnicity was not measured in the study and the researchers failed to consider its effects, could it be a confounding variable or a lurking variable? Explain the difference between a lurking variable and a confounding variable.

b. The means in Maryland were respectively 274, 284, 285, 291 and 294 for people who reported the number of pages read in school and for homework, respectively as 0–5, 6–10, 11–15, 15–20 and 20 or more. These means were 270, 281, 284, 289 and 293 in Connecticut. Identify the third variable given here. Explain how it is possible for Maryland to have the higher mean for each class, yet for Connecticut to have the higher mean when the data are combined. (This is a case of Simpson’s paradox for a quantitative response.)

3.60 Diabetes and breast cancer In 2015, an article published in the journal Breast Cancer Research and Treatment examined the impact of diabetes on the stages of breast

Dalam dokumen The Statistics Art and Science of Learning from Data (Halaman 167-171)