• Tidak ada hasil yang ditemukan

Simpson’s Paradox

Section 3.4 Cautions in Analyzing Associations 163

In interpreting the positive correlation between crime rate and education for Florida counties, we’d be remiss if we failed to recognize that the correlation could be due to a lurking variable. This could happen if we observed those two variables but not urbanization, which would then be a lurking variable. Likewise, if we got excited by the positive correlation between ice cream sales and the number drowned in a month, we’d fail to recognize that the monthly mean tem- perature is a lurking variable. The difficulty in practice is that often we have no clue what the lurking variables may be.

164 Chapter 3 Association: Contingency, Correlation, and Regression

survival status after 20 years. We find that 139/582, which is 24%, of the smokers died, and 230/732, or 31%, of the nonsmokers died. There was a greater survival rate for the smokers.

Could the age of the woman at the beginning of the study explain the as- sociation? Presumably it could if the smokers tended to be younger than the nonsmokers. For this reason, Table 3.8 shows four contingency tables relating smoking status and survival status for these 1,314 women separated into four age groups.

Table 3.8 Smoking Status and 20-Year Survival for Four Age Groups

Age Group

18–34 Survival? 35–54 Survival? 55–64 Survival? 65+ Survival?

Smoker Dead Alive Dead Alive Dead Alive Dead Alive

Yes 5 174 41 198 51 64 42 7

No 6 213 19 180 40 81 165 28

Questions to Explore

a. Show that the counts in Table 3.8 are consistent with those in Table 3.7.

b. For each age group, find the (marginal) percentage of women who smoked.

c. For each age group, find the (marginal) percentage of women who died.

d. For each age group, compute the conditional percentage of dying for smokers and nonsmokers. To describe the association between smok- ing and survival status for each age group, find the difference in the percentages.

e. How can you explain the association in Table 3.7, whereby smoking seems to help women live a longer life? How can this association be so different from the one shown in the partial tables of Table 3.8?

Think It Through

a. If you add the counts in the four separate parts of Table 3.8, you’ll get the counts in Table 3.7. For instance, from Table 3.7, we see that 139 of the women who smoked died. From Table 3.8, we get this from 5 + 41 + 51 + 42 = 139, taking the number of smokers who died from each age group. Doing this for each cell, we see that the counts in the two tables are consistent with each other.

b. Of the 398 women in age group 1, 179 1= 45%2 were smokers. In age group 2, 55% were smokers. In age group 3, 49%. But in age group 4, only 20% were smokers. So the oldest age group had the fewest smok- ers among them.

c. Of the 398 women in age group 1, 11 1= 3%2 died. In age group 2, 14% died and in age group 3, 39%. As we would expect, the most women, 86%, died in age group 4 after a period of 20 years.

d. Table 3.9 shows the conditional percentages who died, for smokers and nonsmokers in each age group. Figure 3.23 plots them. The dif- ference (in percentage points) is shown in the last row. For each age group, a higher percentage of smokers than nonsmokers died.

e. Could age explain the association? The proportion of smokers is higher in the younger age group (e.g., 45% and 55% for age groups 1 and 2,

Section 3.4 Cautions in Analyzing Associations 165

20% in age group 4). Naturally, these younger women are less likely to die after 20 years (3% and 14% in age groups 1 and 2, 85% in age group 4). Overall, when ignoring age, this leads to fewer smokers dying compared to nonsmokers, as observed in Table 3.7. Although the over- all association for this table indicates that smokers had higher survival rates than nonsmokers, it merely reflects the fact that younger women were more likely to be smokers and hence less likely to die during the study time frame compared to the nonsmokers, who were predomi- nately older.

When we looked at the data separately for each age group in Table 3.8, we saw the reverse: Smokers had lower survival rates than nonsmokers.

The analysis using Table 3.7 did not account for age, which strongly in- fluences the association.

Insight

Because of the reversal in the association after taking age into account, the researchers did not conclude that smoking is beneficial to your health.

This example illustrates the dramatic influence that a lurking variable can have, which would be unknown to researchers if they fail to include a cer- tain variable in their study. An association can look quite different after adjusting for the effect of a third variable by grouping the data according to its values.

c Try Exercise 3.58

0 Yes No Yes No Yes

Smoker

Percent Dead

Percentages of Deaths for Smokers and Non-Smokers

No Yes No

25 50 75

100 Age: 18–34 Age: 35–54 Age: 55–64 Age: 65+

mFigure 3.23 Bar Graph Comparing Percentage of Deaths for Smokers and Nonsmokers, by Age. This side-by-side bar graph shows the conditional percentages from Table 3.9.

Table 3.9 Conditional Percentages of Deaths for Smokers and Nonsmokers, by Age.

For instance, for smokers of age 18–34, from Table 3.8 the proportion who died was 5>15 +1742 = 0.028, or 2.8%.

Age Group

Smoker 18–34 35–54 55–64 65+

Yes 2.8% 17.2% 44.3% 85.7%

No 2.7% 9.5% 33.1% 85.5%

Difference 0.1% 7.7% 11.2% 0.2%

166 Chapter 3 Association: Contingency, Correlation, and Regression