• Tidak ada hasil yang ditemukan

Measuring the Strength of an Association Between Two Categorical Variables

For tables such as Table 3.2 that have two categories for the response and ex- planatory variables, there are several ways of describing the nature and strength of the association numerically.

Difference of Proportions We have seen from Table 3.2 and Figure 3.2 that 73% of conventionally grown food samples had pesticide residues present, com- pared to only 23% for organic food samples. The difference between the two percentages is 73% - 23% = 50%. The percentage of conventionally grown food samples with residues present is 50 percentage points higher than the one for organic food samples. This is a substantial difference and indicates a rather strong association between the two variables. When there is no association, as in hypothetical Table 3.3 and Figure 3.3, this difference would be 0. The measure comparing two proportions (or percentages) through their difference is known as the difference of proportions and will be discussed in Chapter 10, Section 10.1, and Chapter 11, Section 11.3.

Ratio of Proportions Another way to measure and describe the strength of the association in Table 3.2 is through the ratio of the two conditional proportions, given by 0.73>0.23 = 3.2. The proportion of food with pesticide residues present is 3.2 times larger for conventionally grown food than for organic food. Again, this indicates a rather strong association. What would that ratio be if the vari- ables were not associated? In that case, the two proportions would be identical (or, in practice, nearly identical), as in Table 3.3. The ratio would then be close to or equal to 1, e.g., 0.4>0.4 = 1 for the hypothetical conditional proportions in Table 3.3. This ratio is often referred to as the risk ratio or the relative risk. It is a popular measure for describing the association between a drug and a control treatment in medical studies.

Caution

Don’t say the percentage of food samples with residues present is 50% higher for conventionally grown food. Say it is 50 percentage points higher. When comparing two percentages through their difference, we express the units of that difference as percentage points. b

In Practice Comparing Population and Sample Conditional Proportions

When sampling from a population, even if there is no association between the two variables in the population, you can’t expect the sample conditional proportions to be exactly the same because of ordinary random variation from sample to sample. Later in the text, we’ll present inferential methods to determine if observed sample differences between conditional proportions are large enough to indicate that the variables are associated in the population.

Section 3.1 The Association Between Two Categorical Variables 125

3.1 Which is the response/explanatory variable? For the following pairs of variables, which more naturally is the response variable and which is the explanatory variable?

a. Carat 1= weight2 and price of a diamond

b. Dosage (low/medium/high) and severity of adverse event (mild/moderate/strong/serious) of a drug c. Top speed and construction type (wood or steel) of a

roller coaster

d. Type of college (private/public) and graduation rate 3.2 Sales and advertising Each month, the owner of

Fay’s Tanning Salon records in a data file the monthly total sales receipts and the amount spent that month on advertising.

a. Identify the two variables.

b. For each variable, indicate whether it is quantitative or categorical.

c. Identify the response variable and the explanatory variable.

3.3 Does higher income make you happy? Every General Social Survey (GSS) includes the question, “Taken all to- gether, would you say that you are very happy, pretty happy, or not too happy?” The table below uses the 2010 survey to cross-tabulate happiness with family income, measured as the response to the question, “Compared with American families in general, would you say that your family income is below average, average, or above average?”

3.1 Practicing the Basics

Clarity by Cut Clarity

Cut IF VVS VS SI I Total

Good 2 4 16 55 3 80

Fair 1 3 8 30 2 44

Happiness and Family Income, from General Social Survey Happiness

Income Not Too

Happy Pretty

Happy Very

Happy Total

Above average 21 213 126 360

Average 96 506 248 850

Below average 143 347 114 604

Total 260 1066 488 1814

a. Identify the response variable and the explanatory variable.

b. Construct the conditional proportions on happiness at each level of income. Interpret and summarize the association between these variables.

c. Overall, what proportion of people reported being very happy?

3.4 Diamonds The clarity and cut of a diamond are two of the four C’s of diamond grading. (The other two are color and carat.) For a sample of diamonds, the following table lists the clarity (rated as internally flaw- less, IF, very very slightly included, VVS, very slightly included, VS, slightly included, SI and included, I) for the two lowest ratings for cut, which are “good” and

“fair.” The data for this exercise are in the Diamonds file on the book’s website.

a. Find the conditional proportions for the five categories of clarity, given cut.

b. Sketch (or create using software) a side-by-side (or stacked) bar graph that compares the two cuts on clarity. Summarize findings in a paragraph.

c. Based on these data, is there an association between the cuts and clarity? Explain.

3.5 Alcohol and college students The Harvard School of Public Health, in its College Alcohol Study Survey, surveyed college students in about 200 colleges in 1993, 1997, 1999, and 2001. The survey asked students questions about their drinking habits. Binge drinking was defined as five drinks in a row for males and four drinks in a row for females. The table shows results from the 2001 study, cross-tabulating subjects’ gender by whether they have participated in binge drinking.

Binge Drinking by Gender Binge Drinking Status

Gender Binge Drinker Non-Binge Drinker Total

Male 1,908 2,017 3,925

Female 2,854 4,125 6,979

Total 4,762 6,142 10,904

a. Identify the response variable and the explanatory variable.

b. Report the cell counts of subjects who were (i) male and a binge drinker, (ii) female and a binge drinker.

c. Can you compare the counts in part b to answer the question, “Is there a difference between male and female students who binge drink?” Explain.

d. Construct a contingency table that shows the condi- tional proportions of sampled students who do or do not binge drink, given gender. Interpret.

e. Based on part d, does it seem that there is an association between binge drinking and gender? Explain.

3.6 Effectiveness of government in preventing terrorism In a survey conducted in March 2013 by the National Consortium for the Study of Terrorism and Responses to Terrorism, 1515 adults were asked about the effec- tiveness of the government in preventing terrorism and whether they believe that it could eventually prevent all major terrorist attacks. 37.06% of the 510 adults who consider the government to be very effective believed that it can eventually prevent all major attacks, while

126 Chapter 3 Association: Contingency, Correlation, and Regression

3.9 Gender gap in party ID In recent election years, politi- cal scientists have analyzed whether a gender gap exists in political beliefs and party identification. The table shows data collected from the 2010 General Social Survey on gender and party identification (ID).

this proportion was 28.36% among those who consider the government somewhat, not too, or not at all effec- tive in preventing terrorism. The other people surveyed considered that terrorists will always find a way.

a. Identify the response variable, the explanatory variable and their categories.

b. Construct a contingency table that shows the counts for the different combinations of categories.

c. Use a contingency table to display the percentages for the categories of the response variables, separately for each category of the explanatory variable.

d. Are the percentages reported in part c conditional?

Explain.

e. Sketch a graph that compares the responses for each category of the explanatory variable.

f. Compute the difference and the ratio of proportions.

Interpret.

g. Give an example of how the results would show that there is no evidence of association between these variables.

3.7 In person or over the phone According to data obtained from the General Social Survey (GSS) in 2014, 1644 out of 2532 respondents were female and interviewed in person, 551 were male and interviewed in person, 320 were female and interviewed over the phone and 17 were male and in- terviewed over the phone.

a. Explain how we could regard either variable (gender of respondent, interview type) as a response variable.

b. Display the data as a contingency table, labeling the variables and the categories.

c. Find the conditional proportions that treat inter- view type as the response variable and gender as the explanatory variable. Interpret.

d. Find the conditional proportions that treat gender as the response variable and interview type as the explan- atory variable. Interpret.

e. Find the marginal proportion of respondents who (i) are female, (ii) were interviewed in person.

3.8 Surviving the Titanic Was the motto “Women and Children First” followed on the fateful journey of the Titanic? Refer to the following table on surviving the sinking of the Titanic.

a. What’s the percentage of children and female adult passengers who survived? What’s the percentage of male adults who survived?

b. Compute the difference in the proportion of children and female adult passengers who survived and male adult passengers who survived. Interpret.

c. Compute the ratio of the proportion between children and female adult passengers and male adult passengers who survived. Interpret.

Party ID by Gender Female Male

Proportion

0.5 0.4 0.3 0.2 0.1 0.0

Dem Indep Rep Dem

Party

Indep Rep Party ID by Gender

Party Identification

Gender Democrat Independent Republican Total

Male 111 155 89 355

Female 237 205 95 537

Total 348 360 184 892

a. Identify the response and explanatory variables.

b. What proportion of sampled individuals is (i) male and Republican, (ii) female and Republican?

c. What proportion of the overall sample is (i) male, (ii) Republican?

d. Are the proportions you computed in part c conditional or marginal proportions?

e. The two bar graphs, one for each gender, display the proportion of individuals identifying with each political party. What are these proportions called?

Is there a difference between males and females in the proportions that identify with a particular party?

Summarize whatever gender gap you observe.

3.10 Use the GSS Go to the GSS website sda.berkeley.edu/

GSS, click GSS, with No Weight Variables predefined (SDA 4.0), type SEX for the row variable and HAPPY for the column variable, put a check in the row box only for percentaging in the output options, and click Run the Table.

a. Report the contingency table of counts.

b. Report the conditional proportions to compare the genders on reported happiness.

c. Are females and males similar, or quite different, in their reported happiness? Compute and interpret the difference and ratio of the proportion of being not too happy between the two sexes.

Survived

Passenger Yes No

Children & Female Adult 373 161

Male Adult 338 1329

Section 3.2 The Association Between Two Quantitative Variables 127

3.2 The Association Between Two Quantitative Variables

In practice, when we investigate the association between two variables, there are three types of cases:

j The variables could be categorical as food type and pesticide status are. In this case, as we have already seen, the data are displayed in a contingency table, and we can explore the association by comparing conditional proportions.

j One variable could be quantitative and one could be categorical such as an- alyzing height and gender or income and race. As we saw in Chapter 2, we can compare the categories (such as females and males) using summaries of center and variability for the quantitative variable (such as the mean and stan- dard deviation of height) and graphics such as side-by-side box plots.

j Both variables could be quantitative. In this case, we analyze how the outcome on the response variable tends to change as the value of the explanatory vari- able changes. The rest of the chapter considers this case.

In exploring the relationship between two quantitative variables, we’ll use the principles introduced in Chapter 2 for exploring the data of a single variable.

We first use graphics to look for an overall pattern. We follow up with numerical summaries and check also for unusual observations that deviate from the overall pattern and may affect results.

Recall

Figure 2.16 in Chapter 2 used side-by-side box plots to compare heights for females and males. b

Numerical and

graphical summariesb

Worldwide Internet and Facebook Use

Picture the Scenario

The number of worldwide Internet users and the number of users of social networking sites such as Facebook have grown significantly over the past decade. This growth though has not been distributed evenly throughout the world. Countries such as Australia, Sweden, and the Netherlands have achieved an Internet penetration of more than 85%, whereas only 13% of India’s population uses the Internet. The story with Facebook is similar.

More than 50% of the populations of countries such as the United States and Australia use Facebook, compared to fewer than 6% of the populations of countries such as China, India, and Russia.

The Internet Use data file on the book’s website contains recent data for 32 countries on Internet penetration, Facebook penetration, broadband subscription percentage, and other variables related to Internet use. In this example, we’ll investigate the relationship between Internet penetration and Facebook penetration. Note that we will often say “use” instead of “penetra- tion” in these two variable names. Table 3.4 displays the values of these two variables for each of the 32 countries.

Example 4

Table 3.4 Internet and Facebook Penetration Rates For 32 Countries

Country Internet Penetration Facebook Penetration

Argentina 55.8% 48.8%

Australia 82.4% 51.5%

Belgium 82.0% 44.2%

(Continued)

128 Chapter 3 Association: Contingency, Correlation, and Regression

Country Internet Penetration Facebook Penetration

Brazil 49.9% 29.5%

Canada 86.8% 51.9%

Chile 61.4% 55.5%

China 42.3% 0.1%

Colombia 49.0% 36.3%

Egypt 44.1% 15.1%

France 83.0% 39.0%

Germany 84.0% 30.9%

Hong Kong 72.8% 56.4%

India 12.6% 5.1%

Indonesia 15.4% 20.7%

Italy 58.0% 38.1%

Japan 79.1% 13.5%

Malaysia 65.8% 46.5%

Mexico 38.4% 31.8%

Netherlands 93.0% 45.1%

Peru 38.2% 31.2%

Philippines 36.2% 30.9%

Poland 65.0% 25.6%

Russia 53.3% 5.6%

Saudi Arabia 54.0% 20.7%

South Africa 41.0% 12.3%

Spain 72.0% 38.1%

Sweden 94.0% 52.0%

Thailand 26.5% 26.5%

Turkey 45.1% 43.4%

United Kingdom 87.0% 52.1%

United States 81.0% 52.9%

Venezuela 44.1% 32.6%

Source: Data from the World Bank (data.worldbank.org) and www.internetworldstats.com for the year 2012.

Question to Explore

Use numerical and graphical summaries to describe the shape, center, and variability of the distributions of Internet penetration and Facebook penetration.

Think It Through

Using many of the statistics from Chapter 2, we obtain the following numerical summaries to describe center and variability for each variable:

Section 3.2 The Association Between Two Quantitative Variables 129

Variable N Mean StDev Minimum Q1 Median Q3 Maximum IQR

Internet Use 32 59.2 22.4 12.6 43.6 56.9 81.3 94.0 37.7

Facebook Use 32 33.9 16.0 0.0 24.4 34.5 47.1 56.4 22.7

Figure 3.4 portrays the distributions using histograms. We observe that the shape for Internet use is bimodal, with one mode around 45% and the other around 85%. We also see that Internet use has a wide range, from a minimum of just over 10% to a maximum of just under 95%. Facebook use ranges between 0% and about 55%, with most nations above 20%.

8 6 4 2

0 0 10 20 30 40 50 Percentage Internet Use

Frequency

60 70 80 90 100

10 8 6 4 2

00 10 20 30 Percentage

Frequency

40 50 60 70 Facebook Use

mFigure 3.4 Histograms of Internet Use and Facebook Use for the 32 Countries.

Question Which nations, if any, might be outliers in terms of Internet use? Facebook use?

Which graphical display would more clearly identify potential outliers?

Insight

The histograms portray each variable separately but give no clue about their relationship. Is it true that countries with higher Internet use tend to have higher Facebook use? How can we picture the association between the two variables on a single display? We’ll study how to do that next.

c Try Exercise 3.13, part a