Properties of the Correlation - The Statistics Art and Science of Learning from Data

j The correlation r always falls between −1 and +1. The closer the value to 1 in absolute value (see the margin comments), the stronger the linear (straight-line) association as the data points fall nearer to a straight line.

j A positive correlation indicates a positive association, and a negative correlation indicates a negative association.

j The value of the correlation does not depend on the variables’ units. For example, suppose one variable is the income of a subject, in dollars. If we change the observations to units of euros or to units of thousands of dollars, we’ll get the same correlation.

j The correlation does not depend on which variable is treated as the response and which as the explanatory variable.

Recall

The absolute value of a number gives the distance the number falls from zero on the number line. The correlation values of -0.9 and 0.9 both have an absolute value of 0.9.

They both represent a stronger association than correlation values of -0.6 and 0.6, for example. b

Finding and interpreting

the correlation value b

Internet Use and Facebook Use

Picture the Scenario

Example 5 displayed a scatterplot for Internet use and Facebook use for 32 countries, shown again in the margin. We observed a positive association.

Questions to Explore

a. What value does software give for the correlation?

b. How can we interpret the correlation value?

Think It Through

Because the association is positive, we expect to find r 7 0. If we input the columns of Internet use and Facebook use into, e.g., MINITAB and request the correlation from the Basic Statistics menu, we get

Correlations: Internet Use, Facebook Use

Pearson correlation of Internet Use and Facebook Use = 0.614.

The correlation of r = 0.614 is positive. This result confirms the positive linear association we observed in the scatterplot. In summary, a country’s extent of Facebook use is moderately associated with its Internet use, with higher Internet use tending to correspond to higher Facebook use.

Insight

We get exactly the same correlation if we treat Facebook use as the explanatory variable and Internet use as the response variable. Also, it doesn’t

Example 7

60 50 40 30 20 10 0

0 10 20 30 40 50 Internet Use (%)

Internet and Facebook Use for 32 Countries

Japan

Facebook Use (%)

60 70 80 90 100

Section 3.2 The Association Between Two Quantitative Variables 135

matter whether we use the proportions or the percentages when computing the correlation. Both sets of units result in r = 0.614.

The identifier Pearson for the correlation in the MINITAB output refers to the British statistician, Karl Pearson. In 1896 he provided the formula used to compute the correlation value from sample data. This formula is shown next.

c Try Exercises 3.14 and 3.15

Formula for the Correlation Value

Although software can compute the correlation for us, it helps to understand it if you see its formula. Let zx denote the z-score for an observation x on the explanatory variable. Remember that zx represents the number of standard deviations that x falls above or below the overall mean. That is,

z_x = observed value - mean of all x

standard deviation of all x = 1x - x2 s_x ,

where sx denotes the standard deviation of the x-values. Similarly, let zy denote the number of standard deviations that an observation y on the response variable falls above or below the mean of all y. To obtain r, you calculate the product zxz_y for each observation and then find a typical value (a type of average) of those products.

Recall

From Section 2.5 the z-score for an observation indicates the number of standard deviations and the direction (above or below) that the observation falls from the overall mean. b

60 50 40 30 20 10 0

0 10 20 30 40 50 Internet Use (%)

Facebook Use (%)

60 70 80 90 100 Japan Internet and Facebook Use for 32 Countries

y = 34

x = 59

mFigure 3.8 Scatterplot of Internet Use and Facebook Use Divided Into Quadrants at 1x, y2. Of the 32 data points, 25 lie in the upper-right quadrant (above the mean on each variable) or the lower-left quadrant (below the mean on each variable). Question Do the points in these two quadrants make a positive or a negative contribution to the correlation value?

(Hint: Is the product of z-scores for these points positive or negative?) 2

1 0

−1

−2

−2 −1 0

z-score Facebook

z-score Internet Japan

1 2

Plotting the z-scores of the variables on the axes, the relative position of the points doesn’t change from Figure 3.8. We have merely relabeled the axis with different units.

The points in this plot give exactly the same correlation as the points in Figure 3.8.

Calculating the Correlation r

r = 1

n -1 Σzxzy = 1

n- 1 Σax -x sx b ay -y

sy b

where n is the number of observations (points in the scatterplot), x and y are means, and sx and sy are standard deviations for x and y. The sum is taken over all n observations.

For x = Internet use and y = Facebook use, Example 7 found the correlation r = 0.614, using statistical software. To visualize how the formula works, let’s revisit the scatterplot, reproduced in Figure 3.8, with a vertical line at the mean In Practice Using Technology

to Calculate r

Hand calculation of the correlation r is tedious. You should rely on software or a calculator. It’s more important to understand how the correlation describes association in terms of how it reflects the relative numbers of points in the four quadrants.

136 Chapter 3 Association: Contingency, Correlation, and Regression

of x and a horizontal line at the mean of y. These lines divide the scatterplot into four quadrants. The summary statistics are

x = 59.2 y = 33.9 s_x = 22.4 s_y = 16.0.

The point for Japan 1x = 79, y = 132 has as its z-scores z_x = 0.89, z_y = -1.27.

This point is labeled in Figure 3.8. Since x = 79 is to the right of the mean for x and y = 13 is below the mean of y, it falls in the lower-right quadrant. This makes Japan somewhat atypical in the sense that all but 7 of the 32 countries have points that fall in the upper-right and lower-left quadrants. The product of the z-scores for Japan equals 0.891-1.272 = -1.13, indicating its negative contribution to r.

SUMMARY: Product of z-scores and correlation

j The product of the z-scores for any point in the upper-right quadrant is positive.

The product is also positive for each point in the lower-left quadrant. Such points make a positive contribution to the correlation.

j The product of the z-scores for any point in the upper-left and lower-right quadrants is negative. Such points make a negative contribution to the correlation.

The overall correlation reflects the number of points in the various quadrants and how far they fall from the means. For example, if all points fall in the upper-right and lower-left quadrants, the correlation must be positive.

Graph Data to See If the Correlation Is Appropriate

The correlation r is an efficient way to summarize the association shown by lots of data points with a single number. But be careful to use it only when it is appropriate. Figure 3.9 illustrates why. It shows a scatterplot in which the data points follow a U-shaped curve. There is an association because as x increases, y first tends to decrease and then tends to increase. For example, this might happen if x = age of person and y = annual medical expenses. Medical expenses tend to be high for newly born and young children, then they tend to be low until the person gets old, when they become high again. However, r = 0 for the data in Figure 3.9.

The correlation is designed for straight-line relationships. For Figure 3.9, r = 0, and it fails to detect the association. The correlation is not valid for describing association when the points cluster around a curve rather than around a straight line.

This figure highlights an important point to remember about any data analysis:

j Always plot the data.

If we merely used software to calculate the correlation for the data in Figure 3.9 without looking at the scatterplot, we might mistakenly conclude that the variables have no association. They do have one (and a fairly strong one), but it is not a straight-line association.

x mFigure 3.9 The Correlation Poorly Describes the Association When the Relationship Is Curved. For this U-shaped relationship, the correlation is 0 (or close to 0), even though the variables are strongly associated. Question Can you use the formula for r, in terms of how points fall in the quadrants, to reason why the correlation would be close to 0?

In Practice Always Construct a Scatterplot

Always construct a scatterplot to display a relationship between two quantitative variables.

The correlation is only meaningful for describing the direction and strength of an approximate straight-line relationship.

In Words

A quadrant is any of the four regions created when a plane is divided by a horizontal line and a vertical line.

Section 3.2 The Association Between Two Quantitative Variables 137

3.11 Used cars and direction of association For the 100 cars on the lot of a used-car dealership, would you expect a positive association, negative association, or no association between each of the following pairs of variables? Explain why.

a. The age of the car and the number of miles on the odometer

b. The age of the car and the resale value

c. The age of the car and the total amount that has been spent on repairs

d. The weight of the car and the number of miles it travels on a gallon of gas

e. The weight of the car and the number of liters it uses per 100 km.*

3.12 Broadband and GDP The Internet Use data file on the book’s website contains data on the number of individuals with broadband access and Gross Domestic Product (GDP) for 32 nations. Let x represent GDP (in billions of U.S.

dollars) and y = number of broadband users (in millions).

a. The figure below shows a scatterplot. Describe this plot in terms of the association between broadband subscribers and GDP.

b. Give the approximate x- and y-coordinates for the nation that has the highest number of broadband subscribers.

c. Use software to calculate the correlation coefficient between the two variables. What is the sign of the coefficient? Explain what the sign means in the context of the problem.

d. Identify one nation that appears to have fewer broadband subscribers than you might expect, based on that nation’s GDP, and one that appears to have more.

e. If you recalculated the correlation coefficient after changing GDP from U.S. dollar to euro, would the correlation coefficient change? Explain.

3.2 Practicing the Basics

a. The five-number summary of GDP is minimum = 204, Q1 = 378, median = 780, Q3 = 2015, and

maximum = 16,245. Sketch a box plot.

b. Based on these statistics and the graph in part a, describe the shape of the distribution of GDP values.

c. The data set also contains per capita GDP, or the overall GDP divided by the nation’s population size.

Construct a scatterplot of per capita GDP and GDP and explain why no clear trend emerges.

d. Your friend, Joe, argues that the correlation between the two variables must be 1 since they are both measuring the same thing. In reality, the actual correlation between per capita GDP and GDP is only 0.32. Identify the flaw in Joe’s reasoning.

3.14 Email use and number of children According to data selected from GSS in 2014, the correlation between y = email hours per week and x = ideal number of children is

-0.0008.

a. Would you call this association strong or weak? Explain.

b. The correlation between email hours per week and Internet hours per week is 0.33. For this sample, which explanatory variable, ideal number of children or Internet hours per week, seems to have a stronger association with y? Explain.

3.15 Internet use correlations For the 32 nations in the Internet Use data file on the book’s website, consider the following correlations:

*liters/100 km rather than miles/gallon is a more common measure for the fuel efficiency in many countries.

US CN

Broadband Subscribers and GDP for 32 Countries

JP 150

100 50 0

0 5000 10000

GDP (in billions of $US) Broadband Subs. (in Mio.)

15000

3.13 Economic development based on GDP The previous problem discusses GDP, which is a commonly used measure of the overall economic activity of a nation. For this group of nations, the GDP data have a mean of 1909 and a standard deviation of 3136 (in billions of U.S. dollars).

Variable 1 Variable 2 Correlation

Internet users Facebook users 0.293

Internet users Broadband subscribers 0.974

Internet users Population 0.834

Facebook users Broadband subscribers 0.281

Facebook users Population 0.234

Broadband subscribers Population 0.704

a. Which pair of variables exhibits the strongest linear relationship?

b. Which pair of variables exhibits the weakest linear relationship?

c. In Example 7, we found the correlation between Internet use and Facebook use (measured in percentages of the population) to be 0.614. Why does the correlation between total number of Internet users and Facebook users differ from that of Internet use and Facebook use?

3.16 Match the scatterplot with r Match the following scatter- plots with the correlation values.

1. r = -0.9 2. r = -0.5 3. r = 0 4. r = 0.6

138 Chapter 3 Association: Contingency, Correlation, and Regression

The students’ teacher enters the data into software and reports a correlation of 0.640 between gender and type of preferred chocolate. He concludes that there is a moderately strong positive correlation between someone’s gender and chocolate preference. What’s wrong with this analysis?

3.19 r = 0 Provide a data set with five pairs of numeric values for which r 7 0, but r = 0 after one of the points is deleted.

3.20 Correlation inappropriate Describe a situation in which it is inappropriate to use the correlation to measure the association between two quantitative variables.

3.21 Which mountain bike to buy? Is there a relationship between the weight of a mountain bike and its price?

A lighter bike is often preferred, but do lighter bikes tend to be more expensive? The following table, from the Mountain Bike data file on the book’s website, gives data on price, weight, and type of suspension 1FU = full, FE = front end2 for 12 brands.

(a) y

x (b) y

x (c) y

x (d) y

3.17 What makes r = -1? Consider the data:

x 1 3 5 7 9

y 17 11 10 -1 -7

a. Sketch a scatterplot.

b. If one pair of (x, y) values is removed, the correlation for the remaining four pairs equals _-1. Which pair has been removed?

c. If one y value is changed, the correlation for the five pairs equals -1. Identify the y value and how it must be changed for this to happen.

3.18 Gender and Chocolate Preference The following table shows data on gender 1coded as 1 = female, 2 = male2 and preferred type of chocolate 1coded as 1 = white, 2 = milk, 3 = dark2 for a sample of 10 students.

Preferred Chocolate Type

Name Gender Type Name Gender Type

Anna 1 2 Josef 2 3

Franz 2 3 Eva 1 3

Hans 2 2 Doris 1 2

Lisl 1 1 Sophie 1 1

Michael 2 3 Kathi 1 1

Mountain Bikes

Brand and Model Price($) Weight(LB) Type

Trek VRX 200 1000 32 FU

Cannondale Super V400 1100 31 FU

GT XCR-4000 940 34 FU

Specialized FSR 1100 30 FU

Trek 6500 700 29 FE

Specialized Rockhop 600 28 FE

Haro Escape A7.1 440 29 FE

Giant Yukon SE 450 29 FE

Mongoose SX 6.5 550 30 FE

Diamondback Sorrento 340 33 FE

Motiv Rockridge 180 34 FE

Huffy Anorak 36789 140 37 FE

Source: Data from Consumer Reports, June 1999.

a. You are shopping for a new bike. You are interested in whether and how weight affects the price. Which variable is the logical choice for the (i) explanatory variable, (ii) response variable?

b. Construct a scatterplot of price and weight. Does the relationship seem to be approximately linear? In what way does it deviate from linearity?

c. Use your software to verify that the correlation equals -0.32. Interpret it in context. Does weight appear to affect the price strongly in a linear manner?

3.22 Prices and protein revisited Is there a relationship between the protein content and the cost of Subway sandwiches? Use software to analyze the data in the following table:

Sandwich Cost ($) Protein (g)

BLT $2.99 17

Ham (Black Forest, without cheese) $2.99 18

Oven Roasted Chicken $3.49 23

Roast Beef $3.69 26

Subway Club® $3.89 26

Sweet Onion Chicken Teriyaki $3.89 26

3.23 Buchanan vote Refer to Example 6 and the Buchanan and the Butterfly Ballot data file on the book’s website.

Let y = Buchanan vote and x = Gore vote.

a. Construct a box plot for each variable. Summarize what you learn.

b. Construct a scatterplot. Identify any unusual points.

What can you learn from a scatterplot that you cannot learn from box plots?

c. For the county represented by the most outlying observation, about how many votes would you have expected Buchanan to get if the point followed the same pattern as the rest of the data?

d. Repeat parts a and b using y = Buchanan vote and x = Bush vote.

Turkey Breast $3.49 18

Turkey Breast & Ham $3.49 19

Veggie Delite® $2.49 8

Cold Cut Combo $2.99 21

Tuna $3.10 21

a. Construct a scatterplot to show how protein depends on cost. Is the association positive or negative? Do you notice any unusual observations?

b. What might explain the gap observed in the scatterplot?

(Hint: Are vegetables generally high or low in protein relative to meat and poultry products?)

c. Obtain the correlation between cost and protein, r.

Interpret this value in context.

Section 3.3 Predicting the Outcome of a Variable 139

3.3 Predicting the Outcome of a Variable

We’ve seen how to explore the relationship between two quantitative variables graphically with a scatterplot. When the relationship has a straight-line pattern, the correlation coefficient describes its strength numerically. We can analyze the data further by finding an equation for the straight line that best describes that pattern. This equation can be used to predict the value of the variable designated as the response variable from the value of the variable designated as the explanatory variable.

Recall

The correlation does not require one variable to be designated as response and the other as explanatory. b

In Words

The symbol yN, which denotes the predicted value of y, is pronounced y-hat.

Regression Line: An Equation for Predicting the Response Outcome

The regression line predicts the value for the response variable y as a straight-line func- tion of the value x of the explanatory variable. Let yn denote the predicted value of y. The equation for the regression line has the form

yn = a+ bx.

In this formula, a denotes the y-intercept and b denotes the slope.

Predict an outcome b

Height Based on Human Remains

Picture the Scenario

Anthropologists can reconstruct information using partial human remains at burial sites. For instance, after finding a femur (thighbone), they can predict how tall an individual was. They use the regression line, yn = 61.4 + 2.4x, where yn is the predicted height and x is the length of the femur, both in centimeters.

Questions to Explore

What is the response and what is the explanatory variable? How can we graph the line that depicts how the predicted height depends on the femur length? A femur found at a particular site has a length of 50 cm. What is the predicted height of the person who had that femur?

Think It Through

It is natural here to treat the length of the femur as the explanatory variable to predict the height of a person, the response variable. The formula

Example 8

140 Chapter 3 Association: Contingency, Correlation, and Regression

Height

Femur Length y = 61.4 + 2.4x

(0, 61.4)

(50, 181.4)

a = 61.4 200

100

0 10 20 30 40 50

mFigure 3.10 Graph of the Regression Line for x =Femur Length and

y = Height of Person. Question At what point does the line cross the y-axis? How can you interpret the slope of 2.4?

yn = 61.4 + 2.4x has y-intercept 61.4 and slope 2.4. It has the straight-line form yn = a + bx with a = 61.4 and b = 2.4.

Each number x, when substituted into the formula yn = 61.4 + 2.4x, yields a value for yn. For simplicity in plotting the line, we start with x = 0, although in practice this would not be an observed femur length. The value x = 0 has yn = 61.4 + 2.4102 = 61.4. This is called the y-intercept and is located 61.4 units up the y-axis at x = 0, at coordinates (0, 61.4). The value x = 50 has yn = 61.4 + 2.41502 = 181.4. When the femur length is 50 cm, the predicted height of the person is 181.4 cm. The coordinates for this point are (50, 181.4). We can plot the line by connecting the points (0, 61.4) and (50, 181.4).

Figure 3.10 plots the straight line for x between 0 and 50. In summary, the predicted height yn increases from 61.4 to 181.4 as x increases from 0 to 50.

Insight

A regression line is often called a prediction equation since it predicts the value of the response variable y at any value of x. Sadly, this particular prediction equation had to be applied to bones found in mass graves in Kosovo, to help identify Albanians who had been executed by Serbians in 1998.⁴

c Try Exercises 3.25, part a, and 3.26, part a

4“The Forensics of War,” by Sebastian Junger in Vanity Fair, October 1999.

In Practice Notation for the Regression Line

The formula yn = a +bx uses slightly different notation from the traditional formula, which is y = mx +b. In that equation, m = the slope (the coefficient of x) and b = y-intercept.

Regardless of the notation, the interpretation of the y-intercept and slope are the same.

Dalam dokumen The Statistics Art and Science of Learning from Data (Halaman 135-141)