Be Cautious of Influential Outliers - The Statistics Art and Science of Learning from Data

One reason to plot the data before you do a correlation or regression analysis is to check for unusual observations. Such an observation can tell you something interesting, as in Examples 6 and 10 about the Buchanan vote in the 2000 U.S.

presidential election. Furthermore, a data point that is an outlier on a scatterplot can have a substantial effect on the regression line and correlation, especially with small data sets.

What’s relevant here is not whether an observation is an outlier in its x value, relative to the other x values, or in its y value, relative to the other y values.

Instead, we search for observations that are regression outliers, being well removed from the trend that the rest of the data follow. The margin figure shows an observation that is a regression outlier, although it is not an outlier on x alone or on y alone.

Regression Outlier

156 Chapter 3 Association: Contingency, Correlation, and Regression

When an observation has a large effect on results of a regression analysis, it is said to be influential. For an observation to be influential, two conditions must hold:

j Its x value is relatively low or high compared to the rest of the data.

j The observation is a regression outlier, falling quite far from the trend that the rest of the data follow.

When both of these happen, the line tends to be pulled toward that data point and away from the trend of the rest of the points as the margin figure illustrates.

Figure 3.18 shows two regression outliers. The correlation without these two points equals 0.00. The first regression outlier is near the middle of the range of x. It does not have much potential for tilting the line up or down. It has little influence on the slope or the correlation. The correlation changes only to 0.03 when we add it to the data set. The second regression outlier is at the high end of the range of x-values. It is influential. The correlation changes to 0.47 when we add it to the data set.

Regression line without outlier

Regression line with outlier

12 11 10 9 8 7 6 5

0 2 4 6

Scatterplot of y vs x

8 10 12

mFigure 3.18 An Observation Is a Regression Outlier If It Is Far Removed from the Trend That the Rest of the Data Follow. The top two points are regression outliers. Not all regression outliers are influential in affecting the correlation or slope. Question Which regression outlier in this figure is influential?

Influential outliers b

Higher Education and Higher Murder Rates

Picture the Scenario

Table 3.6 shows data¹⁰ for the 50 states and the District of Columbia on Violent crime rate: The annual number of murders, forcible rapes, robber- ies, and aggravated assaults per 100,000 people in the population.

Murder rate: The annual number of murders per 100,000 people in the population.

Example 13

10From Statistical Abstract of the United States, 2003.

Section 3.4 Cautions in Analyzing Associations 157

Table 3.6 Statewide Data on Several Variables

State Violent

Crime Murder

Rate Poverty High

School College Single Parent

Alabama 486 7.4 14.7 77.5 20.4 26.0

Alaska 567 4.3 8.4 90.4 28.1 23.2

Arizona 532 7.0 13.5 85.1 24.6 23.5

Arkansas 445 6.3 15.8 81.7 18.4 24.7

California 622 6.1 14.0 81.2 27.5 21.8

Colorado 334 3.1 8.5 89.7 34.6 20.8

Connecticut 325 2.9 7.7 88.2 31.6 22.9

Delaware 684 3.2 9.9 86.1 24.0 25.6

District of Columbia 1508 41.8 17.4 83.2 38.3 44.7

Florida 812 5.6 12.0 84.0 22.8 26.5

Georgia 505 8.0 12.5 82.6 23.1 25.5

Hawaii 244 2.9 10.6 87.4 26.3 19.1

Idaho 253 1.2 13.3 86.2 20.0 17.7

Illinois 657 7.2 10.5 85.5 27.1 21.9

Indiana 349 5.8 8.3 84.6 17.1 22.8

Iowa 266 1.6 7.9 89.7 25.5 19.8

Kansas 389 6.3 10.5 88.1 27.3 20.2

Kentucky 295 4.8 12.5 78.7 20.5 23.2

Louisiana 681 12.5 18.5 80.8 22.5 29.3

Maine 110 1.2 9.8 89.3 24.1 23.7

Maryland 787 8.1 7.3 85.7 32.3 24.5

Massachusetts 476 2.0 10.2 85.1 32.7 22.8

Michigan 555 6.7 10.2 86.2 23.0 24.5

Minnesota 281 3.1 7.9 90.8 31.2 19.6

Mississippi 361 9.0 15.5 80.3 18.7 30.0

Missouri 490 6.2 9.8 86.6 26.2 24.3

Montana 241 1.8 16.0 89.6 23.8 21.4

Poverty: Percentage of the residents with income below the poverty level.

High school: Percentage of the adult residents who have at least a high school education.

College: Percentage of the adult residents who have a college education.

Single parent: Percentage of families headed by a single parent.

The data are in the U.S. Statewide Crime data file on the book’s website.

Let’s look at the relationship between y = murder rate and x = college.

We’ll look at other variables in the exercises.

(Continued)

158 Chapter 3 Association: Contingency, Correlation, and Regression

State Violent

Crime Murder

Rate Poverty High

School College Single Parent

Nebraska 328 3.7 10.7 90.4 24.6 19.6

Nevada 524 6.5 10.1 82.8 19.3 24.2

New Hampshire 175 1.8 7.6 88.1 30.1 20.0

New Jersey 384 3.4 8.1 87.3 30.1 20.2

New Mexico 758 7.4 19.3 82.2 23.6 26.6

New York 554 5.0 14.7 82.5 28.7 26.0

North Carolina 498 7.0 13.2 79.2 23.2 24.3

North Dakota 81 0.6 12.8 85.5 22.6 19.1

Ohio 334 3.7 11.1 87.0 24.6 24.6

Oklahoma 496 5.3 14.1 86.1 22.5 23.5

Oregon 351 2.0 12.9 88.1 27.2 22.5

Pennsylvania 420 4.9 9.8 85.7 24.3 22.8

Rhode Island 298 4.3 10.2 81.3 26.4 27.4

South Carolina 805 5.8 12.0 83.0 19.0 27.1

South Dakota 167 0.9 9.4 91.8 25.7 20.7

Tennessee 707 7.2 13.4 79.9 22.0 27.9

Texas 545 5.9 14.9 79.2 23.9 21.5

Utah 256 1.9 8.1 90.7 26.4 13.6

Vermont 114 1.5 10.3 90.0 28.8 22.5

Virginia 282 5.7 8.1 86.6 31.9 22.2

Washington 370 3.3 9.5 91.8 28.6 22.1

West Virginia 317 2.5 15.8 77.1 15.3 22.3

Wisconsin 237 3.2 9.0 86.7 23.8 21.7

Wyoming 267 2.4 11.1 90.0 20.6 20.8

Questions to Explore

a. Construct the scatterplot between y = murder rate and x = college.

Does any observation look like it could be influential in its effect on the regression line?

b. Use software to find the regression line. Check whether the observation identified in part a actually is influential by finding the line again without that observation.

Think It Through

a. Figure 3.19 shows the scatterplot. The observation out by itself is D.C.

with x = 38.3 and y = 41.8, which is the largest observation on both these variables. It satisfies both conditions for an observation to be influential: It has a relatively extreme value on the explanatory vari- able (college), and it is a regression outlier, falling well away from the linear trend of the other points.

Section 3.4 Cautions in Analyzing Associations 159

b. Using software, the regression line fitted to all 51 observations, includ- ing D.C., equals yn = -3.1 + 0.33x. The slope is positive, as shown in the first plot in Figure 3.20. You can check that the predicted murder rates increase from 1.9 to 10.1 as the percentage with a college education increases from x = 15% to x = 40%, roughly the range of observed x values. By contrast, when we fit the regression line only to the 50 states, excluding the observation for D.C., yn = 8.0 - 0.14x.

The slope of -0.14 reflects a negative trend, as shown in the second plot in Figure 3.20. Now, the predicted murder rate decreases from 5.9 to 2.4 as the percentage with a college education increases from 15%

to 40%.

Percentage with College Education

Murder Rate

15 20 25 30 35 40

D.C.

mFigure 3.19 Scatterplot Relating Murder Rate to Percentage with College Education. Question How would you expect the slope to change if D.C. is excluded from the regression analysis?

Percentage with College Education

Murder Rate

Fitted Line Plot

Murder rate = –3.06 + 0.333 College

15 0 10 20 30 40

20 25 30 35 40

D.C.

Fitted Line Plot (without D.C. observation) Murder rate = 8.04 – 0.138 College 14

12 10

6 4 2

0 15 20 25

Percentage with College Education

Murder Rate

30 35

mFigure 3.20 Scatterplots Relating Murder Rate to Percentage with College Education, With and Without Observation for D.C.

Question Which line better describes the trend for the 50 states?

160 Chapter 3 Association: Contingency, Correlation, and Regression

This example shows the correlation and the regression line are nonresistant:

They are prone to distortion by outliers. Investigate any regression outlier. Was the observation recorded incorrectly, or is it merely different from the rest of the data in some way? It is often a good idea to refit the regression line without it to see if it has a large effect, as we did in this last example.

Dalam dokumen The Statistics Art and Science of Learning from Data (Halaman 156-161)