PART VIII. TESTS OF SIGNIFICANCE
5. THERE ARE TWO REGRESSION LINES
Exercise Set D
1. As part of their training, air force pilots make two practice landings with instruc- tors, and are rated on performance. The instructors discuss the ratings with the pilots after each landing. Statistical analysis shows that pilots who make poor land- ings the first time tend to do better the second time. Conversely, pilots who make good landings the first time tend to do worse the second time. The conclusion: crit- icism helps the pilots while praise makes them do worse. As a result, instructors were ordered to criticize all landings, good or bad. Was this warranted by the facts?
Answer yes or no, and explain briefly.6
2. An instructor standardizes her midterm and final each semester so the class average is 50 and the SD is 10 on both tests. The correlation between the tests is around 0.50. One semester, she took all the students who scored below 30 at the midterm, and gave them special tutoring. They all scored above 50 on the final. Can this be explained by the regression effect? Answer yes or no, and explain briefly.
3. In the data set of figures 5 and 6, are the sons of the 61-inch fathers taller on the average than the sons of the 62-inch fathers, or shorter? What is the explanation?
The answers to these exercises are on pp. A62–63.
THERE ARE TWO REGRESSION LINES 175
Example 3. IQ scores are scaled to have an average of about 100, and an SD of about 15, both for men and for women. The correlation between the IQs of husbands and wives is about 0.50. A large study of families found that the men whose IQ was 140 had wives whose IQ averaged 120. Look at the wives in the study whose IQ was 120. Should the average IQ of their husbands be greater than 120? Answer yes or no, and explain briefly.
Solution. No, the average IQ of their husbands will be around 110. See figure 9. The families where the husband has an IQ of 140 are shown in the vertical strip. The averagey-coordinate in this strip is 120. The families where the wife has an IQ of 120 are shown in the horizontal strip. This is a completely different set of families. The averagex-coordinate for points in the horizontal strip is about 110.
Remember, there are two regression lines. One line is for predicting the wife’s IQ from her husband’s IQ. The other line is for predicting the husband’s IQ from his wife’s.
Figure 9. The two regression lines.
Exercise Set E
1. For the men age 18–24 in the HANES5 sample, the ones who were 63 inches tall averaged 138 pounds in weight. True or false, and explain: the ones who weighed 138 pounds must have averaged 63 inches in height.
2. In Pearson’s study, the sons of the 72-inch fathers only averaged 71 inches in height. True or false: if you take the 71-inch sons, their fathers will average about 72 inches in height. Explain briefly.
3. In example 2 (p. 166), the regression method predicted that a student at the 90th percentile on the SAT would only be at the 69th percentile on first-year GPA. True or false, and explain: a student at the 69th percentile on first-year GPA should be at the 90th percentile on the SAT.
The answers to these exercises are on p. A63.
6. REVIEW EXERCISES
Review exercises may cover material from previous chapters.
1. Shown below is a scatter diagram for Math and Verbal SAT scores for gradu- ating seniors at a certain high school. Three areas are shaded. Match the area with the description. (One description will be left over.)
(i) Total score (Math+Verbal) is below 1000.
(ii) Total score (Math+Verbal) is around 1000.
(iii) Math score is about equal to Verbal score.
(iv) Math score is less than Verbal score.
200 300 400 500 600 700 800
200 300 400 500 600 700 800 A
MATH SAT SCORE
VERBAL SAT SCORE
200 300 400 500 600 700 800
200 300 400 500 600 700 800 B
MATH SAT SCORE
VERBAL SAT SCORE
200 300 400 500 600 700 800
200 300 400 500 600 700 800 C
MATH SAT SCORE
VERBAL SAT SCORE
2. In a study of the stability of IQ scores, a large group of individuals is tested once at age 18 and again at age 35. The following results are obtained.
age 18: average score≈100, SD≈15
age 35: average score≈100, SD≈15, r≈0.80
(a) Estimate the average score at age 35 for all the individuals who scored 115 at age 18.
(b) Predict the score at age 35 for an individual who scored 115 at age 18.
3. Pearson and Lee obtained the following results in a study of about 1,000 families:
average height of husband≈68 inches, SD≈2.7 inches
average height of wife≈63 inches, SD≈2.5 inches, r≈0.25 Predict the height of a wife when the height of her husband is
(a) 72 inches (b) 64 inches (c) 68 inches (d) unknown 4. In one study, the correlation between the educational level of husbands and
wives in a certain town was about 0.50; both averaged 12 years of schooling completed, with an SD of 3 years.7
REVIEW EXERCISES 177
(a) Predict the educational level of a woman whose husband has com- pleted 18 years of schooling.
(b) Predict the educational level of a man whose wife has completed 15 years of schooling.
(c) Apparently, well-educated men marry women who are less well edu- cated than themselves. But the women marry men with even less edu- cation. How is this possible?
5. An investigator measuring various characteristics of a large group of athletes found that the correlation between the weight of an athlete and the amount of weight that athlete could lift was 0.60. True or false, and explain:
(a) On the average, an athlete can lift 60% of his body weight.
(b) If an athlete gains 10 pounds, he can expect to lift an additional 6 pounds.
(c) The more an athlete weighs, on the average the more he can lift.
(d) The more an athlete can lift, on the average the more he weighs.
(e) 60% of an athlete’s lifting ability can be attributed to his weight alone.
6. Three lines are drawn across the scatter diagram below. One is the SD line, one is the regression line foryonx, and one is the regression line forxony.
Which is which? Why? (The “regression line for y onx” is used to predicty fromx.)
7. A doctor is in the habit of measuring blood pressures twice. She notices that patients who are unusually high on the first reading tend to have somewhat lower second readings. She concludes that patients are more relaxed on the second reading. A colleague disagrees, pointing out that the patients who are unusually low on the first reading tend to have somewhat higher second read- ings, suggesting they get more nervous. Which doctor is right? Or perhaps both are wrong? Explain briefly.
8. A large study was made on the blood-pressure problem discussed in the pre- vious exercise. It found that first readings average 130 mm, and second read- ings average 120 mm; both SDs were about 15 mm. Does this support either doctor’s argument? Or is it the regression effect? Explain.
9. In a large statistics class, the correlation between midterm scores and fi- nal scores is found to be nearly 0.50, every term. The scatter diagrams are football-shaped. Predict the percentile rank on the final for a student whose percentile rank on the midterm is
(a) 5% (b) 80% (c) 50% (d) unknown
10. True or false: A student who is at the 40th percentile of first-year GPAs is also likely to be at the 40th percentile of second-year GPAs. Explain briefly.
(The scatter diagram is football-shaped.)
7. SUMMARY
1. Associated with an increase of one SD inx, there is an increase of onlyr SDs in y, on the average. Plotting theseregression estimatesgives theregression lineforyonx.
2. Thegraph of averagesis often close to a straight line, but may be a little bumpy. The regression line smooths out the bumps. If the graph of averages is a straight line, then it coincides with the regression line. If the graph of averages has a strong non-linear pattern, regression may be inappropriate.
3. The regression line can be used to make predictions for individuals. But if you have to extrapolate far from the data, or to a different group of subjects, be careful.
SUMMARY 179
4. In a typical test-retest situation, the subjects get different scores on the two tests. Take the bottom group on the first test. Some improve on the second test, others do worse. On average, the bottom group shows an improvement. Now, the top group: some do better the second time, others fall back. On average, the top group does worse the second time. This is theregression effect, and it hap- pens whenever the scatter diagram spreads out around the SD line into a football- shaped cloud of points.
5. Theregression fallacyconsists in thinking that the regression effect must be due to something other than spread around the SD line.
6. There are two regression lines that can be drawn on a scatter diagram.
One predictsyfromx; the other predictsxfromy.
11
The R.M.S. Error for Regression
Such are the formal mathematical consequences of normal correlation. Much biometric material certainly shows a general agreement with the features to be expected on this assumption: although I am not aware that the question has been subjected to any sufficiently critical enquiry. Approximate agreement is perhaps all that is needed to justify the use of the correlation as a quantity descriptive of the population; its efficacy in this respect is undoubted, and it is not improbable that in some cases it affords, in conjunction with the means and variances, a complete description of the simultaneous variation of the variates.
—SIR R.A.FISHER(ENGLAND, 1890–1962)1
1. INTRODUCTION
The regression method can be used to predict y from x. However, actual values differ from predictions. By how much? The object of this section is to measure the overall size of the differences using the r.m.s. error. For example, take the heights and weights of the 471 men age 18–24 in the HANES5 sample (section 1 of chapter 10). The summary statistics:
average height≈70 inches, SD≈3 inches
average weight≈180 pounds, SD≈45 pounds, r ≈0.40
To review briefly, given a man’s height, his weight is predicted by the average weight for all the men with that height. The average can be estimated by the regression method. Figure 1 shows the regression line. Person A on the diagram is about 72 inches tall. The regression estimate for average weight at this height is
INTRODUCTION 181
Figure 1. Prediction errors. The error is the distance above (+) or below (−) the regression line. The scatter diagram shows heights and weights for the 471 men age 18–24 in the HANES5 sample.
90 135 180 225 270 315 360 405 450
58 61 64 68 70 73 76 79 82
HEIGHT (INCHES)
WEIGHT (POUNDS)
A
B
C
D E
192 pounds (section 1 of chapter 10). However, A’s actual weight is 456 pounds.
The prediction is off, by 264 pounds:
error=actual weight−predicted weight
=456 lb−192 lb=264 lb.
In the diagram, the prediction error is the vertical distance of A above the regres- sion line.
Person C on the diagram is 80.5 inches tall and weighs 183 pounds. The regression line predicts his weight as 243 pounds. So there is a prediction error of 183 lb−243 lb=−60 lb. In the diagram, this error is represented by the vertical distance of C below the regression line.
The distance of a point above (+) or below (−) the regression line is
error=actual−predicted.