PART VIII. TESTS OF SIGNIFICANCE
3. THE REGRESSION METHOD FOR INDIVIDUALS
THE REGRESSION METHOD FOR INDIVIDUALS 165
(a) (b) (c) (d)
x y x y x y x y
1 0 0 0 0 0 0 2
1 6 0 2 1 1 1 3
2 5 1 2 2 4 2 0
3 6 2 4
3 8 3 1
4 2
The answers to these exercises are on pp. A61–62.
Technical note. In general, the regression line fitted to the graph of aver- ages, with each point weighted according to the number of cases it represents, coincides with the regression line fitted to the original scatter diagram. This is exact when points with differentx-coordinates are kept separate in the graph of averages; otherwise, it is a good approximation.
The logic: for all students with an SAT of around 650, the average first-year GPA is about 2.9, by the regression method. That is why we predict a first-year GPA of 2.9 for this individual.
Usually, investigators work out regression estimates from a study, and then extrapolate: they use the estimates on new subjects. In many cases this makes sense, provided the subjects in the survey are representative of the people about whom the inferences are going to be made. But you have to think about the is- sue each time. The mathematics of the regression method will not protect you. In example 1, the university only has experience with the students it admits. There could be a problem in using the regression procedure on students who are quite different from that group. (Admissions officers typically do extrapolate, from ad- mitted students to students who are denied admission.)
Now, another use for the regression method—to predictpercentile ranks. If your percentile rank on a test is 90%, you did very well: only 10% of the class scored higher, the other 90% scored lower. A percentile rank of 25% is not so good: 75% of the class scored higher, the other 25% scored lower (p. 91).
Example 2. (This continues example 1.) Suppose the percentile rank of one student on the SAT is 90%, among the first-year students. Predict his percentile rank on first-year GPA. The scatter diagram is football-shaped. In particular, the SAT scores and GPAs follow the normal curve.
Solution. We are going to use the regression method. This student is above the average on the SAT. By how many SDs? Because SAT scores follow the nor- mal curve, his percentile rank has this information—in disguise (section 5 of chap- ter 5):
This student scored 1.3 SDs above average on the SAT. The regression method predicts he will be 0.4×1.3≈0.5 SDs above average on first-year GPA. Finally, this can be translated back into a percentile rank:
That is the answer. The percentile rank on first-year GPA is predicted as 69%.
In solving this problem, the averages and SDs of the two variables were never used. All that mattered wasr. Basically, this is because the whole problem was worked in standard units. The percentile ranks give you the standard units.
The student in example 2 was compared with his class in two different com-
THE REGRESSION METHOD FOR INDIVIDUALS 167
petitions, the SAT and the first-year exams. He did very well on the SAT, scoring at the 90th percentile. But the regression estimate only puts him at the 69th per- centile on the first year exams; still above average, but not as much. On the other hand, for poor students—say at the 10th percentile of the SAT—the regression method predicts an improvement. It will put them at the 31st percentile on the first-year tests. This is still below average, but closer.
To go at this more carefully, take all the people at the 90th percentile on the SAT—good students. Some of them will move up on the first-year tests, some will move down. On the average, however, this group moves down. For comparison, take all the people at the 10th percentile of the SAT—poor students. Again, some will do better on the first-year tests, others worse. On the average, however, this group moves up. That is what the regression method is telling us.
Initially, many people would predict a first-year rank equal to the SAT rank.
This is not a good strategy. To see why, imagine that you had to predict a student’s rank in a mathematics class. In the absence of other information, the safest guess is to put her at the median. However, if you knew that this student was very good in physics, you would probably put her well above the median in mathematics.
After all, there is a strong correlation between physics and mathematics. On the other hand, if all you knew was her rank in a pottery class, that would not help very much in guessing the mathematics rank. The median looks good: there is not much correlation between pottery and mathematics.
Now, back to the problem of predicting first-year rank from SAT rank. If the two sets of scores are perfectly correlated, first-year rank will be equal to SAT rank. At the other extreme, if the correlation is zero, SAT rank does not help at all in predicting first-year rank. The correlation is somewhere between the two extremes, so we have to predict a rank on the first-year tests somewhere between the SAT rank and the median. The regression method tells us where.
Exercise Set C
1. In a certain class, midterm scores average out to 60 with an SD of 15, as do scores on the final. The correlation between midterm scores and final scores is about 0.50.
The scatter diagram is football-shaped. Predict the final score for a student whose midterm score is
(a) 75 (b) 30 (c) 60 (d) unknown
Compare your answers to exercise 1 on p. 161.
2. For the first-year students at a certain university, the correlation between SAT scores and first-year GPA was 0.60. The scatter diagram is football-shaped. Predict the percentile rank on the first-year GPA for a student whose percentile rank on the SAT was
(a) 90% (b) 30% (c) 50% (d) unknown Compare your answer to (a) with example 2.
3. The scatter diagram below shows the scores on the midterm and final in a certain course. Three lines are drawn across the diagram.
(a) People who have the same percentile rank on both tests are plotted along one of these lines. Which one, and why?
(b) One of these lines would be used to predict final score from midterm score.
Which one, and why?
4. The scatter diagram below shows ages of husbands and wives in Tennessee. (Data are from the March 2005 Current Population Survey.)
(a) Why are there no dots in the lower left hand corner of the diagram?
(b) Why does the diagram show vertical and horizontal stripes?
0 20 40 60 80
0 20 40 60 80
AGE OF HUSBAND (YEARS)
AGE OF WIFE (YEARS)
THE REGRESSION FALLACY 169
5. For the men age 18 and over in the HANES5 sample, the correlation between height and weight was 0.41; the SD of height was about 3 inches and the SD of weight was about 42 pounds. The men age 55–64 averaged about half an inch shorter than the men age 18–24. True or false, and explain: since half an inch is 1/6 ≈ 0.17 SDs of height, the men age 55–64 must have averaged about 0.41×0.17×42≈3 pounds lighter than the men age 18–24.
The answers to these exercises are on p. A62.
Technical note. The method discussed in example 2 is for median ranks. To see why, assume normality andr=0.4. Of students at the 90th percentile on the SAT (relative to their classmates), about half will rank above the 69th percentile on first-year GPA, and half will rank below. The procedure for estimating average ranks is harder.