PART VIII. TESTS OF SIGNIFICANCE
2. THE CORRELATION COEFFICIENT
THE CORRELATION COEFFICIENT 125
Both clouds have the same center and show the same spread, horizontally and vertically. However, the points in the first cloud are tightly clustered around a line: there is a strong linear association between the two variables. In the second cloud, the clustering is much looser. The strength of the association is different in the two diagrams. To measure the association, one more summary statistic is needed—thecorrelation coefficient. This coefficient is usually abbreviated asr, for no good reason (although there are twor’s in “correlation”).
The correlation coefficient is a measure of linear association, or clustering around a line. The relationship between two variables can be summarized by
• the average of thex-values, the SD of thex-values,
• the average of they-values, the SD of they-values,
• the correlation coefficientr.
The formula for computingr will be presented in section 4, but right now we want to focus on the graphical interpretation. Figure 6 shows six scatter dia- grams for hypothetical data, each with 50 points. The diagrams were generated by computer. In all six pictures, the average is 3 and the SD is 1 forxand fory.
The computer has printed the value of the correlation coefficient over each dia- gram. The one at the top left shows a correlation of 0. The cloud is completely formless. As x increases, y shows no tendency to increase or decrease: it just straggles around.
The next scatter diagram has r = 0.40; a linear pattern is beginning to emerge. The next one hasr = 0.60, with a stronger linear pattern. And so on, through the last one. The closerr is to 1, the stronger is the linear association between the variables, and the more tightly clustered are the points around a line.
A correlation of 1, which does not appear in the figure, is often referred to as a perfect correlation—all the points lie exactly on a line, so there is a perfect linear relationship between the variables. Correlations are always 1 or less.
The correlation between the heights of identical twins is around 0.95.4The lower right scatter diagram in figure 6 has a correlation coefficient of 0.95. A scatter diagram for the twins would look about the same. Identical twins are like each other in height, and their points on a scatter diagram are fairly close to the line y = x. However, such twins do not have exactly the same height. That is what the scatter around the 45-degree line shows.
For another example, in the U.S. in 2005, the correlation between income and education was 0.07 for men age 18–24, rising to 0.43 for men age 55–64.5As the scatter diagrams in figure 6 indicate, the relationship between income and ed- ucation is stronger for the older men, but it is still quite rough. Weak associations are common in social science studies, 0.3 to 0.7 being the usual range for r in many fields.
A word of warning:r =0.80 does not mean that 80% of the points are tightly clustered around a line, nor does it indicate twice as much linearity asr =0.40.
Right now, there is no direct way to interpret the exact numerical value of the correlation coefficient; that will be done in chapters 10 and 11.
THE CORRELATION COEFFICIENT 127
Figure 6. The correlation coefficient—six positive values. The diagrams are scaled so that the average equals 3 and the SD equals 1, horizontally and vertically; there are 50 points in each diagram. Clustering is measured by the correlation coefficient.
So far, only positive association has been discussed. Negative association is indicated by a negative sign in the correlation coefficient. Figure 7 shows six more scatter diagrams for hypothetical data, each with 50 points. They are scaled just like figure 6, each variable having an average of 3 and an SD of 1.
A correlation of−0.90, for instance, indicates the same degree of clustering as one of +0.90. With the negative sign, the clustering is around a line which slopes down; with a positive sign, the line slopes up. For women age 25–39 in the U.S. in 2005, the correlation between education and number of children was about −0.2, a weak negative association.6A perfect negative correlation of−1 indicates that all the points lie on a line which slopes down.
Correlations are always between−1 and 1, but can take any value in between. A positive correlation means that the cloud slopes up;
as one variable increases, so does the other. A negative correlation means that the cloud slopes down; as one variable increases, the other decreases.
In a real data set, both SDs will be positive. As a technical matter, if either SD is zero, there is no good way to define the correlation coefficient.
Exercise Set B
1. (a) Would the correlation between the age of a second-hand car and its price be positive or negative? Why? (Antiques are not included.)
(b) What about the correlation between weight and miles per gallon?
2. For each scatter diagram below:
(a) The average ofxis around
1.0 1.5 2.0 2.5 3.0 3.5 4.0
(b) Same, fory.
(c) The SD ofx is around
0.25 0.5 1.0 1.5
(d) Same, fory.
(e) Is the correlation positive, negative, or 0?
0 1 2 3 4 5 6
0 1 2 3
0 1 2 3 4 5 6
0 1 2 3
3. For which of the diagrams in the previous exercise is the correlation closer to 0, forgetting about signs?
THE CORRELATION COEFFICIENT 129
Figure 7. The correlation coefficient—six negative values. The dia- grams are scaled so the average equals 3 and the SD equals 1, horizontally and vertically; there are 50 points in each diagram. Clustering is measured by the correlation coefficient.
4. In figure 1, is the correlation between the heights of the fathers and sons around
−0.3, 0, 0.5, or 0.8?
5. In figure 1, if you took only the fathers who were taller than 6 feet, and their sons, would the correlation between the heights be around−0.3, 0, 0.5 or 0.8?
6. (a) If women always married men who were five years older, the correlation be- tween the ages of husbands and wives would be . Choose one of the options below, and explain.
(b) The correlation between the ages of husbands and wives in the U.S. is . Choose one option, and explain.
exactly−1 close to−1 close to 0 close to 1 exactly 1 7. Investigators are studying registered students at the University of California. The
students fill out questionnaires giving their year of birth, age (in years), age of mother, and so forth. Fill in the blanks, using the options given below, and explain briefly.
(a) The correlation between student’s age and year of birth is . (b) The correlation between student’s age and mother’s age is .
−1 nearly−1 somewhat negative
0 somewhat positive nearly 1 1
8. Investigators take a sample of DINKS (dual-income families—where husband and wife both work—and no kids). The investigators have data on the husband’s in- come and the wife’s income. By definition,
family income=husband’s income+wife’s income.
The average family income was around $85,000, and 10% of the couples had fam- ily income in the range $80,000–$90,000. Fill in the blanks, using the options given below, and explain briefly.
(a) The correlation between wife’s income and family income is . (b) Among couples whose family income is in the range $80,000–$90,000, the
correlation between wife’s income and husband’s income is .
−1 nearly−1 somewhat negative
0 somewhat positive nearly 1 1
9. True or false, and explain: if the correlation coefficient is 0.90, then 90% of the points are highly correlated.
The answers to these exercises are on p. A56.