Review of Correlation - MEASUREMENT IN HEALTH BEHAVIOR

The procedure will create a new variable (e.g., “esteem”), which is the sum of the responses of participants to each of the items on the scale. So if a person gave the following responses to the 10-item scale, he or she would receive a total score of 34 as shown below.

Item Original Responses Responses with Recoded Items

1 3 3

2 3 3

3* 2 3

4 4 4

5* 1 4

6 4 4

7 3 3

8* 2 3

9* 2 3

10* 1 4

Total Scale Score 34

* Reversed scored items

The new variable (e.g., “esteem”) will be added to the SPSS database and will appear on the variable list.

the values for one variable are plotted along the x-axis and the values for the other are plotted on the y-axis. Figure 8.4 shows the scatterplot for hours of exercise and oxygen consumption (VO2ml/kg/min) for the 100 participants enrolled in our healthy lifestyle study. In this hypothetical example, participants were asked how many hours per week they spent participating in aerobic activities. Oxygen consumption (VO2) was obtained for each person by having each complete a treadmill test. VO2is a measure of the amount of oxygen a person consumes per kilogram of body weight per minute. People who are physically fit consume more oxygen per minute during aerobic testing than do those who lead sedentary lifestyles. The scores for each person on both variables are plotted on the scatterplot. Participant A, for example, reported no aerobic activity and had a VO2of 30; participant B reported twenty-five hours of aerobic activity per week and had a VO2of 52. Note that as the number of hours of exercise increases, so does VO2. The line through the dots, called a regression line, is the best-fitting line though the points on the graph. The slope of the line reflects the strength of the relationship and the ori- entation of the line reflects its sign (positive or negative association). In Figure 8.4, the line slopes upward to the right, an indication of a positive correlation.

In Figure 8.5, we see a scatterplot indicative of a negative correlation between two variables. In this case, the regression line slopes downward from left to right, indicating that as the scores on one variable go up, the scores on the other go down.

Here the variables are number of hours of exercise per week and change in weight or number of pounds gained or lost during a six-month period of time. The scatterplot shows that the more the person exercises, the less weight he or she gains.

Scatterplots can be used to show the sign of the relationship between two variables. However, the strength of the relationship is best represented by a numerical value often referred to as a correlation coefficient. The most commonly used coefficient is the Pearson product moment. This value is computed using the following formula. In this formula, the Pearson correlation coefficient is represented by the sym- bol rxy. The subscripts x and y are generally used to denote the two variables.

Equation 8.3 r s s

X X Y Y

1 ^x ^y

xy= -

- -

_ _

i i

where

rxy= the correlation between the variables X and Y, Σ= sigma, which indicates to sum what follows X= the value of the X variable,

X = the mean score on the X variable, Y= the value of the Y variable, Y = the mean score on the Y variable,

N= the number of observations,

s_x= the standard deviation of the X variable, and s_y= the standard deviation of the Y variable.

To show the calculation of a correlation, in Table 8.4 we present the responses of ten participants in the exercise study for number of hours of exercise (variable X) and their VO2s (variable Y). Using the formula given above, we first calculate the deviation from the mean for each of the X and Y scores. We then multiply together each person’s x and y deviation values and sum the results for the ten participants. This sum is divided by 9 times the product of the standard deviations for the two distributions (X and Y). The results are shown below.

. . .

r 10 1 7 52 6 32 305

xy=

^ - h^ h^ h

. . .

r 427 7

305 713

xy= =

FIGURE 8.4. RELATIONSHIP BETWEEN NUMBER OF HOURS EXERCISED AND VO2 FOR 100 PARTICIPANTS.

0.00 A

Number of Hours of Exercise VO2ml/kg/min

10.00 50.00

45.00

40.00

35.00

30.00

20.00

FIGURE 8.5. RELATIONSHIP BETWEEN NUMBER OF HOURS EXERCISED AND CHANGE IN WEIGHT FOR 100 PARTICIPANTS.

0.00

Number of Hours of Exercise

Weight Change in Pounds

10.00 20.00

-10.00 0.00 10.00 20.00

TABLE 8.4. CALCULATION OF THE CORRELATION COEFFICIENT FOR THE ASSOCIATION BETWEEN EXERCISE AND

OXYGEN CONSUMPTION.

Number of

Person ID Hours of Exercise VO2

1 0 37 (–10)(–3) 30

2 4 33 (–6)(–7) 42

3 5 35 (–5)(–5) 25

4 6 38 (–4)(–2) 8

5 8 32 (–2)(–8) 16

6 10 45 (.0)(5) 0

7 10 46 (0)(6) 0

8 12 42 (2)(2) 4

9 20 40 (10)(0) 0

10 25 52 (15)(12) 180

Sum 305

At the bottom of the page, we give the SPSS commands for calculating a correlation between two variables.

By convention, Y is the perceived dependent variable, or the presumed outcome, and X is the independent variable, or the presumed cause. In our example, we give the correlation between hours of exercise and VO2 for the participants in the healthy eating study. In this study, hours of exercise is the independent variable because it log- ically precedes an increase in VO2. The value of the correlation computed using SPSS is .753. This value is slightly different from that obtained from our calculations above (.713). Recall that we used a subsample of 10 participants for our calculations, whereas we included all 100 participants in the SPSS calculations.

Interpretation of the Correlation Coefficient

The value of the Pearson correlation coefficient rxyranges from –1 to +1. A perfect negative correlation is represented by –1 and a perfect positive correlation by +1.

For perfect correlation to exist, every unit increase in the X variable must be matched by a unit increase in the Y variable. The closer the value of rxyis to ±1, the stronger the correlation is between the two variables. A value of 0 indicates that no relationship exists between the variables. For example, numbers selected at random for the two distributions would be expected to have no association. The correlation between exercise and VO2is .753, indicating a strong positive correlation (see Table 8.5). The correlation between number of exercise hours and weight change is –.671, indicating a moderately strong negative correlation. Negative correlations occur when scores that are high on one variable correspond to scores that are low on the other variable and vice versa. Students often confuse the strength of the correlation with the sign (+or –) of the correlation. It is important to remember that the further the value is from 0, the stronger the correlation is, regardless of the sign (Figure 8.6). Thus, the correlation coefficients of –.25 and +.25 are equivalent in strength but opposite in sign.

To compute a correlation between two variables:

From the Data Entry Screen:

√Analyze. . . .Correlate. . . .√Bivariate correlations Highlight and transfer Yvariable from L→R column Highlight and transfer Xvariable from L→R column

√OK

FIGURE 8.6. RANGE AND SIGN OF CORRELATION COEFFICIENTS.

Strong negative association

−1 +1

No association Strong positive

association

TABLE 8.5. CORRELATION MATRIX SHOWING RELATIONSHIPS AMONG BENEFIT RATINGS, HOURS OF EXERCISE, VO2, AND

WEIGHT CHANGE.

aCorrelation is significant at the 0.01 level (2-tailed).

Correlations

Benefits Pearson Correlation Sig. (2-tailed) N

.632^a .000

100

.492^a .000

100

−.603^a .000

100 1

100 Hours

Exercised

Pearson Correlation Sig. (2-tailed) N

.632^a .000

100

.753^a .000

100

−.671^a .000

100 1

100

Weight Change

Pearson Correlation Sig. (2-tailed) N

−.603^a .000

100

−.512^a .000

100

−.671^a .000

100

1 100 VO₂ Pearson Correlation

Sig. (2-tailed) N

.492^a .000

100

.753a .000

100

−.512^a .000

100 1

100 Benefits Hours VO2

Exercised

Weight Change

Diagonal

1. Correlation coefficient

2. Significance level

3. Number in sample

Correlation Matrix

Correlations for a set of variables can be displayed in a correlation matrix, such as the one shown in Table 8.5. The correlation matrix shows the strength, sign, and significance of the relationship between each variable and every other variable. The first number given in each cell of the matrix is the correlation coefficient (1). Table 8.5 presents correlations among the variables discussed earlier: benefit of an exercise program, number of hours of exercise per week, VO2, and weight change. The correlation between benefit and number of hours of exercise is .632, suggesting a moderately strong positive relationship, whereas the correlation between number of hours of exercise and VO2

is .753, indicating a strong positive relationship. Note that each relationship between variables is given twice, once above the diagonal and once below. In other words, the correlation table is symmetric, because, for example, the correlation between variable X and variable Y is also equal to the correlation between variable Y and variable X. The order in which the variables are paired does not affect the correlation between them.

In reports, researchers generally give only the upper or lower diagonal. The 1’s along the diagonal represent the correlation of each variable with itself.

The correlation matrix also gives the p values for statistical significance, which are the second number in each cell (2). The correlation coefficient is a descriptive statistic that provides an indication of the strength and sign of a relationship between two variables. This relationship can be tested using a statistical test. The null hypothesis tested is that no relationship exists between two variables, and the alternative hypothesis is that the relationship does exist. If one is calculating the correlation by hand, as we did in Table 8.4, a statistical table can be used to determine whether the value of the coefficient is significant at the .05 level or the .10 level for a specific sample size. When SPSS is used to calculate correlations, the p value is included in the output. We can see in the correlation matrix that all correlations are significant at the .05 level, meaning that the values of the correlations would occur by chance fewer than 5 out of 100 times when the null hypothesis of no relationship is true. The third number in each cell (3) gives the number of observations used in the calculation of each correlation. In Table 8.5, we see that all 100 observations were used for the calculation of each correlation.

If data were missing on one or more participants on a variable, a number less than 100 would appear in the appropriate cell.

Correlation Issues

Causality. There are a few points to remember about correlation. First, correlation does not imply causality even though causation can imply correlation. It is impossible to tell the direction or nature of causation with just the correlation between two variables. For example, Figure 8.7 displays five of many possible scenarios of causation (James, Mulaik, & Brett, 1982). In scenerio a, X is shown as a cause of Y, whereas in

scenario b, Y is a cause of X. In the third situation (c), Z is the cause of both X and Y, which are not directly related to each other. In the final two situations (d and e), X and Y are related to each other and also to Z. Knowing only the correlation between two variables cannot help you determine which of these or other scenerios is possible. A more advanced analytic technique known as structural equation modeling (SEM) examines the correlations among many variables to test causal relations. However, SEM has its limits and to show causality, the researcher would need to conduct a con- trolled experiment to determine the direction of causality. The correlation coefficient does tell us how things are related (descriptive information) and gives us predictive ca- pabilities (if we know the score on one variable, we can predict the score on the other).

When we square the correlation coefficient, the result tells us to what extent the variance in one variable explains the variance in the other.

Sample Size. In terms of the statistical significance of the correlation coefficient, the larger the sample size, the more likely it is for the value of rxyto be statistically significant. A correlation of .10 would not be statistically significant for a sample size of 20, but would be significant for a sample size of 1,500. A simple rule of thumb for de- termining the significance of a correlation coefficient is to compute

N 3, 2

where N is the sample size (Mulaik, 1972). If the magnitude of the correlation (ig- noring its sign) is bigger than this quantity, the correlation is significantly different from zero. So for a sample size of twenty, the correlation would need to be at least .50 to be statistically significant as shown below:

20 3 . . 2

17 2

2 5

- = = =

In the case of correlations, statistical significance does not say much about prac- tical significance. With a fairly large sample size, we might find that the correlation between drinking coffee and skin cancer is .09. However, with such a small correlation,

FIGURE 8.7. POSSIBLE CAUSAL RELATIONSHIPS BETWEEN TWO VARIABLES.

Source: Permission from Stanley A. Mulaik (personal communication, 2005).

a. X Y b. X Y c. X

Y d. X Z

Y e. X Z

it is unlikely that health professionals would tell people to give up coffee to avoid skin cancer or would include this information in a pamphlet on skin protection.

Group Differences. In research, we might find that the association between two variables differs for groups. This difference in association might be apparent when the correlations are analyzed separately for each group. However, when the groups are combined, the differences might be obscured. For example, if we analyzed them separately, we might find a strong positive correlation between eating chocolate and weight gain for women and a small negative correlation for men. When data from both men and women are analyzed together, we might find no relationship between eating chocolate and weight gain for the total sample, leading to a conclusion that is not true for either group. Likewise, we can have negative (or positive) correlations between two variables in two groups, but get a near zero or slightly positive (or negative) correlation when the groups are combined. In these cases, the correlations may be due not to individual sources of variation but to sources introduced by group membership, and the group effect cancels the relationship due to individual sources.

Restriction of Range. Researchers should also be aware of the possibility of a situation known as restriction of range. If the range of one or both scores involved in a correlation is restricted or reduced, the correlation may be different from a similar correlation based on an unrestricted range of scores. Correlations in the restricted sample may be reduced, showing no correlation, when there is a correlation in the unrestricted population. Likewise, nonzero correlations between variables may occur in restricted populations when there is no correlation between the variables in an unrestricted population. In Figure 8.8, we limit our analysis for our healthy lifestyle study to men and women who report that they exercise less than five hours per week.

With this restricted group of participants, we see that the correlation between number of hours of exercise and VO2is .224 and is not statistically significant. This correlation leads to the conclusion that there is no association between number of hours of exercise and VO2. However, if we include people in the higher range, who exercised more hours and had higher VO2 levels, as shown earlier in Figure 8.4, a distinct positive correlation exists. Thus, restricting the range of variables may lead to inaccurate conclusions about the association between variables.

Variance

Earlier, we defined variance for a single frequency distribution. We said it was the average squared distance from the mean. We can also calculate the percentage of variance in one distribution that is shared with another distribution. By squaring the

correlation coefficient, rxy, we find the proportion of variance in one variable accounted for by its relationship with the other variable.

The percentage of shared variance can be represented using Venn diagrams, as shown in Figures 8.9 and 8.10. In Figure 8.9, each circle represents 100 percent of the variance in one variable. The circle on the left might represent the total amount of variance for number of hours of exercise, and the circle on the right might represent the total amount of variance for VO2 scores.

We saw earlier that the correlation coefficient for the relationship between these two variables was .753. This finding indicates that the two variables share a large por- tion of variance. Figure 8.10 depicts the shared variance, using two intersecting circles.

The area of overlap between the circles denotes the proportion of shared variance. We can calculate the amount of shared variance by squaring the correlation coefficient.

Squaring rxyyields a decimal number that can be converted to a percentage by multi- plying it by 100. The value obtained from squaring the correlation coefficient is referred to as the coefficient of determination and can be interpreted as the proportion of shared variance between two variables. Shared variance is usually presented as a percentage ranging from 0 to 100 percent. In the computation below, we show the calculation of the percent of shared variance for number of hours of exercise and VO2.

FIGURE 8.8. RELATIONSHIP BETWEEN NUMBER OF HOURS EXERCISED AND VO2FOR THOSE WHO EXERCISE FEWER THAN FIVE HOURS PER WEEK.

Number of Hours of Exercise (r = .224; p = .134) VO2ml/kg/min

−1 0

20 30 40 50

1 2 3 4 5

Equation 8.4 rxy 100

2# = percent of shared variance

.753²#100 57= %.

We can then say that 57 percent of the variance in VO2is shared with number of hours of exercise or that 57 percent of VO2 is explained by exercise.

If two variables are not associated at all, then rxy= 0 and rxy2 will be .00 (or 0 percent). That is, there is no variance that the two variables share. Conversely, when two variables are perfectly correlated, rxy= 1 and rxy2 = 1.0 (or 100 percent). Thus, 100 percent of the variance in the first variable is explained by the second variable and vice versa.

Reporting Results of Correlation Analysis

It might be helpful to select a few correlations for interpretation. We will begin with the correlation between number of hours of exercise and VO2. The correlation coefficient rxyis .75 and p <.001. These results indicate that the variables are positively correlated. That is, as the number of hours a person exercises increases, so does that person’s VO2. A correlation coefficient value of .75 is fairly strong. This value indicates that 57 percent of the variance in VO2 is explained by hours of exercise and vice versa.

The p value <.001 indicates that the relationship is statistically significant. That is, given a null hypothesis stating that p = 0, the probability of obtaining a correlation of the given magnitude (.753) purely by chance is estimated to be less than 1 percent.

As a second example, the correlation between number of hours of exercise and weight change is rxy= –.67 and the p value is <.001. The association is strong and negative as hours of exercise increase weight gain decreases. We can say that 45 percent of the variance in number of pounds gained or lost is explained by the number of hours of exercise. Given a null hypothesis stating that p = 0, the probability of obtaining a correlation of the given magnitude (.671) purely by chance is estimated to be less than 1 percent.

FIGURE 8.9. VENN DIAGRAM SHOWING VARIANCE FOR TWO SEPARATE VARIABLES.

100%

Exercise

100%

VO₂

Dalam dokumen MEASUREMENT IN HEALTH BEHAVIOR (Halaman 176-187)