CHAPTER 2 Statistics
2.5 Independence and dependence: regression, and correlation
2.5.3 Correlation
When computing a regression, we can use physical considerations to clearly identify independent and dependent variables.
In some cases, however, the outcomes of an experiment can be thought of as being mutually dependent on each other. This dependency is captured in the statistical concept of correlation. Moreover, as we will see later, even if one variable depends on the other, the correlation coefficient allows us to determine the degree to which variations in the dependent variable can be explained as a consequence of variations in the independent variable.
EXAMPLE 16: CORRELATEDVARIABLES
Suppose we transfer a small file over a cellular modem ten times, each time measuring the round-trip delay (from a ‘ping’
done just before transferring the file) and the throughput achieved (by dividing the file size by the transfer time). The round- trip delay may be large because the network interface card may have a low capacity, so that even a small ping packet experi- ences significant delays. On the other hand, the file transfer throughput may be low because the path delay is large. So, it is not clear which variable ought to be the dependent variable and which variable ought to be the independent variable. Sup- pose that the measured round-trip delays and throughputs are as shown below:
Throughput (kbps) 46 65 53 38 61 89 59 60 73
Round-trip delay (ms) 940 790 910 1020 540 340 810 720 830
DRAFT - Version 3 -Correlation
65
FIGURE 6. Regression and correlation
Figure 6(a) shows the scatter plot of the two variables. There appears to be an approximately linear decline in the round-trip delay with an increase in throughput. We arbitrarily choose throughput to be the independent variable and do a regression of round-trip delay on it, as shown in Figure 6(b). We see that the best-fit line has a negative slope, as expected.
There is no reason why we could not have chosen the round-trip delay to be the independent variable and have done a similar regression. This is shown in Figure 6(c). Again, we see that as the round-trip delay increases, the throughput decreases, indi- cating a negative relationship. We also see the best-fit line with a negative slope.
Note that the two regression lines are not the same! In one case, we are trying to minimize the sum of the squared errors in round-trip delay, and in the other, the sum of squared errors in throughput. So, the best fit lines will, in general, not be the same. This is shown in Figure 6(d) where we show both best-fit lines (one drawn with transposed axes).
The reason why two regression lines do not coincide in general is best understood by doing a thought experiment. Suppose two outcomes of an experiment, say X and Y, are completely independent. Then, E(XY) = E(X)E(Y), by definition of inde- pendence. In the context of a single sample, we rewrite this as:
(EQ 36)
Recall that . We expand the numerator as . Rewriting as and
as , and using Equation 36, we get
300 400 500 600 700 800 900 1000 1100
30 40 50 60 70 80 90
Round trip delay
Throughput (a) Scatter plot
300 400 500 600 700 800 900 1000 1100
30 40 50 60 70 80 90
Round trip delay
Throughput (b) Best fit: Y on X
30 40 50 60 70 80 90
300 400 500 600 700 800 900 1000 1100
Throughput
Round trip delay (c) Best fit: X on Y
200 300 400 500 600 700 800 900 1000 1100 1200
30 40 50 60 70 80 90
Round trip delay
Throughput (d) Best fit: Y on X Best fit: X on Y
E XY( )
∑
xiyi---n E X( )E Y( )
∑
xi ---n∑
yi---n xy
= = = =
b
∑
(xi–x)(yi–y) xi–x ( )2 ---∑
=
∑
xiyi–x∑
yi–y∑
xi+nxy∑
yi nyxi
∑
nxDRAFT - Version 3 - Independence and dependence: regression, and correlation
(EQ 37)
so that the regression line has zero slope, i.e., is parallel to the X axis. Symmetrically, the regression of X on Y will be paral- lel to the Y axis. Therefore, the two regression lines meet at right angles when the outcomes are independent. Recalling the we can interpret bas the expected increment in Y with a unit change in X, b = 0 implies that a unit change in X does not change Y (in expectation), which is consistent with independence.
On the other hand, if one outcome is perfectly linearly related to the other, then Y = tX. Clearly, , so that
. Denoting the regression of X on Y by x = a’ + b’y, the expression for
b’ is given by . With transposed axes, this line exactly overlaps the best fit line for the regression of Y on X. In other words, when there is exact linear dependence between the variables, the best fit regression lines meet at zero degrees. Thus, we can use the angle between the regression lines as an indication of the degree of linear dependence between the variables.
In practice, the standard measure of dependence, or correlation, is the square root of the product bb’, denoted r, also called Pearson’s correlation coefficient, and is given by
(EQ 38)
When the slopes are perpendicular, r = 0, and when the slopes are inverses of each other, so that the regression lines overlap, then r = 1. Moreover, when X and Y are perfectly negatively correlated, so that Y = -tX, r = -1. Therefore, we interpret r as the degree of correlation between two variables, ranging from -1 to +1, with its sign indicating direction of correlation (positive or negative), and its magnitude indicating the degree of correlation.
EXAMPLE 17: CORRELATIONCOEFFICIENT
Compute the correlation coefficient for the variables in Example 16.
Solution:
We compute the mean throughput as 54.4 kbps and the mean delay as 690 ms. Substituting these values into Equation 38, we find that r = -0.56. This indicates a negative correlation, but it is not particularly linear.
There are many interpretations of the correlation coefficient5. One particularly insightful interpretation is based on the sum of squares minimized in a linear regression: . Substituting for a and b, it is easily shown (see Exer- cise 14) that
5. See Joseph Lee Rodgers and W. Alan Nicewander, “Thirteen Ways to Look at the Correlation Coefficient,” The American Statistician, Vol. 42, No. 1 (Feb., 1988), pp. 59-66.
b
xiyi
∑
–x∑
yi–y∑
xi+nxyxi–x ( )2
---
∑
nxy–nxy–nxy+nxy xi–x ( )2---
∑
0= = =
y = tx
b
∑
(xi–x)(yi–y) xi–x ( )2---
∑ ∑
(xi–x)(txi–tx) xi–x ( )2---
∑
t= = =
xi–x
( )(yi–y)
∑
yi–y ( )2 ---
∑
xi–x
( )(txi–tx)
∑
txi–tx
( )2
---
∑
1 ---t= =
r
xi–x
( )(yi–y)
∑
xi–x ( )2 ---
∑
xi–x
( )(yi–y)
∑
yi–y ( )2
---
∑ ∑
(xi–x)(yi–y) xi–x( )2
∑
( )(
∑
(yi–y)2) ---= =
S2 (yi–a–bxi)2
i=1 n
∑
=
DRAFT - Version 3 -Correlation
67
(EQ 39)
That is, r2 is the degree to which a regression is able to reduce the sum of squared errors, which we interpret as the degree to which the independent variable explains variations in the dependent variable. When we have perfect linear dependency between Yand X, then the degree of correlation is 1 in absolute value, and the regression line is perfectly aligned with the data, so that it has zero error.
In computing a correlation coefficient, it is important to remember that it only captures linear dependence. A coefficent of zero does not mean that the variables are independent: they could well be non-linearly dependent. For example, if y2 = 1 - x2, then for every value of X, there are two equal and opposite values of Y, so that the best fit regression line is the X axis, which leads to a correlation coefficient of 0. But, of course, Y is not independent of X! Therefore, it is important to be cautious in drawing conclusions regarding independence when using the correlation coefficient. For drawing such conclusions, it is best to use the chi-square goodness-of-fit test described earlier.
Like any statistic, the correlation coefficient r can have an error due to random fluctuations in a sample. It can be shown that if X and Y are jointly normally distributed, then the variable is approximately normally distributed with a mean of and a variance of 1/(n-3). This can be used to find the confidence interval around r in which we can expect to find .
A specific form of correlation that is relevant in the analysis of time series is autocorrelation. Consider a series of values of a random variable that are indexed by discrete time, i.e., X1, X2,...,Xn. Then, the autocorrelation of this series with lag l is the correlation coefficient between the random variable Xi and Xi-l. If this coefficient is large (close to 1) for a certain value of l, then we can infer that the series has variation on the time scale of l. This is often much easier to compute than a full scale har- monic analysis by means of a Fourier transform.
Finally, it is important to recognize that correlation is not the same as causality. We must not interpret a correlation coeffi- cient close to 1 or -1 to infer causality. For example, it may be the case that packet losses on a wireless network are positively correlated with mean frame size. One cannot infer that larger frame sizes are more likely to be dropped. It could be the case, for example, that the network is heavily loaded when it is subjected to video traffic, which uses large frames. The increase in the loss rate could be due to the load, rather than the frame size. Yet, the correlation between these two quantities would be strong.
To go from correlation to causation, it is necessary to determine the physical causes that lead to causation. Otherwise, the unwary researcher may be led to unsupportable and erroneous conclusions.