Correlation - Independence and dependence: regression, and correlation

CHAPTER 2 Statistics

2.5 Independence and dependence: regression, and correlation

2.5.3 Correlation

When computing a regression, we can use physical considerations to clearly identify independent and dependent variables.

In some cases, however, the outcomes of an experiment can be thought of as being mutually dependent on each other. This dependency is captured in the statistical concept of correlation. Moreover, as we will see later, even if one variable depends on the other, the correlation coefficient allows us to determine the degree to which variations in the dependent variable can be explained as a consequence of variations in the independent variable.

EXAMPLE 16: CORRELATEDVARIABLES

Suppose we transfer a small file over a cellular modem ten times, each time measuring the round-trip delay (from a ‘ping’

done just before transferring the file) and the throughput achieved (by dividing the file size by the transfer time). The round- trip delay may be large because the network interface card may have a low capacity, so that even a small ping packet experi- ences significant delays. On the other hand, the file transfer throughput may be low because the path delay is large. So, it is not clear which variable ought to be the dependent variable and which variable ought to be the independent variable. Sup- pose that the measured round-trip delays and throughputs are as shown below:

Throughput (kbps) ⁴⁶ ⁶⁵ ⁵³ ³⁸ ⁶¹ ⁸⁹ ⁵⁹ ⁶⁰ ⁷³

Round-trip delay (ms) 940 790 910 1020 540 340 810 720 830

DRAFT - Version 3 -Correlation

65

FIGURE 6. Regression and correlation

Figure 6(a) shows the scatter plot of the two variables. There appears to be an approximately linear decline in the round-trip delay with an increase in throughput. We arbitrarily choose throughput to be the independent variable and do a regression of round-trip delay on it, as shown in Figure 6(b). We see that the best-fit line has a negative slope, as expected.

There is no reason why we could not have chosen the round-trip delay to be the independent variable and have done a similar regression. This is shown in Figure 6(c). Again, we see that as the round-trip delay increases, the throughput decreases, indicating a negative relationship. We also see the best-fit line with a negative slope.

Note that the two regression lines are not the same! In one case, we are trying to minimize the sum of the squared errors in round-trip delay, and in the other, the sum of squared errors in throughput. So, the best fit lines will, in general, not be the same. This is shown in Figure 6(d) where we show both best-fit lines (one drawn with transposed axes).

The reason why two regression lines do not coincide in general is best understood by doing a thought experiment. Suppose two outcomes of an experiment, say X and Y, are completely independent. Then, E(XY) = E(X)E(Y), by definition of independence. In the context of a single sample, we rewrite this as:

(EQ 36)

Recall that . We expand the numerator as . Rewriting as and

as , and using Equation 36, we get

300 400 500 600 700 800 900 1000 1100

30 40 50 60 70 80 90

Round trip delay

Throughput (a) Scatter plot

300 400 500 600 700 800 900 1000 1100

30 40 50 60 70 80 90

Round trip delay

Throughput (b) Best fit: Y on X

30 40 50 60 70 80 90

300 400 500 600 700 800 900 1000 1100

Throughput

Round trip delay (c) Best fit: X on Y

200 300 400 500 600 700 800 900 1000 1100 1200

30 40 50 60 70 80 90

Round trip delay

Throughput (d) Best fit: Y on X Best fit: X on Y

E XY( )

∑

x_iy_i

---n E X( )E Y( )

∑

x_i ---n

∑

y_i

---n xy

= = = =

∑

(x_i–x)(y_i–y) x_i–x ( )² ---

∑

x_iy_i^–^x

∑

^yⁱ^–^y

∑

^xⁱ⁺^nxy

∑

^yⁱ ^ny

x_i

∑

^nx

DRAFT - Version 3 - Independence and dependence: regression, and correlation

(EQ 37)

so that the regression line has zero slope, i.e., is parallel to the X axis. Symmetrically, the regression of X on Y will be parallel to the Y axis. Therefore, the two regression lines meet at right angles when the outcomes are independent. Recalling the we can interpret bas the expected increment in Y with a unit change in X, b = 0 implies that a unit change in X does not change Y (in expectation), which is consistent with independence.

On the other hand, if one outcome is perfectly linearly related to the other, then Y = tX. Clearly, , so that

. Denoting the regression of X on Y by x = a’ + b’y, the expression for

b’ is given by . With transposed axes, this line exactly overlaps the best fit line for the regression of Y on X. In other words, when there is exact linear dependence between the variables, the best fit regression lines meet at zero degrees. Thus, we can use the angle between the regression lines as an indication of the degree of linear dependence between the variables.

In practice, the standard measure of dependence, or correlation, is the square root of the product bb’, denoted r, also called Pearson’s correlation coefficient, and is given by

(EQ 38)

When the slopes are perpendicular, r = 0, and when the slopes are inverses of each other, so that the regression lines overlap, then r = 1. Moreover, when X and Y are perfectly negatively correlated, so that Y = -tX, r = -1. Therefore, we interpret r as the degree of correlation between two variables, ranging from -1 to +1, with its sign indicating direction of correlation (positive or negative), and its magnitude indicating the degree of correlation.

EXAMPLE 17: CORRELATIONCOEFFICIENT

Compute the correlation coefficient for the variables in Example 16.

Solution:

We compute the mean throughput as 54.4 kbps and the mean delay as 690 ms. Substituting these values into Equation 38, we find that r = -0.56. This indicates a negative correlation, but it is not particularly linear.

There are many interpretations of the correlation coefficient⁵. One particularly insightful interpretation is based on the sum of squares minimized in a linear regression: . Substituting for a and b, it is easily shown (see Exer- cise 14) that

5. See Joseph Lee Rodgers and W. Alan Nicewander, “Thirteen Ways to Look at the Correlation Coefficient,” The American Statistician, Vol. 42, No. 1 (Feb., 1988), pp. 59-66.

x_iy_i

∑

^–^x

∑

^yⁱ^–^y

∑

^xⁱ⁺^nxy

x_i–x ( )²

---

∑

nxy–nxy–nxy+nxy x_i–x ( )²

---

∑

= = =

y = tx

∑

(x_i–x)(y_i–y) x_i–x ( )²

---

∑ ∑

(x_i–x)(tx_i–tx) x_i–x ( )²

---

∑

= = =

x_i–x

( )(y_i–y)

∑

y_i–y ( )² ---

∑

x_i–x

( )(tx_i–tx)

∑

tx_i–tx

( )²

---

∑

1 ---t

= =

x_i–x

( )(y_i–y)

∑

x_i–x ( )² ---

∑

x_i–x

( )(y_i–y)

∑

y_i–y ( )²

---

∑ ∑

(x_i–x)(y_i–y) x_i–x

( )²

∑

( )⁽

∑

(y_i–y)²⁾ ---

= =

S² (y_i–a–bx_i)²

i=1 n

∑

DRAFT - Version 3 -Correlation

67

(EQ 39)

That is, r² is the degree to which a regression is able to reduce the sum of squared errors, which we interpret as the degree to which the independent variable explains variations in the dependent variable. When we have perfect linear dependency between Yand X, then the degree of correlation is 1 in absolute value, and the regression line is perfectly aligned with the data, so that it has zero error.

In computing a correlation coefficient, it is important to remember that it only captures linear dependence. A coefficent of zero does not mean that the variables are independent: they could well be non-linearly dependent. For example, if y² = 1 - x², then for every value of X, there are two equal and opposite values of Y, so that the best fit regression line is the X axis, which leads to a correlation coefficient of 0. But, of course, Y is not independent of X! Therefore, it is important to be cautious in drawing conclusions regarding independence when using the correlation coefficient. For drawing such conclusions, it is best to use the chi-square goodness-of-fit test described earlier.

Like any statistic, the correlation coefficient r can have an error due to random fluctuations in a sample. It can be shown that if X and Y are jointly normally distributed, then the variable is approximately normally distributed with a mean of and a variance of 1/(n-3). This can be used to find the confidence interval around r in which we can expect to find .

A specific form of correlation that is relevant in the analysis of time series is autocorrelation. Consider a series of values of a random variable that are indexed by discrete time, i.e., X₁, X₂,...,X_n. Then, the autocorrelation of this series with lag l is the correlation coefficient between the random variable X_iand X_i-l. If this coefficient is large (close to 1) for a certain value of l, then we can infer that the series has variation on the time scale of l. This is often much easier to compute than a full scale har- monic analysis by means of a Fourier transform.

Finally, it is important to recognize that correlation is not the same as causality. We must not interpret a correlation coefficient close to 1 or -1 to infer causality. For example, it may be the case that packet losses on a wireless network are positively correlated with mean frame size. One cannot infer that larger frame sizes are more likely to be dropped. It could be the case, for example, that the network is heavily loaded when it is subjected to video traffic, which uses large frames. The increase in the loss rate could be due to the load, rather than the frame size. Yet, the correlation between these two quantities would be strong.

To go from correlation to causation, it is necessary to determine the physical causes that lead to causation. Otherwise, the unwary researcher may be led to unsupportable and erroneous conclusions.

Dalam dokumen Mathematical Mathematical Foundations of Computer Networkingof Computer Networking (Halaman 74-77)