Testing hypotheses about outcomes of experiments

CHAPTER 2 Statistics

2.4 Testing hypotheses about outcomes of experiments

DRAFT - Version 3 -Hypothesis testing

51

iance. It can be shown that if the population is normally distributed, this sampling distribution is the χ² distribution (discussed in Section 2.4.7 on page 58). However, this confidence interval is rarely derived in practice, and so we will omit the details of this result.

DRAFT - Version 3 - Testing hypotheses about outcomes of experiments

Consider two universes in each of which a particular hypothesis is either valid or invalid. In each universe, we can expect one of two results from hypothesis testing: “The hypothesis is not rejected,” or “The hypothesis is rejected.” The four possible outcomes of testing are represented in the following table:

• If the hypothesis is invalid and is rejected, then we have a good outcome. The probability of this event is denoted C₀₀.

• If the hypothesis is invalid but is not rejected, then we have a bad outcome. The probability of this event is denoted C₀₁.

• If the hypothesis is valid and is not rejected, then we have a good outcome. The probability of this event is denoted C₁₁.

• If the hypothesis is valid but is rejected, then we have a bad outcome. The probability of this event is denoted C₁₀. We can use the C_ijs to define the following quantities:

These apply to all types of hypothesis testing. Our goal is to design experiments that maximize good outcomes while mini- mizing bad outcomes. In certain cases, we may trade off a higher sensitivity for a higher error rate, or a higher specificity for a lower concordance. A common rule is to limit Type I errors to 5%. That is, if the hypothesis is valid, we should not mistak- enly reject it more than 5% of the time. Note that at the 5% confidence level, we might reject the hypothesis one time in twenty just by chance.

2.4.3 Formulating a hypothesis

We now return to the problem that an experiment can only result in rejection or non-rejection of a hypothesis. Therefore, we may end up not rejecting an invalid hypothesis. What guidelines should we use in choosing a hypothesis?

The standard technique is to formulate a null hypothesis that we believe is sufficient to explain the data unless statistical evidence strongly indicates otherwise. The null hypothesis should be formulated conservatively, that is, preserving the status quo, where this is applicable. A good way to think about this is in terms of a criminal trial. The judicial system starts with the presumption of innocence. It is up to the prosecution to prove that the defendant is guilty. If the prosecution cannot prove beyond reasonable doubt that the defendant is guilty, then the defendant is released. No doubt, this will let some guilty par- ties go unpunished. But it is preferable to the alternative, where the defendant is assumed guilty and must prove innocence.

In formulating a null hypothesis, it is necessary to be precise. In the words of Sir R.A. Fisher, the inventor of this approach, the null hypothesis should be “free from vagueness and ambiguity.” Otherwise, it may be impossible to reject it, making our effort fruitless. Moreover, a hypothesis should be about a population parameter, not a sample (unless the sample includes the entire population).

EXAMPLE 7: FORMULATINGANULLHYPOTHESIS

Outcome of the experiment

State of the universe Reject hypothesis Do not reject hypothesis Hypothesis is invalid Good outcome C₀₀ Bad outcome C₀₁

False negative or Type II error Hypothesis is valid Bad outcome C₁₀

False positive or Type I error

Good outcome C₁₁

Term Definition Meaning

Concordance C₁₁+C₀₀ The probability of an accurate prediction Error rate C₁₀+C₀₁ The probability of an inaccurate prediction Sensitivity C₁₁/(C₁₁+C₀₁) Ability to predict correctly conditional on the

hypothesis actually being valid

Specificity C₀₀/(C₁₀+C₀₀) Ability to eliminate a false hypothesis conditional on the hypothesis actually being invalid.

DRAFT - Version 3 -Comparing an outcome with a fixed quantity

53

Consider a router that can execute either scheduling algorithm A or scheduling algorithm B. Suppose our goal is to show that scheduling algorithm A is superior to scheduling algorithm B for some metric. An acceptable conservative null hypothesis would be “Scheduling algorithm A and scheduling algorithm B have identical performance.” Given this assumption, we would expect that the performance metrics for both scheduling algorithms to be roughly the same (i.e., this is our expectation regarding the state of the world). If our experiments show this to be the case, for example, if the sample means of the performance metrics for both scheduling algorithms were nearly identical, then we would conclude that we do not have sufficient evidence to prove that scheduling algorithm B improved the system, a conservative and scientifically valid decision. In contrast, if the sample mean for algorithm A were much higher than the sample mean for algorithm B (we will quantify this shortly), then the experiment would be inconsistent with our null hypothesis, and we would reject it, giving credence to the belief that scheduling algorithm A was indeed better³.

If we were to invert the null hypothesis, for example stating it as “Scheduling algorithm A is better than scheduling algorithm B.” then by being unable to reject the null hypothesis, we may come to an unwarranted conclusion. In any case, this hypothesis is imprecise, in that we did not quantify how much better algorithm A is supposed to be better than scheduling algorithm B, so ought to be deprecated on those grounds alone.

We represent the null hypothesis using the symbol H₀. Alternatives to the null hypothesis are usually labelled H₁, H₂, and so on. The steps in hypothesis testing depend on whether the outcome of an experiment is being compared with a fixed quantity, or whether it being compared with the outcome of another experiment. In the next sub-section, we will consider outcomes that are compared with a fixed quantity, deferring comparison of outcomes of two experiments to 2.4.5 on page 54.

Hypotheses can be ‘two-tailed’ or ‘one-tailed.’ We reject a two-tailed hypothesis if the sample statistic significantly differs from the conjectured corresponding population parameter in absolute value. For example, suppose we hypothesize that the population mean μ is 0. If observations indicate that |x | > a, where a is the critical value, then we reject the hypothesis. In contrast, we reject a one-tailed hypothesis if the sample statistic significantly differs from the corresponding population parameter in a pre-specified direction (i.e., is smaller than or larger than the conjectured population parameter), where exper- imental conditions allow us to rule out deviations in the other direction. An example of a one-tailed hypothesis is μ < 0. If x

> a, where a, again, is the critical value, we can reject this hypothesis. Note that we do not consider the ‘other tail,’ that is, the possibility that x < a (why?).

2.4.4 Comparing an outcome with a fixed quantity

To develop intuition, let us start with a simple example. Suppose that physical considerations lead us to expect that the population mean of a population under study is 0. Assume that there is no particular status quo that we are trying to maintain.

Therefore, a reasonable null hypothesis is:

H₀: the population mean is 0

To test this hypothesis, we need to sample the population multiple times and compute the corresponding sample means. If he number of samples is large, the central limit theorem implies that each sample mean will be drawn from a normal distribution. We can use this fact to compute its confidence interval (say at the 95% level) using the techniques in Section 2.3 on page 48. We then check if 0 lies within this interval. One of two cases arise:

• 0 lies in the 95% (99%) confidence interval of the sample mean. In this case, we cannot reject the null hypothesis. This is usually incorrectly interpreted to mean that with 95% (99%) confidence the population mean is indeed 0. Of course, all have have shown is that, conditional on the population mean being 0, the outcome of this experiment has a likelihood greater than 95% (99%) (i.e., it is consistent with the null hypothesis).

• 0 does not lie in the 95% (99%) confidence interval of the sample mean. In this case, we reject the null hypothesis. This is usually interpreted to mean that, with high confidence, the population mean is not 0. Again, all we have shown is that, conditional on the mean being 0, the outcome we saw was rather unlikely, so we have good reason to be suspicious of the null hypothesis.

3. In the Bayesian formulation of hypothesis testing, we view the experiment as updating a prior expectation on the state of the world, thus refining our model for the state of the world. Here, we are presenting the classical statistical view on hypothesis testing.

DRAFT - Version 3 - Testing hypotheses about outcomes of experiments

This example is easily generalized. Suppose we want to establish that the population mean is μ₀. We compute the sample mean x as before. Then, we test the hypothesis:

H₀: (x - μ₀) = 0

which can be tested as described above, with identical conclusions being drawn about the results⁴. EXAMPLE 8: TESTINGFORAZEROMEAN

Returning to Example 10, note that the 99% confidence interval for the mean of the sample data, using the t test, was [1.03, 4.72]. Therefore, we can state with 99% confidence (with the caveats stated above) that the mean of the underlying population is not 0.

2.4.5 Comparing outcomes from two experiments

Suppose we want to test the hypothesis that two samples are drawn from different populations. To fix ideas, consider the situation where we are comparing two systems—a system currently in use and a system that incorporates a new algorithm—on the basis of a particular performance metric. We assume that we can collect performance metrics from each system multiple times to obtain two samples. If the systems do not differ to a statistically significant degree, then both samples would be drawn from the same underlying population, with, for example, the same population mean, and therefore would have similar statistics, such as the sample mean. However, if the statistics are significantly different, then we infer that the two samples are likely to be drawn from different populations and the new algorithm does indeed affect the performance of the system.

The null hypothesis is the statement:

H₀: the two systems are identical

We reject H₀if it is sufficiently unlikely that the two samples are drawn from the same population because this will result in the conservative position that the new algorithm does not improve the performance of the system.

Suppose we collect n sets of samples from the first system, labelled A, to get sample means a₁, a₂,...,a_n and collect m sets of samples from the second system, labelled B, to get sample means b₁, b₂,...,b_m. Let the means of these means be denoted a and b with corresponding variances m₂(a) and m₂(b).

If n=m, then we define an auxiliary random variable C=A-B, which takes values c_i = a_i-b_i. Then, we redefine the hypothesis as:

H₀: the population mean of C is zero This can be easily tested using the approach described in 2.4.4 on page 53.

EXAMPLE 9: COMPARINGTWOSAMPLES

4. Advanced readers may note that in this section, we have switched from the Fisher to the Neyman-Pearson approach to hypothesis testing. This hybrid approach is widely used in modern scientific practice.

DRAFT - Version 3 -Comparing outcomes from two experiments

55

Suppose that you are using simulations to study the effect of buffer size at some network queue on packet loss rate. You would like to see if increasing the buffer size from 5 packets to 100 packets has a significant effect on loss rate. To do so, suppose that you run 10 simulations for each buffer size, resulting in loss rates shown below:

Does the buffer size have a significant influence on the packet loss rate?

Solution:

Note that each loss rate measurement in each simulation is itself a sample mean. Therefore, these loss rates can be assumed to be distributed approximately normally. Denoting by a_i the loss rate with a buffer of size 5 packets and by b_i the loss rate with a buffer size of 100 packets, we define the auxiliary variable that takes values (1.2-0.1), (2.3-0.6),..., (2.2- 1.2), so that c is given by:

c = {1.1, 1.7, 0.8, 1.6, 1.8, 1.5, 1.4, 1.3, 4.3, 1.0}

We compute the sample mean as 1.65 and sample variance m₂as 0.87, so that the unbiased estimator for the population variance is given by (n/n-1)m_2,= 0.97. The variance of the sample mean is given by m₂/n = 0.097, corresponding to a standard deviation of 0.31. Because the number of values is smaller than 20, we use the t distribution with 9 degrees of freedom to compute the confidence interval at the 95% level as 1.65±0.70. This interval does not include 0. Thus we conclude that the change in the buffer size does significantly affect the loss rate.

If , the situation is somewhat more complex. We first use m₂(a) to compute the confidence interval for A’s performance metric around its sample mean a and similarly use m₂(b) to compute the confidence interval for B’s performance metric around its sample mean b (using the normal or t distribution, as appropriate). Now, one of the following two cases hold:

• The confidence intervals do not overlap. Recall that with 95% (or 99%) confidence, A’s and B’s population means lie within the computed confidence intervals. If the null hypothesis were true and the population means coincided, then it must be the case that either A’s population mean or B’s population mean lies outside of its computed confidence interval.

However, this has a probability lower than 5% (1%). Therefore, we reject the hypothesis.

• The confidence intervals overlap. In this case, there is some chance that the samples are drawn from the same population.

The next steps depend on whether we can make one of two assumptions: (1) the sample variances are the same or (2) n and m are both large.

(1) Sample variances are the same. In this case, we define the auxiliary variable s by:

(EQ 21)

Then, it can be shown that if the two samples are drawn from the same population, the variable c defined by:

(EQ 22)

is a standard t variable (i.e., with zero mean and unit variance) with m+n-2 degrees of freedom. Therefore, we can use a t test to determine whether c has a zero mean, using the approach in Section 2.4.4 on page 53.

Loss rate with 5 buffers

1.20% 2.30% 1.90% 2.40% 3.00% 1.80% 2.10% 3.20% 4.50% 2.20%

Loss rate with 100 buffers

0.10% 0.60% 1.10% 0.80% 1.20% 0.30% 0.70% 1.90% 0.20% 1.20%

c_i = a_i–b_i

n≠m

s²

a_i–a ( )²

i=1 n

∑

⁽^bⁱ^–^b⁾²

i=1 m

∑

+ m+n–2

---

c a–b s 1

m---- 1 n--- + ---

DRAFT - Version 3 - Testing hypotheses about outcomes of experiments

EXAMPLE 10: COMPARINGTWOSAMPLESWHENSAMPLESIZESDIFFER

Continuing with Example 9, assume that we have additional data points for the simulation runs with 5 buffers, as shown below. Can we still claim that the buffer size plays a role in determining the loss rate?

Solution:

Here, m = 15 and n = 10, so we cannot use the approach of Example 9. Instead, we will first compute the mean and confidence intervals of both samples to see if the intervals overlap. It is easily found that, at the 95% level, using a t distribution with 14 and 9 degrees of freedom respectively, the confidence intervals are 0.81±0.25 and 0.81±0.40 which overlap. How- ever, the sample variances are not the same, and there is a good chance that the population variances also differ. Neverthe- less, for the purpose of this example, we will make the assumption that the population variances are the same. Therefore, we compute s² using Equation 21 as 2.89, so that s = 1.7. We then use Equation 22 to compute the standard t variate c as 0.0033/

(1.7(1/15 + 1/10)^1/2) = 0.0048. Since this has unit variance, it is easy to see using the t test with 23 degrees of freedom that 0 lies in the confidence interval for c, which implies that, with this data set, buffer size has no statistically significant effect on packet loss rate.

(2) Sample variances differ, but m and n are both large. In this case, it can be shown that the variable c defined by:

(EQ 23)

is a standard normal variable (i.e., with a zero mean and unit variance). Therefore, we can use a standard normal test to determine whether c has a zero mean, using the approach discussed in Section 2.4.4 on page 53.

If neither assumption can be made, then it is difficult to draw meaningful comparisons, other than by using non-parametric tests, such as the Mann-Whitney U test, which is beyond the scope of this text.

2.4.6 Testing hypotheses regarding quantities measured on ordinal scales

So far, we have tested hypotheses where a variable takes on real values. We now consider the case where a variable takes on nominal (categorical) values (such as ‘UDP’ or ‘TCP’) or ordinal values (such as ‘bad’, ‘satisfactory’, and ‘good’) (these terms are defined in Section 2.1.2 on page 41). In such cases, hypothesis testing using the techniques described above is meaningless because a sample cannot be described by a mean, nor can we define real confidence intervals about the mean.

Instead, for such variables, hypotheses are of the form:

H₀: the observed values are drawn from the expected distribution

Then, we use a statistical test, such as the Pearson chi-squared test (described below), to reject or not reject the hypothesis.

EXAMPLE 11: HYPOTHESISFORMULATIONWITHNOMINALSCALES Loss rate with

5 buffers

0.20% 0.30% 0.90% 1.40% 1.00% 0.80% 1.10% 0.20% 1.50% 0.20% 0.50% 1.20% 0.70% 1.30% 0.90%

Loss rate with 100 buffers

0.10% 0.60% 1.10% 0.80% 1.20% 0.30% 0.70% 1.90% 0.20% 1.20%

c a–b

a_i–a ( )²

i=1 n

∑

m m( –1) ---

⎝ ⎠

⎜ ⎟

⎛ ⎞

b_i–b ( )²

i=1 m

∑

n n( –1) ---

⎝ ⎠

⎜ ⎟

⎛ ⎞

---

DRAFT - Version 3 -Testing hypotheses regarding quantities measured on ordinal scales

57

Suppose that you want to check if the distribution of packet types on a link from your campus to the Internet is similar to that reported in the literature. For instance, suppose that 42% of the bytes originating at the University of Waterloo (UW) during the measurement period can be attributed to P2P applications. Suppose you measure 100 GB of traffic and find that 38 GB can be attributed to P2P applications. Then, a reasonable null hypothesis would be:

H₀: the observed traffic on the campus Internet access link is similar to that at UW

How should we test hypotheses of this form? A clue comes from the following thought experiment. Suppose we have a pos- sibly biased coin and that we want to determine whether it is biased or not. The null hypothesis is:

H₀: P(heads) = P(tails) = 0.5

We assume that we can toss the coin as many times as we want and that the outcome of each toss is independent. Let T denote the outcome ‘Tails’ and H denote the outcome ‘Heads.’ We will represent a set of outcomes such as ‘Nine heads and one tails’ by the notation TH⁹. As we saw earlier, if a coin is unbiased, this outcome has the probability . Any outcome from n coin tosses—such as a Heads, represented by H^aT^n-a–can be viewed as one sample drawn at random from the set of all possible outcomes when tossing a coin n times. A little thought indicates that the probability of this outcome, given that the probability of heads is p and of tails is q=1-p, is given by the binomial distribution , which is also the ath term of the expansion of the expression (p + q)ⁿ. As , the binomial distribution tends to the normal distribution, so that the probability of each outcome is approximated by the normal distribution.

Now, consider an experiment where each individual outcome is independent of the others, and where an outcome results in one of k ordinal values, o₁, o₂,…, o_k. Let the expected probability of the ith outcome be p_i,so that the expected count for the ith outcome, e_i=np_i. Suppose we run the experiment n times, and the ith outcome occurs n_i times with . We can

represent any particular outcome by and this outcome can be viewed as one sample drawn at random from the set of all possible outcomes. The probability of such an outcome is given by the multinomial distribution as:

(EQ 24)

(EQ 25)

This outcome is one of the terms from the expansion of (p₁+p₂+...+p_k)ⁿ. As with the binomial distribution, we can use the multinomial distribution to test if any particular outcome, conditional on a null hypothesis on the p_is being true, is ‘too unlikely,’ indicating that the null hypothesis should be rejected.

In many cases, using the multinomial distribution for testing the validity of a hypothesis can be cumbersome. Instead, we use a standard mathematical result that the variable , for values of e_i > 5, closely approximates a standard normal variable with zero mean and unit variance. But we immediately run into a snag: the n_i are not independent. For example, if n₃

= n, then all the other n_i must be zero. Therefore, the X_i are also not independent. However, it can be proved that this set of k dependent variables X_i can be mapped to a set of k-1 independent standard normal variables while keeping the sums of squares of the variables constant. By definition, the sum of squares of k-1 independent standard normal variables follows the

(also written chi-squared and pronounced kai-squared) distribution with k-1 degrees of freedom. Therefore, if the null 10

⎝ ⎠1

⎛ ⎞0.5⁹0.5¹

⎝ ⎠a

⎛ ⎞p^aqⁿ^–^a n→∞

n_i

∑

⁼ ⁿ

o₁ⁿ¹o₂ⁿ²…o_kⁿ^k

P o₁

a₁

o₂

a₂

…o_k^a^k

⎝ ⎠

⎛ ⎞ n

n₁

⎝ ⎠⎛ ⎞ n–n₁ n₂

⎝ ⎠

⎜ ⎟

⎛ ⎞

… n n_i

i k–1

∑

– n_k

⎝ ⎠

⎜ ⎟

⎛ ⎞

p₁

n₁

p₂

n₂

…p_kⁿ^k

n₁!n₂!…n_k!

---p₁ⁿ¹p₂ⁿ²…p_kⁿ^k

X_i n_i–e_i e_i ---

χ²

Dalam dokumen Mathematical Mathematical Foundations of Computer Networkingof Computer Networking (Halaman 61-71)