Testing the Goodness of Fit of a Probability Distribution

Classifying Pest Density 3 3

4.5 Testing the Goodness of Fit of a Probability Distribution

0 2 4 6 8 10 200

100

200

Log likelihood

True mean density

Fig. 4.3. A maximum likelihood example based on the Poisson distribution. The curved line represents the log-likelihood for sample size n= 25, as a function of µ. The maximum is at µ= m, which here is equal to 5.

guide to one of them, the _χ²test. In the simplest case, based on nsample units, the parameters of the probability distribution model are determined (usually by maximum likelihood), and the expected frequencies are calculated as np(x_i|_θ^{^}), where i= 1, 2, … , cand x_i= i1. Note that here we are dealing with frequency classes x_i, whereas in Equations 4.5 and 4.6, and all of Section 4.3, X_j refers to the jth sample unit, and j = 1, 2, … , n. The np(x_i|_θ^{^}) are used in a comparison with the observed frequencies, f_i, to calculate a statistic X²:

(4.8)

where cis the number of frequency classes. Note that the capital letter, X, is used to distinguish the expression in Equation 4.8 from the theoretical _χ²distribution. If the data truly come from the distribution p(x|_θ), with _θ= _θ^{^}, X² follows what is called the _χ²distribution (with a few provisos, as discussed below), whose properties are well documented. The _χ² distribution has one parameter, its degrees of freedom. The number of degrees of freedom, df, is easily calculated as c 1 n_p, where cis the number of classes and n_pis the number of fitted parameters (n_p= 1 for the Poisson distribution). If the data do not come from the distribution p(x|_θ), or, for some reason or other, _θis not equal to _θ^{^}, X²tends to be large, because it is basi- cally the weighted sum of squared differences between observed (f_i) and expected (np(x_i|_θ^{^})) frequencies. Therefore, large values of X²indicate that the fit to the distribution p(x|_θ) is not good.

How large is large? Because we know the properties of the _χ²distribution, we can find the probability of getting a _χ²value equal to or greater than the calculated value of X². Suppose that this probability is 0.2. What this means is as follows:

• either the data come from the distribution p(x|_θ) with _θ= _θ^{^}, but an event with probability of occurrence equal to 0.2 occurred, or

• the data did not come from the distribution p(x|_θ) with _θ= _θ^{^}

Our choice depends on how we view the probability 0.2 in connection with our preconceived notion that the data might conform to the distribution p(x|_θ). If this preconception is strong, then why not accept that we have just been unlucky and got a result which should happen only one time in five – that’s not too unlikely. But what if the probability had been 0.02? Is our preconception so strong that a 1 in 50 chance is just unlucky? Or should we reject the possibility that the data conform to the distribution p(x|_θ)? It has become standard usage that a probability equal to 0.05 is a borderline between: (i) accepting that we have just been unlucky and (ii) that we should have grave doubts about our preconceived notion.

The above procedure is an example of testing a hypothesis. Here the hypothesis is that the data conform to the distribution p(x|_θ). The data are used to estimate parameters, X²is calculated and a _χ²probability (P) is found. The hypothesis is accepted if P is greater than some standard, often 0.05, and rejected otherwise. If there are many data sets, the proportion of data sets where the hypothesis is rejected can be compared with the standard probability that was used. If all the data sets really do conform to the distribution p(x|_θ), this proportion should be close to the standard

X²

(

−

( ) )

∑

^fⁱ _{n x}^{n x}

( )

ⁱ

c p |

p |

ˆ ˆ

θ θ

probability itself. Testing statistical hypotheses in this way has become common in many branches of biological science, and some technical terms should be noted. X² is an example of a ‘test statistic’, and the _χ²probability obtained from it is called a

‘significance probability’. If the significance probability, P, is less than 0.05, the hypothesis is said to be rejected at the 5% level. This test is often called the _χ² goodness-of-fit test.

Use of the _χ²test implies that the observed frequencies can be regarded as fol- lowing a probability model, with the class probabilities given by p(x_i|_θ^{^}). However, a few conditions must be fulfilled before the results can be analysed by the _χ²test:

1. None of the expected frequencies, np(x_i|_θ^{^}), should be small. Traditionally, 5 has been regarded as the minimum value, but some statisticians think that 1 is often large enough. If the tail class frequencies are too small, the classes should be grouped until all classes have acceptable expected frequencies. Some arbitrariness is unavoidable here. Note that if there is grouping, the number of degrees of freedom, df, must be recalculated, because the number of classes, c, has changed.³ 2. The number of sample units, n, should not be small. This requirement is con- nected with the first because if nis small many of the expected frequencies will be small, and too much grouping is not good. Another reason for nto be large (gener- ally, at least 100) is that the estimated parameters and the significance test will be more reliable.

These concepts are illustrated in Exhibit 4.1.

Distributions 69

3There are some theoretical difficulties when classes are grouped relating to degrees of freedom (Chernoff and Lehmann, 1954), but in the present context they should be ignorable.

Exhibit 4.1. Sampling a random pattern produces a Poisson distribution

A random spatial pattern can be generated as in Fig. 4.1a. Sample units can be defined by superimposing regular grid lines, as in Fig. 4.1b. Sample units of different shapes and sizes can be defined by using different grids. Four types of sample unit were cre- ated for the random spatial pattern of 500 points shown in Fig. 4.1a (see Fig. 4.4):

• 25 sample units in a 5 ×5 array (five sample units along each of the horizontal and vertical axes)

• 100 sample units in a 10 ×10 array

• 400 sample units in a 20 ×20 array

• 100 sample units in a 20 ×5 array.

The means and variances were as follows:

Number, n_x, of sample units along the x-axis 5 10 20 20 Number, n_y, of sample units along the y-axis 5 10 20 5

Grid (n_x×n_y) 5 ×5 10 ×10 20 ×20 20 ×5

Number of sample units (n= n_x×n_y) 25 100 400 100

Mean per sample unit 20 5 1.25 5

Calculated variance per sample unit 29.1 5.88 1.41 6.08 Continued

Although not exactly equal to the means, the variances were close to the means for all shapes and sizes of sample unit. The frequency distributions were calculated for each sample unit, and the Poisson distribution was fitted to each by maximum likelihood (Fig. 4.5). As noted above, the maximum likelihood estimate of µfor the Poisson distribution is always equal to the mean, m. The χ²significance probabilities, P, were calculated using the grouping criterion that the expected frequencies for the tail classes should be at least equal to 1:

Grid (n_x×n_y) 5 ×5 10 ×10 20 ×20 20 ×5

X² 19.9 7.8 6.4 9.8

df = c1 n_p(n_p= 1) 14 9 4 9

Significance probability, P 0.13 0.55 0.17 0.36

Fig. 4.4. A random spatial pattern of 500 points, with superimposed grids of sample units. (a) 5 ×5 grid, mean = 20, variance = 29.1; (b) 10 ×10 grid, mean = 5, variance = 5.88; (c) 20 ×20 grid, mean = 1.25, variance = 1.41;

(d) 5 ×20 grid, mean = 5, variance = 6.08.

We know from experience that pests are often not distributed randomly. They tend to be found in aggregated spatial patterns. There are many biological mechanisms that lead to aggregation: the existence of ‘good’ and ‘bad’ resource patches, settling of offspring close to the parent, mate finding and so on. Likewise, there are many ways in which models for such processes can be combined to obtain probability distributions to account for the effect of spatial aggregation.

Distributions 71

Because these are all greater than the standard value, P= 0.05, we would be justi- fied in assuming a Poisson distribution in each case.

Despite the non-significant P-value for the 5 ×5 grid, Fig. 4.5a is not convincing: the frequencies do not really look like the theoretical values. This is mainly because n= 25 is too few sample units for a good test. There are people who are reluctant to accept the results of a statistical significance test if the data do not look convincing. It is unwise to lean too heavily on a statistical crutch. Statistical tests are most useful when they summarize something that is not difficult to swallow.

Fig. 4.5. Observed frequency distributions for the four grid types shown in Fig.

4.4, with a Poisson distribution fitted to each () with grouping criterion equal to one. (a) 5 ×5; (b) 10 ×10; (c) 20 ×20; (d) 5 ×20.

Dalam dokumen SAMPLING AND MONITORING IN CROP PROTECTION The Theoretical Basis for Developing Practical Decision Guides (Halaman 79-83)