Classification versus Estimation - Classifying Pest Density 3 3

Classifying Pest Density 3 3

3.2 Classification versus Estimation

which, when coupled with practical experience and understanding, can indicate a management action. One key piece of information is whether pest density is above or below a level which is likely to cause unacceptable damage (see Chapter 1). This is in principle a matter of classification. Precise knowledge about the true density is unnecessary, except as a guide as to whether the true density is above or below the critical density. Thus, because the information contained in a precise estimate of density can rarely be fully used, density estimation seems to be less appropriate for decision-making than density classification.

But are estimation and classification really so different? In both procedures, samples must be collected, pests organisms are counted, and on many occasions a sample mean is calculated (i.e. an estimate is found). It is only here that a vital difference is detected: in classification, the goal is to find out if pest density is too high, and an estimate (if it is calculated at all) is only useful as an intermediate step to reach that goal. At this point, therefore, we must tie down the differences between sampling for estimation and sampling for classification, and note typical criteria for designing good sampling plans for either purpose.

Suppose that we want to relate pest density to crop yield, or we want to determine how the presence of natural enemies influences pest density. In circumstances such as these, we need estimates of pest numbers per sample unit or per area. These estimates – whether high, low or moderate – must be precise enough to give useful results after they had been used in mathematical calculations or models (possibly very complex ones). We want to be sure that the results of our calculations will be acceptable to others and not easily brushed aside as, for example, ‘too vague’. We want to be able to use variances and the Central Limit Theorem, perhaps in sophis- ticated ways, to delimit how close our sample estimates probably are to the true values.

This problem can usually be solved by relying on the representativeness (no unaccountable bias) and reliability (no effect of uncontrolled variables) of the sample protocol (see Section 2.5), and also on the two important statistical results presented in Chapter 2:

1. Variance of the estimate of the mean, m, based on nsample units is equal to σ²/nand is estimated by V/n.

2. For large n, the distribution of mis normal.

Representativeness and reliability ensure that the sample mean, m, is an unbi- ased estimate of the true pest density, _µ. In other words, the mathematical expectation of mis equal to _µ. Because of the two statistical results, the standard deviation of the sample mean is , and, if nis large enough, the shape of the distribution of mis known (it is close to normal). These can all be combined to show that the distribution of all hypothetical sample means, m, is centred on the true mean density, _µ, and the distribution is shaped like a normal distribution with standard deviation equal to _σ/ n , which can be estimated by V n/ . Because all that we have

σ/ n 3.2.1 Estimation

is one value of m, we want to turn this statistical result around to allow ourselves to infer a value for _µ.

In Chapter 2 we noted some properties of the normal distribution, one of which is that 95% of it lies within two standard deviations of the mean. Retaining our assumption that the distribution of all hypothetical sample means in the above discussion is normal, then 95% of these means are no more than two standard deviations away from the true mean. This is the basis of what are called confidence intervals. We can assume that our single value m is one of the 95% which are within two standard deviations of the true mean _µ, and say that _µmust therefore be no further than from the sample mean, m. We indicate that this is an assumption by saying that the range

(3.1) is a 95% confidence interval. We can replace by , to obtain

(3.2) Confidence intervals are useful ways of describing where true means are. For example, the press often quotes survey results in words such as ‘the percentage support for the governing party is between 37% and 43%, with a 1 in 20 chance of error’. This corresponds to a 95% (95/100 =1 1/20) confidence interval:

37% < true support < 43%

Naturally, we are not stuck with a 95% confidence interval. We can choose any other percentage, but then the factor 2 in Equation 3.2 must be changed accordingly. For example, if a 90% confidence interval were used, 2 would be replaced by 1.645. This result can be checked by looking up tables of the normal distribution. The number 2 is an approximation for the 95% interval; a more exact value is 1.96. The general formula for a 100(1 α)% confidence interval based on the normal distribution is

(3.3) where z_α_/2is defined by the following statement: 100(1 α)% of a standard normal distribution (i.e. mean =0, variance =1) is within z_α_/2of the mean. A 95% interval has _α = 0.05, because if you are unlucky and your estimate (m) happens to be among the 5% which are outside the two standard deviation interval around _µ, it could equally well be on either side (Fig. 3.1a): _αis the overall probability of ‘get- ting it wrong’; that is to say that the interval does not contain the true mean.

Convention divides this error probability equally on both sides of the interval (hence _α/ 2 and z_α_/2), but this is not absolutely necessary. In the extreme case you may decide that you want a one-sided interval, with all the uncertainty put on one side of the estimated value m(Fig. 3.1b).

m z n

m z

− _α ^σ < < +µ _α σn

/2 /2

m V

n m V

−2 < < +_µ 2 n

V n/ σ/ n

m−2_σ/ n< < +_µ m 2_σ/ n 2_σ/ n

Classifying Pest Density 43

Typically, a confidence interval is used after data have been collected, but if a rea- sonable estimate, V, of _σ² is available beforehand, Equation 3.3 can be turned around to indicate how many sample units are required to achieve a desired precision. The half-width of the interval, substituting Vfor _σ², is

(3.4) which can be turned into an expression that defines n:

(3.5) If we wished the 95% confidence interval (z_α_{/ 2}=1.96) to have a half-width equal to 2.5 when V= 60, then n=60(1.96/2.5)² =36.9, so 37 sample units would be required.¹

The coefficient of variation (see Table 2.2) is often used to indicate precision after data have been collected. It also can be turned around to indicate how many sample units are required to achieve a desired precision, provided again that a rea- sonable estimate, V, of _σ²is available beforehand. The formula for CV, the coefficient of variation of the sample mean, is (from Table 2.2) as follows:

(3.6)

CV V n

= /

µ n V z

=  w







α/ 2 2

w z V

= _α/ 2 n

3.2.1.1 Choosing the sample size, n, to achieve specified precision

1Strictly speaking, when σis estimated, zbased on the normal distribution should be replaced by a number based on another distribution (the t-distribution). However, if the value of nis large enough to satisfy the Central Limit Theorem, the difference should be ignorable.

Fig. 3.1. Areas under the standardized normal distribution (mean = 0, variance = 1) indicated by shading. (a) Two-sided interval, 1.960 to 1.960, contains 95% of the total probability; (b) one-sided interval, less than 1.645, contains 95% of the total probability.

(3.7) For example, to obtain a CV equal to 25% when _µ = 5 and V = 60, then n=60(1/(5 _×0.25))²=38.4, so 39 sample units would be required.

Equations 3.5 and 3.7 are similar, but a critical difference is in how the required precision is specified. In Equations 3.4 and 3.5, w is a constant, but in Equations 3.6 and 3.7 CVis proportional to _µ. In practical terms, this means that n defined by Equation 3.7 changes as _µchanges, but ndefined by Equation 3.5 does not. The two methods can be made equivalent if w and CV are both specified either as constants or as proportional to _µ.

Clearly, any procedure which gives an estimate of _µcan be used for classification.

Either of the Equations 3.5 or 3.7 can be used to determine nfor a sampling plan, data could be collected and the sample mean could be obtained. This mean would then be compared with a critical density to make the classification:

µ≤cd if m≤cd or _µ> cd if m> cd (3.8)

We can measure how good the classification is by turning around the confidence interval (Equation 3.2) and determining whether mis inside or outside the classification interval

(3.9) If mis outside the classification interval, the (confidence) probability of the classification (Equation 3.8) being correct is 1 α, but if mis inside the classification interval, the (confidence) probability of the classification being correct is less than 1 α. The confidence probability, 1 α, can be regarded as the probability of get- ting it right. For pest management purposes, we are interested in the ‘correctness’ of the classification. What we want in general terms is that the sample size, n, is adjusted so that mis outside such a classification interval because a decision may then be made with confidence.

For readers with some knowledge of statistics, it is worth noting that the above is linked to hypothesis testing. If mis outside the classification interval (Equation 3.9), then what is called a statement of statistical significance can be made: _µis sig- nificantly different from cd, at the 100_α% probability level. What we want for classification is to be able to make a statement such as this whenever we take a sample for decision-making. We should like a sampling plan that automatically tries to adjust naccordingly.

The key distinction between estimation and classification lies in the purpose to which the sample information is directed. With estimation, we begin by specifying

cd z V

± _α/ 2 n n V

= CV









1 ²

Classifying Pest Density 45

3.2.2 Classification

how close we wish an estimate to be to a true value. With classification, we begin by specifying how frequently we wish to make correct classifications, or how frequently we will tolerate incorrect classifications. In both cases, sample sizes can be adjusted to achieve the specified objective. Clearly, concepts pertinent to estimation and classification are related, but the purposes to which these concepts are applied are quite different. In this chapter, we begin to show how concentrating on sampling for classification produces its own optimal values for sample size. These optimal values will turn out to be radically different from those for estimation.

In Chapter 2 we described how classification could be done after collecting a fixed number of sample units and, using the normal distribution, we showed how OC functions could be calculated. Fixed sample size plans have certain advantages: sim- plicity and the relative assurance that a representative sample can be obtained.

However, when classification with respect to a critical density is the objective, intuition suggests that fewer samples are needed to make a correct classification when pest abundance is very different from the critical density than when it is close to it. By reducing the number of samples required to make a correct classification, if the data allow so, sampling costs are reduced, which is obviously an important con- sideration.

Statisticians have proposed ways of cutting short sample collection if the evi- dence overwhelmingly supports one of the possible classifications. The simplest approach is to make one preliminary sample to get a quick idea of the situation, and then take a second one if more information is needed. This process is called double sampling. It may appear to be an obvious procedure, and it is, but working out the probabilities of decision (operating characteristic (OC) function) and the corresponding average sample number (ASN) function is not easy.

A simple double sampling plan is to take an initial sample (‘batch’) of n_B sample units, and calculate the sample mean, m, and the ends of a classification interval (Equation 3.9) around critical density (cd):

(3.10)

If the sample mean is greater than U, a decision is made to intervene, and if it is below L, the decision is not to intervene. If it is between these two values, a second sample of n_Bunits is taken and the sample mean for all 2n_Bunits is calculated, and

L cd z V

U cd z V

= −

= +

α /

/ B 2

Dalam dokumen SAMPLING AND MONITORING IN CROP PROTECTION The Theoretical Basis for Developing Practical Decision Guides (Halaman 53-58)