Major topics covered in this chapter
Exercises 35 balance with a single internal reference weight.) Carefully considered procedures
F- test for the comparison of standard deviations 47
3.7 Outliers
Every experimentalist is familiar with the situation in which one (or possibly more than one) measurement in a set of results appears to differ unexpectedly from the others. In some cases the suspect result may be attributed to a human error. For ex-ample, if the following results were given for a titration:
the fourth value is almost certainly due to a slip in writing down the result and should be 12.14. However, even when such clearly erroneous values have been re-moved or corrected (and other obvious problems such as equipment failure taken into account) suspect data may still occur. (They may also arise in proficiency testing schemes and method performance studies (see Chapter 4) and regression problems (see Chapter 5).) Should such suspect values be retained, come what may, or should methods be found to decide whether or not they should be rejected as outliers? The calculated mean and standard deviation values, which are used to evaluate and com-pare the precision and accuracy of analytical methods, will obviously depend on whether or not any suspect measurements are rejected, so the occurrence of suspect data, and the ways in which they are treated, must always be described carefully.
The essence of the problem is clear. We saw in Chapter 2 that if the data come from a population with a normal error distribution there is a finite chance that a sin-gle value in a set of replicates will be a long way from the mean, even if everything is in order. To reject such a value wrongly might result in the value of the mean being shifted and the standard deviation being too small. On the other hand if a measure-ment is a genuine outlier it would be wrong to retain it, otherwise the results would be unnecessarily pessimistic. In broad terms, suspect values can be tackled in three ways. Two of these approaches, median-based statistics and robust statistics, are treated separately in Chapter 6. Here we consider only the application of significance testing, probably still the most common approach to the problem of suspect measurements.
The ISO recommended test for outliers is Grubbs’ test. This test compares the de-viation of the suspect value from the sample mean with the standard dede-viation of the sample. The suspect value is naturally the value that is furthest away from the mean.
12.12, 12.15, 12.13, 13.14, 12.12 ml a
In order to use Grubbs’ test for an outlier, i.e. to test the null hypothesis, H0, that all measurements come from the same population, the statistic G is calculated:
(3.7.1) Note that and s are calculated with the suspect value included, as H0presumes that there are no outliers.
The test assumes that the population has a normal error distribution.
x
G = ƒsuspect value - xƒ>s
50 3: Significance tests
The critical values for G for P 0.05 are given in Table A.5. If the calculated value of G exceeds the critical value, the suspect value is rejected. The values given are for a two-sided test, which is appropriate when it is not known in advance at which extreme of the data range an outlier may occur.
Example 3.7.1
The following values were obtained for the nitrite concentration (mg l-1) in a sample of river water:
The last measurement is noticeably lower than the others and is thus suspect:
should it be rejected?
The four values have and s 0.01292, giving from Eq. (3.7.1)
From Table A.5, for sample size 4, the critical value of G is 1.481, (P 0.05).
Since the calculated value of G does not exceed 1.481, the suspect measurement should be retained.
G = ƒ0.380 - 0.3985ƒ
0.01292 = 1.432 x = 0.3985
0 .403, 0 .410, 0 .401, 0 .380
In fact, the suspect value in this data set would have to be considerably lower before it was rejected. Trial and error shows that for the data set
(where ), the value of b would have to be as low as 0.356 before it was rejected using the Grubbs’ test criterion.
Ideally, further measurements should be made when a suspect value occurs, espe-cially if only a few values have been obtained initially: when the sample size is small, one measurement must be very different from the rest before it can safely be rejected.
Extra values may make it clearer whether or not the suspect value should be rejected, and will also reduce its effect on the mean and standard deviation if it is retained.
b 6 0.401
0 .403, 0 .410, 0 .401, b
Example 3.7.2
Three further measurements were added to those given in the example above, so that the complete results became:
Should the value 0.380 still be retained?
The seven values have x = 0.4021and s 0.01088.
0 .403, 0 .410, 0 .401, 0 .380, 0 .400, 0 .413, 0 .408
Outliers 51
The calculated value of G is now:
The critical value of G (P 0.05) for a sample size 7 is 2.020 so the suspect mea-surement is now (just) rejected at the 5% significance level.
G = ƒ0.380 - 0.4021ƒ
0.01088 = 2.031
Dixon’s test (sometimes called the Q-test) is another test for outliers, popular be-cause the calculation is so simple. For small samples (size 3 to 7) the test assesses a suspect measurement by comparing the difference between it and the measurement nearest to it in size with the range of the measurements. (For larger samples the form of the test is modified slightly. The texts by Barnett and Lewis and by Ellison et al.
listed in the Bibliography at the end of this chapter give further details.)
In order to use Dixon’s test for an outlier, that is to test H0: all measurements come from the same population, the statistic Q is calculated:
(3.7.2) This test again assumes that the population has a normal error distribution.
Q = ƒsuspect value - nearest valueƒ largest value - smallest value
Example 3.7.3
Apply Dixon’s test to the data from Example 3.7.2.
Using Eq. (3.7.2) we have
The critical value of Q (P 0.05) for a sample size 7 is 0.570. The suspect value 0.380 is thus rejected (as it was using Grubbs’ test).
Q = ƒ0.380 - 0.400ƒ
0.413 - 0.380 = 0.606
Although taking extra measurements can be helpful, as shown above, it is of course possible that the new values may include further suspect measurements! Moreover if two suspect values occur, both of them might be at high end of the measurement range, both at the low end, or one at the high end and one at the low end. This can complicate the use of the tests. Figure 3.1 illustrates in the form of dot-plots two exam-ples of such difficulties in the application of the Dixon test. In Fig. 3.1(a) there are two The critical values of Q for P 0.05 for a two-sided test are given in Table A.6. If the calculated value of Q exceeds the critical value the suspect value is rejected.
52 3: Significance tests
results (2.9, 3.1), both of which are suspiciously high compared with the mean of the data, yet if Q were calculated uncritically using Eq. (3.7.2) we would obtain:
a value which is not significant at the P 0.05 level and suggests that the suspect values should be retained. Clearly the possible outlier 3.1 has been masked by the other possible outlier, 2.9, giving a low value of Q. A different situation is shown in Fig. 3.1(b), where the two suspect values are at opposite ends of the data set. This gives a large value for the range, so Q is small and again the null hypothesis might be wrongly retained. Handling multiple outliers clearly increases the complexity of out-lier testing, though the Grubbs’ test has been modified to handle data sets which contain two suspect measurements. These modifications use different formulae for G, depending on whether the suspect measurements occur at the same or opposite ends of the range of values, and two further tables of critical values are thus re-quired. Again, texts by Barnett and Lewis and by Ellison et al. listed in the Bibliogra-phy give further details.
A more fundamental concern over these outlier tests is that they assume that the sample comes from a population with a normal error distribution. A result that seems to be an outlier on the assumption of such a distribution may well not be an outlier if the sample actually comes from (for example) a log-normal distribution (Section 2.3). Therefore outlier tests should not be used if there is a suspicion that the population may not have a normal error distribution. Moreover there are cases where different outlier tests lead to opposite conclusions. The reader can verify that if the suspect measurement in examples 3.7.2 and 3.7.3 had been 0.382 rather than 0.380, the Dixon test would have recommended its rejection, while Grubbs’ test would have recommended its retention. Such conclusions must obviously be re-garded with extreme caution. These difficulties, along with the complications aris-ing in cases of multiple outliers, explain the increasaris-ing use of the methods described in Chapter 6, especially the robust methods. Such approaches either are insensitive to extreme values or at least give them less weight in calculations, so the problem of whether or not to reject outliers is avoided.