Major topics covered in this chapter
Sampling 75 large improvement in the quality and acceptability of many analytical results. Some
4.11 Proficiency testing schemes
The quality of analytical measurements is enhanced by two types of testing scheme, in each of which a number of laboratories might participate simultaneously. The more common are proficiency testing (PT) schemes, in which aliquots from homogeneous materials are circulated to a number of laboratories for analysis at regular intervals (every few weeks or months), and the resulting data are reported to a central organiser.
The circulated material is designed to resemble as closely as possible the samples nor-mally submitted for analysis in the relevant field of application, and each laboratory analyses its portion using its own usual method. The results of all the analyses are circu-lated to all the participants, who thus gain information on how their measurements compare with those of others, how their own measurements improve or deteriorate with time, and how their own measurements compare with an external quality stan-dard. In short, the aim of such schemes is the evaluation of the competence of analyti-cal laboratories. PT schemes have now been developed for use in a wide range of application fields, including several areas of clinical chemistry, the analysis of water and various types of food and drink, forensic analysis, and so on. Experience shows that in such schemes widely divergent results will arise, even between experienced and well-equipped and well-staffed laboratories. In one of the commonest of clinical analy-ses, the determination of blood glucose at the mM level, most of the results obtained for a single blood sample approximated to a normal distribution with values between 9.5 and 12.5 mM, in itself a not inconsiderable range. But the complete range of results was from 6.0 to 14.5 mM, i.e. some laboratories obtained values almost 2.5 times those of others. The worrying implications of this discrepancy in clinical diagnosis are obvious. In more difficult areas of analysis the results can be so divergent that there is no real consensus between different laboratories. The importance of PT schemes in highlighting such alarming differences, and in helping to minimise them by encour-aging laboratories to compare their performance, is very clear, and they have unques-tionably helped to improve the quality of analytical results in many fields. Here we are concerned only with the statistical evaluation of the design and results of such schemes, and not with the administrative details of their organisation. Of particular importance are the methods of assessing participants’ performance, and the need to ensure that the circulated aliquots are taken from a homogeneous bulk sample.
The recommended method for verifying homogeneity of the sample involves tak-ing portions of the test material at random, separately homogenising them if necessary, taking two test samples from each portion, and analysing the 2n portions by a method whose standard deviation under repeatability conditions is not more than 30% of the target standard deviation (i.e. the expected reproducibility, see below) of the proficiency test. If the homogeneity is satisfactory, one-way analysis of variance should then show that the between sample mean square is not significantly greater than the within-sample mean square (see Section 4.3).
The results obtained by the laboratories participating in a PT scheme are most commonly expressed as z-scores, where z is given by (see Section 2.2):
n Ú 10
(4.11.1) z = x - xa
s
92 4: The quality of analytical measurements
In this equation the x-value is the result obtained by a single laboratory for a given analysis, xais the assigned value for the level of the analyte, and is the target value for the standard deviation of the test results. The assigned value xacan be obtained by using a certified reference material (CRM), if one is available and suitable for distribu-tion. If this is not feasible, the relevant ISO standard (which also provides many nu-merical examples – see Bibliography) recommends three other possible approaches. In order of decreasing rigour these are (i) a reference value obtained from one laboratory, by comparing random samples of the test material against a CRM; (ii) a consensus value obtained by analysis of random samples of the test material by expert laborato-ries; and (iii) a consensus value obtained from all the participants in a given round of the scheme. This last situation is of interest since, when many laboratories participate in a given PT scheme, there are bound to be a number of suspect results or outliers in an individual test. (Although many PT schemes provide samples and reporting facili-ties for more than one analyte, experience shows that a laboratory that scores well in one specific analysis does not necessarily score well in others.) This problem is over-come by the use of either the median, which is especially recommended for small data sets (n⬍ 10), a robust mean, or the mid-point of the inter-quartile range. All these mea-sures of location avoid or address the effects of dubious results (see Chapter 6). It is also recommended that the uncertainty of the assigned value is reported to partici-pants in the PT scheme. This also may be obtained from the results from expert labo-ratories: estimating uncertainty is covered in more detail below (Section 4.13).
The target value for the standard deviation, , should be circulated in advance to the PT scheme participants along with a summary of the method by which it has been established. It will vary with analyte concentration, and one approach to estimating it is to use a functional relationship between concentration and standard deviation. The best known relationship is the Horwitz trumpet, dating from 1982, so-called because of its shape. Using many results from collaborative trials, Horwitz showed that the rel-ative standard deviation of a method varied with the concentration, c as a mass ratio (e.g. mg g-1 ⫽ 0.001), according to the approximate and empirical equation:
(4.11.2)
This equation leads to the trumpet-shaped curve shown in Fig. 4.10, which can be used to derive target values of for any analysis. Such values can also be estimated from prior knowledge of the standard deviations usually achieved in the analysis in question. Another approach uses fitness for purpose criteria: if the results of the analysis, used routinely, require a certain precision for the data to be interpreted properly and usefully, or to fulfil a legislative requirement, that precision provides the largest (worst) acceptable value of . It is poor practice to estimate from the results of previous rounds of the PT scheme itself, as this would conceal any improve-ment or deterioration in the quality of the results with time.
The results of a single round of a PT scheme are frequently summarised as shown in Fig. 4.11. If the results follow a normal distribution with mean xaand standard de-viation , the z-scores will be a sample from the standard normal distribution, i.e. a normal distribution with mean zero and variance 1. Thus a laboratory with a兩z兩 value of⬍2 is generally regarded as having performed satisfactorily, a 兩z兩 value between 2 and 3 is questionable, and兩z兩 values ⬎3 are unacceptable. Of course even the labora-tories with satisfactory scores will strive to improve their values in the subsequent
s
s s
s s
RSD = ; 2(1 - 0.5logc) s
s
Proficiency testing schemes 93
rounds of the PT scheme. In practice it is not uncommon to find ‘heavy-tailed’ distri-butions, i.e. more results than expected with兩z兩 ⬎2.
Some value has been attached to methods of combining z-scores. For example the results of one laboratory in a single PT scheme over a single year might be combined (though this would mask any improvement or deterioration in performance over the year). If the same analytical method is applied to different concentrations of the same analyte in each round of the same PT scheme, again a composite score might have limited value. Two functions used for this purpose are the re-scaled sum of z-scores (RSZ), and the sum of squared z-z-scores (SSZ), given by RSZ = and SSZ = respectively. Each of these functions has disadvantages, and the use of combined z-scores is not to be recommended. In particular, combining scores from the results of different analyses is dangerous, as very high and very low z-scores might then cancel out to give a falsely optimistic result.
giz2i
gizi> 1n 50
40 30 20 10 0 –10 –20 –30 –40 –50
100% 0.1% 1 ppm 1 ppb 1 ppt Concentration (c)
% relative standard deviation
log10 c 60
Figure 4.10 The Horwitz trumpet.
z
Laboratory Figure 4.11 Summary of results of a single PT round.
94 4: The quality of analytical measurements