Major topics covered in this chapter
2.2 The distribution of repeated measurements
Although the standard deviation gives a measure of the spread of a set of results about the mean value, it does not indicate the shape of the distribution. To illustrate this we need quite a large number of measurements such as those in Table 2.1. This gives the results (to two significant figures) of 50 replicate determinations of the lev-els of nitrate ion, a potentially harmful contaminant, in a particular water specimen.
These results can be summarised in a frequency table (Table 2.2). This table shows that, in Table 2.1, the value 0.46 μg ml-1appears once, the value 0.47 g ml-1 appears three times, and so on. The reader can check that the mean of these results is 0.500 μg ml-1and the standard deviation is 0.0165 μg ml-1(both values being given to three significant figures). The distribution of the results can most easily be appre-ciated by drawing a histogram as in Fig. 2.1. This shows that the distribution of the measurements is roughly symmetrical about the mean, with the measurements clus-tered towards the centre.
20 2: Statistics of repeated measurements
Table 2.2 Frequency table for measurements of nitrate ion concentration
Nitrate ion concentration Frequency (μg ml-1)
0.46 1
0.47 3
0.48 5
0.49 10
0.50 10
0.51 13
0.52 5
0.53 3
10
5
0
0.46 0.48 0.50 0.52
Frequency
Nitrate ion concentration, μg/ml Figure 2.1 Histogram of the nitrate ion concentration data in Table 2.2.
This set of 50 measurements is a sample from the theoretically infinite number of measurements which we could make of the nitrate ion concentration. The set of all possible measurements is called the population. If there are no systematic errors, then the mean of this population, given the symbol m, is the true value of the nitrate ion concentration we are trying to determine. The mean of the sample, , gives us an estimate of m. Similarly, the population has a standard deviation, denoted by
. The standard deviation, s, of the sample gives us an estimate of s. Use of Eq. (2.1.2) gives us an unbiased estimate of s. If n, rather than (n - 1), is used in the denominator of the equation the value of s obtained tends to underesti-mate s (see p. 19 above).
s =
2
ai(xi - m)2>nx
The distinction between populations and samples is fundamental in statistics:
the properties of populations have Greek symbols, samples have English symbols.
(2.2.1)
y = 1
s 22p exp3-(x - m)2>2s24
The nitrate ion concentrations given in Table 2.2 have only certain discrete val-ues, because of the limitations of the method of measurement. In theory a concen-tration could take any value, so a continuous curve is needed to describe the form of the population from which the sample was taken. The mathematical model usually used is the normal or Gaussian distribution which is described by the equation:
The distribution of repeated measurements 21 y
m x
Figure 2.2 The normal distribution, y = exp[−(x − m)2兾2s2]兾s22p. The mean is indicated by m.
where x is the measured value, and y the frequency with which it occurs. The shape of this distribution is shown in Fig. 2.2. There is no need to remember this compli-cated formula, but some of its general properties are important. The curve is sym-metrical about m and the greater the value of s the greater the spread of the curve, as shown in Fig. 2.3. More detailed analysis shows that, whatever the values of m and s, the normal distribution has the following properties.
For a normal distribution with mean m and standard deviation s:
• approximately 68% of the population values lie within ±ls of the mean;
• approximately 95% of population values lie within ±2s of the mean;
• approximately 99.7% of population values lie within ±3s of the mean.
These properties are illustrated in Fig. 2.4. This would mean that, if the nitrate ion concentrations (in μg ml-1) given in Table 2.2 are normally distributed, then about 68% should lie in the range 0.483–0.517, about 95% in the range 0.467–0.533 and 99.7% in the range 0.450–0.550. In fact 33 out of the 50 results (66%) lie between
y
m x SD=s2
s1>s2
SD=s1
Figure 2.3 Normal distributions with the same mean but different values of the standard deviation.
22 2: Statistics of repeated measurements y
x m–1s m+1s
y
y
x
x m
m–2s m m+2s
m–3s m m+3s
(i)
68%
(ii)
95%
(iii)
99.7%
Figure 2.4 Properties of the normal distribution: (i) approximately 68% of values lie within ±1s of the mean; (ii) approximately 95% of values lie within ±2s of the mean; (iii) approximately 99.7% of values lie within ±3s of the mean.
0.483 and 0.517, 49 (98%) between 0.467 and 0.533, and all the results between 0.450 and 0.550, so the agreement with theory is fairly good.
For a normal distribution with known mean, m, and standard deviation, s, the exact proportion of values which lie within any interval can be found from tables, provided that the values are first standardised so as to give z-values. (These are widely used in proficiency testing schemes; see Chapter 4.) This is done by express-ing any value of x in terms of its deviation from the mean in units of the standard deviation, s. That is:
(2.2.2) Standardised normal variable, z = (x - m)
s
Table A.1 (Appendix 2) gives the proportions of values, F(z), that lie below a given value of z. F(z) is called the standard normal cumulative distribution function.
For example the proportion of values below z 2 is F(2) 0.9772 and the propor-tion of values below z -2 is F(-2) 0.0228. Thus the exact proportion of measure-ments lying within two standard deviations of the mean is 0.9772 - 0.0228 0.9544.
Log-normal distribution 23
In practice there is considerable variation in the formatting of tables for calculating proportions from z-values. Some tables give only positive z-values, and the propor-tions for negative z-values then have to be deduced using considerapropor-tions of symme-try. F(z) values are also provided by Excel®and Minitab®.
Although it cannot be proved that replicate values of a single analytical quantity are always normally distributed, there is considerable evidence that this assumption is generally at least approximately true. Moreover we shall see when we come to look at sample means that any departure of a population from normality is not usu-ally important in the context of the statistical tests most frequently used.
The normal distribution is not only applicable to repeated measurements made on the same specimen. It also often fits the distribution of results obtained when the same quantity is measured for different materials from similar sources. For example if we measured the concentration of albumin in blood sera taken from healthy adult humans we would find the results were approximately normally distributed.
Example 2.2.1
If repeated values of a titration are normally distributed with mean 10.15 ml and standard deviation 0.02 ml, find the proportion of measurements which lie between 10.12 ml and 10.20 ml.
Standardising the lower limit of the range gives z (10.12 - 10.15)/0.02 -1.5.
From Table A.1, F(-1.5) 0.0668.
Standardising the upper limit of the range gives z (10.20 - 10.15)/0.02 2.5.
From Table A.1, F(2.5) 0.9938.
Thus the proportion of values between x 10.12 to 10.20 (corresponding to z -1.5 to 2.5) is 0.9938 - 0.0668 0.927.