• Tidak ada hasil yang ditemukan

The median: initial data analysis

Major topics covered in this chapter

Uncertainty 99 to each, and then combines these components using the rules summarised in

6.2 The median: initial data analysis

In previous chapters we have used the arithmetic mean or average as the ‘measure of central tendency’ or ‘measure of location’ of a set of results. This is logical enough when the (symmetrical) normal distribution is assumed, but in non-parametric statistics, the median is usually used instead. To calculate the median of n observa-tions, we arrange them in ascending order: in the unlikely event that n is very large, this sorting process can be performed very quickly by programs available for most computers.

The median is the value of the (n 1)th observation if n is odd, and the average of the nth and the (n12 12  1)th observations if n is even.

1 2

Determining the median of a set of experimental results usually requires little or no calculation. Moreover, in many cases it may be a more realistic measure of central tendency than the arithmetic mean.

This simple example illustrates one valuable property of the median: it is unaffected by outlying values. Confidence limits (see Chapter 2) for the median can be estimated with the aid of the binomial distribution. This calculation can be performed even when the number of measurements is small, but is not likely to be required in analyt-ical chemistry, where the median is generally used only as a rapid measure of an aver-age. The reader is referred to the Bibliography for further information.

In non-parametric statistics the usual measure of dispersion (replacing the standard deviation) is the interquartile range (IQR). As we have seen, the median divides the sample of measurements into two equal halves; if each of these halves is further divided into two the points of division are called the upper and lower quartiles. Several different conventions are used in making this calculation (the interested reader should again consult the bibliography): here we use the method adopted by the Minitab®program. The IQR is not widely used in analytical work, but various statistical tests can be performed on it.

The median and the IQR of a set of measurements are just two of the statistics which feature strongly in initial data analysis (IDA), often also called exploratory data analysis (EDA). This is an aspect of statistics that has grown rapidly in popu-larity in recent years. One reason for this is, yet again, the ability of modern com-puters and dedicated software to present data almost instantly in a wide range of graphical formats: as we shall see, such pictorial representations form an important element of IDA. A second reason for the rising importance of IDA is the increasing acceptance of statistics as a practical and pragmatic subject not necessarily restricted to the use of techniques whose theoretical soundness is unquestioned:

some IDA methods seem almost crude in their principles, but have nonetheless proved most valuable.

156 6: Non-parametric and robust methods

Example 6.2.1

Determine the mean and the median for the following four titration values:

It is easy to calculate that the mean of these four observations is 25.08 ml, and that the median – in this case the average of the second and third values, the observations already being in numerical order – is 25.05 ml. The mean is greater than any of the three closely grouped values (25.01, 25.04 and 25.06 ml) and may thus be a less realistic measure of location than the median. Instead of calculating the median we could use the methods of Chapter 3 to test the value 25.21 as a possible outlier, and determine the mean according to the result obtained, but this approach involves extra calculation and assumes that the data come from a normal population.

25.01, 25.04, 25.06, 25.21 ml

The main advantage of IDA methods is their ability to indicate which (if any) further statistical methods are most appropriate to a given data set.

Several simple presentation techniques are obviously useful. We have already used dot-plots to summarise small data sets (see Chapters 1 and 3). These plots help in

The median: initial data analysis 157 the visual identification of outliers and other unusual features of the data. Here is a further example illustrating their value.

Example 6.2.2

In an experiment to determine whether Pb2 ions interfered with the enzymatic determination of glucose in various foodstuffs, nine food materials were treated with a 0.1 mM solution of Pb(II), while four other materials (the control group) were left untreated. The rates (arbitrary units) of the enzyme-catalysed reaction were then measured for each food and corrected for the different amounts of glucose known to be present. The results were:

Treated foods 21 1 4 26 2 27 11 24 21

Controls 22 22 32 23

Comment on these data.

Written out in two rows as above, the data do not convey much immediate meaning, and an unthinking analyst might proceed straight away to perform a t-test (Chapter 3), or perhaps one of the non-parametric tests described below, to see if the two sets of results are significantly different. But when the data are pre-sented as two dot-plots, or as a single plot with the two sets of results given sep-arate symbols, it is clear that the results, while interesting, are so inconclusive that little can be deduced from them without further measurements (Fig. 6.1).

The medians of the two sets of data are similar: 21 for the treated foods and 22.5 for the controls. But the range of reaction rates for the Pb(II)-treated mate-rials is enormous, with the results apparently failing into at least two groups:

five of the foods seem not to be affected by the lead (perhaps because in these cases Pb(II) is complexed by components other than the enzyme in question), while three others show a large inhibition effect (i.e. the reaction rate is much reduced), and another lies somewhere in between these two extremes. There is the further problem that one of the control group results is distinctly different from the rest, and might be considered as an outlier (see Chapter 3). In these circumstances it seems most unlikely that a conventional significance test will reveal chemically useful information: the use of the simplest IDA method has guided us away from thoughtless and valueless significance testing and (as so often happens) towards more experimental measurements.

0 8 16 24 32

0 8 16 24 32

Treated foods

Controls

Figure 6.1 Dot-plots for Example 6.2.2.

158 6: Non-parametric and robust methods

Another simple data representation technique, of greater value when rather larger samples are studied, is the box-and-whisker plot. In its normal form such a diagram consists of a rectangle (the box) with two lines (the whiskers) extending from oppo-site edges of the box, and a further line in the box, crossing it parallel to the same edges. The ends of the whiskers indicate the range of the data, the edges of the box from which the whiskers protrude represent the upper and lower quartiles, and the line crossing the box represents the median of the data (Fig. 6.2).

Lowest value

Highest value Lower

quartile

Upper quartile Median

Figure 6.2 Box-and-whisker plot.

The box-and-whisker plot, with a numerical scale, is a graphical representation of the five-number summary: the data set is described by its extremes, its lower and upper quartiles, and its median. The plot shows at a glance the spread and the symmetry of the data.

Some computer programs enhance box-and-whisker plots by identifying possible outliers separately, the outliers often being defined as data points which are lower than the lower quartile, or higher than the upper quartile, by more than 1.5 times the inter-quartile range. The whiskers then only extend to these upper and lower limits or fences and outlying data are shown as separate points. (These refinements are not shown in Fig. 6.2.)

Example 6.2.3

The levels of a blood plasma protein in 20 men and 20 women (mg 100 ml1) were found to be:

Men 3 2 1 4 3 2 9 13 11 3

18 2 4 6 2 1 8 5 1 14

Women 6 5 2 1 7 2 2 11 2 1

1 3 11 3 2 3 2 1 4 8

What information can be gained about any differences between the levels of this protein in men and women?

As in the previous example, the data as presented convey very little, but the use of two box-and-whisker plots or five-number summaries is very revealing.

The five-number summaries are:

Min. Lower quartile Median Upper quartile Max.

Men 1 2 3.5 8.75 18

Women 1 2 2.5 5.5 11

The median: initial data analysis 159

While it is usual for analysts to handle relatively small sets of data there are occasions when a larger set of measurements is to be examined. Examples occur in the areas of clinical and environmental analysis, where in many instances there are large nat-ural variations in analyte levels. Table 6.1 shows, in numerical order, the levels of a pesticide in 30 samples of butter beans. The individual values range from 0.03 to 0.96 mg kg1. They might be expressed as a histogram. This would show that, for example, there are four values in the range 0–0.095 mg kg1, four in the range 0.10

0.195 mg kg1, and so on. But a better IDA method uses a stem-and-leaf diagram, as shown in Fig. 6.3.

The left-hand column of figures – the stem – shows the first significant digit for each measurement, while the remaining figures in each row – the leaves – provide the second significant digit. The length of each row thus corresponds to the length of the bars on the corresponding histogram, but the advantage of the stem-and-leaf diagram is that it retains the value of each measurement. The leaves use only whole numbers, so some indication of the scale used must always be given. In this case a key is used to provide this information. Minitab® provides facilities for stem-and-leaf diagrams.

Table 6.1 Levels of pp-DDT in 30 butter bean specimens (mg kg1)

0.03 0.05 0.08 0.08 0.10 0.11 0.18 0.19 0.20 0.20

0.22 0.22 0.23 0.29 0.30 0.32 0.34 0.40 0.47 0.48

0.55 0.56 0.58 0.64 0.66 0.78 0.78 0.86 0.89 0.96

0 1 2 3 4 5 6 7 8 9

3 0 0 0 0 5 4 8 6 6

5 1 0 2 7 6 6 8 9

8 8 2 4 8 8

8 9

2 3 9

Key: 1|1 = 0.11mgkg–1 Figure 6.3 Stem-and-leaf diagram for data from Table 6.1.

It is left as a simple sketching exercise for the reader to show that (a) the distri-butions are very skewed in both men and women, so statistical methods that assume a normal distribution are not appropriate (as we have seen this is often true when a single measurement is made on a number of different subjects, par-ticularly when the latter are living organisms); (b) the median concentrations for men and women are similar; and (c) the range of values is considerably greater for men than for women. The conclusions suggest that we might apply the Siegel–Tukey test (see below, Section 6.6) to see whether the greater varia-tion in protein levels amongst men is significant.

160 6: Non-parametric and robust methods

They can of course be extended to the area of calibration and other regression tech-niques: the very crude method of plotting a curved calibration graph suggested at the end of the previous chapter can be regarded as an IDA approach. Many IDA methods are described in the books by Chatfield and by Velleman and Hoaglin listed in the Bibliography at the end of this chapter.