Analysis of variance by randomization wi

(1)

ANALYSIS OF VARIANCE BY RANDOMIZATION

WITH SMALL DATA SETS

LILIANA GONZALEZ1_{* AND BRYAN F. J. MANLY}2

1_{Finance and Quantitative Analysis Department, University of Otago, PO Box 56, Dunedin, New Zealand}

2_{Department of Mathematics and Statistics, University of Otago, PO Box 56, Dunedin, New Zealand}

SUMMARY

A simulation study has been carried out to compare the results from using dierent randomization methods to assess the signi®cance of theF-statistics for factor eects with analysis of variance. Two-way and three-way designs with and without replication were considered, with the randomization of observations, the restricted randomization of observations, and the randomization of dierent types of residuals. Data from normal, uniform, exponential, and an empirical distribution were considered. It was found that usually all methods of randomization gave similar results, as did the use of the usualF-distribution tables, and that no method of analysis was clearly superior to the others under all conditions.#1998 John Wiley & Sons, Ltd. KEY WORDS analysis of variance; computer intensive statistics; simulation;F-distribution; randomization

inference

1. INTRODUCTION

Randomization methods for testing hypotheses are useful in environmental areas because their justi®cation is easy to understand by government ocials and the public at large, and they are applicable with the non-normal distributions that often occur. However, as soon as the required analysis becomes more than something as simple as the comparison of the mean values of several samples the appropriate randomization procedure becomes questionable. There has been con-troversy with several types of randomization analysis but in this paper we only consider simple factorial designs with analysis of variance. It was in this area that Crowley (1992) in his review of resampling methods remarked that `contrasting views on interaction terms in factorial ANOVA . . . needs to be resolved'.

In discussing the merits of randomization tests we start with one important premise. This is that the major value of these tests is in situations where the data are grossly non-normal (e.g. with several extremely large values and many tied values), and sample sizes are not large. It is in these situations that more conventional methods are of questionable validity and the results of a randomization test may carry more weight. This premise is important because it tells us that our main concern should be with the performance of alternative methods on grossly non-normal data. Theoretical or simulation studies suggesting that one method is somewhat better than another with normally distributed data are not necessarily of much relevance.

CCC 1180±4009/98/010053±13$17.50 Received 29 December 1995

#1998 John Wiley & Sons, Ltd. Revised 13 June 1997

ENVIRONMETRICS, VOL. 9, 53±65 (1998)

Environmetrics,9_{, 53±65 (1998)}

(2)

In this context we address two questions:

(a) Should it be observations or residuals that are permuted and, if it is to be residuals, what type of residuals should be used?

(b) Should restricted randomization be used where possible instead of a completely free randomization?

Our method of approach involves looking at several quite speci®c situations and examining the performance of alternative methods. Because the situations are speci®c there is no guarantee that our conclusions apply to analysis of variance in general. Nevertheless, the results are clear enough to suggest that this is the case.

In brief, our conclusions with respect to the testing of factor eects is that there is usually similar performance with analysis of variance based on the usual F-distribution tables, the randomization of observations, or the randomization of dierent types of residuals with normal or moderately non-normal data. If anything, the use of F-distribution tables tends to give slightly higher power than the other methods except with extremely non-normal data. On the other hand, restricted randomization has guaranteed performance when the null hypothesis is true, but has tendency to give low power to detect real eects under some conditions. Overall, therefore, it is not possible to say that any method of analysis is clearly superior to the other methods.

2. THE SIMULATED SITUATION

To set the scene for our study, we imagine that three streams are sampled in the four seasons of a year and some environmentally important variable is measured. In Section 3 we assume that one reading only is taken for each stream in each season. The sample design is then suitable for a two-factor analysis of variance without replication, with 3412 observations. We generated simulated data for analysis in four dierent ways:

(i) Twelve values were generated from a normal distribution and scaled to have a mean of ®ve and standard deviation of one. These values were then placed in a random order and stream eects and seasonal eects were added to produce one set of data. This process was repeated 1000 times for each of several combinations of stream eects and seasonal eects.

(ii) The same process as described for (i) was used but with the original 12 values chosen from a uniform distribution.

(iii) The same process as described for (i) was used but with the original 12 values chosen from an exponential distribution.

(iv) The same process as described for (i) was used but with the original 12 values chosen from an observed distribution of the ratio of counts of ephemeroptera to counts of oligochaetes (a measure of pollution) in New Zealand streams, as shown in Table I.

Basically, the data generated by methods (i) and (ii) are quite `well-behaved', with no very extreme values; the data generated by method (iii) are positively skewed with a few relatively large values; and the data generated by method (iv) are like the worst type of data found in real life, with many tied values that were originally zero before scaling to a mean of ®ve and a standard deviation of one and several relatively large values.

(3)

After generation, theF-values for variation between streams and variation between seasons were obtained for each set of data from a two-factor analysis of variance. Thep-values were then calculated by four dierent methods:

(a) The observations were randomized freely between the 12 factor combinations and the analysis of variance was repeated. This was done 99 times and thep-value for a factor was taken as the proportion of times that the corresponding observedF-value was equalled or exceeded in the set of 100F-values consisting of the observed one plus the 99 obtained by randomization.

(b) The usual residuals for a two-way analysis of variance (observations7stream mean7 season meanoverall mean) were calculated, and these were then freely randomized between the 12 factor combinations. This was repeated 99 times andp-values were deter-mined as for (a). This is the method proposed by ter Braak (1992).

(c) To test the eect of streams a restricted randomization was used, as described by Edgington (1995, Chapter 6). This involves producing alternative sets of data by randomly permuting observations between streams within seasons only, and thereby allowing for possible dierences between seasons. Similarly, to test the eect of seasons a restricted randomization allowing observations to move between seasons but not between streams was used. For each eect 99 restricted randomizations were carried out to produce 99F-values for comparison with the observedF-value for that eect. Thep-value for an eect was then the proportion ofF-values as large or larger than the observed one in the set consisting of the observedF-value plus the 99 randomized values.

(d) The p-values for eects were determined by computing the probability of F-values as large or larger than those observed from the usualF-distributions.

Method (a) was also used with tests based on the sums of squares of eects as proposed by Manly (1991, Chapter 5). However the relatively poor performance of this approach when compared to the others that were considered means that it is not worth discussing further.

The results of this simulation allow the comparison of the dierent testing procedures under quite a wide range of conditions in terms of the type of data involved. In Sections 4 and 5 slightly more complicated designs involving two factors with replication and the introduction of a third Table I. Ratios of numbers of ephemeroptera to numbers of oligochaetes in New Zealand streams. The three ratios shown for each stream±position±season combination come from three independent samples

taken in the same area at the same time

Stream Position Summer Autumn Winter Spring

A Bottom 0.2, 0.2, 0.0 0.0, 0.0, 0.0 0.0, 0.0, 0.0 0.0, 0.0, 0.0

Middle 2.6, 0.7, 0.7 0.3, 0.6, 0.4 0.0, 0.0, 0.0 0.0, 0.0, 0.0 Top 0.7, 8.5, 7.1 0.5, 1.1, 1.4 0.1, 0.1, 0.2 0.2, 0.1, 0.4

B Bottom 0.0, 0.0, 0.0 0.0, 0.0, 0.0 0.0, 0.0, 0.0 0.0, 0.0, 0.0

Middle 0.3, 0.2, 0.3 0.0, 0.0, 0.0 0.0, 0.0, 0.0 0.0, 0.0, 0.1 Top 1.2, 0.7, 0.8 10.7, 19.9, 9.4 2.0, 6.3, 4.8 2.0, 1.6, 1.9

C Bottom 0.1, 0.2, 0.1 0.0, 0.0, 0.1 0.4, 0.0, 0.3 0.0, 0.0, 0.0

Middle 1.0, 2.6, 1.9 0.0, 0.0, 0.0 0.1, 0.2, 0.0 0.0, 0.0, 0.0 Top 7.3, 10.4, 9.5 46.6, 20.3, 24.0 1.2, 0.8, 6.1 0.2, 0.1, 0.1

#1998 John Wiley & Sons, Ltd. ENVIRONMETRICS, VOL.9_{, 53±65 (1998)}

(4)

factor for the position in the stream are also considered. These introduce the problem of testing for interactions between factors.

3. A TWO-FACTOR SAMPLING DESIGN WITHOUT REPLICATION

The two-factor sampling design was simulated as described in the previous section, with data from four distributions (normal, uniform, exponential and empirical). Seven levels for the combined eects of streams and seasons were used. Let A0 stand for no stream eects, A1 stand for low stream eects, A2 stand for high stream eects, B0 stand for no seasonal eects, B1 stand for low seasonal eects, and B2 stand for high seasonal eects. The seven levels are then A0 and B0, A1 and B0, A2 and B0, A0 and B1, A0 and B2, A1 and B1, and A2 and B2, where, for example A1 and B0 means that there was a low level of stream eects and no seasonal eects. To apply the levels A1, A2, B1 and B2 the 12 data values had constants added to them after being placed in an initial random order as described in the last section. For A1 stream 2 observations were increased by 1.0 and stream 3 observations by 2.0; for A2 stream 2 observations were increased by 2.0 and stream 3 observations by 4.0; for B1 season 2 observations were increased by 0.7, season 3 observations were increased by 1.3, and season 4 observations were increased by 2.0; for B2 season 2 observations were increased by 1.3, season 3 observations were increased by 2.7, and season 4 observations were increased by 4.0.

The results of the simulations are summarized in Table II in terms of the percentage of signi®cant results for tests at the 5 per cent level of signi®cance. Cases where the null hypothesis is true and the observed result is signi®cantly dierent from 5 per cent with a test at the 5 per cent level (i.e. the result is outside the range 3.6 to 6.4 per cent for single values, and outside the range 4.4 to 5.6 per cent of average size) are underlined once. Cases where the null hypothesis was not true and a test gave the fewest signi®cant results (i.e. had relatively low observed power) are underlined twice. The average size is the mean percentage of signi®cant results when the null hypothesis was true. The average power is the mean percentage of signi®cant results when the null hypothesis was not true. It is desirable for this to be large.

Overall the testing procedures have given fairly similar results. However the following tendencies can be noted:

1. Freely randomizing observations usually gave good power and did relatively well in terms of maintaining an average size close to 5 per cent when the null hypothesis was true. 2. Freely randomizing residuals as proposed by ter Braak (1992) did slightly worse than freely

randomizing observations with respect to the percentage of signi®cant results when the null hypothesis was true and with respect to power when the null hypothesis was not true. 3. Restricted randomization as proposed by Edgington (1995) gave about 5 per cent

signi®-cant results when the null hypothesis was true, except with the empirical data, but often had the lowest average power of all the procedures.

4. Use of F-tables for testing gave about the correct proportion of signi®cant results when the null hypothesis was true and had the highest power of all the tests, except with the empirical data.

If anything the simulation results suggest that the use of F-tables is better than any of the randomization methods, except with extremely non-normal data, in which case freely random-izing observations gave the best results. However it should be noted that only 99 randomizations were used with the randomization tests to keep computational times within reasonable limits.

(5)

Table II. Results from the simulation experiment with the two-factor sampling design without replication. The tabulated values are the percentages of signi®cant results with tests at the 5 per cent level (F1, test for a stream eect; F2, test for a seasonal test). Single underlined values are those where the null hypothesis is true but the percentage of signi®cant results is signi®cantly dierent from 5 per cent (i.e. the result is outside the range 3.6 to 6.4 per cent for single values and outside the range 4.4 to 5.6 per cent for average size), for a test at the 5 per cent level. Double underlined values are those where the test produced the fewest signi®cant results when the null hypothesis was not true. The average size is for the cases where the null hypothesis was true so that the 5 per cent is the desired value. The average power is for cases where the null hypothesis was

not true so that high values are desirable

Eects added to data Observations ter Braak Edgington Use of

randomized method method F-tables

F1 F2 F1 F2 F1 F2 F1 F2

Normal data

1. None 4.8 3.9 4.3 2.7 5.7 3.8 4.9 3.0

2. Low for streams 36.6 5.2 35.6 5.1 33.9 5.5 39.4 5.6

3. High for streams 97.1 3.4 98.0 4.2 93.9 4.3 100.0 4.1

4. Low for seasons 4.1 23.3 3.5 23.4 4.2 23.6 4.0 24.9

5. High for seasons 4.9 80.3 4.7 81.4 5.3 81.2 5.0 86.5

6. Low for both 37.5 24.1 36.2 21.4 34.3 22.3 39.1 23.6

7. High for both 97.7 82.1 97.1 81.6 95.1 79.6 100.0 86.1

Average: size 4.4 4.1 4.8 4.4

power 59.8 59.3 58.0 62.5

Uniform data

1. None 4.9 5.4 6.2 5.6 4.6 5.0 7.1 5.4

2. Low for streams 36.7 5.2 36.7 4.6 33.9 4.3 41.0 5.2

3. High for streams 97.3 4.9 98.2 5.0 96.3 5.3 100.0 5.5

4. Low for seasons 5.3 22.8 5.3 23.9 3.5 24.6 5.5 28.0

5. High for seasons 6.8 80.8 6.6 79.0 4.9 79.7 7.4 86.1

6. Low for both 37.0 24.1 36.4 22.2 31.6 23.2 39.0 26.0

7. High for both 98.1 80.1 97.0 80.1 95.0 81.5 100.0 85.5

Average: size 5.4 5.6 4.6 6.0

7. High for both 97.9 83.3 98.2 80.3 94.5 80.1 100.0 87.4

Average: size 4.6 4.6 5.0 4.9

7. High for both 42.9 26.6 40.7 23.0 41.7 33.8 38.7 21.1

Average: size 2.3 1.3 0.3 1.2

power 26.4 18.9 26.5 17.5

(6)

The use of more than 99 randomizations can be expected to increase the power of the random-ization tests to some extent, and hence reduce the apparent superiority of the use ofF-tables in this respect. Some limited extra simulations that were carried out to investigate this show that using 999 rather than 99 randomizations can increase power by up to about 2 per cent. Therefore overall it is fair to say that in the context of this example none of the testing methods has proved to be clearly superior to the others but randomizing observations seems slightly better than the alternatives in terms of maintaining a size close to 5 per cent and having good power.

4. A TWO-FACTOR SAMPLING DESIGN WITH REPLICATION

The second design that we have considered is the same as the ®rst except that replicate observa-tions were considered for each of the 12 stream±season combinaobserva-tions. Situaobserva-tions were simulated with two, four and six replicates, using the same four types of data (normal, uniform, exponential and empirical) as for the design without replicates. However, stream (A) and season (B) eects were halved to take into account to some extent the extra power for detecting eects that was introduced by the replication. In addition, interactions were introduced at a low level (AB1) by adding 0.5 to all observations in stream 3 and season 3, and 1.0 to all observations in stream 3 in season 4, and interactions at a high level (AB2) by adding twice these increments. The values used for the empirical distribution are shown in Table I.

Nine dierent scenarios were simulated for the sample design: A0_B0_{AB0 (no eects);}

A1B0AB0 (low stream eects); A2B0AB0 (high stream eects); A0B1AB0 (low seasonal eects); A0B2AB0 (high seasonal eects); A1B1AB0 (low stream and low seasonal eects); A2B2AB0 (high stream and high seasonal eects); A1B1AB1 (low stream eects, low seasonal eects and low interaction); and A2B2AB2 (high stream eects, high seasonal eects and high interaction).

Tests of main eects were conducted using the methods (a) to (d) described in Section 2. In addition, the stream by season interaction was tested by the following four methods.

(e) The observed F-value for the interaction from the analysis of variance was tested by comparison with the distribution consisting of itself plus 99 alternative values obtained by freely randomizing observations. An interaction was then considered to be signi®cant at the 5 per cent level if the observedF-value was one of the largest in the set of 100. (f ) This was similar to (e) except that the randomizedF-values were obtained by freely

reallo-cating residuals calculated as dierences between observations and the means from the stream±season combinations to which they belonged, as proposed by ter Braak (1992). (g) This was similar to (e) except that residuals were calculated as deviations from the analysis

of variance model without interactions. That is to say, the residuals were calculated as (observation7stream mean7season meanoverall mean). This approach was suggested by Still and White (1981).

(h) The observed interactionF-value was tested using theF-distribution in the usual way.

Because the output from the simulations in terms of percentages of signi®cant results is very extensive, we only present part of this and a summary of the remainder. Table III shows the per-centages signi®cant with four replicates. As for Table II, perper-centages are underlined once when they are signi®cantly dierent from 5 per cent and the null hypothesis is true, and underlined twice when they are the lowest percentage for the alternative methods of testing the particular eects.

(7)

Table III. Results from the simulation experiment with the two-factor sampling design with four replicates. The tabulated values are the percentages of signi®cant results with tests at the 5 per cent level (F1, test for a stream eect; F2, test for a seasonal eect; F12, test for the interaction eect). Single underlined and double underlined values are as de®ned in Table II. Tests on interactions (e) to (h) are as described in

Section 4. The average size and average power are as de®ned in Table II

Eects added to data Observations (e) ter Braak (f) Edgington (g) Use of (h)

8. Low with interaction 91.9 78.2 9.6 92.1 78.7 10.0 90.6 76.0 9.4 92.8 79.7 9.6

9. High with interaction 100.0 100.0 40.4 100.0 100.0 39.8 100.0 100.0 40.7 100.0 100.0 41.9

Average: size 4.7 5.2 4.8 5.4 4.6 5.0 4.7 5.2

power 82.7 25.0 82.5 24.9 82.5 25.1 83.4 25.8

Uniform data

Average: size 4.8 5.2 4.6 5.4 4.5 5.2 4.6 5.2

power 82.5 25.6 82.4 26.1 82.4 26.3 83.3 26.6

(8)

Table continued.

Eects added to data Observations (e) ter Braak (f) Edgington (g) Use of (h)

F1 F2 F12 F1 F2 F12 F1 F2 F12 F1 F2 F12

5. High for seasons 4.6 99.2 4.9 5.3 99.2 4.3 5.0 99.3 4.9 4.8 99.6 4.1

6. Low for both 65.6 47.8 2.9 64.7 48.2 2.6 65.7 50.2 3.0 66.9 48.8 2.6

7. High for both 100.0 99.2 4.5 99.9 98.8 5.5 100.0 99.1 5.5 100.0 99.6 4.7

Average: size 4.8 4.5 5.0 4.6 5.0 4.9 4.9 4.3

power 82.7 25.5 82.6 25.2 83.0 25.2 83.4 25.7

Empirical data

7. High for both 100.0 99.7 3.5 100.0 99.7 4.2 100.0 100.0 4.2 100.0 99.9 3.0

Average: size 4.3 4.0 4.2 4.2 5.0 4.1 3.2 2.7

power 85.0 22.8 85.2 27.7 85.3 26.1 83.9 23.0

(9)

A summary of all the simulation results is provided in Table IV in terms of average size and power, with the results for the design without replication included for completeness. From these results it is apparent that overall the dierences between the randomization methods are relatively small.

5. A THREE-FACTOR SAMPLING DESIGN

The third design that we have considered is where samples are taken at three depths (low, medium and high), in three rivers, for four seasons, with either no replication, two replicates, four replicates, or six replicates for each of the 33436 factor combinations. Data for analysis were again generated by starting with a ®xed set of values, allocating them in a random order to the factor combinations and then adding factor eects at various levels. The initial ®xed sets of values were also chosen as before from normal, uniform, exponential and empirical distributions, and coded to have a mean of ®ve and a standard deviation of one. The empirical distribution was the same one as was used for the two-factor simulations (Table I).

The signi®cance levels for F-statistics from analysis of variance were determined by three methods for this design. The ®rst method involved comparing eachF-statistic with the distribu-tion obtained for the same statistic by randomly permuting the original data values and redoing the analysis of variance. In this case the reference distribution consisted of 99 randomized

F-statistics plus the original statistic. Signi®cance at the 5 per cent level was then obtained if the

F-statistic for the original data was among the largest ®ve of the 100 statistics. The second method used ter Braak's (1992) approach of randomizing residuals instead of observations, but was otherwise similar to the ®rst method. The residuals in this case were from a model without the three-factor interaction when there was no replication, and for a model with the three-factor interaction when there was replication. Finally, the third method involved comparing the

F-statistics with theF-distribution in the conventional manner.

Rather than devoting a good deal of space to de®ning the factor eects that we used and presenting many tables of results, we have just provided an overall summary in Table V. Our conclusions from these three-factor simulations is essentially the same as for the two-factor simulations. For normal or moderately non-normal data the use ofF-tables is quite satisfactory, and tends to give slightly higher power than randomization. However with the grossly non-normal empirical data the randomization of observations gave the best performance in terms of power and size.

6. EXAMPLE

As an example we consider the ratios of species numbers in dierent streams, seasons, and positions in streams that are shown in Table I. The question to be considered is whether there is any evidence that the E/O ratio varied with the stream (A, B, C), the position within stream (bottom, middle, top), or the season of the year (summer, autumn, winter, spring).

An analysis of variance on the ratios gives the results shown in Table VI. Here theF-ratios all have the `error' mean square as the denominator, because ®xed eects are being assumed. For all the methods of testing, the ratios are very highly signi®cant, corresponding to probability levels of 0.0001 or less when compared with tables. If there had been any discrepancies then our simula-tions suggest that in this situation the randomization of observasimula-tions gives the most reliable results.

(10)

Table IV. Summary of the simulation results for a two-factor design in terms of the average size and the average power as de®ned for Table II, with the averages calculated separately for stream and season eects (F1 and F2) and interaction (F12). Single underlined values are those where the average size is for the cases where the null hypothesis was true but the percentage of signi®cant results is signi®cantly dierent from 5 per cent (i.e. the result is outside the range 4.4 to 5.6 per cent for average size, and outside the range 4.9 to 5.1 per cent for overall results). Double

underlining is as de®ned in Table II Number of

Power 57.0 10.8 57.2 11.3 59.9 11.7 58.8 11.6

4 Size 4.7 5.2 4.8 5.4 4.6 5.0 4.7 5.2

Power 82.7 25.0 82.5 24.9 82.5 25.1 83.4 25.8

6 Size 5.1 5.1 4.7 4.9 5.0 5.1 4.8 5.0

Power 92.2 39.2 92.1 39.4 92.1 39.3 92.7 40.9

Uniform data

1 Size 5.4 ± 5.6 ± 4.6 ± 6.0 ±

Power 59.6 ± 59.2 ± 58.2 ± 63.2 ±

2 Size 4.7 5.2 5.2 5.7 4.6 5.4 5.1 5.3

Power 57.0 10.1 56.8 10.6 59.4 10.2 58.5 10.8

4 Size 4.8 5.2 4.6 5.4 4.5 5.2 4.6 5.2

Power 82.5 25.6 82.4 26.1 82.4 26.3 83.3 26.6

6 Size 5.2 5.3 5.2 5.2 4.8 5.1 5.2 5.1

Power 92.5 40.0 92.5 40.3 92.3 41.4 93.1 42.0

Exponential data

1 Size 4.6 ± 4.6 ± 5.0 ± 4.9 ±

Power 59.8 ± 58.8 ± 57.1 ± 62.2 ±

2 Size 5.7 6.2 6.4 7.9 5.2 6.2 5.9 6.9

Power 56.9 11.0 57.6 13.9 60.5 11.2 58.3 12.1

4 Size 4.8 4.5 5.0 4.6 5.0 4.9 4.9 4.3

Power 82.7 25.5 82.6 25.2 83.0 25.2 83.4 25.7

6 Size 5.4 4.7 5.2 4.5 5.6 4.8 4.9 4.2

Power 92.4 38.8 92.5 39.7 92.4 40.1 92.7 40.2

(11)

Table continued.

Power 85.0 22.8 85.2 27.7 85.3 26.1 83.9 23.0

6 Size 3.9 3.7 3.7 3.7 4.8 4.2 2.0 2.0

Power 93.1 38.7 93.4 45.2 93.3 45.5 91.9 38.6

Over all distributions

Size 4.7 4.9 4.7 5.1 4.6 4.9 4.4 4.6

Power 71.0 24.5 70.7 26.0 71.7 25.7 71.5 25.3

(12)

Table V. Summary of the results obtained from simulations with the three-factor sampling design. Overall results are given for tests on the main eects of the position in the stream, the stream, and the season, and the three interactions of positionstream, positionseason, and streamseason. For each eect, the average size is the mean percentage of signi®cant results when the null hypothesis was true, and the average power is the mean percentage of signi®cant results when the null hypothesis was not true. All tests were at the 5 per cent level. Single underlined values are those where the average size is for the cases where the null hypothesis was true but the percentage of signi®cant results is signi®cantly dierent from 5 per cent (i.e. the result is outside the range 4.4 to 5.6 per cent for average size, and outside the range 4.91 to 5.09 per cent for

overall results). Double underlining is as de®ned in Table II Number of

Replicates

Average for Observation randomized ter Braak method Use ofF-table

F1, F2 F12, F13 F1, F2 F12, F13 F1, F2 F12, F13

Power 72.3 14.0 72.3 13.8 73.8 14.3

Uniform distribution

Power 55.7 10.3 55.7 10.1 56.9 10.5

6 Size 4.9 5.3 4.8 5.3 4.8 5.4

Power 72.8 14.1 72.7 14.1 73.9 14.6

Exponential distribution

Power 55.7 10.8 55.4 10.5 56.4 10.4

6 Size 5.1 4.8 5.1 4.8 5.0 4.8

Power 72.4 14.5 72.5 14.8 73.7 14.7

Empirical distribution

Power 60.4 11.9 58.8 11.5 56.7 9.1

6 Size 4.6 4.6 4.4 4.5 3.9 3.6

Power 74.9 15.1 74.2 13.8 73.9 12.6

Over all distributions

Size 5.0 5.0 4.8 4.9 4.7 4.7

Power 43.1 9.4 42.8 9.2 43.3 9.1

(13)

7. DISCUSSION

Whilst it is obviously not possible to draw de®nitive conclusions from the simulation of a few speci®c situations, it is clear that all of the randomization methods have given rather similar results for assessing the existence of factor eects except in very extreme situations. Therefore, from a practical point of view it is not possible to say that one method of randomization is much better than another.

One reason for this is that the use of F-statistics has the eect of making all of the randomization distributions rather similar for most data. We discovered early in our study that this does not occur if sums of squares are used instead ofF-statistics. Indeed it became very clear that using sums of squares does not give tests with good properties. This explains, for example, why Edgington (1995, p. 133) found somewhat dierent p-values of 2.8 and 1.4 per cent for testing a factor eect using a restricted randomization and freely randomizing observations with a small arti®cial two-factor set of data. Using theF-statistic when freely randomizing observa-tions gives a p-value of 3.2 per cent, which is much closer to the 2.8 per cent for restricted randomization. Interestingly, randomizing residuals by ter Braak's (1992) method does not seem sensible for this set of arti®cial data because the residuals are all either1 orÿ1 and thep-value is found to be 8.7 per cent.

REFERENCES

Crowley, P. H. (1992). `Resampling methods for computation-intensive data analysis in ecology and evolution',Annual Review of Ecology and Systematics,23_{, 405±447.}

Edgington, E. S. (1995).Randomization Tests, 3rd edn., Marcel Dekker, New York.

Manly, B. F. J. (1991).Randomization and Monte Carlo Methods in Biology, Chapman and Hall, London. Still, A. W. and White, A. P. (1981). `The approximate randomization test as an alternative to theF-test in

analysis of variance',British Journal of Mathematical and Statistical Psychology,34_{, 243±252.}

ter Braak, C. J. F. (1992). `Permutation versus bootstrap signi®cance tests in multiple regression and ANOVA', in Jockel, K. H. (ed.),Bootstrapping and Related Techniques, Springer, Berlin, pp. 79±86. Table VI. Results for the analysis of variance of ratios of ephemeroptera to oligochaetes using three dierent methods. The ®rst involves randomizing observations; the second randomizing residuals

(ter Braak); and the third the conventionalF-test for analysis of variance

Source of variation d.f. SS MS F Signi®cance level (%)

Randomizing Randomizing Using observations residuals Fdistribution

Stream 2 166.3 83.1 11.04 0.02 0.02 0.01

Position 2 753.4 376.7 50.03 0.02 0.02 0.00

Season 3 364.3 121.4 16.13 0.02 0.02 0.00

Streamposition 4 313.2 78.3 10.40 0.02 0.02 0.00

Streamseason 6 316.0 52.7 7.00 0.04 0.02 0.00

Positionseason 6 724.5 120.8 16.04 0.02 0.02 0.00

Streampositionseason 12 640.5 53.4 7.09 0.02 0.02 0.00

Error 72 542.1 7.5

Total 107 3820.3