INTERNAL PILOT STUDIES FOR ESTIMATING SAMPLE - Birkett

(1)

STATISTICS IN MEDICINE, VOL. 13, 2455-2463 (1994)

INTERNAL PILOT STUDIES FOR ESTIMATING SAMPLE SIZE*

MARTIN A. BIRKETT AND SIMON J. DAY

Lilly Research Centre Ltd., Erl Wood Manor, Windlesham, Surrey GU20 6 P H . U . K

SUMMARY

We develop the idea of using data from the first ‘few’ patients entered in a clinical trial to estimate the final trial size needed to have specified power for rejecting H,, _infavour of H I if a real difference exists. When comparing means derived from Normally distributed data, there is no important effect on test size, power or expected trial size, provided that a minimum of about 20 degrees of freedom are used to estimate residual variance. Relative advantages and disadvantages of using larger internal pilot studies are presented. These revolve around crude expectations of the final study size, recruitment rate, duration of follow-up and practical constraints on the ability to prevent the circulation of unblinded randomization codes to investigators and those involved in editing and checking data.

INTRODUCTION

When planning a clinical trial we may obtain a prior estimate of variance to calculate the required sample size. Sometimes this estimate of variance is subsequently found to be quite different from the actual variance obtained at the end of the trial. Why does this happen? Our prior estimate of variance is often obtained from small trials or sometimes pilot studies. These may have different study conditions to the trial that is currently being designed, for example, different patient types, small numbers of centres, different treatment duration and so on. In general, the prior estimate of variance obtained from these data may not be representative of the trial being designed. This potential problem can be overcome by the use of an internal pilot study which forms an integral part of the trial itself and is not a separate study. The internal pilot consists of the first ‘few’

patients entered in the trial, the actual number being one of the main unanswered questions for this design. The internal pilot study is used to estimate important parameters, in this case the variance, in order to calculate the required sample size. The study then continues until the number of patients required (obtained by information gained from the internal pilot) have been recruited. All patients, including those from the internal pilot, are used in the final analysis. So, the purpose of the internal pilot study is to obtain a good estimate of the true variance based on actual data from patients entered in the trial itself.

Wittes and Brittain’ give an excellent description of other uses of internal pilot studies in addition to calculating the required sample size. They also considered the effect of an internal pilot study designed to recalculate the variance in a study with a normally distributed outcome variable. They suggested that use of an internal pilot study has at most a very small adverse effect

* Presented at the International Society for Clinical Biostatistics Thirteenth International Meeting, Copenhagen, Denmark, August 1992

CCC 0277-671 51941232455-09

0

1994 by John Wiley & Sons, Ltd.

(2)

2456 M. B I R K E T T AND S. D A Y

on the significance level and may greatly improve the power of the study. Their approach is to use a prior estimate of variance to calculate a pre-planned sample size. Then, use half of the pre-planned sample size as the internal pilot study. When all patients from the internal pilot have completed the study, a measure of the variance is obtained based on these patients. If the variance obtained from the internal pilot is less than the prior estimate of variance, the study continues as planned using the pre-planned sample size. If the variance obtained from the internal pilot is greater than the prior estimate of variance the sample size would be adjusted using the estimate of variance obtained from the internal pilot. In short this means that the required sample size can never be less than the pre-planned sample size. The question remains as to what proportion of the pre-planned sample size, which they called n, should be used as the internal pilot.

Gould2v3 presented methodology for carrying out interim power evaluations when observations have binomial distributions, and subsequently recalculating the required sample size. The effect on type I and I1 error rates, a and

p,

was investigated using certain fractions of the pre-planned sample size for the interim examination of the data. It was concluded that the variability of the final sample size is inversely related to the number of interim observations (size of internal pilot). That is if the internal pilot study is small then there is more variability in the final required sample size, a greater likelihood of an unnecessarily large sample size and inadequate or excessive power.

These recent papers offer the most substantial progress on what might be termed the ‘re- estimation concept’ since Stein’s4 classic, though apparently little used, procedure published in 1945. That procedure provides assurance that the width of the confidence interval for the treatment effect will not exceed a pre-specified size.

In this paper we use the same methodology as Wittes and Brittain but with different numbers of patients for the internal pilot study, and not just half (n = 1/2) the pre-planned sample size as they did. For a number of practically relevant situations, we show by simulation that n is not important but rather that a, /3 and E [N] (the expected value of the recalculated sample size based on the estimate of variance obtained from the internal pilot study) depend on the absolute size of the internal pilot study.

METHODS

We consider a clinical trial to compare two groups, each with unknown means and unknown (though assumed equal) variances. The results should, however, be applicable to comparisons of more than two groups. The true (within-group, between patients) variance is denoted c2 and the prior estimate, based on literature reviews, previous experience and so on, is denoted Sz. The difference to be detected 6, is independent of any information gained from the internal pilot data.

We wish to detect the difference 6 to be statistically significant at level a with power 1 -

p.

T o do this, we use our initial estimate of variance, S2, to determine the sample size N o required in each group. Instead of proceeding to recruit N o patients, we recruit only n = n N o (n

<

¹⁾in each group and use these initial data to arrive at a revised estimate of u2 denoted

c2.

Hence we recalculate the required sample size (denote this Nl). The rule of Wittes and Brittain is that the final study size (N) should then be the maximum of N o and N 1 . Gould also states that N , should never be less than N o but suggests further that N o might be increased only if N I / N o > y (y 2 1 but otherwise unspecified) and similarly that the maximum value of N should be wNo (w > 1 and a suggested maximum value of 2). He notes, however, that this can affect the power to reject H o when in fact H, is true.

We use the same simulation approach as Wittes and Brittain; this is written in Quick Basic Version 4.5 and proceeds as follows:

(3)

INTERNAL PILOT STUDIES FOR ESTIMATING SAMPLE SIZE 2457

(i) Specify a2, N o , n. In all cases, set a = 0.05, 1 -

fl

⁼0.90 and 6 = 1 (to simulate power) or (ii) Select two random Normal deviates (by the Box-Muller’ method) with means 0 and 6, (iii) Repeat step (ii) until n ‘patients’ are recruited per group.

(iv) Calculate the observed variance within each group and pool (average) them to obtain an estimate 6 2 .

(v) Calculate N , and repeat step (ii) until the appropriate number of simulated patients have accrued.

(vi) Calculate the observed variance, observed difference, t-statistic and p-value.

6 = 0 (to simulate alpha).

respectively, and with population variances (r2.

Simulations were repeated 47,500 times to determine a with an approximate standard error of 0.001 and 14,400 times to determine 1 -

fl

with an appropriate standard error of 00025.

A variety of rules were used to determine what the final study size should be. Initially we repeated the methods of Wittes and Brittain such that N = max ( N o , Nl). We did not consider restrictions limiting the upper value of N along the lines Gould did although we accept in practice that financial or time constraints would most likely be imposed and some upper limit for N set. In such cases, however, we consider the prime importance is having the information about required sample sizes. With the information available, a decision can be made; without the information, we trust to luck.

In all cases, sample sizes and p-values were determined using the t-distribution rather than the Normal, although in practice, this makes very little difference to the numerical results and no difference to our underlying conclusions.

RESULTS

Using the same methodology as Wittes and Brittain but with various values of n (the number of patients per treatment group in the internal pilot) instead of half the pre-planned sample size, we obtain the results in Table I.

When our prior estimate of variance ( S 2 ) is equal to the true study variance (a’), that is, S’ = a2 = 2, and n is small (2 patients per treatment group) then the power is greater than our planned 90 per cent, a is about 5 per cent but E [ N ] is high at 72 patients per treatment group. As n increases and approaches the pre-planned sample size of 42 patients per treatment group, the power decreases slightly but remains above our planned 90 per cent, a remains about 5 per cent and E [ N ] decreases but more slowly at large values of n.

When S 2 > a2 then the power is unnecessarily high for all values of n, a is unaffected and E [ N ] quickly reaches a plateau which is the pre-planned sample size of 42 patients per treatment group.

The high power is a consequence of the Wittes and Brittain rule, which states that the re-estimate of sample size can never be less than the pre-planned sample size. We note that a appears consistently less than 5 per cent. In particular, when n = 10, the simulated value of a was 4.7 per cent. We believe this must be due to chance as we cannot justify any argument suggesting a could be less than the nominal 5 per cent. If some mechanism does exist that deflates (rather than inflates) a the magnitude is very small and causes no practical concern.

Finally, when S’ < (r2 (the situation which causes most problems in clinical trials), the power falls substantially below the planned 90 per cent at small values of n, a is inflated slightly as n increases and E [ N ] decreases as n increases.

In conclusion, there is a slight loss in power at small values of n when the prior estimate of variance is too small and the power is too high when the prior estimate of variance is too large.

(4)

2458 M. BIRKETT A N D S. D A Y

Table 1. Effect of an internal pilot study of size n o n a-level, power and expected sample size E [ N ] as a function of true variance

u2 Power if S2 = u2 Size of internal pilot (n)

2 3 4 5 10 15 20 30 42

~~~ ~~~~~~

1 99.5 Power 99.6 99.5 99.5 99.5 99.5 99.4 99.5 99.5 99.4

a(%) 5.0 4.9 4.8 5.0 4.7 4.8 4.9 4.8 4.9

E [ N ] 49.5 44.4 43.2 42.6 42.0 42.0 42.0 42.0 42.0

2 90.0 Power 93.9 93.3 93.2 92.8 92.5 92.2 92.1 92.3 91.6

a(%) 4.9 5.0 4.9 5.0 4.9 5.0 5.0 5.1 4.9

E [ N ] 72.0 59.2 55.2 53.3 49.1 48.0 47.1 46.3 45.7 95%* 177.5 118.7 101.3 89.6 71.3 64.4 61.6 57.3 55.1 95%* 91.6 59.7 49.0 44.3 42.0 42.0 42.0 42.0 42.0

4 61.0 Power 83.6 84.2 855 86.0 87.7 89.3 89.3 89.4 89.9

a(%) 4.9 5.2 5.1 5.2 5.3 5.3 5.1 5.3 5.2

E [ N ] 126.8 102.4 95.4 93.6 87.8 86.8 86.2 86.2 85.6 95%* 361.9 231.7 195.1 178.3 140.6 1264 120.9 112.9 108.5 Study design: difference to be detected 6 = 1, initial estimate of variance Sz = 2, x = 5 per cent, fl = 10 per cent, initial pre-planned sample size ( N o ) is 42 per group. Number of replications in simulations: power 14,400, a-level 47,500. True between patient within-group variance, uz

* 95th percentile of distribution of N

This demonstrates a weakness of the methodology proposed by Wittes and Brittain, since the rule of no reduction from the pre-planned sample size can lead to a study which is larger than necessary. The only effect on a is when the prior estimate of variance is too small. Here a is inflated slightly but we consider this increase minimal and of no practical importance. These results also enable some interesting conclusions as to the appropriate size for an internal pilot.

For this we must not only look at the effect on power and a-level but also E [ N ] . It can be seen from Table I that irrespective of whether the prior estimate of variance is less than, greater than or equal to the true study variance, E [ N ] decreases by only comparatively small amounts after n = 10.

This would seem to indicate that the minimum size for an internal pilot study should be 10 patients per treatment group for a two-group study, or about 20 degrees of freedom in general.

This would give reliable estimates of the true study variance without adversely affecting the power and a-level. However, this is only a minimum value. There still remains a problem with unnecessarily high power if S 2 > 0'. This being so, why not grossly underestimate the prior estimate of variance? This would lead to a small pre-planned sample size and thus reduce the chance of having too large a study when the prior estimate of variance is larger than the true study variance. In the extreme, we make no prior sample size calculation but simply state the size of the internal pilot study. We investigated this approach. The methodology used is the same as that before but this time we use the obvious rule that the required sample size, N, can never be less than the size of the internal pilot (n) instead of N o as used previously.

We examine what happens to the three parameters of interest (power, a-level and E [ N ] ) at various values of n with no prior estimate of either variance or sample size. Results are shown in Table 11.

When the true study variance a2 = 1 is known before starting the trial, we require 22 patients per treatment group. At small values of n the power is lower than our required 90 per cent but

(5)

INTERNAL PILOT STUDIES FOR ESTIMATING SAMPLE SIZE 2459

Table 11. Effect of an internal pilot study of size n on a-level, power and expected sample size E [ N ] u2 Power if S2 = u2 Size of internal pilot ( n )

2 3 4 5 10 15 20 30 50

1 99.5 Power 80.6 83.9 85.8 87.9 90.2 91.1 91.9 96.9 99.8

U 8.0 6.5 6.3 5.9 5.5 5.3 5.2 5.1 5.0 E [ N ] 35.0 26.7 24.8 24.5 23.1 23.0 23.7 30.1 50.0 95%* 106.6 62.6 50.0 46.6 36.0 33.3 31.4 30.0 50.0

2 90.0 Power 78.0 81.9 84.2 85.6 88.9 89.1 90.0 90.5 94.0

U 7.1 5.8 5.6 5.3 5.3 5.3 5.1 5.1 5.0 E [ N ] 66.2 51.2 48.0 46.9 44.9 44.4 44.3 44.2 50.6 95%* 192.4 119.4 97.8 90.8 70.4 65.3 62.3 57.7 54.0

4 61.0 Power 77.1 80.7 83.4 85.2 87.6 88.6 89.4 90.1 90.1

U 6.4 5.3 5.3 5.3 5.1 5.2 5.2 5.3 5.2 E [ N ] 125.5 98.7 93.6 92.1 87.5 86.6 86.2 85.9 85.5 95%* 373.7 237.1 195.5 177.0 138.7 127.2 121.3 113.5 105.5 Study design: no prior sample size calculation. Number of replications in simulations; power 14,400, ol-level47.500. Initial estimte of variance, S'; true variance, cr'

* 95th percentile of distribution of N

increases as n increases; cr is inflated at small values of n but quickly approaches 5 per cent as n increases. The expected value of N initially decreases as n increases but when n approaches or exceeds the study size determined from o2 (here 22) then E [ N ] increases as a consequence of the rule that the re-calculated sample size ( N , ) can never be less than the size of the internal pilot (n).

When the true study variance o2 = 2 is known before starting the trial, we require 42 patients per treatment group. Again the power is below o u r required 90 per cent at small values of n and cr is again inflated at small values of n but quickly approaches 5 per cent as n increases. E [ N ] again decreases as n increases but when n approaches or exceeds the study size determined from o2 (here 42) then E [ N ] again increases as a consequence of the rule above.

When the true study variance o2 = 4, is known before starting the trial we require 84 patients per treatment group. Similar patterns for the power, cr-level and E [ N ] are observed as with a' = 1 and c2 = 2. When n > 84 we have an increase in E [ N ] , again for reasons given previously.

In conclusion, whether the true study variance is small or large in relation to the difference between treatment groups, 6, there is always a loss of power at small values of n. The a-level is inflated at small values of n but quickly approaches 5 per cent as n increases. The E [ N ] again shows similar patterns to those in Table I, in that there is little decrease after n = 10 irrespective of whether the true study variance is small or large in relation to 6. This again indicates that an internal pilot study of 10 patients per treatment group would suffice and is further supported by the fact that there is little increase in power after n = 10, and a, although inflated at low values of n, is not over-inflated at n = 10 and beyond. One advantage that our new method enjoys is that it avoids the penalty of having a larger than necessary study which arises under the Wittes and Brittain rule, when the prior estimate of variance is larger than the true study variance.

A further enhancement would be to incorporate any prior estimate of sample size in a formal Bayesian procedure. If reasonable prior information exists then it would be sensible to summarize it by the mean and variance of a chi-square distribution (that is, using the conjugate prior), whilst possibly expressing only vague knowledge of the mean by an improper prior. The posterior distribution of the variance can then be determined in a fairly straight forward manner (see,

(6)

m -

M. BIRKETT AND S. DAY

M

a -

Figure 1. Sample variance of Y-BOCS scores as recruitment progressed in a four-arm multicentre parallel group trial

30:

20

for example, Lee,6 pp. 73-76). In the latter part of this section, we have shown that trials can begin with no prior estimate of sample size at all. In these cases it may be appropriate to use a very weak prior distribution or, in the limit, abandon the Bayesian approach altogether and only use the data.

OBSESSIVE-COMPULSIVE DISORDER

To investigate how reliable the estimate of variance may be with the small n’s that we propose, a recently completed multicentre study was used to study the observed variance as patient enrolment progressed. The study was a randomized, double-blind trial with four parallel treatment groups and approximately 200 patients, comparing treatments for obsessive-compulsive disorder (OCD). The primary efficacy variable was the score on the Yale-Brown Obsessive Compulsive Scale (Y-BOCS)’*’ and the study was designed to detect a difference of 8 units between treatment groups assuming a within-group standard deviation of 7 units. The overall study size was set at 200 patients to provide sufficient safety data for regulatory agencies. The size of the study and enrolment period of 18 months make it a good example for illustrating the method.

Figure 1 shows the actual within-group variance observed in the trial as patient enrolment increased based on an analysis which was unblinded at the treatment group level, but blinded to the actual treatment received by each group. If a large treatment effect exists we would observe smaller estimates of variance using unblinded data compared with using blinded data. Thus we recommend that the estimate of variance is obtained using data that is unblinded at the treatment group level, although in such cases a great deal of care needs to be taken to keep the results confidential. When very few patients have been recruited it is clear that the estimate of variance

(7)

INTERNAL PILOT STUDIES FOR ESTIMATING SAMPLE SIZE 246 1 can be unreliable (due to sampling error) but as n approaches and exceeds the values that we propose, giving about 20 degrees of freedom, the estimate we obtain seems quite reliable and unbiased.

DISCUSSION

We propose that the use of internal pilot studies to estimate sample size eliminates the guess work and uncertainty that stems from trying to estimate 0’ from external sources. Provided that the internal pilot study supplies about 20 degrees of freedom for estimating the residual variance then we are confident that there will be no important adverse effects on either the power or size of the test. We do have one outstanding question that has no definitive answer but that may be left to individual choice: given that we have suggested a minimum value of n and have shown the problems that may arise as n increases (and even exceeds Nl), what is the optimum value of n?

Although we argue that n = 10 seems to give satisfactory results, the larger the value of n, the

‘better’(that is, more accurate) will be the estimate of CT’ and so more appropriate will be the value of N 1 . We have gone to the extreme where pre-study calculations of sample size are not needed and although we hold this argument from a scientific point of view, in practice it will always be desirable to have some estimate of the final study size. A study with initial suggested size, N o , of 25 will clearly be reacted to differently than one with N o = 1000 whether the setting is in the pharmaceutical industry, an academic institution or even within a G P practice. Whatever the situation, some reasonable idea of the final study size needs to be thought through. If, then, N o were determined as 25, what would be the arguments surrounding choice of n? Contrast this with the case where N o = 1000 and consider the same discussion to decide upon n. We show the danger of selecting n potentially too close to N , and so out recommendation would be that if N o = 25, then n should certainly be near to 10. If on the other hand, N o = 1000 then we see little disadvantage in letting n = 100, say. Clearly the estimate of ^{C T ~}based on n = 100 will be more accurate than the estimate based on n = 10. Figure 2 shows 5 0 per cent, 90 per cent and 95 per cent confidence limits for the population variance based on varying degrees of freedom. From this figure and inspection of the percentiles of the distribution of N (Tables I and 11) we suggest there is rarely any advantage in obtaining more than 30 to 4 0 degrees of freedom to determine 2’.

A further consideration in choosing the size of the pilot n, is whether or not enrolment would stop whilst the final patient from the internal pilot study is completing the treatment and follow-up period. This becomes more of an issue as the follow-up time per patient increases.

Consider two extremes where the assessment of a patient is immediate (for example, reduction of blood pressure over 24 hours) or where the follow-up needs to be about 8 weeks (for example, in a depression study). If the recruitment rate is m patients per day and the follow-up time t days per patient, then in a fixed size study design (recruiting N o patients per arm), the total time for the trial will be 2 N o / m

+

^tdays. If we use an internal pilot study, then it will be 2 n / m

+

^tdays before the last patient’s data are available for determining N . During the follow-up time ( t ) for the last patient, we could potentially recruit t m extra patients but the risk is that trn may be greater than N - n . If we continue to recruit during the follow-up period, the total time it will take to complete the study will be 2n/m

+

^{( 2 ( N}^-^n)/m

+

^{t )}or 2 N / m

+

^tdays. If we stop recruiting and await the results of the internal pilot, the total time for the trial will be (2n/rn

+

t )

+

( 2 ( N - n)/m

+

t ) or 2 N / m

+

2t (that is, an extra t days). So the decision rests on the risk of recruiting more patients than is necessary if we continue to recruit during the follow-up time versus the inevitable addition of r days to the study duration if we await the results of the pilot. In all these expressions, a constant, S , may be added as ‘statistician analysis time’ but if the purpose is only to re-estimate the study variance, we would hope that this is negligible.

(8)

M. BIRKETT AND S. DAY

Confidence limits

,

Figure 2. 95, 90 and 50 percent confidence limits for a population variance of 1 based on various degrees of freedom

The issue of confidentiality is not an insurmountable problem and is common in the context of group sequential methods. Independent data monitoring boards may be helpful; a full discussion of issues was recently reported by Ellenberg et Other interesting comments can be found in Peace.” Such unblindings are considered quite acceptable provided they are done for honest reasons, with suitable control and (it is generally agreed) with a full description of the intent written into the study protocol before the event. Most pharmaceutical companies should have standard operating procedures (for example, Wiles el al.’ ’) that will define how data monitoring boards will handle such unblindings. Other institutions may not necessarily have such stringently written procedures but they should still have a well defined system that will ensure an appropriate level of blinding and confidentiality is maintained.

We feel that each study should be thought through on an individual basis and all of these issues be taken into consideration when thinking of implementing an internal pilot study. There appears to us to be situations where internal pilot studies may not be appropriate, particularly studies with long treatment periods, unless greatly increasing the total duration of the study is acceptable.

REFERENCES

1 . Wittes, J. and Brittain, E. ‘The role of internal pilot studies in increasing the efficiency of clinical trials’, 2. Gould, A. L. ‘Sample sizes required for binomial trials when the true response rates are estimated’, 3. Gould, A. L. ‘Interim analyses for monitoring clinical trials that d o not materially affect the Type I error 4. Stein, C. ‘A two-sample test for a linear hypothesis whose power is independent of the variance’, Annals

Statistics in Medicine, 9, 65-72 (1990).

Journal of Statistical Planning and Inference, 8, 51-58 (1983).

rate’, Statistics in Medicine, 11, 55-66 (1992).

of Mathematical Statistics, 16, 243-258 (1945).

(9)

INTERNAL PILOT STUDIES FOR ESTIMATING SAMPLE SIZE 2463

5. Box, G. E. P. and Muller, M. E. ‘A note on the generation of random normal deviates’, Annals of 6. Lee, P. M. Bayesian Statistics: An Introduction, Edward Arnold, London, 1989.

7. Goodman, W. K., Price, L. H., Rasmussen, S. A,, Mazure, C., Fleischmann, R. L., Hill, C. L., Heninger, G. R. and Charney, D. S. ‘The Yale-Brown Obsessive Compulsive Scale. 1. Development, use and reliability’, Archives of General Psychiatry, 46, 1006-101 1 (1989).

8. Goodman, W. K., Price, L. H., Rasmussen, S. A. Mazure, C. Fleischmann, R. L., Hill, C. L., Heninger, G. R. and Charney, D. S. ‘The Yale-Brown Obsessive Compulsive Scale. 11. Validity’, Archives of General Psychiatry, 46 1012-1016 (1989).

9. Ellenberg, S., Geller, N., Simon, R. and Yusuf, S. (eds). ‘Practical issues in data monitoring of clinical trials’, Statistics in Medicine, 12, 415-616 (1993).

10. Peace, K. E. (ed). Biopharmaceutical Sequential Statistical Applications, Marcel Dekker, New York, 1992.

11. Wiles, A,, Atkinson, G., Huson, L., Morse, P. and Struthers, L. ‘Good Statistical Practice in clinical Mathematical Statistics, 29, 610-61 1 (1958).

trials: Guideline standard operating procedures’, D I A Journal, 1994 (in press).