PDF Stanley Lemeshow, David W Hosmer Jr, Janelle Klar, and Stephen K. Lwanga

The adequacy of any sample size is assessed according to the scope of the study or study results. The second part of the book provides the theoretical background for sample size determination, covering populations, samples, and their sampling distributions.

Statistical Methods for

Sample Size Determination

1 The one-sample problem

Estimating the population proportion

How many children must be included in the sample so that the rate can be estimated within 5 percentage points of the true value with 99% confidence. Consider the information given in Example 1.1.1, only this time we will determine the sample size needed to estimate the proportion vaccinated in the population within 10% (not 10 percentage points) of the true value.

Hypothesis testing for a single population proportion

Therefore, taking the larger of the two sample size determinations would require 233 patients to be studied using the new treatment. Therefore, considering the larger of the two sample size determinations, 761 patients would be needed to be studied in the community.

Fig. 2 Sampling distributions for one-sample hypothesis test

2 The two-sample problem

Estimating the difference between two proportions

For example, we may want to be 95% confident that we will estimate the risk difference to within 2 percentage points of the actual risk difference. Or you might want to estimate the difference in risk up to 10% of the actual difference.

Hypothesis testing for two population proportions

This parameter can be estimated as the average of the proportions of the two samples from the pilot study. How many adults would need to be followed in each of the two communities to have an 80% probability of detecting such a large drop in rate if the hypothesis test is performed at the 5% significance level.

Fig 4 displays a graphical representation of the two sample hypothesis testing situation for two proportions

Estimating the odds ratiom with stated precision

Therefore, 408 subjects would be required in each of the case and control groups to ensure, with 95% confidence, that the likelihood ratio estimate will not underestimate the true OR by more than 25% of its predicted value. really. If the purpose of the statistical analysis is to test a hypothesis about the likelihood ratio, sample size calculations can be performed using the methodology described earlier.

Table 1.1 Tabular display of disease/exposure relationship Exposure

4 Sample size determination for cohort studies

Confidence interval estimation of the relative risk

This value can be located in Table 11 e at the intersection of the column corresponding to RR=1.75 and the row corresponding to P2 * =0.20.

Hypothesis testing of the population relative risk

The binomial distribution is a statistical distribution that describes the probability of a particular configuration of dichotomous outcomes when the total number is finite (eg, the number of times a "head" comes up in 7 coin tosses). The power of the test is reflected in how steeply the curve rises to 1.0 in the Q::;p::;pO· range.

Fig. 8 Consequences of hypothesis testing in LQAS procedure

6 The incidence rate

Modifications to the study design based on control of the follow-up period will now be presented for the two-sample problem. The appropriate sample size is located in the table, for specified a and ~ levels, at the intersection of the row representing A.z and the column representing A.1•.

The one-sample problem

The difference between the two population means represents a new parameter, J.L1-J.L2 • An estimate of this parameter is given by the difference in the sample means, x1- x2• The mean of the sampling distribution of xr x2 is. Suppose a study is designed to test H0: fl1=fl2 against Ha: fl1>fl2• The mean of the sampling distribution of X .

8 Sample size for sample surveys

Simple random and systematic sampling

Using 50 children as a pilot sample to estimate P for use in formula (27), it follows that. Therefore, a simple random sample of 338 children needs to be selected, which means that an additional 288 children need to be studied.

Stratified random sampling'

Therefore, using formula (30), which includes correction factors for the finite population within each stratum, we obtain a reduced sample size estimate. Therefore, a greatly reduced estimate of the required sample size was obtained using formula (31), which includes correction factors for the finite population within each stratum.

Table 1.2 Sample size computation by spreadsheet technique Population weight

1 The population

Elementary units

Population parameters

2 The sample

Probability and nonprobability sampling

Sampling frames, sampling units and enumeration units

For example, in studying vaccination of children in a city, it is unlikely that an accurate and up-to-date list of all children will be available or can be compiled at reasonable cost before sampling. However, it is conceivable that a list of all households is available or at least can be obtained without great effort or expense.

Sample measurements and summary statistics

If such a list is available, a sample of households can be drawn, and those children living in the sampled households are taken as the elementary units.

Estimates of population characteristics

The variance of the sampling distribution of the estimated parameter 9, denoted by Var(9), is defined as. For a sample mean x, the mean, variance, and standard error of the sampling distribution are E(x).

Table 11.1 Data on immunization status in 30 villages

Two-stage cluster sampling'

Here z is the number of standard errors from the mean of the sampling distribution and crp=. This expression is quite inclusive, including Ni, ni, and Pi for each of the m groups selected in the sample. The bias of an estimate !l of a population parameter 0 is defined as the difference between the mean, E(!l), of the sampling distribution of !l and the true value of the unknown population parameter o,.

Suppose a representative sample of the population is studied and the proportion of the sample individuals with hypertension is 28%.

Table 11.3 Distribution of families in ten hypothetical villages

5 Hypothesis testing

Finally, the power of the test is determined as the probability of correctly rejecting Ho given H0 is false. As seen in Figures 15 and 16, the probability of making a type II error and the power of the test vary according to the true P; the further the true P is from P0, the smaller the type II error (and the larger the power). If the results showed that less than 80% of mothers received "adequate" prenatal care, then some appropriate community action would be initiated.

Under the null hypothesis and by invoking the central limit theorem, the sampling distribution of the sample proportion is normal with:

Fig. 14 Summary of the probabilities of the possible outcomes in hypothesis testing

6 Two-sample confidence intervals and hypothesis tests

Therefore, since this falls in the rejection region (ie, -0.18<-0.174), the null hypothesis is rejected in favor of the alternative that the drugs are not equally effective. Individuals in different exposure groups (usually presence and absence of exposure) are then followed until a determination of the outcome characteristic (eg, disease or no disease) is made. This proportion is called prevalence and represents a snapshot of the number of people with this condition at a given point in time in relation to the total number of eligible individuals in the population.

The concept of prevalence differs from the concept of incidence, which is a measure of the number of new cases that appear in a population over a period of time.

Fig. 17 presents the two-sample hypothesis testing situation for proportions.

The relative risk and odds ratio

The outstanding feature of a case-control study is that it allows the estimation of relative risk under certain conditions. In a case-control study, the variable measured is the individual's exposure status. Thus, when the disease is rare, we can approximate the value of the relative risk to the value of the odds ratio, which in total can be estimated from the case-control study design.

Regarding the notation of the 2x2 table in Table 11.4, the parameters can be estimated as follows:.

Table 11.3 Tabular display of disease/exposure relationship Exposure

The sampling distribution of the odds ratio

If a statistical test of the null hypothesis H0:0R=1 against the alternative Ha:OR#1 is to be performed, the usual chi-square (x2) test based on the 2x2 table can be used. The previously described test of the equality of two proportions is equivalent to this chi-square test based on the 2x2 table.). Thus, as was the case with the odds ratio, confidence interval estimation of the RR is usually performed by first obtaining a confidence interval for ln(RR) and later exponentiating the confidence limits to obtain a confidence interval for the RR parameter.

Again, the limits of the symmetric confidence interval for ln(RR) must be expressed in order to obtain a confidence interval for the relative risk.

Fig. 18 shows the use of ln(OR) and the relationship between the confidence limits as defmed on the two scales

Screening tests for disease prevalence

This expression for estimating the variance of the lo& of the relative risk is then used in the same way as the expression for the variance of the odds ratio was for constructing confidence intervals. Furthermore, we assume that all n patients were studied in detail after screening and n.1 was determined to actually have the disease while n.2 was determined to be disease-free. This is the percentage of individuals with positive screening tests who actually have the disease.

It is intended to serve as a reference point for the sample size discussions in the manual.

Fig. 19 Classification table of true condition versus decision

Simple random sampling

Unbiasedness is a desirable statistical property because it ensures that the sample values will be correct on average. Use a random process (eg, a table of random numbers) to generate n numbers between 1 and N that identify n individuals in the sample. Definition: A simple random sample is one in which each of the Ncn possible samples has an equal chance of being selected; i.e. 1/(NCn).

However, because of its several disadvantages, alternatives to simple random sampling are often used in real surveys of human populations.

Systematic sampling

It may be possible to select a systematic sample in situations where a simple random sample cannot be selected. In this situation, a simple random sample is not possible, since N is not known in advance. Under most conditions, simple random sampling formulas for parameter and variance estimates can be used with systematic sampling.

This is due to the fact that many of the problems previously mentioned with simple random sampling also apply to systematic sampling, and it is possible to obtain better accuracy at a lower cost with other methods than is possible with systematic sampling.

Stratified sampling

This means that when estimating the population mean, proportion, or total, each sample element is multiplied by the same constant, 1/n, regardless of the stratum to which the element belongs. A stratified random sample can provide greater precision (ie, narrower confidence intervals) than is possible with a simple random sample of the same size. For administrative or logistical reasons, it may be easier to select a stratified sample than a simple random sample.

The main disadvantage of stratified sampling is, however, that it is not less expensive than simple random sampling since detailed frames must be constructed for each stratum before sampling.

Cluster sampling

10 % of True Risk with 99% Confidence

25 % of True Risk with 90% Confidence