Page 1

Short communication

Monte Carlo approaches for determining power and

sample size in low-prevalence applications

Michael S. Williams*, Eric D. Ebel, Bruce A. Wagner

Animal and Plant Health Inspection Service, USDA 2150 B Centre Avenue, Mail Stop 2E6,

Fort Collins, CO 80526, USA

Received 6 November 2006; received in revised form 27 April 2007; accepted 18 May 2007

Abstract

The prevalence of disease in many populations is often low. For example, the prevalence of tuberculosis,

brucellosis, and bovine spongiformencephalopathy range from 1per 100,000to less than 1 per1,000,000in

many countries. When an outbreak occurs, epidemiological investigations often require comparing the

prevalence in an exposed population with that of an unexposed population. To determine if the level of

disease in the two populations is significantly different, the epidemiologist must consider the test to be used,

desired power of the test, and determine the appropriate sample size for both the exposed and unexposed

populations. Commonly available software packages provide estimates of the required sample sizes for this

application.Thisstudyshowsthattheseestimatedsamplesizescanexceedthenecessarynumberofsamples

by more than 35% when the prevalence is low. We provide a Monte Carlo-based solution and show that in

low-prevalence applications this approach can lead to reductions in the total samples size of more than

10,000 samples.

Published by Elsevier B.V.

Keywords: Two population; Proportion; Eradication; Animal surveillance

1. Introduction

Considertheproblemofdetermininganadequatesamplesizetodetectaspecifieddifferencein

the prevalence of disease in two populations (e.g., Adcock, 1997). In such a study, the null

hypothesisisthatnodifferenceinprevalenceexistsinthepopulations(i.e.,H0:p1= p2).However,

in many epidemiological investigations, the goal is to determine whether the difference in the

prevalence exceeds a pre-determined threshold. For example, suppose one wishes to determine if

the prevalence in an exposed population is at least five times higher than the prevalence in the

www.elsevier.com/locate/prevetmed

Preventive Veterinary Medicine 82 (2007) 151–158

* Corresponding author. Tel.: +1 970 494 7306; fax: +1 970 494 7174.

E-mail address: michael.s.williams@aphis.usda.gov (M.S. Williams).

0167-5877/$ – see front matter. Published by Elsevier B.V.

doi:10.1016/j.prevetmed.2007.05.015

Page 2

unexposedpopulation(i.e.,H1:p1? 5p2).Ifaisthesignificanceofthetestandifp1andp2arethe

true prevalences, then the power of the test, denoted by 1 ? b, is the probability that the null

hypothesisisrejectedwheneverp1? 5p2(i.e.,Pr[rejectH0jH1] = 1 ? b).Acommonvalueforbis

0.20,butlower valuesshouldbeconsideredwhenthe costofnotdetectingthedifferenceishighin

comparison to the cost of collecting the data.

Casagrandeetal.(1978)andFleissetal.(1980)provideestimatorstodeterminethesamplesize

needed to detect a difference between two populations with a specified significance level and

power. Due to the discrete nature of the data andthe relianceofthese estimators on the asymptotic

behavior of the test statistic, a number of different continuity corrections have been suggested to

the original estimator given by Fleiss et al. (1980). Gordon and Watson (1996) summarize the

results of numerous authors and conclude that continuity correction is rarely beneficial.

Commonly available software packages have implemented many a number of different

sample size estimators, with examples being EpiInfo (CDC, 2006), the Hmisc library for R and

S+ (Alzola and Harrell, 2006), the sampsi function in Stata (StataCorp, 2003), and the Power

procedure in SAS (O’Brien, 1998).

Thesesamplesizeestimatorshavebeenshowntoworkwellinmanyapplications.However,the

range of prevalences considered in these studies is often orders of magnitude larger than

the prevalence levels encountered in manyanimal surveillance applications, particularly when the

disease has been nearly eradicated from the populations in question. In this study, we consider the

performance of two of these estimators and show that the suggested sample sizes are often very

inaccurate when the prevalence of the disease is low. We propose a Monte Carlo simulator,

combinedwithabinarysearchalgorithm,todeterminetheappropriatesamplesizetoachieveatest

witha givenpower.Asimulationstudyshowsthatwhilethe MonteCarlo-basedsolutionperforms

well, the estimated sample sizes provided by the other two methods can exceed the necessary

number of samples by more than 35% when the prevalence is low. Computer code has been made

available to implement the Monte Carlo-based solution in either R or S+.

2. Review

Considertwolargepopulationswherethetrueproportionofdiseasedanimalsisgivenbyp1and

p2, respectively. From each of the populations, a random sample of size n1and n2is drawn and x1

and x2diseased animals are found. Thus, x1and x2are such that X1jn1,p1? Binomial(n1,p1),

X2jn2,p2? Binomial(n2,p2), and p1= x1/n1and p2= x2/n2are the estimators of p1and p2.

The statistic used in the test is

z ¼ðp1? p2Þ ? ðp1? p2Þ

varðp1? p2Þ1=2

wherevar(p1? p2) = var(p1) + var(p2) ? 2cov(p1,p2) = var(p1) + var(p2) becausethe samples

in each population are assumed to be independent.

Under H0, the prevalence of the disease is the same in each population, so p1= p2and

;

z ¼

ðp1? p2Þ

varðp1? p2Þ1=2:

The sampling distribution of this statistic is approximately Normal and the distribution of the

test statistic agrees well with a standard Normal distribution when both the sample size and

proportion of diseased animals are high.

M.S. Williams et al./Preventive Veterinary Medicine 82 (2007) 151–158152

Page 3

For the typical significance level of a = 0.05, the test statistic will fall in the interval (?1.96,

1.96) for roughly 95% of all samples. In other words, if the null hypothesis is true and the sample

size is sufficient for the distribution of z to be approximately Normal, then

P½?za=2<z<za=2? ? 0:95:

The interpretation of failing to reject H0can be misleading because even though a test fails to

reject the hypothesis that p1= p2,it doesnot imply that no difference exists.Rather the result can

imply that, for the given sample size, the difference in the two populations was too small to be

detected. It can be misleading to use statistical tests without considering their power, where the

power of a statistical test is the probability that H0will be rejected when the difference between

the two population parameters is p1? p2.

Assume that the specified significance level is a = 0.05 and a priori it is known that the

appropriate alternative hypothesis is H1: p1> p2. Then the power of the a-level test for the null

hypothesis H0: p1= p2is given by

P½reject H0j H0 false? ¼ P½z>za=2?:

The power of the test can be determined for a given p1? p2as follows;

?

?

This result forms the basis for the derivation of the sample size calculation provided by Fleiss

et al. (1980). Casagrande et al. (1978) derive the appropriate sample size for the case where an

equal number of samples is taken from each population. However, in many cases an unequal

sample size is desirable because of the factors such as the difference in cost to collect samples

from each population. Let r define the relationship between the sample sizes drawn from each

population.Ifn1isthesamplesizeinthefirstpopulationandn2= rn1,withrspecifiedinadvance,

then Fleiss et al. (1980) give

P½z>za=2? ¼ P

ðp1?p2Þ

varðp1?p2Þ1=2>za=2

?

¼ P

?

?

ðp1?p2Þ?ðp1?p2Þ

varðp1?p2Þ1=2

?

>za=2?

ðp1?p2Þ

varðp1?p2Þ1=2

?

?

¼ Pz>za=2?

ðp1?p2Þ

varðp1?p2Þ1=2

’1 ? F

za=2?

ðp1?p2Þ

varðp1?p2Þ1=2

:

n1¼½za=2

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

ðr þ 1Þ¯ pð1 ? ¯ pÞ

p

þ zb

rðp1? p2Þ2

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

rp1ð1 ? p1Þ þ p2ð1 ? p2Þ

p

?

2

;

(1)

with ¯ p ¼ ðp1þ rp2Þ=ðr þ 1Þ. This formula is used to determine the approximate sample

sizes in the Hmisc for R and the Power function in SAS using the Pearson’s Chi-squared test

option.

Fleiss et al. (1980) and Ury and Fleiss (1980) add the following continuity correction factor

ncc

1¼n1

4

?

1 þ

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

rn1ðp1? p2Þ

1 þ

2ðr þ 1Þ

s

?2

:

(2)

This formula is used to determine the approximate sample sizes in the EpiInfo package. The

utility of this correction factor has been questioned by Gordon and Watson (1996).

The sample sizes (n1, n2) and ðncc

corrected sample sizes, respectively.

1; ncc

2Þ will be referred to as the uncorrected and continuity

M.S. Williams et al./Preventive Veterinary Medicine 82 (2007) 151–158153

Page 4

3. Performance for low-prevalence populations

In the articles relating to the derivation of Eqs. (1) and (2) the prevalence levels considered

typically ranged from p1= 0.05–0.80 and the differences between the p values in the two

populations are relatively small. However, in many animal surveillance applications the

proportion of diseased animals in the two populations can differ by orders of magnitude and at

these low-prevalence levels a small number of infected animals can drastically change the

estimatedprevalence.Theexampleusedinthissectionrelatestotheprevalenceoftuberculosis in

an exposed and an unexposed population of wild deer. An initial small sample from the exposed

population suggested an apparent prevalence of tuberculosis of four animals per 1000

(p1= 0.004). It was determined that samples from both populations should be collected so that at

least a 10-fold difference in the prevalence between the two populations could be detected

(p2= 0.004). The relatively small size of the geographical area that was thought to be exposed

limited the total number of samples that could be collected, so the relationship chosen for the

sample sizes was r = 4. Using Eqs. (1) and (2), the estimated sample sizes to achieve a power of

0.80 were (n1= 1246, n2= 4984) and (ncc

Given the large discrepancy between the two sample sizes, a Monte Carlo simulation was

performed to estimate the true power of the test for the different sample sizes. The simulator

draws samples of size (n1, n2) and ðncc

calculates the z statistic for each sample. This process is repeated 500,000 times to form a Monte

Carlo approximation of the sampling distribution. Using this process, the achieved power for the

two different sample sizes was 0.858 and 0.911, when using the uncorrected and continuity

corrected sample sizes, respectively. The simulator was then used to determine that a sample size

of only ðnmc

reduction of 1360 and 3000 samples when compared to the sample sizes derived from Eqs. (1)

and (2).

Fig.1illustratesthelargediscrepancybetweenthenominalandachievedpowerlevels.Clearly,

at these low-prevalence levels, the assumption that the distribution of the z statistics approaches

that of a unit Normal is not appropriate. Extensive simulation suggests that the sample size

estimates derived from Eqs. (1) and (2) consistently overestimate the required sample size.

1¼ 1574; ncc

2¼ 6296), respectively.

1; ncc

2Þ from the appropriate binomial distributions and

1¼ 974; nmc

2¼ 3896Þ was sufficient to achieve a power of 0.80. This constitutes a

4. A Monte Carlo approach to sample size determination

At the low-prevalence levels encountered in some surveillance applications, the assumption

that the statistic z follows an underlying Normal distribution in repeated samples of equal size is

not tenable. One option for determining the appropriate sample size is to use a Monte Carlo

approach to ‘‘search’’ for a sample size that achieves the desired level of power. While an

exhaustive search for the appropriate sample size is possible, a more efficient approach takes

advantage of the fact that the power of the test increases monotonically with increasing sample

size(Fig.1).Soratherthan perform anextensivesearchforpossiblesamplesizes,asearch canbe

employed to efficiently find the appropriate sample size to within a user specified tolerance. A

binary search, which is a technique for finding a particular value in an order list by ruling out half

of the data at each step, is an efficient method. The algorithm for finding the appropriate sample

size is as follows:

(1) Choose a tolerance value that describes the acceptable discrepancy between the nominal and

actual power of the test.

M.S. Williams et al./Preventive Veterinary Medicine 82 (2007) 151–158154

Page 5

(2) Select an upper and lower bound for the sample size. In low-prevalence applications the

uncorrectedsamplesize(i.e.,n1)servesasareasonableupperboundandn1/3isanacceptable

choice for the lower bound.

(3) Assess the power at the upper and lower bounds of this interval with the Monte Carlo

simulator.

(4) Use the binary search algorithm to choose new upper and lower bounds of an interval that

contains the desired sample size.

(5) Repeat steps 3 and 4 until the desired tolerance level is obtained.

The sample sizes derived from the search algorithm above will be denoted by ðnmc

and S+ code to implement the Monte Carlo sample size calculations is available at http://

www.aphis.usda.gov/vs/nahss/resources.htm#software.

1; nmc

2Þ. R

5. Simulations

A series of examples illustrate the potential reduction in sampling effort associated with using

the Monte Carlo approach to sample size determination. The goal is to illustrate the factors and

situations where the use of the Monte Carlo approach is most beneficial. A tolerance of 0.00025,

for the discrepancy between the nominal and achieved power of the test, was chosen for this

study. The three factors were:

? The size of the effect that was to be detected. The values chosen were a 2-, 4-, and 10-fold

difference in the prevalence. These will be referred to as the effect size.

M.S. Williams et al./Preventive Veterinary Medicine 82 (2007) 151–158 155

Fig. 1. The achieved power of the test as a function of the sample size in the exposed population (n1). The design

prevalencesintheexposedandunexposedpopulationswerep1= 0.004,p2= 0.0004,respectively.Theverticallinesshow

the power achieved by sample sizes in the exposed population dictated by the Monte Carlo, uncorrected, and continuity

correction-based approaches.

Page 6

M.S. Williams et al./Preventive Veterinary Medicine 82 (2007) 151–158

156

Table 1

Summary statistics for the simulation study

Prevalence,

p1

Effect

size

Allocation to

each population, r

Achieved power

for each mc,

uc, cc

Percent increase

in sample size,

Per(uc, mc)

Percent increase

in sample size,

Per(cc, mc)

Monte Carlo

sample

size, nmc

1

Sample size

difference,

D(uc, mc)

Sample size

difference,

D(cc, mc)

0.1

0.01

0.001

0.1

0.01

0.001

0.1

0.01

0.001

0.1

0.01

0.001

0.1

0.01

0.001

0.1

0.01

0.001

2

2

2

4

4

4

1

1

1

1

1

1

1

1

1

4

4

4

4

4

4

4

4

4

79.9, 81.1, 84.1

80.2, 80.9, 83.9

80.1, 81.0, 84.1

80.2, 82.5, 88.0

80.1, 83.7, 88.7

80.0, 83.9, 88.7

80.3, 85.4, 92.1

80.1, 86.1, 92.0

80.1, 86.1, 92.0

79.8, 80.3, 83.6

80.0, 80.4, 83.4

80.0, 80.3, 83.5

80.2, 82.3, 87.7

80.0, 81.5, 86.8

80.0, 81.6, 86.8

79.9, 85.4, 91.0

80.0, 85.8, 91.0

80.0, 85.8, 91.1

2.3

2.1

2.5

5.5

8.8

9.0

11.0

14.1

14.6

0.8

0.8

0.6

5.9

4.1

4.2

21.3

22.1

22.3

10.4

9.6

10.0

18.6

20.6

20.8

26.4

28.5

28.8

9.5

9.1

8.9

20.8

18.6

18.6

38.3

38.2

38.5

423

4,578

45,816

306

1,584

15,847

20

190

2482

18

296

3166

22

298

3104

10

80

590

25

170

1945

50

555

5520

98

974

10,326

70

812

8,322

10

10

10

2

2

2

4

4

4

10

10

10

89

910

9,100

247

2,648

26,697

64

724

7,356

130

1,305

12,825

105

970

9,950

115

1,210

12,080

80

863

8,633

37

386

3,889

The differences in the estimated samples sizes necessary to achieve a test with power of 0.8 in low-prevalence applications are summarized. The achieved power and metrics

describing the difference in the estimated number of samples using a Monte Carlo approach and two alternatives.

Page 7

? The proportion of affected animals in each population. Three different prevalence levels were

considered for the exposedpopulation. Thesewere p1= 0.1, 0.01, 0.001. The prevalence levels

in the unexposed population were determined by the effect size.

? The ratio, r, determines the allocation of the sample size to each subpopulation. The values

r = 1 and 4 were used.

For each combination of these three factors, the study determined the sample size using the

Monte Carlo, uncorrected and continuity corrected approaches and compared the results using a

series of metrics. The first metric for comparison is the percent reduction in total sample size

resulting from the use of the Monte Carlo sample size, which is

Perðuc;mcÞ ¼ 100ðn1þ n2Þ ? ðnmc

1þ nmc

2Þ

ðn1þ n2Þ

and

Perðcc;mcÞ ¼ 100ðncc

1þ ncc

2Þ ? ðnmc

ðncc

1þ nmc

2Þ

2Þ

1þ ncc

for the uncorrected and Monte Carlo-based techniques, respectively. The achieved power for

each of the methods is also given. The final metric is the total reduction in sample size in

comparison to the uncorrected and continuity correct methods is also provided (i.e., Dðuc;mcÞ ¼

ðn1þ n2Þ ? ðnmc

lecting and testing each sample is known, this metric represents the total potential reduction

associated with using the Monte Carlo-based sample sizes.

1þ nmc

2Þ and (i.e., Dðcc;mcÞ ¼ ðncc

1þ ncc

2Þ ? ðnmc

1þ nmc

2Þ). If the cost of col-

6. Results

The results are given in Table 1 where there are a number of clear patterns. The first is that the

difference between the achieved power for the non-Monte Carlo methods is always greater than

the nominal value of 80%, with the continuity corrected sample sizes overestimating the required

sample size by a substantial amount. The level of the bias is determined by the effect size, with

bias in the power increasing in accordance with the effect size. The allocation (r) and the

prevalence level had little or no affect on the achieved power for the various sample sizes.

In contrast, both the allocation of the sample (r) and prevalence levels significantly influenced

the difference in the estimated sample size provided by the non-Monte Carlo methods. As the

prevalence decreased, the percentage of excess samples increased from 0.7% to as much as

38.5%. The number of excess samples that these methods estimate ranges from as little as 10 to

nearly 13,000 samples.

7. Conclusions

The results of this study suggest that sample size calculations that rely on the assumption of a

Normal distribution often overestimate the required number of samples to achieve a specified

power. This poor performance is due to the failure of the distributional assumptions when the

prevalence of the disease is low. In contrast, the proposed Monte Carlo approach returns sample

sizes such that the achieved power of the test closely matches the nominal value. The examples

also illustrate that the use of Monte Carlo methods can reduce the overall sample size by

hundreds to thousands of samples while still meeting the study objectives.

M.S. Williams et al./Preventive Veterinary Medicine 82 (2007) 151–158157

Page 8

References

Adcock, C.J., 1997. Sample size determination: a review. Statistician 46, 261–283.

Alzola,C.F.,Harrell,F.E.,2006.AnintroductiontoSandtheHmiscanddesignlibraries.http://biostat.mc.vanderbilt.edu/

twiki/bin/view/Main/Hmisc.

Casagrande, J.T., Pike, M.C., Smith, P.G., 1978. An improved simple approximate formula for calculating sample sizes

for comparing binomial distributions. Biometrics 34, 483–486.

Centers for Disease Control, 2006. EpiInfo Center for Disease Control and Prevention (CDC). http://www.cdc.gov/epiinfo.

Fleiss, J.L., Tytun, A., Ury, H.K., 1980. A simple approximation for calculating sample sizes for comparing independent

proportions. Biometrics 36, 343–349.

Gordon, I., Watson, R., 1996. The myth of continuity corrected sample size formulae. Biometrics 52, 71–76.

O’Brien, R.G., 1998. Tour of UnifyPow: a SAS module/macrofor sample-size analysis. Proceedings of the Twenty-Third

Annual SAS Users Group International Conference, SAS Institute Inc., Cary, NC, pp. 1346–1355. Software and

updates to this article can be found at http://www.bio.ri.ccf.org/UnifyPow.

StataCorp, 2003. Stata Statistical Software: Release 8.0. Stata Corporation, College Station, TX.

Ury, H.K., Fleiss, J.L., 1980. On approximate sample sizes for comparing two independent proportions with the use of

Yates’ correction. Biometrics 36, 347–351.

M.S. Williams et al./Preventive Veterinary Medicine 82 (2007) 151–158 158