Monte Carlo approaches for determining power and
sample size in low-prevalence applications
Michael S. Williams*, Eric D. Ebel, Bruce A. Wagner
Animal and Plant Health Inspection Service, USDA 2150 B Centre Avenue, Mail Stop 2E6,
Fort Collins, CO 80526, USA
Received 6 November 2006; received in revised form 27 April 2007; accepted 18 May 2007
The prevalence of disease in many populations is often low. For example, the prevalence of tuberculosis,
brucellosis, and bovine spongiformencephalopathy range from 1per 100,000to less than 1 per1,000,000in
many countries. When an outbreak occurs, epidemiological investigations often require comparing the
prevalence in an exposed population with that of an unexposed population. To determine if the level of
disease in the two populations is significantly different, the epidemiologist must consider the test to be used,
desired power of the test, and determine the appropriate sample size for both the exposed and unexposed
populations. Commonly available software packages provide estimates of the required sample sizes for this
by more than 35% when the prevalence is low. We provide a Monte Carlo-based solution and show that in
low-prevalence applications this approach can lead to reductions in the total samples size of more than
Published by Elsevier B.V.
Keywords: Two population; Proportion; Eradication; Animal surveillance
the prevalence of disease in two populations (e.g., Adcock, 1997). In such a study, the null
in many epidemiological investigations, the goal is to determine whether the difference in the
prevalence exceeds a pre-determined threshold. For example, suppose one wishes to determine if
the prevalence in an exposed population is at least five times higher than the prevalence in the
Preventive Veterinary Medicine 82 (2007) 151–158
* Corresponding author. Tel.: +1 970 494 7306; fax: +1 970 494 7174.
E-mail address: firstname.lastname@example.org (M.S. Williams).
0167-5877/$ – see front matter. Published by Elsevier B.V.
true prevalences, then the power of the test, denoted by 1 ? b, is the probability that the null
hypothesisisrejectedwheneverp1? 5p2(i.e.,Pr[rejectH0jH1] = 1 ? b).Acommonvalueforbis
0.20,butlower valuesshouldbeconsideredwhenthe costofnotdetectingthedifferenceishighin
comparison to the cost of collecting the data.
needed to detect a difference between two populations with a specified significance level and
power. Due to the discrete nature of the data andthe relianceofthese estimators on the asymptotic
behavior of the test statistic, a number of different continuity corrections have been suggested to
the original estimator given by Fleiss et al. (1980). Gordon and Watson (1996) summarize the
results of numerous authors and conclude that continuity correction is rarely beneficial.
Commonly available software packages have implemented many a number of different
sample size estimators, with examples being EpiInfo (CDC, 2006), the Hmisc library for R and
S+ (Alzola and Harrell, 2006), the sampsi function in Stata (StataCorp, 2003), and the Power
procedure in SAS (O’Brien, 1998).
range of prevalences considered in these studies is often orders of magnitude larger than
the prevalence levels encountered in manyanimal surveillance applications, particularly when the
disease has been nearly eradicated from the populations in question. In this study, we consider the
performance of two of these estimators and show that the suggested sample sizes are often very
inaccurate when the prevalence of the disease is low. We propose a Monte Carlo simulator,
witha givenpower.Asimulationstudyshowsthatwhilethe MonteCarlo-basedsolutionperforms
well, the estimated sample sizes provided by the other two methods can exceed the necessary
number of samples by more than 35% when the prevalence is low. Computer code has been made
available to implement the Monte Carlo-based solution in either R or S+.
p2, respectively. From each of the populations, a random sample of size n1and n2is drawn and x1
and x2diseased animals are found. Thus, x1and x2are such that X1jn1,p1? Binomial(n1,p1),
X2jn2,p2? Binomial(n2,p2), and p1= x1/n1and p2= x2/n2are the estimators of p1and p2.
The statistic used in the test is
z ¼ðp1? p2Þ ? ðp1? p2Þ
wherevar(p1? p2) = var(p1) + var(p2) ? 2cov(p1,p2) = var(p1) + var(p2) becausethe samples
in each population are assumed to be independent.
Under H0, the prevalence of the disease is the same in each population, so p1= p2and
The sampling distribution of this statistic is approximately Normal and the distribution of the
test statistic agrees well with a standard Normal distribution when both the sample size and
proportion of diseased animals are high.
M.S. Williams et al./Preventive Veterinary Medicine 82 (2007) 151–158152
For the typical significance level of a = 0.05, the test statistic will fall in the interval (?1.96,
1.96) for roughly 95% of all samples. In other words, if the null hypothesis is true and the sample
size is sufficient for the distribution of z to be approximately Normal, then
P½?za=2<z<za=2? ? 0:95:
The interpretation of failing to reject H0can be misleading because even though a test fails to
reject the hypothesis that p1= p2,it doesnot imply that no difference exists.Rather the result can
imply that, for the given sample size, the difference in the two populations was too small to be
detected. It can be misleading to use statistical tests without considering their power, where the
power of a statistical test is the probability that H0will be rejected when the difference between
the two population parameters is p1? p2.
Assume that the specified significance level is a = 0.05 and a priori it is known that the
appropriate alternative hypothesis is H1: p1> p2. Then the power of the a-level test for the null
hypothesis H0: p1= p2is given by
P½reject H0j H0 false? ¼ P½z>za=2?:
The power of the test can be determined for a given p1? p2as follows;
This result forms the basis for the derivation of the sample size calculation provided by Fleiss
et al. (1980). Casagrande et al. (1978) derive the appropriate sample size for the case where an
equal number of samples is taken from each population. However, in many cases an unequal
sample size is desirable because of the factors such as the difference in cost to collect samples
from each population. Let r define the relationship between the sample sizes drawn from each
then Fleiss et al. (1980) give
P½z>za=2? ¼ P
’1 ? F
ðr þ 1Þ¯ pð1 ? ¯ pÞ
rp1ð1 ? p1Þ þ p2ð1 ? p2Þ
with ¯ p ¼ ðp1þ rp2Þ=ðr þ 1Þ. This formula is used to determine the approximate sample
sizes in the Hmisc for R and the Power function in SAS using the Pearson’s Chi-squared test
Fleiss et al. (1980) and Ury and Fleiss (1980) add the following continuity correction factor
2ðr þ 1Þ
This formula is used to determine the approximate sample sizes in the EpiInfo package. The
utility of this correction factor has been questioned by Gordon and Watson (1996).
The sample sizes (n1, n2) and ðncc
corrected sample sizes, respectively.
2Þ will be referred to as the uncorrected and continuity
M.S. Williams et al./Preventive Veterinary Medicine 82 (2007) 151–158153
3. Performance for low-prevalence populations
In the articles relating to the derivation of Eqs. (1) and (2) the prevalence levels considered
typically ranged from p1= 0.05–0.80 and the differences between the p values in the two
populations are relatively small. However, in many animal surveillance applications the
proportion of diseased animals in the two populations can differ by orders of magnitude and at
these low-prevalence levels a small number of infected animals can drastically change the
an exposed and an unexposed population of wild deer. An initial small sample from the exposed
population suggested an apparent prevalence of tuberculosis of four animals per 1000
(p1= 0.004). It was determined that samples from both populations should be collected so that at
least a 10-fold difference in the prevalence between the two populations could be detected
(p2= 0.004). The relatively small size of the geographical area that was thought to be exposed
limited the total number of samples that could be collected, so the relationship chosen for the
sample sizes was r = 4. Using Eqs. (1) and (2), the estimated sample sizes to achieve a power of
0.80 were (n1= 1246, n2= 4984) and (ncc
Given the large discrepancy between the two sample sizes, a Monte Carlo simulation was
performed to estimate the true power of the test for the different sample sizes. The simulator
draws samples of size (n1, n2) and ðncc
calculates the z statistic for each sample. This process is repeated 500,000 times to form a Monte
Carlo approximation of the sampling distribution. Using this process, the achieved power for the
two different sample sizes was 0.858 and 0.911, when using the uncorrected and continuity
corrected sample sizes, respectively. The simulator was then used to determine that a sample size
of only ðnmc
reduction of 1360 and 3000 samples when compared to the sample sizes derived from Eqs. (1)
at these low-prevalence levels, the assumption that the distribution of the z statistics approaches
that of a unit Normal is not appropriate. Extensive simulation suggests that the sample size
estimates derived from Eqs. (1) and (2) consistently overestimate the required sample size.
1¼ 1574; ncc
2¼ 6296), respectively.
2Þ from the appropriate binomial distributions and
1¼ 974; nmc
2¼ 3896Þ was sufficient to achieve a power of 0.80. This constitutes a
4. A Monte Carlo approach to sample size determination
At the low-prevalence levels encountered in some surveillance applications, the assumption
that the statistic z follows an underlying Normal distribution in repeated samples of equal size is
not tenable. One option for determining the appropriate sample size is to use a Monte Carlo
approach to ‘‘search’’ for a sample size that achieves the desired level of power. While an
exhaustive search for the appropriate sample size is possible, a more efficient approach takes
advantage of the fact that the power of the test increases monotonically with increasing sample
size(Fig.1).Soratherthan perform anextensivesearchforpossiblesamplesizes,asearch canbe
employed to efficiently find the appropriate sample size to within a user specified tolerance. A
binary search, which is a technique for finding a particular value in an order list by ruling out half
of the data at each step, is an efficient method. The algorithm for finding the appropriate sample
size is as follows:
(1) Choose a tolerance value that describes the acceptable discrepancy between the nominal and
actual power of the test.
M.S. Williams et al./Preventive Veterinary Medicine 82 (2007) 151–158154
(2) Select an upper and lower bound for the sample size. In low-prevalence applications the
choice for the lower bound.
(3) Assess the power at the upper and lower bounds of this interval with the Monte Carlo
(4) Use the binary search algorithm to choose new upper and lower bounds of an interval that
contains the desired sample size.
(5) Repeat steps 3 and 4 until the desired tolerance level is obtained.
The sample sizes derived from the search algorithm above will be denoted by ðnmc
and S+ code to implement the Monte Carlo sample size calculations is available at http://
A series of examples illustrate the potential reduction in sampling effort associated with using
the Monte Carlo approach to sample size determination. The goal is to illustrate the factors and
situations where the use of the Monte Carlo approach is most beneficial. A tolerance of 0.00025,
for the discrepancy between the nominal and achieved power of the test, was chosen for this
study. The three factors were:
? The size of the effect that was to be detected. The values chosen were a 2-, 4-, and 10-fold
difference in the prevalence. These will be referred to as the effect size.
M.S. Williams et al./Preventive Veterinary Medicine 82 (2007) 151–158 155
Fig. 1. The achieved power of the test as a function of the sample size in the exposed population (n1). The design
prevalencesintheexposedandunexposedpopulationswerep1= 0.004,p2= 0.0004,respectively.Theverticallinesshow
the power achieved by sample sizes in the exposed population dictated by the Monte Carlo, uncorrected, and continuity
M.S. Williams et al./Preventive Veterinary Medicine 82 (2007) 151–158
Summary statistics for the simulation study
each population, r
for each mc,
in sample size,
in sample size,
79.9, 81.1, 84.1
80.2, 80.9, 83.9
80.1, 81.0, 84.1
80.2, 82.5, 88.0
80.1, 83.7, 88.7
80.0, 83.9, 88.7
80.3, 85.4, 92.1
80.1, 86.1, 92.0
80.1, 86.1, 92.0
79.8, 80.3, 83.6
80.0, 80.4, 83.4
80.0, 80.3, 83.5
80.2, 82.3, 87.7
80.0, 81.5, 86.8
80.0, 81.6, 86.8
79.9, 85.4, 91.0
80.0, 85.8, 91.0
80.0, 85.8, 91.1
The differences in the estimated samples sizes necessary to achieve a test with power of 0.8 in low-prevalence applications are summarized. The achieved power and metrics
describing the difference in the estimated number of samples using a Monte Carlo approach and two alternatives.
? The proportion of affected animals in each population. Three different prevalence levels were
considered for the exposedpopulation. Thesewere p1= 0.1, 0.01, 0.001. The prevalence levels
in the unexposed population were determined by the effect size.
? The ratio, r, determines the allocation of the sample size to each subpopulation. The values
r = 1 and 4 were used.
For each combination of these three factors, the study determined the sample size using the
Monte Carlo, uncorrected and continuity corrected approaches and compared the results using a
series of metrics. The first metric for comparison is the percent reduction in total sample size
resulting from the use of the Monte Carlo sample size, which is
Perðuc;mcÞ ¼ 100ðn1þ n2Þ ? ðnmc
Perðcc;mcÞ ¼ 100ðncc
2Þ ? ðnmc
for the uncorrected and Monte Carlo-based techniques, respectively. The achieved power for
each of the methods is also given. The final metric is the total reduction in sample size in
comparison to the uncorrected and continuity correct methods is also provided (i.e., Dðuc;mcÞ ¼
ðn1þ n2Þ ? ðnmc
lecting and testing each sample is known, this metric represents the total potential reduction
associated with using the Monte Carlo-based sample sizes.
2Þ and (i.e., Dðcc;mcÞ ¼ ðncc
2Þ ? ðnmc
2Þ). If the cost of col-
The results are given in Table 1 where there are a number of clear patterns. The first is that the
difference between the achieved power for the non-Monte Carlo methods is always greater than
the nominal value of 80%, with the continuity corrected sample sizes overestimating the required
sample size by a substantial amount. The level of the bias is determined by the effect size, with
bias in the power increasing in accordance with the effect size. The allocation (r) and the
prevalence level had little or no affect on the achieved power for the various sample sizes.
In contrast, both the allocation of the sample (r) and prevalence levels significantly influenced
the difference in the estimated sample size provided by the non-Monte Carlo methods. As the
prevalence decreased, the percentage of excess samples increased from 0.7% to as much as
38.5%. The number of excess samples that these methods estimate ranges from as little as 10 to
nearly 13,000 samples.
The results of this study suggest that sample size calculations that rely on the assumption of a
Normal distribution often overestimate the required number of samples to achieve a specified
power. This poor performance is due to the failure of the distributional assumptions when the
prevalence of the disease is low. In contrast, the proposed Monte Carlo approach returns sample
sizes such that the achieved power of the test closely matches the nominal value. The examples
also illustrate that the use of Monte Carlo methods can reduce the overall sample size by
hundreds to thousands of samples while still meeting the study objectives.
M.S. Williams et al./Preventive Veterinary Medicine 82 (2007) 151–158157
Adcock, C.J., 1997. Sample size determination: a review. Statistician 46, 261–283.
Casagrande, J.T., Pike, M.C., Smith, P.G., 1978. An improved simple approximate formula for calculating sample sizes
for comparing binomial distributions. Biometrics 34, 483–486.
Centers for Disease Control, 2006. EpiInfo Center for Disease Control and Prevention (CDC). http://www.cdc.gov/epiinfo.
Fleiss, J.L., Tytun, A., Ury, H.K., 1980. A simple approximation for calculating sample sizes for comparing independent
proportions. Biometrics 36, 343–349.
Gordon, I., Watson, R., 1996. The myth of continuity corrected sample size formulae. Biometrics 52, 71–76.
O’Brien, R.G., 1998. Tour of UnifyPow: a SAS module/macrofor sample-size analysis. Proceedings of the Twenty-Third
Annual SAS Users Group International Conference, SAS Institute Inc., Cary, NC, pp. 1346–1355. Software and
updates to this article can be found at http://www.bio.ri.ccf.org/UnifyPow.
StataCorp, 2003. Stata Statistical Software: Release 8.0. Stata Corporation, College Station, TX.
Ury, H.K., Fleiss, J.L., 1980. On approximate sample sizes for comparing two independent proportions with the use of
Yates’ correction. Biometrics 36, 347–351.
M.S. Williams et al./Preventive Veterinary Medicine 82 (2007) 151–158 158