Technical ReportPDF Available

# The RCSI Sample size handbook

Authors:

## Abstract and Figures

This manual gives tables of sample sizes for a wide variety of study designs, including means, proportions, agreement and reliability. Each design is covered in an independent section. It also has a discussion of sample size in multivariable analysis, and in qualitative research. Other sections include design of pilot studies and animal studies. It shows the code in Stata and R used in the calculations, and includes a list of useful resources on the internet.
Content may be subject to copyright.
The RCSI Sample size handbook
A rough guide
May 2018 version
Ronán M Conroy
rconroy@rcsi.ie
!
Stata command
. power twomeans 0 (.4(.1)1), power(0.9) graph(ydimension(delta)
xdimension(N))
.4
.6
.8
1
Effect size (δ)
4654 68 88 120 172 266
Total sample size (N)
Parameters: α = .05, 1-β = .9, µ1 = 0, σ = 1
t test assuming σ1 = σ2 = σ
H0: µ2 = µ1 versus Ha: µ2 µ1
Effect size for a two-sample means test
Sample Size: introduction
!1
How to use this guide 3
Introduction : sample size and why it’s important 4
1. Sample size for percentages or proportions 8
1.2 Sample sizes for studies comparing a prevalence with a hypothesised value
11
1.3 Sample sizes for studies comparing proportions between two groups 15
1.4a Sample sizes for population case-control studies 19
1.4b Sample sizes for matched case-control studies 24
1.5 Sample size for logistic regression with a continuous predictor variable 29
1.6 Sample sizes for logistic or Cox regression with multiple predictors 32
2: Sample sizes and powers for comparing two means where the variable is
measured on a continuous scale that is (more or less) normally distributed. 36
2.1 Comparing the means of two groups 36
2.2 Sample sizes for comparing means in the same people under two conditions
44
2.3 Calculating sample sizes for comparing two means: a rule of thumb 49
3. Sample size for correlations or regressions between two variables measured
on a numeric scale 50
4. Sample size for reliability studies 52
5. Sample size calculation for agreement between two raters using a present/
absent rating scale using Cohen’s Kappa 55
6. Sample size for pilot studies 59
7. Sample size for animal experiments in which not enough is known to calculate
statistical power 61
8. Sample size for qualitative research 63
9. Resources for animal experiments 67
9. Computer and online resources 68
Sample Size: introduction
!2
How to use this guide
This guide has sample size ready-reckoners for many common research
designs. Each section is self-contained You need only read the section that
applies to you.
If you are new to sample size calculation, read the ﬁrst section ﬁrst.
Examples
There are examples in each section, aimed at helping you to describe your
sample size calculation in a research proposal or ethics committee submission.
They are largely non-specialist. If you have useful examples, I welcome
contributions.
Feedback
to improve it. If you spot an error, please let me know.
Support
This guide has slowly percolated around the internet. I’m pleased to handle
queries from sta and students of RCSI and ailiated institutions. However, I
cannot deal with queries from elsewhere. I’m sorry.
Warra nty ?
This document is provided as a guide. While every attempt has been made to
ensure its accuracy, neither the author nor the Royal College of Surgeons in
Ireland takes any responsibility for errors contained in it.
What’s new
This version May 2018. Updated text with more useful code in Stata and, to a
lesser extent, R, updated web links.!
Sample Size: introduction
!3
Introduction : sample size and why it’s important
Sample size is an important issue in research. Ethics committees and funding
agencies are aware that if a research project is too small, it misses failing to
ﬁnd what it set out to detect. Not only does this waste the input of the study
participants (and frequently, in the case of animal research, their lives) but by
producing a false negative result a study may do a disservice to research by
discouraging further exploration of the area.
And, of course, if a study is too large it will waste resources that could have
been spent on something else.
So the ideal sample size is one that collects suicient data to have a good
chance of measuring what you set out to measure.
Key issues: representativeness and precision
When choosing a sample, there are two important issues:
will the sample be representative of the population, and
will the sample be precise enough.
The ﬁrst criterion of a good sample is sample representativeness. An
unrepresentative sample will result in biased conclusions, and the bias cannot
be eliminated by taking a larger sample. For this reason, sampling methodology
is the ﬁrst thing to get right.
The second criterion is sample precision. The larger the sample, the smaller
the margin of uncertainty (conﬁdence interval) around the results. However,
there is another factor that also aects precision: the variability of the thing
being measured. The more something that varies from person to person the
bigger your sample needs to be to achieve the same degree of certainty about
This guide deals with the issue of sample size. Remember, however, that sample
size is of secondary importance to sample representativeness.
Key terms used in this sample size calculation
Precision – what it is, what determines it
Precision is the amount of potential error in a ﬁnding. Low-precision studies
have wide margins of error around their ﬁndings, while high-precision studies
have narrow margins of error. The degree of precision is partly determined
by the sample size. In some sample size calculations, you will need to begin
by deciding how much precision you require or, equivalently, the degree of
uncertainty you are prepared to tolerate in your ﬁndings.
Precision is also determined by the variability in the thing you are
studying. If something has little variation, such as body temperature, then you
Sample Size: introduction
!4
will require a smaller sample than for something that varies quite a lot, like
blood pressure.
You can easily imagine variability when it comes to things measured on a
numeric scale. But what about things that are measured on a simple
dichotomous scale – present/absent, true/false for example. Years ago, a
colleague came up with an excellent example. Imagine a crowd of spectators
where the supporters of one team wore white and supporters of the other team
wore black. If one team had 100% support, the crowd would be all one colour –
no variability. The maximum variability would occur where there was 50%
support for each team. This is exactly what happens with dichotomous
variables. The closer the prevalence is to 50%, the higher the variability. At 0%
and 100% there's no variability at all.
Prevalence
Prevalence is how frequently a characteristic is found is in the population you
are studying. Although we speak of prevalences every time we say something
like “ten percent of people” or “a third of new admissions”, we rarely use the
word prevalence for these fractions or percentages. This guide will use
‘prevalence’ as a general term for proportions, fractions and percentages.
Variability
The more variable is the thing we are studying, the more data we will have to
gather in order to achieve a given level of precision. This makes sense
intuitively when we are measuring something on a numeric scale. But it also
applies to other types of measurement too, even to percentages.
Looking at the tables that show sample sizes for dierent prevalences, you will
see that the required sample size rises as the prevalence approaches 50%. This
is because when 50% of people have a characteristic and 50% do not, that
characteristic has the highest person-to-person variability. As the prevalence
nears zero or 100%, variability decreases, and so the required sample size will
also decrease.
For continuous variables, the standard deviation is used as a measure of
variability. This is sometimes known, or guessable, from previously published
work, and this guide will tell you how to do this. But even if it is unknown, the
guide will show you how to make an informed guess.
Eect size
Many sample size calculations require you to stipulate an eect size. This is
the smallest eect that is clinically signiﬁcant (as opposed to statistically
signiﬁcant). Clinical signiﬁcance is a health research term that is used to mean
“practical signiﬁcance” or “real life signiﬁcance”. The task of deciding on the
smallest eect that would be clinically signiﬁcant requires knowledge of the
purpose of the research and the current state of knowledge and practice.
Sample Size: introduction
!5
For example, if you are planning to compare two treatments, you have to
decide how big a dierence between two groups should be before it would be
regarded as clinically important. You might deﬁne it as the smallest eect that
would be noticed by the person being treated, or the smallest eect that would
alter the management of the patient, or the smallest eect required to change
the person’s diagnosis.
The whole question of what constitutes a clinically signiﬁcant ﬁnding is outside
the scope of statistics. However, you will see from the tables that I have tried to
help out by translating the rather abstract language of eect size into terms of
patient beneﬁt or dierences between people.
What eect size isn’t
It is important to realise what eect size is not. Eect size is not the eect that
you think is there. We tend to have high hopes for our theories, and therefore
hope that the treatment or risk factor we are interested in will have a very
important eect. However, in sample size calculation, eect size is always the
smallest eect that would be clinically signiﬁcant. Not the one that you hope is
there.
Importantly, too, eect size is not what was published by someone else. Again,
this is an estimate of the actual eect size, but research must have adequate
power to detect the smallest clinically signiﬁcant eect size. Often the early
publications in a ﬁeld are biased towards larger eect sizes. This is not just
because of publication bias, but also because methodologies will improve and
things will always work less well when they leave the lab for the real world.
Power
Power if the chance that what you are looking for will be detected in your
sample, if it actually exists. No sample, however big, is a guarantee that you
will detect what you are looking for. However, it is foolish to do research
without a reasonable chance that your study will detect it if it exists. And that
if it exists” is very important. The power of a study is its chance of
detecting an eect of a given size, if an eect of at least that size exists.
Decades ago, studies were often run with 80% power. That is to say, there was
an 80% chance that they would detect the eect if it existed but was the
smallest clinically signiﬁcant eect. And, therefore, there was a 20% chance –
one in ﬁve – that they would fail to detect it and come to a false-negative
conclusion.
A 20% probability of a false-negative conclusion is now regarded as
unacceptable by ethics committees. Why waste the time (and lives) of research
participants on projects that have a built-in 20% chance of failure? The sample
sizes in this guide assume that you want 90% or even 95% power to detect
what you are looking for.
Sample Size: introduction
!6
Sample Size: introduction
!7
1. Sample size for percentages or proportions
This section give guidelines for sample sizes for
studies that measure the proportion or percentage of people who have some
characteristic,
and for studies that compare this proportion with either a known population
or with another group.
The characteristic being measured can be a disease, an opinion, a behaviour :
anything that can be measured as present or absent.
Prevalence
Prevalence is the technical term for the proportion of people who have some
feature. You should note that for a prevalence to be measured accurately, the
study sample should be a valid sample. That is, it should not contain any
signiﬁcant source of bias.
1.1 Sample size for simple prevalence studies
The sample size needed for a prevalence study depends on how precisely you
want to measure the prevalence. (Precision is the amount of error in a
measurement.) The bigger your sample, the less error you are likely to make in
measuring the prevalence, and therefore the better the chance that the
prevalence you ﬁnd in your sample will be close to the real prevalence in the
population. You can calculate the margin of uncertainty around the ﬁndings of
your study using conﬁdence intervals. A conﬁdence interval gives you a
maximum and minimum plausible estimate for the true value you were trying to
measure.
Step 1: decide on an acceptable margin of error
The larger your sample, the less uncertainty you will have about the true
prevalence. However, you do not necessarily need a tiny margin of uncertainty.
For an exploratory study, for example, a margin of error of ±10% might be
perfectly acceptable. A 10% margin of uncertainty can be achieved with a
sample of only 100. However, to get to a 5% margin of error will require a
sample of 384 (four times as large).
Step 2: Is your population ﬁnite?
Are you sampling a population which has a deﬁned number of members? Such
populations might include all the physiotherapists in private practice in Ireland,
or all the pharmacies in Ireland. If you have a ﬁnite population, the sample size
you need can be signiﬁcantly smaller.
Step 3: Simply read o your required sample size from table 1.1. "
Sample Size: studies measuring a percentage or proportion
!8
Table 1.1 Sample sizes for prevalence studies
Example 1: Sample size for a study of the prevalence of burnout in students at a large
university
A researcher is interested in carrying out a prevalence study using simple
random sampling from a population of over 11,000 university students. She
would like to estimate the prevalence to within 5% of its true value.
Since the population is large (more than 5,000) she should use the ﬁrst column
in the table. A sample size of 384 students will allow the study to determine the
prevalence of anxiety disorders with a conﬁdence interval of ±5%. Note that if
she wants increase precision so that her margin of error is just ±3%, she will
have to sample over 1,000 participants. Sample sizes increase rapidly when
very high precision is needed.
Example 2: Sample size for a study of a ﬁnite population
A researcher wants to study the prevalence of bullying in registrars and senior
registrars working in Ireland. There are roughly 500 doctors in her population.
She is willing to accept a margin of uncertainty of ±7.5%.
Here, the population is ﬁnite, with roughly 500 registrars and senior registrars,
so the sample size will be smaller than she would need for a study of a large
population. A representative sample of 127 will give the study a margin of error
(conﬁdence interval) of ±7.5% in determining the prevalence of bullying in the
workplace, and 341 will narrow that margin of error to ±3%.
Acceptable
margin of
error
Size of population
Large
5000
2500
1000
500
200
±20%
24
24
24
23
23
22
±15%
43
42
42
41
39
35
±10%
96
94
93
88
81
65
±7.5%
171
165
160
146
127
92
±5%
384
357
333
278
217
132
±3%
1067
880
748
516
341
169
Sample Size: studies measuring a percentage or proportion
!9
Sample Size: studies measuring a percentage or proportion
!10
1.2 Sample sizes for studies comparing a prevalence
with a hypothesised value
This section give guidelines for sample sizes for studies that measure the
proportion or percentage of people who have some characteristic with the
intention of comparing it with a percentage that is already known from
research or hypothesised.
This characteristic can be a disease, and opinion, a behaviour, anything that
can be measured as present or absent. You may want to demonstrate that the
population you are studying has a higher (or lower) prevalence than some other
population that you already know about. For example, you might want to see if
medical students have a lower prevalence of smoking than other third level
students, whose prevalence is already known from previous work.
Eect size
To begin with, you need to ask what is the smallest dierence between
the prevalence in the population you are studying and the prevalence in the
reference population that would be considered meaningful in real life terms?
This dierence is often called a clinically signiﬁcant dierence in medicine,
to draw attention to the fact that it is the smallest dierence that would be
important enough to have practical implications.
The bigger your study, the greater the chance that you will detect such a
dierence. And, of course, the smaller the dierence that you consider to be
clinically signiﬁcant, the bigger the study you need to detect it.
Step 1: Eect size: Decide on the smallest dierence the study should be
capable of detecting
You will have to decide what is the smallest dierence between the group that
you are studying and the general population that would constitute a 'clinically
signiﬁcant dierence' – that is, a dierence that would have real-life
implications. If you found a dierence of 5%, would that have real-life
implications? If not, would 10%? There is a certain amount of guesswork
involved, and you might do well to see what the norm was in the literature.
For instance, if you were studying burnout in medical students and discovered
that the rate was 5% higher than the rate for the general student population,
would that have important real-life implications? How about if it was 10%
lower? 10% higher? At what point would we decide that burnout in medical
students was a problem that needed to be tackled?
Sample Size: studies measuring a percentage or proportion
!11
Step 2: Prevalence: How common is the feature that you are studying in
the population?
Sample sizes are bigger when the feature has a prevalence of 50% in the
population. As the prevalence in the population group goes towards 0% or
100%, the sample size requirement falls. If you do not know how common the
feature is, you should use the sample size for a 50% prevalence as being the
worst-case estimate. The required sample size will be no larger than this, no
matter what the prevalence turns out to be.
Step 3: what power do you want to detect a dierence between the study
group and the population?
A study with 90% power is 90% likely to discover the dierence between
the groups if such a dierence exists. And 95% power increases this likelihood
to 95%. So if a study with 95% power fails to detect a dierence, the dierence
is unlikely to exist. You should aim for 95% power, and certainly accept nothing
less than 90% power. Why run a study that has more than a 10% chance of
failing to detect the very thing it is looking for?
Sample Size: studies measuring a percentage or proportion
!12
Step 4: Use table 1.2 to get an idea of sample size
Table 1.2 Comparing a sample with a known population
The table gives sample sizes for 90% and 95% power in three situations: when
the population prevalence is 50%, 25% and 10%.
If in doubt about the prevalence, err on the high side.
*Sample Stata code for column
. power oneproportion .1 (.15(.05).4), test(wald) power(.95)
The bit that says (.15(.05).4) is a neat way of passing Stata a list of values.
This one says “start at .15, increment by .05 and ﬁnish at .4”.
Population
prevalence 50%
Population
prevalence 25%
Population
prevalence 10%
Dierence
between
prevalences
Power
Power
Power
90%
95%
90%
95%
90%
95%*
+5%
1041
1287
883
1092
536
663
+10%
253
312
240
296
169
208
+15%
107
132
113
139
88
109
+20%
56
69
66
81
56
69
+25%
32
39
43
52
39
48
+30%
19
24
29
36
29
35
-5%
1041
1287
673
832
13
16
-10%
253
312
134
166
-15%
107
132
43
52
-20%
56
69
13
16
–25%
32
39
–30%
19
24
Sample Size: studies measuring a percentage or proportion
!13
Sample Size: studies measuring a percentage or proportion
!14
1.3 Sample sizes for studies comparing proportions
between two groups
This section give guidelines for sample sizes for studies that measure the
proportion or percentage of people who have some characteristic with the
intention of comparing two groups sampled separately, or two subgroups within
the same sample.
This is a common study design in which two groups are compared. In some
cases, the two groups will be got by taking samples from two populations.
However, in many cases the two groups may actually be subgroups of the same
sample. If you plan on comparing two groups within the same sample, the
sample size will have to be increased. Instructions for doing this are at the end
of the section.
Step 1: Eect size: Decide on the dierence the study should be capable
of detecting
You will have to decide what is the smallest dierence between the two groups
that you are studying that would constitute a 'clinically signiﬁcant dierence' –
that is, a dierence that would have real-life implications. If you found a
dierence of 5%, would that have real-life implications? If not, would 10%?
There is a certain amount of guesswork involved, and you might do well to see
what the norm is in the literature.
Step 2: Prevalence: How common is the feature that you are studying in
the comparison group?
Sample sizes are bigger when the feature has a prevalence of 50% in one of the
groups. As the prevalence in one group goes towards 0% or 100%, the sample
size requirement falls. If you do not know how common the feature is, you
should use the sample size for a 50% prevalence as being the worst-case
estimate. The required sample size will be no larger than this no matter what
the prevalence turns out to be.
Step 3: Power: what power do you want to detect a dierence between
the two groups?
A study with 90% power is 90% likely to discover the dierence between
the groups if such a dierence exists. And 95% power increases this likelihood
to 95%. So if a study with 95% power fails to detect a dierence, the dierence
is unlikely to exist. You should aim for 95% power, and certainly accept nothing
less than 90% power. Why run a study that has more than a 10% chance of
failing to detect the very thing it is looking for?
Sample Size: comparing proportions between groups
!15
Step 4: Use table 1.3 to get an idea of sample size
The table gives sample sizes for 90% and 95% power in three situations: when
the prevalence in the comparison group is 50%, 25% and 10%. If in doubt, err
on the high side. The table shows the number in each group, so the total
number is twice the ﬁgure in the table!
Table 1.3 Numbers needed in each group
*Sample Stata command that generated the ﬁgures in this column
. power twoproportion .5 (.45(-.05).2), power(.9)
The notation .5 (.45(-.05).2) is a way of telling Stata to generate a list of
values starting with 0·5, decreasing in units of 0·05 and ending with 0·2)
Example: Study investigating the eect of a support programme on smoking quit rates
The investigator is planning a study of the eect of a telephone support line in
improving smoking quit rates in patients post-stroke. She knows that about
25% of smokers will have quit at the end of the ﬁrst year after discharge. She
feels that the support line would make a clinically important contribution to
management if it improved this this to 35%. The programme would not be
justiﬁable from the cost point of view if the reduction were smaller than this.
So a 10% increase is the smallest eect that would be clinically signiﬁcant.
From the table she can see that two groups of 440 patients would be needed to
have a 90% power of detecting a dierence of at least 10%, and two groups of
543 patients would be needed for 95% power. She writes in her ethics
submission:
Prevalence in
one group 50%
Prevalence in
one group 25%
Prevalence in
one group 10%
Dierence
between the
groups
Power
Power
Power
90%*
95%
90%
95%
90%
95%
5%
2095
2590
1674
2070
918
1135
10%
519
641
440
543
266
329
15%
227
280
203
251
133
164
20%
124
153
118
145
82
101
25%
77
95
77
95
57
70
30%
52
63
54
67
42
52
Sample Size: comparing proportions between groups
!16
Previous studies in the area suggest that as few as 25% of smokers are still not
smoking a year after discharge. The proposed sample size of 500 patients in
each group (intervention and control) will give the study a power to detect a
10% increase in smoking cessation rate that is between 90% and 95%.
Example: Study comparing risk of hypertension in women who continue to work and
those who stop working during a ﬁrst pregnancy.
Women in their ﬁrst pregnancy have roughly a 10% risk of developing
hypertension. The investigator wishes to compare risk in women who stop
working and women who continue. She decides to give the study suicient
power to have a 90% chance of detecting a doubling of risk associated with
continued working. The sample size, from the table, is two groups of 266
women. She decides to increase this to 300 in each group to account for drop-
outs. She writes in her ethics submission:
Women in their ﬁrst pregnancy have roughly a 10% risk of developing
hypertension. We propose to recruit 300 women in each group (work cessation
and working). The proposed sample size has a 90% power to detect a twofold
increase in risk, from 10% to 20%.
Comparing subgroups within the same sample
This often happens when the two groups being compared are subgroups of a
larger sample. For example, if you are comparing men and women coronary
patients and you know that two thirds of patients are men.
A detailed answer is beyond the scope of a ready-reckoner table, because the
ﬁnal sample size will depend on the relative sizes of the groups being
compared. Roughly, if one group is twice as big as the other, the total sample
size will be 20% higher, if one is three times as big as the other, 30% higher. In
the case of the coronary patients, if two thirds of patients are men, one group
will be twice the size of the other. In this case, you would calculate a total
sample size based on the table and then increase it by 20%.
Stata code
Suppose you are comparing two groups from the same sample. You are
expecting the two groups to have a 20% and 80% prevalence. In this case, the
ratio of the two groups is 80:20 which is 4:1. The Stata code for 90% power
that gives the ﬁrst column in the table above now reads
power twoproportions .5 (.45(-.05).2), test(chi2) power(0.9)
nratio(4)
You can see that you simply have to specify nratio() to get the appropriate
calculation.
Sample Size: comparing proportions between groups
!17
What is 90% or 95% power?
Just because a dierence really exists in the population you are studying does
not mean it will appear in every sample you take. Your sample may not show
the dierence, even though it is there. To be ethical and value for money, a
research study should have a reasonable chance of detecting the smallest
dierence that would be of clinical signiﬁcance (if this dierence actually
exists, of course). If you do a study and fail to ﬁnd a dierence, even though it
exists, you may discourage further research, or delay the discovery of
something useful. For this reason, you study should have a reasonable chance
of ﬁnding a dierence, if such a dierence exists.
A study with 90% power is 90% likely to discover the smallest clinically
signiﬁcant dierence between the groups if such a dierence exists. And 95%
power increases this likelihood to 95%. So if a study with 95% power fails to
detect a dierence, the dierence is unlikely to exist. You should aim for 95%
power, and certainly accept nothing less than 90% power. Why run a study that
has more than a 10% chance of failing to detect the very thing it is looking for?
What if I can only study a certain number of people?
You can use the table to get a rough idea of the sort of dierence you study
might be able to detect. Look up the number of people you have available.
Reference
These calculations were carried out using Stata release 13 power command!
Sample Size: comparing proportions between groups
!18
1.4a Sample sizes for population case-control studies
This section give guidelines for sample sizes for studies that measure the eect
of a risk factor by comparing a sample of people with the disease with a control
sample of disease-free individuals drawn from the same population. The
eect of the risk factor is measured using the odds ratio.
Population case-control studies have the disadvantage that the controls and
cases may dier on variables that will have an eect on disease risk
(confounding variables), so a multivariable analysis will have to be carried out
to adjust for these variables. The sample sizes shown here are inﬂated by 25%
to allow for the loss of statistical power that will typically result from adjusting
for confounding variables.
If you are controlling for confounding variables by carrying out a matched
case-control study, see section 1.4b.
A case-control study looks for risk factors for a disease or disorder by
recruiting two groups of participants: cases of the disease or disorder, and
controls, who are drawn from the same population as the cases but who did not
develop the disease.
Case-control studies are observational studies. In experimental studies, we
can hold conditions constant so that the only dierence between the two
groups we are comparing is that one group was exposed to the risk factor and
the other was not. In observational studies, however, there can be other
dierences between those exposed to the risk factor and those not exposed. For
example, if you are looking at the relationship between diarrhoeal disease in
children and household water supply, households with high quality water will
dier in other ways from households with low quality water. They are more
likely to be higher social class, wealthier, and more likely to have better
sanitation. These factors, which are associated with both the disease and the
risk factor, are called confounding factors.
Understanding confounding factors is important in designing and analysing
case-control studies. Confounding factors can distort the apparent relationship
between a risk factor and a disease, so their eects have to be adjusted for
statistically during the analysis. In the diarrhoeal disease example, you might
need to adjust your estimate of the eect of good water quality in the
household for the association between good water quality and presence of a
toilet. Any case-control study must identify and measure potential confounding
factors.
Sample size and adjustment for confounding factors
Allowing for confounding factors in the analysis of case-control studies
increases the required sample size, because the statistical adjustment will
Sample Size: case control studies
!19
increase the margin of uncertainty around the estimate of the risk factor's odds
ratio. If you don't understand the last bit, don't worry. The important thing is
that you have to gather extra data in a case control study to allow you suicient
statistical power to adjust for confounding variables. How much extra data
depends on how strongly the confounding factor is associated with the risk
factor and the disease. Cousens and colleagues (see references) recommend
increasing the sample size by 25%, based on simulation studies. The sample
sizes in the tables in this section are inﬂated by 25% in line with this
recommendation.
Step 1: Prevalence: What is the probable prevalence of the risk factor in
The prevalence of the risk factor will aect your ability to detect its eect. If
most of the population is exposed to the risk factor, it will be common in your
control group, making it hard to detect its eect, for example. If you are unsure
about the prevalence of the risk factor in the population, err on the extreme
side – that is, if it is rare, use the lowest estimate you have as the basis for
calculations, and if it is common use the highest estimate.
Step 2: Eect Size: What is the smallest odds ratio that would be regarded
as clinically signiﬁcant?
The odds ratio expresses the impact of the factor on the risk of the disease or
disorder. Usually we are only interested in risk factors that have a sizeable
impact on risk – and odds ratio of 2, for example – but if you are studying a
common, serious condition you might be interested in detecting an odds ratio
as low as 1.5, because even a 50% increase in risk of something common or
serious will be important at the public health level.
Step 3: Power: What statistical power do you want?
With 90% power, you have a 90% chance of being able to detect a clinically
signiﬁcant odds ratio. That is, though, a 10% chance of doing the study and
failing to detect it. With 95% power, you have only a 5% chance of failing to
detect a clinically signiﬁcant odds ratio, if it exists.
Step 4: Look up the number of cases from table 1.4
Sample Size: case control studies
!20
Table 1.4a Number of cases required for a case control study
Note 1: This assumes a study that recruits an equal number of controls.
Note 2: The table has an allowance of 25% extra participants to adjust for
confounding.
Smallest odds ratio that would be
clinically signiﬁcant
1.5
2
2.5
3
4
5
Prevalence
of the risk
factor
90% Power to detect the odds ratio
10%
1581
493
264
175
103
73
20%
929
300
165
113
69
50
30%
739
246
140
98
61
46
40%
674
231
134
95
63
49
50%
674
239
141
103
69
55
60%
730
265
161
118
81
65
70%
869
324
200
149
105
85
80%
118 4
453
284
215
154
128
90%
2186
855
546
416
304
254
95% Power to detect the odds ratio
10%
1988
619
331
220
129
91
20%
116 8
376
208
141
86
64
30%
929
309
175
121
78
59
40%
848
291
169
120
79
61
50%
848
300
178
129
86
69
60%
919
334
203
149
103
83
70%
1091
408
251
188
131
108
80%
1489
569
358
270
194
160
90%
2749
1075
686
524
383
320
Sample Size: case control studies
!21
Sample Size: case control studies
!22
Example: An obstetrician is interested in the relationship between manual
work during pregnancy and risk of pre-eclampsia. She does some preliminary
research and ﬁnds that about 20% of her patients do manual work during their
pregnancy. She is interested in being able to detect an odds ratio of 3 or more
associated with manual work. Since pre-eclampsia is comparatively rare, she
plans to recruit three controls for each case.
Table 1.4a1 Eect of multiple controls per case on sample
size
From table 1.4, she needs 113 patients with pre-eclampsia for 90% power.
Recruiting three controls per case, she can reduce this by a third (0.67), giving
113 x 0.67 = 75.7 cases (76 in round ﬁgures). However, she will have to recruit
three controls per case, giving 228 controls (76 x 3). Although this is pretty
close to the size of study she would have had to do with a 1:1 case-control ratio,
it will be quicker to carry out, because recruiting the cases will be the slowest
part of the study.
Reference
The calculations in this section were carried out with Stata, using formulas in
Cousens SN, Feachem RG, Kirkwood B, Mertens TE and Smith PG. Case-control
studies of childhood diarrhoea: II Sample size.World Health Organization. CDD/
EDP/88.3 Undated.
case-control-studies-childhood-diarrhoea
Number of
controls per
case
Multiply the
number of
cases by
2
0.75
3
0.67
4
0.63
5
0.60
Sample Size: case control studies
!23
1.4b Sample sizes for matched case-control studies
This section gives sample sizes for studies that compare cases of a disease or
disorder with matched controls drawn from the same population.
Introduction
Case-control studies are widely used to establish the strength of the
relationship between a risk factor and a health outcome. Case-control studies
are observational studies. In experimental studies, we can hold conditions
constant so that the only dierence between the two groups we are comparing
is that one was exposed to the risk factor and one was not. In observational
studies, however, there can be other dierences between those exposed to the
risk factor and those not exposed. For example, if you are looking at the eect
of diet on mild cognitive impairment, you would be aware that the main risk
factor for cognitive impairment is age. Diet also varies with age. Age, then, is a
factor which is associated with both the disease and the risk factor. These
factors are called confounding factors. Confounding factors can distort the
true relationship between a risk factor and a disease unless we take them into
account in the design or the analysis of our study.
We can deal with the presence of confounding variables in the design of our
study by matching the cases and controls on key confounders. In matched case-
control designs, healthy controls are matched to cases using one or more
variables. In practice, the most eicient matching strategy is to match
on at most two variables. Matching on many variables makes it very diicult
to locate and recruit controls. And although matching on many variables is
intuitively attractive, it doesn’t actually increase statistical eiciency – in fact,
matching on more than three variables actually reduces the power of your
study to detect risk factor relationships. Altman recommends that “in a large
study with many variables it is easier to take an unmatched control group and
adjust in the analysis for the variables on which we would have matched, using
ordinary regression methods. Matching is particularly useful in small studies,
where we might not have suicient subjects to adjust for several variables at
once.” (Bland & Altman, 1994).
Matching cases and controls will produce a correlation between the probability
of exposure within each case-control pair. This increases the statistical power
of the study. The sample size will depend on the degree of correlation between
the cases and controls. This is rarely possible to estimate, so these calculations
are based on a case-control correlation of phi=0·2. This is the recommended
action where the correlation is unknown (Dupont 1980).
It is important to note that when you analyse a matched case-control study, you
must incorporate the matching into the analysis using procedures like
conditional logistic regression. Analysing it as an unmatched case-control study
Sample Size: case control studies
!24
Smallest odds ratio that would be
clinically signiﬁcant
1.5
2
2.5
3
4
5
Prevalence
of the risk
factor
90% Power to detect the odds ratio
10%
1501
454
236
152
86
59
20%
885
279
150
100
59
43
30%
705
230
128
87
54
40
40%
644
217
124
86
55
41
50%
644
223
130
92
59
45
60%
697
248
147
105
69
54
70%
827
301
181
131
88
68
80%
112 6
418
254
186
126
99
90%
2072
784
482
355
243
192
95% Power to detect the odds ratio
10%
1851
557
289
185
103
70
20%
1091
342
184
122
71
51
30%
869
283
156
106
65
47
Sample Size: case control studies
!25
Table 1.4b Number of cases required for a matched case
control study
Multiple controls per case
Where there are multiple controls per case, you can get greater statistical
power. If you don’t have enough cases, you could consider this strategy.
Recruiting two controls per case will reduce your case sample size by roughly
25% for the same statistical power, and recruiting three controls per case will
reduce it by roughly a third. However, the total size of your study will increase
because of the extra controls.
Table 1.4b1 Eect of multiple controls per case on sample
size
40%
794
266
151
104
66
49
50%
794
274
158
111
71
54
60%
860
304
179
127
83
64
70%
1020
369
221
159
105
81
80%
1389
513
311
226
151
118
90%
2557
963
589
432
292
229
Number of
controls per
case
Multiply the
number of
cases by
2
0.75
3
0.67
4
0.63
5
0.60
Sample Size: case control studies
!26
Sample Size: case control studies
!27
Sample Size: case control studies
!28
1.5 Sample size for logistic regression with a
continuous predictor variable
This section give guidelines for sample sizes for studies that measure the eect
of a continuous predictor (for example, body mass index) on the risk of an
endpoint (for example ankle injury). The data may come from a cross-sectional,
case-control or cohort study.
Introduction
Logistic regression allows you to calculate the eect that a predictor variable
has on the occurrence of an outcome. It can be used with cross-sectional data,
case-control data and longitudinal (cohort) data. The eect of the predictor
variable is measured by the odds ratio. A researcher may be interested, for
example, on the eect that body weight has on the probability of a patient not
having a complete clinical response to a standard 70mg dose of enteric aspirin,
or the eect that depression scores have on the probability that the patient will
Step 1: Variability : Estimate the mean and standard deviation of the
predictor variable
You will probably be able to estimate the mean value quite easily. If you cannot
ﬁnd an estimate for the standard deviation, you can use the rule of thumb
that the typical range of the variable is four standard deviations. By asking
yourself what an unusually low and an unusually high value would be, you can
work out the typical range. Dividing by four gives a rough standard deviation.
For example, adult weight averages at about 70 kilos, and weights under 50 or
over 100 would be unusual, so the ‘typical range’ is about 50 kilos. This gives
us a ‘guesstimate’ standard deviation of 12.5 kilos (50÷4).
Step 2: Baseline: What is the probability of the outcome at the average
value of the predictor?
A good rule of thumb is that the probability of the outcome at the average value
of the predictor is the same as the probability of the outcome in the whole
sample. So if about 20% of patients have poor adherence to prescribed
treatment, this will do as an estimate of the probability of poor adherence at
the average value of the predictor.
Step 3: Eect size: what is the smallest increase in the probability of the
outcome associated with an increase of one standard deviation of the
predictor that would be clinically signiﬁcant?
Clinical signiﬁcance, or real-life signiﬁcance, means that an eect is important
enough to have real-life consequences. In the case of treatment failure with
aspirin, if the probability of treatment failure increased from 10% at the
Sample Size: case control studies
!29
Prevalence at
mean value
Prevalence 1 SD
higher
Odds ratio
N for 90% power
5%
10%
2.1
333
10%
15%
1.6
484
10%
20%
2.3
172
20%
25%
1.3
734
20%
30%
1.7
220
20%
40%
2.7
98
20%
50%
4.0
143
25%
30%
1.3
825
25%
35%
1.6
238
25%
40%
2.0
128
25%
50%
3.0
93
30%
35%
1.3
889
30%
40%
1.6
249
30%
50%
2.3
93
30%
60%
3.5
106
40%
45%
1.2
933
40%
50%
1.5
250
40%
60%
2.3
87
40%
80%
6.0
499
50%
55%
1.2
865
50%
60%
1.5
225
50%
75%
3.0
81
50%
80%
4.0
133
Sample Size: case control studies
!30
Sample Size: case control studies
!31
1.6 Sample sizes for logistic or Cox regression with
multiple predictors
This section reviews guidelines on the number of cases required for studies in
which logistic regression or Cox regression are used to measure the eects of
risk factors on the occurrence of an endpoint. Earlier recommendations stated
that you needed ten events (endpoints) per predictor variable. Subsequent
work suggested that this isn’t strictly true, and that 5–9 events per predictor
may be yield estimates that are just as good. However, the jury is still out.
The section includes guidelines on designing studies with multiple predictors.
There isn’t a table because the number of potential scenarios is impossibly big.
Introduction
Logistic regression builds a model the estimate the probability of an event
occurring. To use logistic regression, we need data in which each participant’s
status is known: the event of interest has either occurred or has not occurred.
For example, we might be analysing a case-control study of stress fractures in
athletes. Stress fractures are either present (in the cases) or absent (in the
controls). We can use logistic regression to analyse the data.
However, in follow-up studies, we often have data on people who might
experience the event but they have not experienced it yet. For example, in a
cancer follow-up study, some patients have experienced a recurrence of the
disease, while others are still being followed up and are disease free. We
cannot say that those who are disease free will not recur, but we know that
their time to recurrence must be greater than their follow-up time. This kind of
data is called censored data.
In this case, we can use Cox regression (sometimes called a proportional
hazards general linear model, which is what Cox himself called it. You can see
why people refer to it as Cox regression!).
The ten events per predictor rule
There was a very inﬂuential paper published in the 1990s by Peduzzi et al
(1996) based on simulation studies which concluded that for logistic regression
you needed ten events (not patients) per predictor variable if you were
calculating a multivariate model.
Example: a researcher wants to look at factors aecting the development of
hypertension in ﬁrst-time pregnancies. If the researcher has 5 explanatory
variables, they will need to recruit a sample big enough to yield 50 cases of
hypertension. Around 20% of ﬁrst-time mothers will develop hypertension, so
these 50 cases will be 20% of the required sample. So a total sample of 250 will
be required so that there will be the required 50 cases
Sample Size: case control studies
!32
Sample Size: case control studies
!33
Sample Size: case control studies
!34
Vittingho, E. & McCulloch, C.E., 2007. Relaxing the rule of ten events per
variable in logistic and Cox regression. American Journal of Epidemiology,
165(6), pp.710–718.
Wynants L, Bouwmeester W, Moons KGM, Moerbeek M, Timmerman D, Van
Huel S, et al. A simulation study of sample size demonstrated the importance
of the number of events per variable to develop prediction models in clustered
data. Journal of Clinical Epidemiology. Elsevier Inc; 2015 Dec 1;68(12):1406–
14.
!
Sample Size: case control studies
!35
2: Sample sizes and powers for comparing two
means where the variable is measured on a
continuous scale that is (more or less) normally
distributed.
This section give guidelines for sample sizes for studies that measure the
dierence between the means of two groups, or that compare the means of the
same group measured under two dierent conditions (often before and after an
intervention).
2.1 Comparing the means of two groups
Studies frequently compare a group of interest with a control group or
comparison group. If your study involved measuring something on the same
people twice, once under each of two conditions, you need the next section.
Step 1: Eect size: decide on the dierence that you want to be able to
detect
The ﬁrst step in calculating a sample size is to decide on the smallest dierence
between the two groups that would be 'clinically signiﬁcant' or 'scientiﬁcally
signiﬁcant'. For example, a dierence in birth weight of 250 grammes between
babies whose mothers smoked and babies whose mothers did not smoke would
be certainly regarded as clinically signiﬁcant, as it represents the weight gain
of a whole week of gestation. However, a smaller dierence – say 75 grammes –
probably would not be.
It is hard to deﬁne the smallest dierence that would be clinically
signiﬁcant. An element of guesswork in involved. What is the smallest
reduction in cholesterol that would be regarded as clinically worthwhile? It
may be useful to search the literature and see what other investigators have
done. And bear in mind that an expensive intervention will need to be
associated with quite a large dierence before it would be considered
worthwhile.
NB: Eect size should not be based on your hopes or expectations!
Note, however, that the sample size depends on the smallest clinically
signiﬁcant dierence, not on the size of the dierence you expect to ﬁnd. You
may have high hopes, but your obligation as a researcher is to give your study
enough power to detect the smallest dierence that would be clinically
signiﬁcant.
Sample Size: comparing means of two groups
!36
Step 2: Convert the smallest clinically signiﬁcant dierence to standard
deviation units.
Step 2.1. What is the expected mean value for the control or comparison group?
Step 2.2. What is the standard deviation of the control or comparison group?
How to get an approximate standard deviation
If you do not know this exactly, you can get a reasonable guess by identifying
the highest and lowest values that would typically occur. Since most values will
be within ±2 standard deviations of the average, then the highest typical value
(2 standard deviations above average) and lowest typical value (2 below) will
span a range of four standard deviations.
An approximate standard deviation is therefore
For example: a researcher is measuring fœtal heart rate, to see if mothers who
smoke have babies with slower heart rates. A typical rate is 160 beats per
minute, and normally the rate would not be below 135 or above 175. The
variation in 'typical' heart rates is 175–135 = 30 beats. This is about 4 standard
deviations, so the standard deviation is about 7.5 beats per minute. (This
example is real, and the approximate standard deviation is pretty close to the
real one!)
How to get an approximate standard deviation from a published conﬁdence interval
Another potential source of standard deviation information is from published
research. Although the paper may not include a standard deviation, it may
include a conﬁdence interval. The Cochrane Handbook has a useful formula for
converting this to a standard deviation:
where N is the number of cases.
Approximate
SD
=
Highest typical value
Lowest typical value
4
Standard
deviation
=
N
Upper CI limit
Lower CI limit
3·92
Sample Size: comparing means of two groups
!37
Step 3. What is the smallest dierence between the two groups in the
study that would be considered of scientiﬁc or clinical importance?
This is the minimum dierence which should be detectable by the study. You
will have to decide what is the smallest dierence between the two groups that
you are studying that would constitute a 'clinically signiﬁcant dierence' – that
is, a dierence that would have real-life implications.
In the case of the foetal heart rate example, a researcher might decide that a
dierence of 5 beats per minute would be clinically signiﬁcant.
Note again that the study should be designed to have a reasonable chance of
detecting the minimum clinically signiﬁcant dierence, and not the dierence
that you think is actually there.
Step 4. Convert the minimum dierence to be detected to standard
deviation units by dividing it by the standard deviation
Following our example, the minimum dierence is 5 beats, and the standard
deviation is 7.5 beats. The dierence to be detected is therefore two thirds of a
standard deviation (0.67)
Step 5: Use table 2.1 to get an idea of the number of participants you need in each group
to detect a dierence of this size.
Following the example, the nearest value in the table to 0.67 is 0.7. The
researcher will need two groups of 43 babies each to have a 90% chance of
detecting a dierence of 5 beats per minute between smoking and non-smoking
mothers' babies. To have a 95% chance of detecting this dierence, the
researcher will need 54 babies in each group.
Minimum difference to be detected
Standard deviation
Sample Size: comparing means of two groups
!38
Dierence to
be detected
(SD units)
N in each
group 90%
power*
N in each
group 95%
power
Chance that someone in
group 1 will score higher
than someone in group 2
2
7
8
92%
1.5
11
13
86%
1.4
12
15
84%
1.3
14
17
82%
1.25
15
18
81%
1.2
16
20
80%
1.1
19
23
78%
1
23
27
76%
0.9
27
34
74%
0.8
34
42
71%
0.75
39
48
70%
0.7
44
55
69%
0.6
60
74
66%
0.5
86
105
64%
0.4
133
164
61%
0.3
235
290
58%
0.25
338
417
57%
0.2
527
651
55%
Sample Size: comparing means of two groups
!39
something useful. For this reason, you study should have a reasonable chance
of ﬁnding a dierence, if such a dierence exists.
A study with 90% power is 90% likely to discover the dierence between
the groups if such a dierence exists. And 95% power increases this likelihood
to 95%. So if a study with 95% power fails to detect a dierence, the dierence
is unlikely to exist. You should aim for 95% power, and certainly accept nothing
less than 90% power. Why run a study that has more than a 10% chance of
failing to detect the very thing it is looking for?
How do I interpret the column that shows the chance that a person in one
group will have a higher score than a person in another group?
Some scales have measuring units that are hard to imagine. We can imagine
fœtal heart rate, which is in beats per minute, but how do you imagine scores
on a depression scale? What constitutes a 'clinically signiﬁcant' change in
depression score?
One way of thinking of dierences between groups is to ask what
proportion of the people in one group have scores that are higher than average
for the other group. For example we could ask what proportion of smoking
mothers have babies with heart rates that are below the average for non-
smoking mothers? Continuing the example, if we decide that a dierence of 5
beats per minute is clinically signiﬁcant (which corresponds to just about 0.7
SD), this means that there is a 69% chance that a non-smoking mother's baby
will have a higher heart rate than a smoking mother's baby. (Of course, if there
is no eect of smoking on heart rate, then the chances are 50% – a smoking
mothers' baby is just as likely to have higher heart rate as a lower heart rate).
This information is useful for planning clinical trials. We might decide
that a new treatment would be superior if 75% of the people would do better on
it. (If it was just the same, then 50% of people would do better and 50% worse.)
This means that the study needs to detect a dierence of about 1 standard
deviation (from the table). And the required size is two groups of 26 people for
95% power.
The technical name for this percentage, incidentally, is the Mann-Whitney
statistic. You will also encounter it as the c statistic, Harrell’s c, and even as the
area under the ROC curve.
I have a limited number of potential participants. How can I ﬁnd out power
for a particular sample size?
You may be limited to a particular sample size because of the limitations of
your data. There may only be 20 patients available, or your project time scale
only allows for collecting data on a certain number of participants. You can use
the table to get a rough idea of the power of your study. For example, with only
20 participants in each group, you have more than 95% power to detect a
dierence of 1.25 standard deviations (which only needs two groups of 17) and
Sample Size: comparing means of two groups
!40
slightly less than 90% power to detect a dierence of 1 standard deviation (you
would really need 2 groups of 22).
But what if the dierence between the groups is bigger than I think?
Sample sizes are calculated to detect the smallest clinically signiﬁcant
dierence. If the dierence is greater than this, the study's power to detect it is
higher. For instance, a study of two groups of 43 babies has a 90% power to
detect a dierence of 0.7 standard deviations, which corresponded (roughly) to
5 beats per minute, the smallest clinically signiﬁcant dierence. If the real
dierence were bigger – say, 7.5 beats per minute (1 standard deviation) then
the power of the study would actually be 99.6%. (This is just an example, and I
had to calculate this power speciﬁcally; it's not in the table.) So if your study
has adequate power to detect the smallest clinically signiﬁcant dierence, it
has more than adequate power to detect bigger dierences.
I intend using a Wilcoxon (Mann Whitney) test because I don't think my
data will be normally distributed
The ﬁrst important point is that the idea that the data should be normally
distributed before using a t-test, or linear regression, is a myth. It is the
measurement errors that need to be normally distributed. But even more
important, studies with non-normal data have shown that the t-test is extremely
robust to departures from normality (Fagerland, 2012; Fagerland, Sandvik, &
Mowinckel, 2011; Rasch & Teuscher, 2007).
A second persistent misconception is that you cannot use the t-test on small
samples (when pressed, people mutter something about “less than 30” but
aren’t sure). Actually, you can. And the t-test performs well in samples as small
as N=2! (J. de Winter, 2013) Indeed, with very small samples indeed, the
Wilcoxon-Mann Whitney test is unable to detect a signiﬁcant dierence, while
the t-test is (Altman & Bland, 2009).
Relative to a t-test or regression, the Wilcoxon test (also called the Wilcoxon
Mann-Whitney U test) can be less eicient if your data are close to normally
distributed. However, a statistician called Pitman showed that the test was
never less than 86.4% as eicient. So inﬂating your sample by 1.16 should give
you at least the same power that you would have using a t-test with normally
distributed data. With data with skewed distributions, or data in which the
distributions are dierent in the two groups, the Wilcoxon Mann-Whitney test
can be more powerful than a t-test, so
My data are on 5-point Likert scales and my supervisor says I cannot use a
t-test because my data are ordinal
Simulation studies comparing the t-test and the Wilcoxon Mann-Whitney test on
items scored on 5-point scales have given heartening results. In most scenarios,
the two tests had a similar power to detect dierences between groups. The
false-positive error rate for both tests was near to 5% for most situations, and
Sample Size: comparing means of two groups
!41
never higher than 8% in even the most extreme situations. However, when the
samples diered markedly in the shape of their score distribution, the Wilcoxon
Mann-Whitney test outperformed the t-test (J. C. de Winter & Dodou, 2010).
Methods in Stata and R
The sample sizes were calculated using Stata Release 14, using the power
command. The Mann-Whitney statistic was calculated using the mwstati
command for Stata written by Rich Goldstein, and based on formulas in Colditz
et al (1988) above.
You can also use the package pwr in R. The R code for the fœtal heart rate
example, where we want to detect a dierence of 0.67 standard deviations is
> pwr.t.test(n=NULL, d=.67,power=.9,type="two.sample")
Two-sample t test power calculation
n = 47.79517
d = 0.67
sig.level = 0.05
power = 0.9
alternative = two.sided
NOTE: n is number in *each* group
These calculations were carried out using Stata release 12
Altman, D. G., & Bland, J. M. (2009). Parametric v non-parametric methods for
data analysis. Bmj, 338(apr02 1), a3167–a3167. doi:10.1136/bmj.a3167
Conroy, R. M. (2012). What hypotheses do “nonparametric” two-group tests
actually test? The Stata Journal, 12(2), 1–9.
Higgins JPT. Cochrane Handbook for Systematic Reviews of Interventions. The
Cochrane Collaboration; 2011. Available from: www.handbook.cochrane.org.
de Winter, J. (2013). Using the Student’s t-test with extremely small sample
sizes. Practical Assessment, Research & Evaluation, 18(10), 1–12.
de Winter, J. C., & Dodou, D. (2010). Five-point Likert items: t test versus Mann-
Whitney-Wilcoxon. Practical Assessment, Research & Evaluation, 15(11),
1–12.
Fagerland, M. W. (2012). t-tests, non-parametric tests, and large studies--a
paradox of statistical practice? BMC Medical Research Methodology, 12,
78. doi:10.1186/1471-2288-12-78
Fagerland, M. W., Sandvik, L., & Mowinckel, P. (2011). Parametric methods
outperformed non-parametric methods in comparisons of discrete
Sample Size: comparing means of two groups
!42
numerical variables. BMC Medical Research Methodology, 11(1), 44. doi:
10.1186/1471-2288-11-44
Colditz, G. A., J. N. Miller, and F. Mosteller. (1988). Measuring Gain in the
Evaluation of Medical Technology. International Journal of
TechnologyAssessment. 4, 637-42.
Rasch, D., & TEUSCHER, F. (2007). How robust are tests for two independent
samples? Journal of Statistical Planning and Inference, 137(8), 2706–
2720.
Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics
Bulletin, 1(6), 80–83.
Zimmerman, D. W. (2004). A note on preliminary tests of equality of variances.
The British Journal of Mathematical and Statistical Psychology, 57(Pt 1),
173–181. doi:10.1348/000711004849222
Sample Size: comparing means of two groups
!43
2.2 Sample sizes for comparing means in the same
people under two conditions
One common experimental design is to measure the same thing twice, once
under each of two conditions. This sort of data are often analysed with the
paired t-test. However, the paired t-test doesn't actually use the two values you
measured; it subtracts one from the other and gets the average dierence. The
null hypothesis is that this average dierence is zero.
So the sample size for paired measurements doesn't involve specifying the
means for each condition but specifying the mean dierence.
Step 1: decide on the dierence that you want to be able to detect.
The ﬁrst step in calculating a sample size is to decide on the smallest dierence
between the two measurements that would be 'clinically signiﬁcant' or
'scientiﬁcally signiﬁcant'. For example, if you wanted to see how eective an
exercise programme was in reducing weight in people who were overweight,
you might decide that losing two kilos over the one-month trial period would be
the minimum weight loss that would count as a 'signiﬁcant' weight loss..
It is often hard to deﬁne the smallest dierence that would be clinically
signiﬁcant. An element of guesswork in involved. What is the smallest
reduction in cholesterol that would be regarded as clinically worthwhile? It
may be useful to search the literature and see what other investigators have
done.
Eect size should not be based on your expectations!
Note, however, that the sample size depends on the smallest clinically
signiﬁcant dierence, not, on the size of the dierence you expect to ﬁnd.
Step 2: Convert the smallest clinically signiﬁcant dierence to standard
deviation units.
Step 2.1. What is the standard deviation of the dierences?
This is often very hard to ascertain. You may ﬁnd some published data. Even if
you cannot you can get a reasonable guess by identifying the biggest positive
and biggest negative dierences that would typically occur. The biggest
positive dierence is the biggest dierence in the expected direction that
would typically occur. The biggest negative dierence is the biggest dierence
in the opposite direction that would be expected to occur. Since most values
will be within ±2 standard deviations of the average, then the biggest positive
dierence (2 standard deviations above average) and biggest negative (2
below) will span a range of four standard deviations. An approximate standard
deviation is therefore
Sample Size: comparing means of same people measured twice
!44
For example: though we are hoping for at least a two kilo weight loss following
exercise, some people may lose up to ﬁve kilos. However, others might actually
gain as much as a kilo, perhaps because of the eect of exercise on appetite. So
the change in weight can vary from plus ﬁve kilos to minus one, a range of six
kilos. The standard deviation is a quarter of that range: one and a half kilos.
Step 2.2. Convert the minimum dierence to be detected to standard deviation units by
dividing it by the standard deviation
Following our example, the minimum dierence is 2 kilos, and the standard
deviation is 1.5 kilos. The dierence to be detected is therefore one and a third
standard deviations (1.33).
Step 3: Use table 2.2 to get an idea of the number of participants you
need in each group to detect a dierence of this size.
Following the example, the nearest value in the table to 1.33 is 1.3. The
researcher will need to study seven people to have a 90% chance of detecting a
weight loss of 2 kilos following the exercise programme. To have a 95% chance
of detecting this dierence, the researcher will need 8 people.
Approximate
SD of
differences
=
Biggest typical
positive difference
Biggest typical
negative difference
4
Minimum difference to be detected
Standard deviation of the difference
Sample Size: comparing means of same people measured twice
!45
Dierence
to be
detected
(SD units)
N required
for 90%
power*
N required
for 95%
power
Percentage of people
who will change in
the hypothesised
direction
2
5
6
98%
1.5
7
8
93%
1.4
8
9
92%
1.3
9
10
90%
1.25
9
11
89%
1.2
10
12
88%
1.1
11
13
86%
1
13
16
84%
0.9
16
19
82%
0.8
19
23
79%
0.75
21
26
77%
0.7
24
29
76%
0.6
32
39
73%
0.5
44
54
69%
0.4
68
84
66%
0.3
119
147
62%
0.25
171
210
60%
0.2
265
327
58%
Sample Size: comparing means of same people measured twice
!46
What is 90% or 95% power?
Just because a dierence really exists in the population you are studying does
not mean it will appear in every sample you take. Your sample may not show
the dierence, even though it is there. To be ethical and value for money, a
research study should have a reasonable chance of detecting the smallest
dierence that would be of clinical signiﬁcance (if this dierence actually
exists, of course). If you do a study and fail to ﬁnd a dierence, even though it
exists, you may discourage further research, or delay the discovery of
something useful. For this reason, you study should have a reasonable chance
of ﬁnding a dierence, if such a dierence exists.
A study with 90% power is 90% likely to discover the dierence between
the two measurement conditions if such a dierence exists. And 95% power
increases this likelihood to 95%. So if a study with 95% power fails to detect a
dierence, the dierence is unlikely to exist. You should aim for 95% power,
and certainly accept nothing less than 90% power. Why run a study that has
more than a 10% chance of failing to detect the very thing it is looking for?
How do I interpret the column that shows the percentage of people who will change in
the hypothesised direction?
Some scales have measuring units that are hard to imagine. We can imagine
foetal heart rate, which is in beats per minute, but how do you imagine scores
on a depression scale? What constitutes a 'clinically signiﬁcant' change in
depression score?
One way of thinking of dierences between groups is to ask what
proportion of the people will change in the hypothesised direction. For example
we could ask what proportion of depressed patients on an exercise programme
would have to show improved mood scores before we would consider making
the programme a regular feature of the management of depression. If we
decide that a we would like to see improvements in at least 75% of patients,
then depression scores have to fall by 0.7 standard deviation units. The sample
size we need is 22 patients for 90% power, 27 for 95% power (the table doesn't
give 75%, I've used the column for 76%, which is close enough).
The technical name for this percentage, incidentally, is the Mann-Whitney
statistic.
I have a limited number of potential participants. How can I ﬁnd out power for a particular
sample size?
You may be limited to a particular sample size because of the limitations of
your data. There may only be 20 patients available, or your project time scale
only allows for collecting data on a certain number of participants. You can use
the table to get a rough idea of the power of your study. For example, with only
20 participants, you have more than 90% power to detect a dierence of 0.75
Sample Size: comparing means of same people measured twice
!47
Sample Size: comparing means of same people measured twice
!48
2.3 Calculating sample sizes for comparing two
means: a rule of thumb
Sample size for comparing two groups
Gerald van Belle gives a good rule of thumb for calculating sample size for
comparing two groups. You do it like this:
1. Calculate the smallest dierence between the two groups that would be of
scientiﬁc interest.
2. Divide this by the standard deviation to convert it to standard deviation units
(this is the same two steps as before)
3. Square the dierence
4. For 90% power to detect this dierence in studies comparing two groups,
the number you need in each group will be
Round up the answer to the nearest whole number.
5. For 95% power, change the number above the line to 26.
Despite being an approximation, this formula is very accurate.
Studies comparing one mean with a known value
If you are only collecting one sample and comparing their mean to a known
population value, you may also use the formula above. In this case, the formula
for 90% power is
Round up the answer to the nearest whole number.
For 95% power, replace the number 11 above the line by 13.
See the links page at the end of this guide for the source of these rules of
thumb.!
21
(Difference)2
11
(Difference)2
Sample Size: comparing means: rule of thumb
!49
3. Sample size for correlations or regressions
between two variables measured on a numeric scale
This section give guidelines for sample sizes for studies that measure the
relationship between two numeric variables. Although these sample sizes are
often based on correlations, they can also be applied to linear regression, and
both types of measure are shown in the table.
Introduction : correlation and regression
Correlations are not widely used in medicine, because they are hard to
interpret. On interpretation of a Pearson correlation (r) can be got by squaring
it: this gives the proportion of variation in one variable that is linked to
variation in another variable. For example, there is a correlation of 0.7 between
illness-related stigma and depression, which means that just about half the
variation in depression (0.49, which is 0.72) is linked to variation in illness-
related stigma.
Regressions are much more widely used, since they allow us to express the
relationship between two variables in natural units – for example, the eect of
a one-year increase in age on blood pressure. Because regressions are
calculated in natural units, people often cite the proportion of variation shared
between the two variables.
In fact, correlation is just an alternative form of reporting the results of a
regression, so the p-value for a regression will be the same as the p-value for a
Pearson correlation.
Steps in calculating sample size for correlation or regression
Step 1: How much variation in one variable should be linked to variation in
the other variable for the relationship to be clinically important?
This is hard to decide, but it is hard to imagine a correlation being of 'real life'
importance if less than 20% of the variation in one variable is linked to
variation in the other variable.
Step 2: Use the table to look up the corresponding correlation and sample
size
Sample Size: correlation
!50
% Shared
variation
Correlation
Sample size 90%
power*
Sample size 95%
power
10%
0.32
99
122
15%
0.39
65
80
20%
0.45
48
59
25%
0.5
38
47
30%
0.55
31
37
35%
0.59
26
32
40%
0.63
23
27
45%
0.67
19
23
50%
0.71
17
20
Sample Size: correlation
!51
4. Sample size for reliability studies
This section give guidelines for sample sizes for studies that measure
Cronbach’s alpha, an index of the reliability – strictly speaking the internal
consistency – of a set of items designed to measure a trait. The topic of scale
development is a complex one, so the section gives guidance on the
methodology of analysis and the interpretation of alpha.
Introduction : An apology
the following carefully.
Cronbach’s alpha
The reliability of a measurement scale is the degree to which all the items
measure the same thing. Reliability is speciﬁc: it describes the performance of
a scale in a speciﬁc population tested under speciﬁc conditions. So it is
important to make sure that scales are reliable when used in realistic
conditions with realistic participants.
In developing a new measurement scale, or showing that a measurement scale
works in a new setting, it is useful to measure its reliability. Reliability is
usually measured using Cronbach's alpha coeicient, which is scaled between
zero and one, with zero meaning that the items in the scale have nothing in
common and one meaning that they are all perfectly correlated. In practice, it
is wildly unlikely that anyone would develop a scale in which all the items were
unrelated, so there is no point in testing whether your reliability is greater than
zero. Instead, you have to specify a minimum value for the reliability
coeicient.
A mythology has grown up around the interpretation of Cronbach’s alpha,
based, apparently, on the published work of Nunally (1978). According to this
myth, Nunally advocated an alpha of 0·7 as indicating a scale that was
acceptable for use in research. In fact, it’s worth quoting Nunally’s paper,
which oers a much more nuanced and thoughtful approach to the question:
“What a satisfactory level of reliability is depends on how a measure is being
used. In the early stages of research … one saves time and energy by working
with instruments that have only modest reliability, for which purpose
reliabilities of .70 or higher will suice… In contrast to the standards in basic
research, in many applied settings a reliability of .80 is not nearly high enough.
In basic research, the concern is with the size of correlations and with the
Sample Size: reliability studies
!52
dierences in means for dierent experimental treatments, for which purposes
a reliability of .80 for the dierent measures is adequate.”
“In many applied problems, a great deal hinges on the exact score made by a
person on a test… In such instances it is frightening to think that any
measurement error is permitted. Even with a reliability of .90, the standard
error of measurement is almost one-third as large as the standard deviation of
the test scores. In those applied settings where important decisions are made
with respect to speciﬁc test scores, a reliability of .90 is the minimum that
should be tolerated, and a reliability of .95 should be considered the desirable
standard.”
This extensive quotation is from Lance, C.E., Butts, M.M. & Michels, L.C., 2006.
The Sources of Four Commonly Reported Cuto Criteria: What Did They Really
Say? Organizational Research Methods, 9(2), pp.202–220.
So bear in mind that mindlessly setting a desired alpha of 0·7 and citing
Nunally’s original paper is wrong. He didn’t say anything like that. And,
second, that you need to consider carefully the context of your research in
setting a minimum alpha.
Alpha only applies to unidimensional scales
One of the statistical assumptions underlying alpha is that the scale is
unidimensional. That is to say, that all the items measure the same thing, and
that their failure to correlate perfectly is due to measurement error. So an
important part of scale development is making sure that your items are indeed
unidimensional.
How many cases should a reliability study have?
The standard advice is to have at least 10 participants per item on your scale.
However, this should be regarded as the bare minimum.
There are surprising dierences of opinion in the literature, however, on how
small your sample can be. The best current advice is based on simulation
studies where authors have studied the power of samples of various sizes to
detect a given alpha.
Simulation studies indicate that sample size depends on the structure of your
scale. Sample sizes as small as 30 can measure alpha reliably so long as the
scale items have strong inter-correlations.
First step : principal components analysis
Your analysis should begin with a principal components analysis. A principal
components analysis identiﬁes underlying ‘dimensions’ that account for the
variation in a set of items. In the case of reliability, you should only examine the
ﬁrst principal component. There is a good reason for this: alpha has no
Sample Size: reliability studies
!53
Sample Size: reliability studies
!54
5. Sample size calculation for agreement between
two raters using a present/absent rating scale using
Cohen’s Kappa
This section give guidelines for sample sizes for studies that use the kappa
coeicient to measure the agreement between two raters who make ratings of
present/absent.
Introduction
Studies looking at the agreement between raters come in many shapes and
sizes. The most basic design is where two raters are asked to rate the presence
or absence of a particular feature or quality. Kappa is a statistic that measures
the degree of agreement over and above the agreement you would expect by
chance. You can see why just measuring percentage agreement is not enough.
If you toss two coins, they will agree 50% of the time just by chance. Likewise,
two raters, each of whom rates a feature as present 50% of the time will agree
with each other by chance 50% of the time.
When we are studying agreement, we have to choose a null hypothesis.
Normally, the null hypothesis says that the data arose by chance – that there is
no actual relationship between the variables we are studying. However, this
makes no sense at all when we are studying agreement. It would be ridiculous
to set up a scientiﬁc study to determine whether the agreement between two
pathologists was better than chance! When two raters rate the same thing, it
would be unusual to ﬁnd that they didn’t agree any more than you would
expect by chance, even in psychiatry.
So in studies of agreement, we have to set a minimum level of agreement that
we want to outrule in our study. Usually we would like to outrule a level of
agreement that would suggest that there was a signiﬁcant problem with the
reliability of the rating. So unlike other sample size methods, the researcher
will have to base sample size calculation for kappa on two ﬁgures: the value of
kappa to be outruled and the likely true value of kappa. In addition, the
prevalence of the feature will aect sample size.
Estimating sample size for kappa
The sample size will depend on three factors:
Step 1: Prevalence of the feature
What is the approximate prevalence of the feature that is being rated? Sample
sizes will be smallest when there is a 50% prevalence, and will get very large
when the prevalence drops much below 25%.
Sample Size: pilot studies
!55
In the calculations below, we assume that there is no systematic dierence
between the raters. In other words, that each rater gives more or less the same
prevalence of the feature. Where you suspect that raters will give dierent
prevalences, the sample size calculation needs to take this into account, and is
well beyond the scope of this guide. However, the R package I used will
perform the calculation (see below).
Step 2 : Deﬁnition of an unacceptably low level of agreement (null value)
It would be astonishing if two raters could not agree any more than you would
expect by chance. So in designing the study we have to stipulate what would be
an unacceptably low level of agreement. This will act as a baseline against
which we can assess the actual level of agreement. Because this is the level of
agreement that we wish to outrule, the value is often called the null value, or
null hypothesis value.
In practice, a kappa of 0.2–0.40 is regarded as a fair level of agreement, 0.41–
0.60 as moderate, 0.61–0.80 as substantial and anything above 0·8 as excellent.
That said, these cutpoints have a sort of folkloric status, and the interpretation
of kappa is probably best done in the context of the decision that it supports.
In the tables that follow I will tabulate sample sizes for kappa in cases where
you want to demonstrate that kappa is better than 0·4 (so agreement is better
than ‘fair’), better than 0·5 or 0·6 (better than ‘moderate’) and better than 0·7
and 0·8 (better than ‘substantial).
Step 3 : Eect size - what is a clinically acceptable level of agreement?
What is the level of agreement that you think should be present if the test is a
reliable test? This value is often called the alternative value or alternative
hypothesis value, in contrast with the null value.
For example, if the test would require substantial agreement between
assessors rather than simply being moderate, then you might set up your
sample size to detect a kappa of 0·75 against a null hypothesis that kappa is
0·6. This would require 199 ratings made by the two raters to achieve 90%
power. However, if you hypothesised that kappa was 0·75, as before, but
wanted to outrule a kappa of 0·5, the required sample size drops to a very
manageable 78.
Sample Size: pilot studies
!56
Prevalence of
feature
Hypothesised
kappa
Kappa to be
outruled (null
hypothesis kappa)
90%
power
95%
power
0·5
0.6
0.4
156
200
0.7
0.5
131
169
0.8
0.6
102
133
0.7
0.45
87
112
0.8
0.55
68
90
0.8
0.5
49
65
0·4 or 0·6
0.6
0.4
162
208
0.7
0.5
137
177
0.8
0.6
106
139
0.7
0.45
90
117
0.8
0.55
71
94
0.8
0.5
51
68
0·25 or 0·75
0.6
0.4
207
265
0.7
0.5
176
227
0.8
0.6
137
180
0.7
0.45
116
150
0.8
0.55
92
121
0.8
0.5
66
87
0·1 to 0·9
0.6
0.4
427
546
0.7
0.5
371
479
Sample Size: pilot studies
!57
Example
A researcher wishes to study the agreement between family doctors on
whether or not to prescribe an antibiotic for uncomplicated rhinitis. The
prevalence of antibiotic prescribing is about 25%. She would like to show that
the kappa value for agreement is better than 0·5. She hypothesises that the
true kappa might be between 0·7 and 0·8.
Looking at the table, if the true kappa is 0·7, she will need to compare the
doctors’ ratings for 176 patients to have a 90% power to outrule a kappa as low
as 0·5. On the other hand, if the true kappa is 0·75, she would have 90% power
to outrule a kappa as low as 0·45 with a sample of 116.
Limitations of these tables
There are so many potential combinations of prevalence, kappa-to-be-outruled
and hypothesised kappa that these tables can only give an approximate idea of
the numbers involved. And they don’t cover cases where the two raters have
dierent prevalences (which would indicate systematic disagreement!), or
where there are more than two raters etc. To get precise calculations for a
wide variety of scenarios, I recommend using the R package irr.
Reference
These sample sizes were calculated with the N.cohen.kappa command in the
irr package in R. The command uses a formula published in
Cantor, A. B. (1996) Sample-size calculation for Cohen's kappa. Psychological
Methods, 1, 150-153.
The sample sizes in the table were produced using variations on this command:
N.cohen.kappa(0.1, 0.1, 0.5, 0.8,power=.95)!
0.8
0.6
292
382
0.7
0.45
242
313
0.8
0.55
194
255
0.8
0.5
139
183
Prevalence of
feature
Hypothesised
kappa
Kappa to be
outruled (null
hypothesis kappa)
90%
power
95%
power
Sample Size: pilot studies
!58
6. Sample size for pilot studies
Introduction
The sample size methods used so far presuppose that the investigator has some
kind of knowledge that can be used to make informed guesses about such
things as prevalences, eect sizes etc. However, by their very essence pilot
studies are carried out when the researcher is facing the unknown. Even so,
there are some general principles which can be applied to ensure that enough
data are captured by a pilot study to inform subsequent study design with the
smallest use of resources.
Sample size: the law of diminishing returns
Sample size for pilot studies starts with the observation that each participant
that you recruit into a study yields less information than the last one. This law
of diminishing returns can be used to deﬁne a point beyond which recruiting
additional participant will yield minimal improvement in estimating eects.
Calculations by Julious (2005) and Van Belle (2008) both show that in studies
that compare the means of two groups, if you carry on recruitment beyond a
sample size of 12 per group the eect of each additional participant on the
precision is minimal. If your pilot study is purely exploratory and your aim is to
get a preliminary estimate of the dierence between two groups, then a sample
size of 12 per group can be justiﬁed on the basis of these references.
Sample size to justify carrying out a full study
Sometimes there are cases when the investigator will have a preliminary
estimate of the minimum dierence between groups that would constitute a
clinically signiﬁcant dierence. The purpose of the pilot study is to justify
carrying out a full study. For example, before conducting a study of the eects
of a physiotherapy programme on balance in the elderly, the investigators
might be required to do a pilot to show that there were grounds for believing
that such a programme would produce a clinically signiﬁcant improvement in
balance.
Cocks et al (2013) provide an algorithm for estimating the size of a pilot study
that will give the ‘go-ahead’ to a main study. Their rule of thumb, based on
calculated sample sizes for various scenarios, is to recruit 9% of of the
projected ﬁnal sample, or 20 participants, whichever is the greater, as a pilot. If
there is no dierence between the groups, then it is unlikely that the true eect
size is as large as the one speciﬁed by the investigators. Note that this
conclusion is based on an 80% conﬁdence interval, not the usual 95%. If you
are using this method, please read Cocks’ paper for further detail and worked
examples.
Sample Size: pilot studies
!59
Sample Size: pilot studies
!60
7. Sample size for animal experiments in which not
enough is known to calculate statistical power
In animal experiments, the investigator may have no prior literature to turn to.
The potential means and standard deviations of the outcomes are unknown,
and there is no reasonable way of guessing them. In a case like this, sample
size calculations cannot be applied.
The resource equation method
The resource equation method can be used for minimising the number of
animals committed to an exploratory study. It is based on the law of diminishing
returns: each additional animal committed to a study tells us less than the one
to reach the threshold where adding further animals will be uninformative. It
should only be used for pilot studies or proof-of-concept studies.
Applying the resource equation method
1. How many treatment groups will be involved? Call this T.
2. Will the experiment be run in blocks? If so, how many blocks will be used?
Call this B
A block is a batch of animals that are tested at the same time. Each block may
have a dierent response because of the particular conditions at the time they
were tested. Incorporating this information into a statistical analysis will
increase statistical power by removing variability between experimental
conditions on dierent days.
3. Will the results be adjusted for any covariates? If so, how many? Call this C
Covariates are variables that are measured on a continuous scale, such as the
weight of the animal or the initial size of the tumour. Results can be adjusted
for such variables, which increases statistical power.
4. Combine these three ﬁgures:
(T–1) + (B+C–1) = D
5. Add at least 10 and at most 20
The sample size should be at least (D+10) and at most (D+20).
Example of the resource equation method
An investigator wishes to examine the eect of a new delivery vehicle for an
anti-inﬂammatory drug. The experiment will involve four treatments: a control,
a group receiving a saline injection, a group receiving the vehicle alone and a
group receiving the vehicle plus drug. Because of laboratory limitations, only
four animals can be done on one day. The experimenter doesn't plan on
adjusting the results for factors like the weight of the animal.
Sample Size: when nothing is known in advance
!61
Sample Size: when nothing is known in advance
!62
8. Sample size for qualitative research
Issues
Qualitative researchers often regard sample size calculations as something that
is only needed for quantitative research. However, qualitative research
protocols typically contain statements like "participants will be recruited until
data saturation occurs". So there is already an appreciation that a certain
number of participants will be "enough participants".
Clearly, it is important when planning (and especially budgeting) a qualitative
research project to know how many participants will be needed. These
guidelines are partly derived from an excellent paper by Morse 1
General guidance
Over-estimate your sample size when writing a proposal and budgeting it. This
gives you some insurance against diiculties in recruitment, participants whose
data is not very useful and other unanticipated snags.
Speciﬁc factors aecting sample size
Scope of study and nature of the topic
If the scope of the study is broad, then more participants will be needed to
reach saturation. Indeed, broad topics are more likely to require data from
multiple data sources. Doing justice to a broad topic requires a large
commitment of time and resources, including large amounts of data. Broad
studies should not be undertaken unless they are well-supported and have a
good chance of achieving what they set out to do.
If the study addresses an obvious, clear topic, and the information will be easily
obtained from the participants, then fewer participants will be needed. Topics
that are harder to grasp and formulate are often more important, but require
greater skill and experience from the researcher, and will require more data.
If they study topic is one about which people will have trouble talking (because
it is complex, or embarrassing, or may depend on experiences which not
everyone has) you will need more participants.
Quality of data and sample size
The ability of participants to devote time and thought to the interview, and to
articulate their experiences and perceptions, and to reﬂect on them, will all
aect the richness of the data. In particular, in some studies, participants may
not be able to devote time to a long interview, or may not be physically or
psychologically capable of taking part in a long interview, resulting in smaller
Sample Size: qualitative research
!63
Sample Size: qualitative research
!64
The table shows numbers of participants and, for each number, shows how rare
a theme, experience or meaning would have to be so that it was unlikely to be
detected by the study.
Table 8.1 Sample size and likelihood of missing something important in
qualitative research
As you can see, if a study of 60 people fails to identify a theme, experience or
issue, that issue is probably rare – present in about one person in 20 or fewer.
However, a study of 15 participants can fail to identify something which is
present in one person in every four! And a study of 8 participants is quite likely
to fail to ﬁnd out things that aect half of the study population.
Clearly, shadowing (second hand data) can reduce these error rates by getting
participants to talk about others, but this is no substitute for including the
others in the research. Part of this is trying to chose a sample in such a way as
to span the population, but this relies on knowing the factors that make for
diversity in the population – something that may only become clear after the
research is well under way.
However, both expert opinion in the area of qualitative research and the table
above suggest that samples of less than 20 participants have to be justiﬁed on
the grounds that they are unusually rich in data and representative.
Method
The table was calculated based on Poisson conﬁdence intervals for zero
observed frequencies at the given sample sizes, using Stata Release 14.1
Size of
study
If you don't ﬁnd
something, the maximum
likely prevalence is
That's roughly
60
6%
1 person in 20
40
9%
1 person in 10
30
13%
1 person in 8
20
18%
1 person in 6
15
25%
1 person in 4
10
37%
1 person in 3
8
46%
1 person in 2
5
74%
3 people in 4
Sample Size: qualitative research
!65
Sample Size: qualitative research
!66
9. Resources for animal experiments
Festing, Michael FW, and Douglas G. Altman. "Guidelines for the design and
statistical analysis of experiments using laboratory animals." ILAR journal 43.4
(2002): 244-258. http://ilarjournal.oxfordjournals.org/content/43/4/244.full
This paper appears as part of a collection which you can peruse here: http://
ilarjournal.oxfordjournals.org/content/43/4.toc
Festing, Michael FW. "Design and statistical methods in studies using animal
models of development." Ilar Journal 47.1 (2006): 5-14. http://
ilarjournal.oxfordjournals.org/content/47/1/5.full?
sid=6bb505df-77e8-48c3-8b9a-d67bd304deec
Sample Size: Resources on the internet
!67
9. Computer and online resources
Free, highly recommended package: G*Power
! http://gpower.hhu.de/
For applications that go beyond the ones described here, including multiple
regression, I can strongly recommend G*Power, which is free and multi-
platform. There is an excellent manual.
Standard statistical packages
Stata also has a powerful set of sample size routines, and there are many user-
written routines to calculate sample sizes for various types of study. Use the
command findit sample size to get a listing of user-written commands that you
can install.
The free professional package R includes sample size calculation (but requires
a bit of learning). I recommend using software called RStudio as an interface
to R. It makes R far easier to learn and use.
And no; SPSS will sell you a sample size package, but it isn't included with
SPSS itself. If you use SPSS, my advice is to use G*Power and save money.
Sample size calculators and Online resources
http://statpages.org/javasta2.html
They make an excellent sample-size calculator application called StatMate
which gets high scores for a simple, intelligent interface and very useful
explanations of the process. It has a tutorial that walks you through.
use, has some power calculations
There is a free Windows power calculation program at Vanderbilt Medical
Center http://biostat.mc.vanderbilt.edu/wiki/Main/PowerSampleSize
GPower is a very comprehensive package for both Windows and Mac, available
from http://gpower.hhu.de/
Sample Size: Resources on the internet
!68
Sample Size: Resources on the internet
!69
... We believe that the sample is representative of the Spanish working population, primarily due to the stratification conducted [73]. Additionally, the sample's size, surpassing 1000, permits us to accept a margin of error below 3%, considering that the analyzed population is large [74] since it encompasses over 20 million individuals [75]. ...
... passing 1000, permits us to accept a margin of error below 3%, considering that the analyzed population is large [74] since it encompasses over 20 million individuals [75]. ...
Article
Full-text available
This study explores the drivers of employees' attitudes towards home teleworking with Tobit regression and fuzzy-set qualitative comparative analysis (fsQCA). Drawing from technology acceptance models, it derives hypotheses regarding variable relationships and telecommuting perceptions. Data were obtained from a survey with 3104 responses conducted by the Spanish Agency "Centro de Investigaciones Sociológicas" in Spring 2021. The results emphasize the pivotal role of the family-life impact in shaping telecommuting perceptions, alongside factors like location, ICT satisfaction , employer support, and job adaptability. The results from fsQCA reveal an asymmetric influence of input factors on the positive and negative evaluations. Positive perceptions are associated with family-life positivity, firm support, strong ICT, and non-provincial residence, while negative attitudes relate to family-life negativity, lack of employer support, and poor connectivity. The main innovation of this paper lies in the combined use of correlational and configurational methods, enriching insights into employee telecommuting perceptions beyond traditional regression analysis.
... Eleven incomplete forms were excluded due to missing information, resulting in a final sample size of 1067 people. This sample size was determined to be appropriate as the size of the population is unknown, as shown in Table 1 below [22][23]. To ensure random sampling, the sample selection was carried out through various online platforms such as community pages, student portals, and QR Code posters distributed in the streets of Baghdad. ...
Article
Full-text available
In Iraq, walking is primarily used for short trips despite its importance as a sustainable mode of transportation. To improve walking as a means of transportation, it should be considered in transportation planning. This study aims to investigate the factors influencing walking behaviour in Baghdad City from the perspective of road users. An electronic self-designed questionnaire was utilized to gather information, consisting of two parts. The first part includes demographic questions while the second part includes questions related to walking preferences, timing, destinations, and obstacles. The results indicate that 26.242% of respondents prefer walking while 37.113% may walk if the conditions are favourable and 36.645% prefer other modes of transportation. The key factors influencing walking effectiveness are car ownership, trip distance, weather conditions, and availability of safe walking infrastructure.
... welche in der Literatur häufig als Grenze der bereits fragwürdigen, aber noch hinnehmbaren Reliabilität angesehen wird (Blanz, 2015). Die Mindeststichprobengröße zur Berechnung von Reliabilitäten wird in der Literatur mit n = 30 definiert, da die interne Konsistenz nicht für eine Skala, sondern für eine Skala bezogen auf eine Stichprobe gilt (Conroy, 2016;Samuels, 2017). Der α-Wert ist daher maßgeblich von der Stichprobe und ihrer Zusammensetzung abhängig. ...
... Given a desired sample size of 250 due to time and cost constraints, the proportion of women from each community within the total population of women for the selected communities was computed as shown in Table 1. Besides, computing the sample size with the Yamane (1967) sample size formula showed that the selected sample size falls between the acceptable ±5% to ±7.5% margin of error or uncertainty as noted in Conroy [32]. ...
... According to Table 1, the whole survey encompassed a total of 3,014 responses, with 51.7% of the respondents identifying as female and 48.3% as male. We constrained our IJM analysis to the active working population (57.75% of the base sample), and consequently, the final sample had 1,739 answers that in any case, we can consider good sized (Conroy, 2016). ...
Article
Full-text available
Purpose-This paper aims to shed light on the perception of the consequences of implementing home teleworking (TW) for employers and employees amid the pandemic. By doing so, the research analyzes the factors that explain employers' and employees' perceptions of home TW and the symmetry of their impact on its acceptance and rejection. Design/methodology/approach-The analysis is done over the survey "Trends in the digital society during SARS-COV-2 crisis in Spain" by the Spanish "Centro de Investigaciones Sociol ogicas." The explanatory variables were selected and classified using the well-known taxonomy of Baruch and Nicholson (i.e. individual factors, family/home, organizational and job-related). Findings-The global judgment of HTW is positive, but factors such as gender, age, children in care or being an employer nuance that perception. While some factors, such as the attitude of employees toward information communication technologies (ICTs), perceived productivity or the distance from home to work, have a significant link with both positive and negative perceptions of HTW, other factors can only explain either positive or negative perceptions. Likewise, the authors observed that being female and having children on care had a detrimental influence on opinions about HTW. Practical implications-A clearer regulation of TW is needed to prevent imbalances in rights and obligations between companies and employees. The authors also highlight the potentially favorable effects of telecommuting on mitigating depopulation in rural areas. Originality/value-The authors have also measured not only the significance of assessed factors on the overall judgment of HTW for firms and workers but also whether these factors impact acceptance and resistance attitudes toward TW symmetrically.
... With this level of significance and an assumed effect size of 0.25 for ANOVA the statistical power is above 80% [55]. Moreover, in accordance with other authors a sample size of n = 50 is sufficient to conduct t-tests for dependent samples [56] and ANOVAs [57]. ...
... The survey is approximately representative in terms of gender, age, education, and place of residence for the German population aged 18+ (Table A.1), and the final sample size is meaningful in terms of statistical power with an acceptable margin of error of ±5% [20]. For a more detailed description of the sample and data collection, see Kühl et al., 2022 [8]. ...
Article
Understanding people’s resilience to flood disaster is necessary, considering their different levels of vulnerability in terms of material and human losses. This study therefore assesses households’ resilience to flood disaster in Lagos State in order to determine their ability to absorb, adapt to and transform from flood disaster impact. Theoretically, a conceptual framework to unify the dimensions and components of disaster resilience was developed from the General Systems Theory, Parent-Child Relationship Design Pattern, and Resilience Capacity Theory. A descriptive cross-sectional survey based on quantitative research design was employed for this study. Using a multistage sampling technique, six local government areas in the state were selected based on the different flood types founded on historical evidence and were delineated into pluvial, fluvial and coastal zones. Data were thereafter collected through questionnaire administration on 512 sampled household heads from 4093 population on flood-risk streets. Descriptive statistics comprising frequency distribution, cross-tabulation and arithmetic mean were used in analysing the data. The results showed that within each of the zones, the mean indices for the dimensions range from 2.10 to 3.97, indicating their low, moderate and high levels under the different components. In all the zones, the mean indices for the components are within 2.5 and 3.4, indicating a moderate level. This suggests that households’ resilience to flood disaster differed based on the dimensions within each component for each flood zone but their resilience is similar across the zones based on the different components using the proposed unification framework.
Article
Full-text available
Tsetse flies (Glossina spp.) are major vectors of African trypanosomes, causing either Human or Animal African Trypanosomiasis (HAT or AAT). Several approaches have been developed to control the disease, among which is the anti-vector Sterile Insect Technique. Another approach to anti-vector strategies could consist of controlling the fly's vector competence through hitherto unidentified regulatory factors (genes, proteins, biological pathways, etc.). The present work aims to evaluate the protein abundance in the midgut of wild tsetse flies (Glossina palpalis palpalis) naturally infected by Trypanosoma congolense s.l. Infected and non-infected flies were sampled in two HAT/AAT foci in Southern Cameroon. After dissection, the proteomes from the guts of parasite-infected flies were compared to that of uninfected flies to identify quantitative and/or qualitative changes associated with infection. Among the proteins with increased abundance were fructose-1,6-biphosphatase, membrane trafficking proteins, death proteins (or apoptosis proteins) and SERPINs (inhibitor of serine proteases, enzymes considered as trypanosome virulence factors) that displayed the highest increased abundance. The present study, together with previous proteomic and transcriptomic studies on the secretome of trypanosomes from tsetse fly gut extracts, provides data to be explored in further investigations on, for example, mammal host immunisation or on fly vector competence modification via para-transgenic approaches.
Article
Full-text available
I explore the sample size in qualitative research that is required to reach theoretical saturation. I conceptualize a population as consisting of sub-populations that contain different types of information sources that hold a number of codes. Theoretical saturation is reached after all the codes in the population have been observed once in the sample. I delineate three different scenarios to sample information sources: “random chance,” which is based on probability sampling, “minimal information,” which yields at least one new code per sampling step, and “maximum information,” which yields the largest number of new codes per sampling step. Next, I use simulations to assess the minimum sample size for each scenario for systematically varying hypothetical populations. I show that theoretical saturation is more dependent on the mean probability of observing codes than on the number of codes in a population. Moreover, the minimal and maximal information scenarios are significantly more efficient than random chance, but yield fewer repetitions per code to validate the findings. I formulate guidelines for purposive sampling and recommend that researchers follow a minimum information scenario.
Article
Full-text available
Background: The choice of an adequate sample size for a Cox regression analysis is generally based on the rule of thumb derived from simulation studies (Peduzzi et al. (1995)) of a minimum of 10 events per variable (EPV). One simulation study suggested scenarios in which the 10 EPV rule can be relaxed (Vittinghoff and McCulloch (2007)). The effect of a range of binary predictors with varying prevalence, reflecting clinical practice, has not yet been fully investigated. Methods: We conducted an extended resampling study using a large general practice data set, comprising over 2 million anonymized patient records, to examine the EPV requirements for prediction models with low-prevalence binary predictors developed using Cox regression. The performance of the models was then evaluated using an independent external validation data set. We investigated both fully specified models and models derived using variable selection. Results: Our results indicated that an EPV rule of thumb should be data-driven and that EPV > 10 generally eliminates bias in regression coefficients when many low-prevalence predictors are included in a Cox model. Conclusion: Higher EPV is needed when low-prevalence predictors are present in a model to eliminate bias in regression coefficients and improve predictive accuracy.
Article
Full-text available
This study examines 83 IS qualitative studies in leading IS journals for the following purposes: (a) identifying the extent to which IS qualitative studies employ best practices of justifying sample size; (b) identifying optimal ranges of interviews for various types of qualitative research; and (c) identifying the extent to which cultural factors (such as journal of publication, number of authors, world region) impact sample size of interviews. Little or no rigor for justifying sample size was shown for virtually all of the IS studies in this dataset. Furthermore, the number of interviews conducted for qualitative studies is correlated with cultural factors, implying the subjective nature of sample size in qualitative IS studies. Recommendations are provided for minimally acceptable practices of justifying sample size of interviews in qualitative IS studies.
Technical Report
Full-text available
Whilst it is common statistical advice not to attempt a reliability analysis with a sample size less than 300 a recent simulation study indicates that this is possible in certain circumstances. The most common statistic used in reliability analysis is Cronbach’s alpha and an often quoted rule of thumb is a coefficient value above 0.7 is acceptable for psychological constructs. However, some researchers assert that the size of a Cronbach’s alpha coefficient depends upon the number of items in the scale with scales with more items having higher coefficients. The advantage of carrying out a reliability analysis is that it can enable a researcher to treat a group of variables on the same subject as a single scale variable, reducing the complexity of further analysis and reducing the risk of Type I errors. However, student researchers often find it hard to obtain sample sizes of 300. The purpose of this worksheet is to advise students about how to go about trying to validate a scale with smaller sample sizes.
Article
Full-text available
In this article, I discuss measures of effect size for two-group comparisons where data are not appropriately analyzed by least-squares methods. The Mann–Whitney test calculates a statistic that is a very useful measure of effect size, particularly suited to situations in which differences are measured on scales that either are ordinal or use arbitrary scale units. Both the difference in medians and the median difference between groups are also useful measures of effect size. Copyright 2012 by StataCorp LP.
Article
When designing a clinical trial an appropriate justification for the sample size should be provided in the protocol. However, there are a number of settings when undertaking a pilot trial when there is no prior information to base a sample size on. For such pilot studies the recommendation is a sample size of 12 per group. The justifications for this sample size are based on rationale about feasibility; precision about the mean and variance; and regulatory considerations. The context of the justifications are that future studies will use the information from the pilot in their design.
Article
Purpose Qualitative researchers have been criticised for not justifying sample size decisions in their research. This short paper addresses the issue of which sample sizes are appropriate and valid within different approaches to qualitative research. Design/methodology/approach The sparse literature on sample sizes in qualitative research is reviewed and discussed. This examination is informed by the personal experience of the author in terms of assessing, as an editor, reviewer comments as they relate to sample size in qualitative research. Also, the discussion is informed by the author’s own experience of undertaking commercial and academic qualitative research over the last 31 years. Findings In qualitative research, the determination of sample size is contextual and partially dependent upon the scientific paradigm under which investigation is taking place. For example, qualitative research which is oriented towards positivism, will require larger samples than in-depth qualitative research does, so that a representative picture of the whole population under review can be gained. Nonetheless, the paper also concludes that sample sizes involving one single case can be highly informative and meaningful as demonstrated in examples from management and medical research. Unique examples of research using a single sample or case but involving new areas or findings that are potentially highly relevant, can be worthy of publication. Theoretical saturation can also be useful as a guide in designing qualitative research, with practical research illustrating that samples of 12 may be cases where data saturation occurs among a relatively homogeneous population. Practical implications Sample sizes as low as one can be justified. Researchers and reviewers may find the discussion in this paper to be a useful guide to determining and critiquing sample size in qualitative research. Originality/value Sample size in qualitative research is always mentioned by reviewers of qualitative papers but discussion tends to be simplistic and relatively uninformed. The current paper draws attention to how sample sizes, at both ends of the size continuum, can be justified by researchers. This will also aid reviewers in their making of comments about the appropriateness of sample sizes in qualitative research.