Content uploaded by Ronán Michael Conroy

Author content

All content in this area was uploaded by Ronán Michael Conroy on Apr 17, 2018

Content may be subject to copyright.

Content uploaded by Ronán Michael Conroy

Author content

All content in this area was uploaded by Ronán Michael Conroy on Apr 17, 2018

Content may be subject to copyright.

The RCSI Sample size handbook

A rough guide

May 2018 version

Ronán M Conroy

rconroy@rcsi.ie

!

Stata command

. power twomeans 0 (.4(.1)1), power(0.9) graph(ydimension(delta)

xdimension(N))

.4

.6

.8

1

Effect size (δ)

4654 68 88 120 172 266

Total sample size (N)

Parameters: α = .05, 1-β = .9, µ1 = 0, σ = 1

t test assuming σ1 = σ2 = σ

H0: µ2 = µ1 versus Ha: µ2 ≠ µ1

Effect size for a two-sample means test

Sample Size: introduction

!1

How to use this guide 3

Introduction : sample size and why it’s important 4

1. Sample size for percentages or proportions 8

1.2 Sample sizes for studies comparing a prevalence with a hypothesised value

11

1.3 Sample sizes for studies comparing proportions between two groups 15

1.4a Sample sizes for population case-control studies 19

1.4b Sample sizes for matched case-control studies 24

1.5 Sample size for logistic regression with a continuous predictor variable 29

1.6 Sample sizes for logistic or Cox regression with multiple predictors 32

2: Sample sizes and powers for comparing two means where the variable is

measured on a continuous scale that is (more or less) normally distributed. 36

2.1 Comparing the means of two groups 36

2.2 Sample sizes for comparing means in the same people under two conditions

44

2.3 Calculating sample sizes for comparing two means: a rule of thumb 49

3. Sample size for correlations or regressions between two variables measured

on a numeric scale 50

4. Sample size for reliability studies 52

5. Sample size calculation for agreement between two raters using a present/

absent rating scale using Cohen’s Kappa 55

6. Sample size for pilot studies 59

7. Sample size for animal experiments in which not enough is known to calculate

statistical power 61

8. Sample size for qualitative research 63

9. Resources for animal experiments 67

9. Computer and online resources 68

Sample Size: introduction

!2

How to use this guide

This guide has sample size ready-reckoners for many common research

designs. Each section is self-contained You need only read the section that

applies to you.

If you are new to sample size calculation, read the ﬁrst section ﬁrst.

Examples

There are examples in each section, aimed at helping you to describe your

sample size calculation in a research proposal or ethics committee submission.

They are largely non-specialist. If you have useful examples, I welcome

contributions.

Feedback

If you have trouble following this guide, please email me. Your comments help

to improve it. If you spot an error, please let me know.

Support

This guide has slowly percolated around the internet. I’m pleased to handle

queries from staﬀ and students of RCSI and aﬀiliated institutions. However, I

cannot deal with queries from elsewhere. I’m sorry.

Warra nty ?

This document is provided as a guide. While every attempt has been made to

ensure its accuracy, neither the author nor the Royal College of Surgeons in

Ireland takes any responsibility for errors contained in it.

What’s new

This version May 2018. Updated text with more useful code in Stata and, to a

lesser extent, R, updated web links.!

Sample Size: introduction

!3

Introduction : sample size and why it’s important

Sample size is an important issue in research. Ethics committees and funding

agencies are aware that if a research project is too small, it misses failing to

ﬁnd what it set out to detect. Not only does this waste the input of the study

participants (and frequently, in the case of animal research, their lives) but by

producing a false negative result a study may do a disservice to research by

discouraging further exploration of the area.

And, of course, if a study is too large it will waste resources that could have

been spent on something else.

So the ideal sample size is one that collects suﬀicient data to have a good

chance of measuring what you set out to measure.

Key issues: representativeness and precision

When choosing a sample, there are two important issues:

•will the sample be representative of the population, and

•will the sample be precise enough.

The ﬁrst criterion of a good sample is sample representativeness. An

unrepresentative sample will result in biased conclusions, and the bias cannot

be eliminated by taking a larger sample. For this reason, sampling methodology

is the ﬁrst thing to get right.

The second criterion is sample precision. The larger the sample, the smaller

the margin of uncertainty (conﬁdence interval) around the results. However,

there is another factor that also aﬀects precision: the variability of the thing

being measured. The more something that varies from person to person the

bigger your sample needs to be to achieve the same degree of certainty about

your results.

This guide deals with the issue of sample size. Remember, however, that sample

size is of secondary importance to sample representativeness.

Key terms used in this sample size calculation

Precision – what it is, what determines it

Precision is the amount of potential error in a ﬁnding. Low-precision studies

have wide margins of error around their ﬁndings, while high-precision studies

have narrow margins of error. The degree of precision is partly determined

by the sample size. In some sample size calculations, you will need to begin

by deciding how much precision you require or, equivalently, the degree of

uncertainty you are prepared to tolerate in your ﬁndings.

Precision is also determined by the variability in the thing you are

studying. If something has little variation, such as body temperature, then you

Sample Size: introduction

!4

will require a smaller sample than for something that varies quite a lot, like

blood pressure.

You can easily imagine variability when it comes to things measured on a

numeric scale. But what about things that are measured on a simple

dichotomous scale – present/absent, true/false for example. Years ago, a

colleague came up with an excellent example. Imagine a crowd of spectators

where the supporters of one team wore white and supporters of the other team

wore black. If one team had 100% support, the crowd would be all one colour –

no variability. The maximum variability would occur where there was 50%

support for each team. This is exactly what happens with dichotomous

variables. The closer the prevalence is to 50%, the higher the variability. At 0%

and 100% there's no variability at all.

Prevalence

Prevalence is how frequently a characteristic is found is in the population you

are studying. Although we speak of prevalences every time we say something

like “ten percent of people” or “a third of new admissions”, we rarely use the

word prevalence for these fractions or percentages. This guide will use

‘prevalence’ as a general term for proportions, fractions and percentages.

Variability

The more variable is the thing we are studying, the more data we will have to

gather in order to achieve a given level of precision. This makes sense

intuitively when we are measuring something on a numeric scale. But it also

applies to other types of measurement too, even to percentages.

Looking at the tables that show sample sizes for diﬀerent prevalences, you will

see that the required sample size rises as the prevalence approaches 50%. This

is because when 50% of people have a characteristic and 50% do not, that

characteristic has the highest person-to-person variability. As the prevalence

nears zero or 100%, variability decreases, and so the required sample size will

also decrease.

For continuous variables, the standard deviation is used as a measure of

variability. This is sometimes known, or guessable, from previously published

work, and this guide will tell you how to do this. But even if it is unknown, the

guide will show you how to make an informed guess.

Eﬀect size

Many sample size calculations require you to stipulate an eﬀect size. This is

the smallest eﬀect that is clinically signiﬁcant (as opposed to statistically

signiﬁcant). Clinical signiﬁcance is a health research term that is used to mean

“practical signiﬁcance” or “real life signiﬁcance”. The task of deciding on the

smallest eﬀect that would be clinically signiﬁcant requires knowledge of the

purpose of the research and the current state of knowledge and practice.

Sample Size: introduction

!5

For example, if you are planning to compare two treatments, you have to

decide how big a diﬀerence between two groups should be before it would be

regarded as clinically important. You might deﬁne it as the smallest eﬀect that

would be noticed by the person being treated, or the smallest eﬀect that would

alter the management of the patient, or the smallest eﬀect required to change

the person’s diagnosis.

The whole question of what constitutes a clinically signiﬁcant ﬁnding is outside

the scope of statistics. However, you will see from the tables that I have tried to

help out by translating the rather abstract language of eﬀect size into terms of

patient beneﬁt or diﬀerences between people.

What eﬀect size isn’t

It is important to realise what eﬀect size is not. Eﬀect size is not the eﬀect that

you think is there. We tend to have high hopes for our theories, and therefore

hope that the treatment or risk factor we are interested in will have a very

important eﬀect. However, in sample size calculation, eﬀect size is always the

smallest eﬀect that would be clinically signiﬁcant. Not the one that you hope is

there.

Importantly, too, eﬀect size is not what was published by someone else. Again,

this is an estimate of the actual eﬀect size, but research must have adequate

power to detect the smallest clinically signiﬁcant eﬀect size. Often the early

publications in a ﬁeld are biased towards larger eﬀect sizes. This is not just

because of publication bias, but also because methodologies will improve and

things will always work less well when they leave the lab for the real world.

Power

Power if the chance that what you are looking for will be detected in your

sample, if it actually exists. No sample, however big, is a guarantee that you

will detect what you are looking for. However, it is foolish to do research

without a reasonable chance that your study will detect it if it exists. And that

“if it exists” is very important. The power of a study is its chance of

detecting an eﬀect of a given size, if an eﬀect of at least that size exists.

Decades ago, studies were often run with 80% power. That is to say, there was

an 80% chance that they would detect the eﬀect if it existed but was the

smallest clinically signiﬁcant eﬀect. And, therefore, there was a 20% chance –

one in ﬁve – that they would fail to detect it and come to a false-negative

conclusion.

A 20% probability of a false-negative conclusion is now regarded as

unacceptable by ethics committees. Why waste the time (and lives) of research

participants on projects that have a built-in 20% chance of failure? The sample

sizes in this guide assume that you want 90% or even 95% power to detect

what you are looking for.

Sample Size: introduction

!6

Things that are not sample size calculations

Before going on to cover speciﬁc research scenarios, I should mention some

things appear in ethics submissions and grant applications that are not

acceptable as sample size calculations.

The following are the most usual oﬀenders:

Everyone else used six animals per group

The legal advisor to RCSI’s research ethics committee has advised us that there

is no legal defence that runs Well, everyone else did it too. So the fact that

someone else used this sample size does not justify it, whether or not the

research was published. There are grounds for using 6 animals per group, but

they are laid out below under comparing the means of two groups.

We did another study that used 10 patients and it was signiﬁcant/got

published

It’s important, too, to emphasise here the point made above, that sample size

should be set to detect the minimum eﬀect that would be clinically signiﬁcant,

not the eﬀect that someone else found or that the researcher thinks is there.

Small studies are only likely to be published if they ﬁnd something interesting.

So they are likely to be misleading about the potential eﬀect size.

We have limited funding/the student is only available for x weeks

Limited funding and limited time are not excuses for doing bad research. If you

spend your resources on a research project that has no reasonable chance of

being able to answer the research question because it is simply too small, then

you have wasted your limited resources.

This is just a student project

Finally, student projects often lack the time and resources to recruit a sample

big enough to have decent statistical power. Ethics committees understand that

student research is where students learn research methods. So long as the

application is accompanied by a calculation that shows the applicant is aware

of the power of the proposed sample size, and the potential eﬀect that this will

have on the analysis and interpretation of the data, small sample size is not in

itself an obstacle to receiving ethical approval – though it will probably be an

obstacle to publication.

Sample Size: introduction

!7

1. Sample size for percentages or proportions

This section give guidelines for sample sizes for

•studies that measure the proportion or percentage of people who have some

characteristic,

•and for studies that compare this proportion with either a known population

or with another group.

The characteristic being measured can be a disease, an opinion, a behaviour :

anything that can be measured as present or absent.

Prevalence

Prevalence is the technical term for the proportion of people who have some

feature. You should note that for a prevalence to be measured accurately, the

study sample should be a valid sample. That is, it should not contain any

signiﬁcant source of bias.

1.1 Sample size for simple prevalence studies

The sample size needed for a prevalence study depends on how precisely you

want to measure the prevalence. (Precision is the amount of error in a

measurement.) The bigger your sample, the less error you are likely to make in

measuring the prevalence, and therefore the better the chance that the

prevalence you ﬁnd in your sample will be close to the real prevalence in the

population. You can calculate the margin of uncertainty around the ﬁndings of

your study using conﬁdence intervals. A conﬁdence interval gives you a

maximum and minimum plausible estimate for the true value you were trying to

measure.

Step 1: decide on an acceptable margin of error

The larger your sample, the less uncertainty you will have about the true

prevalence. However, you do not necessarily need a tiny margin of uncertainty.

For an exploratory study, for example, a margin of error of ±10% might be

perfectly acceptable. A 10% margin of uncertainty can be achieved with a

sample of only 100. However, to get to a 5% margin of error will require a

sample of 384 (four times as large).

Step 2: Is your population ﬁnite?

Are you sampling a population which has a deﬁned number of members? Such

populations might include all the physiotherapists in private practice in Ireland,

or all the pharmacies in Ireland. If you have a ﬁnite population, the sample size

you need can be signiﬁcantly smaller.

Step 3: Simply read oﬀ your required sample size from table 1.1. "

Sample Size: studies measuring a percentage or proportion

!8

Table 1.1 Sample sizes for prevalence studies

Example 1: Sample size for a study of the prevalence of burnout in students at a large

university

A researcher is interested in carrying out a prevalence study using simple

random sampling from a population of over 11,000 university students. She

would like to estimate the prevalence to within 5% of its true value.

Since the population is large (more than 5,000) she should use the ﬁrst column

in the table. A sample size of 384 students will allow the study to determine the

prevalence of anxiety disorders with a conﬁdence interval of ±5%. Note that if

she wants increase precision so that her margin of error is just ±3%, she will

have to sample over 1,000 participants. Sample sizes increase rapidly when

very high precision is needed.

Example 2: Sample size for a study of a ﬁnite population

A researcher wants to study the prevalence of bullying in registrars and senior

registrars working in Ireland. There are roughly 500 doctors in her population.

She is willing to accept a margin of uncertainty of ±7.5%.

Here, the population is ﬁnite, with roughly 500 registrars and senior registrars,

so the sample size will be smaller than she would need for a study of a large

population. A representative sample of 127 will give the study a margin of error

(conﬁdence interval) of ±7.5% in determining the prevalence of bullying in the

workplace, and 341 will narrow that margin of error to ±3%.

Acceptable

margin of

error

Size of population

Large

5000

2500

1000

500

200

±20%

24

24

24

23

23

22

±15%

43

42

42

41

39

35

±10%

96

94

93

88

81

65

±7.5%

171

165

160

146

127

92

±5%

384

357

333

278

217

132

±3%

1067

880

748

516

341

169

Sample Size: studies measuring a percentage or proportion

!9

Frequently asked questions

Suppose my study involves analysing subgroups, how do I calculate

sample size?

In some cases, you may be interested in percentages or prevalences within

subgroups of your sample. In this case, you should check that they sample size

will have enough power to give you an acceptable margin of error within the

smallest subgroup of interest.

For example, you may be interested in the percentage of mobile phone users

who are worried about the eﬀects of radiation. A sample of 384 will allow you

to measure this percentage with a margin of error of no more than ±5% of its

true value.

However, you are also interested in subgroups, such as men and women, older

and younger people, people with diﬀerent levels of education etc. You reckon

that the smallest subgroup will be older men, who will probably make up only

10% of the sample. This would give you about 38 men, slightly fewer than you

need for a margin of error of ±20%. If this is not acceptable, you might

increase the overall sample size, use stratiﬁed sampling (where a ﬁxed number

of each subgroup is recruited) or decide not to analyse rarer subgroups.

If you want to compare subgroups, however, go to section 1.3

What if I can only survey a ﬁxed number of people?

You can use the table to ﬁnd the approximate margin of error of your study. You

will then have to ask yourself if this margin of error is acceptable. You may

decide not to go ahead with the study because it will not give precise enough

results to be useful.

How can I calculate sample size for a diﬀerent margin of error?

All these calculations were done on a simple web page at

http://www.surveysystem.com/sscalc.htm

Sample Size: studies measuring a percentage or proportion

!10

1.2 Sample sizes for studies comparing a prevalence

with a hypothesised value

This section give guidelines for sample sizes for studies that measure the

proportion or percentage of people who have some characteristic with the

intention of comparing it with a percentage that is already known from

research or hypothesised.

This characteristic can be a disease, and opinion, a behaviour, anything that

can be measured as present or absent. You may want to demonstrate that the

population you are studying has a higher (or lower) prevalence than some other

population that you already know about. For example, you might want to see if

medical students have a lower prevalence of smoking than other third level

students, whose prevalence is already known from previous work.

Eﬀect size

To begin with, you need to ask what is the smallest diﬀerence between

the prevalence in the population you are studying and the prevalence in the

reference population that would be considered meaningful in real life terms?

This diﬀerence is often called a clinically signiﬁcant diﬀerence in medicine,

to draw attention to the fact that it is the smallest diﬀerence that would be

important enough to have practical implications.

The bigger your study, the greater the chance that you will detect such a

diﬀerence. And, of course, the smaller the diﬀerence that you consider to be

clinically signiﬁcant, the bigger the study you need to detect it.

Step 1: Eﬀect size: Decide on the smallest diﬀerence the study should be

capable of detecting

You will have to decide what is the smallest diﬀerence between the group that

you are studying and the general population that would constitute a 'clinically

signiﬁcant diﬀerence' – that is, a diﬀerence that would have real-life

implications. If you found a diﬀerence of 5%, would that have real-life

implications? If not, would 10%? There is a certain amount of guesswork

involved, and you might do well to see what the norm was in the literature.

For instance, if you were studying burnout in medical students and discovered

that the rate was 5% higher than the rate for the general student population,

would that have important real-life implications? How about if it was 10%

lower? 10% higher? At what point would we decide that burnout in medical

students was a problem that needed to be tackled?

Sample Size: studies measuring a percentage or proportion

!11

Step 2: Prevalence: How common is the feature that you are studying in

the population?

Sample sizes are bigger when the feature has a prevalence of 50% in the

population. As the prevalence in the population group goes towards 0% or

100%, the sample size requirement falls. If you do not know how common the

feature is, you should use the sample size for a 50% prevalence as being the

worst-case estimate. The required sample size will be no larger than this, no

matter what the prevalence turns out to be.

Step 3: what power do you want to detect a diﬀerence between the study

group and the population?

A study with 90% power is 90% likely to discover the diﬀerence between

the groups if such a diﬀerence exists. And 95% power increases this likelihood

to 95%. So if a study with 95% power fails to detect a diﬀerence, the diﬀerence

is unlikely to exist. You should aim for 95% power, and certainly accept nothing

less than 90% power. Why run a study that has more than a 10% chance of

failing to detect the very thing it is looking for?

Sample Size: studies measuring a percentage or proportion

!12

Step 4: Use table 1.2 to get an idea of sample size

Table 1.2 Comparing a sample with a known population

The table gives sample sizes for 90% and 95% power in three situations: when

the population prevalence is 50%, 25% and 10%.

If in doubt about the prevalence, err on the high side.

*Sample Stata code for column

. power oneproportion .1 (.15(.05).4), test(wald) power(.95)

The bit that says (.15(.05).4) is a neat way of passing Stata a list of values.

This one says “start at .15, increment by .05 and ﬁnish at .4”.

Population

prevalence 50%

Population

prevalence 25%

Population

prevalence 10%

Diﬀerence

between

prevalences

Power

Power

Power

90%

95%

90%

95%

90%

95%*

+5%

1041

1287

883

1092

536

663

+10%

253

312

240

296

169

208

+15%

107

132

113

139

88

109

+20%

56

69

66

81

56

69

+25%

32

39

43

52

39

48

+30%

19

24

29

36

29

35

-5%

1041

1287

673

832

13

16

-10%

253

312

134

166

-15%

107

132

43

52

-20%

56

69

13

16

–25%

32

39

–30%

19

24

Sample Size: studies measuring a percentage or proportion

!13

Example: Study investigating whether depression is more common in elderly people in

nursing homes than in the general elderly population, using a limited number of available

patients.

Depression has a prevalence of roughly 10% in the general elderly population.

There are approximately 70 persons two nursing homes who will all be invited

to participate in the research. A sample size of 69 would give the study 95%

power to detect a 20% higher prevalence of depression in these participants

compared with the general population.

Example: Study recruiting patients with low HDL cholesterol levels to see if there is a

higher frequency of an allele suspected of being involved in low HDL. The population

frequency of the allele is known to be 25%

The researchers decide that to be clinically signiﬁcant, the prevalence of the

allele would have to be twice as high in patients with low HDL cholesterol. A

sample of 36 patients will give them a 90% chance of detecting a diﬀerence this

big or bigger, while 45 patients will give them a 95% chance of detecting it.

Reference

These calculations were carried out using Stata Version 13 using the power

command. !

Sample Size: studies measuring a percentage or proportion

!14

1.3 Sample sizes for studies comparing proportions

between two groups

This section give guidelines for sample sizes for studies that measure the

proportion or percentage of people who have some characteristic with the

intention of comparing two groups sampled separately, or two subgroups within

the same sample.

This is a common study design in which two groups are compared. In some

cases, the two groups will be got by taking samples from two populations.

However, in many cases the two groups may actually be subgroups of the same

sample. If you plan on comparing two groups within the same sample, the

sample size will have to be increased. Instructions for doing this are at the end

of the section.

Step 1: Eﬀect size: Decide on the diﬀerence the study should be capable

of detecting

You will have to decide what is the smallest diﬀerence between the two groups

that you are studying that would constitute a 'clinically signiﬁcant diﬀerence' –

that is, a diﬀerence that would have real-life implications. If you found a

diﬀerence of 5%, would that have real-life implications? If not, would 10%?

There is a certain amount of guesswork involved, and you might do well to see

what the norm is in the literature.

Step 2: Prevalence: How common is the feature that you are studying in

the comparison group?

Sample sizes are bigger when the feature has a prevalence of 50% in one of the

groups. As the prevalence in one group goes towards 0% or 100%, the sample

size requirement falls. If you do not know how common the feature is, you

should use the sample size for a 50% prevalence as being the worst-case

estimate. The required sample size will be no larger than this no matter what

the prevalence turns out to be.

Step 3: Power: what power do you want to detect a diﬀerence between

the two groups?

A study with 90% power is 90% likely to discover the diﬀerence between

the groups if such a diﬀerence exists. And 95% power increases this likelihood

to 95%. So if a study with 95% power fails to detect a diﬀerence, the diﬀerence

is unlikely to exist. You should aim for 95% power, and certainly accept nothing

less than 90% power. Why run a study that has more than a 10% chance of

failing to detect the very thing it is looking for?

Sample Size: comparing proportions between groups

!15

Step 4: Use table 1.3 to get an idea of sample size

The table gives sample sizes for 90% and 95% power in three situations: when

the prevalence in the comparison group is 50%, 25% and 10%. If in doubt, err

on the high side. The table shows the number in each group, so the total

number is twice the ﬁgure in the table!

Table 1.3 Numbers needed in each group

*Sample Stata command that generated the ﬁgures in this column

. power twoproportion .5 (.45(-.05).2), power(.9)

The notation .5 (.45(-.05).2) is a way of telling Stata to generate a list of

values starting with 0·5, decreasing in units of 0·05 and ending with 0·2)

Example: Study investigating the eﬀect of a support programme on smoking quit rates

The investigator is planning a study of the eﬀect of a telephone support line in

improving smoking quit rates in patients post-stroke. She knows that about

25% of smokers will have quit at the end of the ﬁrst year after discharge. She

feels that the support line would make a clinically important contribution to

management if it improved this this to 35%. The programme would not be

justiﬁable from the cost point of view if the reduction were smaller than this.

So a 10% increase is the smallest eﬀect that would be clinically signiﬁcant.

From the table she can see that two groups of 440 patients would be needed to

have a 90% power of detecting a diﬀerence of at least 10%, and two groups of

543 patients would be needed for 95% power. She writes in her ethics

submission:

Prevalence in

one group 50%

Prevalence in

one group 25%

Prevalence in

one group 10%

Diﬀerence

between the

groups

Power

Power

Power

90%*

95%

90%

95%

90%

95%

5%

2095

2590

1674

2070

918

1135

10%

519

641

440

543

266

329

15%

227

280

203

251

133

164

20%

124

153

118

145

82

101

25%

77

95

77

95

57

70

30%

52

63

54

67

42

52

Sample Size: comparing proportions between groups

!16

Previous studies in the area suggest that as few as 25% of smokers are still not

smoking a year after discharge. The proposed sample size of 500 patients in

each group (intervention and control) will give the study a power to detect a

10% increase in smoking cessation rate that is between 90% and 95%.

Example: Study comparing risk of hypertension in women who continue to work and

those who stop working during a ﬁrst pregnancy.

Women in their ﬁrst pregnancy have roughly a 10% risk of developing

hypertension. The investigator wishes to compare risk in women who stop

working and women who continue. She decides to give the study suﬀicient

power to have a 90% chance of detecting a doubling of risk associated with

continued working. The sample size, from the table, is two groups of 266

women. She decides to increase this to 300 in each group to account for drop-

outs. She writes in her ethics submission:

Women in their ﬁrst pregnancy have roughly a 10% risk of developing

hypertension. We propose to recruit 300 women in each group (work cessation

and working). The proposed sample size has a 90% power to detect a twofold

increase in risk, from 10% to 20%.

Comparing subgroups within the same sample

This often happens when the two groups being compared are subgroups of a

larger sample. For example, if you are comparing men and women coronary

patients and you know that two thirds of patients are men.

A detailed answer is beyond the scope of a ready-reckoner table, because the

ﬁnal sample size will depend on the relative sizes of the groups being

compared. Roughly, if one group is twice as big as the other, the total sample

size will be 20% higher, if one is three times as big as the other, 30% higher. In

the case of the coronary patients, if two thirds of patients are men, one group

will be twice the size of the other. In this case, you would calculate a total

sample size based on the table and then increase it by 20%.

Stata code

Suppose you are comparing two groups from the same sample. You are

expecting the two groups to have a 20% and 80% prevalence. In this case, the

ratio of the two groups is 80:20 which is 4:1. The Stata code for 90% power

that gives the ﬁrst column in the table above now reads

power twoproportions .5 (.45(-.05).2), test(chi2) power(0.9)

nratio(4)

You can see that you simply have to specify nratio() to get the appropriate

calculation.

Sample Size: comparing proportions between groups

!17

Frequently-asked questions

What is 90% or 95% power?

Just because a diﬀerence really exists in the population you are studying does

not mean it will appear in every sample you take. Your sample may not show

the diﬀerence, even though it is there. To be ethical and value for money, a

research study should have a reasonable chance of detecting the smallest

diﬀerence that would be of clinical signiﬁcance (if this diﬀerence actually

exists, of course). If you do a study and fail to ﬁnd a diﬀerence, even though it

exists, you may discourage further research, or delay the discovery of

something useful. For this reason, you study should have a reasonable chance

of ﬁnding a diﬀerence, if such a diﬀerence exists.

A study with 90% power is 90% likely to discover the smallest clinically

signiﬁcant diﬀerence between the groups if such a diﬀerence exists. And 95%

power increases this likelihood to 95%. So if a study with 95% power fails to

detect a diﬀerence, the diﬀerence is unlikely to exist. You should aim for 95%

power, and certainly accept nothing less than 90% power. Why run a study that

has more than a 10% chance of failing to detect the very thing it is looking for?

What if I can only study a certain number of people?

You can use the table to get a rough idea of the sort of diﬀerence you study

might be able to detect. Look up the number of people you have available.

Reference

These calculations were carried out using Stata release 13 power command!

Sample Size: comparing proportions between groups

!18

1.4a Sample sizes for population case-control studies

This section give guidelines for sample sizes for studies that measure the eﬀect

of a risk factor by comparing a sample of people with the disease with a control

sample of disease-free individuals drawn from the same population. The

eﬀect of the risk factor is measured using the odds ratio.

Population case-control studies have the disadvantage that the controls and

cases may diﬀer on variables that will have an eﬀect on disease risk

(confounding variables), so a multivariable analysis will have to be carried out

to adjust for these variables. The sample sizes shown here are inﬂated by 25%

to allow for the loss of statistical power that will typically result from adjusting

for confounding variables.

If you are controlling for confounding variables by carrying out a matched

case-control study, see section 1.4b.

A case-control study looks for risk factors for a disease or disorder by

recruiting two groups of participants: cases of the disease or disorder, and

controls, who are drawn from the same population as the cases but who did not

develop the disease.

Case-control studies are observational studies. In experimental studies, we

can hold conditions constant so that the only diﬀerence between the two

groups we are comparing is that one group was exposed to the risk factor and

the other was not. In observational studies, however, there can be other

diﬀerences between those exposed to the risk factor and those not exposed. For

example, if you are looking at the relationship between diarrhoeal disease in

children and household water supply, households with high quality water will

diﬀer in other ways from households with low quality water. They are more

likely to be higher social class, wealthier, and more likely to have better

sanitation. These factors, which are associated with both the disease and the

risk factor, are called confounding factors.

Understanding confounding factors is important in designing and analysing

case-control studies. Confounding factors can distort the apparent relationship

between a risk factor and a disease, so their eﬀects have to be adjusted for

statistically during the analysis. In the diarrhoeal disease example, you might

need to adjust your estimate of the eﬀect of good water quality in the

household for the association between good water quality and presence of a

toilet. Any case-control study must identify and measure potential confounding

factors.

Sample size and adjustment for confounding factors

Allowing for confounding factors in the analysis of case-control studies

increases the required sample size, because the statistical adjustment will

Sample Size: case control studies

!19

increase the margin of uncertainty around the estimate of the risk factor's odds

ratio. If you don't understand the last bit, don't worry. The important thing is

that you have to gather extra data in a case control study to allow you suﬀicient

statistical power to adjust for confounding variables. How much extra data

depends on how strongly the confounding factor is associated with the risk

factor and the disease. Cousens and colleagues (see references) recommend

increasing the sample size by 25%, based on simulation studies. The sample

sizes in the tables in this section are inﬂated by 25% in line with this

recommendation.

Step 1: Prevalence: What is the probable prevalence of the risk factor in

your population?

The prevalence of the risk factor will aﬀect your ability to detect its eﬀect. If

most of the population is exposed to the risk factor, it will be common in your

control group, making it hard to detect its eﬀect, for example. If you are unsure

about the prevalence of the risk factor in the population, err on the extreme

side – that is, if it is rare, use the lowest estimate you have as the basis for

calculations, and if it is common use the highest estimate.

Step 2: Eﬀect Size: What is the smallest odds ratio that would be regarded

as clinically signiﬁcant?

The odds ratio expresses the impact of the factor on the risk of the disease or

disorder. Usually we are only interested in risk factors that have a sizeable

impact on risk – and odds ratio of 2, for example – but if you are studying a

common, serious condition you might be interested in detecting an odds ratio

as low as 1.5, because even a 50% increase in risk of something common or

serious will be important at the public health level.

Step 3: Power: What statistical power do you want?

With 90% power, you have a 90% chance of being able to detect a clinically

signiﬁcant odds ratio. That is, though, a 10% chance of doing the study and

failing to detect it. With 95% power, you have only a 5% chance of failing to

detect a clinically signiﬁcant odds ratio, if it exists.

Step 4: Look up the number of cases from table 1.4

Sample Size: case control studies

!20

Table 1.4a Number of cases required for a case control study

Note 1: This assumes a study that recruits an equal number of controls.

Note 2: The table has an allowance of 25% extra participants to adjust for

confounding.

Smallest odds ratio that would be

clinically signiﬁcant

1.5

2

2.5

3

4

5

Prevalence

of the risk

factor

90% Power to detect the odds ratio

10%

1581

493

264

175

103

73

20%

929

300

165

113

69

50

30%

739

246

140

98

61

46

40%

674

231

134

95

63

49

50%

674

239

141

103

69

55

60%

730

265

161

118

81

65

70%

869

324

200

149

105

85

80%

118 4

453

284

215

154

128

90%

2186

855

546

416

304

254

95% Power to detect the odds ratio

10%

1988

619

331

220

129

91

20%

116 8

376

208

141

86

64

30%

929

309

175

121

78

59

40%

848

291

169

120

79

61

50%

848

300

178

129

86

69

60%

919

334

203

149

103

83

70%

1091

408

251

188

131

108

80%

1489

569

358

270

194

160

90%

2749

1075

686

524

383

320

Sample Size: case control studies

!21

Example: A study to detect the eﬀect of smoking on insomnia in elderly.

Step 1 is to estimate how common smoking is in the elderly. The current

population estimate is that about 27% of the elderly smoke.

Step 2 is to specify the minimum odds ratio that would be clinically signiﬁcant.

In this case, we might decide that an odds ratio of 2.5 would be the smallest

one that would be of real importance.

The table gives a sample size of 140 cases and 140 controls for 90% power to

detect an odds ratio of at least 2.5 with a smoking prevalence of 30%. This is

probably close enough to 27% to be taken as it is.

When analysing the data, the eﬀect of smoking may be confounded by the fact

that smoking is more common in men, and insomnia is also more common in

men. So the apparent relationship between insomnia and smoking could be

partly due to the fact that both are associated with male sex. We can adjust the

odds ratio for sex, and for other confounding factors during the analysis.

Although this will reduce the study power, the sample size table has a built-in

allowance of 25% extra to deal with the loss of power due to confounding.

In an ethics submission, you would write

The sample size was calculated in order to have 90% power to detect an odds

ratio of 2.5 or greater associated with smoking, given that the prevalence of

smoking is approximately 30% in the target population. The sample size was

inﬂated by 25% to allow for the calculation of an odds ratio adjusted for

confounding variables such as gender, giving a planned sample size of 140

cases and 140 controls.

Frequently-asked questions

I only have 30 cases available to me – what can I do?

Looking at the table, it is clear that you cannot do a lot. You have a 90% chance

of detecting a ten-fold increase in risk associated with a risk factor that is

present in at least 20% of the population and at most 40%. Sample sizes for

case-control studies are generally larger than people think, so it’s a good idea

to look at the table and consider whether you have enough cases to go ahead.

Is there any way I can increase the power of my study by recruiting more

controls?

Yes. If you have a limited number of cases, you can increase the power of your

study by recruiting more controls.

Step 1 : Look up the number of cases you need from table 1.4

Step 2: Use table 1.5 to look up an adjustment factor based on the number of

controls per case that you plan on recruiting. Multiply the original number of

cases by the adjustment factor.

Step 3: the number of controls you require is based on this adjusted number.

Sample Size: case control studies

!22

Example: An obstetrician is interested in the relationship between manual

work during pregnancy and risk of pre-eclampsia. She does some preliminary

research and ﬁnds that about 20% of her patients do manual work during their

pregnancy. She is interested in being able to detect an odds ratio of 3 or more

associated with manual work. Since pre-eclampsia is comparatively rare, she

plans to recruit three controls for each case.

Table 1.4a1 Eﬀect of multiple controls per case on sample

size

From table 1.4, she needs 113 patients with pre-eclampsia for 90% power.

Recruiting three controls per case, she can reduce this by a third (0.67), giving

113 x 0.67 = 75.7 cases (76 in round ﬁgures). However, she will have to recruit

three controls per case, giving 228 controls (76 x 3). Although this is pretty

close to the size of study she would have had to do with a 1:1 case-control ratio,

it will be quicker to carry out, because recruiting the cases will be the slowest

part of the study.

Reference

The calculations in this section were carried out with Stata, using formulas in

Cousens SN, Feachem RG, Kirkwood B, Mertens TE and Smith PG. Case-control

studies of childhood diarrhoea: II Sample size.World Health Organization. CDD/

EDP/88.3 Undated.

A scanned version may be downloaded here: http://www.ircwash.org/resources/

case-control-studies-childhood-diarrhoea

Number of

controls per

case

Multiply the

number of

cases by

2

0.75

3

0.67

4

0.63

5

0.60

Sample Size: case control studies

!23

1.4b Sample sizes for matched case-control studies

This section gives sample sizes for studies that compare cases of a disease or

disorder with matched controls drawn from the same population.

Introduction

Case-control studies are widely used to establish the strength of the

relationship between a risk factor and a health outcome. Case-control studies

are observational studies. In experimental studies, we can hold conditions

constant so that the only diﬀerence between the two groups we are comparing

is that one was exposed to the risk factor and one was not. In observational

studies, however, there can be other diﬀerences between those exposed to the

risk factor and those not exposed. For example, if you are looking at the eﬀect

of diet on mild cognitive impairment, you would be aware that the main risk

factor for cognitive impairment is age. Diet also varies with age. Age, then, is a

factor which is associated with both the disease and the risk factor. These

factors are called confounding factors. Confounding factors can distort the

true relationship between a risk factor and a disease unless we take them into

account in the design or the analysis of our study.

We can deal with the presence of confounding variables in the design of our

study by matching the cases and controls on key confounders. In matched case-

control designs, healthy controls are matched to cases using one or more

variables. In practice, the most eﬀicient matching strategy is to match

on at most two variables. Matching on many variables makes it very diﬀicult

to locate and recruit controls. And although matching on many variables is

intuitively attractive, it doesn’t actually increase statistical eﬀiciency – in fact,

matching on more than three variables actually reduces the power of your

study to detect risk factor relationships. Altman recommends that “in a large

study with many variables it is easier to take an unmatched control group and

adjust in the analysis for the variables on which we would have matched, using

ordinary regression methods. Matching is particularly useful in small studies,

where we might not have suﬀicient subjects to adjust for several variables at

once.” (Bland & Altman, 1994).

Matching cases and controls will produce a correlation between the probability

of exposure within each case-control pair. This increases the statistical power

of the study. The sample size will depend on the degree of correlation between

the cases and controls. This is rarely possible to estimate, so these calculations

are based on a case-control correlation of phi=0·2. This is the recommended

action where the correlation is unknown (Dupont 1980).

It is important to note that when you analyse a matched case-control study, you

must incorporate the matching into the analysis using procedures like

conditional logistic regression. Analysing it as an unmatched case-control study

Sample Size: case control studies

!24

biases the estimates of the risk factor eﬀects in the directly of 1. In other

words, calculated risk factor eﬀects will be smaller than they really are.

Sample size calculation

1. What is the prevalence of the risk factor in the controls? The tables give

possibilities of 10%, 20% 25%, 50% and 75%. If in doubt, select the estimate

furthest from 50%. For example, if you think that the prevalence is somewhere

between 10% and 20%, estimate sample size based on a 10% prevalence.

2. What is the smallest odds ratio that would be of real life importance

(clinically signiﬁcant)?

3. Look up the sample size for 90% and 95% power in the table.

Smallest odds ratio that would be

clinically signiﬁcant

1.5

2

2.5

3

4

5

Prevalence

of the risk

factor

90% Power to detect the odds ratio

10%

1501

454

236

152

86

59

20%

885

279

150

100

59

43

30%

705

230

128

87

54

40

40%

644

217

124

86

55

41

50%

644

223

130

92

59

45

60%

697

248

147

105

69

54

70%

827

301

181

131

88

68

80%

112 6

418

254

186

126

99

90%

2072

784

482

355

243

192

95% Power to detect the odds ratio

10%

1851

557

289

185

103

70

20%

1091

342

184

122

71

51

30%

869

283

156

106

65

47

Sample Size: case control studies

!25

Table 1.4b Number of cases required for a matched case

control study

Multiple controls per case

Where there are multiple controls per case, you can get greater statistical

power. If you don’t have enough cases, you could consider this strategy.

Recruiting two controls per case will reduce your case sample size by roughly

25% for the same statistical power, and recruiting three controls per case will

reduce it by roughly a third. However, the total size of your study will increase

because of the extra controls.

Table 1.4b1 Eﬀect of multiple controls per case on sample

size

40%

794

266

151

104

66

49

50%

794

274

158

111

71

54

60%

860

304

179

127

83

64

70%

1020

369

221

159

105

81

80%

1389

513

311

226

151

118

90%

2557

963

589

432

292

229

Number of

controls per

case

Multiply the

number of

cases by

2

0.75

3

0.67

4

0.63

5

0.60

Sample Size: case control studies

!26

Example

A researcher wishes to conduct a matched case-control study of the eﬀect of

regular alcohol consumption on risk of falls in older people. She estimates that

20% to 30% of the elderly population consume alcohol regularly. She decides

that an odds ratio of 2.5 would be regarded as clinically signiﬁcant.

She uses the lower estimate of prevalence – 20% – for sample size calculation.

She will require 150 case-control pairs to achieve 90% power. This is a very

large number of falls patients, and she will only have a maximum of 60 patients

available to her, so she realises that she will only reach 90% power to detect an

odds ratio of 4.

Recruiting 60 patients would take a long time, so she considers recruiting two

controls for each patient, which would reduce the number of patients from 60

to 45, though increasing the number of controls to 90.

Should you use a matched design?

Matched designs seem to oﬀer advantages in being able to control for

confounding variables. However, there are two points to be considered. The

ﬁrst is that a matched design will under-estimate the strength of the risk factor

eﬀect if it is analysed without taking the matching into account, so it is

important to use an appropriate statistical technique (such as conditional

logistic regression).

More importantly, matching can make it hard to ﬁnd controls. In many

situations it is probably better to adjust for confounding variables statistically

and use an unmatched case-control design.

However, there are two cases where matching can be beneﬁcial:

1. There are strong, known risk factors that are not of interest. Variables like

age, smoking, diabetes are well-studied and have strong eﬀects on risk of many

diseases. Matching on these variables can greatly increase study power

(Stürmer 2000). However, one-to-one matching may be less eﬀicient than

frequency matching, and the paper by Stürmer et al is a useful read before you

decide on a matching strategy.

2. Matching may be used to control for background variables that are hard to

measure or are unknown. For example, in hospital studies time of admission

may have a considerable eﬀect on patient outcomes – patients admitted when

the hospital is very busy may receive diﬀerent treatment to those admitted

when it is quiet. Matching cases and controls by time of admission can be used

to control for these contextual variables.

Sample Size: case control studies

!27

References

Bland J M, Altman D G. Statistics notes: Matching BMJ 1994; 309 :1128

Stürmer, T., and H. Brenner. "Potential gain in precision and power by matching

on strong risk factors in case-control studies: the example of laryngeal cancer."

Journal of epidemiology and biostatistics 5.2 (2000): 125-131.

These calculations were carried out using the Stata command sampsi_mcc,

written by Adrian Mander, of the MRC Human Nutrition Research, Cambridge,

UK.

A typical command is

sampsi_mcc , p0(.1) power(.9) solve(n) alt(4) phi(.2) m(1)

which sets the prevalence at .1, the power at 90%, the hypothesised odds ratio

(alternative odds ratio) at 4 and asks Stata to solve the problem for N, the

sample size. The command also includes two options that are not actually

needed, since they are the defaults: phi, the correlation between case-control

pairs, is set at 0.2 and the matching (m) is set to one control per case.

The formulas are drawn from

Dupont W.D. (1988) Power calculations for matched case-control studies.

Biometrics 44: 1157-1168.

Sample Size: case control studies

!28

1.5 Sample size for logistic regression with a

continuous predictor variable

This section give guidelines for sample sizes for studies that measure the eﬀect

of a continuous predictor (for example, body mass index) on the risk of an

endpoint (for example ankle injury). The data may come from a cross-sectional,

case-control or cohort study.

Introduction

Logistic regression allows you to calculate the eﬀect that a predictor variable

has on the occurrence of an outcome. It can be used with cross-sectional data,

case-control data and longitudinal (cohort) data. The eﬀect of the predictor

variable is measured by the odds ratio. A researcher may be interested, for

example, on the eﬀect that body weight has on the probability of a patient not

having a complete clinical response to a standard 70mg dose of enteric aspirin,

or the eﬀect that depression scores have on the probability that the patient will

not adhere to prescribed treatment.

Step 1: Variability : Estimate the mean and standard deviation of the

predictor variable

You will probably be able to estimate the mean value quite easily. If you cannot

ﬁnd an estimate for the standard deviation, you can use the rule of thumb

that the typical range of the variable is four standard deviations. By asking

yourself what an unusually low and an unusually high value would be, you can

work out the typical range. Dividing by four gives a rough standard deviation.

For example, adult weight averages at about 70 kilos, and weights under 50 or

over 100 would be unusual, so the ‘typical range’ is about 50 kilos. This gives

us a ‘guesstimate’ standard deviation of 12.5 kilos (50÷4).

Step 2: Baseline: What is the probability of the outcome at the average

value of the predictor?

A good rule of thumb is that the probability of the outcome at the average value

of the predictor is the same as the probability of the outcome in the whole

sample. So if about 20% of patients have poor adherence to prescribed

treatment, this will do as an estimate of the probability of poor adherence at

the average value of the predictor.

Step 3: Eﬀect size: what is the smallest increase in the probability of the

outcome associated with an increase of one standard deviation of the

predictor that would be clinically signiﬁcant?

Clinical signiﬁcance, or real-life signiﬁcance, means that an eﬀect is important

enough to have real-life consequences. In the case of treatment failure with

aspirin, if the probability of treatment failure increased from 10% at the

Sample Size: case control studies

!29

average weight to 25% one standard deviation higher, it would certainly be of

clinical importance. Would an increase from 10% to 20% be clinically

important? Probably. But any smaller increase probably would not. So in this

case, we would set 10% and 20% as the prevalence at the mean and the

smallest increase the be detected one standard deviation higher.

Step 4. Read oﬀ the required sample size from the table.

Table 1.5 Sample size for logistic regression

Prevalence at

mean value

Prevalence 1 SD

higher

Odds ratio

N for 90% power

5%

10%

2.1

333

10%

15%

1.6

484

10%

20%

2.3

172

20%

25%

1.3

734

20%

30%

1.7

220

20%

40%

2.7

98

20%

50%

4.0

143

25%

30%

1.3

825

25%

35%

1.6

238

25%

40%

2.0

128

25%

50%

3.0

93

30%

35%

1.3

889

30%

40%

1.6

249

30%

50%

2.3

93

30%

60%

3.5

106

40%

45%

1.2

933

40%

50%

1.5

250

40%

60%

2.3

87

40%

80%

6.0

499

50%

55%

1.2

865

50%

60%

1.5

225

50%

75%

3.0

81

50%

80%

4.0

133

Sample Size: case control studies

!30

Example

A researcher wishes to look at the eﬀect of stigma on the risk of depression in

medical patients. Previous research suggests that the prevalence of depression

is about 20%. We can take this as the prevalence at the mean stigma score. The

researcher wishes to be able to detect an increase in prevalence of 10% at one

standard deviation above the mean value. She will need 172 patients to have a

90% chance of detecting a relationship this strong.

Reference:

These calculations were carried out using the powerlog command written for

Stata by Philip B. Ender, UCLA Institut for Digital Research and Education.

The command is supported by an online tutorial at the IDRE website: http://

www.ats.ucla.edu/stat/stata/dae/logit_power.htm!

Sample Size: case control studies

!31

1.6 Sample sizes for logistic or Cox regression with

multiple predictors

This section reviews guidelines on the number of cases required for studies in

which logistic regression or Cox regression are used to measure the eﬀects of

risk factors on the occurrence of an endpoint. Earlier recommendations stated

that you needed ten events (endpoints) per predictor variable. Subsequent

work suggested that this isn’t strictly true, and that 5–9 events per predictor

may be yield estimates that are just as good. However, the jury is still out.

The section includes guidelines on designing studies with multiple predictors.

There isn’t a table because the number of potential scenarios is impossibly big.

Introduction

Logistic regression builds a model the estimate the probability of an event

occurring. To use logistic regression, we need data in which each participant’s

status is known: the event of interest has either occurred or has not occurred.

For example, we might be analysing a case-control study of stress fractures in

athletes. Stress fractures are either present (in the cases) or absent (in the

controls). We can use logistic regression to analyse the data.

However, in follow-up studies, we often have data on people who might

experience the event but they have not experienced it yet. For example, in a

cancer follow-up study, some patients have experienced a recurrence of the

disease, while others are still being followed up and are disease free. We

cannot say that those who are disease free will not recur, but we know that

their time to recurrence must be greater than their follow-up time. This kind of

data is called censored data.

In this case, we can use Cox regression (sometimes called a proportional

hazards general linear model, which is what Cox himself called it. You can see

why people refer to it as Cox regression!).

The ten events per predictor rule

There was a very inﬂuential paper published in the 1990s by Peduzzi et al

(1996) based on simulation studies which concluded that for logistic regression

you needed ten events (not patients) per predictor variable if you were

calculating a multivariate model.

Example: a researcher wants to look at factors aﬀecting the development of

hypertension in ﬁrst-time pregnancies. If the researcher has 5 explanatory

variables, they will need to recruit a sample big enough to yield 50 cases of

hypertension. Around 20% of ﬁrst-time mothers will develop hypertension, so

these 50 cases will be 20% of the required sample. So a total sample of 250 will

be required so that there will be the required 50 cases

Sample Size: case control studies

!32

More recent research has cast doubt on this rule

More recently, bigger and more comprehensive simulation studies have cast

doubt on this hard-and-fast rule. Vittinghoﬀ and McCulloch (2007), in a very

widely-cited paper, concluded that “problems are fairly frequent with 2–4

events per predictor variable, uncommon with 5–9 events per predictor

variable, and still observed with 10–16 events per predictor variable. Cox

models appear to be slightly more susceptible than logistic. The worst

instances of each problem were not severe with 5–9 events per predictor

variable and usually comparable to those with 10–16 events per predictor

variable.”

In other words, with between 5 and 9 events per predictor variable, their

models performed more or less as well as models with 10-16 events per

variable. As a safe minimum, then, it appears that there should be at least 5

events per predictor variable.

Since then, further simulation studies where prediction models are validated

against new datasets tend to conﬁrm that 10 events per variable is a minimum

requirement (see Wynants 2015) for logistic regression. These studies are

important because they are concerned with the generalisability of ﬁndings.

The importance of the number and type of predictor variables

The second factor that will inﬂuence sample size is the nature of the study.

Where the predictor variables have low prevalence and you intend running a

multivariable model with several predictors, then the number of events per

variable required for Cox regression is of the order of 20. As you might

imagine, increasing the number of predictor variables and decreasing their

prevalence both require increases in the number of events per variable.

Sample size requirements

Based on current research, the sample should have at least 5 events per

predictor variable ideally 10. Sample sizes will need to be larger than this if

you are performing a multivariate analysis with predictor variables that have

low prevalences. In this case, you may require up to 20 events per variable, and

should probably read the paper by Ogundimu et al.

Correlated predictors – a potential source of problems

One consideration needs to be mentioned: correlations between your predictor

variables. If your predictor variables are uncorrelated, the required sample size

will be smaller than if they are correlated. And the stronger the correlation, the

larger the required sample size. Courvoisier (2011) points out that the size of

the eﬀect associated with the predictor and the correlations between the

predictors all aﬀect the statistical power of a study. And Kocak and colleagues,

Sample Size: case control studies

!33

using simulation studies, report that the problem is especially signiﬁcant in

small samples.

One solution to the problem is to design the analysis carefully.

1. Choose predictor variables based on theory, not availability. It is better to use

a small set of predictors that test an interesting hypothesis than to have a large

number of predictors that were chosen simply because the data were there.

2. Make sure that predictors don’t overlap. If you put education and social class

into a prediction model, they measure overlapping constructs. The well-

educated tend to have higher social class. Does your hypothesis really state

that the two constructs have diﬀerent eﬀects? Choose one good measure of

each construct rather than having multiple overlapping measures.

Frequently asked questions

That’s all very well but I have only 30 patients

That’s health research. I worked on what was, at the time, one of the world’s

largest studies of a rare endocrine disorder. It has 16 patients. We are often

faced with a lack of participants because we are dealing with rare problems or

rare events. In such a case, we do what we can. What this section is warning is

is that with rare conditions our statistical power is low. The only strategy in this

case is the one outlined above: keep to a small, theoretically-justiﬁed set of

predictors that have as little overlap as possible. And try and collaborate with

other centres to pool data.

References

Courvoisier, D.S. et al., 2011. Performance of logistic regression modeling:

beyond the number of events per variable, the role of data structure. Journal of

Clinical Epidemiology, 64(9), pp.993–1000.

Kocak M, Onar-Thomas A. A Simulation-Based Evaluation of the Asymptotic

Power Formulas for Cox Models in Small Sample Cases. The American

Statistician. 2012 Aug 1;66(3):173-9.

Ogundimu EO, Altman DG, Collins GS. Adequate sample size for developing

prediction models is not simply related to events per variable. Journal of

Clinical Epidemiology. Elsevier Inc; 2016 Aug 1;76(C):175–82.

Peduzzi, P. et al., 1996. A simulation study of the number of events per variable

in logistic regression analysis. Journal of Clinical Epidemiology, 49(12), pp.

1373–1379.

Sample Size: case control studies

!34

Vittinghoﬀ, E. & McCulloch, C.E., 2007. Relaxing the rule of ten events per

variable in logistic and Cox regression. American Journal of Epidemiology,

165(6), pp.710–718.

Wynants L, Bouwmeester W, Moons KGM, Moerbeek M, Timmerman D, Van

Huﬀel S, et al. A simulation study of sample size demonstrated the importance

of the number of events per variable to develop prediction models in clustered

data. Journal of Clinical Epidemiology. Elsevier Inc; 2015 Dec 1;68(12):1406–

14.

!

Sample Size: case control studies

!35

2: Sample sizes and powers for comparing two

means where the variable is measured on a

continuous scale that is (more or less) normally

distributed.

This section give guidelines for sample sizes for studies that measure the

diﬀerence between the means of two groups, or that compare the means of the

same group measured under two diﬀerent conditions (often before and after an

intervention).

2.1 Comparing the means of two groups

Studies frequently compare a group of interest with a control group or

comparison group. If your study involved measuring something on the same

people twice, once under each of two conditions, you need the next section.

Step 1: Eﬀect size: decide on the diﬀerence that you want to be able to

detect

The ﬁrst step in calculating a sample size is to decide on the smallest diﬀerence

between the two groups that would be 'clinically signiﬁcant' or 'scientiﬁcally

signiﬁcant'. For example, a diﬀerence in birth weight of 250 grammes between

babies whose mothers smoked and babies whose mothers did not smoke would

be certainly regarded as clinically signiﬁcant, as it represents the weight gain

of a whole week of gestation. However, a smaller diﬀerence – say 75 grammes –

probably would not be.

It is hard to deﬁne the smallest diﬀerence that would be clinically

signiﬁcant. An element of guesswork in involved. What is the smallest

reduction in cholesterol that would be regarded as clinically worthwhile? It

may be useful to search the literature and see what other investigators have

done. And bear in mind that an expensive intervention will need to be

associated with quite a large diﬀerence before it would be considered

worthwhile.

NB: Eﬀect size should not be based on your hopes or expectations!

Note, however, that the sample size depends on the smallest clinically

signiﬁcant diﬀerence, not on the size of the diﬀerence you expect to ﬁnd. You

may have high hopes, but your obligation as a researcher is to give your study

enough power to detect the smallest diﬀerence that would be clinically

signiﬁcant.

Sample Size: comparing means of two groups

!36

Step 2: Convert the smallest clinically signiﬁcant diﬀerence to standard

deviation units.

Step 2.1. What is the expected mean value for the control or comparison group?

Step 2.2. What is the standard deviation of the control or comparison group?

How to get an approximate standard deviation

If you do not know this exactly, you can get a reasonable guess by identifying

the highest and lowest values that would typically occur. Since most values will

be within ±2 standard deviations of the average, then the highest typical value

(2 standard deviations above average) and lowest typical value (2 below) will

span a range of four standard deviations.

An approximate standard deviation is therefore

For example: a researcher is measuring fœtal heart rate, to see if mothers who

smoke have babies with slower heart rates. A typical rate is 160 beats per

minute, and normally the rate would not be below 135 or above 175. The

variation in 'typical' heart rates is 175–135 = 30 beats. This is about 4 standard

deviations, so the standard deviation is about 7.5 beats per minute. (This

example is real, and the approximate standard deviation is pretty close to the

real one!)

How to get an approximate standard deviation from a published conﬁdence interval

Another potential source of standard deviation information is from published

research. Although the paper may not include a standard deviation, it may

include a conﬁdence interval. The Cochrane Handbook has a useful formula for

converting this to a standard deviation:

where N is the number of cases.

Approximate

SD

=

Highest typical value

–

Lowest typical value

4

Standard

deviation

=

√N

Upper CI limit

–

Lower CI limit

3·92

Sample Size: comparing means of two groups

!37

Step 3. What is the smallest diﬀerence between the two groups in the

study that would be considered of scientiﬁc or clinical importance?

This is the minimum diﬀerence which should be detectable by the study. You

will have to decide what is the smallest diﬀerence between the two groups that

you are studying that would constitute a 'clinically signiﬁcant diﬀerence' – that

is, a diﬀerence that would have real-life implications.

In the case of the foetal heart rate example, a researcher might decide that a

diﬀerence of 5 beats per minute would be clinically signiﬁcant.

Note again that the study should be designed to have a reasonable chance of

detecting the minimum clinically signiﬁcant diﬀerence, and not the diﬀerence

that you think is actually there.

Step 4. Convert the minimum diﬀerence to be detected to standard

deviation units by dividing it by the standard deviation

Following our example, the minimum diﬀerence is 5 beats, and the standard

deviation is 7.5 beats. The diﬀerence to be detected is therefore two thirds of a

standard deviation (0.67)

Step 5: Use table 2.1 to get an idea of the number of participants you need in each group

to detect a diﬀerence of this size.

Following the example, the nearest value in the table to 0.67 is 0.7. The

researcher will need two groups of 43 babies each to have a 90% chance of

detecting a diﬀerence of 5 beats per minute between smoking and non-smoking

mothers' babies. To have a 95% chance of detecting this diﬀerence, the

researcher will need 54 babies in each group.

Minimum difference to be detected

Standard deviation

Sample Size: comparing means of two groups

!38

Table 2.1 Sample size for comparing the means of two

groups

Sample Stata code for the ﬁrst entries in this column:

power twomeans 0 (2 1.5 1.4 1.3 1.25 1.2 1.1 1), power(0.9)

If you intend using the Wilcoxon Mann-Whitney test,

multiply the sample size by 1.16

Frequently-asked questions

What is 90% or 95% power?

Just because a diﬀerence really exists in the population you are studying does

not mean it will appear in every sample you take. Your sample may not show

the diﬀerence, even though it is there. To be ethical and value for money, a

research study should have a reasonable chance of detecting the smallest

diﬀerence that would be of clinical signiﬁcance (if this diﬀerence actually

exists, of course). If you do a study and fail to ﬁnd a diﬀerence, even though it

exists, you may discourage further research, or delay the discovery of

Diﬀerence to

be detected

(SD units)

N in each

group 90%

power*

N in each

group 95%

power

Chance that someone in

group 1 will score higher

than someone in group 2

2

7

8

92%

1.5

11

13

86%

1.4

12

15

84%

1.3

14

17

82%

1.25

15

18

81%

1.2

16

20

80%

1.1

19

23

78%

1

23

27

76%

0.9

27

34

74%

0.8

34

42

71%

0.75

39

48

70%

0.7

44

55

69%

0.6

60

74

66%

0.5

86

105

64%

0.4

133

164

61%

0.3

235

290

58%

0.25

338

417

57%

0.2

527

651

55%

Sample Size: comparing means of two groups

!39

something useful. For this reason, you study should have a reasonable chance

of ﬁnding a diﬀerence, if such a diﬀerence exists.

A study with 90% power is 90% likely to discover the diﬀerence between

the groups if such a diﬀerence exists. And 95% power increases this likelihood

to 95%. So if a study with 95% power fails to detect a diﬀerence, the diﬀerence

is unlikely to exist. You should aim for 95% power, and certainly accept nothing

less than 90% power. Why run a study that has more than a 10% chance of

failing to detect the very thing it is looking for?

How do I interpret the column that shows the chance that a person in one

group will have a higher score than a person in another group?

Some scales have measuring units that are hard to imagine. We can imagine

fœtal heart rate, which is in beats per minute, but how do you imagine scores

on a depression scale? What constitutes a 'clinically signiﬁcant' change in

depression score?

One way of thinking of diﬀerences between groups is to ask what

proportion of the people in one group have scores that are higher than average

for the other group. For example we could ask what proportion of smoking

mothers have babies with heart rates that are below the average for non-

smoking mothers? Continuing the example, if we decide that a diﬀerence of 5

beats per minute is clinically signiﬁcant (which corresponds to just about 0.7

SD), this means that there is a 69% chance that a non-smoking mother's baby

will have a higher heart rate than a smoking mother's baby. (Of course, if there

is no eﬀect of smoking on heart rate, then the chances are 50% – a smoking

mothers' baby is just as likely to have higher heart rate as a lower heart rate).

This information is useful for planning clinical trials. We might decide

that a new treatment would be superior if 75% of the people would do better on

it. (If it was just the same, then 50% of people would do better and 50% worse.)

This means that the study needs to detect a diﬀerence of about 1 standard

deviation (from the table). And the required size is two groups of 26 people for

95% power.

The technical name for this percentage, incidentally, is the Mann-Whitney

statistic. You will also encounter it as the c statistic, Harrell’s c, and even as the

area under the ROC curve.

I have a limited number of potential participants. How can I ﬁnd out power

for a particular sample size?

You may be limited to a particular sample size because of the limitations of

your data. There may only be 20 patients available, or your project time scale

only allows for collecting data on a certain number of participants. You can use

the table to get a rough idea of the power of your study. For example, with only

20 participants in each group, you have more than 95% power to detect a

diﬀerence of 1.25 standard deviations (which only needs two groups of 17) and

Sample Size: comparing means of two groups

!40

slightly less than 90% power to detect a diﬀerence of 1 standard deviation (you

would really need 2 groups of 22).

But what if the diﬀerence between the groups is bigger than I think?

Sample sizes are calculated to detect the smallest clinically signiﬁcant

diﬀerence. If the diﬀerence is greater than this, the study's power to detect it is

higher. For instance, a study of two groups of 43 babies has a 90% power to

detect a diﬀerence of 0.7 standard deviations, which corresponded (roughly) to

5 beats per minute, the smallest clinically signiﬁcant diﬀerence. If the real

diﬀerence were bigger – say, 7.5 beats per minute (1 standard deviation) then

the power of the study would actually be 99.6%. (This is just an example, and I

had to calculate this power speciﬁcally; it's not in the table.) So if your study

has adequate power to detect the smallest clinically signiﬁcant diﬀerence, it

has more than adequate power to detect bigger diﬀerences.

I intend using a Wilcoxon (Mann Whitney) test because I don't think my

data will be normally distributed

The ﬁrst important point is that the idea that the data should be normally

distributed before using a t-test, or linear regression, is a myth. It is the

measurement errors that need to be normally distributed. But even more

important, studies with non-normal data have shown that the t-test is extremely

robust to departures from normality (Fagerland, 2012; Fagerland, Sandvik, &

Mowinckel, 2011; Rasch & Teuscher, 2007).

A second persistent misconception is that you cannot use the t-test on small

samples (when pressed, people mutter something about “less than 30” but

aren’t sure). Actually, you can. And the t-test performs well in samples as small

as N=2! (J. de Winter, 2013) Indeed, with very small samples indeed, the

Wilcoxon-Mann Whitney test is unable to detect a signiﬁcant diﬀerence, while

the t-test is (Altman & Bland, 2009).

Relative to a t-test or regression, the Wilcoxon test (also called the Wilcoxon

Mann-Whitney U test) can be less eﬀicient if your data are close to normally

distributed. However, a statistician called Pitman showed that the test was

never less than 86.4% as eﬀicient. So inﬂating your sample by 1.16 should give

you at least the same power that you would have using a t-test with normally

distributed data. With data with skewed distributions, or data in which the

distributions are diﬀerent in the two groups, the Wilcoxon Mann-Whitney test

can be more powerful than a t-test, so

My data are on 5-point Likert scales and my supervisor says I cannot use a

t-test because my data are ordinal

Simulation studies comparing the t-test and the Wilcoxon Mann-Whitney test on

items scored on 5-point scales have given heartening results. In most scenarios,

the two tests had a similar power to detect diﬀerences between groups. The

false-positive error rate for both tests was near to 5% for most situations, and

Sample Size: comparing means of two groups

!41

never higher than 8% in even the most extreme situations. However, when the

samples diﬀered markedly in the shape of their score distribution, the Wilcoxon

Mann-Whitney test outperformed the t-test (J. C. de Winter & Dodou, 2010).

Methods in Stata and R

The sample sizes were calculated using Stata Release 14, using the power

command. The Mann-Whitney statistic was calculated using the mwstati

command for Stata written by Rich Goldstein, and based on formulas in Colditz

et al (1988) above.

You can also use the package pwr in R. The R code for the fœtal heart rate

example, where we want to detect a diﬀerence of 0.67 standard deviations is

> pwr.t.test(n=NULL, d=.67,power=.9,type="two.sample")

Two-sample t test power calculation

n = 47.79517

d = 0.67

sig.level = 0.05

power = 0.9

alternative = two.sided

NOTE: n is number in *each* group

References and useful reading

These calculations were carried out using Stata release 12

Altman, D. G., & Bland, J. M. (2009). Parametric v non-parametric methods for

data analysis. Bmj, 338(apr02 1), a3167–a3167. doi:10.1136/bmj.a3167

Conroy, R. M. (2012). What hypotheses do “nonparametric” two-group tests

actually test? The Stata Journal, 12(2), 1–9.

Higgins JPT. Cochrane Handbook for Systematic Reviews of Interventions. The

Cochrane Collaboration; 2011. Available from: www.handbook.cochrane.org.

de Winter, J. (2013). Using the Student’s t-test with extremely small sample

sizes. Practical Assessment, Research & Evaluation, 18(10), 1–12.

de Winter, J. C., & Dodou, D. (2010). Five-point Likert items: t test versus Mann-

Whitney-Wilcoxon. Practical Assessment, Research & Evaluation, 15(11),

1–12.

Fagerland, M. W. (2012). t-tests, non-parametric tests, and large studies--a

paradox of statistical practice? BMC Medical Research Methodology, 12,

78. doi:10.1186/1471-2288-12-78

Fagerland, M. W., Sandvik, L., & Mowinckel, P. (2011). Parametric methods

outperformed non-parametric methods in comparisons of discrete

Sample Size: comparing means of two groups

!42

numerical variables. BMC Medical Research Methodology, 11(1), 44. doi:

10.1186/1471-2288-11-44

Colditz, G. A., J. N. Miller, and F. Mosteller. (1988). Measuring Gain in the

Evaluation of Medical Technology. International Journal of

TechnologyAssessment. 4, 637-42.

Rasch, D., & TEUSCHER, F. (2007). How robust are tests for two independent

samples? Journal of Statistical Planning and Inference, 137(8), 2706–

2720.

Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics

Bulletin, 1(6), 80–83.

Zimmerman, D. W. (2004). A note on preliminary tests of equality of variances.

The British Journal of Mathematical and Statistical Psychology, 57(Pt 1),

173–181. doi:10.1348/000711004849222

Sample Size: comparing means of two groups

!43

2.2 Sample sizes for comparing means in the same

people under two conditions

One common experimental design is to measure the same thing twice, once

under each of two conditions. This sort of data are often analysed with the

paired t-test. However, the paired t-test doesn't actually use the two values you

measured; it subtracts one from the other and gets the average diﬀerence. The

null hypothesis is that this average diﬀerence is zero.

So the sample size for paired measurements doesn't involve specifying the

means for each condition but specifying the mean diﬀerence.

Step 1: decide on the diﬀerence that you want to be able to detect.

The ﬁrst step in calculating a sample size is to decide on the smallest diﬀerence

between the two measurements that would be 'clinically signiﬁcant' or

'scientiﬁcally signiﬁcant'. For example, if you wanted to see how eﬀective an

exercise programme was in reducing weight in people who were overweight,

you might decide that losing two kilos over the one-month trial period would be

the minimum weight loss that would count as a 'signiﬁcant' weight loss..

It is often hard to deﬁne the smallest diﬀerence that would be clinically

signiﬁcant. An element of guesswork in involved. What is the smallest

reduction in cholesterol that would be regarded as clinically worthwhile? It

may be useful to search the literature and see what other investigators have

done.

Eﬀect size should not be based on your expectations!

Note, however, that the sample size depends on the smallest clinically

signiﬁcant diﬀerence, not, on the size of the diﬀerence you expect to ﬁnd.

Step 2: Convert the smallest clinically signiﬁcant diﬀerence to standard

deviation units.

Step 2.1. What is the standard deviation of the diﬀerences?

This is often very hard to ascertain. You may ﬁnd some published data. Even if

you cannot you can get a reasonable guess by identifying the biggest positive

and biggest negative diﬀerences that would typically occur. The biggest

positive diﬀerence is the biggest diﬀerence in the expected direction that

would typically occur. The biggest negative diﬀerence is the biggest diﬀerence

in the opposite direction that would be expected to occur. Since most values

will be within ±2 standard deviations of the average, then the biggest positive

diﬀerence (2 standard deviations above average) and biggest negative (2

below) will span a range of four standard deviations. An approximate standard

deviation is therefore

Sample Size: comparing means of same people measured twice

!44

For example: though we are hoping for at least a two kilo weight loss following

exercise, some people may lose up to ﬁve kilos. However, others might actually

gain as much as a kilo, perhaps because of the eﬀect of exercise on appetite. So

the change in weight can vary from plus ﬁve kilos to minus one, a range of six

kilos. The standard deviation is a quarter of that range: one and a half kilos.

Step 2.2. Convert the minimum diﬀerence to be detected to standard deviation units by

dividing it by the standard deviation

Following our example, the minimum diﬀerence is 2 kilos, and the standard

deviation is 1.5 kilos. The diﬀerence to be detected is therefore one and a third

standard deviations (1.33).

Step 3: Use table 2.2 to get an idea of the number of participants you

need in each group to detect a diﬀerence of this size.

Following the example, the nearest value in the table to 1.33 is 1.3. The

researcher will need to study seven people to have a 90% chance of detecting a

weight loss of 2 kilos following the exercise programme. To have a 95% chance

of detecting this diﬀerence, the researcher will need 8 people.

Approximate

SD of

differences

=

Biggest typical

positive difference

–

Biggest typical

negative difference

4

Minimum difference to be detected

Standard deviation of the difference

Sample Size: comparing means of same people measured twice

!45

Table 2.2 Sample sizes for comparing means in the same

people under two conditions

Sample sizes for studies which compare mean values on the same people measured under

two different conditions

*Stata code for this column:

power pairedmeans, sddiff(1) altdiff( 2 1.5 1.4 1.3 1.25 1.2 1.1 1

0.9 0.8 0.75 0.7 0.6 0.5 0.4 0.3 0.25 0.2) power(0.9)

Note that the Stata code includes a list of values for the alternative hypothesis

diﬀerence. Note also that you can run this command from the Stata menus and

dialogues.

Diﬀerence

to be

detected

(SD units)

N required

for 90%

power*

N required

for 95%

power

Percentage of people

who will change in

the hypothesised

direction

2

5

6

98%

1.5

7

8

93%

1.4

8

9

92%

1.3

9

10

90%

1.25

9

11

89%

1.2

10

12

88%

1.1

11

13

86%

1

13

16

84%

0.9

16

19

82%

0.8

19

23

79%

0.75

21

26

77%

0.7

24

29

76%

0.6

32

39

73%

0.5

44

54

69%

0.4

68

84

66%

0.3

119

147

62%

0.25

171

210

60%

0.2

265

327

58%

Sample Size: comparing means of same people measured twice

!46

Frequently-asked questions

What is 90% or 95% power?

Just because a diﬀerence really exists in the population you are studying does

not mean it will appear in every sample you take. Your sample may not show

the diﬀerence, even though it is there. To be ethical and value for money, a

research study should have a reasonable chance of detecting the smallest

diﬀerence that would be of clinical signiﬁcance (if this diﬀerence actually

exists, of course). If you do a study and fail to ﬁnd a diﬀerence, even though it

exists, you may discourage further research, or delay the discovery of

something useful. For this reason, you study should have a reasonable chance

of ﬁnding a diﬀerence, if such a diﬀerence exists.

A study with 90% power is 90% likely to discover the diﬀerence between

the two measurement conditions if such a diﬀerence exists. And 95% power

increases this likelihood to 95%. So if a study with 95% power fails to detect a

diﬀerence, the diﬀerence is unlikely to exist. You should aim for 95% power,

and certainly accept nothing less than 90% power. Why run a study that has

more than a 10% chance of failing to detect the very thing it is looking for?

How do I interpret the column that shows the percentage of people who will change in

the hypothesised direction?

Some scales have measuring units that are hard to imagine. We can imagine

foetal heart rate, which is in beats per minute, but how do you imagine scores

on a depression scale? What constitutes a 'clinically signiﬁcant' change in

depression score?

One way of thinking of diﬀerences between groups is to ask what

proportion of the people will change in the hypothesised direction. For example

we could ask what proportion of depressed patients on an exercise programme

would have to show improved mood scores before we would consider making

the programme a regular feature of the management of depression. If we

decide that a we would like to see improvements in at least 75% of patients,

then depression scores have to fall by 0.7 standard deviation units. The sample

size we need is 22 patients for 90% power, 27 for 95% power (the table doesn't

give 75%, I've used the column for 76%, which is close enough).

The technical name for this percentage, incidentally, is the Mann-Whitney

statistic.

I have a limited number of potential participants. How can I ﬁnd out power for a particular

sample size?

You may be limited to a particular sample size because of the limitations of

your data. There may only be 20 patients available, or your project time scale

only allows for collecting data on a certain number of participants. You can use

the table to get a rough idea of the power of your study. For example, with only

20 participants, you have more than 90% power to detect a diﬀerence of 0.75

Sample Size: comparing means of same people measured twice

!47

standard deviations (which only needs two groups of 17) and slightly less than

95% power to detect a diﬀerence of 0.8 standard deviations (you would really

need 21 participants).

But what if the diﬀerence is bigger than I think?

Sample sizes are calculated to detect the smallest clinically signiﬁcant

diﬀerence. If the actual diﬀerence is greater than this, the study's power to

detect it is higher.

Reference and methods

These calculations were carried out using Stata release 15 with the power

command

You can also use the pwr package in R. Here is the calculation for a diﬀerence

of 0.5 standard deviations with 90% power.

pwr.t.test(n=NULL, d=.5,power=.9,type="paired")

Paired t test power calculation

n = 43.99548

d = 0.5

sig.level = 0.05

power = 0.9

alternative = two.sided

NOTE: n is number of *pairs*!

Sample Size: comparing means of same people measured twice

!48

2.3 Calculating sample sizes for comparing two

means: a rule of thumb

Sample size for comparing two groups

Gerald van Belle gives a good rule of thumb for calculating sample size for

comparing two groups. You do it like this:

1. Calculate the smallest diﬀerence between the two groups that would be of

scientiﬁc interest.

2. Divide this by the standard deviation to convert it to standard deviation units

(this is the same two steps as before)

3. Square the diﬀerence

4. For 90% power to detect this diﬀerence in studies comparing two groups,

the number you need in each group will be

Round up the answer to the nearest whole number.

5. For 95% power, change the number above the line to 26.

Despite being an approximation, this formula is very accurate.

Studies comparing one mean with a known value

If you are only collecting one sample and comparing their mean to a known

population value, you may also use the formula above. In this case, the formula

for 90% power is

Round up the answer to the nearest whole number.

For 95% power, replace the number 11 above the line by 13.

See the links page at the end of this guide for the source of these rules of

thumb.!

21

(Difference)2

11

(Difference)2

Sample Size: comparing means: rule of thumb

!49

3. Sample size for correlations or regressions

between two variables measured on a numeric scale

This section give guidelines for sample sizes for studies that measure the

relationship between two numeric variables. Although these sample sizes are

often based on correlations, they can also be applied to linear regression, and

both types of measure are shown in the table.

Introduction : correlation and regression

Correlations are not widely used in medicine, because they are hard to

interpret. On interpretation of a Pearson correlation (r) can be got by squaring

it: this gives the proportion of variation in one variable that is linked to

variation in another variable. For example, there is a correlation of 0.7 between

illness-related stigma and depression, which means that just about half the

variation in depression (0.49, which is 0.72) is linked to variation in illness-

related stigma.

Regressions are much more widely used, since they allow us to express the

relationship between two variables in natural units – for example, the eﬀect of

a one-year increase in age on blood pressure. Because regressions are

calculated in natural units, people often cite the proportion of variation shared

between the two variables.

In fact, correlation is just an alternative form of reporting the results of a

regression, so the p-value for a regression will be the same as the p-value for a

Pearson correlation.

Steps in calculating sample size for correlation or regression

Step 1: How much variation in one variable should be linked to variation in

the other variable for the relationship to be clinically important?

This is hard to decide, but it is hard to imagine a correlation being of 'real life'

importance if less than 20% of the variation in one variable is linked to

variation in the other variable.

Step 2: Use the table to look up the corresponding correlation and sample

size

Sample Size: correlation

!50

*Stata command for this column:

power onecorrelation 0 (0.32 0.39 0.45 0.5 0.55 0.59 0.63 0.67

0.71), power(0.9)

Reference

These calculations were carried out in Stata 15 with the command power

% Shared

variation

Correlation

Sample size 90%

power*

Sample size 95%

power

10%

0.32

99

122

15%

0.39

65

80

20%

0.45

48

59

25%

0.5

38

47

30%

0.55

31

37

35%

0.59

26

32

40%

0.63

23

27

45%

0.67

19

23

50%

0.71

17

20

Sample Size: correlation

!51

4. Sample size for reliability studies

This section give guidelines for sample sizes for studies that measure

Cronbach’s alpha, an index of the reliability – strictly speaking the internal

consistency – of a set of items designed to measure a trait. The topic of scale

development is a complex one, so the section gives guidance on the

methodology of analysis and the interpretation of alpha.

Introduction : An apology

I wish there were a simple answer to this problem, and there isn’t. Please read

the following carefully.

Cronbach’s alpha

The reliability of a measurement scale is the degree to which all the items

measure the same thing. Reliability is speciﬁc: it describes the performance of

a scale in a speciﬁc population tested under speciﬁc conditions. So it is

important to make sure that scales are reliable when used in realistic

conditions with realistic participants.

In developing a new measurement scale, or showing that a measurement scale

works in a new setting, it is useful to measure its reliability. Reliability is

usually measured using Cronbach's alpha coeﬀicient, which is scaled between

zero and one, with zero meaning that the items in the scale have nothing in

common and one meaning that they are all perfectly correlated. In practice, it

is wildly unlikely that anyone would develop a scale in which all the items were

unrelated, so there is no point in testing whether your reliability is greater than

zero. Instead, you have to specify a minimum value for the reliability

coeﬀicient.

Myths about Cronbach’s alpha

A mythology has grown up around the interpretation of Cronbach’s alpha,

based, apparently, on the published work of Nunally (1978). According to this

myth, Nunally advocated an alpha of 0·7 as indicating a scale that was

acceptable for use in research. In fact, it’s worth quoting Nunally’s paper,

which oﬀers a much more nuanced and thoughtful approach to the question:

“What a satisfactory level of reliability is depends on how a measure is being

used. In the early stages of research … one saves time and energy by working

with instruments that have only modest reliability, for which purpose

reliabilities of .70 or higher will suﬀice… In contrast to the standards in basic

research, in many applied settings a reliability of .80 is not nearly high enough.

In basic research, the concern is with the size of correlations and with the

Sample Size: reliability studies

!52

diﬀerences in means for diﬀerent experimental treatments, for which purposes

a reliability of .80 for the diﬀerent measures is adequate.”

“In many applied problems, a great deal hinges on the exact score made by a

person on a test… In such instances it is frightening to think that any

measurement error is permitted. Even with a reliability of .90, the standard

error of measurement is almost one-third as large as the standard deviation of

the test scores. In those applied settings where important decisions are made

with respect to speciﬁc test scores, a reliability of .90 is the minimum that

should be tolerated, and a reliability of .95 should be considered the desirable

standard.”

This extensive quotation is from Lance, C.E., Butts, M.M. & Michels, L.C., 2006.

The Sources of Four Commonly Reported Cutoﬀ Criteria: What Did They Really

Say? Organizational Research Methods, 9(2), pp.202–220.

So bear in mind that mindlessly setting a desired alpha of 0·7 and citing

Nunally’s original paper is wrong. He didn’t say anything like that. And,

second, that you need to consider carefully the context of your research in

setting a minimum alpha.

Alpha only applies to unidimensional scales

One of the statistical assumptions underlying alpha is that the scale is

unidimensional. That is to say, that all the items measure the same thing, and

that their failure to correlate perfectly is due to measurement error. So an

important part of scale development is making sure that your items are indeed

unidimensional.

How many cases should a reliability study have?

The standard advice is to have at least 10 participants per item on your scale.

However, this should be regarded as the bare minimum.

There are surprising diﬀerences of opinion in the literature, however, on how

small your sample can be. The best current advice is based on simulation

studies where authors have studied the power of samples of various sizes to

detect a given alpha.

Simulation studies indicate that sample size depends on the structure of your

scale. Sample sizes as small as 30 can measure alpha reliably so long as the

scale items have strong inter-correlations.

First step : principal components analysis

Your analysis should begin with a principal components analysis. A principal

components analysis identiﬁes underlying ‘dimensions’ that account for the

variation in a set of items. In the case of reliability, you should only examine the

ﬁrst principal component. There is a good reason for this: alpha has no

Sample Size: reliability studies

!53

interpretation when scales combine items that measure diﬀerent constructs.

The ﬁrst principal component measures the degree to which the items measure

the same construct.

Samuels, summarising the literature, makes these recommendations

1. Don’t run reliability analysis with less than 30 participants

2. If you have between 30 and 50 participants, remove items that have loadings

of less than 0·4 on the ﬁrst principal component. This means that that very

little of the variation in the responses to that item are shared with the other

scale items.

3. Rerun the principal components analysis and examine the ﬁrst eigenvalue

(the eigenvalue for the ﬁrst principal component). If this is less than 6, do not

attempt a reliability analysis; the items just don’t show enough homogeneity to

yield a reliable estimate of alpha.

4. Ideally, scale items should have a loading of 0·8 or more on the ﬁrst principal

component. Items between 0·4 and 0·8 need to be considered carefully as

candidates for inclusion.

5. If your sample size is between 50 and 100, then follow the same steps, but if

your eigenvalue falls between 3 and 6, then only perform a reliability analysis if

the sample size is at least 75. See Yurdugül for details of how these ﬁgures are

arrived at.

References

Lance, C.E., Butts, M.M. & Michels, L.C., 2006. The Sources of Four Commonly

Reported Cutoﬀ Criteria: What Did They Really Say? Organizational Research

Methods, 9(2), pp.202–220.

Samuels, P., 2015. Statistical Methods – Scale reliability analysis with small

samples, Birmingham City University, Centre for Academic Success. DOI:

10.13140/RG.2.1.1495.5364. https://www.researchgate.net/publication/

280936182_Advice_on_Reliability_Analysis_with_Small_Samples

Yurdugül, H., 2008. Minimum sample size for Cronbach's coeﬀicient alpha : a

Monte-Carlo study. Hacettepe University Journal of Education, 35, pp.397–405.

http://www.efdergi.hacettepe.edu.tr/200835HALİL%20YURDUGÜL.pdf

Sample Size: reliability studies

!54

5. Sample size calculation for agreement between

two raters using a present/absent rating scale using

Cohen’s Kappa

This section give guidelines for sample sizes for studies that use the kappa

coeﬀicient to measure the agreement between two raters who make ratings of

present/absent.

Introduction

Studies looking at the agreement between raters come in many shapes and

sizes. The most basic design is where two raters are asked to rate the presence

or absence of a particular feature or quality. Kappa is a statistic that measures

the degree of agreement over and above the agreement you would expect by

chance. You can see why just measuring percentage agreement is not enough.

If you toss two coins, they will agree 50% of the time just by chance. Likewise,

two raters, each of whom rates a feature as present 50% of the time will agree

with each other by chance 50% of the time.

When we are studying agreement, we have to choose a null hypothesis.

Normally, the null hypothesis says that the data arose by chance – that there is

no actual relationship between the variables we are studying. However, this

makes no sense at all when we are studying agreement. It would be ridiculous

to set up a scientiﬁc study to determine whether the agreement between two

pathologists was better than chance! When two raters rate the same thing, it

would be unusual to ﬁnd that they didn’t agree any more than you would

expect by chance, even in psychiatry.

So in studies of agreement, we have to set a minimum level of agreement that

we want to outrule in our study. Usually we would like to outrule a level of

agreement that would suggest that there was a signiﬁcant problem with the

reliability of the rating. So unlike other sample size methods, the researcher

will have to base sample size calculation for kappa on two ﬁgures: the value of

kappa to be outruled and the likely true value of kappa. In addition, the

prevalence of the feature will aﬀect sample size.

Estimating sample size for kappa

The sample size will depend on three factors:

Step 1: Prevalence of the feature

What is the approximate prevalence of the feature that is being rated? Sample

sizes will be smallest when there is a 50% prevalence, and will get very large

when the prevalence drops much below 25%.

Sample Size: pilot studies

!55

In the calculations below, we assume that there is no systematic diﬀerence

between the raters. In other words, that each rater gives more or less the same

prevalence of the feature. Where you suspect that raters will give diﬀerent

prevalences, the sample size calculation needs to take this into account, and is

well beyond the scope of this guide. However, the R package I used will

perform the calculation (see below).

Step 2 : Deﬁnition of an unacceptably low level of agreement (null value)

It would be astonishing if two raters could not agree any more than you would

expect by chance. So in designing the study we have to stipulate what would be

an unacceptably low level of agreement. This will act as a baseline against

which we can assess the actual level of agreement. Because this is the level of

agreement that we wish to outrule, the value is often called the null value, or

null hypothesis value.

In practice, a kappa of 0.2–0.40 is regarded as a fair level of agreement, 0.41–

0.60 as moderate, 0.61–0.80 as substantial and anything above 0·8 as excellent.

That said, these cutpoints have a sort of folkloric status, and the interpretation

of kappa is probably best done in the context of the decision that it supports.

In the tables that follow I will tabulate sample sizes for kappa in cases where

you want to demonstrate that kappa is better than 0·4 (so agreement is better

than ‘fair’), better than 0·5 or 0·6 (better than ‘moderate’) and better than 0·7

and 0·8 (better than ‘substantial).

Step 3 : Eﬀect size - what is a clinically acceptable level of agreement?

What is the level of agreement that you think should be present if the test is a

reliable test? This value is often called the alternative value or alternative

hypothesis value, in contrast with the null value.

For example, if the test would require substantial agreement between

assessors rather than simply being moderate, then you might set up your

sample size to detect a kappa of 0·75 against a null hypothesis that kappa is

0·6. This would require 199 ratings made by the two raters to achieve 90%

power. However, if you hypothesised that kappa was 0·75, as before, but

wanted to outrule a kappa of 0·5, the required sample size drops to a very

manageable 78.

Sample Size: pilot studies

!56

Sample sizes for kappa for two raters

Prevalence of

feature

Hypothesised

kappa

Kappa to be

outruled (null

hypothesis kappa)

90%

power

95%

power

0·5

0.6

0.4

156

200

0.7

0.5

131

169

0.8

0.6

102

133

0.7

0.45

87

112

0.8

0.55

68

90

0.8

0.5

49

65

0·4 or 0·6

0.6

0.4

162

208

0.7

0.5

137

177

0.8

0.6

106

139

0.7

0.45

90

117

0.8

0.55

71

94

0.8

0.5

51

68

0·25 or 0·75

0.6

0.4

207

265

0.7

0.5

176

227

0.8

0.6

137

180

0.7

0.45

116

150

0.8

0.55

92

121

0.8

0.5

66

87

0·1 to 0·9

0.6

0.4

427

546

0.7

0.5

371

479

Sample Size: pilot studies

!57

Example

A researcher wishes to study the agreement between family doctors on

whether or not to prescribe an antibiotic for uncomplicated rhinitis. The

prevalence of antibiotic prescribing is about 25%. She would like to show that

the kappa value for agreement is better than 0·5. She hypothesises that the

true kappa might be between 0·7 and 0·8.

Looking at the table, if the true kappa is 0·7, she will need to compare the

doctors’ ratings for 176 patients to have a 90% power to outrule a kappa as low

as 0·5. On the other hand, if the true kappa is 0·75, she would have 90% power

to outrule a kappa as low as 0·45 with a sample of 116.

Limitations of these tables

There are so many potential combinations of prevalence, kappa-to-be-outruled

and hypothesised kappa that these tables can only give an approximate idea of

the numbers involved. And they don’t cover cases where the two raters have

diﬀerent prevalences (which would indicate systematic disagreement!), or

where there are more than two raters etc. To get precise calculations for a

wide variety of scenarios, I recommend using the R package irr.

Reference

These sample sizes were calculated with the N.cohen.kappa command in the

irr package in R. The command uses a formula published in

Cantor, A. B. (1996) Sample-size calculation for Cohen's kappa. Psychological

Methods, 1, 150-153.

The sample sizes in the table were produced using variations on this command:

N.cohen.kappa(0.1, 0.1, 0.5, 0.8,power=.95)!

0.8

0.6

292

382

0.7

0.45

242

313

0.8

0.55

194

255

0.8

0.5

139

183

Prevalence of

feature

Hypothesised

kappa

Kappa to be

outruled (null

hypothesis kappa)

90%

power

95%

power

Sample Size: pilot studies

!58

6. Sample size for pilot studies

Introduction

The sample size methods used so far presuppose that the investigator has some

kind of knowledge that can be used to make informed guesses about such

things as prevalences, eﬀect sizes etc. However, by their very essence pilot

studies are carried out when the researcher is facing the unknown. Even so,

there are some general principles which can be applied to ensure that enough

data are captured by a pilot study to inform subsequent study design with the

smallest use of resources.

Sample size: the law of diminishing returns

Sample size for pilot studies starts with the observation that each participant

that you recruit into a study yields less information than the last one. This law

of diminishing returns can be used to deﬁne a point beyond which recruiting

additional participant will yield minimal improvement in estimating eﬀects.

Calculations by Julious (2005) and Van Belle (2008) both show that in studies

that compare the means of two groups, if you carry on recruitment beyond a

sample size of 12 per group the eﬀect of each additional participant on the

precision is minimal. If your pilot study is purely exploratory and your aim is to

get a preliminary estimate of the diﬀerence between two groups, then a sample

size of 12 per group can be justiﬁed on the basis of these references.

Sample size to justify carrying out a full study

Sometimes there are cases when the investigator will have a preliminary

estimate of the minimum diﬀerence between groups that would constitute a

clinically signiﬁcant diﬀerence. The purpose of the pilot study is to justify

carrying out a full study. For example, before conducting a study of the eﬀects

of a physiotherapy programme on balance in the elderly, the investigators

might be required to do a pilot to show that there were grounds for believing

that such a programme would produce a clinically signiﬁcant improvement in

balance.

Cocks et al (2013) provide an algorithm for estimating the size of a pilot study

that will give the ‘go-ahead’ to a main study. Their rule of thumb, based on

calculated sample sizes for various scenarios, is to recruit 9% of of the

projected ﬁnal sample, or 20 participants, whichever is the greater, as a pilot. If

there is no diﬀerence between the groups, then it is unlikely that the true eﬀect

size is as large as the one speciﬁed by the investigators. Note that this

conclusion is based on an 80% conﬁdence interval, not the usual 95%. If you

are using this method, please read Cocks’ paper for further detail and worked

examples.

Sample Size: pilot studies

!59

Method

Calculate the sample size from section 2.1.

Use 9% of this sample size or 20 participants, whichever is the greater

If, when you analyse the pilot study, there is no signiﬁcant diﬀerence between

the groups, it is unlikely that the eﬀect size reaches clinical signiﬁcance.

References

Cocks, K. & Torgerson, D.J., 2013. Sample size calculations for pilot randomized

trials: a conﬁdence interval$approach. Journal of Clinical Epidemiology, 66(2),

pp.197–201.

Julious, S.A., 2005. Sample size of 12 per group rule of thumb for a pilot study.

Pharmaceutical Statistics, 4(4), pp.287–291. Available at: http://

onlinelibrary.wiley.com/doi/10.1002/pst.185/abstract.

van Belle, G., 2008. Sample Size. In Statistical Rules of Thumb. Wiley,

Chichester. pp. 27–51. Download from http://vanbelle.org/chapters/

webchapter2.pdf

Sample Size: pilot studies

!60

7. Sample size for animal experiments in which not

enough is known to calculate statistical power

In animal experiments, the investigator may have no prior literature to turn to.

The potential means and standard deviations of the outcomes are unknown,

and there is no reasonable way of guessing them. In a case like this, sample

size calculations cannot be applied.

The resource equation method

The resource equation method can be used for minimising the number of

animals committed to an exploratory study. It is based on the law of diminishing

returns: each additional animal committed to a study tells us less than the one

to reach the threshold where adding further animals will be uninformative. It

should only be used for pilot studies or proof-of-concept studies.

Applying the resource equation method

1. How many treatment groups will be involved? Call this T.

2. Will the experiment be run in blocks? If so, how many blocks will be used?

Call this B

A block is a batch of animals that are tested at the same time. Each block may

have a diﬀerent response because of the particular conditions at the time they

were tested. Incorporating this information into a statistical analysis will

increase statistical power by removing variability between experimental

conditions on diﬀerent days.

3. Will the results be adjusted for any covariates? If so, how many? Call this C

Covariates are variables that are measured on a continuous scale, such as the

weight of the animal or the initial size of the tumour. Results can be adjusted

for such variables, which increases statistical power.

4. Combine these three ﬁgures:

(T–1) + (B+C–1) = D

5. Add at least 10 and at most 20

The sample size should be at least (D+10) and at most (D+20).

Example of the resource equation method

An investigator wishes to examine the eﬀect of a new delivery vehicle for an

anti-inﬂammatory drug. The experiment will involve four treatments: a control,

a group receiving a saline injection, a group receiving the vehicle alone and a

group receiving the vehicle plus drug. Because of laboratory limitations, only

four animals can be done on one day. The experimenter doesn't plan on

adjusting the results for factors like the weight of the animal.

Sample Size: when nothing is known in advance

!61

In this case, T (treatments) is 4 and C (covariates) is zero. So the sample size is

at least 10 + (T–1) which is 10 + 3, which is 13. However, 13 animals will have

to be done in at least 3 batches (assuming that the lab could manage a batch of

ﬁve). This means that the experiment will probably have a minimum of 3

blocks, and more likely four. So, taking the blocks into consideration, the

minimum sample size will be 10 + (T–1) + (B–1), which is 10 + 3 + 3, which is

16 animals.

The experimenter might like to aim for the maximum number of animals, to

reduce the possibility that the experiment will come to a false-negative

conclusion. In this case, 20 + (T–1) suggests 23 animals, which will have to be

done in 6 blocks of four. 20 + (T–1) + (B–1) is 28, which means running 7

blocks of four, which requires another adjustment: an extra animal is needed

because the number of blocks is now 7. The ﬁnal maximum sample size is 29.

As you can see, when you are running an experiment in blocks, the sample size

will depend on the number of blocks, which, in turn, may necessitate a small

adjustment to the sample size.

Why do investigators use groups of 6 animals?

In early-stage research, most of the eﬀects discovered will be dead ends. For

this reason, researchers are only interested in pursuing diﬀerences between

groups that are very large indeed. As can be seen from the table under

“comparing the means of two groups”, two groups of 6 animals will detect a

situation in which the scores of one group are almost entirely distinct from the

scores of the other – there is a 92% chance that an animal in the high-scoring

group will score higher than an animal in the low-scoring group.

“Everyone else used 6” is not a sample size calculation

Researchers should remember that this precludes the power to detect smaller

diﬀerences, and justify their sample sizes based on the statistical power and

the requirement for clinically signiﬁcant eﬀects to be very large. It’s not

enough to say that everyone else used groups of 6. !

Sample Size: when nothing is known in advance

!62

8. Sample size for qualitative research

Issues

Qualitative researchers often regard sample size calculations as something that

is only needed for quantitative research. However, qualitative research

protocols typically contain statements like "participants will be recruited until

data saturation occurs". So there is already an appreciation that a certain

number of participants will be "enough participants".

Clearly, it is important when planning (and especially budgeting) a qualitative

research project to know how many participants will be needed. These

guidelines are partly derived from an excellent paper by Morse‑ 1

General guidance

Over-estimate your sample size when writing a proposal and budgeting it. This

gives you some insurance against diﬀiculties in recruitment, participants whose

data is not very useful and other unanticipated snags.

Speciﬁc factors aﬀecting sample size

Scope of study and nature of the topic

If the scope of the study is broad, then more participants will be needed to

reach saturation. Indeed, broad topics are more likely to require data from

multiple data sources. Doing justice to a broad topic requires a large

commitment of time and resources, including large amounts of data. Broad

studies should not be undertaken unless they are well-supported and have a

good chance of achieving what they set out to do.

If the study addresses an obvious, clear topic, and the information will be easily

obtained from the participants, then fewer participants will be needed. Topics

that are harder to grasp and formulate are often more important, but require

greater skill and experience from the researcher, and will require more data.

If they study topic is one about which people will have trouble talking (because

it is complex, or embarrassing, or may depend on experiences which not

everyone has) you will need more participants.

Quality of data and sample size

The ability of participants to devote time and thought to the interview, and to

articulate their experiences and perceptions, and to reﬂect on them, will all

aﬀect the richness of the data. In particular, in some studies, participants may

not be able to devote time to a long interview, or may not be physically or

psychologically capable of taking part in a long interview, resulting in smaller

Sample Size: qualitative research

!63

amounts of data from each interview. Where interviews are likely to be lower in

information, larger sample sizes are needed.

On the other hand, when participants are being interviewed several times, this

will generate more data, and sample sizes will be smaller.

Variability and sample size

The more variable the experiences, perceptions and meanings of the

participants, the more participants will be needed to achieve the same degree

of saturation.

Shadowed data and sample size

This is a term coined by Morse for situations in which participants talk about

the experiences of others. You might call it 'secondhand data'. Collecting such

data can make interviews more information rich and make better use of each

participant, reducing the total sample size required. In particular, encouraging

participants to compare and contrast their experiences, views and meanings

with those of others can throw important light on variability in the domain you

are studying. However, shadowing is no substitute for collecting ﬁrst hand

data, and may introduce bias.

So how many?

Morse recommends that semi-structured interviews with relatively small

amounts of data per person should have 30 to 60 interviews. On the other

hand, grounded theory research, with two to three unstructured interviews per

person, should need 20 to 30 participants. In either case, the ﬁnal choice of

number should be guided by the other factors above.

A failsafe approach based on failure to detect

One question that a qualitative researcher should think about is this: if

something doesn't emerge in my research (an attitude, an experience etc) then

how common could it be in the population I am researching? Research, to be

valid, must have a reasonable chance of detecting things that are common

enough to matter. Failure to detect something important is a risk in all

research, qualitative and quantitative. While you cannot guarantee that your

research will absolutely detect everything important, you can at least make an

estimate of the likelihood that your sample will fail to include at least one

important topic/view/meaning etc.

Sample Size: qualitative research

!64

The table shows numbers of participants and, for each number, shows how rare

a theme, experience or meaning would have to be so that it was unlikely to be

detected by the study.

Table 8.1 Sample size and likelihood of missing something important in

qualitative research

As you can see, if a study of 60 people fails to identify a theme, experience or

issue, that issue is probably rare – present in about one person in 20 or fewer.

However, a study of 15 participants can fail to identify something which is

present in one person in every four! And a study of 8 participants is quite likely

to fail to ﬁnd out things that aﬀect half of the study population.

Clearly, shadowing (second hand data) can reduce these error rates by getting

participants to talk about others, but this is no substitute for including the

others in the research. Part of this is trying to chose a sample in such a way as

to span the population, but this relies on knowing the factors that make for

diversity in the population – something that may only become clear after the

research is well under way.

However, both expert opinion in the area of qualitative research and the table

above suggest that samples of less than 20 participants have to be justiﬁed on

the grounds that they are unusually rich in data and representative.

Method

The table was calculated based on Poisson conﬁdence intervals for zero

observed frequencies at the given sample sizes, using Stata Release 14.1

Size of

study

If you don't ﬁnd

something, the maximum

likely prevalence is

That's roughly

60

6%

1 person in 20

40

9%

1 person in 10

30

13%

1 person in 8

20

18%

1 person in 6

15

25%

1 person in 4

10

37%

1 person in 3

8

46%

1 person in 2

5

74%

3 people in 4

Sample Size: qualitative research

!65

References and further reading

Boddy CR. Sample size for qualitative research. Qualitative Mrkt Res: An Int J.

2016 Sep 12;19(4):426–32.

Marshall B, Cardon P, Poddar A, Fontenot R. Does sample size matter in

qualitative research: a review of qualitative interviews in is research. Journal of

Computer Information Systems 2013.

Morse JM. Determining Sample Size. Qual Health Res. 2000 January 1,

2000;10(1):3-5.

Morse JM. Analytic Strategies and Sample Size. Qual Health Res. SAGE

Publications; 2015 Oct;25(10):1317–8.

Thomson SB. Sample Size and Grounded Theory. JOAAG. 2011 Mar 9;5(6):45–

52.

van Rijnsoever FJ. (I Can’t Get No) Saturation: A simulation and guidelines for

sample sizes in qualitative research. Derrick GE, editor. PLoS ONE. 2017 Jul

26;12(7):e0181689–17.

Sample Size: qualitative research

!66

9. Resources for animal experiments

Festing, Michael FW, and Douglas G. Altman. "Guidelines for the design and

statistical analysis of experiments using laboratory animals." ILAR journal 43.4

(2002): 244-258. http://ilarjournal.oxfordjournals.org/content/43/4/244.full

This paper appears as part of a collection which you can peruse here: http://

ilarjournal.oxfordjournals.org/content/43/4.toc

Festing, Michael FW. "Design and statistical methods in studies using animal

models of development." Ilar Journal 47.1 (2006): 5-14. http://

ilarjournal.oxfordjournals.org/content/47/1/5.full?

sid=6bb505df-77e8-48c3-8b9a-d67bd304deec

Sample Size: Resources on the internet

!67

9. Computer and online resources

Free, highly recommended package: G*Power

! http://gpower.hhu.de/

For applications that go beyond the ones described here, including multiple

regression, I can strongly recommend G*Power, which is free and multi-

platform. There is an excellent manual.

Standard statistical packages

Stata also has a powerful set of sample size routines, and there are many user-

written routines to calculate sample sizes for various types of study. Use the

command findit sample size to get a listing of user-written commands that you

can install.

The free professional package R includes sample size calculation (but requires

a bit of learning). I recommend using software called RStudio as an interface

to R. It makes R far easier to learn and use.

And no; SPSS will sell you a sample size package, but it isn't included with

SPSS itself. If you use SPSS, my advice is to use G*Power and save money.

Sample size calculators and Online resources

You can look for sample size software to download at

http://statpages.org/javasta2.html

The Graph Pad website has a lot of helpful resources

http://graphpad.com/welcome.htm

They make an excellent sample-size calculator application called StatMate

which gets high scores for a simple, intelligent interface and very useful

explanations of the process. It has a tutorial that walks you through.

http://graphpad.com/scientiﬁc-software/statmate/

The OpenEpi website, which you can download to your computer for oﬄine

use, has some power calculations

http://www.openepi.com/Menu/OE_Menu.htm

There is a free Windows power calculation program at Vanderbilt Medical

Center http://biostat.mc.vanderbilt.edu/wiki/Main/PowerSampleSize

GPower is a very comprehensive package for both Windows and Mac, available

from http://gpower.hhu.de/

Sample Size: Resources on the internet

!68

Online sample size calculators

WebPower

A splendid site that also oﬀers an R package. It has a very comprehensive suite

of power and sample size calculation methods. It also allows you to create a

user ID so that you can save your work. There is a comprehensive manual.

Recommended.

https://webpower.psychstat.org/wiki/

Manual, which has lots of useful reading, here:

https://webpower.psychstat.org/wiki/_media/grant/webpower_manual_book.pdf

Power and sample size

http://powerandsamplesize.com/

Excellent site with well-designed and validated calculators for a wide variety of

study designs. Recommended.

Sealed Envelope power calculators

Calculations for clinical trials (the company provides support for clinical trials)

including equivalence and non-inferiority trials

https://www.sealedenvelope.com/power/

Simple Interactive Statistical Analysis (SISA)

http://www.quantitativeskills.com/sisa/calculations/sampshlp.htm

Easy-to-use with good explanations but a smaller selection of study designs.

The survey system and Survey Monkey

http://www.surveysystem.com/sscalc.htm

https://www.surveymonkey.com/mp/sample-size-calculator/

Sample sizes for surveys. Survey Monkey has a very readable web page on

sample size considerations.

Harvard sample size calculators

http://hedwig.mgh.harvard.edu/sample_size/size.html

A small selection, but clearly organised by study type.

Rules of thumb

Gerard van Belle's chapter on rules of thumb for sample size calculation can be

downloaded from his website (http://www.vanbelle.org/) It's extracted from his

book.

Sample Size: Resources on the internet

!69