Page 1

STATISTICS IN MEDICINE

Statist. Med. 2009; 28:1159–1175

Published online 23 Januray 2009 in Wiley InterScience

(www.interscience.wiley.com) DOI: 10.1002/sim.3531

TUTORIAL IN BIOSTATISTICS

Recommended tests for association in 2×2 tables

Stian Lydersen1,∗,†, Morten W. Fagerland2and Petter Laake3

1Unit for Applied Clinical Research, Department of Cancer Research and Molecular Medicine,

Norwegian University of Science and Technology, Trondheim, Norway

2Ullev˚ al Department of Research Administration, Oslo University Hospital, Norway

3Department of Biostatistics, University of Oslo, Norway

SUMMARY

The asymptotic Pearson’s chi-squared test and Fisher’s exact test have long been the most used for testing

association in 2×2 tables. Unconditional tests preserve the significance level and generally are more

powerful than Fisher’s exact test for moderate to small samples, but previously were disadvantaged by

being computationally demanding. This disadvantage is now moot, as software to facilitate unconditional

tests has been available for years. Moreover, Fisher’s exact test with mid-p adjustment gives about the

same results as an unconditional test. Consequently, several better tests are available, and the choice of

a test should depend only on its merits for the application involved. Unconditional tests and the mid-p

approach ought to be used more than they now are. The traditional Fisher’s exact test should practically

never be used. Copyright q 2009 John Wiley & Sons, Ltd.

KEY WORDS:2×2 tables; Fisher’s exact test; Pearson’s chi-squared test; unconditional tests; mid-p-value

1. INTRODUCTION

A 2×2 table is a way of summarizing the observed cross-classification of two dichotomous random

variables. An example shown in Table I comprises the results of a double blind trial of high dose

versus standard dose of epinephrine in children with cardiac arrest [1]. One of the 34 children in the

high dose group survived 24h, while 7 of the 34 children in the standard dose group survived 24h.

Another example is shown in Table II, which comprises a classification of CHRNA4 genotypes

and presence of exfoliation syndrome in the eyes [2].

Association in 2×2 tables traditionally has been tested using the asymptotic Pearson’s chi-

squared test for larger samples and Fisher–Irwins’s exact test, usually called Fisher’s exact test,

for smaller samples. But these two tests have lesser-known drawbacks. The asymptotic test may

∗Correspondence to: Stian Lydersen, Unit for Applied Clinical Research, Department of Cancer Research and

Molecular Medicine, NTNU, N-7489 Trondheim, Norway.

†E-mail: stian.lydersen@ntnu.no

Received 20 February 2008

Accepted 8 December 2008

Copyright q 2009 John Wiley & Sons, Ltd.

Page 2

1160

S. LYDERSEN, M. W. FAGERLAND AND P. LAAKE

Table I. Treatment of children with cardiac arrest. High dose

versus standard dose epinephrine [1].

Survival at 24h

TreatmentYesNo Sum

High dose

Standard dose

Sum

1

7

8

33

27

60

34∗

34∗

68∗

An asterisk * denotes the sums fixed by design.

Table II. Genotype (CHRNA4-CC versus CHRNA4-CT or -TT) and

presence of exfoliative syndrome in the eyes (XFS) [2].

XFS

Yes NoSum

CHRNA4-CC

CHRNA4-TC/TT

Sum

0 16

57

73

16

72

88∗

15

15

An asterisk * denotes the sum fixed by design.

not preserve the test size, that is, the actual significance level may be higher than the nominal

significance level. Fisher’s exact test is conservative, that is, other tests generally have higher power

yet still preserve test size. In the examples of Tables I and II, neither of these traditional methods

performs well.

Many significance tests are possible, depending on a variety of choices, including

• Level of conditioning in the sample space (possible tables): Should the p-value be computed

unconditionally or conditionally on one or more of the marginal sums in the observed table?

• Choice of test statistic, such as Pearson’s or Fisher’s.

• Exact or asymptotic calculation of the p-value.

• Further adjustments, such as the mid-p-value.

The number of conceivable tests is large. For example, Mart´ ın Andr´ es and Silva Mato [3] studied

60 asymptotic tests for comparing binomial proportions. Exact analysis of discrete data, including

2×2 tables, is described in a more general framework in a recent book by Hirji [4].

In the present article, we describe the main principles underpinning various tests in 2×2 tables

and recommend when each of them might best be used. Estimation and confidence intervals for

effect size are outside the scope of the article. The common experimental designs underlying 2×2

tables are defined in Section 2. The most common test statistics are defined in Section 3. Various

ways of defining and computing the p-value are described in Section 4. Section 5 summarizes

what is known about the tests. Some readers may wish to skip Sections 3–5 and go straight to

Section 6, where we summarize the definitions and properties of the most common tests, and the

recommended tests. Examples illustrating the tests are given in Section 7. Power and sample size

calculations are briefly covered in Section 8. Our recommendations for choice of tests are given

in Section 9.

Copyright q 2009 John Wiley & Sons, Ltd.

Statist. Med. 2009; 28:1159–1175

DOI: 10.1002/sim

Page 3

RECOMMENDED TESTS FOR ASSOCIATION IN 2×2 TABLES

1161

2. EXPERIMENTAL DESIGNS AND HYPOTHESES

The counts of a 2×2 table may be summarized as illustrated in Table III. Such a table may result

from different designs or sampling models, as described below.

2.1. Both margins fixed design

The classic example is ‘a lady tasting a cup of tea’ [5]. Fisher’s colleague Muriel Bristol claims

she can taste whether milk or tea was added first to her cup. Four tea first and four milk first cups

are presented to her in randomized order. She is told that there are four of each kind and is asked

to identify which is which. A possible result is given in Table IV. In this design, the row sums

as well as the column sums are fixed beforehand. Such a design is hardly ever used in practice

[6]. But understanding this design is important in understanding the nature of Fisher’s exact test

discussed in Section 3 below.

2.2. One margin fixed design

In clinical trials, one set of marginal sums usually is fixed beforehand, typically the number of

patients in each treatment group, such as in the epinephrine example of Table I. This is also the

basic design in case-control studies in epidemiology. The one margin fixed design is usually used

for comparing two binomial proportions. We assume, without loss of generality, that the fixed

margin comprises the row sums.

2.3. Total number fixed design

In cross-sectional studies of association, usually only the total sum N is fixed beforehand. This is

the case for the exfoliation example in Table II, where only the total number of patients, N =88,

was fixed before genotype determination and eye examination.

Table III. The general counts of a 2×2 table.

j

12 Sum

i

1

2

n11

n21

n12

n22

n1+

n2+

N

Sum

n+1

n+2

Table IV. Fisher’s tea-drinker.

Guess poured first

Poured first MilkTea Sum

Milk

Tea

Sum

3

1

4∗

1

3

4∗

4∗

4∗

8∗

An asterisk * denotes the sums fixed by design..

Copyright q 2009 John Wiley & Sons, Ltd.

Statist. Med. 2009; 28:1159–1175

DOI: 10.1002/sim

Page 4

1162

S. LYDERSEN, M. W. FAGERLAND AND P. LAAKE

We consider the null hypothesis of no association between the variables defining rows and

columns, for example, that the probability of success is independent of treatment. The present

article focuses on the one margin fixed design and the total sum fixed design. The null hypothesis

is formalized in Table V. If a row sum or column sum is zero, the table is uninformative about

association. We assume that the row sums and column sums are nonzero. Most of the tests discussed

are two-tailed, testing whether association is random or not.

3. COMMON TEST STATISTICS

A test statistic is a function of the observations, providing a measure of the observed table’s

compliance with the null hypothesis. Unless otherwise stated, we define a test statistic T in such

a way that it is non-negative and tables with large values agree less with the null hypothesis than

do tables with lower values.

Let n denote the observed table (Table III) with marginal sums n+=(n1+,n2+,n+1,n+2). Under

H0, the estimated expected counts are

mij=ni+n+j/N

(1)

The most used test statistics are those defined for Pearson’s chi-squared test, the likelihood ratio

(LR) test, and Fisher’s exact test. Pearson’s chi-squared test statistic is

TPe(n)=?

i,j

(nij−mij)2

mij

=N(n11n22−n12n21)2

n1+n2+n+1n+2

(2)

and the LR test statistic is

TLR(n)=−2 logL0

L1

=2?

i,j

nijlog

?nij

mij

?

where a term is 0 if nij=0 (3)

The maximum likelihood of the table under H0and H1are L0and L1. Fisher’s statistic is defined

as the conditional probability

?n+1

n11

n1+−n11

with small values providing evidence against H0. All the sampling models described previously

give the same results (1) to (4).

For testing equality between two proportions (one margin fixed design), another possible test

statistic is the normalized difference between the observed proportions

P(n|n+)=

??

n+2

???

N

n1+

?

(4)

z=

n11

n1+−n21

·n+2

N

n2+

1

n1++

?

n+1

N

?

1

n2+

?

(5)

Copyright q 2009 John Wiley & Sons, Ltd.

Statist. Med. 2009; 28:1159–1175

DOI: 10.1002/sim

Page 5

RECOMMENDED TESTS FOR ASSOCIATION IN 2×2 TABLES

1163

Table V. Two common experimental designs for 2×2 tables.

Model

Fixed sums

Unknown parameters

Probability model

Null hypothesis

One

n1+n2+(and

Column 1 Column 2 Total

Two independent binomials

?1=?2

margin

N =n1++n2+) Row 1

?1

1−?1

1

fixed

Row 2

?2

1−?2

1

P(n)=

?n1+

n11

?

?n11

1

(1−?1)n1+−n11

?n2+

n21

?

?n21

2

(1−?2)n2+−n21

Total sum

N

Column 1 Column 2 Total

Multinomial

?ij=?i+?+j

fixed

Row 1

?11

?12

?1+

P(n)=

N!

n11!n12!n21!n22!?n11

11?n21

21?n12

12?n22

22

Row 2

?21

?22

?2+

Total

?+1

?+2

1

Copyright q 2009 John Wiley & Sons, Ltd.

Statist. Med. 2009; 28:1159–1175

DOI: 10.1002/sim

Page 6

1164

S. LYDERSEN, M. W. FAGERLAND AND P. LAAKE

However, TPe=z2, so this statistic is equivalent to Pearson’s chi-squared statistic. It is sometimes

denoted as zpooled, as it is based on the two proportions being equal under H0. If this is not the

case, the corresponding normalized difference is

zunpooled=

n11

n1+−n21

?n11n12

1+

n2+

+n21n22

n3

n3

2+

(6)

The zunpooledstatistic is not recommended for testing the null hypothesis of no association [7]. Like-

wise, the Santner and Snell statistic [8], which is the unnormalized difference in proportions given

by the numerator in (5), is not recommended for testing the null hypothesis of no association [7].

There are continuity corrections for some test statistics, aiming to improve the asymptotics in

certain designs. Best known is Yates’ correction for Pearson’s statistic, described in Section 4

below.

4. DEFINING AND COMPUTING THE p-VALUE

In general, the p-value is defined as the probability of the test statistic T being equal to or more

extreme than its value for the observed table (tobs)

p-value=P(T?tobs|H0)

(7)

In general, H0is rejected if the p-value does not exceed ?, the nominal significance level. If both

marginal sums are fixed, the p-value (7) can be computed using the probabilities (4). Else, the

calculated p-value depends on the design, as well as the value(s) of the unknown parameter(s), or

nuisance parameter(s), under H0. In the one margin fixed design, there is one nuisance parameter,

?=?1=?2, the common success probability in row 1 and 2. In the total sum fixed design, the row

and column probabilities are unknown, so there are two nuisance parameters, ?1+and ?+1. A test

is said to preserve test size if the actual significance level does not exceed the nominal significance

level, for any value of the nuisance parameter(s). If the actual significance level is lower than ?,

the test is called conservative. If H0is rejected only if the p-value is less than ?, the test may be

unnecessarily conservative, since we have discrete data.

4.1. Exact conditional tests

The hindrance of unknown nuisance parameter(s) is overcome when the conditional p-value, given

the marginal sums n+=(n1+,n2+,n+1,n+2), is computed as

Conditional p-value=P(T?tobs|n+,H0)

Then, H0is rejected if the conditional p-value does not exceed ?. This is done in Fisher’s exact

test, which was first proposed by Irwin [9]. The widespread use of conditional tests for 2×2 tables

may be ascribed to their independence of the nuisance parameter(s) and to their computational

ease. In the present article, by ‘conditional test’ we mean conditional on row and column sums,

unless otherwise stated.

(8)

Copyright q 2009 John Wiley & Sons, Ltd.

Statist. Med. 2009; 28:1159–1175

DOI: 10.1002/sim

Page 7

RECOMMENDED TESTS FOR ASSOCIATION IN 2×2 TABLES

1165

Any conditional test preserves test size, because

P(Reject H0|H0)=?

n+

P(Reject H0|n+,H0)P(n+|H0)

=?

??

=??

n+

P[tobsis such that P(T?tobs|n+,H0)??]P(n+|H0)

n+

?P(n+|H0)

n+

P(n+|H0)=?·1=?

(9)

A conditional test can be unnecessarily conservative, with actual significance level notably less

than ?. There are several approaches for reducing this conservatism, using

• An asymptotic method, like the asymptotic Pearson’s chi-squared test. However, this may

violate test size seriously for small samples.

• A mid-p-value with an exact conditional test. The test size may be violated, but typically not

much [10]. This approach is called quasi-exact by Hirji et al. [11].

• An unconditional test that preserves test size and has high power. We regard this to be the

best approach in small samples.

4.2. Asymptotic tests

The most used asymptotic tests are probably those using Pearson’s chi-squared statistic (2) and

the LR statistic (3), approximating the p-value as

asymp p-value=P(?2

1?tobs)

(10)

where ?2

used for all designs described above.

Yates’ continuity correction for Pearson’s statistic is given by

1is chi-squared distributed with one degree of freedom. These asymptotic tests can be

TPe,CC(n)=?

i,j

(|nij−mij|−1/2)2

mij

=

N

?

|n11n22−n12n21|−N

n1+n2+n+1n+2

2

?2

(11)

where mijis given by (1).

4.3. The mid-p-value

The mid-p-value is defined as

mid-p-value=P(T>tobs)+1

2P(T =tobs)

(12)

and the null hypothesis is rejected if mid-p-value ??. The mid-p procedure was proposed by

Lancaster [12] and recently has gained wider acceptance, see [4,13,14] and references therein.

Barnard [15] recommends reporting both the p-value and the mid-p-value, arguing that the p-

value measures the significance when the data are judged alone and the mid-p-value is suited for

combining evidence from several studies. In fact, for a one-sided test with any discrete test statistic,

Copyright q 2009 John Wiley & Sons, Ltd.

Statist. Med. 2009; 28:1159–1175

DOI: 10.1002/sim

Page 8

1166

S. LYDERSEN, M. W. FAGERLAND AND P. LAAKE

we have E(P-value|H0)>1

justification for mid-p-values in 2×2 contingency tables is provided by Hwang and Yang [16].

A mid-p test does not guarantee preservation of test size. However, a conditional mid-p test, as

described above, approximately preserves test size unconditionally [4].

2and E(mid-p-value|H0)=1

2, see, for example, [4,14]. A theoretical

4.4. Unconditional tests

The exact tests considered this far are conditional on n+=(n1+,n2+,n+1,n+2). Unconditional

tests, on the other hand, assume no marginal sums fixed, save those fixed by design. A compli-

cation concerning the unconditional p-value is that P(T?tobs) depends on the unknown nuisance

parameter(s) under H0. To ensure that the test size does not exceed ?, the p-value is taken [17,18] as

max

0???1P(T?tobs;?|H0)

(13)

Equation (13) applies to a one margin fixed design, where the nuisance parameter is the success

probability?=?1=?2under H0.Inamultinomialdesign(totalnumberfixeddesign),maximization

is performed over the two-dimensional area [0,1]×[0,1] for (?+1, ?1+). The unconditional tests

described here are for the one margin fixed design, unless otherwise specified.

The unconditional test described by Suissa and Shuster [19] is of this type, using Pearson’s

statistic. Barnard’s unconditional test [20] uses a more computationally intensive algorithm for

building a rejection region, and is, to our knowledge not included in any available software.

StatXact provides the Suissa and Shuster test (somewhat misleadingly named Barnard’s test in

StatXact).

Boschloo [21] and McDonald et al. [22] suggested a raised conditional level of significance to

reduce the conservatism in an exact conditional test. That is, reject H0if the conditional p-value??,

where ? is the highest number such that P(Reject H0) ?? for all parameter values under H0.

This is a valid procedure, see, for example, [18]. An equivalent procedure is to use Fisher’s exact

conditional p-value as a test statistic in an unconditional test, with small values providing evidence

against H0. We call this the Fisher–Boschloo statistic. The resulting unconditional p-value for

Boschloo’s test is interpreted in the usual manner and is compared with the significance level ?.

This p-value can be computed using the Berger software [23]. Some authors modify Boschloo’s

test by using the one-sided Fisher’s exact conditional p-value as test statistic in the two-sided test.

This is done in the software by Mart´ ın Andr´ es [24].

It has been pointed out that the p-value defined by (13) is maximized over all values of ?,

including values highly unlikely in light of the observations (see, for example Agresti ([25], pp.

95–96)). This drawback is reduced in the Berger and Boos procedure [26], where the unconditional

p-value is taken as

max

?∈C?P(T?tobs;?)+?

(14)

where C?is a 100(1−?) per cent confidence interval for ?. Here, ? is taken to be very small, such

as 0.001. This procedure also preserves the test size. It is fully implemented in [23,27].

For an unconditional test in the one margin fixed design (two binomials), the Berger software

[23] allows N?1000, with optional Berger and Boos correction. Other relevant softwares are

[24,27,28]. In the total sum fixed design, the softwares [23,24], allow sample sizes N?400 and

N?40, respectively.

Copyright q 2009 John Wiley & Sons, Ltd.

Statist. Med. 2009; 28:1159–1175

DOI: 10.1002/sim

Page 9

RECOMMENDED TESTS FOR ASSOCIATION IN 2×2 TABLES

1167

An approximate unconditional test may be formed by inserting the maximum likelihood estimate

of ? in (13) instead of maximizing over ?. It seems similar to the conditional mid-p test, in terms

of occasionally exceeding the nominal significance level [29].

4.5. One-sided and two-sided tests

The test statistics and p-values defined above are generally for two-sided tests. For a one-sided

test, only outcomes in the direction of the one-sided hypothesis are included in computation of

the p-value.

For two-sided exact tests, the two-sided p-value may alternatively be defined as twice the

smallest tail (TST), that is, twice the smallest of the one-sided p-values, instead of a probability-

based p-value (8). In exact tests, this can make the test slightly more conservative. On the other

hand, with the TST one always is on the safe side: When rejecting a two-sided hypothesis at

level ?, the TST method implies that the corresponding one-sided hypothesis will be rejected at

level ?/2. These issues are further discussed in Hirji ([4], pp. 206–210), Agresti ([25], p. 93), and

Altman ([30], p. 256).

Note that many software programs compute the one-sided p-value only for the smallest tail.

5. WHAT IS KNOWN ABOUT THE TESTS

5.1. Choice of test statistic

For conditional p-values in 2×2 tables, common test statistics like Pearson’s chi-squared, LR,

and Fisher’s statistic are equivalent in two cases [31]:

• If the row sums (or column sums) are equal.

• For one-sided tests, regardless of the marginal sums. Hence, the same applies to two-sided

twice the smaller tail tests.

Else, the three test statistics may produce differing results. Power comparisons using conditional

p-values for two binomials indicate some differences, though there is no general ‘loser’ or ‘winner’

among them [10,32]. The statistics of Fisher, and particularly Pearson, tend to be slightly more

powerful than LR. However, the Fisher statistic seems more robust to design and rarely performs

poorly. The same conclusions may be made for rxc tables with one fixed margin [33].

These results for conditional p-values do not extend to unconditional p-values computed by

(13). For example, the results in Table I give unconditional p-value=0.0281 with Pearson’s and

Fisher’s statistics, and unconditional p-value=0.0402 with the LR statistic. For unconditional tests

for the one margin fixed design, Mehrotra et al. [7] compared the test statistics Pearson (zpooled),

zunpooled, Santner and Snell and Fisher–Boschloo. Pearson and Fisher–Boschloo are recommended,

as they have the highest power. Andres et al. [34] compared 15 test statistics, including the original

test by Barnard. Barnard’s test and a simplified version of Barnard’s test have highest power, but

are considered too computer intensive for practical use. Among the others, Pearson’s chi-squared

and Fisher–Boschloo have power nearly as high as the optimal Barnard’s test [34].

For unconditional tests in the total sum fixed design, we are not aware of comparisons of test

statistics. Hence, we have no reason to recommend other test statistics than in the one margin fixed

design.

Copyright q 2009 John Wiley & Sons, Ltd.

Statist. Med. 2009; 28:1159–1175

DOI: 10.1002/sim

Page 10

1168

S. LYDERSEN, M. W. FAGERLAND AND P. LAAKE

5.2. p-value or mid-p-value

The conditional p-value test is conservative. The conditional mid-p test is less conservative, but

does not always preserve test size. However, in some cases, the conditional mid-p test preserves

the nominal level for all values of the nuisance parameter, else it tends to slightly violate test

size occasionally [10]. Hirji et al. [11] carried out a comprehensive comparison of Fisher’s exact

test, Fisher exact mid-p, the asymptotic Pearson’s chi-squared and two other asymptotic tests. For

both one- and two-sided tests, and for a wide range of sample sizes, they found that the actual

significance levels of the mid-p tests tend to be closer to the nominal level as compared with the

other tests. Empirical studies show that the performance of a conditional mid-p test resembles that

of an unconditional test ([4], p. 219).

5.3. Asymptotic tests

The original rationale for using asymptotic expressions was their ease of computation. According

to Cochran’s criterion [35], Pearson’s asymptotic chi-squared test is inaccurate in a 2×2 table if

any of the expected counts are less than five (mij<5). Today, computations for other approaches

are readily performed. A conditional mid-p test can easily be calculated and performs somewhat

better than asymptotic tests also with counts slightly above Cochran’s criterion [11]. Anyway, it

should be noted that Yates’ correction (11) for Pearson’s chi-squared test assumes that all the

marginal sums n+=(n1+,n2+,n+1,n+2) are fixed, which makes it a valid approximation to an

exact conditional test. This correction reduces the numerical value of the test statistic, and hence

reduces the power and significance level of the test, making it overly conservative [6]. We agree

with authors who state that Yates’ correction should no longer be used [4,6,25].

5.4. Conditional or unconditional tests

Unconditional tests are generally more powerful than conditional tests, and have been recommended

recently by many authors, see [7] and references therein. One outstanding comparison is that

Fisher–Boschloo’s test is uniformly more powerful than Fisher’s exact test, because its rejection

region always includes that of Fisher’s exact test [21]. This is true for one-sided tests, and hence

also for TST two-sided tests.

Mehrotra et al. [7] studied the Berger and Boos procedure (14) with ?=0.001 for unconditional

tests with Pearson’s and Fisher–Boschloo’s statistics and found that the procedure gives a slight

improvement in test power. Kang and Ahn [36] pointed out that the Berger and Boos procedure is

particularly useful in extremely unbalanced designs, comparing two binomial proportions where

one sample size is, say, 20 times the other sample size. This seems sensible: The situation is almost

like a one sample test for a binomial probability, where you test if the probability in the small

group equals the empirical probability in the large group. To our knowledge, no research has been

conducted to find optimal values of ?. Most authors use 0.001 or 0.0001. An exception is StatXact

8 using the default value 0.000001.

As an illustration, Figure 1 shows the actual significance level obtained in comparing two groups

with fixed row sums n1+=n2+=34, as in the epinephrine study. Fisher’s exact test is far more

conservative than Fisher–Boschloo’s unconditional test, which also by definition preserves test

size. Fisher’s conditional mid-p test performs about as well as Fisher–Boschloo’s unconditional

test. The mid-p test does not always preserve test size, although, in this case, it does. The Pearson’s

asymptotic test is not conservative, but neither does it preserve test size.

Copyright q 2009 John Wiley & Sons, Ltd.

Statist. Med. 2009; 28:1159–1175

DOI: 10.1002/sim

Page 11

RECOMMENDED TESTS FOR ASSOCIATION IN 2×2 TABLES

1169

0

0.01

0.02

0.03

0.04

0.05

0.06

0 0.10.2 0.30.4 0.50.60.7 0.80.91

Common π

P(Reject H0)

Pearson asymp

Fisher-Boschloo uncond

Fisher exact cond mid p

Fisher exact cond

Figure 1. Actual significance level, two binomials (one margin fixed design), row sums 34, ?=0.05.

0

0.2

0.4

0.6

0.8

1

0 10 20 30 4050 60 7080

n per row

P(Reject H0)

Pearson asymp

Fisher-Boschloo uncond

Fisher exact cond mid p

Fisher exact cond

Figure 2. Power, two binomials (one margin fixed design), equal row sums, ?1=0.03, ?2=0.2, ?=0.05.

Figure 2 shows an example of test power as a function of sample size for two equal groups with

success probabilities ?1=0.03 and ?2=0.2, similar to the observed proportions in the epinephrine

study, for a nominal significance level of ?=0.05. The conditional test has lowest power. The

mid-p test and the unconditional test have about the same power.

The conservatism of conditional tests is more pronounced in balanced designs than in unbalanced

designs. This causes the paradox that a conditional test for a balanced design often gains power

when the sample size is reduced by one in one group [37]. A conditional mid-p test, as well as

an unconditional test, usually looses power with such a sample size reduction.

Copyright q 2009 John Wiley & Sons, Ltd.

Statist. Med. 2009; 28:1159–1175

DOI: 10.1002/sim

Page 12

1170

S. LYDERSEN, M. W. FAGERLAND AND P. LAAKE

Unconditional tests have been disputed, as by [38], see also page 191 of [39]. Such dispute is

chiefly philosophical. Clearly, unconditional tests are legitimate when the relevant marginals are

not fixed by design [17,18]. Unconditional tests have higher power than conditional tests. This

may lead to markedly lower sample sizes, even in moderately large samples. For example, to test

for a difference in binomial proportions when ?1=0.03, ?2=0.2, and ?=0.05, the sample size

needed for 80 per cent power is 60 per group with Fisher’s exact test and 52 per group with

Fisher–Boschloo’s unconditional tests, even though Mart´ ın Andr´ es et al. [40] report the conditional

test to be acceptable in this case based on an average power consideration. In general, there should

be no reason to condition on quantities not fixed by the design.

6. SUMMARY OF MUCH USED TESTS AND RECOMMENDED TESTS

6.1. Pearson’s chi-squared test

This test uses the test statistic (2), with high values providing evidence against the null hypothesis.

It is an asymptotic test that approximates the p-value by the upper tail probability from the chi-

squared distributed with one degree and freedom. Yates’ correction (11) for Pearson’s chi-squared

statistic was originally introduced to make it mimic Fisher’s exact test. Pearson’s chi-squared test

with Yates’ correction should no longer be used.

6.2. Fisher’s exact test

An exact p-value is the exact probability of observing a table at least as extreme as the observed

one, under the null hypothesis. However, in 2×2 tables this probability typically depends on

one or more unknown parameters, such as the common success probability in comparing two

binomials in the one margin fixed design. This obstacle vanishes if we condition on the marginals

(observed row and column sums)n+=(n1+,n2+,n+1,n+2), as if these were fixed by design like in

Fisher’s tea drinker example. The conditional probability (4) of a table given the marginals does not

depend on any unknown parameters. The p-value from the resulting Fisher’s exact conditional test

equals the probability of the observed table plus the sum of the probabilities (4) equal to or smaller

than the probability of the observed table. The resulting test preserves test size, but is, however,

unnecessarily conservative with lower power than conditional mid-p tests and unconditional tests.

We do not recommend the use of Fisher’s exact test.

6.3. Fisher’s exact mid-p test

In the mid-p version of Fisher’s exact conditional test, only half the probability of the observed

outcome is included in the mid-p-value. The resulting test is less conservative than Fisher’s

exact test, but the test size is not necessarily preserved. Its performance approximates that of an

unconditional test.

6.4. Exact unconditional tests

In an exact unconditional test, the exact p-value is first computed as a function of the unknown

parameter(s). Then, the p-value is taken as the maximum over all possible values, typically from 0

to 1. Recommended test statistics are Pearson’s chi-squared statistic (the Suissa and Shuster test)

or the conditional p-value from Fisher’s exact test (Fisher–Boschloo’s test). An unconditional test

Copyright q 2009 John Wiley & Sons, Ltd.

Statist. Med. 2009; 28:1159–1175

DOI: 10.1002/sim

Page 13

RECOMMENDED TESTS FOR ASSOCIATION IN 2×2 TABLES

1171

preserves test size and is usually more powerful than Fisher’s exact test. In fact, the one-sided

Fisher–Boschloo test is uniformly more powerful than the one-sided Fisher’s exact test. If the

Berger and Boos correction is used, the p-value is maximized over a confidence interval of values

rather than the whole interval 0 to 1, and this generally improves the test further. We consider

exact unconditional tests to be the gold standard for testing association in 2×2 tables.

7. EXAMPLES

The various methods may be illustrated by the following examples of computing the p-values for

the data in Tables I and II. In these examples, unconditional tests are our prime recommendations.

In addition, we compute the p-values for Fisher’s exact test, traditionally used in tables with small

counts. This is done partly to show how they differ from p-values for unconditional tests, partly

to illustrate how Fisher–Bochloo’s unconditional p-value is computed.

7.1. The epinephrine study (Table I)

In Fisher’s exact test with the given marginal sums, the possible counts n11are 0,1,...,8. The

conditional sample space, which is a one-dimensional subspace of the two-dimensional uncon-

ditional sample space, consists of nine outcomes, as illustrated in Figure 3. The corresponding

nine conditional probabilities are 0.0025, 0.0247, 0.1021, 0.2253,0.2910, 0.2253, 0.1021, 0.00247,

0.0025. In fact, since the sample sizes are equal, there are only five possible different values of the

test statistic, and hence only five possible conditional probabilities and five possible p-values. The

exact conditional p-value is the probability of the test statistic being equal to (white in Figure 3)

or more extreme than (grey in Figure 3) its value for the observed table (white with asterisk in

Figure 3)

p-value=0.0025+0.0247+0.0247+0.0025=0.0544

34

n21

8

*

0

08 34

n11

Figure 3. The conditional p-value for the epinephrine example (Table I) computed over the conditional

sample space of the nine possible outcomes given the marginal sums. The observed outcome is marked∗.

Possible outcomes with less extreme, equal, and more extreme test statistic than the observed are marked

black, white, and grey, respectively.

Copyright q 2009 John Wiley & Sons, Ltd.

Statist. Med. 2009; 28:1159–1175

DOI: 10.1002/sim

Page 14

1172

S. LYDERSEN, M. W. FAGERLAND AND P. LAAKE

34

n21

7

*

0

0 134

n11

Figure 4. The unconditional p-value for the epinephrine example (Table I) in the one margin fixed design,

computed over all 35×35=1225 possible outcomes in the unconditional sample space. The observed

outcome is marked∗. Possible outcomes with less extreme, equal, and more extreme test statistic than

the observed are marked black, white, and grey, respectively.

0

0.005

0.01

0.015

0.02

0.025

0.03

0 0.1 0.20.3 0.40.5

π

0.6 0.7 0.80.91

P(Reject H0)

Figure 5. P(T?tobs;?) for the epinephrine example.

Fisher’s exact conditional mid-p-value is

mid-p-value=0.0544−1

2(0.0247+0.0247)=0.0297

In principle, the unconditional p-value is found by computing the test statistic for all the (n1++

1)(n2++1)=35×35=1225 possible outcomes. The p-value for Fisher–Boschloo’s unconditional

test is the probability of the test statistic being equal to (white in Figure 4) or more extreme than

(grey in Figure 4) than its value for the observed table (white with asterisk in Figure 4). This

p-value depends on the value of the common success probability ? under H0, as illustrated in

Figure 5. The unconditional p-value is taken as the maximum of this function, which is

unconditional p-value= max

0???1P(T?tobs;?|H0)=P(T?tobs;0.128|H0)

= P(T?tobs;0.872|H0)=0.0281

Copyright q 2009 John Wiley & Sons, Ltd.

Statist. Med. 2009; 28:1159–1175

DOI: 10.1002/sim

Page 15

RECOMMENDED TESTS FOR ASSOCIATION IN 2×2 TABLES

1173

For the Berger and Boos correction, we first compute a confidence interval for the common success

probability under H0. Since we are performing exact tests, we should use an exact confidence

interval. The Clopper–Pearson interval guarantees that the coverage probability is at least 1−?. It

is given by (?l, ?h) where

?

?

?l=

1+

N−n+1+1

2n+1,2(N−n+1+1)(1−?/2)

N−n+1

(n+1+1)F−1

n+1F−1

?−1

?h=

1+

2(n+1+1),2(N−n+1)(?/2)

?−1

(15)

and F−1

by the inverse distribution function of c. With n+1=8 successes out of N =68 trials, a 1−10−3

confidence interval becomes (0.0271, 0.2939), and the corresponding p-value is

a,b(c) is the quantile of the Fisher distribution with a and b degrees of freedom, given

p-value=

max

0.0271???0.2939P(T?tobs;?|H0)+10−3=P(T?tobs;0.128|H0)+10−3

=0.0281+10−3=0.0291

7.2. The genotype—exfoliation example (Table II)

Only the total sum N is fixed beforehand. p-values conditional on row and column sums are

computed as before, and, using Fisher’s test statistic, are obtained as p-value=0.0629 and

mid-p-value=0.0447. The p-value for an unconditional test using the Fisher–Boschloo statistic

is computed to be 0.0514 or 0.0486 without or with Berger–Boos correction (?=10−3) by the

Berger software [23].

In the above examples, we see that the p-values from the recommended unconditional tests

can be substantially lower than those from the traditional Fisher’s exact test. Also, the conditional

mid-p-value is approximately equal to the unconditional p-value, as expected.

8. POWER AND SAMPLE SIZE CALCULATIONS

Tests may be conditional or unconditional. However, power calculations must, by their nature,

always be performed unconditionally on the marginal sums not fixed by design. Exact power or

sample size calculations for the recommended tests cannot always be performed with existing

software. StatXact provides power calculations for conditional as well as unconditional tests. The

software [28] performs power calculations for all recommended tests of the one margin fixed

design, save for unconditional tests with the Berger and Boos correction. Power calculations for

unconditional tests with the Berger and Boos correction can be performed, slightly conservatively,

as power calculations for unconditional tests without the correction. Computing time may be

excessive for moderate to large sample sizes with unconditional tests. Most commercial softwares

that provide sample size calculations provide asymptotic calculations for asymptotic tests. In cases

Copyright q 2009 John Wiley & Sons, Ltd.

Statist. Med. 2009; 28:1159–1175

DOI: 10.1002/sim