Page 1

STATISTICS IN MEDICINE

Statist. Med. 17, 1623—1634 (1998)

A SIMPLE METHOD OF SAMPLE SIZE CALCULATIONFOR

LINEAR AND LOGISTIC REGRESSION

F. Y. HSIEH?*, DANIEL A. BLOCH? AND MICHAEL D. LARSEN?

?CSPCC, Department of Veterans Affairs, Palo Alto Health Care System (151-K), Palo Alto, California 94304, U.S.A.

?Division of Biostatistics, Department of Health Research and Policy, Stanford University, Stanford, California 94305, U.S.A.

?Department of Statistics, Stanford University, Stanford, California 94305, U.S.A.

SUMMARY

A sample size calculation for logistic regression involves complicated formulae. This paper suggests use of

sample size formulae for comparing means or for comparing proportions in order to calculate the required

samplesize for a simplelogisticregression model.One can thenadjust the requiredsample sizefor a multiple

logisticregression model by a variance inflation factor. This method requires no assumption of low response

probability in the logistic model as in a previous publication. One can similarly calculate the sample size for

linear regression models. This paper also compares the accuracy of some existing sample-size software for

logistic regression with computer power simulations. An example illustrates the methods. ? 1998 John

Wiley & Sons, Ltd.

INTRODUCTION

In a multiple logistic regression analysis, one frequently wishes to test the effect of a specific

covariate, possibly in the presence of other covariates, on the binary response variable. Owing to

the nature of non-linearity, the sample size calculation for logistic regression is complicated.

Whittemore? proposed a formula, derived from the information matrix, for small response

probabilities. Hsieh? simplified and extended the formula for general situations by using the

upper bound of the formula. Appendix I presents a simple closed form, based on an information

matrix, to approximate the sample size for both continuous and binary covariates in a simple

logistic regression. In a different approach, Self and Mauritsen? used generalized linear models

and the score tests to estimate the sample size through an iterative procedure. These published

methods are complicated and may not be more accurate than the conventional sample size

formulae for comparing two means or a test of equality of proportions. In the next section, we

present a simple formula for the approximate sizes of the sample required for simple logistic

regression by using formulae for calculating sample size for comparing two means or for

* Correspondence to: F. Y. Hsieh, CSPCC, Department of Veterans Affairs, Palo Alto Health Care System (151-K),

Palo Alto, California 94304, U.S.A.

Contract/grant sponsor: Department of Veterans Affairs Cooperative Studies Program

Contract/grant sponsor: NIH

Contract/grant number: AR20610

Contract/grant sponsor: National Institute on Drug Abuse

Contract/grant number: Y01-DA-40032-0

CCC 0277—6715/98/141623—12$17.50

? 1998 John Wiley & Sons, Ltd.

Received February 1997

Revised October 1997

Page 2

comparing two proportions. We can then adjust the sample size requirement for a multiple

logistic regression by a variance inflation factor. This approach applies to multiple linear

regression as well.

SIMPLE LOGISTIC REGRESSION

Ina simple logisticregressionmodel,we relateacovariate X?to the binaryresponse variable½ in

a model log(P/(1!P))"??#??X?where P"prob(½"1). We are interested in testing the

null hypothesis H?:??"0 against the alternativeH?:??"?*, where ?*O0, that the covariate is

related to the binary response variable. The slope coefficient ??is the change in log odds for an

increase of one unit? in X?. When the covariate is a continuous variable with a normal

distribution, the log odds value ??is zero if and only if the group means, assuming equal

variances, between the two response categories are the same. Therefore we may use a sample size

formula for a two-sample t-test to calculate the required sample size. For simplicity, we use

a normal approximation instead, as the sample size formula (see formula (7) in Appendix I) may

be easily changed to include t-tests if required:

n"(Z?????#Z???)?/[P1(1!P1)?*?]

where n is the required total sample size, ?* is the effect size to be tested, P1 is the event rate at the

mean of X, and Z?is the upper uth percentile of the standard normal distribution.

When the covariate is a binary variable, say X"0 or 1, the log odds value ??"0 if and only if

the two event rates are equal. The sample size formula for the total sample size required for

comparing two independent event rates has the following form (see formula (10)):

n"?Z?????[P(1!P)/B]???#Z???[P1(1!P1)#P2(1!P2)(1!B)/B]?????

/[(P1!P2)?(1!B)]

where: P("(1!B)P1#BP2) is the overall event rate; B is the proportion of the sample with

X"1; P1 and P2 are the event rates at X"0 and X"1, respectively. For B"0)5, the required

sample size is bounded by the following simple form (see formula (11)):

n(4P(1!P)(Z?????#Z???)?/(P1!P2)?.

Appendix I presents two simpler forms, formulae (12) and (13), than formula (2). A later section

presents the comparisons of these formulae with computer power simulations.

(1)

(2)

(3)

MULTIPLE LOGISTIC REGRESSION

When there is more than one covariate in the model, a hypothesis of interest is the effect of

a specific covariate in the presence of other covariates. In terms of log odds parameters, the null

hypothesis for multiple logistic regression is H?: [??,??,2,??]"[0,??,2,??] against the

alternative [?*,??,2,??]. Let b?be the maximum likelihood estimate of ??. Whittemore? has

shown that, for continuous, normal covariates X, the variance of b?in the multivariate setting

with p covariates, var?(b?), can be approximated by inflating the variance of b?obtained from the

one parameter model, var?(b?), by multiplying by 1/(1!??????2p) where ?1.232pis the multiple

correlation coefficient relating X?with X?,2,X?. That is, approximately

var?(b?)"var?(b?)/(1!??1.232p)

1624

F. HSIEH, D. BLOCH AND M. LARSEN

Statist. Med. 17, 1623—1634 (1998)

? 1998 John Wiley & Sons, Ltd.

Page 3

The squared multiple correlation coefficient ??1.232p, also known as R?, is equal to the proportion

of the variance of X?explained by the regression relationship with X?,2,X?. The term

1/(1!??1.232p) will be referred to as a variance inflation factor (VIF). The required sample size for

the multivariate case can also be approximated from the univariate case by inflating it with the

samefactor1/(1!??1.232p).Following the

n?"n?/(1!??1.232p) where n?and n?are the sample sizes required for a logistic regression model

with p and 1 covariates, respectively. The same VIF seems to work well for binary covariates (see

Appendix III).

relationshipofthevariances,we have

MULTIPLE LINEAR REGRESSION

For multiple linear regression models, we can easily derive the same VIF for p covariates (see

Appendix II). Therefore, we can adjust similarly the sample size for a regression model with

p covariates. It is known that in a simple linear regression model, the correlation coefficient ? and

the regression parameter ??have the relationship ?"????/??. Hence ?"0 if and only if ??"0.

When both X and ½ are standardized, testing the hypotheses that ?"0 and that ??"0 are

equivalent and the required sample sizes are the same.

Let r be the estimate of the correlation coefficient between X and ½. The sample size formula

(see Sokal and Rohlf?) for testing H?: ?"0 against the alternative H?: ?"r is

n?"(Z?????#Z???)?/C(r)?#3

where the Fisher’s transformation C(r)"??log((1#r)/(1!r)). If we add p!1 covariates to the

regression model, the required sample size for testing H?: [??,??,2,??]"[0,??,2,??]

againstthe alternative[?*,??,2,??] is n?"n?/(1!??1.232p), approximately.If we already have

q covariates in the model and would like to expand the model to p('q) covariates, then, from

Appendix II, n?"n?((var?(b?)/var?(b?))"n?/(1!??(1q#12p))(232q)) where the partial correlation

coefficient ?(1q#12p))(232q) measures the linear association between covariates X?

X???,2,X?when the values of covariates X?,2,X?are held fixed.

and

COMPARISON OF SAMPLE-SIZE SOFTWARE

There are at least two computer programs available that use formula (4) (see AppendixI): nQuery

from Dr. Janet Elashoff,? and SSIZE? from the first author. One program, EGRET SIZ from

SERC,?uses the approachofSelf and Mauritsen.?For logisticregression,thecomputerprograms

nQuery and SSIZE provide sample sizes only for continuous covariates while EGRET SIZ only

provides estimates for discrete covariates. Both nQuery and EGRET SIZ are commercial

software. Note that the sample size calculation for logistic regression is only one of the many

features provided by the above three computer programs.

Table I presents sample size examples for a binary covariate using formula (4) and software

EGRET SIZ as well as the corresponding sample size for comparing two proportions (without

continuitycorrection from formulae (2), (3), (12) and (13)), and the results of power simulations. In

the table, P1 and P2 are event rates at X"0 and X"1, respectively; B is the proportion

of the sample with X"1; OR is the odds ratio of X"1 versus X"0 such that

OR"P2(1!P1)/(P1(1!P2)); P"(1!B)P1#BP2 is the overall event rate or case fraction.

Table I is designed to show the relationshipof sample sizes for different study designs. It is known

that a balanced design (B"0)5) requires less sample size than an unbalanced design

SAMPLE SIZE CALCULATION

1625

Statist. Med. 17, 1623—1634 (1998)

? 1998 John Wiley & Sons, Ltd.

Page 4

Table I. Results of sample size calculations for a binary covariate from six different methods,

power"95 per cent, two-sided significance level 5 per cent

Design Sample sizePower simulation

Balanced design with high event rates

(4):

P1"0)4, P2"0)5, B"0)5

(2):

P"0)45, P1"0)4, P2"0)5, B"0)5

(3):

P"0)45, P1"0)4, P2"0)5, B"0)5

(12)

P"0)45, P1"0)4, P2"0)5, B"0)5

(13)

P1"0)4, P2"0)5, B"0)5

SIZ: OR"1)5, case fraction P"0)45,

sampling fraction 50/50

1367

1282

1287

1287

1274

1285

96)0$0)63%

95)4$0)66%

94)7$0)71%

94)7$0)71%

94)6$0)71%

95)9$0)63%

Balanced design with low odds ratio

(4):

P1"0)5, P2"0)2, B"0)5

(2):

P"0)35, P1"0)5, P2"0)2, B"0)5

(3):

P"0)35, P1"0)5, P2"0)2, B"0)5

(12):

P"0)35, P1"0)5, P2"0)2, B"0)5

(13):

P1"0)5, P2"0)2, B"0)5

SIZ: OR"0)25, case fraction P"0)35,

sampling fraction 50/50

141

126

131

131

119

129

96)3$0)60%

95)0$0)69%

96)6$0)57%

96)6$0)57%

94)9$0)70%

96)1$0)61%

Balanced design with high odds ratio

(4):

P1"0)2, P2"0)5, B"0)5

(2):

P"0)35, P1"0)2, P2"0)5, B"0)5

(3):

P"0)35, P1"0)2, P2"0)5, B"0)5

(12):

P"0)35, P1"0)2, P2"0)5, B"0)5

(13):

P1"0)2, P2"0)5, B"0)5

SIZ: OR"4)0, case fraction P"0)35,

sampling fraction 50/50

166

126

131

131

119

129

99)0$0)31%

95)0$0)69%

96)6$0)57%

96)6$0)57%

92)9$0)81%

95)4$0)66%

Balanced design with high odds ratio

(4):

P1"0)05, P2"0)1, B"0)5

(2):

P"0)075, P1"0)05, P2"0)1, B"0)5

(3):

P"0)075, P1"0)05, P2"0)1, B"0)5

(12):

P"0)075, P1"0)05, P2"0)1, B"0)5

(13):

P1"0)05, P2"0)1, B"0)5

SIZ:OR"2)111, case fraction P"0)075,

sampling fraction 50/50

¸ow prevalence rate

(4):

P1"0)05, P2"0)1, B"0)2

(2):

P"0)06, P1"0)05, P2"0)1, B"0)2

(12):

P"0)06, P1"0)05, P2"0)1, B"0)2

(13):

P1"0)05, P2"0)1, B"0)2

SIZ:OR"2)111, case fraction P"0)06,

sampling fraction 80/20

1818

1437

1443

1443

1430

1417

98)2$0)42%

94)4$0)73%

95)8$0)63%

95)8$0)63%

94)4$0)73%

94)5$0)72%

2612

2186

1833

2648

2070

97)4$0)50%

94)9$0)70%

91)2$0)90%

97)4$0)50%

94)6$0)71%

High prevalence rate

(4):

P1"0)05, P2"0)1, B"0)8

(2):

P"0)09, P1"0)05, P2"0)1, B"0)8

(12):

P"0)09, P1"0)05, P2"0)1, B"0)8

(13):

P1"0)05, P2"0)1, B"0)8

SIZ:OR"2)111, case fraction P"0)09,

sampling fraction 20/80

3060

2257

2661

1820

2347

98)3$0)41%

95)0$0)69%

97)8$0)46%

89)5$0)97%

97)2$0)52%

1626

F. HSIEH, D. BLOCH AND M. LARSEN

Statist. Med. 17, 1623—1634 (1998)

? 1998 John Wiley & Sons, Ltd.

Page 5

(B"0)2 or 0)8); a low prevalence rate (B"0)2) requires less sample size than a high prevalence

rate (B"0)8); sample size remains the same if the odds ratio is reversed. In addition to the

significance level and the power of the test, the values of the following parameters, listed after the

sample size methods, are specified in the table:

Formula (4): P1, P2 and B.

Formula (2): tests of proportions: P, P1, P2 and B.

Formula (3): simple form for a balanced design: P, P1 and P2.

Formula (12): simple form for an unbalanced design: P, P1, P2 and B.

Formula (13): simple form for an unbalanced design: P1, P2 and B.

SIZ: OR, sampling fractions 1!B and B, and overall case fraction P.

The power simulations, obtained from SIZ with 1000 replications, use the likelihood ratio test for

the logistic regression model. The simulations show that the sample sizes obtained from testing

two proportions (formulae (2) and (3)) have statistical power within one standard deviation of the

expected power of 95 per cent. Also, formulae (2) and (3) are more stable than the other four

methods. Note that formula (4) calculates the required total number of events based on the event

rate corresponding to X"0, then inflates the number of events to obtain the total sample size.

Therefore, formula (4) produces a larger sample size if the lower event rate is assigned to P1

instead of P2. Formula (4) tends to overestimate the required sample sizes especially when the

event ratesare low (seeTable I). Formula(3) is a specialcase of formula(12) for a balanceddesign.

As shown in Table I, formula (3) gives the same sample sizes as formula (12) when B"0)5, but

slightly larger sample size than formula (2). Since formula (3) is designed for B"0)5, no sample

sizes for formula (3) are given for low or high prevalence rate. Formulae (12) and (13) are simpler

than formula (2), but lack accuracy when the sample size ratio is not close to 1 (say'2 or (0)5),

and should not be used when the accuracy of sample size calculation is important. It is known

thata designwithlow prevalencerate requiresless samplesize than highprevalencerate. InTable

I, formula (13) does not show this relationship which indicates that the formula overestimates the

sample size for low prevalence rate and underestimates high prevalence rate.

Table II presents the results for a continuous covariate from sample size programs nQuery and

SSIZE. The corresponding sample sizes from a two-sample t-test (formula (6) with Z-values

replaced by t-values) and from formula (1) are also listed for comparison. The table specifies the

following parameters indicated after the sample size methods:

Formula (1): P1, effect size"log(OR)"?*.

Two-sample t-test: effect size"log(OR),

sample size ratio"prob(½"1)/prob(½"0)"(1!P1)/P1.

nQuery: P1(event rate at the mean of X),

P2(event rate at one standard deviation above the mean of X).

SSIZE: P1(event rate at the mean of X),

OR (odds ratio at one standard deviation above the mean of X)

"P2(1!P1)/(P1(1!P2)).

Table II also provides power simulations obtained from 1000 replications generated by assuming

a normally distributed variable X. We used the Wald test in the simulation of the logistic

regression model. The results show that the sample sizes estimated by using the two-sample t-test

formula and formula (1) seem to be more conservative, but still large enough to achieve the

SAMPLE SIZE CALCULATION

1627

Statist. Med. 17, 1623—1634 (1998)

? 1998 John Wiley & Sons, Ltd.