# A simple method of sample size calculation for linear and logistic regression.

**ABSTRACT** A sample size calculation for logistic regression involves complicated formulae. This paper suggests use of sample size formulae for comparing means or for comparing proportions in order to calculate the required sample size for a simple logistic regression model. One can then adjust the required sample size for a multiple logistic regression model by a variance inflation factor. This method requires no assumption of low response probability in the logistic model as in a previous publication. One can similarly calculate the sample size for linear regression models. This paper also compares the accuracy of some existing sample-size software for logistic regression with computer power simulations. An example illustrates the methods.

**1**Bookmark

**·**

**592**Views

- [Show abstract] [Hide abstract]

**ABSTRACT:**Background Body mass index (BMI) is a strong predictor of mortality in the general population. In spite of the medical hazards of obesity, a protective effect on mortality has been suggested in surgical patients: the obesity paradox. The aim of the present nationwide cohort study was to examine the association between BMI and mortality in patients treated surgically for perforated peptic ulcer (PPU).Methods This was a national prospective cohort study of all Danish patients treated surgically for PPU between 1 February 2003 and 31 August 2009, for whom BMI was registered. Non-surgically treated patients and those with malignant ulcers were excluded. The primary outcome measure was 90-day mortality. The association between BMI and mortality was calculated as crude and adjusted relative risks (RRs) with 95 per cent confidence intervals (c.i.).ResultsOf 2668 patients who underwent surgical treatment for PPU, 1699 (63·7 per cent) had BMI recorded. Median age was 69·4 (range 17·6–100·9) years and 53·7 per cent of the patients were women. Some 1126 patients (66·3 per cent) had at least one of six co-morbid diseases; 728 (42·8 per cent) had an American Society of Anesthesiologists grade of III or more. A total of 471 patients (27·7 per cent) died within 90 days of surgery. Being underweight was associated with a more than twofold increased risk of death following surgery for PPU (adjusted RR 2·26, 95 per cent c.i. 1·37 to 3·71). No statistically significant association was found between obesity and mortality.Conclusion Being underweight was associated with increased mortality in patients with PPU, whereas being overweight or obese was neither protective nor an adverse prognostic factor.British Journal of Surgery 05/2014; · 4.84 Impact Factor - SourceAvailable from: Klazien Matter-WalstraKlazien W Matter-Walstra, Rita Achermann, Roland Rapold, Dirk Klingbiel, Andrea Bordoni, Silvia Dehler, Gernot Jundt, Isabelle Konzelmann, Kerri M Clough-Gorr, Thomas Szucs, Matthias Schwenkglenks, Bernhard C Pestalozzi[Show abstract] [Hide abstract]

**ABSTRACT:**The use of cancer related therapy in cancer patients at the end-of-life has increased over time in many countries. Given a lack of published Swiss data, the objective of this study was to describe delivery of health care during the last month before death of cancer patients.BMC Cancer 05/2014; 14(1):306. · 3.33 Impact Factor - [Show abstract] [Hide abstract]

**ABSTRACT:**Clinical supervision is an essential element in training genetic counselors. Although live supervision has been identified as the most common supervision technique utilized in genetic counseling, there is limited information on factors influencing its use as well as the use of other techniques. The purpose of this study was to identify barriers supervisors face when implementing supervision techniques. All participants (N = 141) reported utilizing co-counseling. This was most used with novice students (96.1 %) and intermediate students (93.7 %). Other commonly used techniques included live supervision where the supervisor is silent during session (98.6 %) which was used most frequently with advanced students (94.0 %), and student self-report (64.7 %) used most often with advanced students (61.2 %). Though no barrier to these commonly used techniques was identified by a majority of participants, the most frequently reported barriers included time and concern about patient's welfare. The remaining supervision techniques (live remote observation, video, and audio recording) were each used by less than 10 % of participants. Barriers that significantly influenced use of these techniques included lack of facilities/equipment and concern about patient reactions to technique. Understanding barriers to implementation of supervisory techniques may allow students to be efficiently trained in the future by reducing supervisor burnout and increasing the diversity of techniques used.Journal of Genetic Counseling 05/2014; · 1.45 Impact Factor

Page 1

STATISTICS IN MEDICINE

Statist. Med. 17, 1623—1634 (1998)

A SIMPLE METHOD OF SAMPLE SIZE CALCULATIONFOR

LINEAR AND LOGISTIC REGRESSION

F. Y. HSIEH?*, DANIEL A. BLOCH? AND MICHAEL D. LARSEN?

?CSPCC, Department of Veterans Affairs, Palo Alto Health Care System (151-K), Palo Alto, California 94304, U.S.A.

?Division of Biostatistics, Department of Health Research and Policy, Stanford University, Stanford, California 94305, U.S.A.

?Department of Statistics, Stanford University, Stanford, California 94305, U.S.A.

SUMMARY

A sample size calculation for logistic regression involves complicated formulae. This paper suggests use of

sample size formulae for comparing means or for comparing proportions in order to calculate the required

samplesize for a simplelogisticregression model.One can thenadjust the requiredsample sizefor a multiple

logisticregression model by a variance inflation factor. This method requires no assumption of low response

probability in the logistic model as in a previous publication. One can similarly calculate the sample size for

linear regression models. This paper also compares the accuracy of some existing sample-size software for

logistic regression with computer power simulations. An example illustrates the methods. ? 1998 John

Wiley & Sons, Ltd.

INTRODUCTION

In a multiple logistic regression analysis, one frequently wishes to test the effect of a specific

covariate, possibly in the presence of other covariates, on the binary response variable. Owing to

the nature of non-linearity, the sample size calculation for logistic regression is complicated.

Whittemore? proposed a formula, derived from the information matrix, for small response

probabilities. Hsieh? simplified and extended the formula for general situations by using the

upper bound of the formula. Appendix I presents a simple closed form, based on an information

matrix, to approximate the sample size for both continuous and binary covariates in a simple

logistic regression. In a different approach, Self and Mauritsen? used generalized linear models

and the score tests to estimate the sample size through an iterative procedure. These published

methods are complicated and may not be more accurate than the conventional sample size

formulae for comparing two means or a test of equality of proportions. In the next section, we

present a simple formula for the approximate sizes of the sample required for simple logistic

regression by using formulae for calculating sample size for comparing two means or for

* Correspondence to: F. Y. Hsieh, CSPCC, Department of Veterans Affairs, Palo Alto Health Care System (151-K),

Palo Alto, California 94304, U.S.A.

Contract/grant sponsor: Department of Veterans Affairs Cooperative Studies Program

Contract/grant sponsor: NIH

Contract/grant number: AR20610

Contract/grant sponsor: National Institute on Drug Abuse

Contract/grant number: Y01-DA-40032-0

CCC 0277—6715/98/141623—12$17.50

? 1998 John Wiley & Sons, Ltd.

Received February 1997

Revised October 1997

Page 2

comparing two proportions. We can then adjust the sample size requirement for a multiple

logistic regression by a variance inflation factor. This approach applies to multiple linear

regression as well.

SIMPLE LOGISTIC REGRESSION

Ina simple logisticregressionmodel,we relateacovariate X?to the binaryresponse variable½ in

a model log(P/(1!P))"??#??X?where P"prob(½"1). We are interested in testing the

null hypothesis H?:??"0 against the alternativeH?:??"?*, where ?*O0, that the covariate is

related to the binary response variable. The slope coefficient ??is the change in log odds for an

increase of one unit? in X?. When the covariate is a continuous variable with a normal

distribution, the log odds value ??is zero if and only if the group means, assuming equal

variances, between the two response categories are the same. Therefore we may use a sample size

formula for a two-sample t-test to calculate the required sample size. For simplicity, we use

a normal approximation instead, as the sample size formula (see formula (7) in Appendix I) may

be easily changed to include t-tests if required:

n"(Z?????#Z???)?/[P1(1!P1)?*?]

where n is the required total sample size, ?* is the effect size to be tested, P1 is the event rate at the

mean of X, and Z?is the upper uth percentile of the standard normal distribution.

When the covariate is a binary variable, say X"0 or 1, the log odds value ??"0 if and only if

the two event rates are equal. The sample size formula for the total sample size required for

comparing two independent event rates has the following form (see formula (10)):

n"?Z?????[P(1!P)/B]???#Z???[P1(1!P1)#P2(1!P2)(1!B)/B]?????

/[(P1!P2)?(1!B)]

where: P("(1!B)P1#BP2) is the overall event rate; B is the proportion of the sample with

X"1; P1 and P2 are the event rates at X"0 and X"1, respectively. For B"0)5, the required

sample size is bounded by the following simple form (see formula (11)):

n(4P(1!P)(Z?????#Z???)?/(P1!P2)?.

Appendix I presents two simpler forms, formulae (12) and (13), than formula (2). A later section

presents the comparisons of these formulae with computer power simulations.

(1)

(2)

(3)

MULTIPLE LOGISTIC REGRESSION

When there is more than one covariate in the model, a hypothesis of interest is the effect of

a specific covariate in the presence of other covariates. In terms of log odds parameters, the null

hypothesis for multiple logistic regression is H?: [??,??,2,??]"[0,??,2,??] against the

alternative [?*,??,2,??]. Let b?be the maximum likelihood estimate of ??. Whittemore? has

shown that, for continuous, normal covariates X, the variance of b?in the multivariate setting

with p covariates, var?(b?), can be approximated by inflating the variance of b?obtained from the

one parameter model, var?(b?), by multiplying by 1/(1!??????2p) where ?1.232pis the multiple

correlation coefficient relating X?with X?,2,X?. That is, approximately

var?(b?)"var?(b?)/(1!??1.232p)

1624

F. HSIEH, D. BLOCH AND M. LARSEN

Statist. Med. 17, 1623—1634 (1998)

? 1998 John Wiley & Sons, Ltd.

Page 3

The squared multiple correlation coefficient ??1.232p, also known as R?, is equal to the proportion

of the variance of X?explained by the regression relationship with X?,2,X?. The term

1/(1!??1.232p) will be referred to as a variance inflation factor (VIF). The required sample size for

the multivariate case can also be approximated from the univariate case by inflating it with the

samefactor1/(1!??1.232p).Followingthe

n?"n?/(1!??1.232p) where n?and n?are the sample sizes required for a logistic regression model

with p and 1 covariates, respectively. The same VIF seems to work well for binary covariates (see

Appendix III).

relationship of thevariances, wehave

MULTIPLE LINEAR REGRESSION

For multiple linear regression models, we can easily derive the same VIF for p covariates (see

Appendix II). Therefore, we can adjust similarly the sample size for a regression model with

p covariates. It is known that in a simple linear regression model, the correlation coefficient ? and

the regression parameter ??have the relationship ?"????/??. Hence ?"0 if and only if ??"0.

When both X and ½ are standardized, testing the hypotheses that ?"0 and that ??"0 are

equivalent and the required sample sizes are the same.

Let r be the estimate of the correlation coefficient between X and ½. The sample size formula

(see Sokal and Rohlf?) for testing H?: ?"0 against the alternative H?: ?"r is

n?"(Z?????#Z???)?/C(r)?#3

where the Fisher’s transformation C(r)"??log((1#r)/(1!r)). If we add p!1 covariates to the

regression model, the required sample size for testing H?: [??,??,2,??]"[0,??,2,??]

againstthe alternative[?*,??,2,??] is n?"n?/(1!??1.232p), approximately.If we already have

q covariates in the model and would like to expand the model to p('q) covariates, then, from

Appendix II, n?"n?((var?(b?)/var?(b?))"n?/(1!??(1q#12p))(232q)) where the partial correlation

coefficient ?(1q#12p))(232q) measures the linear association between covariates X?

X???,2,X?when the values of covariates X?,2,X?are held fixed.

and

COMPARISON OF SAMPLE-SIZE SOFTWARE

There are at least two computer programs available that use formula (4) (see AppendixI): nQuery

from Dr. Janet Elashoff,? and SSIZE? from the first author. One program, EGRET SIZ from

SERC,?uses the approachofSelf and Mauritsen.?For logisticregression,thecomputerprograms

nQuery and SSIZE provide sample sizes only for continuous covariates while EGRET SIZ only

provides estimates for discrete covariates. Both nQuery and EGRET SIZ are commercial

software. Note that the sample size calculation for logistic regression is only one of the many

features provided by the above three computer programs.

Table I presents sample size examples for a binary covariate using formula (4) and software

EGRET SIZ as well as the corresponding sample size for comparing two proportions (without

continuitycorrection from formulae (2), (3), (12) and (13)), and the results of power simulations. In

the table, P1 and P2 are event rates at X"0 and X"1, respectively; B is the proportion

of the sample with X"1; OR is the odds ratio of X"1 versus X"0 such that

OR"P2(1!P1)/(P1(1!P2)); P"(1!B)P1#BP2 is the overall event rate or case fraction.

Table I is designed to show the relationshipof sample sizes for different study designs. It is known

that a balanced design (B"0)5) requires less sample size than an unbalanced design

SAMPLE SIZE CALCULATION

1625

Statist. Med. 17, 1623—1634 (1998)

? 1998 John Wiley & Sons, Ltd.

Page 4

Table I. Results of sample size calculations for a binary covariate from six different methods,

power"95 per cent, two-sided significance level 5 per cent

DesignSample sizePower simulation

Balanced design with high event rates

(4):

P1"0)4, P2"0)5, B"0)5

(2):

P"0)45, P1"0)4, P2"0)5, B"0)5

(3):

P"0)45, P1"0)4, P2"0)5, B"0)5

(12)

P"0)45, P1"0)4, P2"0)5, B"0)5

(13)

P1"0)4, P2"0)5, B"0)5

SIZ:OR"1)5, case fraction P"0)45,

sampling fraction 50/50

1367

1282

1287

1287

1274

1285

96)0$0)63%

95)4$0)66%

94)7$0)71%

94)7$0)71%

94)6$0)71%

95)9$0)63%

Balanced design with low odds ratio

(4):

P1"0)5, P2"0)2, B"0)5

(2):

P"0)35, P1"0)5, P2"0)2, B"0)5

(3):

P"0)35, P1"0)5, P2"0)2, B"0)5

(12):

P"0)35, P1"0)5, P2"0)2, B"0)5

(13):

P1"0)5, P2"0)2, B"0)5

SIZ:OR"0)25, case fraction P"0)35,

sampling fraction 50/50

141

126

131

131

119

129

96)3$0)60%

95)0$0)69%

96)6$0)57%

96)6$0)57%

94)9$0)70%

96)1$0)61%

Balanced design with high odds ratio

(4):

P1"0)2, P2"0)5, B"0)5

(2):

P"0)35, P1"0)2, P2"0)5, B"0)5

(3):

P"0)35, P1"0)2, P2"0)5, B"0)5

(12):

P"0)35, P1"0)2, P2"0)5, B"0)5

(13):

P1"0)2, P2"0)5, B"0)5

SIZ:OR"4)0, case fraction P"0)35,

sampling fraction 50/50

166

126

131

131

119

129

99)0$0)31%

95)0$0)69%

96)6$0)57%

96)6$0)57%

92)9$0)81%

95)4$0)66%

Balanced design with high odds ratio

(4):

P1"0)05, P2"0)1, B"0)5

(2):

P"0)075, P1"0)05, P2"0)1, B"0)5

(3):

P"0)075, P1"0)05, P2"0)1, B"0)5

(12):

P"0)075, P1"0)05, P2"0)1, B"0)5

(13):

P1"0)05, P2"0)1, B"0)5

SIZ:OR"2)111, case fraction P"0)075,

sampling fraction 50/50

¸ow prevalence rate

(4):

P1"0)05, P2"0)1, B"0)2

(2):

P"0)06, P1"0)05, P2"0)1, B"0)2

(12):

P"0)06, P1"0)05, P2"0)1, B"0)2

(13):

P1"0)05, P2"0)1, B"0)2

SIZ:OR"2)111, case fraction P"0)06,

sampling fraction 80/20

1818

1437

1443

1443

1430

1417

98)2$0)42%

94)4$0)73%

95)8$0)63%

95)8$0)63%

94)4$0)73%

94)5$0)72%

2612

2186

1833

2648

2070

97)4$0)50%

94)9$0)70%

91)2$0)90%

97)4$0)50%

94)6$0)71%

High prevalence rate

(4):

P1"0)05, P2"0)1, B"0)8

(2):

P"0)09, P1"0)05, P2"0)1, B"0)8

(12):

P"0)09, P1"0)05, P2"0)1, B"0)8

(13):

P1"0)05, P2"0)1, B"0)8

SIZ:OR"2)111, case fraction P"0)09,

sampling fraction 20/80

3060

2257

2661

1820

2347

98)3$0)41%

95)0$0)69%

97)8$0)46%

89)5$0)97%

97)2$0)52%

1626

F. HSIEH, D. BLOCH AND M. LARSEN

Statist. Med. 17, 1623—1634 (1998)

? 1998 John Wiley & Sons, Ltd.

Page 5

(B"0)2 or 0)8); a low prevalence rate (B"0)2) requires less sample size than a high prevalence

rate (B"0)8); sample size remains the same if the odds ratio is reversed. In addition to the

significance level and the power of the test, the values of the following parameters, listed after the

sample size methods, are specified in the table:

Formula (4): P1, P2 and B.

Formula (2): tests of proportions: P, P1, P2 and B.

Formula (3): simple form for a balanced design: P, P1 and P2.

Formula (12): simple form for an unbalanced design: P, P1, P2 and B.

Formula (13): simple form for an unbalanced design: P1, P2 and B.

SIZ: OR, sampling fractions 1!B and B, and overall case fraction P.

The power simulations, obtained from SIZ with 1000 replications, use the likelihood ratio test for

the logistic regression model. The simulations show that the sample sizes obtained from testing

two proportions (formulae (2) and (3)) have statistical power within one standard deviation of the

expected power of 95 per cent. Also, formulae (2) and (3) are more stable than the other four

methods. Note that formula (4) calculates the required total number of events based on the event

rate corresponding to X"0, then inflates the number of events to obtain the total sample size.

Therefore, formula (4) produces a larger sample size if the lower event rate is assigned to P1

instead of P2. Formula (4) tends to overestimate the required sample sizes especially when the

event ratesare low (seeTable I). Formula(3) is a specialcase of formula(12) for a balanceddesign.

As shown in Table I, formula (3) gives the same sample sizes as formula (12) when B"0)5, but

slightly larger sample size than formula (2). Since formula (3) is designed for B"0)5, no sample

sizes for formula (3) are given for low or high prevalence rate. Formulae (12) and (13) are simpler

than formula (2), but lack accuracy when the sample size ratio is not close to 1 (say'2 or (0)5),

and should not be used when the accuracy of sample size calculation is important. It is known

thata designwithlow prevalencerate requiresless samplesize than highprevalencerate. InTable

I, formula (13) does not show this relationship which indicates that the formula overestimates the

sample size for low prevalence rate and underestimates high prevalence rate.

Table II presents the results for a continuous covariate from sample size programs nQuery and

SSIZE. The corresponding sample sizes from a two-sample t-test (formula (6) with Z-values

replaced by t-values) and from formula (1) are also listed for comparison. The table specifies the

following parameters indicated after the sample size methods:

Formula (1): P1, effect size"log(OR)"?*.

Two-sample t-test: effect size"log(OR),

sample size ratio"prob(½"1)/prob(½"0)"(1!P1)/P1.

nQuery: P1(event rate at the mean of X),

P2(event rate at one standard deviation above the mean of X).

SSIZE: P1(event rate at the mean of X),

OR (odds ratio at one standard deviation above the mean of X)

"P2(1!P1)/(P1(1!P2)).

Table II also provides power simulations obtained from 1000 replications generated by assuming

a normally distributed variable X. We used the Wald test in the simulation of the logistic

regression model. The results show that the sample sizes estimated by using the two-sample t-test

formula and formula (1) seem to be more conservative, but still large enough to achieve the

SAMPLE SIZE CALCULATION

1627

Statist. Med. 17, 1623—1634 (1998)

? 1998 John Wiley & Sons, Ltd.

Page 6

Table II. Results of sample size calculations for a continuous covariate from four different

methods, power"95 per cent, two-sided significance level 5 per cent

DesignSample size Power simulation

Balanced design

(1):

t-test:

nQuery: P1"0)5, P2"0)6

SSIZE:

P1"0)5, OR"1)5

ºnbalanced design, high event rates

(1):

P1"0)4, effect size ?*"0)405

t-test:effect size"0)405, sample size ratio"1)5

nQuery: P1"0)4, P2"0)5

SSIZE:

P1"0)4, OR"1)5

ºnbalanced design, low event rates

(1):

P1"0)1, effect size ?*"0)405

t-test:effect size"0)405, sample size ratio"9

nQuery: P1"0)1, P2"0)143

SSIZE:

P1"0)1, OR"1)5

P1"0)5, effect size ?*"0)405

effect size"0)405, sample size ratio"1

317

320

342

341

95)0$0)69%

95)5$0)66%

96)1$0)61%

95)3$0)67%

330

333

380

379

94)4$0)73%

94)8$0)70%

96)7$0)56%

96)7$0)56%

880

890

951

950

95)5$0)66%

96)1$0)61%

96)6$0)57%

96)6$0)57%

desired power. In other words, Table II seems to indicate that the t-test is a good estimate of

sample size which preserves power. Since we used to upper bound of the required sample size in

the formulae in both nQuery and SSIZE,both programs provide samplesizes slightly higher than

those required. When the odds ratio is fixed, a balanced design (that is, response rate P1"0)5)

requires less sample size than an unbalanced design (for example, P1"0)4 or 0)1). Note that due

to the exponential nature of the correction term (see Appendix I), we do not recommended use of

either software for logistic regression when the odds ratio is large (say*3).

EXAMPLE

We use a Department of Veterans Affairs Cooperative Study entitled ‘A Psychophysiological

Study of Chronic Post-Traumatic Stress Disorder’? to illustrate the preceding sample size

calculation for logistic regression with continuous covariates. The study developed and validated

alogistic regressionmodelto explorethe use ofcertainpsychophysiologicalmeasurementsfor the

prognosis of combat-related post-traumatic stress disorder (PTSD). In the study, patients’ four

psychophysiological measurements — heart rate, blood pressures, EMG and skin conductance

—were recordedwhile patientswere exposedto video tapescontainingcombat and neutralscenes.

Among the psychophysiologicalvariables, the differenceof the heart rates obtained while viewing

the combat and the neutral tapes (DCNHR) is considered a good predictor of the diagnosis of

PTSD. The prevalence rate of PTSD among the Vietnam veterans was assumed to be 20 per cent.

Therefore, we assumed a four to one sample size ratio for the non-PTSD versus PTSD groups.

The effect size of DCNHR is approximately0)3 which is the difference of the group means divided

by the standard deviation. With a two-sided significance level of 0)05 and a power of 95 per cent,

the required sample size based on a two-sample t-test is 905. The squared multiple correlation

1628

F. HSIEH, D. BLOCH AND M. LARSEN

Statist. Med. 17, 1623—1634 (1998)

? 1998 John Wiley & Sons, Ltd.

Page 7

coefficient of DCNHR versus the other three psychophysiological variables was estimated to be

0)1 and thus the VIF is 1)11. After adjusting for the VIF, a sample size of 1005 was needed for

fitting a multiple logistic regression model.

CONCLUSION

The proposed simple methods to calculate sample size for linear and logistic regression models

have several advantages. The formulae for the simple methods are well known and do not require

specialized software. This paper also provides simple forms of the formulae for easy hand

calculation. Compared to more accurate, but more complicated formulae, formulae (1) and (3)

have high degrees of accuracy. Computer simulations suggest that the proposed sample size

methods for comparing means and for comparing proportions are more accurate than SSIZE,

nQueryand EGRET SIZ. This paper suggests not to use SSIZE or nQuery when the odds ratio is

large (say*3) and Liu and Liang’s formula (13) when the sample size ratio is not close to 1

(say'2 or (0)5). This paper derives the variance inflation factor (VIF) for the linear regression

model and also shows, through computer simulations, that the same VIF applies to the logistic

regression model with binary covariates. The usage of the VIF to expand the sample size

calculation from one covariate to more than one covariate appears very useful and can be

extended to other multivariate models. In conclusion, this paper presents more accurate and

simple formulae for sample size calculation with extensions to multivariate models of various

types.

APPENDIX I

In a simple logistic regression model log(P/(1!P))"??#??X?, where P"prob(½"1), the

hypothesis H?: ??"0 against H?: ??"?* is of interest. A power of 1!? and a two-sided

significancelevel? are usually prespecifiedto calculatethe sample sizefor the hypothesistest. The

followingsample size formula, used in both SSIZE and nQuery, is a combinationof Whittemore?

formulae (6) and (16):

n"(»(0)???Z?????#»(?*)???Z???)?(1#2P1?)/(P1?*?)

where the log odds value ?*"log(P2(1!P1)/(P1(1!P2))), and Z???and Z?????are standard

normal variables with a tail probability of ? and ?/2, respectively.

For a continuous covariate, »(0)"1, »(?*)"exp(!?*?/2), P1 and P2 are the event rates at

the mean of X and one SD above the mean, respectively. The value of ? for continuous covariates

is from Hsieh? formula (3): ?"(1#(1#?*?)exp(5?*?/4))(1#exp(!?*?/4))??.

For a binary covariate, the overall event rate P"(1!B)P1#BP2, where P1 and P2 are the

event rates at X"0 and X"1, respectively; B is the proportion of the sample with X"1,

»(0)"1/(1!B)#1/B, and »(?*)"1/(1!B)#1/(Bexp(?*)). The value of ? for binary covari-

ates is from Whittemore? formula (14): ?"(»(0)???#»(?*)???R)/(»(0)???#»(?*)???) where R is

from Whittemore? formula (15): R"»(?*)B(1!B)exp(2?*)/(Bexp(?*)#(1!B))?. Note that

R"?"1 when ?*"0.

The proposed method is to use a two-sample test instead of a one-sample test for sample

size calculation. The popular sample size formula for testing the equality of two independent

sample means with equal sample sizes from two normally distributed groups has the familiar

(4)

SAMPLE SIZE CALCULATION

1629

Statist. Med. 17, 1623—1634 (1998)

? 1998 John Wiley & Sons, Ltd.

Page 8

form (see Rosner??):

n"2(???#???)(Z?????#Z???)?/??

(5)

where n is the total sample size and ? is the difference of the two group means to be detected;

???and ???are the variances of the two groups. For an unequal-sample-size design with a sample

size ratio of k, the required total sample size should be inflated by a factor of (k#1)?/(4k).

Assuming equal variances, the test statistic employs the common variance of the two groups and

formula (5) reduces to

n"??(Z?????#Z???)?[(k#1)?/k]/??

In a simple logistic regression model with a continuous covariate, the sample size ratio is

k"(1!P1)/P1 where P1 is the event rate of the response at X"0. Therefore, P1 is also the

overall event rate when X is standardized to have mean 0 and variance 1. By replacing the effect

size ?/? by ?*, formula (6) becomes

n"(Z?????#Z???)?/[P1(1!P1)?*?].

As derived by Whittemore,? 1"»(0)*»(?*), and therefore formula (4) can be bounded by

n)(Z?????#Z???)?(1#2P1?)/(P1?*?).

Formula (7) is more general than the formula derived by Whittemore,? who assumed that P1 is

small and therefore 1/(1!P1) is negligible. Note that Hsieh? formula (3) implies that one should

not use formula (4) when the odds ratio is large (say*3).

Whenthe covariateis a binary variable,say X"0 or 1, the log oddsvalues ??"0 if and onlyif

the two event rates are equal. We can calculate the total sample size from the formula for

comparing the two independent event rates (see Rosner??):

n"(1#k)?Z?????[P(1!P)(k#1)/k]???#Z???[P1(1! P1)#P2(1!P2)/k]?????/(P1!P2)?

(6)

(7)

(8)

(9)

where: k"B/(1!B) is the sample size ratio; B is the proportion of the sample with X"1;

P"(1!B)P1#BP2 is the overall event rate; P1 and P2 are the event rates at X"0 and X"1,

under the alternative hypothesis, respectively. By replacing k by B/(1!B), formula (9) becomes

n"?Z?????[P(1! P)/B]???#Z???[P1(1!P1)#P2(1!P2)(1!B)/B]?????/[(P1!P2)?(1!B)].

(10)

For a balanced design, k"1 or B"0)5, formula (10) is bounded by

n(4P(1!P)(Z?????#Z???)?/(P1!P2)?.

For an unbalanced design, similar to (6), we inflate formula (11) by a factor of 1/[4B(1!B)] to

obtain a simple approximation:

n"P(1!P)(Z?????#Z???)?/[B(1!B)(P1!P2)?].

In a recent publication, Liu and Liang?? extended Self and Mauritsen’s method for correlated

observations. As a special case, they provided a closed form for a logistic regression model with

(11)

(12)

1630

F. HSIEH, D. BLOCH AND M. LARSEN

Statist. Med. 17, 1623—1634 (1998)

? 1998 John Wiley & Sons, Ltd.

Page 9

one binary covariate. Their closed form, without the adjustment of the design effect for correlated

observations, is very similar to (12):

n"(Z?????#Z???)?[BP1(1!P1)#(1!B)P2(1!P2)]/[B(1!B)(P1!P2)?].

Examples and comparisons of these formulae are provided in Table I.

(13)

APPENDIX II

Let var?(b?) and var?(b?) equal the variances of the parameter estimate obtained from multiple

linear regression models with p and 1 covariates, respectively. We show that, most often, the ratio

var?(b?)/var?(b?) is bounded by 1/(1!??1.232p). In addition, var?(b?)/var?(b?) is bounded by

1/(1!??(1q#12p))(232q)) where the partial correlation coefficient ?(1q#12p))(232q)measures the

linear association between covariates X?and X???,2,X?when the values of covariates

X?,2,X?are held fixed.

We begin with one covariate in a linear regression model ½"??#??X?#e where the error

term e is distributed as Normal (0,???) and, for simplicity, the sample mean of X?is 0. The

variance of the least squares estimate b?is known to equal

var?(b?)"???/?X??.

Whentherearetwo covariatesX?and X?withsamplemeans0, thevariance-covariancematrixof

the estimates of the parameters is

var?(b?,b?)"???(X?X)??"????

?X??

?X?X?

?X?X?

?X???

??

whereX is thematrix of covariates.Throughthe inverseof the 2?2 X?Xmatrix, we can obtainthe

variance of b?as

var?(b?)"????X??/(?X???X??!(?X?X?)?)

"(???/???)var?(b?)/(1!????).

Thevalue of ???/???, in most cases, is less than 1 and close to 1. Since the additional covariate in the

model also takes away a degree of freedom from the error term, the estimate of the variance ratio

???/???may sometimes slightly exceed 1. The squared multiple correlation coefficient, in this case

the same as the simple correlation coefficient, is ????"(?X?X?)?/(?X???X??).

Whenthere are three covariates,the multiple correlationcoefficient?????can be obtainedfrom

the matrix operation

??????"[?X?X?, ?X?X?]?

"(2?X?X??X?X??X?X?!?X??(?X?X?)?!?X??(?X?X?)?]/??X??[?X???X??

!(?X?X?)?]?.

With three covariates in the regression model, the variance-covariance matrix of the estimates of

the parameters can be obtained from the inverse of the 3?3 X?X matrix through the formula

?X??

?X?X?

?X?X?

?X???

???

?X?X?

?X?X????X??.

SAMPLE SIZE CALCULATION

1631

Statist. Med. 17, 1623—1634 (1998)

? 1998 John Wiley & Sons, Ltd.

Page 10

var?(b?,b?,b?)"???(X?X)??. Therefore

var?(b?)"???[?X???X??!(?X?X?)?]/[?X???X???X??#2?X?X??X?X??X?X?

!?X??(?X?X?)?!?X??(?X?X?)?!?X??(?X?X?)?]

"(???/???)var?(b?)/(1!??????).

Usually, ???/???)1 and var?(b?))var?(b?)/(1!??????). In a linear regression model with p para-

meters,var?(b?,b?,2,b?)"???(X?X)??"????. By applying a result of Anderson??(equation 20)

that ???

var?(b?)"??????"???/?X??(1!??1.232p)

"(???/???)var?(b?)/(1!??????2?).

Again, in mostsituations, ???/???)1

VIF"1/(1!??1.232p) is the approximate upper bound of the ratio var?(b?)/var?(b?). The upper

bound does not hold in the rare situation when ???/???'1 but the approximation is still good

enough. When p is not too large, the bound is tight; when p is large and ?1.232pis near 1, the

bound is inaccurate. A similar result holds for nested models. We would like to expand the model

from the situation of q covariates to p covariates where p'q. Then, reasoning as above

var?(b?)/var?(b?)"(???/???)(1!??1.232p)/(1!??1.232p))(1!??1.232q)/(1!??1.232p)

"1/(1!??(1q#12p))(232q)).

??"?X??(1!??1.232p), we obtain

and var?(b?))var?(b?)/(1!??1.232p). Then, the

where the partial corelation coefficient ?(1q#12p))(232q)measures the linear association between

covariates X?and X???,2,X?when the values of covariates X?,2,X?are held fixed. The

value of the ratio ???/???should be closer to 1 than ???/???.

APPENDIX III

We use simulations to investigate, in a multiple logistic regression model with p independent

binary covariates, how well the ratio of the maximum likelihood estimates of the variances

var?(b?)/var?(b?) is approximated by 1/(1!??1.232p), where the multiple correlation coefficient

relating binary covariates X?with X?,2,X?has the same formula as continuous covariates

with a normal distribution:

??1.232p"[?X?X?,?X?X?2,?X?X?]

?X??

?X?X?

2

?X?X?

?X?X?

?X??

2

?X?X?

2

2

2

2

?X?X?

?X?X?

2

?X??

??

?X?X?

?X?X?

2

?X?X???X??

.

The 80 computer simulations each use a sample size of 1000 with eight binary covariates. When

all eight covariates are generated independently, the estimate values of ??1.2327are near zero. In

order that the response variable ½ and the covariates X’s be somewhat correlated, and the

estimates of ??1.2327have a broad range of values, say from 0 to 0)7, the generation of the eight

covariates requires some special care.

1632

F. HSIEH, D. BLOCH AND M. LARSEN

Statist. Med. 17, 1623—1634 (1998)

? 1998 John Wiley & Sons, Ltd.

Page 11

Figure 1. Results of 80 simulations: estimates of var?(b?)/var?(b?) versus 1/(1!??

1.2328)

Let º, »?, »?,2, and »?be uniform random variates obtained from a generator in SAS.??

The response variable ½ is Bernoulli with a parameter value 0)5. The eight covariates X?,X?,2,

andX?arealso Bernoulliwith parametersB?,B?,2, andB?which have values 0)5, 0)6, 0)65, 0)7,

0)75, 0)8, 0)85 and 0)9, respectively. The response variable ½ is generated such that ½"1 when

º'0)5 and ½"0, otherwise. In the first simulation, the covariates are generated such that

X?"1 when 0)1º#0)9»?'B?and X?"0, otherwise, for i"1,2,2,8. The same process was

repeated for the second simulation except for the generation of X?where the same random value

forX?was used: X?"1 when 0)1º#0)9»?'B?and X?"0, otherwise. In the third simulation,

the same random value for X?was used for X?. The similar process continued until the

completion of the eighth simulation. After finishing the first eight simulations, the whole process

was then repeated ten times to obtain a total of 80 simulations.

In practice, the estimated values of var?(b?) and ??1.232p(same as R?) can be obtained from SAS

PROC LOGISTIC and PROC REG,?? respectively. The estimates of var?(b?)/var?(b?) versus

1/(1!??1.232p) from the simulations are plotted in Figure 1. The simulation results show that, for

binary covariates, the estimates of 1/(1!??1.232p) closely approximate the value of the estimates

of the ratio var?(b?)/var?(b?). Figure 1 shows that the estimates of 1/(1!??1.232p) very slightly

underestimate the variance ratio var?(b?)/var?(b?).

ACKNOWLEDGEMENTS

The authors thank Drs. Philip Lavori, Kelvin Lee, a referee and the editor for valuable comments

andeditorial suggestions which strengthenedthe content. This work was supported in part by the

DVA Cooperative Studies Program of the Veteran Health Administration, NIH grant AR20610

(Multipurpose Arthritis Center) and Y01-DA-40032-0 (National Institute on Drug Abuse), the

latter to the VA Cooperative Studies Program.

SAMPLE SIZE CALCULATION

1633

Statist. Med. 17, 1623—1634 (1998)

? 1998 John Wiley & Sons, Ltd.

Page 12

REFERENCES

1. Whittemore, A. ‘Sample size for logistic regression with small response probability’, Journal of the

American Statistical Association, 76, 27—32 (1981).

2. Hsieh, F. Y. ‘Sample size tables for logistic regression’, Statistics in Medicine, 8, 795—802 (1989).

3. Self, S. G. and Mauritsen, R. H. ‘Power/sample size calculations for generalized linear models’,

Biometrics, 44, 1, 79—86 (1988)

4. Hosmer, D. W. and Lemeshow, S. Applied ¸ogistic Regression, Wiley, New York, 1989, p. 56.

5. Sokal, R. R. and Rohlf, F. J. Biometry, W. H. Freeman and Company, New York, 1995, p. 578.

6. Elashoff, J. nQuery Advisor Sample Size and Power Determination, Statistical Solutions Ltd., Boston,

MA, 1996.

7. Hsieh, F. ‘SSIZE: A sample size program for clinical and epidemiologic studies’, American Statistician,

45, 338 (1991).

8. SERC. EGRE¹ SIZ sample size and power for nonlinear regression models, Statistics and Epidemiology

Research Corp. Seattle, WA, 1992.

9. Keane, T. M., Kolb, L. C. and Thomas, R. G. ‘A Psychophysiological Study of Chronic Post-Traumatic

Stress Disorder’, Cooperative Study No. 334, Cooperative Studies Program Coordinating Center, VA

Medical Center, Palo Alto, California, U.S.A., 1988.

10. Rosner, B. Fundamentals of Biostatistics, 4th edn, PWS-KENT Publishing Company, 1995, p. 283 and

384.

11. Liu, G. and Liang, K. Y. ‘Sample size calculation for studies with correlated observations’, Biometrics,

53, 537—547 (1997).

12. Anderson, T. W. An Introduction to Multivariate Statistical Analysis, Wiley, New York, 1958, p. 32.

13. SAS Institute Inc. SAS/S¹A¹ ºser’s Guide, »ersion 6 (»ol. 1 and 2), Cary, NC, 1990.

1634

F. HSIEH, D. BLOCH AND M. LARSEN

Statist. Med. 17, 1623—1634 (1998)

? 1998 John Wiley & Sons, Ltd.

#### View other sources

#### Hide other sources

- Available from Michael D Larsen · May 21, 2014
- Available from usf.edu