ArticlePDF Available

A Simple Method of Sample Size Calculation for Linear and Logistic Regression

Authors:
  • St. Michael's College, Colchester, Vermont

Abstract and Figures

A sample size calculation for logistic regression involves complicated formulae. This paper suggests use of sample size formulae for comparing means or for comparing proportions in order to calculate the required sample size for a simple logistic regression model. One can then adjust the required sample size for a multiple logistic regression model by a variance inflation factor. This method requires no assumption of low response probability in the logistic model as in a previous publication. One can similarly calculate the sample size for linear regression models. This paper also compares the accuracy of some existing sample-size software for logistic regression with computer power simulations. An example illustrates the methods.
Content may be subject to copyright.
STATISTICS IN MEDICINE
Statist. Med. 17, 16231634 (1998)
A SIMPLE METHOD OF SAMPLE SIZE CALCULATION FOR
LINEAR AND LOGISTIC REGRESSION
F. Y. HSIEH1*, DANIEL A. BLOCH2 AND MICHAEL D. LARSEN3
1 CSPCC, Department of Veterans Aairs, Palo Alto Health Care System (151-K), Palo Alto, California 94304, U.S.A.
2 Division of Biostatistics, Department of Health Research and Policy, Stanford University, Stanford, California 94305, U.S.A.
3 Department of Statistics, Stanford University, Stanford, California 94305, U.S.A.
SUMMARY
A sample size calculation for logistic regression involves complicated formulae. This paper suggests use of
sample size formulae for comparing means or for comparing proportions in order to calculate the required
sample size for a simple logistic regression model. One can then adjust the required sample size for a multiple
logistic regression model by a variance inflation factor. This method requires no assumption of low response
probability in the logistic model as in a previous publication. One can similarly calculate the sample size for
linear regression models. This paper also compares the accuracy of some existing sample-size software for
logistic regression with computer power simulations. An example illustrates the methods. ( 1998 John
Wiley & Sons, Ltd.
INTRODUCTION
In a multiple logistic regression analysis, one frequently wishes to test the effect of a specific
covariate, possibly in the presence of other covariates, on the binary response variable. Owing to
the nature of non-linearity, the sample size calculation for logistic regression is complicated.
Whittemore1 proposed a formula, derived from the information matrix, for small response
probabilities. Hsieh2 simplified and extended the formula for general situations by using the
upper bound of the formula. Appendix I presents a simple closed form, based on an information
matrix, to approximate the sample size for both continuous and binary covariates in a simple
logistic regression. In a different approach, Self and Mauritsen3 used generalized linear models
and the score tests to estimate the sample size through an iterative procedure. These published
methods are complicated and may not be more accurate than the conventional sample size
formulae for comparing two means or a test of equality of proportions. In the next section, we
present a simple formula for the approximate sizes of the sample required for simple logistic
regression by using formulae for calculating sample size for comparing two means or for
* Correspondence to: F. Y. Hsieh, CSPCC, Department of Veterans Affairs, Palo Alto Health Care System (151-K),
Palo Alto, California 94304, U.S.A.
Contract/grant sponsor: Department of Veterans Affairs Cooperative Studies Program
Contract/grant sponsor: NIH
Contract/grant number: AR20610
Contract/grant sponsor: National Institute on Drug Abuse
Contract/grant number: Y01-DA-40032-0
CCC 02776715/98/14162312$17.50 Received February 1997
( 1998 John Wiley & Sons, Ltd. Revised October 1997
comparing two proportions. We can then adjust the sample size requirement for a multiple
logistic regression by a variance inflation factor. This approach applies to multiple linear
regression as well.
SIMPLE LOGISTIC REGRESSION
In a simple logistic regression model, we relate a covariate X
1
to the binary response variable ½ in
a model log(P/(1!P))"b
0
#b
1
X
1
where P"prob(½"1). We are interested in testing the
null hypothesis H
0
:b
1
"0 against the alternative H
1
:b
1
"b*, where b*O0, that the covariate is
related to the binary response variable. The slope coefficient b
1
is the change in log odds for an
increase of one unit4 in X
1
. When the covariate is a continuous variable with a normal
distribution, the log odds value b
1
is zero if and only if the group means, assuming equal
variances, between the two response categories are the same. Therefore we may use a sample size
formula for a two-sample t-test to calculate the required sample size. For simplicity, we use
a normal approximation instead, as the sample size formula (see formula (7) in Appendix I) may
be easily changed to include t-tests if required:
n"(Z
1~a@2
#Z
1~b
)2/[P1(1!P1)b*2 ] (1)
where n is the required total sample size, b* is the effect size to be tested, P1 is the event rate at the
mean of X, and Z
u
is the upper uth percentile of the standard normal distribution.
When the covariate is a binary variable, say X"0 or 1, the log odds value b
1
"0 if and only if
the two event rates are equal. The sample size formula for the total sample size required for
comparing two independent event rates has the following form (see formula (10)):
n"MZ
1~a@2
[P(1!P)/B]1@2#Z
1~b
[P1(1!P1)#P2(1!P2)(1!B)/B]1@2N2
/[(P1!P2)2(1!B)] (2)
where: P("(1!B)P1#BP2) is the overall event rate; B is the proportion of the sample with
X"1; P1 and P2 are the event rates at X"0 and X"1, respectively. For B"0)5, the required
sample size is bounded by the following simple form (see formula (11)):
n(4P(1!P)(Z
1~a@2
#Z
1~b
)2/(P1!P2)2. (3)
Appendix I presents two simpler forms, formulae (12) and (13), than formula (2). A later section
presents the comparisons of these formulae with computer power simulations.
MULTIPLE LOGISTIC REGRESSION
When there is more than one covariate in the model, a hypothesis of interest is the effect of
a specific covariate in the presence of other covariates. In terms of log odds parameters, the null
hypothesis for multiple logistic regression is H
0
:[b
1
,b
2
,
2
,b
p
]"[0, b
2
,
2
, b
p
] against the
alternative [b*, b
2
,
2
, b
p
]. Let b
1
be the maximum likelihood estimate of b
1
. Whittemore1 has
shown that, for continuous, normal covariates X, the variance of b
1
in the multivariate setting
with p covariates, var
p
(b
1
), can be approximated by inflating the variance of b
1
obtained from the
one parameter model, var
1
(b
1
), by multiplying by 1/(1!o
2
1.23
2
p
) where o
1.23
2
p
is the multiple
correlation coefficient relating X
1
with X
2
,
2
, X
p
. That is, approximately
var
p
(b
1
)"var
1
(b
1
)/(1!o
2
1.23
2
p
)
1624
F. HSIEH, D. BLOCH AND M. LARSEN
Statist. Med. 17, 16231634 (1998)( 1998 John Wiley & Sons, Ltd.
The squared multiple correlation coefficient o
2
1.23
2
p
, also known as R2, is equal to the proportion
of the variance of X
1
explained by the regression relationship with X
2
,
2
, X
p
. The term
1/(1!o
2
1.23
2
p
) will be referred to as a variance inflation factor (VIF). The required sample size for
the multivariate case can also be approximated from the univariate case by inflating it with the
same factor 1/(1!o
2
1.23
2
p
). Following the relationship of the variances, we have
n
p
"n
1
/(1!o
2
1.23
2
p
) where n
p
and n
1
are the sample sizes required for a logistic regression model
with p and 1 covariates, respectively. The same VIF seems to work well for binary covariates (see
Appendix III).
MULTIPLE LINEAR REGRESSION
For multiple linear regression models, we can easily derive the same VIF for p covariates (see
Appendix II). Therefore, we can adjust similarly the sample size for a regression model with
p covariates. It is known that in a simple linear regression model, the correlation coefficient o and
the regression parameter b
1
have the relationship o"b
1
p
X
/p
Y
. Hence o"0 if and only if b
1
"0.
When both X and ½ are standardized, testing the hypotheses that o"0 and that b
1
"0 are
equivalent and the required sample sizes are the same.
Let r be the estimate of the correlation coefficient between X and ½. The sample size formula
(see Sokal and Rohlf5) for testing H
0
: o"0 against the alternative H
1
: o"r is
n
1
"(Z
1~a@2
#Z
1~b
)2/C(r)2#3
where the Fisher’s transformation C(r)"
1
2
log((1#r)/(1!r)). If we add p!1 covariates to the
regression model, the required sample size for testing H
0
:[b
1
,b
2
,
2
,b
p
]"[0, b
2
,
2
, b
p
]
against the alternative [b*, b
2
,
2
, b
p
]isn
p
"n
1
/(1!o
2
1.23
2
p
), approximately. If we already have
q covariates in the model and would like to expand the model to p('q) covariates, then, from
Appendix II, n
p
"n
q
((var
p
(b
1
)/var
q
(b
1
))"n
q
/(1!o
2
(1 q#1
2
p)) (23
2
q)
) where the partial correlation
coefficient o
(1 q#1
2
p)) (23
2
q)
measures the linear association between covariates X
1
and
X
q`1
,
2
, X
p
when the values of covariates X
2
,
2
, X
q
are held fixed.
COMPARISON OF SAMPLE-SIZE SOFTWARE
There are at least two computer programs available that use formula (4) (see Appendix I): nQuery
from Dr. Janet Elashoff,6 and SSIZE7 from the first author. One program, EGRET SIZ from
SERC,8 uses the approach of Self and Mauritsen.3 For logistic regression, the computer programs
nQuery and SSIZE provide sample sizes only for continuous covariates while EGRET SIZ only
provides estimates for discrete covariates. Both nQuery and EGRET SIZ are commercial
software. Note that the sample size calculation for logistic regression is only one of the many
features provided by the above three computer programs.
Table I presents sample size examples for a binary covariate using formula (4) and software
EGRET SIZ as well as the corresponding sample size for comparing two proportions (without
continuity correction from formulae (2), (3), (12) and (13)), and the results of power simulations. In
the table, P1 and P2 are event rates at X"0 and X"1, respectively; B is the proportion
of the sample with X"1; OR is the odds ratio of X"1 versus X"0 such that
OR"P2(1!P1)/(P1(1!P2)); P"(1!B)P1#BP2 is the overall event rate or case fraction.
Table I is designed to show the relationship of sample sizes for different study designs. It is known
that a balanced design (B"0)5) requires less sample size than an unbalanced design
SAMPLE SIZE CALCULATION 1625
Statist. Med. 17, 16231634 (1998)( 1998 John Wiley & Sons, Ltd.
Table I. Results of sample size calculations for a binary covariate from six different methods,
power"95 per cent, two-sided significance level 5 per cent
Design Sample size Power simulation
Balanced design with high event rates
(4): P1"0)4, P2"0)5, B"0)5 1367 96)0$0)63%
(2): P"0)45, P1"0)4, P2"0)5, B"0)5 1282 95)4$0)66%
(3): P"0)45, P1"0)4, P2"0)5, B"0)5 1287 94)7$0)71%
(12) P"0)45, P1"0)4, P2"0)5, B"0)5 1287 94)7$0)71%
(13) P1"0)4, P2"0)5, B"0)5 1274 94)6$0)71%
SIZ: OR"1)5, case fraction P"0)45, 1285 95)9$0)63%
sampling fraction 50/50
Balanced design with low odds ratio
(4): P1"0)5, P2"0)2, B"0)5 141 96)3$0)60%
(2): P"0)35, P1"0)5, P2"0)2, B"0)5 126 95)0$0)69%
(3): P"0)35, P1"0)5, P2"0)2, B"0)5 131 96)6$0)57%
(12): P"0)35, P1"0)5, P2"0)2, B"0)5 131 96)6$0)57%
(13): P1"0)5, P2"0)2, B"0)5 119 94)9$0)70%
SIZ: OR"0)25, case fraction P"0)35, 129 96)1$0)61%
sampling fraction 50/50
Balanced design with high odds ratio
(4): P1"0)2, P2"0)5, B"0)5 166 99)0$0)31%
(2): P"0)35, P1"0)2, P2"0)5, B"0)5 126 95)0$0)69%
(3): P"0)35, P1"0)2, P2"0)5, B"0)5 131 96)6$0)57%
(12): P"0)35, P1"0)2, P2"0)5, B"0)5 131 96)6$0)57%
(13): P1"0)2, P2"0)5, B"0)5 119 92)9$0)81%
SIZ: OR"4)0, case fraction P"0)35, 129 95)4$0)66%
sampling fraction 50/50
Balanced design with high odds ratio
(4): P1"0)05, P2"0)1, B"0)5 1818 98)2$0)42%
(2): P"0)075, P1"0)05, P2"0)1, B"0)5 1437 94)4$0)73%
(3): P"0)075, P1"0)05, P2"0)1, B"0)5 1443 95)8$0)63%
(12): P"0)075, P1"0)05, P2"0)1, B"0)5 1443 95)8$0)63%
(13): P1"0)05, P2"0)1, B"0)5 1430 94)4$0)73%
SIZ: OR"2)111, case fraction P"0)075, 1417 94)5$0)72%
sampling fraction 50/50
¸ow prevalence rate
(4): P1"0)05, P2"0)1, B"0)2 2612 97)4$0)50%
(2): P"0)06, P1"0)05, P2"0)1, B"0)2 2186 94)9$0)70%
(12): P"0)06, P1"0)05, P2"0)1, B"0)2 1833 91)2$0)90%
(13): P1"0)05, P2"0)1, B"0)2 2648 97)4$0)50%
SIZ: OR"2)111, case fraction P"0)06, 2070 94)6$0)71%
sampling fraction 80/20
High prevalence rate
(4): P1"0)05, P2"0)1, B"0)8 3060 98)3$0)41%
(2): P"0)09, P1"0)05, P2"0)1, B"0)8 2257 95)0$0)69%
(12): P"0)09, P1"0)05, P2"0)1, B"0)8 2661 97)8$0)46%
(13): P1"0)05, P2"0)1, B"0)8 1820 89)5$0)97%
SIZ: OR"2)111, case fraction P"0)09, 2347 97)2$0)52%
sampling fraction 20/80
1626 F. HSIEH, D. BLOCH AND M. LARSEN
Statist. Med. 17, 16231634 (1998)( 1998 John Wiley & Sons, Ltd.
(B"0)2or0)8); a low prevalence rate (B"0)2) requires less sample size than a high prevalence
rate (B"0)8); sample size remains the same if the odds ratio is reversed. In addition to the
significance level and the power of the test, the values of the following parameters, listed after the
sample size methods, are specified in the table:
Formula (4): P1, P2 and B.
Formula (2): tests of proportions: P, P1, P2 and B.
Formula (3): simple form for a balanced design: P, P1 and P2.
Formula (12): simple form for an unbalanced design: P, P1, P2 and B.
Formula (13): simple form for an unbalanced design: P1, P2 and B.
SIZ: OR, sampling fractions 1!B and B, and overall case fraction P.
The power simulations, obtained from SIZ with 1000 replications, use the likelihood ratio test for
the logistic regression model. The simulations show that the sample sizes obtained from testing
two proportions (formulae (2) and (3)) have statistical power within one standard deviation of the
expected power of 95 per cent. Also, formulae (2) and (3) are more stable than the other four
methods. Note that formula (4) calculates the required total number of events based on the event
rate corresponding to X"0, then inflates the number of events to obtain the total sample size.
Therefore, formula (4) produces a larger sample size if the lower event rate is assigned to P1
instead of P2. Formula (4) tends to overestimate the required sample sizes especially when the
event rates are low (see Table I). Formula (3) is a special case of formula (12) for a balanced design.
As shown in Table I, formula (3) gives the same sample sizes as formula (12) when B"0)5, but
slightly larger sample size than formula (2). Since formula (3) is designed for B"0)5, no sample
sizes for formula (3) are given for low or high prevalence rate. Formulae (12) and (13) are simpler
than formula (2), but lack accuracy when the sample size ratio is not close to 1 (say'2or(0)5),
and should not be used when the accuracy of sample size calculation is important. It is known
that a design with low prevalence rate requires less sample size than high prevalence rate. In Table
I, formula (13) does not show this relationship which indicates that the formula overestimates the
sample size for low prevalence rate and underestimates high prevalence rate.
Table II presents the results for a continuous covariate from sample size programs nQuery and
SSIZE. The corresponding sample sizes from a two-sample t-test (formula (6) with Z-values
replaced by t-values) and from formula (1) are also listed for comparison. The table specifies the
following parameters indicated after the sample size methods:
Formula (1): P1, effect size"log(OR)"b*.
Two-sample t-test: effect size"log(OR),
sample size ratio"prob(½"1)/prob(½"0)"(1!P1)/P1.
nQuery: P1(event rate at the mean of X),
P2(event rate at one standard deviation above the mean of X).
SSIZE: P1(event rate at the mean of X),
OR (odds ratio at one standard deviation above the mean of X)
"P2(1!P1)/(P1(1!P2)).
Table II also provides power simulations obtained from 1000 replications generated by assuming
a normally distributed variable X. We used the Wald test in the simulation of the logistic
regression model. The results show that the sample sizes estimated by using the two-sample t-test
formula and formula (1) seem to be more conservative, but still large enough to achieve the
SAMPLE SIZE CALCULATION 1627
Statist. Med. 17, 16231634 (1998)( 1998 John Wiley & Sons, Ltd.
Table II. Results of sample size calculations for a continuous covariate from four different
methods, power"95 per cent, two-sided significance level 5 per cent
Design Sample size Power simulation
Balanced design
(1): P1"0)5, effect size b*"0)405 317 95)0$0)69%
t-test: effect size"0)405, sample size ratio"1 320 95)5$0)66%
nQuery: P1"0)5, P2"0)6 342 96)1$0)61%
SSIZE: P1"0)5, OR"1)5 341 95)3$0)67%
ºnbalanced design, high event rates
(1): P1"0)4, effect size b*"0)405 330 94)4$0)73%
t-test: effect size"0)405, sample size ratio"1)5 333 94)8$0)70%
nQuery: P1"0)4, P2"0)5 380 96)7$0)56%
SSIZE: P1"0)4, OR"1)5 379 96)7$0)56%
ºnbalanced design, low event rates
(1): P1"0)1, effect size b*"0)405 880 95)5$0)66%
t-test: effect size"0)405, sample size ratio"9 890 96)1$0)61%
nQuery: P1"0)1, P2"0)143 951 96)6$0)57%
SSIZE: P1"0)1, OR"1)5 950 96)6$0)57%
desired power. In other words, Table II seems to indicate that the t-test is a good estimate of
sample size which preserves power. Since we used to upper bound of the required sample size in
the formulae in both nQuery and SSIZE, both programs provide sample sizes slightly higher than
those required. When the odds ratio is fixed, a balanced design (that is, response rate P1"0)5)
requires less sample size than an unbalanced design (for example, P1"0)4or0)1). Note that due
to the exponential nature of the correction term (see Appendix I), we do not recommended use of
either software for logistic regression when the odds ratio is large (say*3).
EXAMPLE
We use a Department of Veterans Affairs Cooperative Study entitled ‘A Psychophysiological
Study of Chronic Post-Traumatic Stress Disorder’9 to illustrate the preceding sample size
calculation for logistic regression with continuous covariates. The study developed and validated
a logistic regression model to explore the use of certain psychophysiological measurements for the
prognosis of combat-related post-traumatic stress disorder (PTSD). In the study, patients’ four
psychophysiological measurements heart rate, blood pressures, EMG and skin conductance
were recorded while patients were exposed to video tapes containing combat and neutral scenes.
Among the psychophysiological variables, the difference of the heart rates obtained while viewing
the combat and the neutral tapes (DCNHR) is considered a good predictor of the diagnosis of
PTSD. The prevalence rate of PTSD among the Vietnam veterans was assumed to be 20 per cent.
Therefore, we assumed a four to one sample size ratio for the non-PTSD versus PTSD groups.
The effect size of DCNHR is approximately 0)3 which is the difference of the group means divided
by the standard deviation. With a two-sided significance level of 0)05 and a power of 95 per cent,
the required sample size based on a two-sample t-test is 905. The squared multiple correlation
1628
F. HSIEH, D. BLOCH AND M. LARSEN
Statist. Med. 17, 16231634 (1998)( 1998 John Wiley & Sons, Ltd.
coefficient of DCNHR versus the other three psychophysiological variables was estimated to be
0)1 and thus the VIF is 1)11. After adjusting for the VIF, a sample size of 1005 was needed for
fitting a multiple logistic regression model.
CONCLUSION
The proposed simple methods to calculate sample size for linear and logistic regression models
have several advantages. The formulae for the simple methods are well known and do not require
specialized software. This paper also provides simple forms of the formulae for easy hand
calculation. Compared to more accurate, but more complicated formulae, formulae (1) and (3)
have high degrees of accuracy. Computer simulations suggest that the proposed sample size
methods for comparing means and for comparing proportions are more accurate than SSIZE,
nQuery and EGRET SIZ. This paper suggests not to use SSIZE or nQuery when the odds ratio is
large (say*3) and Liu and Liang’s formula (13) when the sample size ratio is not close to 1
(say'2or(0)5). This paper derives the variance inflation factor (VIF) for the linear regression
model and also shows, through computer simulations, that the same VIF applies to the logistic
regression model with binary covariates. The usage of the VIF to expand the sample size
calculation from one covariate to more than one covariate appears very useful and can be
extended to other multivariate models. In conclusion, this paper presents more accurate and
simple formulae for sample size calculation with extensions to multivariate models of various
types.
APPENDIX I
In a simple logistic regression model log(P/(1!P))"b
0
#b
1
X
1
, where P"prob(½"1), the
hypothesis H
0
: b
1
"0 against H
1
: b
1
"b* is of interest. A power of 1!b and a two-sided
significance level a are usually prespecified to calculate the sample size for the hypothesis test. The
following sample size formula, used in both SSIZE and nQuery, is a combination of Whittemore1
formulae (6) and (16):
n"(»(0)1@2Z
1~a@2
(b*)1@2Z
1~b
)2 (1#2P1d)/(P1b*2) (4)
where the log odds value b*"log(P2(1!P1)/(P1(1!P2))), and Z
1~b
and Z
1~a@2
are standard
normal variables with a tail probability of b and a/2, respectively.
For a continuous covariate, »(0)"1, »(b*)"exp(!b*2/2), P1 and P2 are the event rates at
the mean of X and one SD above the mean, respectively. The value of d for continuous covariates
is from Hsieh2 formula (3): d"(1#(1#b*2)exp(5b*2/4))(1#exp(!b*2/4))~1.
For a binary covariate, the overall event rate P"(1!B)P1#BP2, where P1 and P2 are the
event rates at X"0 and X"1, respectively; B is the proportion of the sample with X"1,
»(0)"1/(1!B)#1/B, and »(b*)"1/(1!B)#1/(B exp (b*)). The value of d for binary covari-
ates is from Whittemore1 formula (14): d"(»(0)1@2(b*)1@2R)/(»(0)1@2 (b*)1@2 ) where R is
from Whittemore1 formula (15): R(b*)B(1!B)exp(2b*)/(B exp(b*)#(1!B))2. Note that
R"d"1 when b*"0.
The proposed method is to use a two-sample test instead of a one-sample test for sample
size calculation. The popular sample size formula for testing the equality of two independent
sample means with equal sample sizes from two normally distributed groups has the familiar
SAMPLE SIZE CALCULATION 1629
Statist. Med. 17, 16231634 (1998)( 1998 John Wiley & Sons, Ltd.
form (see Rosner10):
n"2(p
2
1
#p
2
2
)(Z
1~a@2
#Z
1~b
)2/*2 (5)
where n is the total sample size and * is the difference of the two group means to be detected;
p
2
1
and p
2
2
are the variances of the two groups. For an unequal-sample-size design with a sample
size ratio of k, the required total sample size should be inflated by a factor of (k#1)2/(4k).
Assuming equal variances, the test statistic employs the common variance of the two groups and
formula (5) reduces to
n"p
2
(Z
1~a@2
#Z
1~b
)2 [(k#1)2/k]/*2 (6)
In a simple logistic regression model with a continuous covariate, the sample size ratio is
k"(1!P1)/P1 where P1 is the event rate of the response at X"0. Therefore, P1 is also the
overall event rate when X is standardized to have mean 0 and variance 1. By replacing the effect
size */p by b*, formula (6) becomes
n"(Z
1~a@2
#Z
1~b
)2/[P1(1!P1)b*2]. (7)
As derived by Whittemore,1 1(0) (b*), and therefore formula (4) can be bounded by
n)(Z
1~a@2
#Z
1~b
)2 (1#2P1d)/(P1b*2). (8)
Formula (7) is more general than the formula derived by Whittemore,1 who assumed that P1is
small and therefore 1/(1!P1) is negligible. Note that Hsieh2 formula (3) implies that one should
not use formula (4) when the odds ratio is large (say*3).
When the covariate is a binary variable, say X"0 or 1, the log odds values b
1
"0 if and only if
the two event rates are equal. We can calculate the total sample size from the formula for
comparing the two independent event rates (see Rosner10):
n"(1#k)MZ
1~a@2
[P(1!P)(k#1)/k]1@2#Z
1~b
[P1(1!P1)#P2(1!P2)/k]1@2 N2/(P1!P2)2
(9)
where: k"B/(1!B) is the sample size ratio; B is the proportion of the sample with X"1;
P"(1!B)P1#BP2 is the overall event rate; P1 and P2 are the event rates at X"0 and X"1,
under the alternative hypothesis, respectively. By replacing k by B/(1!B), formula (9) becomes
n"MZ
1~a@2
[P(1!P)/B]1@2#Z
1~b
[P1(1!P1)#P2(1!P2)(1!B)/B]1@2 N2/[(P1!P2)2 (1!B)].
(10)
For a balanced design, k"1orB"0)5, formula (10) is bounded by
n(4P(1!P)(Z
1~a@2
#Z
1~b
)2/(P1!P2)2. (11)
For an unbalanced design, similar to (6), we inflate formula (11) by a factor of 1/[4B(1!B)] to
obtain a simple approximation:
n"P(1!P)(Z
1~a@2
#Z
1~b
)2/[B(1!B)(P1!P2)2]. (12)
In a recent publication, Liu and Liang11 extended Self and Mauritsen’s method for correlated
observations. As a special case, they provided a closed form for a logistic regression model with
1630
F. HSIEH, D. BLOCH AND M. LARSEN
Statist. Med. 17, 16231634 (1998)( 1998 John Wiley & Sons, Ltd.
one binary covariate. Their closed form, without the adjustment of the design effect for correlated
observations, is very similar to (12):
n"(Z
1~a@2
#Z
1~b
)2 [BP1(1!P1)#(1!B)P2(1!P2)]/[B(1!B)(P1!P2)2]. (13)
Examples and comparisons of these formulae are provided in Table I.
APPENDIX II
Let var
p
(b
1
) and var
1
(b
1
) equal the variances of the parameter estimate obtained from multiple
linear regression models with p and 1 covariates, respectively. We show that, most often, the ratio
var
p
(b
1
)/var
1
(b
1
) is bounded by 1/(1!o
2
1.23
2
p
). In addition, var
p
(b
1
)/var
q
(b
1
) is bounded by
1/(1!o
2
(1 q#1
2
p) ) (23
2
q)
) where the partial correlation coefficient o
(1 q#1
2
p) ) (23
2
q)
measures the
linear association between covariates X
1
and X
q`1
,
2
, X
p
when the values of covariates
X
2
,
2
, X
q
are held fixed.
We begin with one covariate in a linear regression model ½"b
0
#b
1
X
1
#e where the error
term e is distributed as Normal (0, p
2
1
) and, for simplicity, the sample mean of X
1
is 0. The
variance of the least squares estimate b
1
is known to equal
var
1
(b
1
)"p
2
1
/&X
2
1
.
When there are two covariates X
1
and X
2
with sample means 0, the variance-covariance matrix of
the estimates of the parameters is
var
2
(b
1
, b
2
)"p
2
2
(X@X)~1"p
2
2
C
&X
2
1
&X
1
X
2
&X
1
X
2
&X
2
2
D
~1
where X is the matrix of covariates. Through the inverse of the 2]2 X@X matrix, we can obtain the
variance of b
1
as
var
2
(b
1
)"p
2
2
&X
2
2
/(&X
2
1
&X
2
2
!(&X
1
X
2
)2)
"(p
2
2
/p
2
1
) var
1
(b
1
)/(1!o
2
12
).
The value of p
2
2
/p
2
1
, in most cases, is less than 1 and close to 1. Since the additional covariate in the
model also takes away a degree of freedom from the error term, the estimate of the variance ratio
p
2
2
/p
2
1
may sometimes slightly exceed 1. The squared multiple correlation coefficient, in this case
the same as the simple correlation coefficient, is o
2
12
"(&X
1
X
2
)2/(&X
2
1
&X
2
2
).
When there are three covariates, the multiple correlation coefficient o
1.23
can be obtained from
the matrix operation
o
2
1.23
"[&X
1
X
2
, &X
1
X
3
]
C
&X
2
2
&X
2
X
3
&X
2
X
3
&X
2
3
D
~1
C
&X
1
X
2
&X
1
X
3
DN
&X
2
1
.
"(2&X
1
X
2
&X
2
X
3
&X
1
X
3
!&X
2
2
(&X
1
X
3
)2!&X
2
3
(&X
1
X
2
)2]/M&X
2
1
[&X
2
2
&X
2
3
!(&X
2
X
3
)2]N.
With three covariates in the regression model, the variance-covariance matrix of the estimates of
the parameters can be obtained from the inverse of the 3]3 X@X matrix through the formula
SAMPLE SIZE CALCULATION 1631
Statist. Med. 17, 16231634 (1998)( 1998 John Wiley & Sons, Ltd.
var
3
(b
1
, b
2
, b
3
)"p
2
3
(X@X)~1. Therefore
var
3
(b
1
)"p
2
3
[&X
2
2
&X
2
3
!(&X
2
X
3
)2]/[&X
2
1
&X
2
2
&X
2
3
#2&X
1
X
2
&X
2
X
3
&X
1
X
3
!&X
2
1
(&X
2
X
3
)2!&X
2
2
(&X
1
X
3
)2!&X
2
3
(&X
1
X
2
)2]
"(p
2
3
/p
2
1
)var
1
(b
1
)/(1!o
2
1.23
).
Usually, p
2
3
/p
2
1
)1 and var
3
(b
1
))var
1
(b
1
)/(1!o
2
1.23
). In a linear regression model with p para-
meters, var
p
(b
1
, b
2
,
2
, b
p
)"p
2
p
(X@X)~1"p
2
p
&. By applying a result of Anderson12 (equation 20)
that &
~1
11
"&X
2
1
(1!o
2
1.23
2
p
), we obtain
var
p
(b
1
)"p
2
p
&
11
"p
2
p
/&X
2
1
(1!o
2
1.23
2
p
)
"(p
2
p
/p
2
1
)var
1
(b
1
)/(1!o
2
1.23
2
p
).
Again, in most situations, p
2
p
/p
2
1
)1 and var
p
(b
1
))var
1
(b
1
)/(1!o
2
1.23
2
p
). Then, the
VIF"1/(1!o
2
1.23
2
p
) is the approximate upper bound of the ratio var
p
(b
1
)/var
1
(b
1
). The upper
bound does not hold in the rare situation when p
2
p
/p
2
1
'1 but the approximation is still good
enough. When p is not too large, the bound is tight; when p is large and o
1.23
2
p
is near 1, the
bound is inaccurate. A similar result holds for nested models. We would like to expand the model
from the situation of q covariates to p covariates where p'q. Then, reasoning as above
var
p
(b
1
)/var
q
(b
1
)"(p
2
p
/p
2
q
)(1!o
2
1.23
2
p
)/(1!o
2
1.23
2
p
))(1!o
2
1.23
2
q
)/(1!o
2
1.23
2
p
)
"1/(1!o
2
(1 q#1
2
p) ) (23
2
q)
).
where the partial corelation coefficient o
(1 q#1
2
p) ) (23
2
q)
measures the linear association between
covariates X
1
and X
q`1
,
2
, X
p
when the values of covariates X
2
,
2
, X
q
are held fixed. The
value of the ratio p
2
p
/p
2
q
should be closer to 1 than p
2
p
/p
2
1
.
APPENDIX III
We use simulations to investigate, in a multiple logistic regression model with p independent
binary covariates, how well the ratio of the maximum likelihood estimates of the variances
var
p
(b
1
)/var
1
(b
1
) is approximated by 1/(1!o
2
1.23
2
p
), where the multiple correlation coefficient
relating binary covariates X
1
with X
2
,
2
, X
p
has the same formula as continuous covariates
with a normal distribution:
o
2
1.23
2
p
"[&X
1
X
2
,&X
1
X
3
2
,&X
1
X
p
]
&X
2
2
&X
2
X
3
2 &X
2
X
p
&X
2
X
3
&X
2
3
2 &X
3
X
p
2222
&X
2
X
p
&X
3
X
p
2&X
2
p
~1 &X
1
X
2
&X
1
X
3
2
&X
1
X
p
N
&X
2
1
.
The 80 computer simulations each use a sample size of 1000 with eight binary covariates. When
all eight covariates are generated independently, the estimate values of o
2
1.23
2
7
are near zero. In
order that the response variable ½ and the covariates X’s be somewhat correlated, and the
estimates of o
2
1.23
2
7
have a broad range of values, say from 0 to 0)7, the generation of the eight
covariates requires some special care.
1632
F. HSIEH, D. BLOCH AND M. LARSEN
Statist. Med. 17, 16231634 (1998)( 1998 John Wiley & Sons, Ltd.
Figure 1. Results of 80 simulations: estimates of var
8
(b
1
)/var
1
(b
1
) versus 1/(1!o
2
1.23
2
8
)
Let º, »
1
, »
2
,
2
, and »
8
be uniform random variates obtained from a generator in SAS.13
The response variable ½ is Bernoulli with a parameter value 0)5. The eight covariates X
1
, X
2
,
2
,
and X
8
are also Bernoulli with parameters B
1
, B
2
,
2
, and B
8
which have values 0)5, 0)6, 0)65, 0)7,
0)75, 0)8, 0)85 and 0)9, respectively. The response variable ½ is generated such that ½"1 when
º'0)5 and ½"0, otherwise. In the first simulation, the covariates are generated such that
X
i
"1 when 0)1 º#0)9»
i
'B
i
and X
i
"0, otherwise, for i"1, 2,
2
, 8. The same process was
repeated for the second simulation except for the generation of X
2
where the same random value
for X
1
was used: X
2
"1 when 0)1º#0)9»
1
'B
2
and X
2
"0, otherwise. In the third simulation,
the same random value for X
1
was used for X
3
. The similar process continued until the
completion of the eighth simulation. After finishing the first eight simulations, the whole process
was then repeated ten times to obtain a total of 80 simulations.
In practice, the estimated values of var
p
(b
1
) and o
2
1.23
2
p
(same as R2) can be obtained from SAS
PROC LOGISTIC and PROC REG,13 respectively. The estimates of var
p
(b
1
)/var
1
(b
1
) versus
1/(1!o
2
1.23
2
p
) from the simulations are plotted in Figure 1. The simulation results show that, for
binary covariates, the estimates of 1/(1!o
2
1.23
2
p
) closely approximate the value of the estimates
of the ratio var
p
(b
1
)/var
1
(b
1
). Figure 1 shows that the estimates of 1/(1!o
2
1.23
2
p
) very slightly
underestimate the variance ratio var
p
(b
1
)/var
1
(b
1
).
ACKNOWLEDGEMENTS
The authors thank Drs. Philip Lavori, Kelvin Lee, a referee and the editor for valuable comments
and editorial suggestions which strengthened the content. This work was supported in part by the
DVA Cooperative Studies Program of the Veteran Health Administration, NIH grant AR20610
(Multipurpose Arthritis Center) and Y01-DA-40032-0 (National Institute on Drug Abuse), the
latter to the VA Cooperative Studies Program.
SAMPLE SIZE CALCULATION 1633
Statist. Med. 17, 16231634 (1998)( 1998 John Wiley & Sons, Ltd.
REFERENCES
1. Whittemore, A. ‘Sample size for logistic regression with small response probability’, Journal of the
American Statistical Association, 76,2732 (1981).
2. Hsieh, F. Y. ‘Sample size tables for logistic regression’, Statistics in Medicine, 8, 795802 (1989).
3. Self, S. G. and Mauritsen, R. H. ‘Power/sample size calculations for generalized linear models’,
Biometrics, 44,1,7986 (1988)
4. Hosmer, D. W. and Lemeshow, S. Applied ¸ogistic Regression, Wiley, New York, 1989, p. 56.
5. Sokal, R. R. and Rohlf, F. J. Biometry, W. H. Freeman and Company, New York, 1995, p. 578.
6. Elashoff, J. nQuery Advisor Sample Size and Power Determination, Statistical Solutions Ltd., Boston,
MA, 1996.
7. Hsieh, F. ‘SSIZE: A sample size program for clinical and epidemiologic studies’, American Statistician,
45, 338 (1991).
8. SERC. EGRE¹ SIZ sample size and power for nonlinear regression models, Statistics and Epidemiology
Research Corp. Seattle, WA, 1992.
9. Keane, T. M., Kolb, L. C. and Thomas, R. G. ‘A Psychophysiological Study of Chronic Post-Traumatic
Stress Disorder’, Cooperative Study No. 334, Cooperative Studies Program Coordinating Center, VA
Medical Center, Palo Alto, California, U.S.A., 1988.
10. Rosner, B. Fundamentals of Biostatistics, 4th edn, PWS-KENT Publishing Company, 1995, p. 283 and
384.
11. Liu, G. and Liang, K. Y. ‘Sample size calculation for studies with correlated observations’, Biometrics,
53, 537547 (1997).
12. Anderson, T. W. An Introduction to Multivariate Statistical Analysis, Wiley, New York, 1958, p. 32.
13. SAS Institute Inc. SAS/S¹A¹ºsers Guide, »ersion 6 (»ol. 1 and 2), Cary, NC, 1990.
1634 F. HSIEH, D. BLOCH AND M. LARSEN
Statist. Med. 17, 16231634 (1998)( 1998 John Wiley & Sons, Ltd.
... Because norepinephrine is used early in cases of shock following trauma in France, this may bias the interpretation of the shock index at admission. Therefore, patients receiving norepinephrine in the trauma bay, with a heart rate > 120 bpm at admission and who were transfused in the first few hours, were categorized into severe shock [19]. Table S1). ...
... The majority of patients were male (78.6%), with a median age of 41 . The median ISS was 20 [12][13][14][15][16][17][18][19][20][21][22][23][24][25][26][27][28], and the predominant trauma mechanism was blunt (90.8%). Patients with severe RM were younger and had more severe injuries (median ISS 26 [12,[17][18][19][20][21][22][23][24][25][26][27][28][29][30][31][32][33][34][35] in the severe RM group vs. 19 [11][12][13][14][15][16][17][18][19][20][21][22][23][24][25][26] in the group without severe RM, p < 0.001). ...
... The median ISS was 20 [12][13][14][15][16][17][18][19][20][21][22][23][24][25][26][27][28], and the predominant trauma mechanism was blunt (90.8%). Patients with severe RM were younger and had more severe injuries (median ISS 26 [12,[17][18][19][20][21][22][23][24][25][26][27][28][29][30][31][32][33][34][35] in the severe RM group vs. 19 [11][12][13][14][15][16][17][18][19][20][21][22][23][24][25][26] in the group without severe RM, p < 0.001). In the severe RM group, 18.7% patients had TBI (vs. ...
Article
Full-text available
Background: Traumatic rhabdomyolysis (RM) is common and associated with the development of acute kidney injury and potentially with other organ dysfunctions. Thus, RM may increase the risk of death. The primary objective was to assess the effect of severe RM (Creatine Kinase [CK] > 5000 U/L) on 30-day mortality in trauma patients using a causal inference approach. Methods: In this multicenter cohort study conducted in France using a national major trauma registry (Traumabase) between January 1, 2012, and July 1, 2023, all patients admitted to a participating major trauma center hospitalized in intensive care unit (ICU) and with CK measurement were included. Confounding variables for both 30-day mortality and exposure were used to establish a propensity score. A doubly robust approach with inverse treatment weighting enabled the calculation of the average treatment effect on the treated (ATT). Analyses were performed in the overall cohort as well as in two subgroups: hemorrhagic shock subgroup (HS) and traumatic brain injury subgroup (TBI). Sensitivity analyses were conducted. Results: Among the 8592 patients included, 1544 (18.0%) had severe RM. They were predominantly males (78.6%) with median [IQR] age of 41 [27-58] years and severely injured (ISS 20 [13 - 29]) mainly from blunt trauma (90.8%). In the entire cohort, the ATT, expressed as a risk difference, was 0.073 [-0.054 to 0.200]. Considering the 1311 patients in the HS subgroup, the ATT was 0.039 [0.014 to 0.063]. As in the overall cohort, there was no effect on mortality in the TBI subgroup. Severe RM was associated with greater severity of trauma and more complications (whether related to renal function or not) during the ICU stay. Mortality due to multiorgan failure (39.9% vs 12.4%) or septic shock (2.6% vs 0.8%) was more frequent among patients with severe RM. Conclusions: Severe RM was not associated with 30-day mortality considering the overall cohort. However, it was associated with a 4.0% increase in 30-day mortality among patients with concurrent hemorrhagic shock. Severe RM plays a significant role in ICU morbidity.
... The sample size was calculated using a multiple logistic regression formula [25] with an odds ratio of 2.8 [14], a confidence level of 95% (α = 0.05), a power of 80% (ß = 0.8), and a case-tocontrol ratio of 1:1. The minimum sample size required for both cases and controls was 150 participants each. ...
Preprint
Full-text available
Background Stroke is a leading cause of death and disability-adjusted life years globally. The incidence of stroke is increasing in Asia, with ischemic stroke accounting for approximately 80% of stroke cases in Thailand. Stroke often results in long-term disabilities, including impairments in speech, communication, and concentration. Objective This study aimed to investigate factors associated with newly diagnosed ischemic stroke. Methods A matched case-control study was conducted, including 154 newly diagnosed ischemic stroke cases and 183 non-stroke individuals with type 2 diabetes mellitus (T2DM) as controls. Data were collected between February and September 2022 (post-COVID-19 period) using a structured questionnaire covering (1) socio-demographics, (2) lifestyle factors, (3) perceived social support, and (4) self-care management (SCM). Multivariable logistic regression models were employed to calculate adjusted odds ratios (aOR) with 95% confidence intervals (CI). Results Most participants were female (60.8%), Buddhists (92.9%), and agriculturists (66.5%), with a mean age of 58.9 (±9.9) years. Factors associated with ischemic stroke included male sex (aOR=3.533, 95%CI 1.732–7.206), Buddhism (aOR=3.529, 95%CI 1.107–11.250), sedentary occupation (aOR=5.785, 95%CI 2.613–12.807), and having T2DM for over 10 years (aOR=6.194, 95%CI 3.553–10.798). Protective factors included age ≥ 60 years (aOR=0.553, 95%CI 0.312–0.982) and moderate SCM levels (aOR=0.453, 95%CI 0.256–0.802). Conclusion Prolonged T2DM and sedentary occupations significantly contribute to ischemic stroke incidence. Effective prevention programs, including lifestyle modifications and diabetes self-care management education, may help reduce the burden of ischemic stroke.
... The logistic regression formula of Hsieh et al., 1998 was used to estimate the sample size [11]. ...
Article
Full-text available
Background: Cervical cancer is one of the most significant public health issues in Thailand. The number of women who have had cervical cancer screenings has increased over time; however, if major obstacles are not addressed, it will become challenging to maintain this success. Objectives: This study aimed to identify the magnitude and factors influencing cervical cancer screening uptake among women in Maesot, Tak Province, Thailand. Methods: This cross-sectional study was conducted in Maesot, Thailand. A structured questionnaire was used. Descriptive statistics and multiple logistic regression were used to determine the magnitude of the cervical cancer screening uptake and associated factors. The association was described with an Adjusted Odds Ratio (AOR) and 95% Confidence Interval (95% CI) at the statistically significant level of p-value < 0.05. Results: Of the total 422 women, the average age was 38.82 ±8.46 years. Nearly one-fourth of the respondents were Thai nationals. Almost 70% of respondents were married, and 80% were employed. About one-third of respondents had a high school level of education. The magnitude of Cervical Cancer Screening Uptake (CCSU) was 13.27% (95% CI: 10.34-16.86). Age ≥ 40 years (AOR=2.15, 95%CI: 1.05-4.40), Christian and having others religion (AOR=2.19,95% CI:1.08-4.44), married between 26 years to 43 years (AOR= 2.93, 95%CI: 1.45-5.93), having good knowledge of cervical cancer (AOR= 4.70, 95%CI: 2.22-9.72), and having good knowledge of cervical cancer screening program (AOR= 6.58, 95%CI: 2.82-17.77) were significant predictors of cervical cancer screening. Conclusion: Cervical cancer screening among women in Maesot is lower than that of national coverage of Thailand. Determining factors for the decision to undergo cervical cancer screening uptake include age, religion, age of marriage, knowledge of cervical cancer and knowledge of cervical cancer screening program. To improve uptake, structured screening programs need to be implemented in collaboration with national partners and institutions to decrease the incidence of cervical cancer in Thailand.
Article
Full-text available
This study aims to uncover the mechanisms and quantitative dose response relationships among sleep quality, anxiety, depression and miscarriage, as well as develop a comprehensive predictive model for the miscarriage rate. In this study, 1058 pregnant women in mainland China were recruited. We utilized both univariate, multivariate analyses and sensitivity analysis to investigate the relationship between sleep quality, anxiety, depression, and miscarriage. Then, we used mediation analysis and directed acyclic graph to explore how anxiety and sleep quality mediate the relationship between depression and miscarriage. We employed restricted cubic spline (RCS) to examine the dose-response relationship between these variables and constructed a nomogram model for predicting the occurrence of miscarriages. During our investigation, 16.4% of the participant had a miscarriage. Our results showed a significant association between sleep quality, anxiety, depression and miscarriage both unadjusted and multivariable multinomial logistic regression. Dose-response relationships showed that the miscarriage rate slowly increases with increasing PSQI, SAS and SDS scores at first. However, when a certain threshold is reached, even slight increases in the scores will lead to a sharp rise in the miscarriage rate. Anxiety mediated the effect of depression on miscarriage by 44% and sleep quality had a similar mediation effect (16%). The quantitative dose response relationships between PSQI, SAS, SDS, and the miscarriage rate are all positive. In the impact of depression on the miscarriage rate, anxiety and sleep quality also play significant mediating roles. By revealing high-risk pregnant women, early intervention can be provided, aiming to reduce the miscarriage rate.
Article
Full-text available
The purpose was to identify the most predictive parameters for perceived exertion and estimated time limit responses at the velocity corresponding to the lactate concentration threshold. The former scale concerns the subject's current status (how hard he feels the exercise currently is) whereas the latter scale deals with a subjective prediction of how long the current exercise level can be maintained. Multiple regression equations were developed among physiological, psychological, nutritional, and individual parameters (subjects' characteristics and performances) as independent variables, and perceived exertion or estimated time limit as dependent variables. Independent variables were collected before or during an incremental running field test. 94 regional to national level athletes (47 endurance-trained runners, 11 sprinters, and 36 handball players) participated. Multiple stepwise regression showed that Rating of Perceived Exertion and Estimated Time Limit at the lactate threshold were mainly mediated by factors relative to the performance expressed in percentage of the maximal aerobic velocity. Secondary factors which contribute significantly as perceptual predictors were related to various classes of factors except for psychological factors.
Article
Full-text available
Background Atrophy and fatty infiltration of the supraspinatus (SS) muscle are prognostic indicators of poor functional outcomes and higher retear rates after rotator cuff repair. While older patients, female patients, and those with massive and retracted rotator cuff tears are at a higher risk for these indicators, it is unclear whether tear characteristics, acromion morphology, and acromioclavicular (AC) joint arthritis affect SS atrophy in older patients with chronic shoulder pain. Purpose To investigate the multifactorial influences associated with SS atrophy in rotator cuff tears. Study Design Cross-sectional study; Level of evidence, 3. Methods A review was conducted on 391 patients with atraumatic shoulder pain (mean age, 60.88 ± 8 years; range, 50-93 years; 200 men and 191 women) who underwent magnetic resonance imaging between May 2019 and April 2020. SS atrophy was calculated using the occupation ratio. Logistic regression was performed to evaluate the association of SS atrophy with patient age and sex, rotator cuff tear type (partial- vs full-thickness), anteroposterior (AP) tear size, AC and glenohumeral (GH) joint arthritis, and acromion shape. A subgroup analysis was performed in patients without tears to investigate whether SS atrophy and fatty infiltration were independent phenomena. Results Overall, 91 patients had full-thickness tears without retraction, 131 had partial-thickness tears, and 169 had no tears. The prevalence of SS atrophy was associated with patient age and was more prevalent in women (67.6%), full-thickness tears (91.1%), an AP tear size of >15 mm (92.6%), and GH joint arthritis (100%) ( P < .001 for all). The severity of atrophy (indicated by a decrease in the occupation ratio) increased with older age. In the patients without tears, SS atrophy prevalence was 33.1%. Logistic regression analysis showed significant independent associations of SS atrophy with age ( P < .001), female sex ( P < .001), nonretracted full-thickness tears ( P < .001), an AP tear size of >15 mm ( P < .001), and hook-shaped acromion ( P = .007). A subgroup analysis of the nontear group revealed a significant association of SS atrophy with fatty infiltration ( P < .001). Conclusion This study identified significant associations between SS atrophy and older age, female sex, full-thickness tear without retraction, an AP tear size of >15 mm, and hook-shaped acromion. Notably, partial-thickness tears were not significantly associated with SS atrophy.
Article
Full-text available
Background Risky sexual behaviors refer to actions or practices that increase the likelihood engaging in sexual intercourse. Such behavior can lead to HIV infection/AIDS, sexually transmitted diseases, and unintended pregnancy. The impact of risky sexual behaviors is a growing public health concern. These issues pose significant challenges to public health, particularly among university students and younger age groups who may be more vulnerable to various factors. Thus, this research aimed to examine the factors and sexual health literacy associated with risky sexual behaviors among undergraduate students in the four major regions of Thailand. Method A cross-sectional study was conducted among 916 undergraduate students in their final semester of the 2023 academic year, from March to May 2024, at public, autonomous, and private universities across four regions: northern, central, southern, and northeastern of Thailand. The study employed multistage random sampling method. A self-administered structured questionnaire was used to assess risky sexual behaviors and the data were analyzed using multiple logistic regression. Results A total of 916 students participated in the study. The prevalence of risky sexual behavior was 46.84% (95% CI:43.56% to 50.12%) students engaged in risky sexual behaviors. Risky sexual behavior was significantly associated with grade point averages (GPA) between 2.00–3.00 could protective risky sexual behaviors was 42% as compared to GPA > 3.00 (AOR = 0.58, 95%CI:0.42 to 0.79), ex-substance use (AOR = 3.48, 95%CI:1.46 to 8.26), Current smoker (AOR = 2.90, 95%CI:1.90 to 4.43), negative attitudes toward risky sexual behaviors (AOR = 2.32, 95%CI:1.32 to 4.06), access to places of ill repute and access to sexual arousal stimuli (sometime) (AOR = 2.23, 95%CI:1.41 to 3.52), social influences (high level) (AOR = 0.29, 95%CI:0.15 to 0.55), and sufficient to excellent level of the application of information about sexual health (AOR = 0.48, 95%CI:0.26 to 0.87) of statistical significance at P < 0.05, which was significantly associated with risky sexual behavior. Conclusion The findings of this study offer important insights for preventing risky sexual behaviors among undergraduate students across four regions. By promoting healthy sexual practices and encouraging behavior modification, negative consequences can be reduced. Public health care providers, policymakers, and stakeholders should implement tailored strategies, such as comprehensive sexual education and accessible health services, to address the specific needs of these students. These targeted interventions can significantly reduce the prevalence of risky sexual behaviors.
Article
Full-text available
Background Over 1.7 billion instances of diarrheal illness in children are reported worldwide yearly. Diarrhea was a major cause of death in children, accounting for 9% of all global under-five child deaths in 2021. The objective of this study was to identify the association between hygiene practices and childhood diarrhea among under-five children in Myanmar. Method This cross-sectional study was conducted in 16 townships from 8 states and regions of Myanmar. 1207 children between the ages of 6 and 59 months were recruited by multistage random sampling. Data were collected with a preformed questionnaire after participants provided consent. Multiple logistic regressions were administered to determine the factors associated with childhood diarrhea. Result This study found that 86 (7.13%) under-five children experienced diarrhea disease. This study identified that children receiving limited hygiene services were 2.85 times (AOR = 2.85, 95% CI: 1.31 to 6.21; p value 0.01) and children without hygiene services were 2.63 times (AOR = 2.63, 95% CI: 1.42 to 4.89; p value 0.01) more likely to have diarrhea disease than those with basic hygiene services. Other factors associated with diarrhea included: fathers who washed their hands less than four steps (AOR = 2.20, 95% CI: 1.29 to 3.74; p value 0.01), families taking more than 15 min to collect water (AOR = 1.77, 95% CI: 1.06 to 2.97; p value 0.03), families sharing toilet usage (AOR = 2.00, 95% CI: 1.15 to 3.48; p value 0.01), mother’s inadequate and problematic hygiene promotion health literacy (AOR = 2.20, 95% CI: 1.24 to 3.90; p value 0.01), houses made of bamboo or lacking floors (AOR = 2.31, 95% CI: 1.38 to 3.89; p value 0.01), families with three or more children (AOR = 1.68, 95% CI: 1.01 to 2.79; p value 0.05) and breastmilk being the primary food after 6 months of age (AOR = 2.07, 95% CI: 1.09 to 3.93; p value 0.03). Conclusions Ensuring access to basic hygiene services, getting water at home 24 h per day, seven days per week, using private toilets, promoting hygiene health literacy, improving house flooring, family planning and introducing a variety of foods after age 6 months could significantly prevent diarrhea among under-five children in Myanmar. This study underscores the critical role of handwashing facilities in reducing the diarrhea incidence in children.
Article
Full-text available
Soil freeze‐thaw cycles play a critical role in ecosystem, hydrological and biogeochemical processes, and climate. The Tibetan Plateau (TP) has the largest area of frozen soil that undergoes freeze‐thaw cycles in the low‐mid latitudes. Evidence suggests ongoing changes in seasonal freeze‐thaw cycles during the past several decades on the TP. However, the status of diurnal freeze‐thaw cycles (DFTC) of shallow soil and their response to climate change largely remain unknown. In this study, using in‐situ observations, the latest reanalysis, machine learning, and physics‐based modeling, we conducted a comprehensive assessment of the spatiotemporal variations of DFTC and their response to climate change in the upper Brahmaputra (UB) basin. About 24 ± 8% of the basin is subjected to DFTC with a mean frequency of 87 ± 55 days during 1980–2018. The area and frequency of DFTC show small long‐term changes during 1980–2018. Air temperature impacts on the frequency of DFTC changes center mainly around the freezing point (0°C). The spatial variations in the response of DFTC to air temperature can primarily be explained by three factors: precipitation (30.4%), snow depth (22.6%) and seasonal warming/cooling rates (14.9%). Both rainfall and snow events reduce diurnal fluctuations of soil temperature, subsequently reducing DFTC frequency, primarily by decreasing daytime temperature through evaporation‐cooling and albedo‐cooling effects, respectively. These results provide an in‐depth understanding of diurnal soil freeze‐thaw status and its response to climate change.
Article
Full-text available
Aims Sarcopenia is associated with substantial morbidity and mortality. The SARC‐F self‐rated questionnaire is a simple tool that can be rapidly implemented by clinicians to identify individuals with probable sarcopenia who may require further in‐depth assessment. A score ≥ 4 is predictive of sarcopenia and poorer outcomes. We sought to identify the prevalence and demographic correlates of probable sarcopenia in a newly formed cohort of community‐dwelling older adults. Methods A cross‐sectional analysis of 480 participants (219 men and 261 women) identified from primary care in whom a questionnaire ascertaining demographic, lifestyle factors, comorbidities, nutrition risk and SARC‐F score was completed between 2021 and 2022. Participant characteristics in relation to probable sarcopenia were examined using sex‐stratified logistic regression. Age was included as a covariate. Results The median (lower quartile, upper quartile) age was 79.8 (76.9, 83.5) years. 12.8% (28) of men and 23% (60) of women had probable sarcopenia. Older age was associated with probable sarcopenia in both sexes (odds ratio [95% CI]: men 1.10 [1.02, 1.19], p = 0.01; women 1.08 [1.02, 1.14], p = 0.01) as was higher malnutrition risk score (men: 1.30 [1.12, 1.51], p = 0.001; women: 1.32 [1.17, 1.50], p < 0.001 per unit increase). Among men, being married or in a civil partnership or cohabiting was protective against probable sarcopenia (0.39 [0.17, 0.89], p = 0.03) as was reporting drinking any alcohol (0.34 [0.13, 0.92], p = 0.03), whereas in women generally similar relationships were seen though these were weaker. Higher BMI (1.14 (1.07, 1.22), p < 0.001 per unit increase) and more comorbidities (1.61 [1.34, 1.94], p < 0.001 per extra medical condition) were also associated with probable sarcopenia in women. Conclusions Probable sarcopenia (SARC‐F score ≥ 4) was common in older adults living in their own homes. In addition to advancing age and malnutrition, socio‐demographic factors were also important. Patients with a higher SARC‐F and who are living with associated risk factors should be prioritised for further in‐depth assessment for sarcopenia to allow the planning and implementation of interventions to mitigate potential adverse consequences.
Article
An approach for estimating power/sample size is described within the framework of generalized linear models. This approach is based on an asymptotic approximation to the power of the score test under contiguous alternatives and is applicable to tests of composite null hypotheses. An implementation is described for the special case of logistic regression models. Simulation studies are presented which indicate that the asymptotic approximation to the finite-sample situation is good over a range of parameter configurations.
Article
The Fisher information matrix for the estimated parameters in a multiple logistic regression can be approximated by the augmented Hessian matrix of the moment generating function for the covariates. The approximation is valid when the probability of response is small. With its use one can obtain a simple closed-form estimate of the asymptotic covariance matrix of the maximum-likelihood parameter estimates, and thus approximate sample sizes needed to test hypotheses about the parameters. The method is developed for selected distributions of a single covariate, and for a class of exponential-type distributions of several covariates. It is illustrated with an example concerning risk factors for coronary heart disease. 2 figures, 2 tables.
Article
The Fisher information matrix for the estimated parameters in a multiple logistic regression can be approximated by the augmented Hessian matrix of the moment-generating function for the covariates. The approximation is valid when the probability of response is small. With its use one can obtain a simple closed-form estimate of the asymptotic covariance matrix of the maximum likelihood parameter estimates, and thus approximate sample sizes needed to test hypotheses about the parameters. The method is developed for selected distributions of a single covariate and for a class of exponential-type distributions of several covariates. It is illustrated with an example concerning risk factors for coronary heart disease.
Chapter
This is a book review by A. W. F. Edwards (published in Biometrics, 31(2) 229-230) of my books Biometry (by Sokal and Rohlf) and Statistical Tables (by Rohlf and Sokal) both published in 1981.