Page 1
American Journal of Epidemiology
Copyright O 1997 by The Johns Hopkins University School of Hygiene and PubBc Health
All rights reserved
Vol. 145, No. 8
Printed In U.SLA
Flexible Modeling of the Effects of Serum Cholesterol on Coronary Heart
Disease Mortality
Michal Abrahamowicz,12 Roxane du Berger,2 and Steven A. Graver1"3
Current understanding of the impact of lipids and other risk factors on coronary heart disease is largely
based on the results of parametric multiple regression analyses of large prospective studies. To assess the
potential impact of the a priori assumption of linearity of continuous risk factors on the results of parametric
analyses, the authors completed a secondary analysis of the Lipid Research Clinics Prevalence and Followup
Studies (19721987) data using an assumptionfree nonparametric modeling approach. The effects of total
serum cholesterol and the ratio of total serum cholesterol to high density lipoprotein cholesterol, adjusted for
common risk factors, were estimated using a smoothing spline method available in the generalized additive
model extension of the multiple logistic regression. The data set included 2,512 men in the random sample of
the Lipid Research Clinics study who did not take lipidlowering medications. During the median followup of
12.6 years, 94 coronary heart disease deaths occurred. The generalized additive model fits the effects of total
serum cholesterol (p < 0.01) and the ratio of total serum cholesterol to high density lipoprotein cholesterol
(p < 0.02) significantly better than the parametric logistic regression. Validation studies confirmed that, among
new observations arising from the same population, generalized additive model estimates predicted outcomes
better than the parametric estimates. Nonlinear effects of both lipid measures were robust and may be
clinically important. The authors conclude that the linearity assumption inherent in parametric models may
result in biased estimates of the effects of total serum cholesterol on coronary heart disease mortality and
recommend that their findings be verified in a nonparametric analysis of data from another large prospective
study. Am J Epidemiol 1997;145:71429.
coronary disease; lipids; models, statistical; risk factors
Coronary heart disease (CHD) is a major cause of
mortality and morbidity in Western societies. Accord
ingly, designing effective strategies for preventing
CHD is among the highest priorities of public health
officials. Serum cholesterol is recognized as a major
CHD risk factor (1). Several interventions and clinical
practice guidelines have been developed to reduce the
incidence of CHD by changing the levels of modifi
Received for publication November 8, 1995, and in final form
January 21, 1997.
Abbreviations: AJC, Akaike information criterion; BIC, Bayesian
information criterion; BMI, body mass index; CHD, coronary heart
disease; Cl, confidence interval; GAM, generalized additive model;
HDL cholesterol, high density lipoprotein cholesterol; LRC, Upid
Research Clinics; TC, total serum cholesterol.
1 Department of Epidemiology and Biostatjstjcs, McGill Univer
sity, Montreal, Quebec, Canada.
2 Division of Clinical Epidemiology, Department of Medicine,
Montreal General Hospital, Montreal, Quebec, Canada.
3 Department of Medicine, McGill University, Montreal, Quebec,
Canada.
Reprint requests to Dr. Michal Abrahamowlcz, DtvisJon of Clinical
Epidemiology, Montreal General Hospital, 1650 Cedar Avenue,
Montreal, Quebec, Canada H3G 1A4.
able risk factors, and of serum cholesterol in particular
(24). In the era of shrinking resources, the assess
ment of effectiveness and costeffectiveness of the
proposed prevention strategies has also received con
siderable attention (58).
Assessing the effect of an intervention based on risk
factor modification involves comparing risks associ
ated with different levels of the risk factor. Current
perception of the effects of particular CHD risk factors
relies strongly on large prospective epidemiologic
studies such as the Framingham Heart Study (9) or the
Multiple Risk Factor Intervention Trial (10). In these
studies, the outcomes of participants with different
baseline values of risk factors are analyzed to estimate
the relative and/or absolute risks associated with par
ticular values. Statistical methods used to analyze re
sults of prospective studies of CHD mortality/morbid
ity have evolved over time and have tracked, with
some understandable delay, advancements in theoret
ical and applied statistics and the availability of soft
ware. While early analyses relied on univariate statis
tical methods, more recent ones typically use multiple
logistic regression modeling to control for potential
714
by guest on July 14, 2011
aje.oxfordjournals.org
Downloaded from
Page 2
Cholesterol in Coronary Disease: Flexible Modeling 715
confounders and to estimate independent (adjusted)
effects of particular risk factors (9).
The multiple logistic regression model belongs to
the broad family of parametric general linear models
that rely on the assumption that the effects of contin
uous predictors are linear (linearity assumption). The
linearity assumption simplifies both model estimation
and interpretation of results since it allows for sum
marizing the effects of a continuous predictor by a
single parameter (logarithm of odds ratio in the logis
tic model). However, the linearity assumption may be
too simple to represent the effects of some risk factors
correctly. Linearity dictates that the estimated change
in the logit of risk due to changing the predictor value
by a given amount is constant over the entire range of
these values. For example, the linearity assumption
requires that the estimated effect of lowering total
serum cholesterol (TC) from 300 to 250 mg/dl on the
logit of a CHD death is exactly the same as the effect
of reducing TC from 200 to 150 mg/dl. It should be
noted that this assumption is inconsistent with those
analyses of CHD risks in which continuous risk factors
have been categorized, implying a threshold effect
(11). More generally, it is possible that the risk in
crease may be steeper over, for example, the range of
high TC values than over the range of moderate "nor
mal" values. In such cases, the parametric logistic
model will be incorrect and will result in biased esti
mates of relative and absolute risks. More specifically,
if the linearity assumption is incorrect for a given risk
factor, the parametric logistic regression estimate will
underestimate its effect over some range of values and
overestimate the effect over some other range.
In the last decade, a number of flexible nonparamet
ric extensions of the conventional logistic regression
model have been proposed in the statistical literature
(1215). These nonparametric regression methods
eliminate the restrictive linearity assumption and allow
greater flexibility in modeling the data so that the
estimated effects of continuous risk factors may follow
an arbitrary continuous smooth function. Accordingly,
the risk of bias is largely reduced as the estimates
depend more on empirical data and less on a priori
assumptions. There are several reasons why flexible
modeling of cholesterol effects is of interest First, the
impact of CHD on mortality and morbidity makes it
mandatory to provide as accurate as possible estimates
of the effects of particular risk factors, even if it
requires additional analytic and computational efforts.
If statistically significant and clinically important de
partures from the linearity assumption are found, they
may suggest the need to revise die current perception
of the effectiveness and costeffectiveness of various
interventions aimed at the modification of risk factors.
On the other hand, if flexible nonparametric models
yield estimates of the cholesterol effect that are very
similar to those obtained from the parametric multiple
logistic model, such results will enhance the scientific
value of findings based on the latter model by provid
ing a posteriori empiric validation of the underlying a
priori assumptions.
In this study, we use a nonparametric regression
approach to reassess the effects of selected continuous
risk factors on the risk of CHD mortality, with partic
ular focus on two lipid measures: TC and die ratio of
TC to high density lipoprotein cholesterol (TC/HDL
cholesterol).
MATERIALS AND METHODS
Data source
We completed a secondary analysis of the public
use data provided by the Lipid Research Clinics (LRC)
Program Prevalence and Followup Studies. Between
1972 and 1976, 10 North American clinics partici
pated in the Prevalence Study aimed at estimating the
prevalence of dyslipoproteinemias and related factors
(16—18). At the first visit, 60,502 individuals out of
81,926 initially contacted completed a questionnaire
and had tiieir blood tested. A 15 percent "random
sample" of the visit 1 participants as well as all indi
viduals found to have abnormal lipid levels at visit 1
("nonrandom sample") were then invited to return for
visit 2, which took place on average 3 months later.
The nonrandom sample included all participants tak
ing lipidlowering medication and those with TC
and/or triglyceride values exceeding the respective
age and sexspecific cutoffs (19). At visit 2, a com
plete medical history was taken, a physical examina
tion was carried out, and TC and its fractions were
measured. (Details of laboratory procedures can be
found in references 16 and 18.) Participants were then
followed prospectively until death or the end of the
LRC Followup Study in June 1987 (20). Participants
or a reliable source were contacted to determine par
ticipants' vital status. Causes of mortality were estab
lished based on death certificates and hospital records
(20).
Our main analyses focus on male participants in the
random sample. Individuals who were on lipidlower
ing medications at visit 2 were eliminated from the
data set. Following the argument presented by
Benfante et al. (21), their risks may be determined
more by their (unknown) previous TC level than by
the current level, likely to be reduced recently due to
medication. The resulting data set includes 2,512 in
dividuals widi the median followup time of 12.6 years
(interquartile range, 1.2 years). Ninetyfour CHD
deaths occurred in this data set.
Am J Epidemiol Vol. 145, No. 8, 1997
by guest on July 14, 2011
aje.oxfordjournals.org
Downloaded from
Page 3
716 Abrahamowicz et al.
In some descriptive analyses and for the purpose of
assessing the generalizability of our results, we use
data on males in the nonrandom sample. After the
exclusion of individuals on lipidlowering medications
as well as of an obvious outlier with the TC value
above 1,000 mg/dl, there were 1,992 men in the non
random sample, among whom 102 died because of
CHD.
Statistical analyses
Simple univariate statistics, t tests for continuous
variables, and chisquare tests for binary variables
were used to compare random and nonrandom sam
ples. A significance level of a = 0.05 was used for all
tests.
Main analyses relied on multivariable binary regres
sion models with the outcome defined as a CHD death.
Each multiple regression model included one of the
two cholesterol measures, either TC or TC/HDL cho
lesterol, as well as a set of a priori selected risk factors:
age, systolic blood pressure, body mass index (BMI),
glucose intolerance, current smoking status, history of
CHD, and the use of antihypertensive medication. All
risk factors were represented by the values determined
at visit 2. In the parametric multiple logistic regression
model, the effect of each continuous risk factor on the
logit of a CHD death was represented by a linear
function.
Nonparametric modeling and testing of linearity.
Generalized additive models (GAM) methodology
(14, 15) was used for nonparametric modeling of the
TC effects, adjusted for other risk factors. Cubic
smoothing splines were used to estimate flexible func
tions that describe the dependence of the logit of a
CHD death on the TC value. To test the significance of
the nonlinear effects, we used the likelihood ratio test
with the approximate chisquare statistics based on the
effective number of degrees of freedom, correspond
ing to the trace of the smoother matrix (15). The
significant result of this test was interpreted as an
evidence that the nonparametric smoothing spline es
timate fits the data significantly better than the linear
function estimated using the parametric logistic re
gression model. Simulations reported by Hastie and
Tibshirani (15) show that, under the null hypothesis of
a linear effect, the empiric distribution of the proposed
likelihood ratio statistics in its upper tail is very similar
to the theoretical chisquare distribution with the ap
propriate number of degrees of freedom. This means
that the test of the nonparametric effect is reliable at
the conventional significance levels a = 0.05 or a =
0.01. A parsimonious 2df smoothing spline model
was chosen a priori for the purpose of testing the
significance of nonparametric effects. According to
our expectations, this model should provide sufficient
flexibility to ensure that the estimation of the risk
function would be reasonably free of bias, while pro
tecting against the risk of overfitting the data that is
present with more complex models (22). Basing sig
nificance testing on an a priori selected model also
avoids the problem of inflation of type I error rate that
occurs if the model is selected a posteriori (23, 24).
In the above analyses, we rely on the GAM gener
alization of the binary regression model, which allows
us to estimate absolute risks (probabilities of CHD
death associated with different risk profiles; see Clin
ical relevance of nonlinear effects), in addition to
relative risks. However, to evaluate the robustness of
our conclusions about the shape of the estimated risk
function and about the significance of nonparametric
effects, our main analyses described in this subsection
were replicated using the GAM generalization of the
Cox proportional hazards model (15). As in binary
regression analyses, the effect of TC or TC/HDL cho
lesterol was represented by a 2df smoothing spline
and was adjusted for the same risk factors. Given the
very high censoring level (above 95 percent) and little
variation in the duration of followup in our data, we
a priori expect that the results of the proportional
hazards analyses will be very similar to those of the
binary regression. Therefore, all further analyses, de
scribed in the following subsections, are limited to the
binary regression approach.
Model selection. A posteriori model selection cri
teria may be useful for descriptive purposes since they
allow us to determine which model provides an opti
mal tradeoff between the fit to data and model parsi
mony. Therefore, we estimated GAM models with
different degrees of freedom and used the Akaike
Information Criterion (AIC) (25) as well as the Bayes
ian Information Criterion (BIC) (26) to find the best
fitting model, corresponding to the minimum of a
given criterion value. While AIC is more commonly
used, the BIC criterion has the advantage of account
ing for the sample size and has been found to perform
better in other nonparametric analyses (22, 23).
In the preliminary GAM analyses, the nonparamet
ric modeling was limited to the cholesterol effects.
Final analyses were based on a more complex GAM
model in which the effect of each continuous risk
factor (age, systolic pressure, BMI, and TC or TC/
HDL cholesterol) was represented by a 2df smoothing
spline. We assessed also the sensitivity of the esti
mates with respect to the decision whether to include
in the analysis the participants on antihypertension
medications and/or those with a history of CHD. In
additional exploratory analyses, we used different sub
sets of the original data set to investigate some post
Am J Epidemiol Vol. 145, No. 8, 1997
by guest on July 14, 2011
aje.oxfordjournals.org
Downloaded from
Page 4
Cholesterol in Coronary Disease: Flexible Modeling 717
hoc suggestions as well as to illustrate some properties
of the estimates derived from different models.
Model validation. Results of multivariable model
ing of the data on men in the random sample were
validated using two approaches based, respectively, on
1) independent data (nonrandom sample) and 2) cross
validation of the random sample. In both approaches,
we treated the random sample as a "learning sample"
and used the resulting estimated regression coeffi
cients to calculate the estimated probability of a CHD
death for each subject in a "testing sample." Compar
ison of the estimated probabilities for individual sub
jects with the actual outcomes (CHD death or not) for
the same subject yielded the loglikelihood of the data
in the testing sample under the model estimated from
the learning sample. The loglikelihood was used as a
criterion to assess the goodness of fit of the estimated
model to independent data.
The first approach consisted of using the nonrandom
sample of men as the testing sample. In this part of the
validation study, we attempted to assess to what extent
the estimates obtained with the random sample of LRC
data are generalizable to other similar data sets. Given
that at present we do not have access to detailed data
from a different, large, prospective CHD study, we
have decided to use the nonrandom sample of the LRC
study for this purpose. However, validation results
based on the nonrandom sample should be interpreted
with caution since this sample is composed of individ
uals selected specifically because they were perceived
to be at a high risk of CHD events (19). Therefore, the
individuals in the nonrandom sample, in addition to
having a different distribution of common risk factors
recorded in the LRC study, may differ from the ran
domly selected individuals with respect to some more
subtle clinical characteristics and/or other unrecorded
variables associated with CHD risks.
The purpose of the second part of the validation
study was to compare the reproducibility of the esti
mates obtained with different models in terms of their
ability to predict outcomes in "new cases" arising from
the same population. The approach consisted of a
10fold crossvalidation of the random sample esti
mates. The random sample was first randomly divided
in 10 subsets of approximately equal size. Then, a
learning sample was constructed as the sum of the nine
subsets, and the tenth subset was used as a testing
sample. The same process of estimation and validation
was repeated nine additional times, each time using a
different subset as a testing sample and nine other
subsets as a learning sample. This procedure ensures
that for each individual the outcome is predicted based
on the model estimated independently of the individ
ual's data (i.e., estimated from the nine subsets that do
not include this individual). By summing up the log
likelihood values obtained in each of the 10 subsets,
we calculated the crossvalidated log likelihood of the
entire random sample.
Clinical relevance of nonlinear effects.
the clinical importance of the differences between the
parametric and the nonparametric estimates, we com
pared the results of the logistic regression and GAM
analyses with respect to the estimated effects of some
arbitrary changes in the TC level on the probability of
a CHD death for hypothetical patients with prespeci
fied risk profiles.
To assess
RESULTS
Baseline characteristics: random versus
nonrandom sample
Table 1 compares the distributions of risk factors
among men in random and nonrandom subsets of LRC
data. In most cases, the mean values of continuous risk
factors and the prevalence of binary factors in the
nonrandom sample are statistically significantly higher
than the corresponding values in the random sample,
except for HDL cholesterol, which is lower, as ex
pected. Although the majority of the differences are
not clinically important, for TC and TC/HDL choles
terol the mean values in the nonrandom sample are
very substantially higher than in the random sample,
which reflects the design of the LRC study (19).
Accordingly, the question of homogeneity of the lipid
effects across random and nonrandom samples has to
be investigated.
Parametric logistic regression modeling
Table 2 presents the results of the conventional
multiple logistic regression analyses of CHD mortality
in the two samples, with the effects of all continuous
risk factors assumed to be linear in logit. All results
are from the model that included TC but not TC/HDL
cholesterol. The results for TC/HDL cholesterol are
adjusted for all other risk factors except TC. (Results
for other risk factors did not change materially when
TC was replaced by TC/HDL cholesterol.) Interest
ingly, there is no evidence of the independent effect of
BMI on the risk of CHD death in men since in both
samples p values are very high and adjusted odds
ratios are very close to 1.0. Post hoc analysis sug
gested that the unexpected nonsignificance of the ad
justed effect of systolic pressure in the nonrandom
sample was likely due to the near collinearity between
systolic pressure and the use of antihypertensive med
ication in that sample. This problem did not occur in
the random sample, where the adjusted effect of sys
tolic pressure is highly significant. All other risk fac
Am J Epidemiol Vol. 145, No. 8, 1997
by guest on July 14, 2011
aje.oxfordjournals.org
Downloaded from
Page 5
718 Abrahamowicz et al.
o
5
ei
as
to co
to to
§8
in oo
CM CM
I
I
u> to
OJ 00
( £ > <
U) CO
LO CO
to S
"1 B
c .2
~ Q.
se
1
8
s

e
8
tors are significant, and the corresponding odds ratios
estimated from random and nonrandom samples are
very similar for most risk factors. However, the esti
mated effect of TC/HDL cholesterol is weaker in the
nonrandom sample. A oneunit increase in TC/HDL
cholesterol corresponds to the estimated relative risk
increase of 33 percent (95 percent confidence interval
(CI) 1948) and 14 percent (95 percent CI 523) in
the random and nonrandom samples, respectively.
Moreover, the two estimates of the intercept, corre
sponding to the estimated logit of a CHD death for a
specific covariate pattern (all covariate values equal to
zero), are quite different: 10.1 (95 percent CI 12.4
to 7.8) and 7.2 (95 percent CI 9.6 to 4.8) for
the random and nonrandom samples, respectively. The
fact that, for a fixed covariate pattern, the estimated
risks are substantially higher in the nonrandom sample
indicates that the higher mortality rate (table 2) in this
sample is not completely accounted for by the differ
ences in the distribution of common risk factors. This
may suggest that the participants included in the non
random sample have some other highrisk character
istics not taken into consideration in our analyses.
Nonparametric effects of lipids
In table 3, we present the results of separate non
parametric multivariable GAM analyses of the ad
justed effects of TC and TC/HDL cholesterol for
males in the random sample. The effect of each lipid
variable is adjusted for all other risk factors listed in
table 2, and all of the other risk factors are modeled
parametrically. We focus first on the GAM models
with 2 df since these parsimonious models were se
lected a priori to test the significance of the nonlinear
effects of cholesterol. The p values for the likelihood
ratio tests of the nonlinearity of the adjusted effect are
below 0.05 for both TC (p < 0.01) and TC/HDL
cholesterol (p < 0.02). This indicates that the 2df
smoothing spline estimates obtained from the GAM
model represent the effects of cholesterol on CHD
mortality significantly better than the linear estimates
derived from the conventional logistic regression, p
values for models with higher degrees of freedom
confirm the significance of the nonlinear effects at a
= 0.O5. However, the p values increase with the
increasing model complexity, suggesting that addi
tional degrees of freedom may be not necessary to
model systematic effects of cholesterol. This is further
corroborated by the model selection criteria. While
the differences between the corresponding AIC
values assigned to the three models are very small, the
2df model is definitely superior with respect to a
more adequate sample sizedependent BIC criterion
(table 3).
Am J Epidemiol Vol. 145, No. 8, 1997
by guest on July 14, 2011
aje.oxfordjournals.org
Downloaded from