Page 1
American Journal of Epidemiology
Copyright O 1997 by The Johns Hopkins University School of Hygiene and PubBc Health
All rights reserved
Vol. 145, No. 8
Printed In U.SLA
Flexible Modeling of the Effects of Serum Cholesterol on Coronary Heart
Disease Mortality
Michal Abrahamowicz,12 Roxane du Berger,2 and Steven A. Graver1"3
Current understanding of the impact of lipids and other risk factors on coronary heart disease is largely
based on the results of parametric multiple regression analyses of large prospective studies. To assess the
potential impact of the a priori assumption of linearity of continuous risk factors on the results of parametric
analyses, the authors completed a secondary analysis of the Lipid Research Clinics Prevalence and Followup
Studies (19721987) data using an assumptionfree nonparametric modeling approach. The effects of total
serum cholesterol and the ratio of total serum cholesterol to high density lipoprotein cholesterol, adjusted for
common risk factors, were estimated using a smoothing spline method available in the generalized additive
model extension of the multiple logistic regression. The data set included 2,512 men in the random sample of
the Lipid Research Clinics study who did not take lipidlowering medications. During the median followup of
12.6 years, 94 coronary heart disease deaths occurred. The generalized additive model fits the effects of total
serum cholesterol (p < 0.01) and the ratio of total serum cholesterol to high density lipoprotein cholesterol
(p < 0.02) significantly better than the parametric logistic regression. Validation studies confirmed that, among
new observations arising from the same population, generalized additive model estimates predicted outcomes
better than the parametric estimates. Nonlinear effects of both lipid measures were robust and may be
clinically important. The authors conclude that the linearity assumption inherent in parametric models may
result in biased estimates of the effects of total serum cholesterol on coronary heart disease mortality and
recommend that their findings be verified in a nonparametric analysis of data from another large prospective
study. Am J Epidemiol 1997;145:71429.
coronary disease; lipids; models, statistical; risk factors
Coronary heart disease (CHD) is a major cause of
mortality and morbidity in Western societies. Accord
ingly, designing effective strategies for preventing
CHD is among the highest priorities of public health
officials. Serum cholesterol is recognized as a major
CHD risk factor (1). Several interventions and clinical
practice guidelines have been developed to reduce the
incidence of CHD by changing the levels of modifi
Received for publication November 8, 1995, and in final form
January 21, 1997.
Abbreviations: AJC, Akaike information criterion; BIC, Bayesian
information criterion; BMI, body mass index; CHD, coronary heart
disease; Cl, confidence interval; GAM, generalized additive model;
HDL cholesterol, high density lipoprotein cholesterol; LRC, Upid
Research Clinics; TC, total serum cholesterol.
1 Department of Epidemiology and Biostatjstjcs, McGill Univer
sity, Montreal, Quebec, Canada.
2 Division of Clinical Epidemiology, Department of Medicine,
Montreal General Hospital, Montreal, Quebec, Canada.
3 Department of Medicine, McGill University, Montreal, Quebec,
Canada.
Reprint requests to Dr. Michal Abrahamowlcz, DtvisJon of Clinical
Epidemiology, Montreal General Hospital, 1650 Cedar Avenue,
Montreal, Quebec, Canada H3G 1A4.
able risk factors, and of serum cholesterol in particular
(24). In the era of shrinking resources, the assess
ment of effectiveness and costeffectiveness of the
proposed prevention strategies has also received con
siderable attention (58).
Assessing the effect of an intervention based on risk
factor modification involves comparing risks associ
ated with different levels of the risk factor. Current
perception of the effects of particular CHD risk factors
relies strongly on large prospective epidemiologic
studies such as the Framingham Heart Study (9) or the
Multiple Risk Factor Intervention Trial (10). In these
studies, the outcomes of participants with different
baseline values of risk factors are analyzed to estimate
the relative and/or absolute risks associated with par
ticular values. Statistical methods used to analyze re
sults of prospective studies of CHD mortality/morbid
ity have evolved over time and have tracked, with
some understandable delay, advancements in theoret
ical and applied statistics and the availability of soft
ware. While early analyses relied on univariate statis
tical methods, more recent ones typically use multiple
logistic regression modeling to control for potential
714
by guest on July 14, 2011
aje.oxfordjournals.org
Downloaded from
Page 2
Cholesterol in Coronary Disease: Flexible Modeling 715
confounders and to estimate independent (adjusted)
effects of particular risk factors (9).
The multiple logistic regression model belongs to
the broad family of parametric general linear models
that rely on the assumption that the effects of contin
uous predictors are linear (linearity assumption). The
linearity assumption simplifies both model estimation
and interpretation of results since it allows for sum
marizing the effects of a continuous predictor by a
single parameter (logarithm of odds ratio in the logis
tic model). However, the linearity assumption may be
too simple to represent the effects of some risk factors
correctly. Linearity dictates that the estimated change
in the logit of risk due to changing the predictor value
by a given amount is constant over the entire range of
these values. For example, the linearity assumption
requires that the estimated effect of lowering total
serum cholesterol (TC) from 300 to 250 mg/dl on the
logit of a CHD death is exactly the same as the effect
of reducing TC from 200 to 150 mg/dl. It should be
noted that this assumption is inconsistent with those
analyses of CHD risks in which continuous risk factors
have been categorized, implying a threshold effect
(11). More generally, it is possible that the risk in
crease may be steeper over, for example, the range of
high TC values than over the range of moderate "nor
mal" values. In such cases, the parametric logistic
model will be incorrect and will result in biased esti
mates of relative and absolute risks. More specifically,
if the linearity assumption is incorrect for a given risk
factor, the parametric logistic regression estimate will
underestimate its effect over some range of values and
overestimate the effect over some other range.
In the last decade, a number of flexible nonparamet
ric extensions of the conventional logistic regression
model have been proposed in the statistical literature
(1215). These nonparametric regression methods
eliminate the restrictive linearity assumption and allow
greater flexibility in modeling the data so that the
estimated effects of continuous risk factors may follow
an arbitrary continuous smooth function. Accordingly,
the risk of bias is largely reduced as the estimates
depend more on empirical data and less on a priori
assumptions. There are several reasons why flexible
modeling of cholesterol effects is of interest First, the
impact of CHD on mortality and morbidity makes it
mandatory to provide as accurate as possible estimates
of the effects of particular risk factors, even if it
requires additional analytic and computational efforts.
If statistically significant and clinically important de
partures from the linearity assumption are found, they
may suggest the need to revise die current perception
of the effectiveness and costeffectiveness of various
interventions aimed at the modification of risk factors.
On the other hand, if flexible nonparametric models
yield estimates of the cholesterol effect that are very
similar to those obtained from the parametric multiple
logistic model, such results will enhance the scientific
value of findings based on the latter model by provid
ing a posteriori empiric validation of the underlying a
priori assumptions.
In this study, we use a nonparametric regression
approach to reassess the effects of selected continuous
risk factors on the risk of CHD mortality, with partic
ular focus on two lipid measures: TC and die ratio of
TC to high density lipoprotein cholesterol (TC/HDL
cholesterol).
MATERIALS AND METHODS
Data source
We completed a secondary analysis of the public
use data provided by the Lipid Research Clinics (LRC)
Program Prevalence and Followup Studies. Between
1972 and 1976, 10 North American clinics partici
pated in the Prevalence Study aimed at estimating the
prevalence of dyslipoproteinemias and related factors
(16—18). At the first visit, 60,502 individuals out of
81,926 initially contacted completed a questionnaire
and had tiieir blood tested. A 15 percent "random
sample" of the visit 1 participants as well as all indi
viduals found to have abnormal lipid levels at visit 1
("nonrandom sample") were then invited to return for
visit 2, which took place on average 3 months later.
The nonrandom sample included all participants tak
ing lipidlowering medication and those with TC
and/or triglyceride values exceeding the respective
age and sexspecific cutoffs (19). At visit 2, a com
plete medical history was taken, a physical examina
tion was carried out, and TC and its fractions were
measured. (Details of laboratory procedures can be
found in references 16 and 18.) Participants were then
followed prospectively until death or the end of the
LRC Followup Study in June 1987 (20). Participants
or a reliable source were contacted to determine par
ticipants' vital status. Causes of mortality were estab
lished based on death certificates and hospital records
(20).
Our main analyses focus on male participants in the
random sample. Individuals who were on lipidlower
ing medications at visit 2 were eliminated from the
data set. Following the argument presented by
Benfante et al. (21), their risks may be determined
more by their (unknown) previous TC level than by
the current level, likely to be reduced recently due to
medication. The resulting data set includes 2,512 in
dividuals widi the median followup time of 12.6 years
(interquartile range, 1.2 years). Ninetyfour CHD
deaths occurred in this data set.
Am J Epidemiol Vol. 145, No. 8, 1997
by guest on July 14, 2011
aje.oxfordjournals.org
Downloaded from
Page 3
716 Abrahamowicz et al.
In some descriptive analyses and for the purpose of
assessing the generalizability of our results, we use
data on males in the nonrandom sample. After the
exclusion of individuals on lipidlowering medications
as well as of an obvious outlier with the TC value
above 1,000 mg/dl, there were 1,992 men in the non
random sample, among whom 102 died because of
CHD.
Statistical analyses
Simple univariate statistics, t tests for continuous
variables, and chisquare tests for binary variables
were used to compare random and nonrandom sam
ples. A significance level of a = 0.05 was used for all
tests.
Main analyses relied on multivariable binary regres
sion models with the outcome defined as a CHD death.
Each multiple regression model included one of the
two cholesterol measures, either TC or TC/HDL cho
lesterol, as well as a set of a priori selected risk factors:
age, systolic blood pressure, body mass index (BMI),
glucose intolerance, current smoking status, history of
CHD, and the use of antihypertensive medication. All
risk factors were represented by the values determined
at visit 2. In the parametric multiple logistic regression
model, the effect of each continuous risk factor on the
logit of a CHD death was represented by a linear
function.
Nonparametric modeling and testing of linearity.
Generalized additive models (GAM) methodology
(14, 15) was used for nonparametric modeling of the
TC effects, adjusted for other risk factors. Cubic
smoothing splines were used to estimate flexible func
tions that describe the dependence of the logit of a
CHD death on the TC value. To test the significance of
the nonlinear effects, we used the likelihood ratio test
with the approximate chisquare statistics based on the
effective number of degrees of freedom, correspond
ing to the trace of the smoother matrix (15). The
significant result of this test was interpreted as an
evidence that the nonparametric smoothing spline es
timate fits the data significantly better than the linear
function estimated using the parametric logistic re
gression model. Simulations reported by Hastie and
Tibshirani (15) show that, under the null hypothesis of
a linear effect, the empiric distribution of the proposed
likelihood ratio statistics in its upper tail is very similar
to the theoretical chisquare distribution with the ap
propriate number of degrees of freedom. This means
that the test of the nonparametric effect is reliable at
the conventional significance levels a = 0.05 or a =
0.01. A parsimonious 2df smoothing spline model
was chosen a priori for the purpose of testing the
significance of nonparametric effects. According to
our expectations, this model should provide sufficient
flexibility to ensure that the estimation of the risk
function would be reasonably free of bias, while pro
tecting against the risk of overfitting the data that is
present with more complex models (22). Basing sig
nificance testing on an a priori selected model also
avoids the problem of inflation of type I error rate that
occurs if the model is selected a posteriori (23, 24).
In the above analyses, we rely on the GAM gener
alization of the binary regression model, which allows
us to estimate absolute risks (probabilities of CHD
death associated with different risk profiles; see Clin
ical relevance of nonlinear effects), in addition to
relative risks. However, to evaluate the robustness of
our conclusions about the shape of the estimated risk
function and about the significance of nonparametric
effects, our main analyses described in this subsection
were replicated using the GAM generalization of the
Cox proportional hazards model (15). As in binary
regression analyses, the effect of TC or TC/HDL cho
lesterol was represented by a 2df smoothing spline
and was adjusted for the same risk factors. Given the
very high censoring level (above 95 percent) and little
variation in the duration of followup in our data, we
a priori expect that the results of the proportional
hazards analyses will be very similar to those of the
binary regression. Therefore, all further analyses, de
scribed in the following subsections, are limited to the
binary regression approach.
Model selection. A posteriori model selection cri
teria may be useful for descriptive purposes since they
allow us to determine which model provides an opti
mal tradeoff between the fit to data and model parsi
mony. Therefore, we estimated GAM models with
different degrees of freedom and used the Akaike
Information Criterion (AIC) (25) as well as the Bayes
ian Information Criterion (BIC) (26) to find the best
fitting model, corresponding to the minimum of a
given criterion value. While AIC is more commonly
used, the BIC criterion has the advantage of account
ing for the sample size and has been found to perform
better in other nonparametric analyses (22, 23).
In the preliminary GAM analyses, the nonparamet
ric modeling was limited to the cholesterol effects.
Final analyses were based on a more complex GAM
model in which the effect of each continuous risk
factor (age, systolic pressure, BMI, and TC or TC/
HDL cholesterol) was represented by a 2df smoothing
spline. We assessed also the sensitivity of the esti
mates with respect to the decision whether to include
in the analysis the participants on antihypertension
medications and/or those with a history of CHD. In
additional exploratory analyses, we used different sub
sets of the original data set to investigate some post
Am J Epidemiol Vol. 145, No. 8, 1997
by guest on July 14, 2011
aje.oxfordjournals.org
Downloaded from
Page 4
Cholesterol in Coronary Disease: Flexible Modeling 717
hoc suggestions as well as to illustrate some properties
of the estimates derived from different models.
Model validation. Results of multivariable model
ing of the data on men in the random sample were
validated using two approaches based, respectively, on
1) independent data (nonrandom sample) and 2) cross
validation of the random sample. In both approaches,
we treated the random sample as a "learning sample"
and used the resulting estimated regression coeffi
cients to calculate the estimated probability of a CHD
death for each subject in a "testing sample." Compar
ison of the estimated probabilities for individual sub
jects with the actual outcomes (CHD death or not) for
the same subject yielded the loglikelihood of the data
in the testing sample under the model estimated from
the learning sample. The loglikelihood was used as a
criterion to assess the goodness of fit of the estimated
model to independent data.
The first approach consisted of using the nonrandom
sample of men as the testing sample. In this part of the
validation study, we attempted to assess to what extent
the estimates obtained with the random sample of LRC
data are generalizable to other similar data sets. Given
that at present we do not have access to detailed data
from a different, large, prospective CHD study, we
have decided to use the nonrandom sample of the LRC
study for this purpose. However, validation results
based on the nonrandom sample should be interpreted
with caution since this sample is composed of individ
uals selected specifically because they were perceived
to be at a high risk of CHD events (19). Therefore, the
individuals in the nonrandom sample, in addition to
having a different distribution of common risk factors
recorded in the LRC study, may differ from the ran
domly selected individuals with respect to some more
subtle clinical characteristics and/or other unrecorded
variables associated with CHD risks.
The purpose of the second part of the validation
study was to compare the reproducibility of the esti
mates obtained with different models in terms of their
ability to predict outcomes in "new cases" arising from
the same population. The approach consisted of a
10fold crossvalidation of the random sample esti
mates. The random sample was first randomly divided
in 10 subsets of approximately equal size. Then, a
learning sample was constructed as the sum of the nine
subsets, and the tenth subset was used as a testing
sample. The same process of estimation and validation
was repeated nine additional times, each time using a
different subset as a testing sample and nine other
subsets as a learning sample. This procedure ensures
that for each individual the outcome is predicted based
on the model estimated independently of the individ
ual's data (i.e., estimated from the nine subsets that do
not include this individual). By summing up the log
likelihood values obtained in each of the 10 subsets,
we calculated the crossvalidated log likelihood of the
entire random sample.
Clinical relevance of nonlinear effects.
the clinical importance of the differences between the
parametric and the nonparametric estimates, we com
pared the results of the logistic regression and GAM
analyses with respect to the estimated effects of some
arbitrary changes in the TC level on the probability of
a CHD death for hypothetical patients with prespeci
fied risk profiles.
To assess
RESULTS
Baseline characteristics: random versus
nonrandom sample
Table 1 compares the distributions of risk factors
among men in random and nonrandom subsets of LRC
data. In most cases, the mean values of continuous risk
factors and the prevalence of binary factors in the
nonrandom sample are statistically significantly higher
than the corresponding values in the random sample,
except for HDL cholesterol, which is lower, as ex
pected. Although the majority of the differences are
not clinically important, for TC and TC/HDL choles
terol the mean values in the nonrandom sample are
very substantially higher than in the random sample,
which reflects the design of the LRC study (19).
Accordingly, the question of homogeneity of the lipid
effects across random and nonrandom samples has to
be investigated.
Parametric logistic regression modeling
Table 2 presents the results of the conventional
multiple logistic regression analyses of CHD mortality
in the two samples, with the effects of all continuous
risk factors assumed to be linear in logit. All results
are from the model that included TC but not TC/HDL
cholesterol. The results for TC/HDL cholesterol are
adjusted for all other risk factors except TC. (Results
for other risk factors did not change materially when
TC was replaced by TC/HDL cholesterol.) Interest
ingly, there is no evidence of the independent effect of
BMI on the risk of CHD death in men since in both
samples p values are very high and adjusted odds
ratios are very close to 1.0. Post hoc analysis sug
gested that the unexpected nonsignificance of the ad
justed effect of systolic pressure in the nonrandom
sample was likely due to the near collinearity between
systolic pressure and the use of antihypertensive med
ication in that sample. This problem did not occur in
the random sample, where the adjusted effect of sys
tolic pressure is highly significant. All other risk fac
Am J Epidemiol Vol. 145, No. 8, 1997
by guest on July 14, 2011
aje.oxfordjournals.org
Downloaded from
Page 5
718 Abrahamowicz et al.
o
5
ei
as
to co
to to
§8
in oo
CM CM
I
I
u> to
OJ 00
( £ > <
U) CO
LO CO
to S
"1 B
c .2
~ Q.
se
1
8
s

e
8
tors are significant, and the corresponding odds ratios
estimated from random and nonrandom samples are
very similar for most risk factors. However, the esti
mated effect of TC/HDL cholesterol is weaker in the
nonrandom sample. A oneunit increase in TC/HDL
cholesterol corresponds to the estimated relative risk
increase of 33 percent (95 percent confidence interval
(CI) 1948) and 14 percent (95 percent CI 523) in
the random and nonrandom samples, respectively.
Moreover, the two estimates of the intercept, corre
sponding to the estimated logit of a CHD death for a
specific covariate pattern (all covariate values equal to
zero), are quite different: 10.1 (95 percent CI 12.4
to 7.8) and 7.2 (95 percent CI 9.6 to 4.8) for
the random and nonrandom samples, respectively. The
fact that, for a fixed covariate pattern, the estimated
risks are substantially higher in the nonrandom sample
indicates that the higher mortality rate (table 2) in this
sample is not completely accounted for by the differ
ences in the distribution of common risk factors. This
may suggest that the participants included in the non
random sample have some other highrisk character
istics not taken into consideration in our analyses.
Nonparametric effects of lipids
In table 3, we present the results of separate non
parametric multivariable GAM analyses of the ad
justed effects of TC and TC/HDL cholesterol for
males in the random sample. The effect of each lipid
variable is adjusted for all other risk factors listed in
table 2, and all of the other risk factors are modeled
parametrically. We focus first on the GAM models
with 2 df since these parsimonious models were se
lected a priori to test the significance of the nonlinear
effects of cholesterol. The p values for the likelihood
ratio tests of the nonlinearity of the adjusted effect are
below 0.05 for both TC (p < 0.01) and TC/HDL
cholesterol (p < 0.02). This indicates that the 2df
smoothing spline estimates obtained from the GAM
model represent the effects of cholesterol on CHD
mortality significantly better than the linear estimates
derived from the conventional logistic regression, p
values for models with higher degrees of freedom
confirm the significance of the nonlinear effects at a
= 0.O5. However, the p values increase with the
increasing model complexity, suggesting that addi
tional degrees of freedom may be not necessary to
model systematic effects of cholesterol. This is further
corroborated by the model selection criteria. While
the differences between the corresponding AIC
values assigned to the three models are very small, the
2df model is definitely superior with respect to a
more adequate sample sizedependent BIC criterion
(table 3).
Am J Epidemiol Vol. 145, No. 8, 1997
by guest on July 14, 2011
aje.oxfordjournals.org
Downloaded from
Page 6
Cholesterol in Coronary Disease: Flexible Modeling 719
TABLE 2. Results of conventional multiple logistic regression in males, LRC*, 19721987
Covariates
Age (years)
Systolic pressure
BMI* (kg/m»)
Current smoker
Glucose intolerance
History of CHD*
TC* (mg/dl)
TC/HDL cholesterol*
Adjusted
odds
ratio
1.065
1.020
1.012
2.205
3.143
6.414
1.010
1.326
Random sample
(n2,512)t
95%
confidence
interval
1.0421.088
1.0081.032
0.9491.081
1.3703.549
1.4946.613
3.56611.533
1.0041.016
1.1861.483
P
<0.0001
0.001
0.710
0.001
0.003
<0.0001
0.0004
<0.0001
Adjusted
odds
ratio
1.059
1.006
0.995
1.966
^9O0
5.592
1.009
1.138
No nrandom sample
95%
confidence
Interval
1.0371.080
0.9941.018
0.9371.057
1.2533.085
1.5665.370
3.2209.714
1.0051.014
1.0491.234
P
<0.0001
0.328
0.877
0.003
0.001
<0.0001
<0.0001
0.002
* LRC, LJpid Research Clinics Prevalence and Followup Studies; BMI, body mass index; CHD, coronary heart
disease; TC, total serum cholesterol; HDL cholesterol, high density lipopnotein cholesterol,
t Number of deaths in random sample  94 (3.74%).
t Number of deaths in nonrandom sample  102 (5.12%).
TABLE 3. GAM* analyses: testing significance of nonlinear effects of llpids and model selection
criteria in random malas, LRC*, 19721987
GAM
modeb
(df)
2
3
4
P
value
0.007
0.009
0.015
TC*
AHD*
638.782
637.818
638.366
BIC*
661.391
662.780
665.818
P
value
0.012
0.023
0.040
TC/HDL cootesterol'
AIC
630.404
630.184
630.841
BtC
653.148
655.379
658.526
1 QAM, generalized additive model; LRC, LJpid Research Clinics Prevalence and Followup Studies; TC, total
serum cholesterol; HDL cholesterol, high density Dpoprotein cholesterol; AIC, AkaDce information criterion; BIC,
Bayesian information criterion.
In GAM, the estimated effect of a continuous risk
factor is represented by a flexible function rather than
by a single parameter (logarithm of odds ratio in the
logistic model), so that the best way to appreciate the
results is to plot this function. The left panel of figure
1 compares the parametric logistic regressionbased
estimate (dashed line) of the effect of TC on the logit
of a CHD death with the corresponding estimates
obtained from GAM models with different degrees of
freedom. The tick marks on the abscissa illustrate the
sample distribution of TC, indicating that the estimates
are well supported by the data in the range between
about 130 and 300 mg/dl. The upper tail of the graph
is truncated because the very sparse data beyond 300
mg/dl produce unstable estimates in that region. All of
the estimates are adjusted for the effects of nonlipid
risk factors listed in table 1. Therefore, a given curve
represents the estimated dependence of the logit of a
CHD death on TC in a hypothetical population homo
geneous with respect to age, systolic pressure, BMI,
smoking status, previous history of CHD, and the
presence/absence of diabetes. A priori assumptions
underlying the parametric logistic regression model
restrict this relation to a linear function, which implies
that the slope has to be constant over the entire range
of TC. The nonparametric GAM estimates are more
flexible so that the estimated slope may change along
that range. The GAM estimates are restricted only in
that they have to exhibit some degree of smoothness
that decreases with increasing complexity of the model
(i.e., with increasing degrees of freedom). The higher
the local slope over a given subinterval of abscissa, the
higher the impact of changing TC value in this specific
range.
All nonparametric GAM estimates describe a simi
lar pattern of the effects of TC: Over the range of
100250 mg/dl, the risks increase quickly and approx
imately linearly, whereas above the level of 250 mg/dl,
the risk function levels off. Among GAM estimates,
the 2df estimate (solid curve) appears to be the most
plausible one clinically. The 2df smoothing spline
estimate is monotonically nondecreasing. By contrast,
the higher estimates of degrees of freedom suggest that
the risk decreases substantially with increasing TC
beyond the level of 250 mg/dl, which seems counter
intuitive and may reflect sampling error rather than a
systematic effect. Moreover, the monotone 2df esti
mate is within the 95 percent confidence intervals for
the more complex models (data not shown), which
provides further support for the relatively parsimoni
ous 2df nonparametric model (27). The 10df esti
mate of the risk function exhibits several bumps and
Am J Epidemiol Vol. 145, No. 8, 1997
by guest on July 14, 2011
aje.oxfordjournals.org
Downloaded from
Page 7
720 Abrahamowicz et al.
C M
£
Q
5°
git of
3
Estli
2
COA
7 1
100
150
200 250
300
TC (mg/dl)
TC/HDL cholesterol
FIGURE 1. Comparison of parametric and nonparametric estimates of the effects of total serum cholesterol (TC) (left panel) and the ratio
of TC to high density lipoprotein cholesterol (TC/HDL cholestero)) (right panel) for males in the random sample of the LRC Prevalence and
Followup Studies, 19721987. The tick marks on the abscissa describe the sample distribution of the respective risk factor. The logistic
regression estimate Is denoted by a dashed line. GAM estimates are displayed for 2 (solid curve) as well as 3, 4, and 10 (dotted curves) df.
All effects are adjusted for risk factors listed in table 1. The graphs correspond to an arbitrary vector of rtonlipid risk factor values, so that
the logrt scale on the vertical axis has a valid unit but an arbitrary zero. Accordingly, the relative risks between different risk factor values are
accurately represented, but absolute risk levels should not be interpreted. CMD, coronary heart disease.
valleys and illustrates the overfitting bias inherent in
models that are more complex than required to repre
sent the relation of interest (22). Given all the above
remarks, the visual assessment of the estimates con
firms that the BICoptimal, 2df model is unlikely to
miss any systematic aspects of the effect of TC in the
range from about 130 to 300 mg/dl, where most of our
data fall. Obviously, the number of observations be
yond this range is too low to allow stable estimation of
this effect in the tails of TC distribution. A somewhat
similar pattern of results emerges in the right panel of
figure 1, where the effects of TC/HDL cholesterol are
presented. The estimated risk increases in a linear
fashion over the range of values lower than 7.0, but
increasing the TC/HDL cholesterol beyond this level
has a less dramatic effect
The analyses reported in this subsection were repli
cated using the GAM generalization of the Cox pro
portional hazards model, with the adjusted effects of
TC or TC/HDL cholesterol represented by a 2df
smoothing spline. (Because of near collinearity of the
estimated effects of the history of previous CHD and
diabetes, only one of these two variables could be
included in these additional analyses. The results ob
tained with either variable excluded were almost iden
tical.) Consistent with our expectations, the results of
the GAM proportional hazards modeling were very
similar to those of the binary regression analyses re
ported above. The likelihood ratio tests indicated that
the inclusion of the nonparametric effects of either TC
or TC/HDL cholesterol significantly improved the fit
of the models to data (at a = 0.05). Moreover, for
either lipid measure, the estimated risk functions were
very similar to the 2df estimates presented in figure 1.
The conclusions regarding the shape and significance
of the relevant effects did not change materially when
the analyses were restricted to a subset of individuals
without previous history of CHD (data not shown).
Robustness of nonlinear effects
Table 4 and figure 2 present the results of the
analysis of the sensitivity of the conclusions regarding
the nonlinear effects of cholesterol, represented by the
2df smoothing spline estimates, with respect to
changes in the data set. Table 4 indicates mat the
significance of nonlinear effects persists when the
sample size is reduced by eliminating highrisk sub
groups (individuals on antihypertensive medication
and/or those with the history of previous CHD). More
over, all estimates in figure 2 are quite similar, which
shows that the finding of a decreasing impact of TC on
Am J Epidemiol Vol. 145, No. 8, 1997
by guest on July 14, 2011
aje.oxfordjournals.org
Downloaded from
Page 8
Cholesterol in Coronary Disease: Flexible Modeling 721
TABLE 4. Significance of nonlinear effects of Upids In relation to changes in the random mates data
set, LRC*, 19721987
Particfcants
on blood
pressure
lowering
me (Scat) on
Included?
Yea
Vfes
No
No
ParUdpanls
wtlh
pro bus
CHD*
Included?
No.
of
participants
No.
of
CHD
deaths
p value tor departure from
Bnearlty In 2dl GAM' model
TC*
TC/HDL
cholesterol*
"res
No
Mas
No
2,512
2,424
2,355
2,276
94
71
76
56
<0.01
0.032
<0.01
<0.01
0.012
<0.01
0.038
<0.01
* LRC, Lipid Research Clinics Prevalence and FoDowup Studies; CHD, coronary heart disease; GAM,
generalized additive model; TC, total serum cholesterol; HDL cholesterol, high density lipoprotein cholesterol.
100
200
TC (mg/dl)
300
FIGURE 2. Estimated adjusted effects of total serum cholesterol (TC) from different subsets of males in the random sample of the LRC
Prevalence and Followup Studies, 19721987. a, Logistic model, all males (n = 2,512); b to e, 2df GAM models; b, all males (n = 2,512);
c, all males except those wtth history of coronary heart disease (CHD) (n = 2,424); d, all males except those on antihypertenslve medication
(n = 2,355); e, all males except those with a history of CHD and/or those on antihypertensive medication (n = 2,276). All effects are adjusted
for relevant risk factors listed In table 1. The relative risks between different risk factor values are accurately represented, but absolute risk
levels should not be interpreted.
CHD mortality in the upper range of TC is robust with
respect to the changes in the data set.
Exploring the bias of linear estimates
Statistical significance and robustness of the nonlin
ear effects of cholesterol indicates that the conven
tional logistic regressionbased analyses provide bi
ased estimates of these effects. Comparison of the
linear estimate with GAMbased spline estimates in
figure 1 (left panel) shows that this bias is due to the
inability of the linear model to account for the decreas
ing impact of TC above the level of 250 mg/dl. There
fore, for the LRC data, the conventional linear model
seems to overestimate the effects of changing TC in
the upper range of where the effect is nonlinear. The
reason why the strength of association between TC
and risks of CHD death decreases among individuals
with TC values above 250 mg/dl is not evident and
may be partly related to the limitations of the available
data (see Discussion). By contrast, all nonparametric
estimates of TC effects presented in figure 1 (left
panel) look quite linear for TC values lower than 250
mg/dl, suggesting that the linearity assumption holds
for this range of TC values. In spite of this, the logistic
regression estimate is quite different from the GAM
estimates, even for the range of values below 250
Am J Epidemiol Vol. 145, No. 8, 1997
by guest on July 14, 2011
aje.oxfordjournals.org
Downloaded from
Page 9
722 Abrahamowicz et al.
mg/dl. More specifically, the slope of the logistic
regression estimate is much lower than the slopes of
nonparametric estimates over this range.
To explore further why the two estimates diverge
even over the interval where both are approximately
linear, we reestimated the respective models using
only data for those participants whose TC was below
250 mg/dl. In figure 3, we compare the estimates
obtained with the data truncated at 250 mg/dl with the
original estimates based on the entire data set. The two
nonparametric GAM (2df) estimates are almost iden
tical in the range of TC values below 200 mg/dl and
gradually diverge from each other as they move closer
to the cutoff of 250 mg/dl. This illustrates the property
of the local dependence of the GAM estimates. The
estimated effect of TC below the 250 mg/dl cutoff is
only slightly affected by the inclusion in the analysis
of individuals with TC values above 250 mg/dl, and
this impact is "local," i.e., limited to a rather narrow
interval close to the cutoff. The reason why the two
GAM estimates are not identical is that GAM smooths
the empirical data (15). When estimating the effect
inside a given narrow interval, say TC ranging from
220 to 230 mg/dl, it "borrows strength" from the data
in the close neighborhood of that interval. Otherwise,
the estimate would be too dependent on few observa
tions that fall in this interval. This would increase the
impact of the sampling error, resulting in implausible
"bumpy" estimates similar to the 10df estimates il
lustrated in figure 1. The degree of smoothing is
controlled by selecting the number of degrees of free
dom (15). It should also be noted that an almost
perfect linearity of the 2df smoothing spline estimate
for the truncated data in figure 3 is reassuring in that
it confirms that GAM does not "force" nonlinearity in
the situations where the data do not require it.
By contrast, the two logistic regressionbased esti
mates in figure 3 differ quite substantially from each
other over the entire range of TC from 100 to 250
mg/dl. Indeed, for truncated data, the estimated odds
ratio corresponding to an increase of 1 mg/dl is 1.021
(95 percent CI 1.0111.032), whereas the estimate
obtained from the entire data set is 1.010 (95 percent
CI 1.0041.016). This indicates that the estimated
relative risk increase resulting from an increase of 1
mg/dl in TC is 2 percent when based on truncated data
and 1 percent when based on the entire sample. Thus,
inclusion of the individuals with TC above 250 mg/dl
results in a twofold decrease of the parametric estimate
of the effect of TC in the interval below 250 mg/dl
cutoff. Clearly, individuals with TC above the cutoff
provide little information about the TC effects in the
interval below the cutoff. Therefore, among the two
parametric estimates of the TC effects in the lower
range, that obtained with data truncated to this range is
more accurate. This leads to a conclusion that, when
100
200
TC (mg/dl)
300
FIGURE 3. Comparison of the estimates from the set of males in the random sample and the truncated subset of males with total serum
cholesterol (TC) =s250 mg/dl, the LRC Prevalence and Followup Studies, 19721987. The vertical line at 250 mg/dl is the cutoff of TC.
Estimates based on the entire data set are described by linear models (a) and 2df GAM modeis (b). Estimates based on truncated data are
described by linear models (c) and 2df GAM models (d). All effects are adjusted for risk factors listed in table 1. The relative risks between
different risk factor values are accurately represented, but absolute risk levels should not be interpreted. CHD, coronary heart disease.
Am J Epidemiol Vol. 145, No. 8, 1997
by guest on July 14, 2011
aje.oxfordjournals.org
Downloaded from
Page 10
Cholesterol in Coronary Disease: Flexible Modeling 723
analyzing the entire data set, formal limitations of the
parametric model (in particular, the global dependence
of its estimates) result in a biased estimate of the effect
of TC even for the interval where the linearity assump
tion seems to be correct. By contrast, the GAM esti
mate of the effect of TC below the 250 rng/dl is
relatively robust with respect to the inclusion of indi
viduals with values beyond this range so that the GAM
estimate obtained using all data is quite similar to the
parametric estimate based on truncated data.
Simultaneous estimation of nonlinear effects of
all continuous risk factors
Table 5 shows the results of testing simultaneously
the nonlinearity of the effects of all continuous risk
factors in the random sample, each represented by a
2df smoothing spline. In addition to TC and TC/HDL
cholesterol, age also has a significant independent
nonlinear effect, while the result is marginally nonsig
nificant for BMI. By contrast, the linearity assumption
appears quite adequate in the case of systolic pressure.
Figure 4 compares the GAM estimates (solid curves)
with the respective parametric estimates for the five
continuous risk factors. All of the GAM estimates,
with the exception of TC/HDL cholesterol (see the
figure 4 legend for explanation) were obtained simul
taneously. Thus, when adjusting the nonparametric
effect of a given continuous predictor for the effects of
other continuous risk factors, all of these effects are
assumed to be represented by the respective (solid)
curves in figure 4 rather than by (dotted) lines. The
estimates of the effects of TC and TC/HDL cholesterol
did not change materially when other risk factors were
represented by flexible functions (figure 4) rather than
by a straight line (figure 1). This is not surprising
given the low correlation between TC and other risk
factors. The GAM estimate for systolic pressure con
firms linearity of its effect. A marked difference be
TABLE 5. Simultaneous testing of nonlinearity of adjusted
effects of continuous risk factors in the random males data
set, LRC*. 19721987
Risk
factor
Age
Systolic pressure
BMI*
TC*
TC/HDL cholesterol*
QAM*(df = 2)t
test of nonlnear effect
Model 1
p value
0.028
0.171
0.074
0.008
Model 2
p value
0.021
0.203
0.087
0.019
* LRC, Lipid Research Clinics Prevalence and Followup
Studies; GAM, generalized additive model; BMI, body mass index;
TC, total serum cholesterol; HDL cholesterol, high density
Bpoprotein cholesterol.
t Models include all nonlipid risk factors.
tween the parametric and GAM estimates occurs for
BMI. The GAM estimate is Ushaped and suggests
that the risk of CHD death increases at both extremes
of BMI range. By contrast, the parametric estimate is
very flat and completely nonsignificant (table 2).
Whereas these results support further the usefuhiess of
flexible modeling, we have to interpret the nonlinear
effects of age and BMI with some caution since, in
contrast to the nonlinear effects of TC, we did not
hypothesize such effects a priori.
Model validation
In the validation studies, we have used the subset of
individuals who were not taking antihypertensive
medications. We had to remove the participants on
those medications from the analysis because of the
difficulty in separating the effect of the medication and
the effect of high systolic pressure in the nonrandom
sample, where the two variables were almost collinear
(see Parametric logistic regression modeling). This
would cause problems in the analyses using the non
random sample as the testing sample.
In table 6, the results of validation studies are sum
marized in terms of the difference in fit to testing
sample data between the GAM models and the corre
sponding parametric multiple logistic models. Using
the first validation strategy, we find that the GAM
models (estimated from 2,355 medicationfree males
in the random sample) account slightly worse for the
effects of the TC in the nonrandom sample than does
the parametric model. By contrast, the GAM models
represent the effects of TC/HDL cholesterol in the
nonrandom sample substantially better than the para
metric model. Thus, we cannot draw firm conclusions
about the extent to which the nonparametric effects
estimated from the random sample of LRC data are
generalizable to a different data set. However, as
pointed out in the "Model validation" subsection of the
"Statistical analyses" section, the results of this part of
our validation study should be interpreted with cau
tion. Indeed, as discussed in the section Parametric
logistic regression modeling, individuals in the non
random sample appear to have some highrisk charac
teristics that are not accounted for by the common risk
factors included in our multivariable models. Thus, to
obtain a more accurate assessment of the generaliz
ability of nonparametric estimates based on the ran
dom sample of LRC data, we will need to repeat the
validation with a data set that, although collected in a
different study, can be considered a representative
sample from a population similar to that screened in
the LRC study.
The right side of table 6 shows that in the case of
crossvalidation of the data from the random sample
Am J Epidemiol Vol. 145, No. 8, 1997
by guest on July 14, 2011
aje.oxfordjournals.org
Downloaded from
Page 11
724 Abrahamowicz et al.
300
2 5 10
TC/HDL cholesterol
30 50 70 90
Age
100 150 200
SBP
FIQURE 4. Nonlinear effects of continuous risk factors versus logistic estimates based on males In the random sample of the LRC
Prevalence and Followup Studies, 19721987. All nonlinear effects are represented by 2df smoothing splines and are estimated
simultaneously. Estimates of the effect of the ratio of total serum cholesterol (TC) to high density lipoprotein cholesterol (TC/HDL cholesterol)
are from model 2 of table 5; all other estimates are from model 1 of table 5. Solid curves are GAM estimates, dashed curves indicate their
95% polntwlse confidence Intervals, and dotted lines are parametric logistic estimates. All estimates are further adjusted for binary risk
factors: smoking, glucose intolerance, history of coronary heart disease, and use of antihypertensive medication. Tick marks on the abscissas
represent the sample distribution of the relevant risk factor. The relative risks between different risk factor values are accurately represented,
but absolute risk levels should not be interpreted. SBP, systolic blood pressure; BMI, body mass index.
TABLE 6. Validation of GAM* models estimated from random sample, LRC*, 19721987
Upld measure
tnmodelt
Nonparametric
estimation
ot enact
valdation strategy
change In logHtelhoodt
PredctJon
tn
nonrandom
sample
Prediction of
•new"
observations
h random
sample
4.488
5.274
1.738
2.139
TC*
TC/HDL cholesterol*
TC
TC, age, and BMI*
TC/HDL cholesterol
TC/HDL cholesterol, age, and BMI
2.994
3.956
8.220
7.507
* GAM, generalized additive model; LRC, LJpid Research Clinics Prevalence and Followup Studies; TC, total
serum cholesterol; BMI, body mass index; HDL cholesterol, high density lipoprotein cholesterol.
t Participants on antihypertensive medication are excluded.
$ Loglikelihood values of GAM models are compared with those of linear logistic models with the same
predictors. A positive value means a better fit of GAM, and a negative value means a worse fit
the GAM models performed uniformly better than the
parametric logistic model. This indicates that the
GAM estimates may be more reproducible and may
better predict outcomes in "new" cases arising from
exactly the same population. The fact that the im
provement in the fit offered by GAM is more substan
tial for TC effects than for TC/HDL cholesterol is
consistent with the finding that in the case of TC,
nonlinear effects are more apparent and more signifi
cant (figure 1 and table 3).
Finally, the results in table 6 do not provide conclu
sive evidence that nonparametric representation of the
effects of age and BMI further improve the predictive
ability of the GAM models, in comparison with the
models in which only the cholesterol effects are mod
eled by smoothing splines. Replacing parametric esti
mates of the effects of age and BMI by nonparametric
effects further improves the predictive ability of the
GAM model in the crossvalidation of the random
sample data but not in the validation based on the
Am J Epidemiol Vol. 145, No. 8, 1997
by guest on July 14, 2011
aje.oxfordjournals.org
Downloaded from
Page 12
Cholesterol in Coronary Disease: Flexible Modeling 725
nonrandom sample. Moreover, in all cases, the differ
ence between the two GAM models with respect to
goodness of fit to data is minor (difference in log
likelihood of less than 1.0).
Clinical relevance of nonlinear effects
Having established the significance and robustness
of the nonlinear effects of cholesterol on the risk of
CHD mortality, we now focus on the clinical rele
vance of the difference between the nonparametric and
parametric estimates. The left panel of figure 5 shows
how the estimated probability of a CHD death changes
as a function of TC for a hypothetical individual with
an "average risk profile" (healthy nonsmoker with the
mean values of continuous risk factors, as shown in
table 1). As expected, the parametric model overesti
mates the risk in both tails of the TC distribution and
underestimates it over the middle range, in comparison
with the GAM estimate. (The right panel of figure 5
shows a similar pattern of differences in the case of
TC/HDL cholesterol.) Accordingly, the two models
differ substantially with respect to the estimated dif
ferences in the risk of CHD death between individuals
with different TC levels. The magnitude and the di
rection of this difference depends on the specific TC
values being compared.
This phenomenon is illustrated in figure 6, which
compares the two models with respect to the esti
mated differences in the 12year probability of a
CHD death corresponding to different modifications
of lipid levels and different baseline risk profiles.
Regardless of the model used and the specific mod
ification considered, the estimated absolute differ
ence increases as we move from an average risk
profile A to an elevatedrisk profile B and, finally, to
a very highrisk, symptomatic individual C. This
merely reflects the differences in the absolute risk
levels between these hypothetical individuals. Fig
ure 6 also shows that in many cases the differences
between the corresponding GAM (G) and logistic
regression (L) estimates are important in absolute
values (occasionally exceeding 5 percent in the ab
solute probability of a CHD death) and/or in relative
terms (more than twofold differences). Moreover,
the sign of the difference between the two types of
estimates depends systematically on a specific in
terval over which the risk factor is presumed to
change. In the case of both TC (top panel) and
TC/HDL cholesterol (bottom panel), the logistic
model underestimates the impact of the changes
closer to the middle range of a respective distribu
tion (figure 6, left side), while it seems to overes
timate the impact of changes in the upper tail (right
side).
100
300
8
8
pi
o
oc\i
8
/
1
/
/ /
/ /
4 6 8 10
TC/HDL cholesterol
FIGURE 5. Estimated probabilities of a coronary heart disease (CHD) death as a function of total serum cholesterol (TC) (left panel) and the
ratio of TC to high density lipoprotein cholesterol (TC/HDL cholesterol) (right panel), given the following risk profile: mean values of age,
systolic pressure, and body mass index (table 1), nonsmoker, no glucose intolerance, no history of coronary heart diseases, and no
antihypertensive medication. Dotted curves are logistic estimates, and solid curves are QAM estimates with 2 df. Results are based on males
In the random sample of LRC Prevalence and Followup Studies, 19721987.
Am J Epidemiol Vol. 145, No. 8, 1997
by guest on July 14, 2011
aje.oxfordjournals.org
Downloaded from
Page 13
726 Abrahamowicz et al.
180
Q L
260220300
L
TC
B
10
B
TC/HDL cholesterol
FIGURE 6. Comparison of the estimated effects of selected risk factor modifications for hypothetical risk profiles. The three risk profiles are:
A, healthy nonsmoker with mean values of continuous risk factors; S, smoker with elevated values of systolic pressure and BMI
(corresponding to the 75th percentile of the respective distributions in the random sample); and C, nonsmoking diabetic with a history of
previous CHD and mean values of continuous risk factors. The top panel shows the estimated decrease In the probability of a CHD death
during a 12year period because of towering total serum cholesterol (TC) from 250 to 180 mg/dl (left top panel) as well as from 300 to 220
mg/dl (right top panel) The bottom panel shows the corresponding effects of decreasing the ratto of TC to high density llpoprotein cholesterol
(TC/HDL cholesteroO from 6 to 3 (left bottom panel) and from 10 to 6 (right bottom panel). The GAM estimates are denoted by G (above barred
boxes), and the logistic model estimates are denoted by L (above white boxes). Results are based on males In the random sample of LRC
Prevalence and Followup Studies, 19721987.
DISCUSSION
Our reanalysis of the LRC data indicates that the
effects of both TC and TC/HDL cholesterol on the risk
of CHD mortality in men diverge significantly from
the linearity assumption that underlies a logistic re
gression model. The pattern of nonlinearity of lipid
measures effects persists regardless of whether other
risk factors are modeled parametrically or not and is
robust with respect to the exclusion of highrisk indi
viduals. Comparison between the parametric esti
mates, restricted by the linearity assumption, and flex
ible nonparametric estimates suggests that the
conventional parametric model may induce two types
of bias.
First, the parametric model seems to overestimate
the effect of the baseline TC among men whose TC
levels exceed 250 mg/dl. For those men, the increase
in the risk of CHD mortality associated with an in
crease in TC level, estimated by the unbiased flexible
GAM model, is substantially lower than that predicted
by the logistic model, and the discrepancy is clinically
important. This leveling off of the effect of choles
terol, measured at baseline, on the risk of CHD death
may be partly due to the limitations of the data avail
able. It is possible that a fraction of the individuals in
whom high levels of lipids were found at the baseline
has undergone some cholesterollowering interven
tions in the subsequent years. In that case, the infor
mation about their TC at the baseline may be less
relevant to assess their average risk level over the 12
years of followup, so that the model using the base
line measurement only will overestimate the risks for
those individuals. If for a fraction of men with high
lipid levels an attempt has been made to lower their
TC to an acceptable level, say 220 mg/dl, then the
discrepancy between the current and baseline TC lev
els will increase with the increasing baseline value.
This could explain why the strength of the TC effect,
as represented by the slope of the nonparametric esti
mate of the risk function, gradually decreases as TC
increases beyond 220 mg/dl. Consistent with this con
jecture, Benfante et al. (21) found that among the
participants in the Honolulu Heart Program contacted
in 1993, after 25 years of followup, there were weak
correlations between their recent values of risk factors
and the corresponding values measured 25 years ear
lier. In 1993, 11 percent of the participants in that
study reported taking cholesterollowering medica
tion, and 54 percent were on a special diet. The prev
alence of medication and diet were, respectively, 100
and 30 percent higher among men who developed
cardiovascular disease during the followup and had,
Am J Epidemiol Vol. 145, No. 8, 1997
by guest on July 14, 2011
aje.oxfordjournals.org
Downloaded from
Page 14
Cholesterol in Coronary Disease: Flexible Modeling 727
on average, a higher baseline cholesterol level than did
the subjects in the noncardiovascular group. For as
sessment of the adequacy of the conjecture that the
leveling off of the GAM estimate of the TC effect in
the LRC study is related to the pattern of subsequent
changes in TC and, if necessary, for elimination of that
type of bias by taking into account changes in the
cholesterol level in individual patients, the study de
sign should include repeated measurements of choles
terol. In the absence of repeated measurements, we
carried out a simple experiment by simulating what the
results would be if the followup time were shortened.
The motivation was that if the leveling off is due to
subsequent changes in the TC levels, then the nonlin
ear effect will be less marked with shorter followup
time, presumably associated with less important
changes. When the followup time was artificially
shortened to 6 years, the nonlinearity of the effects
persisted for both TC and TC/HDL cholesterol (data
not shown). This suggests that nonlinearity cannot be
accounted by the subsequent changes in cholesterol
levels of individual participants.
It is also possible that the decrease in the estimated
effect of TC over the upper tail of its distribution is, at
least in part, due to measurement errors. In accordance
with the wellknown phenomenon of "regression to
the mean," the highest among observed TC values are
likely to be above the "true" value characterizing a
given patient over a relevant period of time. In all of
the analyses reported above, we used the TC values
observed at the second visit because several other risk
factors, adjusted for in those analyses, were measured
at the second visit only. To assess the importance of
the impact of the presumed regression to the mean, we
correlated TC values measured at the second visit
(used in the analyses above) with the corresponding
first visit values, obtained on average 3 months earlier.
The Pearson correlation coefficient was high {p—
0.82), and the mean values of TC at the two visits were
almost identical (difference = 0.3 mg/dl). However, in
the subset of 28 participants with the highest second
visit TC values (i.e., values above 306 mg/dl, corre
sponding to the 99th percentile of the sample distri
bution), the second visit mean was higher than the first
visit mean by as much as 25.3 mg/dl. This is consistent
with the regression to the mean presumption. To as
sess to what extent the nonlinearity of the GAM esti
mate based on second visit data may be accounted by
the regression to the mean, we carried out additional
analyses in which we replaced the second visit TC
values with the mean of the two visits. The range of
TC values among men in the random sample de
creased from 353 mg/dl for the second visit data to 277
mg/dl for the mean of the two visits. The leveling off
of the GAM 2 df estimate of the effect of the mean of
two TC measurements was slightly less marked than in
the case of second visit data alone. However, the
nonlinear effect remained significant (p — 0.05). This
indicates that whereas regression to the mean may
contribute to the estimated decrease in the effect of TC
in the upper tail of the distribution, there remains a
significant departure from linearity that cannot be ex
plained by this phenomenon.
In spite of the abundance of epidemiologic data, the
detailed quantification of the impact of serum choles
terol on CHD mortality may require further studies.
Recently, Law et al. (28) compared the results of a
number of large cohort studies, international compar
isons, and randomized clinical trials addressing this
issue. Whereas the authors conclude that the loglinear
model adequately represents the overall effect of se
rum cholesterol, their comparison indicates some in
teresting differences both between and within partic
ular categories of studies, and the reasons for these
discrepancies are not obvious. Specifically, figure 2 of
the article by Law et al. indicates that among 10 large
cohort studies, similar in design to the LRC study, a
distinct leveling off of the risks in the upper tail of TC
distribution has been observed in three studies. In all
three studies, the logarithm of risk is a linear function
of TC over the interval below approximately 240
mg/dl, but the impact of further increases is minor
(28). Thus, in these three studies the pattern of changes
in the risks, along the spectrum of TC values, is quite
similar to that estimated in our nonparametric analy
ses.
The second type of bias induced by conventional
parametric models that rely on the linearity assump
tion concerns the effects of TC in the range between
100 and 220 mg/dl. This may appear somewhat unex
pected given that over this range of lowtomoderate
values the linearity assumption seems to be correct. In
fact, even potentially very flexible GAM models with
several degrees of freedom yield the estimates of the
effect of TC on the logit of a CHD death that are
almost exactly linear below 220 mg/dl (figure 1). In
spite of this, the conventional model considerably un
derestimates the slope of the risk function over this
interval. This occurs because the singleslope restric
tion of the parametric model induces the global de
pendence of the estimate. The estimated common
("global") slope represents the average of the local
slopes that correspond to the effects of the risk factor
specific to different intervals within the range of val
ues observed in the sample. If these local effects are
quite different, the global slope overestimates the true
effect over some intervals and underestimates it over
others. In the case of TC effects on CHD mortality, the
Am J Epidemiol Vol. 145, No. 8, 1997
by guest on July 14, 2011
aje.oxfordjournals.org
Downloaded from
Page 15
728 Abrahamowicz et al.
GAM estimate of the risk function is linear over the
lower range but levels off above the 220 mg/dl. As a
result, the global slope of the constrained parametric
model may considerably underestimate the true slope
over the range 100220 mg/dl. The undesirable im
plications of the global dependence are illustrated in
figure 3: The estimated parametric effect of TC over
the range of values lower than 250 mg/dl changes
substantially depending on whether or not participants
with TC values above 250 mg/dl are included in the
analysis. The fact that GAM estimates are only locally
dependent and, therefore, that their dependence on the
data from outside the relevant range is minimal rep
resents, in our opinion, a major advantage of nonpara
metric modeling.
In summary, our results suggest that by imposing
restrictive a priori assumptions conventional paramet
ric regression models may provide biased estimates of
the effects of serum cholesterol on CHD mortality.
Our finding that nonlinear effects are statistically sig
nificant, robust, and clinically important raises the
possibility of revising the current perception of the
role of lipids in CHD. Specifically, the actual effec
tiveness of some cholesterollowering interventions
may be substantially different from that predicted
based on parametric modeling of epidemiologic data.
However, given the potential impact of such a revision
on research, health policy, and clinical practice, it is
mandatory to replicate our findings with data collected
in a different longterm followup study of a similar
population. In addition, it is important to ensure that
the design of such a study permits the separation of the
effects of the baseline risk factor values from the
effects of their changes during the followup.
From the methodological perspective, our study
may convey a message of a more general interest. We
have proposed a framework for nonparametric model
ing of complex disease processes, and our results
illustrate potential benefits of such analyses. We be
lieve that a nonparametric regression approach may
offer valuable insights in other epidemiologic studies
addressing research questions whose importance
makes it mandatory to maximize the accuracy of the
estimates.
ACKNOWLEDGMENTS
Supported by the Dairy Farmers of Canada.
Dr. Michal Abrahamowicz and Dr. Steven A. Grover are
Senior Research Scholars of the Fonds de Recherche sur la
Sante du Quebec.
The authors thank Dr. Paul Fortin for his helpful com
ments and Diane Telmosse for her assistance.
REFERENCES
1. LaRosa JC, Hunninghake D, Bush D, et al. The cholesterol
facts. A summary of the evidence relating dietary fats, serum
cholesterol, and coronary heart disease. A joint statement by
the American Heart Association and the National Heart, Lung,
and Blood Institute. The Task Force on Cholesterol Issues,
American Heart Association. Circulation 1990;81:172133.
2. Canadian Consensus Conference on Cholesterol: final report
The Canadian Consensus Conference on the Prevention of
Heart and Vascular Disease by Altering Serum Cholesterol
and Lipoprotein Risk Factors. Can Med Assoc J 1988;139
(Suppl.):l8.
3. Dietary guidelines for healthy American adults. A statement
for physicians and health professionals by the Nutrition Com
mittee, American Heart Association. Circulation 1988;77;
4. Summary of the second report of the National Cholesterol
Education Program (NCEP) Expert Panel on Detection, Eval
uation, and Treatment of High Blood Cholesterol in Adults
(Adult Treatment Panel II). JAMA 1993;269:301523.
5. Grover SA, Abrahamowicz M, Joseph L, et al. The benefits of
treating hyperlipidemia to prevent coronary heart disease:
estimating changes in life expectancy and morbidity. JAMA
1992^67:81622.
6. Tsevat J, Weinstein MC, Williams LW, et al. Expected gains
in life expectancy from various coronary heart disease risk
factor modifications. Circulation 1991;83:1194201.
7. Weinstein MC, Stason WB. Costeffectiveness of interven
tions to prevent or treat coronary heart disease. Annu Rev
Public Health 1985;6:4163.
8. Hamilton VH, Racicot FE, Zowall H, et al. The costeffec
tiveness of HMGCoA reductase inhibitors to prevent coro
nary heart disease. Estimating the benefits of increasing
HDLC. JAMA 1995;273:10328.
9. Cupples LA, D'Agostino RB. Some risk factors related to the
annual incidence of cardiovascular disease and death using
pooled repeated biennial measurements: Framingham Heart
Study, 30 year followup. In: Kannel WB, Wolf PA, Garrison
RJ, eds. The Framingham Study: an epidemiological investi
gation of cardiovascular disease. Section 34. Washington, DC:
US Department of Health and Human Services, 1987. (NIH
publication no. 872703).
10. Martin MI, Hulley SB, Browner WS, et al. Serum cholesterol,
blood pressure, and mortality: implications from a cohort of
361,662 men. Lancet 1986;2:9336.
11. Silberberg JS. Estimating the benefits of cholesterol lowering:
are risk factors for coronary heart disease multiplicative?
J Clin Epidemiol 1990;43:8759.
12. Ramsay JO, Abrahamowicz M. Binomial regression with
monotone splines: a psychometric application. J Am Stat
Assoc 1989;84:90615.
13. Durrleman S, Simon R. Flexible regression models with cubic
splines. Stat Med 1989;8:55161.
14. Hastie TJ, Tibshirani RJ. Generalized additive models: some
applications. J Am Stat Assoc 1987;82:37186.
15. Hastie TJ, Tibshirani RJ. Generalized additive models. New
York, NY: Chapman and Hall, 1990.
16. Central Patient Registry and Coordinating Centre for the Lipid
Research Clinics. Reference manual for the Lipid Research
Clinics Prevalence Study. Vol. 1 and 2. Chapel Hill, NC:
University of North Carolina, 1974.
17. The Lipid Research Clinics Program Epidemiology Commit
tee. Plasma lipid distributions in selected North American
populations: the Lipid Research Clinics Program Prevalence
Study. Circulation 1979;60:42739.
18. Heiss G, Tamir I, Davis CE, et al. Lipoproteincholesterol
distributions in selected North American populations: the
Lipid Research Clinics Program Prevalence Study. Circulation
1980;61:30215.
19. Lipid Research Clinics Program. Manual of laboratory oper
ations. Vol. 1. Lipid and lipoprotein analysis. Bethesda, MD:
Am J Epidemiol Vol. 145, No. 8, 1997
by guest on July 14, 2011
aje.oxfordjournals.org
Downloaded from
Page 16
Cholesterol in Coronary Disease: Flexible Modeling 729
National Heart, Lung, and Blood Institute, National Institutes
of Health, 1974. (NIH publication no. 75628).
20. Jacobs DR Jr, Mebane IL, Bangdiwala SL et al. High density
lipoprotein cholesterol as a predictor of cardiovascular disease
mortality in men and women: the followup study of the Lipid
Research Clinics Prevalence Study. Am J Epidemiol 1990;
131:3247.
21. Benfante R, Hwang LJ, Masaki K, et al. To what extent do
cardiovascular risk factor values measured in elderly men
represent their midlife values measured 25 years earlier? A
preliminary report and commentary from the Honolulu Heart
Program. Am J Epidemiol 1994; 140:20616.
22. Abrahamowicz M, Ciampi A. Information theoretic criteria in
nonparametric density estimation. Bias and variance in the
infinite dimensional case. Comput Stat Data Anal 1991;12:
23947.
23. Kooperberg C, Stone CJ, Truong YK. Hazard regression.
J Am Stat Assoc 1995;90:7894.
24. Abrahamowicz M, MacKenzie T, Esdaile JM. Timedepen
dent hazard ratio: modeling and hypothesis testing with appli
cation in lupus nephritis. J Am Stat Assoc 1996;91:14329.
25. Akaike H. A new look at the statistical identification model.
IEEE Trans Auto Control 1974;19:71623.
26. Schwarz G. Estimating the dimension of a model. Ann Stat
1978;6:4614.
27. Abrahamowicz M, Ciampi A. Optimal fit in nonparametric
modelling via computationally intensive inference. In: Momi
rovic K, Mildner V, eds. COMPSTAT 1990 proceedings in
computational statistics. Heidelberg, Germany: Physica
Verlag, 1990:30914.
28. Law MR, Wald NJ, Thompson SG. By how much and how
quickly does reduction in serum cholesterol concentration
lower risk of ischaemic heart disease? BMJ 1994;308:
36772.
Am J Epidemiol Vol. 145, No. 8, 1997
by guest on July 14, 2011
aje.oxfordjournals.org
Downloaded from