Comparison of 19 pre-operative risk stratification
models in open-heart surgery
Johan Nilsson1*, Lars Algotsson2, Peter Ho ¨glund3, Carsten Lu ¨hrs1, and Johan Brandt1
1Department of Cardiothoracic Surgery, Heart and Lung Centre, Lund University Hospital, SE 221 85 Lund, Sweden;
2Department of Cardiothoracic Anesthesiology, Heart and Lung Centre, Lund University Hospital, Lund, Sweden; and
3Competence Centre for Clinical Research, Lund University Hospital, Lund, Sweden
Received 23 August 2005; revised 2 November 2005; accepted 16 December 2005; online publish-ahead-of-print 18 January 2006
See page 768 for the editorial comment on this article (doi:10.1093/eurheartj/ehi792)
Aims To compare 19 risk score algorithms with regard to their validity to predict 30-day and 1-year
mortality after cardiac surgery.
Methods and results Risk factors for patients undergoing heart surgery between 1996 and 2001 at a
single centre were prospectively collected. Receiver operating characteristics (ROC) curves were
used to describe the performance and accuracy. Survival at 1 year and cause of death were obtained
in all cases. The study included 6222 cardiac surgical procedures. Actual mortality was 2.9% at 30
days and 6.1% at 1 year. Discriminatory power for 30-day and 1-year mortality in cardiac surgery was
highest for logistic (0.84 and 0.77) and additive (0.84 and 0.77) European System for Cardiac Operative
Risk Evaluation (EuroSCORE) algorithms, followed by Cleveland Clinic (0.82 and 0.76) and Magovern
(0.82 and 0.76) scoring systems. None of the other 15 risk algorithms had a significantly better discrimi-
natory power than these four. In coronary artery bypass grafting (CABG)-only surgery, EuroSCORE fol-
lowed by New York State (NYS) and Cleveland Clinic risk score showed the highest discriminatory
power for 30-day and 1-year mortality.
Conclusion EuroSCORE, Cleveland Clinic, and Magovern risk algorithms showed superior performance
and accuracy in open-heart surgery, and EuroSCORE, NYS, and Cleveland Clinic in CABG-only surgery.
Although the models were originally designed to predict early mortality, the 1-year mortality prediction
was also reasonably accurate.
Despite technological advancements, open-heart operations
still carry a risk of mortality and morbidity. To aid in the
selection of patients for cardiac surgery, several risk-
scoring systems have been developed during the last
decades. These aim to estimate the risk of peri-operative
death, based on the occurrence of different risk factors.
Operative mortality is also increasingly used as an indicator
of the quality of cardiac surgery.1
To make an accurate comparison between different
institutions or surgeons, mortality data must be adjusted
to the risk profiles of the patients.2,3Differences between
the available risk algorithms regarding score design and the
patient population on which the score development was
based could influence their accuracy and performance.
Ideally, a risk model should be useful for outcome prediction
at different surgical centres, both at the institutional level
and for individual patients.4Operative mortality is the
outcome variable most commonly used as a quality indi-
cator, but long-term mortality may be more relevant from
a patient perspective.
A few comparative studies of different risk algorithms
exist.4–8However, the relative performance of the risk-
scoring systems currently used remains unclear. The
purpose of this study was to compare 19 open-source risk
score algorithms with regard to their validity to predict
30-day and 1-year mortality after cardiac surgery in a
large single-institution patient population.
Study design and patients
The study was approved by the Ethics Committee of the Medical
Faculty at Lund University. Risk factors for all adult patients under-
going heart surgery at the University Hospital of Lund between
January 1996 and February 2001 were prospectively collected
Cardiothoracic Surgery. The patient record form contained a total
of 248 variables (80 pre-, 106 intra-, and 62 post-operative) based
on the Society of Thoracic Surgeons (STS)9patient record form.
The data was stored in a local adult cardiac surgery database.
Data collection and risk-score calculation
From the total of 248 variables, those corresponding to the risk
factors in the different risk models were selected. Thus, a subset
& The European Society of Cardiology 2006. All rights reserved. For Permissions, please e-mail: email@example.com
*Corresponding author. Fax: þ46 46 15 86 35.
E-mail address: firstname.lastname@example.org
European Heart Journal (2006) 27, 867–874
by guest on May 14, 2011
of 104 of the pre- and intra-operative variables were imported into
the statistical software package, together with 30-day and 1-year
mortality for the population. Missing values were replaced using
the probability imputation technique10before the risk score was
calculated. The probability imputation technique substitutes con-
ditional probabilities for missing covariate values when the covariate
is qualitative. The risk score for each algorithm was calculated for
every patient according to the published definitions (Table 1).
The vital status at 1 year after the operation was obtained for all
patients from the Population and Welfare Statistics Sweden,
Statistiska Centralbyra ˚n, Stockholm, Sweden, as was the date and
cause of mortality.
Means (+SD) were used to describe the continuous variables, and
frequencies were calculated for categorical variables. Score-
predicted operative mortality (death within 30 days of operation)
was calculated using the mean score from the different risk
models, except for the Northern New England algorithm where
the published score-mortality table11was used. Receiver operating
characteristics (ROC) curves were used to describe the performance
and predictive accuracy for the different algorithms.12The discrimi-
natory power, i.e. the c-index, was evaluated by calculating the
areas under ROC curves.13The areas under curves are presented
with 95% confidence limits. An area of 1.0 under the ROC curve indi-
cates perfect discrimination, whereas an area of 0.50 indicates
complete absence of discrimination. Any intermediate value is a
quantitative measure of the ability of the risk predictor model to
distinguish between survivors and non-survivors.
To compare the areas under the resulting ROC curves (used as an
index for the predicted value), the non-parametric approach
described by DeLong et al.14was used. The ROC area for each risk
algorithm was systematically compared with the ROC area of the
other 18 algorithms. The numbers of algorithms with a significantly
larger or smaller ROC area was then computed. The probability
significance level was adjusted for the effect of multiple compari-
sons using Sidak’s method.
Graphs and statistical analyses were performed using the
Intercooled Stata version 9.0 (2005) statistical package (StataCorp
LP, College Station, TX, USA) and GraphPad Prism 4b, 2004 for Mac
OS X, GraphPad Software, Inc., USA.
Between January 1996 and February 2001, 6499 consecutive
heart operations were performed on 6414 patients. During
the period January–March 1998, database service and
upgrade resulted in missing values in 30% of the data
points. All operations (n ¼ 277) from this period were
excluded from the study. Thus, 6153 patients, undergoing
6222 operations, were included in the analysis. In 2% of
the total data points, missing values were replaced using
the probability imputation technique.10There was accurate
documentation of data including mortality and cause of
death in all cases, and no patient was lost to follow-up.
The average age was 66.3 + 10.6 years (range 18–95).
The majority of patients were men (72%). A coronary
artery bypass grafting (CABG)-only operation was performed
in 4351 cases (70%), 1340 (22%) cases had a valve procedure
with or without CABG surgery, and 531 (8%) were miscel-
laneous procedures, e.g. post-infarction septal rupture
(37 cases), aortic aneurysm or dissection (209 cases), and
surgery had been performed in 457 cases (7.3%). Seventy-
eight patients (1.3%) were in cardiogenic shock at the
start of the operation and 628 (10%) were operated within
24 h after acceptance for surgery (emergency surgery).
The actual 30-day mortality was 2.9% (n ¼ 180) and the
1-year mortality was 6.1% (n ¼ 377).
Synopsis of original data of 19 risk score algorithms
RegionYear of data
Number of patients
UK national scorea,5
13 302 (128)
13 302 (128)
18 814 (33)
12 712 (43)
Add, additive; log, logistic; mod, modified; NNE, Northern New England; N/A, not available. Cleveland Clinic risk score algorithm is also known as Higgins
score, NNE as American College of Cardiology/American Heart Association (ACA/AHA) score, and Ontario as Provincial Adult Cardiac Care Network (PACCN)
aAlgorithms developed for CABG-only surgery.
868J. Nilsson et al.
by guest on May 14, 2011
Performance and predictive accuracy for the
The discriminatory power (i.e. the area under the ROC
curve) for 30-day mortality and 1-year mortality was
highest for the logistic (0.84 and 0.77) and additive (0.84
and 0.77) European System for Cardiac Operative Risk
Cleveland Clinic (0.82 and 0.76) and the Magovern (0.82
and 0.76) scoring systems (Figures 1 and 2). None of the
other risk algorithms had a significantly better discrimina-
tory power (larger ROC area) than these four (Figure 3). In
the subanalysis with CABG-only patients, the discriminatory
power for the two EuroSCORE algorithms were highest, fol-
lowed by the New York State (NYS) and Cleveland Clinic
risk algorithm (Table 2).
The mortality predictions of the different scoring systems
are shown in (Figure 4).
The most common cause of death within 30 days was cardio-
vascular disease (n ¼ 163, 91%), followed by cerebrovascular
disease (n ¼ 3, 1.7%), malignant neoplasm (n ¼ 3, 1.7%), and
chronic lower respiratory disease (n ¼ 2, 1.1%). Cardio-
vascular disease was also the most common cause of death
within 1 year (n ¼ 280, 74%), followed by malignant
vs. 1-specificity for the 19 risk algorithms is plotted. The solid line represents
the absence of discrimination. Open-heart surgery (n ¼ 6222).
The ROC curves. The sensitivity of prediction of 30-day mortality
risk scoring system (left y-axis), the number of risk algorithms with a
significantly (P , 0.05) larger (black bar) or smaller (grey bar) ROC area are
shown. (A) 30-day mortality and (B) 1-year mortality. Open-heart surgery
(n ¼ 6222). See Table 1 for abbreviations.
Comparison of the ROC area for different risk algorithms. For each
bars) for 30-day mortality and 1-year mortality. (A) 30-day mortality and (B)
The ROC area (diamonds) with 95% confidence intervals (horizontal
(n ¼ 6222).See
Comparison of 19 pre-operative risk stratification models in open-heart surgery 869
by guest on May 14, 2011
neoplasm (n ¼ 22, 5.8%), cerebrovascular disease (n ¼ 16,
4.2%), chronic lower respiratory disease (n ¼ 10, 2.7%), and
septicaemia (n ¼ 10, 2.7%). For each risk algorithm, the
ROC areas for cardiovascular-related (n ¼ 163) and total
30-day mortality (n ¼ 180) were almost identical (difference
0.005 or less). The discriminatory power for cardiovascular-
related1-yearmortality(n ¼ 280)increasedbyapproximately
0.03 for all 19 algorithms compared with the discriminatory
power for total 1-year mortality (n ¼ 377) (logistic Euro-
SCORE 0.80, additive EuroSCORE 0.80, Cleveland Clinic
0.79, and Magovern 0.78). However, it did not change their
relative order of discriminatory power.
The purpose of this study was to compare 19 commonly used
cardiac surgical risk scores with regard to their validity in a
large single-institute patient population. The results show
that four of the algorithms had a superior performance
and accuracy to predict 30-day and 1-year mortality,
expressed as discriminatory power, compared with the
other 15 algorithms. Despite the fact that all of the algor-
ithms were designed to predict early mortality, they also
predict 1-year mortality well, especially when the cause of
death was cardiovascular disease.
Most algorithms overestimated the 30-day mortality in
this patient population. The same finding has been reported
in other studies.4,6Rather than reflecting weaknesses in the
risk score algorithm, these findings are probably explained
by differences in patient mix and temporal periods com-
pared to the original databases used for development of
the algorithms.6Prediction of mortality rate in the CABG-
only subgroup was almost perfect using the Northern New
England and NYS algorithms, which are both for use in
CABG surgery and newly developed.
The potential of ROC curves in medical diagnostic testing
was recognized as early as 1960.15Even if comparison of ROC
curves in a statistically valid fashion to evaluate models
remains controversial, the ROC curve is currently the best
developed statistical tool for describing performance.12
The EuroSCORE model, which had the highest discriminatory
power, has been shown to work well to predict 30-day mor-
tality in many European countries16and in the United
States.17It compared favourably with the STS risk stratifica-
tion algorithm7(which is not open source and was therefore
not included in the present analysis). Recently, it was
demonstrated that EuroSCORE could predict intensive care
unit stay and costs of open-heart surgery.18The Cleveland
Clinic model has also shown high discrimination to predict
early mortality.8An important finding in the present study
is that these algorithms could be used also to predict long-
term mortality (1 year), especially for cardiovascular
Earlier studies have compared the performance of differ-
ent risk algorithms to predict 30-day mortality,4,6,8but
have not shown significant differences in performance and
accuracy. This may be explained by smaller patient
The predictive accuracy of different risk scoring systems
may be influenced by numerous factors, such as differences
in variable definitions, management of incomplete data
fields, surgical procedure selection criteria, and geographi-
cal differences in patient risk factors. The prevalence of
risk factors in patients referred for heart surgery may also
change over time. Difficulties thus arise when comparison
of the accuracy and predictive power of large databases
are attempted. However, ROC analysis is a robust technique
for such comparisons. Importantly, the shapes of the ROC
curves were similar among the compared risk models
(Figure 2), making direct comparison possible.12Murphy-
Filkins et al.19showed that an increase up to five times of
ance and accuracy in CABG-only surgery (n ¼ 4351)
ROC area (95% CI)
ROC area for the five risk algorithms with best perform-
ROC area (95% CI)
Cleveland Clinic risk score algorithm is also known as Higgins score.
lines) in comparison to score-predicted 30-day mortality (diamonds) with 95%
confidence intervals (horizontal bars). (A) All open-heart surgery and (B)
CABG-only surgery. Asterisk denotes the predicted mortality calculated
from ACC/AHA score mortality table11specified for CABG-only surgery. See
Table 1 for abbreviations.
Observed 30-day mortality with 95% confidence intervals (vertical
870J. Nilsson et al.
by guest on May 14, 2011
a low-frequency variable (for example, due to difference
in a variable definition) did not appreciably change the
All surgical procedures were included in the study, irre-
Thus, a patient could participate two or more times in
the analysis. This could be debated, as a dependence of the
data that arises from multiple procedures performed
within a patient may occur. An alternative would be to
include only the first procedure for each patient. A subanaly-
sis using this approach (n ¼ 6153) showed only very small
differences in the ROC area for the different risk algorithms
(in average 0.001). A drawback of excluding patients having
a second procedure during the study period is that some
high-risk cases will be eliminated from the analysis.
Regardless of which method used, the differences caused by
this dependence was negligible, most likely due to the
small number of patients (1%) who had more than one
The probability imputation technique, used in this study,
hasbeen shown towork
studies.20Another strategy to handle incomplete data is to
exclude the patients with missing values from analysis, but
because missing values are more likely in emergent high-
risk patients, this could result in bias.
Geographical differences in the occurrence of patient risk
factors may have influenced the design of different risk-
scoring systems, but do not seem to influence the present
results. The best-performing risk scores in this study were
developed in two different geographical areas: Europe and
Eight of the included risk algorithms (Cabdeal, NYS,
NorthernNew England, Magovern,
(modified), UK national score, and Veterans Affairs) were
originally designed to predict early mortality in CABG-only
patients, which also could affect the predictive accuracy.
A subanalysis of CABG-only patients in this material ident-
ified the same two risk-scoring systems with the largest
ROC areas (EuroSCORE additive and logistic), followed by
the NYS and the Cleveland Clinic risk-scoring systems.
well in prognosticfactor
The smaller ROC area for the 1-year than for the 30-day
mortality prediction was expected. Risk models originally
designed to predict 30-day mortality will mainly predict
cardiovascular death, which was the most common cause
of early post-operative mortality (91%). At 1 year, the
causes of death will be more diverse and the proportion of
cardiovascular-related death will decrease (74%).
The strength of the present study is that the algorithms
could be compared using a relatively large patient material,
where the patient data were collected on a regular basis in
the daily clinical work. The data was pre-operatively
entered into the database, generally by residents, and not
by the surgeon performing the operation.
During the last decades, several different risk score algor-
ithms for cardiac surgery have been published, but it still
remains difficult to risk stratify individual patients.4,8One
method to improve risk algorithm development could be to
include more patients with higher risk scores as suggested
by Wyse and Taylor.21
However, we found that the
Cleveland Clinic score, which was developed on 5051
patients, performed almost as well as the EuroSCORE, devel-
oped on 13 302 patients.
Most risk algorithms are based on logistic regression analy-
sis with a priori assumptions of linear relationships. Another
method to improve risk prediction could be to use a more
complex risk model, such as the artificial neural network,
which has the advantage of the capacity to model
complex, non-linear relationships and is relatively robust
and tolerant of missing data.22There are only a few
studies done in this area, which merits further investigation.
Even if a perfect risk prediction algorithm in cardiac
surgery is never achieved, identification of the best-
performing risk algorithms is important. Pre-operative risk
stratification may aid in the selection between cardiac
available, facilitate the planning of hospital resource utili-
zation, and enable accurate comparison between different
institutions or surgeons.
Conflict of interest: none declared.
Pre-operative general risk factors in 6222 open-heart operations
or n (%)
NYSNorthern New England
ParsonnetParsonnet (modified) Pons
UK national score
(sys .140 mmHg)
Comparison of 19 pre-operative risk stratification models in open-heart surgery 871
by guest on May 14, 2011