Predicting the outcome of patients with subarachnoid hemorrhage using machine learning techniques.
ABSTRACT Outcome prediction for subarachnoid hemorrhage (SAH) helps guide care and compare global management strategies. Logistic regression models for outcome prediction may be cumbersome to apply in clinical practice.
To use machine learning techniques to build a model of outcome prediction that makes the knowledge discovered from the data explicit and communicable to domain experts.
A derivation cohort (n = 441) of nonselected SAH cases was analyzed using different classification algorithms to generate decision trees and decision rules. Algorithms used were C4.5, fast decision tree learner, partial decision trees, repeated incremental pruning to produce error reduction, nearest neighbor with generalization, and ripple down rule learner. Outcome was dichotomized in favorable [Glasgow outcome scale (GOS) = I-II] and poor (GOS = III-V). An independent cohort (n = 193) was used for validation. An exploratory questionnaire was given to potential users (specialist doctors) to gather their opinion on the classifier and its usability in clinical routine.
The best classifier was obtained with the C4.5 algorithm. It uses only two attributes [World Federation of Neurological Surgeons (WFNS) and Fisher's scale] and leads to a simple decision tree. The accuracy of the classifier [area under the ROC curve (AUC) = 0.84; confidence interval (CI) = 0.80-0.88] is similar to that obtained by a logistic regression model (AUC = 0.86; CI = 0.83-0.89) derived from the same data and is considered better fit for clinical use.
- Journal of Neurosurgery 04/2012; 117(1):12-4; discussion 14. · 3.15 Impact Factor
- [Show abstract] [Hide abstract]
ABSTRACT: BACKGROUND: Clinical prediction models can enhance clinical decision-making and research. However, available prediction models in aneurysmal subarachnoid hemorrhage (aSAH) are rarely used. We evaluated the methodological validity of SAH prediction models and the relevance of the main predictors to identify potentially reliable models and to guide future attempts at model development. METHODS: We searched the EMBASE, MEDLINE, and Web of Science databases from January 1995 to June 2012 to identify studies that reported clinical prediction models for mortality and functional outcome in aSAH. Validated methods were used to minimize bias. RESULTS: Eleven studies were identified; 3 developed models from datasets of phase 3 clinical trials, the others from single hospital records. The median patient sample size was 340 (interquartile range 149-733). The main predictors used were age (n = 8), Fisher grade (n = 6), World Federation of Neurological Surgeons grade (n = 5), aneurysm size (n = 5), and Hunt and Hess grade (n = 3). Age was consistently dichotomized. Potential predictors were prescreened by univariate analysis in 36 % of studies. Only one study was penalized for model optimism. Details about model development were often insufficiently described and no published studies provided external validation. CONCLUSIONS: While clinical prediction models for aSAH use a few simple predictors, there are substantial methodological problems with the models and none have had external validation. This precludes the use of existing models for clinical or research purposes. We recommend further studies to develop and validate reliable clinical prediction models for aSAH.Neurocritical Care 11/2012; · 3.04 Impact Factor
- [Show abstract] [Hide abstract]
ABSTRACT: We propose to use pure-phase filter in combination with high NA lens axicon to achieve a sub wavelength uniform magnetization depth when illuminated by a circularly polarized Bessel Gaussian beam. The magnetization distributions are derived and evaluated based on the vector diffraction theory and the inverse Faraday Effect of the isotropic and non magnetically ordered material. With this kind of system, the longitudinal magnetization depth is increased to 12.2λ and the magnetic spot size has been reduced to 0.36λ. However, in the conventional lens with same NA, the FWHM of the magnetic spot is found to be 0.512λ and its corresponding magnetization depth is only 1.2λ. We expect such a sub wavelength longitudinal magnetic field can be widely used in high density magneto optic recording and the scanning near-field magnetic microscope for studies of magnetic responses of sub wavelength elementary cells of metamaterials.17th OptoElectronics and Communications Conference(OECC-2012); 07/2012
Predicting the Outcome of Patients
With Subarachnoid Hemorrhage Using Machine
Paula de Toledo, Pablo M. Rios, Agapito Ledezma, Araceli Sanchis, Jose F. Alen, and Alfonso Lagares
Abstract—Background: Outcome prediction for subarachnoid
hemorrhage (SAH) helps guide care and compare global manage-
ment strategies. Logistic regression models for outcome prediction
may be cumbersome to apply in clinical practice. Objective: To use
machine learning techniques to build a model of outcome predic-
tion that makes the knowledge discovered from the data explicit
and communicable to domain experts. Material and methods: A
derivation cohort (n = 441) of nonselected SAH cases was ana-
lyzed using different classification algorithms to generate decision
trees and decision rules. Algorithms used were C4.5, fast decision
tree learner, partial decision trees, repeated incremental pruning
to produce error reduction, nearest neighbor with generalization,
and ripple down rule learner. Outcome was dichotomized in fa-
vorable [Glasgow outcome scale (GOS) = I–II] and poor (GOS =
III–V). An independent cohort (n = 193) was used for validation.
An exploratory questionnaire was given to potential users (special-
ist doctors) to gather their opinion on the classifier and its usability
C4.5 algorithm. It uses only two attributes [World Federation of
Neurological Surgeons (WFNS) and Fisher’s scale] and leads to a
simple decision tree. The accuracy of the classifier [area under the
ROC curve (AUC) = 0.84; confidence interval (CI) = 0.80–0.88]
is similar to that obtained by a logistic regression model (AUC
= 0.86; CI = 0.83–0.89) derived from the same data and is con-
sidered better fit for clinical use.
Index Terms—Data mining, knowledge discovery in databases,
machine learning, prognosis, subarachnoid hemorrhage.
blood in the subarachnoid space, occupied by the arteries feed-
ing the brain. The most common cause of SAH is the rupture
of a cerebral aneurysm, an abnormal and fragile dilatation of a
cerebral artery. Its annual incidence is 10–15 cases per 100000
inhabitants and nearly 50% of patients suffering it will have a
poor outcome. Brain damage related to this form of stroke is
due to a decrease in cerebral blood perfusion leading to cerebral
ischemia. Diagnosis is made with cranial computerized tomog-
raphy (CT) scan that shows the extent of the bleeding. Cerebral
angiography confirms the presence of an aneurysm in the ma-
This?work was supported in part by the Spanish Ministries of Science under
Grant?TRA2007-67374-C02-02 and Health under Grant FIS PI 070152. The
work of?A. Lagares and J. F. Alen was supported by the Fundacion Mutua
P. de Toledo, P. M. Rios, A. Ledezma, and A. Sanchis are with the Control,
Learning, and Systems Optimization Group, Universidad Carlos III de Madrid,
Madrid 28040, Spain (e-mail: email@example.com).
J. F. Alen and A. Lagares are with the Department of Neurosurgery, Hospital
Doce de Octubre, Madrid 28041, Spain.
of hemorrhagic stroke characterized by the presence of
jority of the patients, although in nearly 20% of the cases the
cause is unknown. The aneurysm, if found, should be treated as
ity over 50%). Treatment could be performed by endovascular
means or surgically to exclude the aneurysm preserving normal
As in other acute neurological diseases, determining prog-
nosis after SAH is crucial for giving adequate information to
patient’s relatives, guide treatment options, detect subgroups
of patients that could benefit from certain treatments, and com-
pare treatments or global management strategies. Prognostic
information coming just from surgically or endovascularly
treated cases would not be applicable to all patients suffering
this condition, as many patients die before being treated .
Therefore, any model valid for assessing prognosis at diagnosis
in this disease should be obtained from a nonselected series of
patients. Prognostic factors are mainly level of consciousness at
admission, quantity of bleeding in the initial CT-scan, age, size
Neurological Surgeons (WFNS ) being the most frequently
used. It divides patients in five grades according to the severity
of consciousness disturbance. Its reliability and interobserver
from the Glasgow coma scale (GCS ), which is a universal
scale for consciousness assessment. The amount of bleeding
in the initial CT has been evaluated with different scales,
some assessing the amount of cisternal blood in a qualitative
way (Fisher’s scale ) and others using a semiquantitative
algorithm . The evaluation of the prognostic information
given by these different scales has been done mainly with con-
ventional statistics. Prognostic models have been built mainly
for dichotomized six-month outcome using logistic regression
analysis. Some scales have been built combining factors
coming from these models, including age, Fisher’s scale, and
WFNS , . Their accuracy has been tested using the area
under the receiver operating curve area under the ROC curve
(AUC), achieving less than 90% accurate prognosis. The results
are difficult to interpret in the clinical setting as they consist
of different combinations of prognostic factors derived from
several scales, combined by scores or coefficients derived from
the regression equation. There is a need for simple, universal,
interpretable, and reliable prognostic tools for SAH patients.
A. Data Mining in Prognosis
as well as predicting potential disease onset on healthy patients,
DE TOLEDO et al.: PREDICTING THE OUTCOME OF PATIENTS WITH SUBARACHNOID HEMORRHAGE795
is an active area of research in medicine. Prognostic models are
primarily used to select appropriate treatments – and
tests ,  not only in individual patient management, but
generating global predictive scenarios, determining study eligi-
bility of patients for new treatments, defining inclusion criteria
for clinical trials to control for variation in prognosis, as well as
in cost reimbursement programs.
Statistical techniques such as univariate and multivariate lo-
advances in data mining algorithms, have led to the adoption of
problems in clinical medicine. A relevant summary of current
be found in a recent review paper by Belazzi and Zupan .
where different techniques are presented and compared.
The question of whether artificial neural networks (ANNs)
or other machine learning techniques can outperform statis-
tical modeling techniques in prediction problems in clinical
medicine does not have a simple answer. There are plenty
of research works comparing techniques from the two do-
mains , , , showing that there is no methodology
outweighing the others in all possible scenarios, and that the
tools need to be carefully selected depending on the problem
faced and the significant quality criteria. In some cases, ma-
chine learning techniques have been shown to lead to similar
results as logistic regression in accuracy, but outperform in cal-
ibration , . Other authors highlight the ease of use and
automation of techniques such as ANNs, while stating that lo-
gistic regression is still the gold standard . Furthermore,
statistical and machine learning techniques are not necessarily
competing strategies, but can also be used together to perform
a prediction task .
Most universally used predictive data mining methods, ac-
cording to a poll conducted in 2006 among researchers in the
field  are:
1) those based on decisions trees such as ID3  and C4.5
2) those based on decision rules;
3) statistical methods, mainly logistic regression; and
4) ANNs, followed by support vector machines, naive
Bayesian classifiers, Bayesian networks, and nearest
Less used methods are ensemble methods (boosting, bag-
ging) and genetic algorithms. In the field of prognosis in clini-
cal medicine the results differ, as logistic regression is still the
most widely applied, followed by ANNs –, . The
use of decision trees , ,  is growing in recent times.
Other methods such as genetic algorithms are still scarce, but
promising. , . A growing trend is combining different
machine learning techniques to achieve improved results. .
When comparing different classifiers , , the key issues
to address are:
1) predictive accuracy;
2) interpretability of the classification models by the domain
3) handling of missing data and noise;
4) ability to work with different types of attributes (categor-
ical, ordinal, continuous);
5) reduction of attributes needed to derive the conclusion;
6) computational cost for both induction and use of the clas-
7) ability to explain the decisions reached when models are
used in decision making; and
8) ability to perform well with unseen cases.
Interpretability of the results being the main selection cri-
teria, besides accuracy, it is surprising that there is little re-
search in the field. Harper  conducted a survey among the
staff of set of NHS trusts in the south of U.K., comparing the
comprehensibility and ease of use of models based on logis-
tic regression, ANNs, and decision trees that concluded that
the latter are the ones with a greater practical appeal. The in-
terpretability of models obtained from logistic regression can
be facilitated by the use of nomograms (Lubsen et al. ).
Nomograms are a well-established visualization technique con-
sisting of a graphic representation of the statistical model
that incorporates several variables to predict a particular end
In the field of SAH, the classification and regression trees
methodology (CART) has been compared to logistic regression
analysis(n = 885)topredict theoutcome ofSAHpatients .
Results obtained were similar and it was concluded that the sin-
gle best predictor (level of consciousness) was itself as good
as multivariate analysis. CART was also used in a similar con-
dition , intracerebral hemorrhage (n = 347), to develop a
classification tree that stratified the mortality in four risk levels
and outperformed a multivariate logistic regression model in
terms of accuracy (AUC 0.86 vs. 0.81).
The aim of this paper is to use knowledge discovery and
machine learning techniques to build a model for predicting
the outcome of a patient with SAH, using only data gathered
on hospital admission, which makes the knowledge discovered
from the data explicit and communicable to domain experts,
and which is usable in routine practice. To be usable, the model
should use as few predictors as possible, be intuitive to inter-
pret, and have a similar accuracy to techniques currently in use.
The class attribute is the outcome six months after discharge,
measured by means of the Glasgow outcome scale (GOS ),
a five-point scale that is often dichotomized into “favorable out-
come” and “poor outcome.” The main objective is to predict
the dichotomized outcome, but models leading five and three
(trichotomized) classes are also investigated.
796IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 13, NO. 5, SEPTEMBER 2009
CHARACTERISTICS OF COHORTS USED IN THIS STUDY
II. MATERIALS AND METHODS
A. Data Mining Methodology
by the Cross-Industry Standard Process for Data Mining Inter-
est Group. These phases are, with minor changes, common to
most data mining methodologies: business understanding, data
understanding, data preparation, modeling, evaluation, and de-
ployment. The knowledge discovery process consists of a series
of evolutionary cycles, covering one or more of those phases,
repeating tasks such as data preparation, feature selection, se-
lection of the data mining technique, generation of classifiers,
and evaluation of the results.
B. Data Sources
We collected data retrospectively from two different data co-
Spain. The first cohort (Dataset1) keeps information from 441
cases, from 1990 to 2001. The second (Dataset2) was created
between 2001 and 2007 and has 192 cases, for which a smaller
number of variables (a subset) were recorded. The strategy fol-
the classifier, and the second for external validation.
C. Business and Data Understanding, Data Preparation
Data gathered can be categorized as follows:
1) initial evaluation variables;
2) variables related to diagnostic cranial CT scan;
3) variables related to diagnostic angiography;
4) variables related to the type of treatment and level of con-
sciousness before treatment; and
5) outcome variables, including complications (rebleeding,
ischemia, vasospasm, etc.).
Outcome is measured, both at discharge and six months after.
Only data available at the time of diagnosis (groups 1 and 2) are
used for prediction, resulting in a total of 40 attributes.
The data were anonymized prior to its handing over to the
research team, to comply with the Spanish national regulations
on personal data. Informed consent had been obtained from all
the patients before including their information in the registry.
The open source tool Weka  was used in different phases
of the knowledge discovery process. Weka is a collection of
state-of-the-art data mining algorithms and data preprocessing
methods for a wide range of tasks such as data preprocessing,
attribute selection, clustering, and classification. Weka has been
and in bioinformatics .
1) Attribute Selection: Attribute selection is a key factor for
success in the generation of the model. Different subset eval-
uators and search methods were combined. Subset evaluators
used were classifier subset evaluator  (assesses the pre-
dictive ability of each attribute individually and the degree of
redundancy among them, preferring sets of attributes that are
highly correlated with the class but have low intercorrelation)
and Wrapper  (employs cross validation to estimate the ac-
curacy of the learning scheme for each attribute set). Search
methods used were:
1) greedy stepwise  (greedy hill climbing without back-
tracking; stopping when adding or removing an attribute
worsens the results of the evaluation, as compared to the
2) genetic search (using a simple genetic algorithm) ;
3) exhaustive search  (exhaustive search in the attribute
est subset); and
4) race search  (competitions among attribute subsets,
evaluating them as a function of the error obtained in the
2) Classification Algorithms: Among the different machine
learning techniques available, decision trees and decision rules
were preferred to neural networks for their interpretability. De-
cision trees, also called classification trees, are models made
of nodes (leaves) and branches, where nodes represent classi-
fications and branches correspond to conjunctions of features
(values or value ranges) that lead a classification. The aim in de-
homogeneous groups with respect to the outcome variable (e.g.,
“favorable outcome,” “poor outcome”). The tree construction
is achieved by recursively partitioning the dataset into subsets
based on the value of a variable. In each iteration, the learning
in the resulting subsets. Different measures of homogeneity can
be used, resulting in different tree learning techniques. Deci-
sion rules are similar to decision trees, and can be derived from
the former or produced directly, either from knowledge elicited
from the experts or with machine learning techniques.
From the broad range of decision trees and decision rules
algorithms available, the following were included in this study
DE TOLEDO et al.: PREDICTING THE OUTCOME OF PATIENTS WITH SUBARACHNOID HEMORRHAGE797
according to their suitability to the problem domain: C4.5, fast
decision tree learner (REPTree), partial decision trees (PART),
repeated incremental pruning to produce error reduction (Rip-
per), nearest neighbor with generalization (NNge), ripple down
rule learner (Ridor), and best-first decision tree learning (BFT).
C4.5, REPTree, and BFT build decision trees whereas the rest
are rule induction algorithms. C4.5  is an improvement over
ID3 , since it produces a decision tree using entropy to
determine each tree node, but is not able to work either with in-
complete data or with numerical attributes. C4.5 improves ID3,
including the concept of gain ratio and admitting numerical at-
tributes. REPTree  builds a decision tree by evaluating the
predictor attributes against a quantitative target attribute, using
variance reduction to derive balanced tree splits and minimize
error corrections. PART  (obtaining rules from partial deci-
work in two phases: first, they generate classification rules, and
then, these are optimized through an improvement process, usu-
ally with a high computational cost. PART algorithm does not
to take the best leaf in each iteration and transform it into a rule.
Ripper is a rule induction algorithm working in three phases as
1) building (growing and pruning);
2) optimization; and
3) rule reduction.
It is an improved version of incremental reduced error prun-
ing (IREP) . NNge  is a nearest neighbor method of
generating rules using nonnested generalized exemplars. Ri-
dor  technique is characterized by the generation of a first
default rule, using incremental reduced-error pruning to find
exceptions to this rule with the smallest pondered error rate.
In the second phase, the best exceptions are selected using the
IREP algorithm. BFT  uses binary split for both nominal
and numeric attributes, while for missing values, the method of
“fractional” instances is used.
As it is possible to have a statistically but not yet clinically
valid model and vice versa, evaluation must be conducted in
two directions: laboratory evaluation of the performance of the
model and clinical evaluation to determine whether the model
is satisfactory for clinical purpose.
1) Laboratory Evaluation: Hit ratio and kappa statistics
have been used to compare the different classifiers generated.
Hit ratio is not a proper accuracy score, as it does not penalize
models that are imprecise (for example, by exaggerating the
probability of a dominant class). Kappa statistic  corrects
the degree of agreement between the classifier’s predictions and
reality by considering the proportion of predictions that might
occur by chance, and is recommended  as the statistic of
choice to compare classifiers. The receiver operating character-
istics (ROC) curve  is another widely used tool. In the case
as is, while ROC curves are more cumbersome to interpret.
Tenfold cross validation was used for internal validation.
This well-known validation strategy is based on the partition
of the original sample into ten subsamples, retaining one for
testing and using the remaining nine as training data. The cross-
validation process is then repeated ten times with each of the
ten samples, averaging the results from the tenfolds to produce
a single estimation. External validation of the best classifier was
performed with an independent dataset (hold out strategy).
2) Clinical Evaluation: To assess the potential usefulness
of the classifier in clinical routine, the results were presented
to six neurosurgeons from five different hospitals in Spain. A
questionnaire with 21 questions (five-point Likert scale) was
prepared by the research team, covering issues related to the
value of the model, interpretability, and potential use in clinical
It must be noted that the classifier developed is intended to
be used as a support tool in a Web-based multicentric register
of SAH cases. Therefore, it should be possible to implement it
in a way that can be integrated with such technologies.
described in the methodology. For each cycle, two different sets
ofexperiments wereperformed: afirstbatterytoselect themore
relevant attributes and a second battery to build the classifier
itself. Table II shows the experimental configuration, including
attribute selection, search, and classification algorithms used.
The experiments of the first two cycles resulted in:
1) the variables representing the amount of blood in the ten
cisterns being substituted by a summary score;
2) continuous variables such as age being clustered; and
3) the outcome variable being clustered in two and three
classes (Table II).
The experiments of the third cycle used 27 attributes. Differ-
ent datasets were prepared according to the following:
1) nonaggregated attributes, so that the splitting values are
set by the attribute selection algorithm;
2) attributes clustered as decided by the technical team;
3) attributes clustered as decided by the expert; and
4) only age clustered.
Attribute selection led to 38 datasets with attributes ranging
from 1 to 23.
For the dichotomized problem, kappa values and hit ratio
were very similar for C4.5, PART, REPTree, Ripper, and BTF
798IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 13, NO. 5, SEPTEMBER 2009
VALIDATION FOR THE BEST CLASSIFIER GENERATED BY EACH ALGORITHM
Fig. 1.Final classifier: C4.5 decision tree, dichotomized outcome.
models (Table III). Precision values were similar for the first
three whereas NNGe, Ripper, and Ridor had slightly worse re-
which is shown in Fig. 1. It used only two attributes (WFNS
and Fisher’s) and had a lower complexity (six branches, five
leaves) as compared to the others. The attributes selected for
this classifier (WFNS and Fisher’s scale) had been present in
the experiments of all previous cycles, and their selection was
consistent with the relevance assigned to them by the expert.
The best model generated by the PART algorithm led exactly
to good results, but the number of rules was higher (17) and
more difficult to interpret according to the domain expert. The
attributes selected (WFNS, Fisher’s, and number of previous
hemorrhages) were also consistent from the clinical point of
Regarding the results for the trichotomized problem, the per-
centage of correctly classified instances was only slightly lower
than those obtained for the dichotomized scale (Table III); how-
ever, a more careful insight considering the kappa statistic, con-
fusion matrix, and precision values for each class, showed that
the intermediate class (severe disability) was very defectively
classified. As there were very few cases (34) of this class, this
had a small effect in the overall hit ratio but a great impact on
Fig. 2.ROC curve for the final classifier.
the utility of the classifier. Attributes used by the best model
were WFNS and the score summarizing amount of blood in
the cisterns. To handle the imbalance of the dataset, a further
experiment was performed replicating the instances in the in-
termediate class (three times) to increase their overall weight
in the classifier learning process. Results were slightly better
(79% hit ratio, 0.670 kappa), but the number of instances in the
intermediate class, which were correctly classified, is still only
four (out of 34). A further experiment was performed attending
to a request by the expert: generate a trichotomized tree with the
same attributes used by the dichotomized (WFNS and Fisher’s).
Results were 1% hit ratio, 0.476 kappa.
The model chosen is therefore the one created by the C4.5
and five leaves, shown in Fig. 1. The quality values for this
classifier are AUC = 0.841 [0.80–0.88; confidence interval (CI)
95%], hit ratio = 83%, and kappa = 0.625.
B. External Validation
As the attributes selected by the model were present in
tion of the selected classifier with this independent test set. The
results obtained were AUC = 0.837 (0.78–0.89; 95% CI), 78%
hit ratio, 0.73 sensitivity (for the “poor outcome” class), 0.81
specificity, and 0.55 kappa. The ROC curve is shown in Fig. 2.
External validation was performed as well using a random
sifier generated was the same (Fig. 1) and the results were only
slightly better: 80% hit ratio, 0.73 sensitivity, 0.86 specificity,
and 0.60 kappa. This indicates that it is possible to generate
a classifier from cases available at a certain point of time that
preserves its predicting ability for future patients.