Predicting the Outcome of Patients
With Subarachnoid Hemorrhage Using Machine
Paula de Toledo, Pablo M. Rios, Agapito Ledezma, Araceli Sanchis, Jose F. Alen, and Alfonso Lagares
Abstract—Background: Outcome prediction for subarachnoid
hemorrhage (SAH) helps guide care and compare global manage-
ment strategies. Logistic regression models for outcome prediction
may be cumbersome to apply in clinical practice. Objective: To use
machine learning techniques to build a model of outcome predic-
tion that makes the knowledge discovered from the data explicit
and communicable to domain experts. Material and methods: A
derivation cohort (n = 441) of nonselected SAH cases was ana-
lyzed using different classification algorithms to generate decision
trees and decision rules. Algorithms used were C4.5, fast decision
tree learner, partial decision trees, repeated incremental pruning
to produce error reduction, nearest neighbor with generalization,
and ripple down rule learner. Outcome was dichotomized in fa-
vorable [Glasgow outcome scale (GOS) = I–II] and poor (GOS =
III–V). An independent cohort (n = 193) was used for validation.
An exploratory questionnaire was given to potential users (special-
ist doctors) to gather their opinion on the classifier and its usability
C4.5 algorithm. It uses only two attributes [World Federation of
Neurological Surgeons (WFNS) and Fisher’s scale] and leads to a
simple decision tree. The accuracy of the classifier [area under the
ROC curve (AUC) = 0.84; confidence interval (CI) = 0.80–0.88]
is similar to that obtained by a logistic regression model (AUC
= 0.86; CI = 0.83–0.89) derived from the same data and is con-
sidered better fit for clinical use.
Index Terms—Data mining, knowledge discovery in databases,
machine learning, prognosis, subarachnoid hemorrhage.
blood in the subarachnoid space, occupied by the arteries feed-
ing the brain. The most common cause of SAH is the rupture
of a cerebral aneurysm, an abnormal and fragile dilatation of a
cerebral artery. Its annual incidence is 10–15 cases per 100000
inhabitants and nearly 50% of patients suffering it will have a
poor outcome. Brain damage related to this form of stroke is
due to a decrease in cerebral blood perfusion leading to cerebral
ischemia. Diagnosis is made with cranial computerized tomog-
raphy (CT) scan that shows the extent of the bleeding. Cerebral
angiography confirms the presence of an aneurysm in the ma-
This?work was supported in part by the Spanish Ministries of Science under
Grant?TRA2007-67374-C02-02 and Health under Grant FIS PI 070152. The
work of?A. Lagares and J. F. Alen was supported by the Fundacion Mutua
P. de Toledo, P. M. Rios, A. Ledezma, and A. Sanchis are with the Control,
Learning, and Systems Optimization Group, Universidad Carlos III de Madrid,
Madrid 28040, Spain (e-mail: email@example.com).
J. F. Alen and A. Lagares are with the Department of Neurosurgery, Hospital
Doce de Octubre, Madrid 28041, Spain.
of hemorrhagic stroke characterized by the presence of
jority of the patients, although in nearly 20% of the cases the
cause is unknown. The aneurysm, if found, should be treated as
ity over 50%). Treatment could be performed by endovascular
means or surgically to exclude the aneurysm preserving normal
As in other acute neurological diseases, determining prog-
nosis after SAH is crucial for giving adequate information to
patient’s relatives, guide treatment options, detect subgroups
of patients that could benefit from certain treatments, and com-
pare treatments or global management strategies. Prognostic
information coming just from surgically or endovascularly
treated cases would not be applicable to all patients suffering
this condition, as many patients die before being treated .
Therefore, any model valid for assessing prognosis at diagnosis
in this disease should be obtained from a nonselected series of
patients. Prognostic factors are mainly level of consciousness at
admission, quantity of bleeding in the initial CT-scan, age, size
Neurological Surgeons (WFNS ) being the most frequently
used. It divides patients in five grades according to the severity
of consciousness disturbance. Its reliability and interobserver
from the Glasgow coma scale (GCS ), which is a universal
scale for consciousness assessment. The amount of bleeding
in the initial CT has been evaluated with different scales,
some assessing the amount of cisternal blood in a qualitative
way (Fisher’s scale ) and others using a semiquantitative
algorithm . The evaluation of the prognostic information
given by these different scales has been done mainly with con-
ventional statistics. Prognostic models have been built mainly
for dichotomized six-month outcome using logistic regression
analysis. Some scales have been built combining factors
coming from these models, including age, Fisher’s scale, and
WFNS , . Their accuracy has been tested using the area
under the receiver operating curve area under the ROC curve
(AUC), achieving less than 90% accurate prognosis. The results
are difficult to interpret in the clinical setting as they consist
of different combinations of prognostic factors derived from
several scales, combined by scores or coefficients derived from
the regression equation. There is a need for simple, universal,
interpretable, and reliable prognostic tools for SAH patients.
A. Data Mining in Prognosis
as well as predicting potential disease onset on healthy patients,
DE TOLEDO et al.: PREDICTING THE OUTCOME OF PATIENTS WITH SUBARACHNOID HEMORRHAGE795
is an active area of research in medicine. Prognostic models are
primarily used to select appropriate treatments – and
tests ,  not only in individual patient management, but
generating global predictive scenarios, determining study eligi-
bility of patients for new treatments, defining inclusion criteria
for clinical trials to control for variation in prognosis, as well as
in cost reimbursement programs.
Statistical techniques such as univariate and multivariate lo-
advances in data mining algorithms, have led to the adoption of
problems in clinical medicine. A relevant summary of current
be found in a recent review paper by Belazzi and Zupan .
where different techniques are presented and compared.
The question of whether artificial neural networks (ANNs)
or other machine learning techniques can outperform statis-
tical modeling techniques in prediction problems in clinical
medicine does not have a simple answer. There are plenty
of research works comparing techniques from the two do-
mains , , , showing that there is no methodology
outweighing the others in all possible scenarios, and that the
tools need to be carefully selected depending on the problem
faced and the significant quality criteria. In some cases, ma-
chine learning techniques have been shown to lead to similar
results as logistic regression in accuracy, but outperform in cal-
ibration , . Other authors highlight the ease of use and
automation of techniques such as ANNs, while stating that lo-
gistic regression is still the gold standard . Furthermore,
statistical and machine learning techniques are not necessarily
competing strategies, but can also be used together to perform
a prediction task .
Most universally used predictive data mining methods, ac-
cording to a poll conducted in 2006 among researchers in the
field  are:
1) those based on decisions trees such as ID3  and C4.5
2) those based on decision rules;
3) statistical methods, mainly logistic regression; and
4) ANNs, followed by support vector machines, naive
Bayesian classifiers, Bayesian networks, and nearest
Less used methods are ensemble methods (boosting, bag-
ging) and genetic algorithms. In the field of prognosis in clini-
cal medicine the results differ, as logistic regression is still the
most widely applied, followed by ANNs –, . The
use of decision trees , ,  is growing in recent times.
Other methods such as genetic algorithms are still scarce, but
promising. , . A growing trend is combining different
machine learning techniques to achieve improved results. .
When comparing different classifiers , , the key issues
to address are:
1) predictive accuracy;
2) interpretability of the classification models by the domain
3) handling of missing data and noise;
4) ability to work with different types of attributes (categor-
ical, ordinal, continuous);
5) reduction of attributes needed to derive the conclusion;
6) computational cost for both induction and use of the clas-
7) ability to explain the decisions reached when models are
used in decision making; and
8) ability to perform well with unseen cases.
Interpretability of the results being the main selection cri-
teria, besides accuracy, it is surprising that there is little re-
search in the field. Harper  conducted a survey among the
staff of set of NHS trusts in the south of U.K., comparing the
comprehensibility and ease of use of models based on logis-
tic regression, ANNs, and decision trees that concluded that
the latter are the ones with a greater practical appeal. The in-
terpretability of models obtained from logistic regression can
be facilitated by the use of nomograms (Lubsen et al. ).
Nomograms are a well-established visualization technique con-
sisting of a graphic representation of the statistical model
that incorporates several variables to predict a particular end
In the field of SAH, the classification and regression trees
methodology (CART) has been compared to logistic regression
analysis(n = 885)topredict theoutcome ofSAHpatients .
Results obtained were similar and it was concluded that the sin-
gle best predictor (level of consciousness) was itself as good
as multivariate analysis. CART was also used in a similar con-
dition , intracerebral hemorrhage (n = 347), to develop a
classification tree that stratified the mortality in four risk levels
and outperformed a multivariate logistic regression model in
terms of accuracy (AUC 0.86 vs. 0.81).
The aim of this paper is to use knowledge discovery and
machine learning techniques to build a model for predicting
the outcome of a patient with SAH, using only data gathered
on hospital admission, which makes the knowledge discovered
from the data explicit and communicable to domain experts,
and which is usable in routine practice. To be usable, the model
should use as few predictors as possible, be intuitive to inter-
pret, and have a similar accuracy to techniques currently in use.
The class attribute is the outcome six months after discharge,
measured by means of the Glasgow outcome scale (GOS ),
a five-point scale that is often dichotomized into “favorable out-
come” and “poor outcome.” The main objective is to predict
the dichotomized outcome, but models leading five and three
(trichotomized) classes are also investigated.
796IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 13, NO. 5, SEPTEMBER 2009
CHARACTERISTICS OF COHORTS USED IN THIS STUDY
II. MATERIALS AND METHODS
A. Data Mining Methodology
by the Cross-Industry Standard Process for Data Mining Inter-
est Group. These phases are, with minor changes, common to
most data mining methodologies: business understanding, data
understanding, data preparation, modeling, evaluation, and de-
ployment. The knowledge discovery process consists of a series
of evolutionary cycles, covering one or more of those phases,
repeating tasks such as data preparation, feature selection, se-
lection of the data mining technique, generation of classifiers,
and evaluation of the results.
B. Data Sources
We collected data retrospectively from two different data co-
Spain. The first cohort (Dataset1) keeps information from 441
cases, from 1990 to 2001. The second (Dataset2) was created
between 2001 and 2007 and has 192 cases, for which a smaller
number of variables (a subset) were recorded. The strategy fol-
the classifier, and the second for external validation.
C. Business and Data Understanding, Data Preparation
Data gathered can be categorized as follows:
1) initial evaluation variables;
2) variables related to diagnostic cranial CT scan;
3) variables related to diagnostic angiography;
4) variables related to the type of treatment and level of con-
sciousness before treatment; and
5) outcome variables, including complications (rebleeding,
ischemia, vasospasm, etc.).
Outcome is measured, both at discharge and six months after.
Only data available at the time of diagnosis (groups 1 and 2) are
used for prediction, resulting in a total of 40 attributes.
The data were anonymized prior to its handing over to the
research team, to comply with the Spanish national regulations
on personal data. Informed consent had been obtained from all
the patients before including their information in the registry.
The open source tool Weka  was used in different phases
of the knowledge discovery process. Weka is a collection of
state-of-the-art data mining algorithms and data preprocessing
methods for a wide range of tasks such as data preprocessing,
attribute selection, clustering, and classification. Weka has been
and in bioinformatics .
1) Attribute Selection: Attribute selection is a key factor for
success in the generation of the model. Different subset eval-
uators and search methods were combined. Subset evaluators
used were classifier subset evaluator  (assesses the pre-
dictive ability of each attribute individually and the degree of
redundancy among them, preferring sets of attributes that are
highly correlated with the class but have low intercorrelation)
and Wrapper  (employs cross validation to estimate the ac-
curacy of the learning scheme for each attribute set). Search
methods used were:
1) greedy stepwise  (greedy hill climbing without back-
tracking; stopping when adding or removing an attribute
worsens the results of the evaluation, as compared to the
2) genetic search (using a simple genetic algorithm) ;
3) exhaustive search  (exhaustive search in the attribute
est subset); and
4) race search  (competitions among attribute subsets,
evaluating them as a function of the error obtained in the
2) Classification Algorithms: Among the different machine
learning techniques available, decision trees and decision rules
were preferred to neural networks for their interpretability. De-
cision trees, also called classification trees, are models made
of nodes (leaves) and branches, where nodes represent classi-
fications and branches correspond to conjunctions of features
(values or value ranges) that lead a classification. The aim in de-
homogeneous groups with respect to the outcome variable (e.g.,
“favorable outcome,” “poor outcome”). The tree construction
is achieved by recursively partitioning the dataset into subsets
based on the value of a variable. In each iteration, the learning
in the resulting subsets. Different measures of homogeneity can
be used, resulting in different tree learning techniques. Deci-
sion rules are similar to decision trees, and can be derived from
the former or produced directly, either from knowledge elicited
from the experts or with machine learning techniques.
From the broad range of decision trees and decision rules
algorithms available, the following were included in this study
DE TOLEDO et al.: PREDICTING THE OUTCOME OF PATIENTS WITH SUBARACHNOID HEMORRHAGE797
according to their suitability to the problem domain: C4.5, fast
decision tree learner (REPTree), partial decision trees (PART),
repeated incremental pruning to produce error reduction (Rip-
per), nearest neighbor with generalization (NNge), ripple down
rule learner (Ridor), and best-first decision tree learning (BFT).
C4.5, REPTree, and BFT build decision trees whereas the rest
are rule induction algorithms. C4.5  is an improvement over
ID3 , since it produces a decision tree using entropy to
determine each tree node, but is not able to work either with in-
complete data or with numerical attributes. C4.5 improves ID3,
including the concept of gain ratio and admitting numerical at-
tributes. REPTree  builds a decision tree by evaluating the
predictor attributes against a quantitative target attribute, using
variance reduction to derive balanced tree splits and minimize
error corrections. PART  (obtaining rules from partial deci-
work in two phases: first, they generate classification rules, and
then, these are optimized through an improvement process, usu-
ally with a high computational cost. PART algorithm does not
to take the best leaf in each iteration and transform it into a rule.
Ripper is a rule induction algorithm working in three phases as
1) building (growing and pruning);
2) optimization; and
3) rule reduction.
It is an improved version of incremental reduced error prun-
ing (IREP) . NNge  is a nearest neighbor method of
generating rules using nonnested generalized exemplars. Ri-
dor  technique is characterized by the generation of a first
default rule, using incremental reduced-error pruning to find
exceptions to this rule with the smallest pondered error rate.
In the second phase, the best exceptions are selected using the
IREP algorithm. BFT  uses binary split for both nominal
and numeric attributes, while for missing values, the method of
“fractional” instances is used.
As it is possible to have a statistically but not yet clinically
valid model and vice versa, evaluation must be conducted in
two directions: laboratory evaluation of the performance of the
model and clinical evaluation to determine whether the model
is satisfactory for clinical purpose.
1) Laboratory Evaluation: Hit ratio and kappa statistics
have been used to compare the different classifiers generated.
Hit ratio is not a proper accuracy score, as it does not penalize
models that are imprecise (for example, by exaggerating the
probability of a dominant class). Kappa statistic  corrects
the degree of agreement between the classifier’s predictions and
reality by considering the proportion of predictions that might
occur by chance, and is recommended  as the statistic of
choice to compare classifiers. The receiver operating character-
istics (ROC) curve  is another widely used tool. In the case
as is, while ROC curves are more cumbersome to interpret.
Tenfold cross validation was used for internal validation.
This well-known validation strategy is based on the partition
of the original sample into ten subsamples, retaining one for
testing and using the remaining nine as training data. The cross-
validation process is then repeated ten times with each of the
ten samples, averaging the results from the tenfolds to produce
a single estimation. External validation of the best classifier was
performed with an independent dataset (hold out strategy).
2) Clinical Evaluation: To assess the potential usefulness
of the classifier in clinical routine, the results were presented
to six neurosurgeons from five different hospitals in Spain. A
questionnaire with 21 questions (five-point Likert scale) was
prepared by the research team, covering issues related to the
value of the model, interpretability, and potential use in clinical
It must be noted that the classifier developed is intended to
be used as a support tool in a Web-based multicentric register
of SAH cases. Therefore, it should be possible to implement it
in a way that can be integrated with such technologies.
described in the methodology. For each cycle, two different sets
ofexperiments wereperformed: afirstbatterytoselect themore
relevant attributes and a second battery to build the classifier
itself. Table II shows the experimental configuration, including
attribute selection, search, and classification algorithms used.
The experiments of the first two cycles resulted in:
1) the variables representing the amount of blood in the ten
cisterns being substituted by a summary score;
2) continuous variables such as age being clustered; and
3) the outcome variable being clustered in two and three
classes (Table II).
The experiments of the third cycle used 27 attributes. Differ-
ent datasets were prepared according to the following:
1) nonaggregated attributes, so that the splitting values are
set by the attribute selection algorithm;
2) attributes clustered as decided by the technical team;
3) attributes clustered as decided by the expert; and
4) only age clustered.
Attribute selection led to 38 datasets with attributes ranging
from 1 to 23.
For the dichotomized problem, kappa values and hit ratio
were very similar for C4.5, PART, REPTree, Ripper, and BTF
798 IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 13, NO. 5, SEPTEMBER 2009
VALIDATION FOR THE BEST CLASSIFIER GENERATED BY EACH ALGORITHM
Fig. 1. Final classifier: C4.5 decision tree, dichotomized outcome.
models (Table III). Precision values were similar for the first
three whereas NNGe, Ripper, and Ridor had slightly worse re-
which is shown in Fig. 1. It used only two attributes (WFNS
and Fisher’s) and had a lower complexity (six branches, five
leaves) as compared to the others. The attributes selected for
this classifier (WFNS and Fisher’s scale) had been present in
the experiments of all previous cycles, and their selection was
consistent with the relevance assigned to them by the expert.
The best model generated by the PART algorithm led exactly
to good results, but the number of rules was higher (17) and
more difficult to interpret according to the domain expert. The
attributes selected (WFNS, Fisher’s, and number of previous
hemorrhages) were also consistent from the clinical point of
Regarding the results for the trichotomized problem, the per-
centage of correctly classified instances was only slightly lower
than those obtained for the dichotomized scale (Table III); how-
ever, a more careful insight considering the kappa statistic, con-
fusion matrix, and precision values for each class, showed that
the intermediate class (severe disability) was very defectively
classified. As there were very few cases (34) of this class, this
had a small effect in the overall hit ratio but a great impact on
Fig. 2.ROC curve for the final classifier.
the utility of the classifier. Attributes used by the best model
were WFNS and the score summarizing amount of blood in
the cisterns. To handle the imbalance of the dataset, a further
experiment was performed replicating the instances in the in-
termediate class (three times) to increase their overall weight
in the classifier learning process. Results were slightly better
(79% hit ratio, 0.670 kappa), but the number of instances in the
intermediate class, which were correctly classified, is still only
four (out of 34). A further experiment was performed attending
to a request by the expert: generate a trichotomized tree with the
same attributes used by the dichotomized (WFNS and Fisher’s).
Results were 1% hit ratio, 0.476 kappa.
The model chosen is therefore the one created by the C4.5
and five leaves, shown in Fig. 1. The quality values for this
classifier are AUC = 0.841 [0.80–0.88; confidence interval (CI)
95%], hit ratio = 83%, and kappa = 0.625.
B. External Validation
As the attributes selected by the model were present in
tion of the selected classifier with this independent test set. The
results obtained were AUC = 0.837 (0.78–0.89; 95% CI), 78%
hit ratio, 0.73 sensitivity (for the “poor outcome” class), 0.81
specificity, and 0.55 kappa. The ROC curve is shown in Fig. 2.
External validation was performed as well using a random
sifier generated was the same (Fig. 1) and the results were only
slightly better: 80% hit ratio, 0.73 sensitivity, 0.86 specificity,
and 0.60 kappa. This indicates that it is possible to generate
a classifier from cases available at a certain point of time that
preserves its predicting ability for future patients.
DE TOLEDO et al.: PREDICTING THE OUTCOME OF PATIENTS WITH SUBARACHNOID HEMORRHAGE799
RESULTS OF MULTIVARIATE LOGISTIC REGRESSION ANALYSIS (B = −0.46;
HOSMER–LEMESHOW GOODNESS OF FIT; χ2= 6.03; DOF = 8; p = 0.65)
C. Statistical Model
The results of the multivariate logistic regression analysis
using factors recorded at admission for dichotomized outcomes
are shown in Table IV. A backward stepwise strategy was used
to build the model. The attributes selected are: WFNS, Fisher’s
0.86 (0.83–0.89, 95% CI). Coefficients from logistic regression
models are difficult to interpret and different strategies have
been used in order to calculate individual probabilities, such
as converting these models into a score or using nomograms.
Such a strategy was used and a nomogram was plotted from the
logistic regression model results (Fig. 3).
1) Clinical Evaluation: Six neurosurgeons responded to the
on machine learning techniques before. They considered the
model simple to understand (4.0 in a five-point Likert scale),
and sound from the clinical point of view (4.0). As compared
to logistic regression models, they found it easier to interpret
(3.7) and reported similar trust on the methodology (3.0). The
fact that the model used only two variables was considered an
advantage (3.8). All respondents agreed that the classifier could
be used in clinical routine (4.3) and that integrating it both into
the hospital information systems (3.7) and in the multicenter
registry (3.8) would be a plus.
D. Deployment: Integration With the Multicenter Register
The Web-based multicentric register was modified to add a
“show prognosis” button, which displays the graphical presen-
tation of the decision tree highlighting the branch that led to
the classification. In order to cope with future changes in the
classifier, the system reads the graph from a standard graph rep-
resentation based on DOT templates. This is the format used
by the open source Weka library , whose graphical imple-
mentation package we modified to work as a Java applet. This
applet is able to accept any decision tree and represent it. As
a drawback of this implementation, we note that it requires the
installation of the Java runtime environment in the user’s com-
puter. The classifier itself is coded in Visual Basic and the time
needed to offer a prediction is negligible (mean time to load the
page <1 s).
The C4.5 algorithm leads the best model using automatic
ical interpretation of the tree was more obscure. The accuracy
of the classifier, expressed in terms of AUC is 0.84 (0.80–0.88;
95% CI), in the range of the results [AUC = 0.86 (0.83–0.89,
95% CI)] obtained with the logistic regression model  (range
0.83–0.86 for the different scales studied). As compared to the
logistic regression model, the decision tree uses one factor less
(both use WFNS and Fisher’s grade, logistic model uses age as
WFNS is known to be the best single predictor of out-
come , . Although age has been repeatedly found to be a
determinant prognostic factor in SAH, when using conventional
with the C4.5 algorithm when the age attribute is added to the
filtered learning data are worse than when it is ignored.
It could have been expected that the classifier would show a
linear progress from “favorable outcome” to “poor outcome.”
Conversely, it can be noticed (Fig. 1) that for Fisher’s grade 3,
results are worse than those for grade 4. This lack of linearity in
the Fisher’s scale has been found by other researchers , 
and resides in the very definition used when assessing Fisher’s
definition. Therefore, the proportion of patients with vasospasm
Fisher’s grade 3 than for grade 4 .
The use of a nonselected series of patients is the main value
of this paper as compared to Germanson , who used data
from patients selected for a randomized trial where the pres-
ence of an aneurysm confirmed by angiography was needed for
inclusion. Many patients with diagnosed SAH die before an-
giography (nearly 10% in our series) and also many patients
with SAH do not harbor an aneurysm (more than 20% in our se-
ries). Therefore, data and prognostic information from selected
cases are not applicable to all SAH patients at diagnosis. In
Germanson’s work, patients are stratified in three levels of risk
for unfavorable outcome, although there is no assessment of the
accuracy of their prediction in terms that allow for comparison
with our results.
The diagnostic capability of the decision tree equals that of
the logistic regression model, while the tree brings about some
advantages that are as follows:
1) a decision tree is more intuitive and simpler to interpret
than a nomogram;
2) it contains a reduced number of rules;
3) uses one factor less (age); and
4) is easier to generate.
The experts interviewed agreed on the fact that the decision
tree is easier to interpret than the nomogram and gave to the two
question “the predictions achieved using logistic regression are
more trustworthy than those obtained using machine learning,”
only two of the respondents “moderately agreed,” whereas the
other four either disagreed or were neutral. According to our
800 IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 13, NO. 5, SEPTEMBER 2009
Fig. 3.Nomogram summarizing information derived from the logistic regression model (using the derivation cohort), showing probabilities of poor outcome.
exploratory survey, potential applicability of this model in clini-
registry or into the information systems used in clinical practice
machine learning techniques from the experts, but this was not
thecase.Itmustbenotedthatthesurveyisverylimited(n = 6),
and therefore, these conclusions are only indicative, but as none
of these experts had used machine learning techniques before,
a favorable bias in this group is not foreseen.
outcome, severe disability, death) were useless from the clinical
point of view. The number of cases in the dataset that belong
to the intermediate class is small (34 cases), and therefore, the
variability in data collection is another very well-known source
of error for these models. Current work with statistical methods
is mainly for a dichotomized outcome variable, which is usable
from a clinical point of view, although the greater challenge
for the future stands in achieving a good prediction for inter-
mediate cases. Future work will target the improvement of the
results in this area, working with higher number of cases (likely
coming from the multicentric registry) and with other machine
learning algorithms (for example, combined classifiers such as
boosting and bagging or genetic algorithms). Another field with
potential for improvement is the prediction of complications
(such as rebleeding, hydrocephalus, or vasospasm), the predic-
The results are limited by the size of the training set (441
instances). However, SAH is a relatively rare condition and
building larger databases is not always possible. A further limi-
tation of the results is that they have been derived from patients
from a single hospital, and therefore, its applicability outside
this organization is unknown. Before integrating the classifier
into the multicenter registry, the model should be tested and
improved, if necessary, with data gathered from all hospitals
involved, in order to increase its generalization ability.
The open source toolkit Weka has proved to be a very useful
instrument supporting the data mining process. Furthermore,
the tree coding format used by Weka has been used to integrate
the classifier into the multicenter registry. The use of a standard
language to represent the classifier that can be interpreted by
the information system used in routine and modified without
having to change the information system itself is an interesting
way to facilitate the adoption of decision support tools in clini-
cal practice. An alternative format is the Predictive Data Mining
Markup Language , a vendor-independent open standard
that defines an XML-based markup language for the encod-
ing of many predictive data mining models, including decision
trees and logistic regression. A further step would be to config-
ure the prediction tool as a Web service offered to information
systems subscribing to it. The model could be incrementally
learning from the new cases introduced in the multicenter reg-
istry and offer updated decision support information online to
electronic healthcare record systems from different healthcare
providers. Researchers in the field of artificial intelligence in
medicine agree that the impact in clinical practice of progno-
sis tools is maximized when these are made accessible through
computer-based systems that are integrated into the clinician’s
the multicenter registry is a step, yet only investigative, toward
 H. Saveland, J. Hillman, L. Brandt, G. Edner, K. Jakobson, and G. Algers,
study from neurosurgical units in Sweden during a 1-year period,” J.
Neurosurg., vol. 76, pp. 729–734, 1992.
 A. Lagares, P. A. Gomez, R. D. Lobato, J. F. Alen, R. Alday, and J.
Campollo, “Prognostic factors on hospital admission after spontaneous
subarachnoid hemorrhage,” Acta Neurochir. (Wien), vol. 143, pp. 665–
 H.S¨ avelandandL.Brand,“Whicharethemajordeterminantsforoutcome
study from a strictly unselected series,” Acta Neurol. Scand., vol. 90,
pp. 245–250, 1994.
DE TOLEDO et al.: PREDICTING THE OUTCOME OF PATIENTS WITH SUBARACHNOID HEMORRHAGE801 Download full-text
 C. G. Drake, W. E. Hunt, K. Sano, N. Kassell, G. Teasdale, B. Pertuiset,
and J. C. Devilliers, “Report of the World Federation of Neurological
Surgeons committee on a universal subarachnoid hemorrhage grading
scale,” J. Neurosurg., vol. 68, pp. 985–986, 1988.
ness. A practical scale,” Lancet, vol. 2, no. 7872, pp. 81–84, Jul. 1974.
to subarachnoid hemorrhage visualized by computed tomographic scan-
ning,” Neurosurgery, vol. 6, pp. 1–9, 1980.
 A. Hijdra, P. J. A. M. Brouwers, M. Vermeulen, and J. van Gijn, “Grading
the amount of blood on computed tomograms after subarachnoid hemor-
rhage,” Stroke, vol. 21, pp. 1156–1161, 1990.
 A. Lagares, P. A. Gomez, J. F. Alen, R. D. Lobato, J. J. Rivas, R. Alday,
J. Campollo, and A. G. de la Camara, “A comparison of different grading
scales for predicting outcome after subarachnoid haemorrhage,” Acta
Neurochir. (Wien), vol. 147, no. 1, pp. 5–16, Jan. 2005.
to predict outcome for surgical management of intracranial aneurysms,”
Neurosurgery, vol. 42, pp. 959–970, 1998.
 H. Seker, M. O. Odetayo, D. Petrovic, and R. N. Naguib, “A fuzzy logic
based-method for prognostic decision making in breast and prostate can-
cers,” IEEE Trans. Inf. Technol. Biomed., vol. 7, no. 2, pp. 114–122, Jun.
care,” Annu. Rev. Biomed. Eng., vol. 8, pp. 567–599, 2006.
 G. F. Cooper, V. Abraham, C. F. Aliferis, J. M. Aronis, B. G. Buchanan,
and P. Spirtes, “Predicting dire outcomes of patients with community
acquired pneumonia,” J. Biomed. Inf., vol. 38, no. 5, pp. 347–366, Oct.
 Y. C. Li, L. Liu, W. T. Chiu, and W. S. Jian, “Neural network modeling
for surgical decisions on traumatic brain injury patients,” Int. J. Med. Inf.,
vol. 57, no. 1, pp. 1–9, Jan. 2000.
 B. A. Mobley, E. Schechter, W. E. Moore, P. A. McKee, and J. E. Eichner,
“Neural network predictions of significant coronary artery stenosis in
men,” Artif. Intell. Med., vol. 34, no. 2, pp. 151–161, Jun. 2005.
 M. Buscema, E. Grossi, M. Intraligi, N. Garbagna, A. Andriulli,
and M. Breda, “An optimized experimental protocol based on neuro-
evolutionary algorithms application to the classification of dyspeptic pa-
tients and to the prediction of the effectiveness of their treatment,” Artif.
Intell. Med., vol. 34, no. 3, pp. 279–305, Jul. 2005.
 A. Abu-Hanna and N. de Keizer, “Integrating classification trees with
local logistic regression in intensive care prognosis,” Artif. Intell. Med.,
vol. 29, no. 1/2, pp. 5–23, Sep./Oct. 2003.
 R. Bellazzi and B. Zupan, “Predictive data mining in clinical medicine:
Current issues and guidelines,” Int. J. Med. Inf., vol. 77, no. 2, pp. 81–97,
 P. J. Lucas and A. Abu-Hanna, “Prognostic methods in medicine,” Artif.
Intell. Med., vol. 15, no. 2, pp. 105–119, Feb. 1999.
 D. Delen, G. Walker, and A. Kadam, “Predicting breast cancer surviv-
ability: A comparison of three data mining methods,” Artif. Intell. Med.,
vol. 34, no. 2, pp. 113–127, Jun. 2005.
 G. Clermont, D. C. Angus, S. M. DiRusso, M. Griffin, and W. T. Linde-
Zwirble, “Predicting hospital mortality for patients in the intensive care
unit: A comparison of artificial neural networks with logistic regression
models,” Crit. Care Med., vol. 29, no. 2, pp. 291–296, Feb. 2001.
 F. Jaimes, J. Farbiarz, D. Alvarez, and C. Martinez, “Comparison be-
tween logistic regression and neural networks to predict death in patients
with suspected sepsis in the emergency room,” Crit. Care, vol. 9, no. 2,
pp. R150–R156, Apr. 2005.
and neural networks,” Methods Inf. Med., vol. 45, no. 5, pp. 536–540,
 (2006). KDNuggets Data Mining Methods Poll [Online]. Available:
 R. Quinlan, “Induction of decision trees,” Mach. Learn., vol. 1, no. 1,
pp. 81–106, 1986.
 R. Quinlan, C4.5: Programs for Machine Learning.
Morgan Kaufmann, 1993.
 Z. H. Zhou and Y. Jiang, “Medical diagnosis with C4.5 rule preceded by
artificial neural network ensemble,” IEEE Trans. Inf. Technol. Biomed.,
vol. 7, no. 1, pp. 37–42, Mar. 2003.
 P. R. Harper, “A review and comparison of classification algorithms for
medical decision making,” Health Policy, vol. 71, no. 3, pp. 315–331,
 J. Lubsen, J. Pool, and E. van der Does, “A practical device for the
application of a diagnostic or prognostic function,” Methods Inf. Med.,
vol. 17, no. 2, pp. 127–129, Apr. 1978.
 T. P. Germanson, G. Lanzino, G. L. Kongable, J. C. Torner, and N. F. Kas-
sell, “Risk classification after aneurysmal subarachnoid hemorrhage,”
Surg. Neurol., vol. 49, no. 2, pp. 155–163, Feb. 1998.
 O. Takahashi, E. F. Cook, T. Nakamura, J. Saito, F. Ikawa, and T. Fukui,
“Risk stratification for in-hospital mortality in spontaneous intracerebral
no. 11, pp. 743–750, Nov. 2006.
 B. Jennett and M. Bond, “Assessment of outcome after severe brain dam-
age,” Lancet, vol. 1, no. 7905, pp. 480–484, Mar. 1975.
 C. Shearer, “The CRISP-DM model: The new blueprint for data mining,”
J. Data Warehousing, vol. 5, no. 4, pp. 13–22, 2000.
 I. H. Witten and F. Eibe, Data Mining: Practical Machine Learning Tools
and Techniques, 2nd ed. San Francisco, CA: Morgan Kaufmann, 2005.
 M. H. Ou, G. A. West, M. Lazarescu, and C. Clay, “Dynamic knowledge
Med., vol. 39, no. 1, pp. 79–96, Jan. 2007.
framework for bioinformatics,” Bioinformatics, vol. 23, no. 5, pp. 651–
653, Mar. 2007.
 D. E. Golberg, Genetic Algorithms in Search, Optimization, and Machine
Learning, 1st ed.Reading, MA: Addison-Wesley, 1989.
 E. Frank and I. H. Witten, “Generating accurate rule sets without global
optimization,” in Proc. 15th Int. Conf. Mach. Learn.
CA: Morgan Kaufmann, 1998, pp. 144–151.
 W. W. Cohen, “Fast effective rule induction,” in Proc. 12th Int. Conf.
Mach. Learn. (ML 1995), pp. 115–123.
 B. Martin, “Instance-based learning: Nearest neighbor with generaliza-
tion” Master’s thesis, Univ. Waikato, Hamilton, New Zealand, 1995.
 S. Haijian, “Best-first decision tree learning,” Ph.D. dissertation, Univ.
Waikato, Hamilton, New Zealand, 2007.
 J.Cohen,“Acoefficientofagreementfornominalscales,” Educ.Psychol.
Meas., vol. 20, no. 1, pp. 37–46, 1960.
 A. Ben-David, “What’s wrong with hit ratio?” IEEE Intell. Syst., vol. 21,
no. 6, pp. 68–70, Nov./Dec. 2006.
 J. A. Hanley, “Receiver operating characteristic (ROC) methodology: The
state of the art,” Crit. Rev. Diagn. Imag., vol. 29, pp. 307–335, 1989.
 G. Lanzino, N. F. Kassell, T. P. Germanson, G. L. Kongable, L. L.
Truskowski, J. C. Torner, and J. A. Jane, “Age and outcome after aneurys-
mal subarachnoid hemorrhage: Why do older patients fare worse,” J.
Neurosurg., vol. 85, no. 3, pp. 410–418, Sep. 1996.
 Data Mining Group. (2006). The Predictive Model Markup Language
(PMML) [Online]. Available: www.dmg.org
 M. Stefanelli, “The socio-organizational age of artificial intelligence in
medicine,” Artif. Intell. Med., vol. 23, no. 1, pp. 25–47, Aug. 2001.
San Mateo, CA: