A Text Mining Approach to the Prediction of Disease Status from
Clinical Discharge Summaries
HUI YANG, PHD, IRENA SPASIC, PHD, JOHN A. KEANE, GORAN NENADIC, PHD
A b s t r a c t
Processing for Clinical Data—the i2b2 obesity challenge, whose aim was to automatically identify the status of
obesity and 15 related co-morbidities in patients using their clinical discharge summaries. The challenge consisted
of two tasks, textual and intuitive. The textual task was to identify explicit references to the diseases, whereas the
intuitive task focused on the prediction of the disease status when the evidence was not explicitly asserted.
Design: The authors assembled a set of resources to lexically and semantically profile the diseases and their
associated symptoms, treatments, etc. These features were explored in a hybrid text mining approach, which
combined dictionary look-up, rule-based, and machine-learning methods.
Measurements: The methods were applied on a set of 507 previously unseen discharge summaries, and the
predictions were evaluated against a manually prepared gold standard. The overall ranking of the participating
teams was primarily based on the macro-averaged F-measure.
Results: The implemented method achieved the macro-averaged F-measure of 81% for the textual task (which was
the highest achieved in the challenge) and 63% for the intuitive task (ranked 7thout of 28 teams—the highest was
66%). The micro-averaged F-measure showed an average accuracy of 97% for textual and 96% for intuitive
Conclusions: The performance achieved was in line with the agreement between human annotators, indicating the
potential of text mining for accurate and efficient prediction of disease statuses from clinical discharge summaries.
? J Am Med Inform Assoc. 2009;16:596–600. DOI 10.1197/jamia.M3096.
Objective: The authors present a system developed for the Challenge in Natural Language
The objective of the 2008 i2b2 obesity challenge1in Natural
language processing (NLP) for clinical data was to evaluate
NLP systems on their performance in identifying patient
obesity and associated co-morbidities based on hospital dis-
charge summaries. Fifteen related diseases were considered:
Diabetes mellitus (DM), Hypercholesterolemia, Hypertriglyc-
eridemia, Hypertension (HTN), Atherosclerotic CV disease
(CAD), Heart failure (CHF), Peripheral vascular disease (PVD),
Venous insufficiency, Osteoarthritis (OA), Obstructive sleep
apnea (OSA), Asthma, GERD, Gallstones/Cholecystectomy,
Depression, and Gout. The aim was to label each document
with disease/co-morbidity status, indicating whether:
• a patient was diagnosed with a disease/co-morbidity
(Y—yes, disease present),
• a patient was diagnosed with not having a disease/co-
morbidity (N—no, disease absent),
• it was uncertain whether a patient had a disease/co-
morbidity or not (Q—questionable), or
• a disease/co-morbidity status was not mentioned in the
discharge summary (U—unmentioned).
The challenge consisted of two tasks, textual and intuitive.
The textual task was to identify explicit references to the
diseases in the narrative text. Each hospital report was to be
labeled using one of four possible disease status labels (Y, N,
Q, or U). The intuitive task focused on inferring the disease
status even when the evidence was not explicitly asserted.
Possible intuitive labels were Y, N, and Q for each disease.
The organizers provided a training set with 730 hospital
discharge summaries manually annotated with more than
We implemented a hybrid approach that combined three
types of features: lexical, terminological and semantic, ex-
ploited by dictionary look-up, rule-based and machine-
learning methods. We assembled a set of resources to
lexically and semantically profile the diseases and their
associated symptoms, treatments, etc. The methods were
applied on a set of 507 previously unseen discharge sum-
maries, and the predictions were evaluated against the
manually prepared gold standard. In the textual task, a macro-
averaged F-measure (81%) for our approach was the highest
achieved in the challenge. In the intuitive task, we achieved the
macro-averaged F-measure of 63%. The micro-averaged F-
Affiliation of the authors: School of Computer Science, University of
Manchester, Manchester, UK; Dr. Yang is currently with the Depart-
ment of Computing, Open University, UK.
This work was partially supported by the UK BBSRC project
“Mining Term Associations from Literature to Support Knowledge
Discovery in Biology”. Irena Spasic gratefully acknowledges the
support of the BBSRC and EPSRC via “The Manchester Centre for
Integrative Systems Biology” grant.
Correspondence: Goran Nenadic, Manchester Interdisciplinary Bio-
centre, University of Manchester, 131 Princess Street, Manchester
M1 7DN, UK; e-mail: ?G.Nenadic@manchester.ac.uk?.
Received for review: 12/07/08; accepted for publication: 04/07/09.
Yang et al., Text Mining of Disease Status Predictions
measure showed an average accuracy of 97% for textual
annotation and 96% for intuitive annotation, indicating the
potential of text mining techniques to accurately extract the
disease status from hospital discharge summaries.
The general idea underlying our approach was to identify
sentences that contained evidence to support a judgment for
a given disease, and then to integrate evidence gathered at
the sentence level to make a prediction at the document
level. The system workflow consisted of three major steps:
report pre-processing, textual prediction and intuitive pre-
diction, with the final integration of the textual and intuitive
results (see Fig 1). The prediction steps were applied for each
of the 16 diseases/co-morbidities separately.
The report pre-processing involved basic textual processing of
input discharge narratives. In the textual prediction step,
explicit evidence was identified and combined to derive
textual predictions. The intuitive prediction module focused
on capturing intuitive clues that could associate the report
with the disease. The finial intuitive judgments were com-
bined with the textual ones. Figure 2 depicts a detailed
architecture of the system. In the following sections we
describe each module and the basic steps performed (for
further details see a JAMIA online data supplement at
Report Pre-processing Module
Input discharge summaries were first split into sections
using a set of flexible lexical matching rules that identified
section titles and classified them into six predefined cate-
F i g u r e 1.
The general design of the system.
F i g u r e 2.
The system architec-
Journal of the American Medical Informatics Association Volume 16Number 4 July / August 2009
gories: “Diagnosis”, “Past or Present History of Illness”,
“Social/Family History”, “Physical or Laboratory Examina-
tion”, “Medication/Disposition”, and “Other”. Section titles
were recognized by matching the most frequent title key-
words collected semi-automatically from the training data-
set. In addition, each section type was assigned a weight
reflecting its predictive capacity for a given disease (see the
Training Data Analyses section). The sections were decom-
posed into sentences using LingPipe.2Part of Speech (POS)
tagging and shallow parsing were performed using the
GeniaTagger, which is specifically tuned for biomedical
Textual Prediction Module
The main objective of this module was to identify sentences
that, given a disease, explicitly mentioned the disease itself
and/or associated clinical terms. We lexically profiled each
disease by collecting (1) its name and synonyms from public
resources including the UMLS4, (2) disease sub-classes (e.g.,
diabetes type II) and their synonyms, (3) disease superclasses
(e.g., reflux for GERD and arthritis for OA) and their syn-
onyms, and (4) clinical terms closely related to the disease
(e.g., associated symptoms and treatments), imported from
public medical resources or selected from the training data-
set based on their occurrence statistics. All clinical terms
collected were assigned confidence levels taking into ac-
count the quality of the prediction results obtained from the
training dataset (available as an online data supplement at
Initially, the sentences that contained any term from the
lexical profile were labeled with Y, and, in the subsequent
steps, the evidence was challenged and potentially reversed
to N, Q, or U based on the context in which they were used.
The sentence-based predictions were then combined at the
document level. The four processing steps in this module are
described briefly below (further details are given in the
Step T1: Term matching. To cater for terminological varia-
tion, terms that characterize a disease were matched against
the text approximately, taking into account morphological
variants, and if necessary ignoring word order and tolerat-
ing the distance between the words within a term (e.g., both
“stent placement” and “placement of coronary stent” re-
ferred to the same treatment for CAD).
Step T2: Sentence filtering. Sentences that did not mention
a disease-related term were filtered out. We also discarded
sentences from the sections deemed less important for the
textual task (namely “Social/Family History” and “Other”),
sentences that potentially referred to family members, and
sentences containing ambiguous disease terms.
Step T3: Sentence labeling. After filtering, the remaining
sentences were initially considered to support the judgment
of disease presence (Y). We then applied a set of lexico-
semantic patterns (see Table 1 for examples) to potentially
re-label them with N, Q, or U judgments, using a pattern
matching algorithm similar to NegEx.5The patterns gener-
alized the structure of manually collected examples that
indicated negative, questionable or unmentioned status of
diseases. If any of these patterns was matched successfully,
the disease status was changed using the label associated
with the pattern.
Step T4: Result integration. When a report contained mul-
tiple sentences with conflicting labels associated, we em-
ployed a weighted voting scheme. The score for each disease
status label was obtained by collecting all sentences with the
given label, and adding up the weights associated with the
container sections. The highest-scored label was suggested
as the final annotation, with potential tie cases labeled as Q.
We submitted the results of two runs for the textual task: in
run 1, all clinical terms from the associated lexical profiles
were used, whereas clinical terms with lower confidence
were excluded in run 2.
Intuitive Prediction Module
The intuitive task focused on the prediction of the disease
status (Y, N, and Q) based on both explicit and implicit
textual assertions. We relied on a combination of term- and
clinical inference rule-matching to extract disease informa-
tion at the sentence level, and a supervised learning method
for disease status classification at the document level. The
module consisted of five steps, described briefly here (fur-
ther details are given in the online supplement).
Step I1: Candidate sentence identification. In the first step,
the system identified potential evidence sentences (labeled Y
initially) by looking for any of the following three evidence
types within the sentences:
a. Terms referring to the disease symptoms (e.g., RCA
occlusion for CAD). The first two intuitive runs (1 and 2)
differed in the predictive capacity of the symptoms used
(all terms vs. most important ones, respectively for runs 1
and 2; see the Training Data Analyses section).
b. Important clinical facts or conditions related to the dis-
ease (e.g., weight ? 200 lbs; systolic blood pressure ? 135).
Around 20 manually designed inference rules were used.
c. Medications typically used to treat the disease and/or
symptoms (appearing within the “Medication/Disposi-
Step I2: Sentence labeling. This step was analogous to
textual prediction (step T3).
Table 1 y Examples of Disease-Status Lexico-Semantic Patterns (target patterns are in italic)
No history of coronary artery disease; negative for CHF; denied congestive heart failure; she does not have a history of GERD; no signs or
sxms of heart failure
?sleep apnea; question of asthma; it is possible that she has sleep apnea; possibility of sleep apnea; gout may be involved in this problem;
No known diagnosis of CAD; assess for CAD was non-diagnostic; equivocal for coronary artery disease; need Cath to assess CAD;
evaluate for PVD;
Normal coronaries; clear coronary arteries; gallbladder was normal with no stones; he is a thin, health-appearing black man;
We should also consider further gastroesophageal reflux studies as an outpatient; CAD assessment was not indicated;
CHF ? congestive heart failure; GERD ? gastroesophageal reflux disease; CAD ? coronary artery disease; PVD ? peripheral vascular disease.
Yang et al., Text Mining of Disease Status Predictions
Step I3: Sentence-level result integration. Similarly to tex-
tual predictions (step T4), the integration of sentence-level
predictions was performed when some sentences had dif-
ferent labels attached for the same disease. Three factors
were considered: (a) the confidence level of disease symp-
tom terms found in the sentences; (b) the weight of the
section where the evidence appeared; and (c) the signifi-
cance of the three types of sentence evidence (step I1) for the
Step I4: Document-level labeling. This was an optional step
with only one run submitted (run 3, see below). We applied a
support vector machine (SVM) classifier to assign disease
labels at the document level. Phrases recognized in the
pre-processing stage by GeniaTagger were mapped to the
UMLS concepts using approximate string matching. Con-
cepts mentioned in a negative context were identified using
a negation module similar to NegEx. The weight assigned to
a feature was calculated as the difference between the
number of positive mentions of the corresponding concept
and their negative mentions. Finding questionable evidence
at a document level was considered unfeasible (there were
too few examples for machine learning), so we trained a
binary SVM classifier that differentiated between potential Y
and N labels only.
Step I5: Final result integration. Textual Y and N predic-
tions were given high confidence and were recycled as
intuitive predictions (see the section Training Data Analy-
ses). Only Q and U textual judgments were adjusted in cases
where intuitive evidence suggested different labels. More
precisely, when new implicit evidence was established for a
previously assigned textual Q or U judgment, then it was
changed to an intuitive Y, N, or Q label based on the
procedure described in steps I1–I3. If no new sentence-level
implicit evidence was established for a Q or U textual
judgment, then the SVM-based document classification was
taken into account. If the classifier produced a highly
confident Y label, then the final intuitive label for the disease
would be amended to Y. Otherwise, a textual Q judgment
would be kept unchanged, whereas a textual U judgment
would change to N in the final intuitive annotation. This
approach was used to provide the intuitive run 3.
Experiments and Results
The training and testing data for the challenge were col-
lected from the Research Patient Data Repository of Partners
Health Care (see Table 2 for the distribution of the annota-
tions provided manually by two experts).
Training Data Analyses
We compared textual and intuitive annotations assigned to
each document-disease pair (see Table 3). Intuitive annota-
tions largely agreed with the textual ones in case of Y and N
labels. Intuitive annotations differed primarily from the textual
Q and U labels. This observation motivated our integration
strategy—the intuitive results “inherited” all textual Y and N
predictions, and only Q and U textual labels were considered
eligible for re-annotation in the intuitive part.
The training data were further analyzed to estimate the
relevance of certain features and their predictive capacity.
We first analyzed the relevance of six section types. Relative
relevance weights were assigned to each section type based
on the ratio between the number of sentences in the given
section type whose labels were consistent with the expert-
generated judgments (at the document level) and the total
number of evidence sentences that supported the correct
annotations. This gave us relative predictive capacity of the
section types to enable inference of the document label.
Similar distributional analyses were performed for other
features (see the online supplement for further details).
Testing Environment and Results
Each of the 28 teams taking part in the challenge was
allowed to submit the results of up to three system runs. The
system performance was measured using a set of three stan-
dard measures: recall (R), precision (P) and F-measure. The
results were micro- and macro-averaged across the status
labels for each of the diseases considered. The overall perfor-
mance was measured in the same way for all diseases taken
together. The participating teams were primarily ranked based
on the macro-averaged F-measure. Hereafter, we only report a
single averaged score for the micro values as the values for
P-Micro, R-Micro and F-Micro were identical.
Table 2 y The Distribution of Annotations in the i2b2
Obesity Challenge Datasets
Table 3 y Comparison of Textual and Intuitive
Annotations in the Training Data on the Same
Table 4 y The Summary of the Evaluation of the Two
Run 1 Run 2
Micro P, R, F
Table 5 y The Summary of the Evaluation of the
Three Intuitive Runs
Run 1 Run 2 Run 3
Micro P, R, F
Journal of the American Medical Informatics AssociationVolume 16 Number 4 July / August 2009
The results of two textual runs were submitted (see Table 4). Download full-text
Run 2 improved the results, but only by a small margin. The
macro-averaged F-measure for run 2 was the highest one
achieved in the challenge and was substantially better than
the mean average of all participating teams (81 versus 56%).
Similarly, the micro-averaged F-measure was high (97%),
compared to the mean average calculated for all participat-
ing teams (91%). A detailed analysis of the results is avail-
able in the online supplement.
The results of three runs were submitted for the intuitive
task (see Table 5): run 1 was the best run with the macro-
averaged F-measure of 63% (ranked 7th) and the micro-
averaged F-measure of 96% (ranked fifth overall). A detailed
analysis of the results is available in the online supplement.
Table 6 shows the detailed evaluation of the results for the
individual diseases. In the textual task, the micro-averaged
F-measure ranged from 92% (CAD) to 100% (hyper-triglyc-
eridemia), whereas for the intuitive task it ranged from 89%
(depression) to 99% (OSA). The micro-averaged values were
more consistent across different diseases, whereas there
were substantial differences in the macro-evaluated metrics.
A detailed analysis and full discussion of the results are
available in the online supplement.
The system implementing the methodology described
achieved excellent results with an average micro accuracy of
97% for the textual task and 96% for the intuitive task. The
macro-averaged F-measure of 81% for the textual task was
the highest achieved in the challenge, and the macro-
averaged F-measure of 63% (the highest was 66%) for the
intuitive task was ranked 7thout of 28. The macro-averaged
measures showed that prediction of questionable labels was
most challenging, in particular in the intuitive task.
The system’s performance may be improved in several
ways. More work is required to expand the set of clinical
inference rules and match them reliably in textual narra-
tives. Dynamic expansion of abbreviations that could cor-
rectly map ambiguous abbreviations to corresponding
medical terms should improve identification of key clinical
findings. Finally, the estimation of discriminative power of
medications used to treat specific diseases should improve
Overall, the performance of our system and most of the
other systems developed for the i2b2 obesity challenge was
comparable to that of a human expert, indicating that text
mining techniques have substantial potential to accurately
and efficiently extract the disease status from hospital dis-
charge summaries. However, more research is required to
investigate if the methodologies used can be easily ported
between different areas of medical practice. For our system,
the infrastructure developed is general enough to be re-used
across the clinical domain. However, few details require
knowledge elicitation from domain experts or medical re-
sources, and manual changes to the system (e.g., clinical
inference rules). Still, a major bottleneck faced by medical
text mining systems in general is the provision of the
training data, which need to be analyzed manually and
statistically to identify clues to be exploited in both rule-
based and machine-learning approaches.
1. i2b2 obesity challenge. Available at: http://www.i2b2.org/. Ac-
cessed: Nov 23, 2008.
2. Carpenter B. Phrasal queries with LingPipe and Lucene: Ad hoc
genomics text retrieval. In: Proceedings of the 13thAnnual Text
Retrieval Conference, 2004.
3. Tsuruoka Y, Tateishi Y, Kim J, et al. Developing a robust
part-of-speech tagger for biomedical text. Adv Inform 2005:
4. UMLS Knowledge Base. Available at: http://www.nlm.nih.gov/
research/umls. Accessed: Nov 23, 2008.
5. Chapman W, Bridewell W, Hanbury P, Cooper G, Buchanan B. A
simple algorithm for identifying negated findings and diseases in
discharge summaries. J Biomed Inform 2001;34:301–10.
Table 6 y Disease-Based Performance of the Best Textual (Run 2) and Intuitive (Run 1)
Textual Predictions (Run 2)
Intuitive Predictions (Run 1)
Micro P, R, FMacro PDisease Micro P, R, FMacro R Macro FMacro R Macro F
OSA ? obstructive sleep apnea; GERD ? gastroesophageal reflux disease; CHF ? congestive heart failure; OA ? osteo arthritis; PVD ?
peripheral vascular disease; CAD ? coronary artery disease.
Yang et al., Text Mining of Disease Status Predictions