Five-way Smoking Status Classification Using Text Hot-Spot
Identification and Error-correcting Output Codes
AARON M. COHEN, MD, MS
A b s t r a c t
was to evaluate the ability of systems to automatically identify patient smoking status from discharge summaries.
Our submission included several techniques that we compared and studied, including hot-spot identification,
zero-vector filtering, inverse class frequency weighting, error-correcting output codes, and post-processing rules.
We evaluated our approaches using the same methods as the i2b2 task organizers, using micro- and macro-
averaged F1 as the primary performance metric. Our best performing system achieved a micro-F1 of 0.9000 on the
test collection, equivalent to the best performing system submitted to the i2b2 challenge. Hot-spot identification,
zero-vector filtering, classifier weighting, and error correcting output coding contributed additively to increased
performance, with hot-spot identification having by far the largest positive effect. High performance on automatic
identification of patient smoking status from discharge summaries is achievable with the efficient and
straightforward machine learning techniques studied here.
? J Am Med Inform Assoc. 2008;15:32–35. DOI 10.1197/jamia.M2434.
We participated in the i2b2 smoking status classification challenge task. The purpose of this task
Automated document classification can be a powerful tech-
nique to aid biomedical researchers by reducing the human
effort needed to make repeated decisions categorizing sam-
ples of text. It is useful when each of the samples from a
given source needs to be categorized into one or more of
several pre-defined categories (labels or classes), and there
already exists (or can easily be created) a set of already
categorized documents, known as training data, usually
created by human experts. In this situation, machine learn-
ing algorithms can be used to extract a set of rules or
procedures from the training data, and apply these rules to
new, previously unseen documents, accurately predicting
the labels that should be assigned to these documents.
The i2b2 challenge modeled the task of identifying patient
records of interest in terms of smoking status. The challenge
was organized to evaluate automated systems identifying
the smoking status of a patient from a hospital discharge
summary. Smoking status was defined as one of five mutu-
ally exclusive categories: UNKNOWN, NON-SMOKER,
SMOKER (current status unknown), PAST SMOKER, and
Human experts created training and test sets by manually
assigning labels. The task organizers studied the perfor-
mance of the human annotators during the creation of the
test collections. Two human annotators agreed about 81.5%
of the time on which one of the five labels to assign to each
summary. If automated systems can perform at this level
such systems could be useful in a clinical research setting.
Classifier System Approach
Our approach to this multi-classification problem training
uses a sequence of five steps.
Hot-spot Passage Isolation
We hypothesized that words would have different effects on
the classification depending upon where they occurred in
the discharge summary, and that words occurring near text
describing the smoking status of the individual would be the
most important. Since our basic classifier approach was to
treat features extracted from the text without specific posi-
tion information (“bag of words”),1we needed a simple way
to incorporate this into the algorithm.
We found that there were a small number of patterns in the
training data that indicated that the nearby text pertained to
the patient’s smoking status. We called this text “hot-spots,”
from which we could extract the word-based features. Using
cross-validation we found good performance by simply
taking a window of text up to 100 characters before and after
an identified hot-spot. The hot-spot identified passages were
then isolated and passed on to the next step in the process.
The rest of the discharge summary was ignored. The hot-
spot identifying text patterns that we used are shown in
Tokenization and Vectorization
Once the hot-spot identified passages were isolated, these were
tokenized into individual words and symbols using the Stan-
dardAnalyzer module available in the Apache Lucene search
engine library (available at http://jakarta.apache.org/
Affiliation of the author: Department of Medical Informatics and
Clinical Epidemiology, School of Medicine, Oregon Health & Sci-
ence University, Portland, OR.
Correspondence: Aaron M. Cohen, MD, MS, Department of Medical
Informatics and Clinical Epidemiology, School of Medicine, Oregon
Health & Science University, 3181 S.W. Sam Jackson Park Road,
Mail Code: BICC, Portland, OR, 97239-3098; e-mail: ?cohenaa@
Received for review: 03/13/07; accepted for publication: 10/03/07.
COHEN, Five-way Smoking Status Text Classification