Five-way Smoking Status Classification Using Text Hot-Spot Identification and Error-correcting Output Codes

Department of Medical Informatics and Clinical Epidemiology, School of Medicine, Oregon Health & Science University, 3181 S.W. Sam Jackson Park Road, Mail Code: BICC, Portland, OR, 97239-3098, USA.
Journal of the American Medical Informatics Association (Impact Factor: 3.93). 10/2007; 15(1):32-5. DOI: 10.1197/jamia.M2434
Source: PubMed

ABSTRACT We participated in the i2b2 smoking status classification challenge task. The purpose of this task was to evaluate the ability of systems to automatically identify patient smoking status from discharge summaries. Our submission included several techniques that we compared and studied, including hot-spot identification, zero-vector filtering, inverse class frequency weighting, error-correcting output codes, and post-processing rules. We evaluated our approaches using the same methods as the i2b2 task organizers, using micro- and macro-averaged F1 as the primary performance metric. Our best performing system achieved a micro-F1 of 0.9000 on the test collection, equivalent to the best performing system submitted to the i2b2 challenge. Hot-spot identification, zero-vector filtering, classifier weighting, and error correcting output coding contributed additively to increased performance, with hot-spot identification having by far the largest positive effect. High performance on automatic identification of patient smoking status from discharge summaries is achievable with the efficient and straightforward machine learning techniques studied here.

1 Follower
  • Source
    • "Many top systems assigned an " unknown " label if they did not find smoking-related information in the document [5] [6] [7] [8]. The majority of the systems used machine learning approaches for the classification [3] [4] [7] [9] [10], some used rule-based methods [11], some employed both of them [5] [6] [8], and others used their own methods [12] [13]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper describes improvements of and extensions to the Mayo Clinic 2006 smoking status classification system. The new system aims at addressing some of the limitations of the previous one. The performance improvements were mainly achieved through remodeling the negation detection for non-smoker, temporal resolution to distinguish a past and current smoker, and improved detection of the smoking status category of unknown. In addition, we introduced a rule-based component for patient-level smoking status assignments in which the individual smoking statuses of all clinical documents for a given patient are aggregated and analyzed to produce the final patient smoking status. The enhanced system builds upon components from Mayo's clinical Text Analysis and Knowledge Extraction System developed within IBM's Unstructured Information Management Architecture framework. This reusability minimized the development effort. The extended system is in use to identify smoking status risk factors for a peripheral artery disease NHGRI study.
    AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium 01/2009; 2009:619-23.
  • Source
    • "There is a significant body of literature on classifying patient outcomes from discharge summaries, which might be more generally be termed text classification in the NLP community. Examples of recent work in this area include: classifying patients' smoking status [10] [13] [3] [2], adverse event detection[7], and foot examination findings[9]. This goal of labeling discharge summaries with patient outcomes is similar to our task, except that we have a larger number of possible labels (20 predetermined, and an open set of unknown lables) than is typical in this domain (which ranges from binary to five-way outcomes). "
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Clinical texts constitute a very important source of information for medical studies. This huge collection of data is susceptible for the application of the many biomedical text mining tools that helps in automating their analysis process. Tasks such as classification according to predefined categories, clustering and extraction of specific entities, such as disease and medications, may be of great importance to aid the doctors in their work. Botero is a Java library developed for the Second i2b2 challenge that performs clinical text classification according to some predefined disease related to the obesity domain.
Show more


Available from