Five-way Smoking Status Classification Using Text Hot-Spot Identification and Error-correcting Output Codes

Department of Medical Informatics and Clinical Epidemiology, School of Medicine, Oregon Health & Science University, 3181 S.W. Sam Jackson Park Road, Mail Code: BICC, Portland, OR, 97239-3098, USA.
Journal of the American Medical Informatics Association (Impact Factor: 3.5). 10/2007; 15(1):32-5. DOI: 10.1197/jamia.M2434
Source: PubMed


We participated in the i2b2 smoking status classification challenge task. The purpose of this task was to evaluate the ability of systems to automatically identify patient smoking status from discharge summaries. Our submission included several techniques that we compared and studied, including hot-spot identification, zero-vector filtering, inverse class frequency weighting, error-correcting output codes, and post-processing rules. We evaluated our approaches using the same methods as the i2b2 task organizers, using micro- and macro-averaged F1 as the primary performance metric. Our best performing system achieved a micro-F1 of 0.9000 on the test collection, equivalent to the best performing system submitted to the i2b2 challenge. Hot-spot identification, zero-vector filtering, classifier weighting, and error correcting output coding contributed additively to increased performance, with hot-spot identification having by far the largest positive effect. High performance on automatic identification of patient smoking status from discharge summaries is achievable with the efficient and straightforward machine learning techniques studied here.

1 Follower
6 Reads
  • Source
    • "One remedy to this issue is to use a ''hot-spotting'' technique to extract a few sentences context around a key word or phrase and then treat the task as document classification where the document is very short. This has worked well in i2b2 challenges in the past, for example in [12], where the technique is used for smoking status classification with annotations provided at the document level. "
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper describes the use of an agile text mining platform (Linguamatics' Interactive Information Extraction Platform, I2E) to extract document-level cardiac risk factors in patient records as defined in the i2b2/UTHealth 2014 Challenge. The approach uses a data-driven rule-based methodology with the addition of a simple supervised classifier. We demonstrate that agile text mining allows for rapid optimization of extraction strategies, while post-processing can leverage annotation guidelines, corpus statistics and logic inferred from the gold standard data. We also show how data imbalance in a training set affects performance. Evaluation of this approach on the test data gave an F-Score of 91.7%, one percent behind the top performing system. Copyright © 2015. Published by Elsevier Inc.
    Journal of Biomedical Informatics 07/2015; 15. DOI:10.1016/j.jbi.2015.06.030 · 2.19 Impact Factor
  • Source
    • "Our intention was to create a classifier that could make accurate classification judgements without using many enriched-engineered features (e.g., named entity recognition, or part of speech tagging). We made an a priori decision to use a SVM classifier with linear kernel (Fan et al., 2008), using default parameter settings, as we have used this for a baseline classifier in previous work (Cohen, 2008; Ambert and Cohen, 2009, 2011; Cohen et al., 2009, 2010c). Our goal in developing the baseline classifier was not to create a highly-optimized classifier tuned to work on neuroscience publications. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The frequency and volume of newly-published scientific literature is quickly making manual maintenance of publicly-available databases of primary data unrealistic and costly. Although machine learning (ML) can be useful for developing automated approaches to identifying scientific publications containing relevant information for a database, developing such tools necessitates manually annotating an unrealistic number of documents. One approach to this problem, active learning (AL), builds classification models by iteratively identifying documents that provide the most information to a classifier. Although this approach has been shown to be effective for related problems, in the context of scientific databases curation, it falls short. We present Virk, an AL system that, while being trained, simultaneously learns a classification model and identifies documents having information of interest for a knowledge base. Our approach uses a support vector machine (SVM) classifier with input features derived from neuroscience-related publications from the primary literature. Using our approach, we were able to increase the size of the Neuron Registry, a knowledge base of neuron-related information, by a factor of 90%, a knowledge base of neuron-related information, in 3 months. Using standard biocuration methods, it would have taken between 1 and 2 years to make the same number of contributions to the Neuron Registry. Here, we describe the system pipeline in detail, and evaluate its performance against other approaches to sampling in AL.
    Frontiers in Neuroinformatics 12/2013; 7:38. DOI:10.3389/fninf.2013.00038 · 3.26 Impact Factor
  • Source
    • "There are also efforts to identify relevant text passages from patient records. For example, Cohen (2008) explores a hotspot identification method using specific words of interest. However, patient records offer a significantly different ‘document context’ for summarization research. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Previous research in the biomedical text-mining domain has historically been limited to titles, abstracts and metadata available in MEDLINE records. Recent research initiatives such as TREC Genomics and BioCreAtIvE strongly point to the merits of moving beyond abstracts and into the realm of full texts. Full texts are, however, more expensive to process not only in terms of resources needed but also in terms of accuracy. Since full texts contain embellishments that elaborate, contextualize, contrast, supplement, etc., there is greater risk for false positives. Motivated by this, we explore an approach that offers a compromise between the extremes of abstracts and full texts. Specifically, we create reduced versions of full text documents that contain only important portions. In the long-term, our goal is to explore the use of such summaries for functions such as document retrieval and information extraction. Here, we focus on designing summarization strategies. In particular, we explore the use of MeSH terms, manually assigned to documents by trained annotators, as clues to select important text segments from the full text documents. Our experiments confirm the ability of our approach to pick the important text portions. Using the ROUGE measures for evaluation, we were able to achieve maximum ROUGE-1, ROUGE-2 and ROUGE-SU4 F-scores of 0.4150, 0.1435 and 0.1782, respectively, for our MeSH term-based method versus the maximum baseline scores of 0.3815, 0.1353 and 0.1428, respectively. Using a MeSH profile-based strategy, we were able to achieve maximum ROUGE F-scores of 0.4320, 0.1497 and 0.1887, respectively. Human evaluation of the baselines and our proposed strategies further corroborates the ability of our method to select important sentences from the full texts.;
    Bioinformatics 07/2011; 27(13):i120-8. DOI:10.1093/bioinformatics/btr223 · 4.98 Impact Factor
Show more

Similar Publications


6 Reads
Available from