Mayo clinic NLP system for patient smoking status identification.

Biomedical Informatics Research, Mayo Clinic, 200 First Street SW, Rochester, MN 55902, USA.
Journal of the American Medical Informatics Association (Impact Factor: 3.93). 10/2007; 15(1):25-8. DOI: 10.1197/jamia.M2437
Source: PubMed

ABSTRACT This article describes our system entry for the 2006 I2B2 contest "Challenges in Natural Language Processing for Clinical Data" for the task of identifying the smoking status of patients. Our system makes the simplifying assumption that patient-level smoking status determination can be achieved by accurately classifying individual sentences from a patient's record. We created our system with reusable text analysis components built on the Unstructured Information Management Architecture and Weka. This reuse of code minimized the development effort related specifically to our smoking status classifier. We report precision, recall, F-score, and 95% exact confidence intervals for each metric. Recasting the classification task for the sentence level and reusing code from other text analysis projects allowed us to quickly build a classification system that performs with a system F-score of 92.64 based on held-out data tests and of 85.57 on the formal evaluation data. Our general medical natural language engine is easily adaptable to a real-world medical informatics application. Some of the limitations as applied to the use-case are negation detection and temporal resolution.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In modern electronic medical records (EMR) much of the clinically important data - signs and symptoms, symptom severity, disease status, etc. - are not provided in structured data fields, but rather are encoded in clinician generated narrative text. Natural language processing (NLP) provides a means of "unlocking" this important data source for applications in clinical decision support, quality assurance, and public health. This chapter provides an overview of representative NLP systems in biomedicine based on a unified architectural view. A general architecture in an NLP system consists of two main components: background knowledge that includes biomedical knowledge resources and a framework that integrates NLP tools to process text. Systems differ in both components, which we will review briefly. Additionally, challenges facing current research efforts in biomedical NLP include the paucity of large, publicly available annotated corpora, although initiatives that facilitate data sharing, system evaluation, and collaborative work between researchers in clinical NLP are starting to emerge.
  • [Show abstract] [Hide abstract]
    ABSTRACT: The electronic medical record has evolved from a digital representation of individual patient results and documents to information of large scale and complexity. Big Data refers to new technologies providing management and processing capabilities, targeting massive and disparate data sets. For an individual patient, techniques such as Natural Language Processing allow the integration and analysis of textual reports with structured results. For groups of patients, Big Data offers the promise of large-scale analysis of outcomes, patterns, temporal trends, and correlations. The evolution of Big Data analytics moves us from description and reporting to forecasting, predictive modeling, and decision optimization.
    The Journal of ambulatory care management 07/2014; 37(3):206-210. DOI:10.1097/JAC.0000000000000037
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Background: Atherosclerotic vascular disease (AVD), a leading cause of morbidity and mortality, is increasing in prevalence in the developing world. We describe an approach to establish a biorepository linked to medical records with the eventual goal of facilitating discovery of biomarkers for AVD. Methods: The Vascular Disease Biorepository at Mayo Clinic was established to archive DNA, plasma, and serum from patients with suspected AVD. AVD phenotypes, relevant risk factors and comorbid conditions were ascertained by electronic medical record (EMR)-based electronic algorithms that included diagnosis and procedure codes, laboratory data and text searches to ascertain medication use. Results: Up to December 2012, 8800 patients referred for vascular ultrasound examination and non-invasive lower extremity arterial evaluation were approached, of whom 5268 consented. The mean age of the initial 2182 patients recruited was 70.4 ± 11.2 years, 62.6% were men and 97.6% were whites. The prevalences of AVD phenotypes were: carotid artery stenosis 48%, abdominal aortic aneurysm 21% and peripheral arterial disease 38%. Positive predictive values for electronic phenotyping algorithms were>0.90 for cases (and>0.95 for controls) for each AVD phenotype, using manual review of the EMR as the gold standard. The prevalences of risk factors and comorbidities were as follows: hypertension 78%, diabetes 29%, dyslipidemia 73%, smoking 70%, coronary heart disease 37%, heart failure 12%, cerebrovascular disease 20% and chronic kidney disease 19%. Conclusions: Our study demonstrates the feasibility of establishing a biorepository of plasma, serum and DNA, with relatively rapid annotation of clinical variables using EMR-based algorithms.
    03/2013; 2013(1):82-90. DOI:10.5339/gcsp.2013.10

Full-text (2 Sources)

Available from
Jul 14, 2014