Mayo Clinic NLP System for Patient Smoking Status Identification

Biomedical Informatics Research, Mayo Clinic, 200 First Street SW, Rochester, MN 55902, USA.
Journal of the American Medical Informatics Association (Impact Factor: 3.5). 10/2007; 15(1):25-8. DOI: 10.1197/jamia.M2437
Source: PubMed


This article describes our system entry for the 2006 I2B2 contest "Challenges in Natural Language Processing for Clinical Data" for the task of identifying the smoking status of patients. Our system makes the simplifying assumption that patient-level smoking status determination can be achieved by accurately classifying individual sentences from a patient's record. We created our system with reusable text analysis components built on the Unstructured Information Management Architecture and Weka. This reuse of code minimized the development effort related specifically to our smoking status classifier. We report precision, recall, F-score, and 95% exact confidence intervals for each metric. Recasting the classification task for the sentence level and reusing code from other text analysis projects allowed us to quickly build a classification system that performs with a system F-score of 92.64 based on held-out data tests and of 85.57 on the formal evaluation data. Our general medical natural language engine is easily adaptable to a real-world medical informatics application. Some of the limitations as applied to the use-case are negation detection and temporal resolution.

Download full-text


Available from: Christopher G Chute, Jul 14, 2014
16 Reads
  • Source
    • "Important study variables, such as smoking status[57], [58], co-morbidities[59], and family disease history[60] are often missing from the coded record and more likely to appear in physician notes. These variables can often be extracted[61], [62] using Natural Language Processing (NLP). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Results of medical research studies are often contradictory or cannot be reproduced. One reason is that there may not be enough patient subjects available for observation for a long enough time period. Another reason is that patient populations may vary considerably with respect to geographic and demographic boundaries thus limiting how broadly the results apply. Even when similar patient populations are pooled together from multiple locations, differences in medical treatment and record systems can limit which outcome measures can be commonly analyzed. In total, these differences in medical research settings can lead to differing conclusions or can even prevent some studies from starting. We thus sought to create a patient research system that could aggregate as many patient observations as possible from a large number of hospitals in a uniform way. We call this system the 'Shared Health Research Information Network', with the following properties: (1) reuse electronic health data from everyday clinical care for research purposes, (2) respect patient privacy and hospital autonomy, (3) aggregate patient populations across many hospitals to achieve statistically significant sample sizes that can be validated independently of a single research setting, (4) harmonize the observation facts recorded at each institution such that queries can be made across many hospitals in parallel, (5) scale to regional and national collaborations. The purpose of this report is to provide open source software for multi-site clinical studies and to report on early uses of this application. At this time SHRINE implementations have been used for multi-site studies of autism co-morbidity, juvenile idiopathic arthritis, peripartum cardiomyopathy, colorectal cancer, diabetes, and others. The wide range of study objectives and growing adoption suggest that SHRINE may be applicable beyond the research uses and participating hospitals named in this report.
    PLoS ONE 03/2013; 8(3):e55811. DOI:10.1371/journal.pone.0055811 · 3.23 Impact Factor
  • Source
    • "There is a substantial amount of literature on identifying and extracting information from EMRs [12]. Machine-learning methods have been used for different classifications tasks based on electronic medical records such as identification of patients with various conditions [6,13-17], automatic coding [7,18,19], identifying candidates in need of therapy [20], identifying clinical entries of interest [21], and identifying smoking status [22,23]. Schuemie et al. [24] compared several machine-learning methods for identifying patients with liver disorder from free-text medical records. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Background Distinguishing cases from non-cases in free-text electronic medical records is an important initial step in observational epidemiological studies, but manual record validation is time-consuming and cumbersome. We compared different approaches to develop an automatic case identification system with high sensitivity to assist manual annotators. Methods We used four different machine-learning algorithms to build case identification systems for two data sets, one comprising hepatobiliary disease patients, the other acute renal failure patients. To improve the sensitivity of the systems, we varied the imbalance ratio between positive cases and negative cases using under- and over-sampling techniques, and applied cost-sensitive learning with various misclassification costs. Results For the hepatobiliary data set, we obtained a high sensitivity of 0.95 (on a par with manual annotators, as compared to 0.91 for a baseline classifier) with specificity 0.56. For the acute renal failure data set, sensitivity increased from 0.69 to 0.89, with specificity 0.59. Performance differences between the various machine-learning algorithms were not large. Classifiers performed best when trained on data sets with imbalance ratio below 10. Conclusions We were able to achieve high sensitivity with moderate specificity for automatic case identification on two data sets of electronic medical records. Such a high-sensitive case identification system can be used as a pre-filter to significantly reduce the burden of manual record validation.
    BMC Medical Informatics and Decision Making 03/2013; 13(1):30. DOI:10.1186/1472-6947-13-30 · 1.83 Impact Factor
  • Source
    • "Dyslipidemia was defined as total cholesterol ≥ 220 mg/dL, or high-density lipoprotein cholesterol ≤  40 mg/dL in men or ≤  45 mg/dL in women, triglycerides ≥ 200 mg/dL, or the use of lipid-lowing medications. Smoking status was ascertained by NLP as described previously 6 and smokers were defined as either current or past smokers. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Background: Atherosclerotic vascular disease (AVD), a leading cause of morbidity and mortality, is increasing in prevalence in the developing world. We describe an approach to establish a biorepository linked to medical records with the eventual goal of facilitating discovery of biomarkers for AVD. Methods: The Vascular Disease Biorepository at Mayo Clinic was established to archive DNA, plasma, and serum from patients with suspected AVD. AVD phenotypes, relevant risk factors and comorbid conditions were ascertained by electronic medical record (EMR)-based electronic algorithms that included diagnosis and procedure codes, laboratory data and text searches to ascertain medication use. Results: Up to December 2012, 8800 patients referred for vascular ultrasound examination and non-invasive lower extremity arterial evaluation were approached, of whom 5268 consented. The mean age of the initial 2182 patients recruited was 70.4 ± 11.2 years, 62.6% were men and 97.6% were whites. The prevalences of AVD phenotypes were: carotid artery stenosis 48%, abdominal aortic aneurysm 21% and peripheral arterial disease 38%. Positive predictive values for electronic phenotyping algorithms were>0.90 for cases (and>0.95 for controls) for each AVD phenotype, using manual review of the EMR as the gold standard. The prevalences of risk factors and comorbidities were as follows: hypertension 78%, diabetes 29%, dyslipidemia 73%, smoking 70%, coronary heart disease 37%, heart failure 12%, cerebrovascular disease 20% and chronic kidney disease 19%. Conclusions: Our study demonstrates the feasibility of establishing a biorepository of plasma, serum and DNA, with relatively rapid annotation of clinical variables using EMR-based algorithms.
    03/2013; 2013(1):82-90. DOI:10.5339/gcsp.2013.10
Show more