Mayo Clinic NLP System for Patient Smoking Status Identification

Biomedical Informatics Research, Mayo Clinic, 200 First Street SW, Rochester, MN 55902, USA.
Journal of the American Medical Informatics Association (Impact Factor: 3.93). 10/2007; 15(1):25-8. DOI: 10.1197/jamia.M2437
Source: PubMed

ABSTRACT This article describes our system entry for the 2006 I2B2 contest "Challenges in Natural Language Processing for Clinical Data" for the task of identifying the smoking status of patients. Our system makes the simplifying assumption that patient-level smoking status determination can be achieved by accurately classifying individual sentences from a patient's record. We created our system with reusable text analysis components built on the Unstructured Information Management Architecture and Weka. This reuse of code minimized the development effort related specifically to our smoking status classifier. We report precision, recall, F-score, and 95% exact confidence intervals for each metric. Recasting the classification task for the sentence level and reusing code from other text analysis projects allowed us to quickly build a classification system that performs with a system F-score of 92.64 based on held-out data tests and of 85.57 on the formal evaluation data. Our general medical natural language engine is easily adaptable to a real-world medical informatics application. Some of the limitations as applied to the use-case are negation detection and temporal resolution.

Download full-text


Available from: Christopher G Chute, Jul 14, 2014
  • Source
    • "Only the last sentence is used when a document contains multiple sentences with smoking references. Savova et al. [9] present a layered approach that first identifies lexical smoker features from sentences, then systematically applying Support Vector Machine (SVM) classifiers to filter out nonsmokers. By eliminating nonsmokers, Savova treated the remaining classification task as a temporal classification task to identify past and current smokers. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The ability to connect the dots in structured background knowledge and also across scientific literature has been demonstrated as a critical aspect of knowledge discovery. It is not unreasonable therefore to expect that connecting-the-dots across massive amounts of healthcare data may also lead to new insights that could impact diagnosis, treatment and overall patient care. Of critical importance is the observation that while structured Electronic Medical Records (EMR) are useful sources of health information, it is often the unstructured clinical texts such as progress notes and discharge summaries that contain rich, updated and granular information. Hence, by coupling structured EMR data with data from unstruc-tured clinical texts, more holistic patient records, needed for connecting the dots, can be obtained. Unfortunately, free-text progress notes are fraught with a lack of proper grammatical structure, and contain liberal use of jargon and abbreviations, together with frequent misspellings. While these notes still serve their intended purpose for medical care, automatically extracting semantic information from them is a complex task. Overcoming this complexity could mean that evidence-based support for structured EMR data using unstructured clinical texts, can be provided. In this work therefore, we explore a pattern-based approach for extracting Smoker Semantic Types (SST) from unstructured clinical notes, in order to enable evidence-based resolution of SSTs asserted in structured EMRs using SSTs extracted from unstructured clinical notes. Our findings support the notion that information present in unstructured clinical text can be used to complement structured healthcare data. This is a cru-cial observation towards creating comprehensive longitudinal patient models for connecting-the-dots and providing better overall patient care.
    IEEE International Conference on Bioinformatics and Biomedicine: The First International Workshop on the Role of Semantic Web in Literature-Based Discovery (SWLBD2012), Philadelphia, Pennsylvania; 10/2012
  • Source
    • ": Some records are assigned generic codes by the GP, but nevertheless contain potentially important information in the free text: (a) an example record, consisting of a Read code and its textual description followed by free text; (b) uncoded symptoms and signs in this record (highlighted in bold in the free text). of patients [25]. Recently, more general medical text processing tools have been developed and made available [4] [24]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The UK General Practice Research Database (GPRD) is a valuable source of information for health services research. It contains coded data supplemented by free text (physicians' notes and letters). However, due to the difficulty of extracting useful information and the cost of anonymisation, this text is seldom utilised in epidemiological research. We annotated the records of 344 women in the year prior to a diagnosis of ovarian cancer and developed a method for automatically detecting mentions of symptoms in text. We estimated the incidence of five commonly presenting symptoms using: (1) coded symptoms, (2) codes augmented by symptoms automatically extracted from text, and (3) a 'gold standard' dataset of codes and text tagged by three clinically trained annotators. The estimates of incidence of each symptom increased by at least 40% when coded information was enhanced using the manually tagged free text. Our automatic method extracted a significant proportion of this extra information. Our straightforward approach should be extremely useful for medical researchers who wish to validate studies based on codes, or to accurately assess symptoms, using information that can be automatically extracted from unanonymised free text.
  • Source
    • "Studies have demonstrated some success in using natural language processing (NLP) systems to identify disease. As they become more reliable and widely available, the use of NLPs is likely to greatly enhance the value of text-based messages (McCowan, Moore, and Fry 2006; McCowan et al. 2007; Pakhomov et al. 2007; Savova et al. 2008). For many diseases, diagnosis and treatment are provided exclusively in the outpatient setting. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Research on pressing health services and policy issues requires access to complete, accurate, and timely patient and organizational data. This paper describes how administrative and health records (including electronic medical records) can be linked for comparative effectiveness and health services research. We categorize the major agents (i.e., who owns and controls data and who carries out the data linkage) into three areas: (1) individual investigators; (2) government sponsored linked data bases; and (3) public-private partnerships that facilitate linkage of data owned by private organizations. We describe challenges that may be encountered in the linkage process, and the benefits of combining secondary databases with primary qualitative and quantitative sources. We use cancer care research to illustrate our points. To fill the gaps in the existing data infrastructure, additional steps are required to foster collaboration among institutions, researchers, and public and private components of the health care sector. Without such effort, independent researchers, governmental agencies, and nonprofit organizations are likely to continue building upon a fragmented and costly system with limited access. Discussion. Without the development and support for emerging information technologies across multiple health care settings, the potential for data collected for clinical and transactional purposes to benefit the research community and, ultimately, the patient population may go unrealized. The current environment is characterized by budget and technical challenges, but investments in data infrastructure are arguably cost-effective given the need to reform our health care system and to monitor the impact of health reform initiatives.
    Health Services Research 10/2010; 45(5 Pt 2):1468-88. DOI:10.1111/j.1475-6773.2010.01142.x · 2.49 Impact Factor
Show more