A Study of Transportability of an Existing Smoking Status Detection Module across Institutions.

Department of Biomedical Informatics.
AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium 01/2012; 2012:577-86.
Source: PubMed


Electronic Medical Records (EMRs) are valuable resources for clinical observational studies. Smoking status of a patient is one of the key factors for many diseases, but it is often embedded in narrative text. Natural language processing (NLP) systems have been developed for this specific task, such as the smoking status detection module in the clinical Text Analysis and Knowledge Extraction System (cTAKES). This study examined transportability of the smoking module in cTAKES on the Vanderbilt University Hospital's EMR data. Our evaluation demonstrated that modest effort of change is necessary to achieve desirable performance. We modified the system by filtering notes, annotating new data for training the machine learning classifier, and adding rules to the rule-based classifiers. Our results showed that the customized module achieved significantly higher F-measures at all levels of classification (i.e., sentence, document, patient) compared to the direct application of the cTAKES module to the Vanderbilt data.

11 Reads
  • Source
    • "These features are generated by Latent Dirichlet Allocations (10), which can effectively group similar instances together based on their topics (11–14). "
    [Show abstract] [Hide abstract]
    ABSTRACT: This article describes our participation of the Gene Ontology Curation task (GO task) in BioCreative IV where we participated in both subtasks: A) identification of GO evidence sentences (GOESs) for relevant genes in full-text articles and B) prediction of GO terms for relevant genes in full-text articles. For subtask A, we trained a logistic regression model to detect GOES based on annotations in the training data supplemented with more noisy negatives from an external resource. Then, a greedy approach was applied to associate genes with sentences. For subtask B, we designed two types of systems: (i) search-based systems, which predict GO terms based on existing annotations for GOESs that are of different textual granularities (i.e., full-text articles, abstracts, and sentences) using state-of-the-art information retrieval techniques (i.e., a novel application of the idea of distant supervision) and (ii) a similarity-based system, which assigns GO terms based on the distance between words in sentences and GO terms/synonyms. Our best performing system for subtask A achieves an F1 score of 0.27 based on exact match and 0.387 allowing relaxed overlap match. Our best performing system for subtask B, a search-based system, achieves an F1 score of 0.075 based on exact match and 0.301 considering hierarchical matches. Our search-based systems for subtask B significantly outperformed the similarity-based system.Database URL:
    Database The Journal of Biological Databases and Curation 01/2014; 2014. DOI:10.1093/database/bau087 · 3.37 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: OBJECTIVE: To evaluate the validity of, characterize the usage of, and propose potential research applications for International Classification of Diseases, Ninth Revision (ICD-9) tobacco codes in clinical populations. MATERIALS AND METHODS: Using data on cancer cases and cancer-free controls from Vanderbilt's biorepository, BioVU, we evaluated the utility of ICD-9 tobacco use codes to identify ever-smokers in general and high smoking prevalence (lung cancer) clinic populations. We assessed potential biases in documentation, and performed temporal analysis relating transitions between smoking codes to smoking cessation attempts. We also examined the suitability of these codes for use in genetic association analyses. RESULTS: ICD-9 tobacco use codes can identify smokers in a general clinic population (specificity of 1, sensitivity of 0.32), and there is little evidence of documentation bias. Frequency of code transitions between 'current' and 'former' tobacco use was significantly correlated with initial success at smoking cessation (p<0.0001). Finally, code-based smoking status assignment is a comparable covariate to text-based smoking status for genetic association studies. DISCUSSION: Our results support the use of ICD-9 tobacco use codes for identifying smokers in a clinical population. Furthermore, with some limitations, these codes are suitable for adjustment of smoking status in genetic studies utilizing electronic health records. CONCLUSIONS: Researchers should not be deterred by the unavailability of full-text records to determine smoking status if they have ICD-9 code histories.
    Journal of the American Medical Informatics Association 02/2013; 20(4). DOI:10.1136/amiajnl-2012-001557 · 3.50 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Genetic association studies have rapidly become a major tool for identifying the genetic basis of common human diseases. The advent of cost-effective genotyping coupled with large collections of samples linked to clinical outcomes and quantitative traits now make it possible to systematically characterize genotype-phenotype relationships in diverse populations and extensive datasets. To capitalize on these advancements, the Epidemiologic Architecture for Genes Linked to Environment (EAGLE) project, as part of the collaborative Population Architecture using Genomics and Epidemiology (PAGE) study, accesses two collections: the National Health and Nutrition Examination Surveys (NHANES) and BioVU, Vanderbilt University's biorepository linked to de-identified electronic medical records. We describe herein the workflows for accessing and using the epidemiologic (NHANES) and clinical (BioVU) collections, where each workflow has been customized to reflect the content and data access limitations of each respective source. We also describe the process by which these data are generated, standardized, and shared for meta-analysis among the PAGE study sites. As a specific example of the use of BioVU, we describe the data mining efforts to define cases and controls for genetic association studies of common cancers in PAGE. Collectively, the efforts described here are a generalized outline for many of the successful approaches that can be used in the era of high-throughput genotype-phenotype associations for moving biomedical discovery forward to new frontiers of data generation and analysis.
Show more