Extracting timing and status descriptors for colonoscopy testing from electronic medical records

Department of Biomedical Informatics, Vanderbilt University, Nashville, Tennessee, USA.
Journal of the American Medical Informatics Association (Impact Factor: 3.5). 07/2010; 17(4):383-8. DOI: 10.1136/jamia.2010.004804
Source: PubMed


Colorectal cancer (CRC) screening rates are low despite confirmed benefits. The authors investigated the use of natural language processing (NLP) to identify previous colonoscopy screening in electronic records from a random sample of 200 patients at least 50 years old. The authors developed algorithms to recognize temporal expressions and 'status indicators', such as 'patient refused', or 'test scheduled'. The new methods were added to the existing KnowledgeMap concept identifier system, and the resulting system was used to parse electronic medical records (EMR) to detect completed colonoscopies. Using as the 'gold standard' expert physicians' manual review of EMR notes, the system identified timing references with a recall of 0.91 and precision of 0.95, colonoscopy status indicators with a recall of 0.82 and precision of 0.95, and references to actually completed colonoscopies with recall of 0.93 and precision of 0.95. The system was superior to using colonoscopy billing codes alone. Health services researchers and clinicians may find NLP a useful adjunct to traditional methods to detect CRC screening status. Further investigations must validate extension of NLP approaches for other types of CRC screening applications.

Download full-text


Available from: Joshua C Denny, Oct 01, 2015
14 Reads
  • Source
    • "Breast neoplasm Pathology reports [16] [20] [21] Breast neoplasm PubMed abstracts [15] Cervical neoplasm PubMed abstracts [22] Colon neoplasm Pathology reports [23] [24] Colorectal neoplasm EMR notes [25] [26] [28] [29] Colorectal neoplasm Pathology reports [27] Colorectal neoplasm Histopathology reports [30] Colorectal neoplasm Colonoscopy reports [5] Lung neoplasm Radiographic reports [31] Lung neoplasm EMR [26] Lung neoplasm Pathology reports [32] Ovarian neoplasm GPRD records [33] Pancreatic neoplasm PubMed abstracts, EMRs [34] Prostate neoplasm Clinical records: all available paper, electronic, radiologic, radiation therapy and pathology records [37] Prostate neoplasm Pathology reports [21] [36] Prostate neoplasm EMR [26] Skin neoplasm Pathology reports [36] cancer. In particular, two types of reports are relevant for recording cancer-related information: pathology and imaging reports. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Purpose: This paper reviews the research literature on text mining (TM) with the aim to find out (1) which cancer domains have been the subject of TM efforts, (2) which knowledge resources can support TM of cancer-related information and (3) to what extent systems that rely on knowledge and computational methods can convert text data into useful clinical information. These questions were used to determine the current state of the art in this particular strand of TM and suggest future directions in TM development to support cancer research. Methods: A review of the research on TM of cancer-related information was carried out. A literature search was conducted on the Medline database as well as IEEE Xplore and ACM digital libraries to address the interdisciplinary nature of such research. The search results were supplemented with the literature identified through Google Scholar. Results: A range of studies have proven the feasibility of TM for extracting structured information from clinical narratives such as those found in pathology or radiology reports. In this article, we provide a critical overview of the current state of the art for TM related to cancer. The review highlighted a strong bias towards symbolic methods, e.g. named entity recognition (NER) based on dictionary lookup and information extraction (IE) relying on pattern matching. The F-measure of NER ranges between 80% and 90%, while that of IE for simple tasks is in the high 90s. To further improve the performance, TM approaches need to deal effectively with idiosyncrasies of the clinical sublanguage such as non-standard abbreviations as well as a high degree of spelling and grammatical errors. This requires a shift from rule-based methods to machine learning following the success of similar trends in biological applications of TM. Machine learning approaches require large training datasets, but clinical narratives are not readily available for TM research due to privacy and confidentiality concerns. This issue remains the main bottleneck for progress in this area. In addition, there is a need for a comprehensive cancer ontology that would enable semantic representation of textual information found in narrative reports.
    International Journal of Medical Informatics 09/2014; 83(9). DOI:10.1016/j.ijmedinf.2014.06.009 · 2.00 Impact Factor
  • Source
    • "can only be answered and interpreted if the relative temporal relations between the events are considered. In general, temporal reasoning has applications in several tasks in the clinical domain such as information extraction [2] [3], question answering [4] [5], patient timeline visualization [6], clinical guideline development [7] [8] and others. Automatic extraction of temporal information can facilitate processing of patient information in the narrative text, and this can contribute to the decision making process in fundamental patient care tasks such as prevention, diagnosis and forecasting the effects of the treatments [9] [10]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Clinical records include both coded and free-text fields that interact to reflect complicated patient stories. The information often covers not only the present medical condition and events experienced by the patient, but also refers to relevant events in the past (such as signs, symptoms, tests or treatments). In order to automatically construct a timeline of these events, we first need to extract the temporal relations between pairs of events or time expressions presented in the clinical notes. We designed separate extraction components for different types of temporal relations, utilizing a novel hybrid system that combines machine learning with a graph-based inference mechanism to extract the temporal links. The temporal graph is a directed graph based on parse tree dependencies of the simplified sentences and frequent pattern clues. We generalized the sentences in order to discover patterns that, given the complexities of natural language, might not be directly discoverable in the original sentences. The proposed hybrid system performance reached an F-measure of 0.63, with precision at 0.76 and recall at 0.54 on the 2012 i2b2 Natural Language Processing corpus for the temporal relation (TLink) extraction task, achieving the highest precision and third highest f-measure among participating teams in the TLink track.
    Journal of Biomedical Informatics 11/2013; 46. DOI:10.1016/j.jbi.2013.11.001 · 2.19 Impact Factor
  • Source
    • "There was a special section focused on CRI papers in the December 2011 supplement issue. Much of the increase can be attributed to publications from awardees of the CTSA, since publication rate is related to funding.38 JAMIA publications acknowledging CTSA funding rose from three in 200939–41 to four in 201014 42–44 and 15 in 2011.15 17 19 36 45–55 Some of the articles were not exclusively focused on CRI, but were directly related, covering many different topics that are highly relevant to CRI: data models and terminologies,27 56–68 natural language processing (NLP),16 50 61 69–99 surveillance systems,48 65 80 100–110 and privacy technology and policy.33 111–117 This 2012 CRI supplement adds 18 new publications to this growing field. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Clinical research informatics is the rapidly evolving sub-discipline within biomedical informatics that focuses on developing new informatics theories, tools, and solutions to accelerate the full translational continuum: basic research to clinical trials (T1), clinical trials to academic health center practice (T2), diffusion and implementation to community practice (T3), and 'real world' outcomes (T4). We present a conceptual model based on an informatics-enabled clinical research workflow, integration across heterogeneous data sources, and core informatics tools and platforms. We use this conceptual model to highlight 18 new articles in the JAMIA special issue on clinical research informatics.
    Journal of the American Medical Informatics Association 04/2012; 19(e1):e36-e42. DOI:10.1136/amiajnl-2012-000968 · 3.50 Impact Factor
Show more