Evaluation of a generalizable approach to clinical information retrieval using the automated retrieval console (ARC)

Massachusetts Veterans Epidemiology Research and Information Center Cooperative Studies Coordinating Center, VA Boston Healthcare System, Jamaica Plain, Massachusetts 02130, USA.
Journal of the American Medical Informatics Association (Impact Factor: 3.5). 07/2010; 17(4):375-82. DOI: 10.1136/jamia.2009.001412
Source: PubMed


Reducing custom software development effort is an important goal in information retrieval (IR). This study evaluated a generalizable approach involving with no custom software or rules development. The study used documents "consistent with cancer" to evaluate system performance in the domains of colorectal (CRC), prostate (PC), and lung (LC) cancer. Using an end-user-supplied reference set, the automated retrieval console (ARC) iteratively calculated performance of combinations of natural language processing-derived features and supervised classification algorithms. Training and testing involved 10-fold cross-validation for three sets of 500 documents each. Performance metrics included recall, precision, and F-measure. Annotation time for five physicians was also measured. Top performing algorithms had recall, precision, and F-measure values as follows: for CRC, 0.90, 0.92, and 0.89, respectively; for PC, 0.97, 0.95, and 0.94; and for LC, 0.76, 0.80, and 0.75. In all but one case, conditional random fields outperformed maximum entropy-based classifiers. Algorithms had good performance without custom code or rules development, but performance varied by specific application.

Download full-text


Available from: Wildon R Farwell, Jan 02, 2014
  • Source
    • "NCI 1032 [60] SNOMED CT IHTSDO SNOMEDCT 1353 [21] [26] [27] [30] [32] [45] Abbreviations: AMA, American Medical Association; ICH, International Conference on Harmonisation; IHTSDO, International Health Terminology Standards Development Organisation; NCI, National Cancer Institute; NLM, National Library of Medicine; WHO, World Health Organisation. Table 4 – Examples of cancer-specific ontologies. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Purpose: This paper reviews the research literature on text mining (TM) with the aim to find out (1) which cancer domains have been the subject of TM efforts, (2) which knowledge resources can support TM of cancer-related information and (3) to what extent systems that rely on knowledge and computational methods can convert text data into useful clinical information. These questions were used to determine the current state of the art in this particular strand of TM and suggest future directions in TM development to support cancer research. Methods: A review of the research on TM of cancer-related information was carried out. A literature search was conducted on the Medline database as well as IEEE Xplore and ACM digital libraries to address the interdisciplinary nature of such research. The search results were supplemented with the literature identified through Google Scholar. Results: A range of studies have proven the feasibility of TM for extracting structured information from clinical narratives such as those found in pathology or radiology reports. In this article, we provide a critical overview of the current state of the art for TM related to cancer. The review highlighted a strong bias towards symbolic methods, e.g. named entity recognition (NER) based on dictionary lookup and information extraction (IE) relying on pattern matching. The F-measure of NER ranges between 80% and 90%, while that of IE for simple tasks is in the high 90s. To further improve the performance, TM approaches need to deal effectively with idiosyncrasies of the clinical sublanguage such as non-standard abbreviations as well as a high degree of spelling and grammatical errors. This requires a shift from rule-based methods to machine learning following the success of similar trends in biological applications of TM. Machine learning approaches require large training datasets, but clinical narratives are not readily available for TM research due to privacy and confidentiality concerns. This issue remains the main bottleneck for progress in this area. In addition, there is a need for a comprehensive cancer ontology that would enable semantic representation of textual information found in narrative reports.
    Full-text · Article · Sep 2014 · International Journal of Medical Informatics
  • Source
    • "The next step of the analysis was performance of a supervised classification of retrieved, filtered affirmative 400-character document snippets as true or false positive using the Automated Retrieval Console developed by D'Avolio [12] [13]. This classifier utilizes the Mayo Ctakes (2) toolset for linguistic feature extraction, the UIMA pipeline architecture and the MALLET conditional random fields classifier. "
    [Show abstract] [Hide abstract]
    ABSTRACT: To fulfill the promise of electronic health records to support the study of disease in populations, efficient techniques are required to search large clinical corpora. The authors describe a hybrid system that combines a search engine and a natural language feature extraction and classification system to estimate the annual incidence of suicide attempts and demonstrate an association of adverse childhood experiences with suicide attempt risk in a cohort of 250,000 patients. The methodology replicated a previous finding that a positive association between suicide attempt incidence and a history of childhood abuse, neglect or family dysfunction exists, and that the association is stronger when multiple adverse events are reported.
    Full-text · Conference Paper · Jan 2014
  • Source
    • "Utilizing controlled terminology could have further increased recall, as previously demonstrated in retrieving liver cysts using iSCOUT [24]. Finally, supervised classification algorithms, previously implemented for information retrieval, were not available in either application [31]. Incorporating these algorithms into information retrieval applications could further enhance precision and recall of these tools. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Communication of critical results from diagnostic procedures between caregivers is a Joint Commission national patient safety goal. Evaluating critical result communication often requires manual analysis of voluminous data, especially when reviewing unstructured textual results of radiologic findings. Information retrieval (IR) tools can facilitate this process by enabling automated retrieval of radiology reports that cite critical imaging findings. However, IR tools that have been developed for one disease or imaging modality often need substantial reconfiguration before they can be utilized for another disease entity. THIS PAPER: 1) describes the process of customizing two Natural Language Processing (NLP) and Information Retrieval/Extraction applications - an open-source toolkit, A Nearly New Information Extraction system (ANNIE); and an application developed in-house, Information for Searching Content with an Ontology-Utilizing Toolkit (iSCOUT) - to illustrate the varying levels of customization required for different disease entities and; 2) evaluates each application's performance in identifying and retrieving radiology reports citing critical imaging findings for three distinct diseases, pulmonary nodule, pneumothorax, and pulmonary embolus. Both applications can be utilized for retrieval. iSCOUT and ANNIE had precision values between 0.90-0.98 and recall values between 0.79 and 0.94. ANNIE had consistently higher precision but required more customization. Understanding the customizations involved in utilizing NLP applications for various diseases will enable users to select the most suitable tool for specific tasks.
    Full-text · Article · Aug 2012 · The Open Medical Informatics Journal
Show more