A text mining approach to the prediction of disease status from clinical discharge summaries.

School of Computer Science, University of Manchester, Manchester, UK.
Journal of the American Medical Informatics Association (Impact Factor: 3.93). 05/2009; 16(4):596-600. DOI: 10.1197/jamia.M3096
Source: PubMed

ABSTRACT OBJECTIVE The authors present a system developed for the Challenge in Natural Language Processing for Clinical Data-the i2b2 obesity challenge, whose aim was to automatically identify the status of obesity and 15 related co-morbidities in patients using their clinical discharge summaries. The challenge consisted of two tasks, textual and intuitive. The textual task was to identify explicit references to the diseases, whereas the intuitive task focused on the prediction of the disease status when the evidence was not explicitly asserted. DESIGN The authors assembled a set of resources to lexically and semantically profile the diseases and their associated symptoms, treatments, etc. These features were explored in a hybrid text mining approach, which combined dictionary look-up, rule-based, and machine-learning methods. MEASUREMENTS The methods were applied on a set of 507 previously unseen discharge summaries, and the predictions were evaluated against a manually prepared gold standard. The overall ranking of the participating teams was primarily based on the macro-averaged F-measure. RESULTS The implemented method achieved the macro-averaged F-measure of 81% for the textual task (which was the highest achieved in the challenge) and 63% for the intuitive task (ranked 7(th) out of 28 teams-the highest was 66%). The micro-averaged F-measure showed an average accuracy of 97% for textual and 96% for intuitive annotations. CONCLUSIONS The performance achieved was in line with the agreement between human annotators, indicating the potential of text mining for accurate and efficient prediction of disease statuses from clinical discharge summaries.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents the architecture of a health data search engine, along with preliminary findings that demonstrate the feasibility of the approach taken. The work is motivated by the need to incorporate information about similar patients into clinical decision making, and by the need to develop a tool that can search for similar patients in health data repositories. Central to the design of the search engine is the use of clustering analysis within health data repositories to ensure that responses to queries consist of data summaries that do not violate the confidentiality of patient records. Recent results concerning the feasibility of this search engine approach are reviewed. These results speak to the relative ease of creating clinically meaningful summaries of patient types, and to the accuracy of predictions made using the summarized data. The paper concludes with a brief discussion of further work required to implement a health data search engine and to demonstrate its effectiveness.
  • Source
  • [Show abstract] [Hide abstract]
    ABSTRACT: Identifying patients at risk for acute respiratory distress syndrome (ARDS) before their admission to intensive care is crucial to prevention and treatment. The objective of this study is to determine the performance of an automated algorithm for identifying selected ARDS predisposing conditions at the time of hospital admission. This secondary analysis of a prospective cohort study included 3,005 patients admitted to hospital between January 1 and December 31, 2010. The automated algorithm for five ARDS predisposing conditions (sepsis, pneumonia, aspiration, acute pancreatitis, and shock) was developed through a series of queries applied to institutional electronic medical record databases. The automated algorithm was derived and refined in a derivation cohort of 1,562 patients and subsequently validated in an independent cohort of 1,443 patients. The sensitivity, specificity, and positive and negative predictive values of an automated algorithm to identify ARDS risk factors were compared with another two independent data extraction strategies, including manual data extraction and ICD-9 code search. The reference standard was defined as the agreement between the ICD-9 code, automated and manual data extraction. Compared to the reference standard, the automated algorithm had higher sensitivity than manual data extraction for identifying a case of sepsis (95% vs. 56%), aspiration (63% vs. 42%), acute pancreatitis (100% vs. 70%), pneumonia (93% vs. 62%) and shock (77% vs. 41%) with similar specificity except for sepsis and pneumonia (90% vs. 98% for sepsis and 95% vs. 99% for pneumonia). The PPV for identifying these five acute conditions using the automated algorithm ranged from 65% for pneumonia to 91 % for acute pancreatitis, whereas the NPV for the automated algorithm ranged from 99% to 100%. A rule-based electronic data extraction can reliably and accurately identify patients at risk of ARDS at the time of hospital admission.
    Applied Clinical Informatics 01/2014; 5(1):58-72. · 0.39 Impact Factor

Full-text (2 Sources)

Available from
Jun 2, 2014