A Text Mining Approach to the Prediction of Disease Status from Clinical Discharge Summaries

School of Computer Science, University of Manchester, Manchester, UK.
Journal of the American Medical Informatics Association (Impact Factor: 3.5). 05/2009; 16(4):596-600. DOI: 10.1197/jamia.M3096
Source: PubMed


Objective: The authors present a system developed for the Challenge in Natural Language Processing for Clinical Data—the i2b2 obesity challenge, whose aim was to automatically identify the status of obesity and 15 related co-morbidities in patients using their clinical discharge summaries. The challenge consisted of two tasks, textual and intuitive. The textual task was to identify explicit references to the diseases, whereas the intuitive task focused on the prediction of the disease status when the evidence was not explicitly asserted.
Design: The authors assembled a set of resources to lexically and semantically profile the diseases and their associated symptoms, treatments, etc. These features were explored in a hybrid text mining approach, which combined dictionary look-up, rule-based, and machine-learning methods.
Measurements: The methods were applied on a set of 507 previously unseen discharge summaries, and the predictions were evaluated against a manually prepared gold standard. The overall ranking of the participating teams was primarily based on the macro-averaged F-measure.
Results: The implemented method achieved the macro-averaged F-measure of 81% for the textual task (which was the highest achieved in the challenge) and 63% for the intuitive task (ranked 7th out of 28 teams—the highest was 66%). The micro-averaged F-measure showed an average accuracy of 97% for textual and 96% for intuitive annotations.
Conclusions: The performance achieved was in line with the agreement between human annotators, indicating the potential of text mining for accurate and efficient prediction of disease statuses from clinical discharge summaries.

Download full-text


Available from: Hui Yang, Dec 19, 2013
  • Source
    • "Final decisions were made using these two sets of concepts. Yang et al. [11] used lexical patterns to identify sections followed by a lexicon of UMLS concepts and associated confidence scores for obesity and disease status. The 2008 challenge [4] aimed at identification of medication information such as drug name, dosage, frequency, reason for medication, etc. from clinical notes. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The second track of the 2014 i2b2 challenge asked participants to automatically identify risk factors for heart disease among diabetic patients using natural language processing techniques for clinical notes. This paper describes a rule-based system developed using a combination of regular expressions, concepts from the Unified Medical Language System (UMLS), and freely-available resources from the community. With a performance (F1=90.7) that is significantly higher than the median (F1=87.20) and close to the top performing system (F1=92.8), it was the best rule-based system of all the submissions in the challenge. This system was also used to evaluate the utility of different terminologies in the UMLS towards the challenge task. Of the 155 terminologies in the UMLS, 129 (76.78%) have no representation in the corpus. The Consumer Health Vocabulary had very good coverage of relevant concepts and was the most useful terminology for the challenge task. While segmenting notes into sections and lists has a significant impact on the performance, identifying negations and experiencer of the medical event results in negligible gain.
    Full-text · Article · Sep 2015 · Journal of Biomedical Informatics
  • Source
    • "Data mining and text mining have been applied to healthcare date (Koh and Tan, 2011). For instance, a text mining algorithm was applied to predict disease status from discharge summaries (Yang et al. 2009). "

    Full-text · Technical Report · Jul 2014
    • "Data mining and text mining have been applied to healthcare date (Koh and Tan, 2011). For instance, a text mining algorithm was applied to predict disease status from discharge summaries (Yang et al. 2009). "
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents the architecture of a health data search engine, along with preliminary findings that demonstrate the feasibility of the approach taken. The work is motivated by the need to incorporate information about similar patients into clinical decision making, and by the need to develop a tool that can search for similar patients in health data repositories. Central to the design of the search engine is the use of clustering analysis within health data repositories to ensure that responses to queries consist of data summaries that do not violate the confidentiality of patient records. Recent results concerning the feasibility of this search engine approach are reviewed. These results speak to the relative ease of creating clinically meaningful summaries of patient types, and to the accuracy of predictions made using the summarized data. The paper concludes with a brief discussion of further work required to implement a health data search engine and to demonstrate its effectiveness.
    No preview · Technical Report · Jun 2014
Show more