Extracting subject demographic information from abstracts of randomized clinical trial reports

Biomedical Informatics Training Program, Stanford University School of Medicine, USA.
Studies in health technology and informatics 02/2007; 129(Pt 1):550-4. DOI: 10.3233/978-1-58603-774-1-550
Source: PubMed


In order to make more informed healthcare decisions, consumers need information systems that deliver accurate and reliable information about their illnesses and potential treatments. Reports of randomized clinical trials (RCTs) provide reliable medical evidence about the efficacy of treatments. Current methods to access, search for, and retrieve RCTs are keyword-based, time-consuming, and suffer from poor precision. Personalized semantic search and medical evidence summarization aim to solve this problem. The performance of these approaches may improve if they have access to study subject descriptors (e.g. age, gender, and ethnicity), trial sizes, and diseases/symptoms studied. We have developed a novel method to automatically extract such subject demographic information from RCT abstracts. We used text classification augmented with a Hidden Markov Model to identify sentences containing subject demographics, and subsequently these sentences were parsed using Natural Language Processing techniques to extract relevant information. Our results show accuracy levels of 82.5%, 92.5%, and 92.0% for extraction of subject descriptors, trial sizes, and diseases/symptoms descriptors respectively.

1 Follower
18 Reads
  • Source
    • "captures important scientific and clinical investigations in biomedicine. As a result, the knowledge buried in those trial records has shown to be valuable for researchers, clinicians , and the pharmaceutical industry [12] [13] [14]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Recent progress in high-throughput genomic technologies has shifted pharmacogenomic research from candidate gene pharmacogenetics to clinical pharmacogenomics (PGx). Many clinical related questions may be asked such as 'what drug should be prescribed for a patient with mutant alleles?' Typically, answers to such questions can be found in publications mentioning the relationships of the gene-drug-disease of interest. In this work, we hypothesize that is a comparable source rich in PGx related information. In this regard, we developed a systematic approach to automatically identify PGx relationships between genes, drugs and diseases from trial records in In our evaluation, we found that our extracted relationships overlap significantly with the curated factual knowledge through the literature in a PGx database and that most relationships appear on average 5years earlier in clinical trials than in their corresponding publications, suggesting that clinical trials may be valuable for both validating known and capturing new PGx related information in a more timely manner. Furthermore, two human reviewers judged a portion of computer-generated relationships and found an overall accuracy of 74% for our text-mining approach. This work has practical implications in enriching our existing knowledge on PGx gene-drug-disease relationships as well as suggesting crosslinks between and other PGx knowledge bases.
    Journal of Biomedical Informatics 04/2012; 45(5):870-8. DOI:10.1016/j.jbi.2012.04.005 · 2.19 Impact Factor
  • Source
    • "The system starts by finding sentences likely to contain relevant information. Identifying candidate sentences is a common step when analyzing medical documents [11], [3], [6], [4], [16], significantly reducing the amount of text that must be processed in order to find needed information. This lowers the likelihood of false positives and also enables the use of more time-consuming methods when searching for mentions and quantities. "
    [Show abstract] [Hide abstract]
    ABSTRACT: A central concern in Evidence Based Medicine (EBM) is how to convey research results effectively to practitioners. One important idea is to summarize results by key summary statistics that describe the effectiveness (or lack thereof) of a given intervention, specifically the absolute risk reduction (ARR) and number needed to treat (NNT). Manual summarization is slow and expensive, thus, with the exponential growth of the biomedical research literature, automated solutions are needed. In this paper, we present a novel method for automatically creating EBM-oriented summaries from research abstracts of randomly-controlled trials (RCTs). The system extracts descriptions of the treatment groups and outcomes, as well as various associated quantities, and then calculates summary statistics. Results on a hand-annotated corpus of research abstracts show promising, and potentially useful, results.
    IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2011, Atlanta, GA, USA, 12-15 November, 2011; 11/2011
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Clinical trials are one of the most valuable sources of scientific evidence for improving the practice of medicine. The Trial Bank project aims to improve structured access to trial findings by including formalized trial information into a knowledge base. Manually extracting trial information from published articles is costly, but automated information extraction techniques can assist. The current study highlights a single architecture to extract a wide array of information elements from full-text publications of randomized clinical trials (RCTs). This architecture combines a text classifier with a weak regular expression matcher. We tested this two-stage architecture on 88 RCT reports from 5 leading medical journals, extracting 23 elements of key trial information such as eligibility rules, sample size, intervention, and outcome names. Results prove this to be a promising avenue to help critical appraisers, systematic reviewers, and curators quickly identify key information elements in published RCT articles.
    AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium 02/2008;
Show more