Dynamic categorization of clinical research eligibility criteria by hierarchical clustering

Department of Biomedical Informatics, Columbia University, New York, NY 10032, United States.
Journal of Biomedical Informatics (Impact Factor: 2.19). 06/2011; 44(6):927-35. DOI: 10.1016/j.jbi.2011.06.001
Source: PubMed


To semi-automatically induce semantic categories of eligibility criteria from text and to automatically classify eligibility criteria based on their semantic similarity.
The UMLS semantic types and a set of previously developed semantic preference rules were utilized to create an unambiguous semantic feature representation to induce eligibility criteria categories through hierarchical clustering and to train supervised classifiers.
We induced 27 categories and measured the prevalence of the categories in 27,278 eligibility criteria from 1578 clinical trials and compared the classification performance (i.e., precision, recall, and F1-score) between the UMLS-based feature representation and the "bag of words" feature representation among five common classifiers in Weka, including J48, Bayesian Network, Naïve Bayesian, Nearest Neighbor, and instance-based learning classifier.
The UMLS semantic feature representation outperforms the "bag of words" feature representation in 89% of the criteria categories. Using the semantically induced categories, machine-learning classifiers required only 2000 instances to stabilize classification performance. The J48 classifier yielded the best F1-score and the Bayesian Network classifier achieved the best learning efficiency.
The UMLS is an effective knowledge source and can enable an efficient feature representation for semi-automated semantic category induction and automatic categorization for clinical research eligibility criteria and possibly other clinical text.

Download full-text


Available from: Chunhua Weng,
  • Source
    • "Weng et al. [12] and Luo et al. [13] describe in related publications, a semi-automatic approach that allows annotating free text eligibility criteria using semantic representation. In contrast to our expert-driven simplification approach - intended to reduce complexity with a focus on trial feasibility - those methods aim at semi-automatically extracting the complete information out of free text. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Clinical studies are a necessity for new medications and therapies. Many studies, however, struggle to meet their recruitment numbers in time or have problems in meeting them at all. With increasing numbers of electronic health records (EHRs) in hospitals, huge databanks emerge that could be utilized to support research. The Innovative Medicine Initiative (IMI) funded project 'Electronic Health Records for Clinical Research' (EHR4CR) created a standardized and homogenous inventory of data elements to support research by utilizing EHRs. Our aim was to develop a Data Inventory that contains elements required for site feasibility analysis. The Data Inventory was created in an iterative, consensus driven approach, by a group of up to 30 people consisting of pharmaceutical experts and informatics specialists. An initial list was subsequently expanded by data elements of simplified eligibility criteria from clinical trial protocols. Each element was manually reviewed by pharmaceutical experts and standard definitions were identified and added. To verify their availability, data exports of the source systems at eleven university hospitals throughout Europe were conducted and evaluated. The Data Inventory consists of 75 data elements that, on the one hand are frequently used in clinical studies, and on the other hand are available in European EHR systems. Rankings of data elements were created from the results of the data exports. In addition a sub-list was created with 21 data elements that were separated from the Data Inventory because of their low usage in routine documentation. The data elements in the Data Inventory were identified with the knowledge of domain experts from pharmaceutical companies. Currently, not all information that is frequently used in site feasibility is documented in routine patient care.
    Trials 01/2014; 15(1):18. DOI:10.1186/1745-6215-15-18 · 1.73 Impact Factor
  • Source
    • "We examined how many and what type of trial eligibility criteria were mapped into the patient characteristics and corresponding data elements of the EMRs to evaluate the completeness of our mapping. We assigned patient characteristics, as mentioned above, to one of the 27 semantic categories defined by Luo et al. [22]. One author, a medical doctor, broke up and assigned the eligibility criteria to one of the semantic categories, and another medical doctor validated the results. "
    [Show abstract] [Hide abstract]
    ABSTRACT: A number of clinical trials have encountered difficulties enrolling a sufficient number of patients upon initiating the trial. Recently, many screening systems that search clinical data warehouses for patients who are eligible for clinical trials have been developed. We aimed to estimate the number of eligible patients using routine electronic medical records (EMRs) and to predict the difficulty of enrolling sufficient patients prior to beginning a trial. Investigator-initiated clinical trials that were conducted at Kyoto University Hospital between July 2004 and January 2011 were included in this study. We searched the EMRs for eligible patients and calculated the eligible EMR patient index by dividing the number of eligible patients in the EMRs by the target sample size. Additionally, we divided the trial eligibility criteria into corresponding data elements in the EMRs to evaluate the completeness of mapping clinical manifestation in trial eligibility criteria into structured data elements in the EMRs. We evaluated the correlation between the index and the accrual achievement with Spearman's rank correlation coefficient. Thirteen of 19 trials did not achieve their original target sample size. Overall, 55% of the trial eligibility criteria were mapped into data elements in EMRs. The accrual achievement demonstrated a significant positive correlation with the eligible EMR patient index (r = 0.67, 95% confidence interval (CI), 0.42 to 0.92). The receiver operating characteristic analysis revealed an eligible EMR patient index cut-off value of 1.7, with a sensitivity of 69.2% and a specificity of 100.0%. Our study suggests that the eligible EMR patient index remains exploratory but could be a useful component of the feasibility study when planning a clinical trial. Establishing a step to check whether there are likely to be a sufficient number of eligible patients enables sponsors and investigators to concentrate their resources and efforts on more achievable trials.
    Trials 12/2013; 14(1):426. DOI:10.1186/1745-6215-14-426 · 1.73 Impact Factor
  • Source
    • "Obviously, an investigation regarding the availability of EHR data for the purpose of patient recruitment should derive its data elements of interest from a random set of real world eligibility criteria. But while the contents of these criteria have recently been described by Luo et al. [14], no comparison with the content of a set of real life EHRs is yet available. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Background Computerized clinical trial recruitment support is one promising field for the application of routine care data for clinical research. The primary task here is to compare the eligibility criteria defined in trial protocols with patient data contained in the electronic health record (EHR). To avoid the implementation of different patient definitions in multi-site trials, all participating research sites should use similar patient data from the EHR. Knowledge of the EHR data elements which are commonly available from most EHRs is required to be able to define a common set of criteria. The objective of this research is to determine for five tertiary care providers the extent of available data compared with the eligibility criteria of randomly selected clinical trials. Methods Each participating study site selected three clinical trials at random. All eligibility criteria sentences were broken up into independent patient characteristics, which were then assigned to one of the 27 semantic categories for eligibility criteria developed by Luo et al. We report on the fraction of patient characteristics with corresponding structured data elements in the EHR and on the fraction of patients with available data for these elements. The completeness of EHR data for the purpose of patient recruitment is calculated for each semantic group. Results 351 eligibility criteria from 15 clinical trials contained 706 patient characteristics. In average, 55% of these characteristics could be documented in the EHR. Clinical data was available for 64% of all patients, if corresponding data elements were available. The total completeness of EHR data for recruitment purposes is 35%. The best performing semantic groups were ‘age’ (89%), ‘gender’ (89%), ‘addictive behaviour’ (74%), ‘disease, symptom and sign’ (64%) and ‘organ or tissue status’ (61%). No data was available for 6 semantic groups. Conclusions There exists a significant gap in structure and content between data documented during patient care and data required for patient eligibility assessment. Nevertheless, EHR data on age and gender of the patient, as well as selected information on his disease can be complete enough to allow for an effective support of the manual screening process with an intelligent preselection of patients and patient data.
    BMC Medical Informatics and Decision Making 03/2013; 13(1):37. DOI:10.1186/1472-6947-13-37 · 1.83 Impact Factor
Show more