A Pilot Study Using Machine Learning and Domain Knowledge to Facilitate Comparative Effectiveness Review Updating

Southern California Evidence-based Practice Center, RAND Corporation, Santa Monica, CA (SRD, PGS, SH, SJN, AM, KDS).
Medical Decision Making (Impact Factor: 3.24). 09/2012; 33(3). DOI: 10.1177/0272989X12457243
Source: PubMed


BACKGROUND: Comparative effectiveness and systematic reviews require frequent and time-consuming updating. RESULTS: of earlier screening should be useful in reducing the effort needed to screen relevant articles. METHODS: We collected 16,707 PubMed citation classification decisions from 2 comparative effectiveness reviews: interventions to prevent fractures in low bone density (LBD) and off-label uses of atypical antipsychotic drugs (AAP). We used previously written search strategies to guide extraction of a limited number of explanatory variables pertaining to the intervention, outcome, and STUDY DESIGN: We empirically derived statistical models (based on a sparse generalized linear model with convex penalties [GLMnet] and a gradient boosting machine [GBM]) that predicted article relevance. We evaluated model sensitivity, positive predictive value (PPV), and screening workload reductions using 11,003 PubMed citations retrieved for the LBD and AAP updates. Results. GLMnet-based models performed slightly better than GBM-based models. When attempting to maximize sensitivity for all relevant articles, GLMnet-based models achieved high sensitivities (0.99 and 1.0 for AAP and LBD, respectively) while reducing projected screening by 55.4% and 63.2%. The GLMnet-based model yielded sensitivities of 0.921 and 0.905 and PPVs of 0.185 and 0.102 when predicting articles relevant to the AAP and LBD efficacy/effectiveness analyses, respectively (using a threshold of P ≥ 0.02). GLMnet performed better when identifying adverse effect relevant articles for the AAP review (sensitivity = 0.981) than for the LBD review (0.685). The system currently requires MEDLINE-indexed articles. CONCLUSIONS: We evaluated statistical classifiers that used previous classification decisions and explanatory variables derived from MEDLINE indexing terms to predict inclusion decisions. This pilot system reduced workload associated with screening 2 simulated comparative effectiveness review updates by more than 50% with minimal loss of relevant articles.

Download full-text


Available from: Susanne Hempel, Apr 29, 2014
19 Reads
  • Medical Decision Making 04/2013; 33(3):313-5. DOI:10.1177/0272989X13480564 · 3.24 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Incentives offered by the U.S. government have spurred marked increases in use of health information technology (IT). To update previous reviews and examine recent evidence that relates health IT functionalities prescribed in meaningful use regulations to key aspects of health care. English-language articles in PubMed from January 2010 to August 2013. 236 studies, including pre-post and time-series designs and clinical trials that related the use of health IT to quality, safety, or efficiency. Two independent reviewers extracted data on functionality, study outcomes, and context. Fifty-seven percent of the 236 studies evaluated clinical decision support and computerized provider order entry, whereas other meaningful use functionalities were rarely evaluated. Fifty-six percent of studies reported uniformly positive results, and an additional 21% reported mixed-positive effects. Reporting of context and implementation details was poor, and 61% of studies did not report any contextual details beyond basic information. Potential for publication bias, and evaluated health IT systems and outcomes were heterogeneous and incompletely described. Strong evidence supports the use of clinical decision support and computerized provider order entry. However, insufficient reporting of implementation and context of use makes it impossible to determine why some health IT implementations are successful and others are not. The most important improvement that can be made in health IT evaluations is increased reporting of the effects of implementation and context. Office of the National Coordinator.
    Annals of internal medicine 01/2014; 160(1). DOI:10.7326/M13-1531 · 17.81 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Evidence-based medicine depends on the timely synthesis of research findings. An important source of synthesized evidence resides in systematic reviews. However, a bottleneck in review production involves dual screening of citations with titles and abstracts to find eligible studies. For this research, we tested the effect of various kinds of textual information (features) on performance of a machine learning classifier. Based on our findings, we propose an automated system to reduce screeing burden, as well as offer quality assurance. We built a database of citations from 5 systematic reviews that varied with respect to domain, topic, and sponsor. Consensus judgments regarding eligibility were inferred from published reports. We extracted 5 feature sets from citations: alphabetic, alphanumeric(+), indexing, features mapped to concepts in systematic reviews, and topic models. To simulate a two-person team, we divided the data into random halves. We optimized the parameters of a Bayesian classifier, then trained and tested models on alternate data halves. Overall, we conducted 50 independent tests. All tests of summary performance (mean F3) surpassed the corresponding baseline, P<0.0001. The ranks for mean F3, precision, and classification error were statistically different across feature sets averaged over reviews; P-values for Friedman's test were .045, .002, and .002, respectively. Differences in ranks for mean recall were not statistically significant. Alphanumeric(+) features were associated with best performance; mean reduction in screening burden for this feature type ranged from 88% to 98% for the second pass through citations and from 38% to 48% overall. A computer-assisted, decision support system based on our methods could substantially reduce the burden of screening citations for systematic review teams and solo reviewers. Additionally, such a system could deliver quality assurance both by confirming concordant decisions and by naming studies associated with discordant decisions for further consideration.
    PLoS ONE 01/2014; 9(1):e86277. DOI:10.1371/journal.pone.0086277 · 3.23 Impact Factor
Show more