ArticlePDF Available

Evaluation of a semi-automated data extraction tool for public health literature-based reviews: Dextr

Authors:
  • Evidence Prime

Abstract

Introduction: There has been limited development and uptake of machine-learning methods to automate data extraction for literature-based assessments. Although advanced extraction approaches have been applied to some clinical research reviews, existing methods are not well suited for addressing toxicology or environmental health questions due to unique data needs to support reviews in these fields. Objectives: To develop and evaluate a flexible, web-based tool for semi-automated data extraction that: 1) makes data extraction predictions with user verification, 2) integrates token-level annotations, and 3) connects extracted entities to support hierarchical data extraction. Methods: Dextr was developed with Agile software methodology using a two-team approach. The development team outlined proposed features and coded the software. The advisory team guided developers and evaluated Dextr's performance on precision, recall, and extraction time by comparing a manual extraction workflow to a semi-automated extraction workflow using a dataset of 51 environmental health animal studies. Results: The semi-automated workflow did not appear to affect precision rate (96.0% vs. 95.4% manual, p = 0.38), resulted in a small reduction in recall rate (91.8% vs. 97.0% manual, p < 0.01), and substantially reduced the median extraction time (436 s vs. 933 s per study manual, p < 0.01) compared to a manual workflow. Discussion: Dextr provides similar performance to manual extraction in terms of recall and precision and greatly reduces data extraction time. Unlike other tools, Dextr provides the ability to extract complex concepts (e.g., multiple experiments with various exposures and doses within a single study), properly connect the extracted elements within a study, and effectively limit the work required by researchers to generate machine-readable, annotated exports. The Dextr tool addresses data-extraction challenges associated with environmental health sciences literature with a simple user interface, incorporates the key capabilities of user verification and entity connecting, provides a platform for further automation developments, and has the potential to improve data extraction for literature reviews in this and other fields.
Environment International 159 (2022) 107025
0160-4120/© 2021 Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND IGO license
(http://creativecommons.org/licenses/by-nc-nd/3.0/igo/).
Evaluation of a semi-automated data extraction tool for public health
literature-based reviews: Dextr
Vickie R. Walker
a
,
*
, Charles P. Schmitt
a
, Mary S. Wolfe
a
, Artur J. Nowak
b
, Kuba Kulesza
b
,
Ashley R. Williams
c
, Rob Shin
c
, Jonathan Cohen
c
, Dave Burch
c
, Matthew D. Stout
a
,
Kelly A. Shipkowski
a
, Andrew A. Rooney
a
a
Division of the National Toxicology Program (DNTP), National Institute of Environmental Health Sciences (NIEHS), National Institutes of Health (NIH), Research
Triangle Park, NC, USA
b
Evidence Prime Inc, Krakow, Poland
c
ICF, Research Triangle Park, NC, USA
ARTICLE INFO
Handling Editor: Paul Whaley
Keywords:
Automation
Text mining
Machine learning
Natural language processing
Literature review
Systematic review
Scoping review
Systematic evidence map
ABSTRACT
Introduction: There has been limited development and uptake of machine-learning methods to automate data
extraction for literature-based assessments. Although advanced extraction approaches have been applied to some
clinical research reviews, existing methods are not well suited for addressing toxicology or environmental health
questions due to unique data needs to support reviews in these elds.
Objectives: To develop and evaluate a exible, web-based tool for semi-automated data extraction that: 1) makes
data extraction predictions with user verication, 2) integrates token-level annotations, and 3) connects
extracted entities to support hierarchical data extraction.
Methods: Dextr was developed with Agile software methodology using a two-team approach. The development
team outlined proposed features and coded the software. The advisory team guided developers and evaluated
Dextrs performance on precision, recall, and extraction time by comparing a manual extraction workow to a
semi-automated extraction workow using a dataset of 51 environmental health animal studies.
Results: The semi-automated workow did not appear to affect precision rate (96.0% vs. 95.4% manual, p =
0.38), resulted in a small reduction in recall rate (91.8% vs. 97.0% manual, p <0.01), and substantially reduced
the median extraction time (436 s vs. 933 s per study manual, p <0.01) compared to a manual workow.
Discussion: Dextr provides similar performance to manual extraction in terms of recall and precision and greatly
reduces data extraction time. Unlike other tools, Dextr provides the ability to extract complex concepts (e.g.,
multiple experiments with various exposures and doses within a single study), properly connect the extracted
elements within a study, and effectively limit the work required by researchers to generate machine-readable,
annotated exports. The Dextr tool addresses data-extraction challenges associated with environmental health
sciences literature with a simple user interface, incorporates the key capabilities of user verication and entity
connecting, provides a platform for further automation developments, and has the potential to improve data
extraction for literature reviews in this and other elds.
1. Introduction
Systematic review methodology is a rigorous approach to literature-
based assessments that maximizes transparency and minimizes bias
(OConnor et al. 2019). Three main assessment formats (systematic
reviews, scoping reviews, and systematic evidence maps) use these
methods in a t-for-purpose approach depending on the research
question and project goals. Systematic reviews follow a pre-dened
protocol to identify, select, critically assess, synthesize, and integrate
evidence to answer a specic question and reach conclusions. The best
Abbreviations: SR, Systematic review; SCR, Scoping review; SEM, Systematic evidence map.
* Corresponding author at: NIEHS, P.O. Box 12233, Mail Drop K204, Research Triangle Park, NC 27709, USA. Express mail: 530 Davis Drive, Morrisville, NC
27560, USA.
E-mail address: Walker.Vickie@nih.gov (V.R. Walker).
Contents lists available at ScienceDirect
Environment International
journal homepage: www.elsevier.com/locate/envint
https://doi.org/10.1016/j.envint.2021.107025
Received 1 July 2021; Received in revised form 7 October 2021; Accepted 3 December 2021
Environment International 159 (2022) 107025
2
Box 1
A comparison of the literature-assessment steps between scoping reviews/systematic evidence maps and systematic reviews
1
.
Literature- Assessment Steps Task Description in Scoping Review
(SCR) or Systematic Evidence
Mapping (SEM) Workow
Task Description in Systematic
Review (SR) Workow
Percent
Time
2
Problem Formulation and Protocol
Development
Dene research question and objectives,
typically with a broad PECO
3
, and all
methods before conducting review.
Best practice to publish protocol
before conducting review for
transparency; however, only minor
impact on bias
Objectives are often open questions to
survey broad topics and identify
extent of evidence (i.e., areas that are
data rich or data poor / data gaps)
Protocol should describe key concepts
that will be mapped (e.g., exposures)
to support objectives
Dene research question, PECO, and
all methods before conducting
review to reduce bias.
Best practice to publish protocol
for transparency
Critical to publish protocol before
starting evidence evaluation to
reduce bias
Objectives are focused, closed
questions (e.g., specic exposure
outcome pairs/ hazards)
8%
Identify the
Evidence
Identify
Literature
Develop search strategy to identify
evidence relevant to address the
question.
Search is biased to address the degree
of precision and certainty of the
objectives where a comprehensive
search may not be necessary
Searches conducted in one or more
major literature database, or in a
stepwise manner, to address
objectives
Search terms are generally broad /
topic based with lower specicity
Searches retrieve evidence supportive
of multiple decisions and scenarios
Develop comprehensive search
strategy to identify all relevant
evidence to address the question.
Search is biased toward maximum
number of sources to ensure
identication of all evidence
relevant to synthesis
Search includes literature
databases, sources of grey
literature, and published data
Search terms are highly resolved
and specied for key elements of
the objectives
7%
Screen Studies Screen studies against eligibility criteria
from objectives and PECO.
Inclusion and exclusion criteria are
topic-based and may only address
PECO at a high level
Included studies likely to address
diverse scenarios
Screen studies against eligibility
criteria from objectives and PECO.
Inclusion and exclusion criteria
specied in detail for all PECO
elements
Assure specic research question is
efciently addressed
17%
Extract Data Extract study meta-data and
characteristics to address objectives.
Flexible approach supports t-for-
purpose maps of varying degrees of
comprehensiveness
Optional extraction of study ndings
and other characteristics depending
on objectives
Complete extraction of meta-data and
results to address question.
Entities determined by project
objectives
15%
Evaluate the Evidence Appraisal of studies is optional
depending on objectives.
Study characteristics relating to
quality of study design and conduct,
or internal validity, may be extracted
May include stepwise approach (e.g.,
methods mapped relative to
objectives), or quality only assessed
for studies addressing key outcomes
Critical appraisal of included studies
is essential to characterizing
certainty in bodies of evidence.
Performed as assessment of
internal validity (risk of bias)
May include external validity,
sensitivity, other factors
9%
Summarize and Synthesize Data SCRs and SEMs have limited or no
synthesis may only include
summaries.
Primary output shows extent of
evidence and key characteristics
relative to question and objectives
SCRs provide narrative summaries
(limited or no synthesis) of evidence
relative to objectives
Quantitative synthesis addresses
question and objectives where
appropriate; qualitative synthesis
used if pooling not appropriate.
Synthesis supports a specic
decision context
May include meta-analysis
5%
(continued on next page)
V.R. Walker et al.
Environment International 159 (2022) 107025
3
(continued)
Literature- Assessment Steps Task Description in Scoping Review
(SCR) or Systematic Evidence
Mapping (SEM) Workow
Task Description in Systematic
Review (SR) Workow
Percent
Time
2
SEMs output includes evidence map,
database, or tables to support and
inform decision making on question
Although data may inform multiple
decisions, summary may be specic
for decision-making context in
objectives
Synthesis should address key
features separately (e.g., evidence
streams, exposures, health effects)
Example for environmental health
questions would synthesize hazard
characterization data
Integrate
Evidenceand
Report
Findings
Integrate
Evidence
SCRs and SEMs do not typically include
integration or synthesis.
SEMs may identify regions of
evidence with study characteristics
associated with condence or
certainty
Assessment of condence or certainty
in the results of the synthesis
described according to the objectives.
Should address certainty of each
body of evidence relative to
questions or objectives
Includes integration of the
evidence base as a whole
Example for environmental health
questions would provide detailed
certainty of evidence for hazard or
risk conclusions from exposure
8%
DevelopReport All review outputs provided in
accessible format.
SCRs and SEMs do not typically
provide conclusions
Good SEMs are interactive, sortable,
and searchable
Outputs support and inform decision
making on question
Outputs should inform research and
analysis decisions, where data rich
areas may support conclusions or data
poor areas may serve as areas of
uncertainty that could be addressed
by research or evidence surveillance
Report all conclusions in clear
language and accessible format with
answer to review question.
Includes description of certainty of
conclusions
Describes limitation in the review
and limitations in the evidence
base for assessing the question
12%
Project Management
4
Oversight of team interactions and
workow performed to complete the
review.
Develop materials and guidance for
steps in the review (screening, data
extraction, etc.) and provide training
Manage communication and
meetings for workow, track
progress, and address problems
Arrange and conduct pilot testing of
review steps and revise approach
based on lessons learned
Arrange for workow integration of
new tools, machine learning and AI
features
Recruit technical experts and new
team members and address conict of
interest
Plan for protocol, data, and document
review
Oversight is the same as SCR and
SEM with additional steps for critical
appraisal and integration of
evidence.
19%
1
Note: For the purpose of the table, scoping reviews and systematic evidence maps are considered to have the same workow; adapted from
James et al. (2016) and Wolffe et al. (2020).
2
Estimated percent of work time to complete a systematic review adapted from Clark et al. (2020).
3
Research questions for scoping and systematic reviews should be stated in terms of the Population, Exposure, Comparator, and Outcome
(PECO) of interest. Scoping reviews and evidence maps sometimes do not include a specic comparator.
4
Although project management is not typically considered a step in the literature review process, it took nearly 20% of time when considered as
a separate function by Clark et al. (2020).
SR =Systematic Review, SCR =Scoping Review, SEM =Systematic Evidence Map.
V.R. Walker et al.
Environment International 159 (2022) 107025
4
systematic reviews use a comprehensive literature search based on a
narrowly focused question to facilitate conclusions. Scoping reviews
utilize systematic-review methods to summarize available data on broad
topics to identify data-rich and data-poor areas of research and inform
evidence-based decisions on further research or analysis. Systematic
evidence maps use systematic-review methods to characterize the evi-
dence base for a broad research area to illustrate the extent and types of
evidence available via an interactive visual format that may be a stand-
alone product or part of a scoping review (Wolffe et al. 2020). All three
of these assessment formats are generally time-consuming and resource-
intensive to conduct, primarily due to the need to accomplish most steps
manually (Marshall et al. 2017), but also driven by the complexity of the
data under consideration and amount of relevant literature to evaluate.
The specic steps in a literature-based assessment depend on the
goals and approach used, with ve basic steps in most assessments: 1)
dene the question and methods for the review (i.e., problem formula-
tion and protocol development), 2) identify the evidence, 3) evaluate the
evidence, 4) summarize the evidence, and 5) integrate the evidence and
report the ndings. Box 1 compares these steps across a scoping-review/
systematic-review map and systematic-review products. Many steps in
the literature-based assessment process have repetitive and rule-based
decisions, which lend the steps to automated or semi-automated
approaches.
The development and use of automation are steadily advancing in
literature analysis, with much of its uptake focused on clinical and
medical research. This progress may be related to funding advantages
and the relative consistency of medical data and publications, or perhaps
may be because clinical research has used systematic-review method-
ology longer. Although data sources are similar for many literature-
based assessments, there are differences and unique aspects of the
data relevant for addressing toxicology or environmental health ques-
tions versus clinical questions. One important difference is that envi-
ronmental health assessments require the identication of research from
multiple evidence streams (i.e., human, animal, and in vitro exposure
studies), which necessitates training tools on publications addressing
each evidence stream. In contrast, data that are relevant for addressing
clinical and medical review questions come primarily from randomized
controlled trials in human subjects. Even within the human data there is
greater complexity in toxicology or environmental research, where a
range of epidemiological study designs are used for investigating the
health effects of environmental chemicals. Moreover, experimental an-
imal studies measure more diverse endpoints and may report more data
than clinical studies, resulting in longer and more complicated data
extraction. Finally, cell-based assays and in vitro exposure studies pro-
vide valuable mechanistic insights for the question at hand, but these
assays cover an even more diverse range of endpoints, platforms, tech-
nologies, and associated data. While clinical and environmental as-
sessments both focus on health-related outcomes, the requirements for
environmentally focused reviews expand beyond those considered in the
medical eld. Therefore, a tool that meets the needs of environmental
health assessments could likely be applied successfully to clinical
questions, while the opposite may not be true.
The systematic-review toolbox (http://systematicreviewtools.com)
provides a catalogue of over 200 tools that address parts of all ve
assessment steps as well as associated tasks such as meta-analysis or
collaboration (Marshall and Brereton 2015). Within the toolbox, there
are multiple resources that support developing and implementing
literature assessments using manual processes. For instance, several
tools provide web-based forms for review teams to capture objectives,
record search strategies, detail quality-assessment checklists, and record
manual steps such as extracting evidence and making risk-of-bias or
quality-assessment judgements. The availability of tools that support
full- or partial-automation of literature-assessment processes is much
more limited, despite recent advances in natural language processing
(NLP), machine learning, and articial intelligence (AI). This is espe-
cially true for the process of identifying evidence, where a combination
of active learning and linguistic models can successfully predict the
relevance of literature based on small samples of manually selected
studies (e.g., Brockmeier et al. 2019; Howard et al. 2020; Rathbone et al.
2017; Wallace et al. 2012). These approaches have now been incorpo-
rated into several systematic-review tools. In contrast, the development
and the adoption of automation methods for steps three through ve of a
systematic review has been limited (Box 1). The risk-of-bias assessment
of individual studies is a critical and time-consuming process in assess-
ments that is generally considered to require subject-matter experts to
evaluate complex factors in study design and reporting. While a few
models exist to predict risk-of-bias ratings for clinical research studies
(Marshall et al. 2017; Millard et al. 2016), such methods have not
translated into adoption within mainstream systematic-review tools.
Several assessment steps rely on the extraction of identied data
from text, another widely recognized time-consuming process. Recent
developments in NLP, including both general extraction of named en-
tities and relationships (surveyed in Yadav and Bethard 2018) as well as
specic extraction of biomedical terms (e.g., chemicals, genes, and
adverse outcomes (reviewed in Perera et al. 2020)), suggest that
machine-based approaches are sufciently mature for semi-automated
data-extraction approaches. Some elements of human- and machine-
based data extraction are straightforward, including identifying the
species and sex of the experimental animal models. Other elements, such
as identifying the results of experimental assays, questionnaires, or
statistical analyses, are more complex because publications may report
the results from numerous assays and endpoints after multiple expo-
sures, doses, and time periods. Standardization of reporting is also
lacking, such that authors may report the experimental details using
different measurement units, different names for the same chemical, and
other variations in terminology (Wolffe et al. 2020). In addition, this
information may be located within the text of the publication or in a
table, gure, or gure caption.
In 2018, the Division of the National Toxicology Program (DNTP)
participated in the National Institute of Standards and Technology Text
Analysis Conference (NIST TAC) challenge by hosting the Systematic
Review Information Extraction (SRIE) track to investigate the feasibility
of developing machine-learning models to identify, extract, and connect
data entities routinely extracted from environmental health experi-
mental animal studies. The data entities included 24 elds such as
species, exposure, dose level, time of dose, endpoint. Creating the
training and test sets required structured annotation of the various data-
extraction entities and labeling them in the text of each article. Devel-
oping these datasets required more comprehensive study annotation
than typical data extraction workow because the training dataset
needed to capture all entities and endpoints in a research publication
rather than the subset that might be relevant for a given systematic re-
view question. These training datasets are critical to providing a xed
format that can be automatically processed and interpreted by a com-
puter for training and model development. Overall, the results of the
challenge were promising in that model-derived annotation of design
features from the methods section of experimental animal studies ach-
ieved results in some extraction elds that neared human-level perfor-
mance, suggesting that computer-assisted data extraction is a viable
option for assisting researchers in the labor- and resource-intensive steps
of data extraction in the literature-assessment process (Schmitt et al.
2018).
Given the positive outcome of the NIST TAC challenge, we developed
Dextr, a web-based tool designed to incorporate NLP data-extraction
models (including but not limited to models developed for the NIST
TAC Challenge) into annotation and data-extraction workows to sup-
port literature-based assessments. Many potential features were
considered as we established the design requirements for Dextr
(Table 1), with three design features considered key for our needs. First,
and most importantly, was the ability of the tool to make data-extraction
predications automatically, with the users ability to manually verify the
predicted entity or override and modify the extracted information (i.e., a
V.R. Walker et al.
Environment International 159 (2022) 107025
5
semi-automated data-extraction approach where automated pre-
dictions are veried by the user). Second, we considered the capability
to group the extracted entities (e.g., connect the species, strain, and sex
of the animal model or the dose, exposure, and outcomes), critical to
supporting hierarchical data extraction and greater utility of the
extracted data. The third key feature was the ability of the tool to make
token-level annotations (i.e., identifying a word, phrase, or specic
sequence of characters) that can be used in either a typical data-
extraction workow as part of a literature review or to annotate
studies for developing training datasets. The annotation of studies dur-
ing data extraction has the potential to create training datasets without a
separate, directed effort if the tool includes appropriate machine-
readable export options. In addition to the three key features identi-
ed above, Table 1 describes unique characteristics of environmental
health data and key challenges for data extraction considered in devel-
oping Dextr. Given the increasing volume of published studies, we
believe that semi-automation of the labor-intensive step of data
extraction by Dextr has great potential to improve the speed and accu-
racy of conducting literature-based assessments and reduce the work-
load and resources required without comprising the rigor and
transparency that are critical to systematic-review methodology.
In this paper, we briey describe the methods development of Dextr,
investigate the tools usability, and evaluate the tools impact on per-
formance in terms of recall, precision, and time using this semi-
automated approach in DNTPs data-extraction workow.
2. Methods
2.1. Tool development
The underlying machine-learning model (Nowak and Kunstman
2018) was developed as part of the NIST TAC 2018 SRIE workshop
(Schmitt et al. 2018). Briey, the model is a deep neural network con-
taining more than 31 million trainable parameters. It consists of pre-
trained embeddings (Global Vectors for Words Representations: GloVe
and Embeddings from Language Models: ELMo), a bidirectional long
short-term memory (LSTM) encoder, and a conditional random eld
(Nowak and Kunstman 2018). This model was developed and trained
only on the methods sections of environmental health studies. There-
fore, for this project, the model similarly was restricted to the methods
section and performed the sequence tagging task by producing a tag
(denoting a single data-extraction eld) for every token from the input
in the methods section. The goal for Dextr was to develop a exible user
interface that met the design requirements in Table 1 and leverage this
model within a literature-review workow. The project was conducted
according to Agile software development methodology around two
principal teams. A development team (AJN and KK) was formed to
outline potential features and functions for Dextr and code the tool. An
advisory team (AAR, CPS, RS, ARW, MSW, VRW) of experts with
backgrounds in public health, literature analysis, and computational
methods was then formed to guide the developers. The development
team worked sequentially in sprintson clearly dened, testable pieces
of functionality that could inform further planning and design.
Throughout the process, the development team consulted with the
advisory team, demonstrated newly added functionality, and presented
mock-ups of the user interface illustrating key functions within the tool
(e.g., project management screens, data import, extraction interfaces).
As part of the Agile development sprints, each task had a test plan to
verify its correctness. These test plans were then performed, rst by the
testers that were part of the development team, and then (if successful)
by the advisory team. Members of the advisory team (ARW and RS)
oversaw the project schedule and timeline and managed the develop-
ment team and evaluation study. The development and advisory teams
discussed potential renements, suggested improvements, and agreed
upon the approach to be implemented. When all features had been
developed, a test version of the tool was produced and tested by the
advisory team (ARW and RS). All issues or bugs identied during testing
were addressed by the development team. When both teams agreed that
the tool met the design requirements, the development team deployed
the Minimum Viable Product (MVP) version of the tool to the Quality
Assurance (QA) environment in April 2020. Before applying this initial
version of the tool (Dextr v1.0-beta1) in daily work, QA testing and basic
performance evaluation were conducted on the MVP version to quantify
the potential gains in using a semi-automated workow with Dextr
compared to a manual workow as described in the following section.
2.2. Evaluation
The aim of the evaluation was to understand how the integration of
Dextr, a semi-automated extraction tool employing a machine-learning
model would perform in the DNTP literature-review workow. Specif-
ically, we sought to understand how the tool would impact data
extraction recall, precision, and extraction time compared to a manual
workow. The performance of the underlying machine-learning model
was evaluated previously and was not within the scope of this evaluation
(Schmitt et al. 2018). Although Dextr enables users to connect extracted
entities, there is no difference in this aspect of the workow between
manual and semi-automated approaches. Therefore, the connection
feature was rigorously tested and subject to QA procedures, but not part
of the evaluation. Similarly, as we have continued to rene the user
Table 1
Dextr design requirements.
Challenge Description Dextr Features Addressing the Challenge
Interoperability: Efciently import and export necessary le types Ability to import various le types (i.e., CSV and RIS)
Allows bulk upload of PDFs (click and drag)
Exports as CSV or modied brat le types
Usability features: User interface that operates with efcient mouse
and key stroke options (exibility for user
preferences)
Selection with mouse click options.
Project management features
Easy to follow user interface
Ability to modify data extraction form
Complex data: Environmental health sciences publications often
report multiple experiments with various chemical
exposures and doses and evaluate several endpoints
(hierarchical data structure and groupings /
connections)
Ability to extract multiple entities including multiple animal models, exposures, and outcomes
Ability to connect the metadata at various levels (i.e., dose-exposure-outcome pairings)
Annotations: Capability to annotate studies within a typical data
extraction workow that can be used to develop
annotated datasets needed for training or
developing new models
Token-level (i.e., word, phrase, or specic sequence of characters) annotations recorded for each
extraction entity
Ability to export annotations in a machine-readable format for model renement and new model
development
Flexibility: Functionality that emphasizes exibility for taking
advantage of advancements in natural language
processing
Ability to utilize regular expressions to identify a string of text (i.e., #### mg =dose) or keyword
searches without models
Ability to add validated models (3rd party models) to the suite of available models
V.R. Walker et al.
Environment International 159 (2022) 107025
6
interface and develop new capabilities for Dextr, we did not evaluate
new features (such as the ability to use controlled vocabularies) that
were not expected to negatively impact recall, precision, or extraction
time. The evaluation study was designed and conducted independently
of the development team and consisted of two teams, a manual extrac-
tion team (RS, ARW, RB, KS) and a semi-automated extraction team (JR
and JS) (Fig. 1). The manual extraction team included two manual ex-
tractors (RS and ARW), who read each study and manually extracted
each data element, and two QA reviewers (RB and KS), who reviewed
the manual data extractions and made any corrections as needed
(Manual +QA). The semi-automated team included two QA reviewers
(JR and JS) who reviewed the machine-generated data extractions and
made any corrections as needed (Model +QA). The manual extractors
had prior experience with Dextr related to development discussions and
testing tasks. The QA reviewers on both teams had previous and similar
levels of experience with conducting data extraction for literature re-
views, received a user guide for Dextr, and completed a pilot test in
Dextr prior to the evaluation. Reviewers were told to accept correct data,
add missing data, and ignore data incorrectly identied by either the
extractor or the model. Incorrect data were ignored to minimize any
additional time reviewers would spend clicking to reject incorrect sug-
gestions. This approach reects use of Dextr for literature-based reviews;
however, if Dextr is used to construct new training data for model
development, then incorrect data would need to be labeled as such.
Extractor time and reviewer time were recorded within Dextr. Prior
to either beginning extraction or review, users started a timer on the
tools user interface and paused or stopped the timer (as needed) until
they completed their task. The times between start and stop actions were
manually checked against a complete event log to identify potential
cases of the timer not being started or stopped and summed for each
study to provide a total extraction time.
All the statistical analyses (JC) used in the evaluation were con-
ducted using SAS Version 9.4 (SAS Institute Inc., Cary, NC). The
generalized linear mixed model regressions were tted using the
GLIMMIX procedure with maximum likelihood estimation based on the
Laplace approximation. Two-sided t tests were used to test the null hy-
pothesis that a given xed effect coefcient or a linear combination of
the coefcients (such as the estimated difference between the log odds of
recall for the manual and semi-automated modes averaged across elds
and extractors) was zero. Similarly, one-sided F tests were used to test
null hypotheses for interactions and contrasts, i.e., that all the corre-
sponding linear combinations of the coefcients were zero. All statistical
tests (two-sided t tests for estimated xed effects and one-sided F tests
for interactions and contrasts) were carried out at the 5% signicance
level. P-values are shown in the tables. There was no missing data and no
removal of potential outliers.
2.3. Pilot evaluation
An initial pilot evaluation was conducted on 10 studies and the re-
sults were used to calculate the sample size for the number of studies to
be included in the evaluation. All software requires a general under-
standing of the functionality and features to navigate the user interface
and perform the tasks it is designed for in this case to conduct data
extractions. The goals of the pilot were to prepare for evaluating Dextrs
performance, not to assess the learning process of new users. Therefore,
an extraction guidance document was written so participants would
better understand how Dextr worked and minimize the impact of the
learning curve. The guidance was developed and reviewed by the
advisory team and the development team prior to sharing it with the
extraction team. Extraction team members provided usability feedback
after the pilot; however, no changes were implemented to Dextr before
the evaluation study. Since no changes were made to the tool based on
the pilot, the results from the 10 pilot studies were included in the main
evaluation study.
The evaluation sample size was selected using a statistical power
analysis based on statistical models tted to data from the pilot study.
These statistical models used similar but less complicated formulations
than the nal models tted to the nal data.
Using the same notation as in the Evaluation Metrics section, the
pilot study statistical model for recall was of the form:
Logit(recall) = intercept +
α
i+βc +θs,
where
α
i is a xed factor for the mode, θs is a random factor for the study,
drawn from a normal distribution with mean zero, and c is the quanti-
tative complexity score (i.e., the calculated score divided by 100). The
logit is the log odds. The same model formulation was used for precision.
The pilot study statistical model used for time was of the form:
Log(time) = intercept +
α
i+βc +θs+error,
where error is normally distributed with mean zero and is independent
of the random factor θs.For several candidate values of K, data for K
studies were simulated from each tted model 100 times each under the
alternative hypotheses, and the same statistical model was retted to the
simulated data. 100 simulations were used since the iterative method for
tting the models is computer intensive. For recall, each eld was
assumed to have 4 gold standard tags (the average number in the pilot
data, rounded to the nearest integer). For precision, each eld was
assumed to have 3 tags (the average number in the pilot data, rounded to
the nearest integer). The simulated complexity scores were equally
likely to be any of the 10 pilot study complexity scores. For the manual
mode, the simulated data used the tted statistical models for the
manual mode. For the semi-automated mode, the simulated data for
recall and precision used the same model but increased the log odds by a
xed amount, delta. For the semi-automated mode, the simulated data
for time used the same model but decreased the geometric mean time by
a xed percentage, perc. The estimated statistical power was the pro-
portion of the simulated models where the difference between the two
modes was statistically signicant at the 5% signicance level.
Based on 100 simulations from the tted models and using a 5%
signicance level, we found that a sample of 50 studies would be suf-
cient to have an estimated statistical power of 100% (95% condence
interval (96.4, 100) %) to detect an increase or decrease of 1 in the log
odds of recall, or a decrease of 1 in the log odds of precision; 98% (95%
condence interval (93.0, 99.8) %) to detect an increase of 1 in the log
odds of precision, and 97% (95% condence interval (91.4, 99.4) %) to
detect a 20% decrease in median time. The condence intervals account
for the uncertainty due to the fact that only 100 simulations were used.
Therefore, a nal sample size of at least 50 studies was selected and 51
studies were chosen for the gold-standard dataset.
Fig. 1. Evaluation study design with two teams: manual extraction and semi-
automated extraction. The manual-extraction team included two extractors
who completed the primary extraction, followed by two QA reviewers. The
semi-automated extraction team included the algorithm that completed the
primary extraction, followed by two different QA reviewers.
V.R. Walker et al.
Environment International 159 (2022) 107025
7
2.3.1. Data-extraction elds
We selected ve data-extraction elds (test article, species, strain,
sex, and endpoint) for the evaluation study from the full list of 24
extraction elds included in the NIST TAC SRIE challenge dataset
(Schmitt et al. 2018). The challenge evaluated models using the F1
metric, which is a harmonic mean of precision (i.e., positive predictive
value) and recall (i.e., sensitivity). The ve data elds represented a mix
of elds with high (species, sex), medium (strain), and low (test article,
endpoint) F1 scores across all models previously evaluated in the
challenge.
2.3.2. Gold-standard dataset
The gold-standard dataset comprised respiratory endpoints associ-
ated with exposure to biocides manually extracted from 51 experimental
animal studies (Supplemental File S1). Although in vitro, experimental
animal, and epidemiological study designs are of interest, we focused on
experimental animal studies for the evaluation of Dextr because the
model used had been developed and trained on experimental animal
studies. Guidance on respiratory outcomes or endpoints provided to the
extractors and QA reviewers is shown in Table 2. The teams were told
that the table was not an exhaustive list and were instructed to identify
any respiratory effect evaluated. The gold-standard dataset was devel-
oped by a separate extractor (PH), not included in either extraction
team, who read the papers and extracted information on test article,
species, strain, sex, and endpoints. A QA review was performed on the
resulting dataset, or gold-standard dataset, by an independent QA
reviewer (VW), who was also not on either extraction team.
2.3.3. Evaluation criteria
The results from Dextr were manually assessed by a single grader (RS
or ARW) and compared to the gold-standard dataset. In brief, nal data
from the manual mode (Manual +QA) and from the semi-automated
mode (Model +QA) were exported from Dextr into a CSV le. The re-
sults of each study by mode (manual or semi-automated) were graded
separately. Each extracted data element was compared to the gold
standard and marked as either a true positive(TP), if a match with an
element in the gold standard, or a false positive(FP), if an additional
element was not included in the gold standard. Out of 3334 results, 985
were identied as FPs. Fifty-ve of these were manually agged for
further investigation for reasons such as a possible gold standard match,
duplicate nding, or human error. Gold-standard data elements that
were not included in the Dextr results were marked as a false negative
(FN). Endpoints in the Dextr exports that were more specic than those
in the gold standard were considered a TP. For example, if a Dextr result
had an endpoint of lung myeloid cell distributionand lung CD4 +T
cell numbers,but the gold-standard endpoint was lung myeloid cell
distribution (B and T cells),then both Dextr-identied endpoints were
considered TPs. Out of 1561 TP results, 540 were exact matches while
1021 were not exact matches for reasons such as plural versus singular,
abbreviated or not, order of terms differed, or more detail provided in
one source or the other. Out of these non-exact matches, 282 did not
exactly match due to plural/singular/abbreviation discrepancies while
739 had a slight difference in wording, but were still considered a match.
The graders consulted a tertiary grader (VW) to make a nal decision in
cases where a data element was identied in the Dextr results and
missed in the gold standard. For QA, four studies were independently
graded by a separate grader (either ARW or RS) and compared to the
initial grading. Changes to the grading were made based on discussions
between the graders. If questions between graders remained, then a
tertiary grader (VW) was consulted to provide clarity and a nal deci-
sion. Duplicate data elements in the Dextr results were graded only once.
We calculated a complexity score for each study to account for the
additional effort an extraction would take based on the number of var-
iations of an experiment. For example, we anticipated a study with
multiple test articles would be more difcult to extract than a study with
only one test article. We were unable to nd an established method to
address complexity, and therefore complexity scores were developed
using expert judgment from experienced extractors. Although the
complexity scores were designed to address study characteristics over-
all, the score is based on study characteristics that relate to the specic
data extraction elements for this paper. The number of data-extraction
elements by eld in the gold-standard dataset was multiplied by the
weights shown in Table 3 and summed across the ve data elds to
calculate the score. The advisory team developed the weights for each
eld based on judgement related to how complex an extraction task was
given multiple test articles, species, strains, sexes, and endpoints.
Additional test articles and species were identied as introducing
complexity to an extraction, while most studies examined multiple
endpoints and did not dramatically add time to an extraction.
2.4. Usability feedback
After completion of the evaluation study, we asked manual and semi-
automated QA reviewers to provide qualitative feedback on their user
experience with Dextr. To summarize the assessment of usability across
the reviewers, six open-ended user-experience questions were developed
and responses for each question were recorded and compiled (Table S1).
Note, that the feedback reects user experience during the pilot and
evaluation phases.
2.5. Evaluation metrics
We evaluated the utility of Dextr in DNTPs workow on three key
metrics: recall, precision, and extraction time. The recall rate is the
probability, prob(recall), that a gold-standard tag was correctly recalled.
The precision rate is the probability, prob(precision), that an identied
tag was a gold-standard tag. In the main paper, we compare arithmetic
Table 2
Respiratory outcome examples and key terms.
Category Examples and Key Terms
Organ/tissue nasal cavity and paranasal sinus, nose (including
olfactory), larynx, trachea, pharynx, pleura, lung/
pulmonary (including bronchi, alveoli), glottis,
epiglottis
Signs/symptoms sneezing/snifing, nasal congestion, nasal discharge
(e.g., rhinorrhea), coughing, increased mucus/
sputum/phlegm, breathing abnormalities (e.g.,
wheezing, shortness of breath, unusual noises when
breathing)
Respiratory-related diseases
or conditions
brosis, asthma, emphysema, chronic obstructive
pulmonary disease (COPD), pneumonia, sinusitis,
rhinitis, granuloma (or other inammation)
Lung function
measurements
forced expiratory volume (FEV), forced vital capacity
(FVC), peak expiratory ow (PEF), expiratory reserve
volume (ERV), functional residual capacity (FRC),
vital capacity (VC), total lung capacity (TLC), airway
resistance, mucociliary clearance
1
Examples of outcomes or endpoints and key terms in this table provided in a
guidance document for training of extractors and QA reviewers.
Table 3
Assigned weights to data extraction elds within
Dextr.
1
Field Weight
Endpoint(s) 0.5
Sex 1
Species 2
Strain 1
Test article 2
1
Weights were developed using expert judg-
ment to capture how additional extraction ele-
ments introduced complexity into the extraction
task.
V.R. Walker et al.
Environment International 159 (2022) 107025
8
means of the recall rate or precision rate or the median total extraction
time with predictions from tted statistical models unstratied by
eldthat estimate the rates or medians as a single function of the mode,
eld, and other explanatory variables. In the Supplemental Materials,
we present alternative models for the recall and precision rates strati-
ed by eld,where for each eld the recall or precision rate is sepa-
rately modeled.
The log odds of recall is dened as logit(recall) =log {prob(recall) /
(1 - prob(recall)) }, where logis the natural logarithm. The tted
statistical model is a version of the model used in Saldanha et al. (2016).
The statistical model assumes that the log odds of recall for a given gold-
standard tag is a function of the mode (i =manual or semi-automated),
complexity score (c), eld (f =endpoint, sex, species, strain, or test
article), primary extractor (p =primary1 or primary2 for manual mode,
NULL for semiautomated mode), quality assurance reviewer (q =q1 or
q2 for the two semi-automated mode reviewers, q3 or q4 for the two
manual mode reviewers), and study (s =different values for each of the
51 studies). For each combination of study, mode, and eld, the recall
outcomes for each of the gold-standard tags are independent and have
the same log odds, giving a binomial distribution for the number of gold-
standard tags correctly recalled. For all these statistical analyses, the
complexity score dened above was divided by 100 to improve model
convergence without changing the underlying model formulation. The
general model used the equation:
Logit(recall) = intercept +
α
i+βc+γf+δp+
ε
q+(
α
γ)if +(
α
β)ic+(βγ)fc+θs.
In this general model,
α
iandγf are xed factors for the mode and eld;
δp,
ε
q, and θs are random factors for the primary extractors, QA re-
viewers, and study, drawn from independent normal distributions with
mean zero; and c is the quantitative complexity score (i.e., the calculated
score divided by 100). The terms (
α
γ)if ,(
α
β)i,and(βγ)f are interaction
terms for mode ×eld, mode ×complexity score, and complexity score
×eld. Thus, the model allows the effect of the eld to vary with the
mode or with the complexity score and allows the effect of the mode to
vary with the complexity score.
We were unable to t this general model to the data due to problems
with extremely high standard errors for the mode ×eld interaction and
some convergence issues, although there were no problems with com-
plete or quasi-complete separation of the logistic regression models. For
example, in the initial model with random factors, the estimated vari-
ance for the QA reviewer was zero, but the corresponding gradient of
minus twice the log-likelihood was over 150 instead of being at most
0.001, the convergence criterion. For the nal model we therefore
removed the mode ×eld interaction and replaced the random factors
for the primary and QA reviewers by xed factors. Replacing the random
factors by xed factors might limit the generalizability of these results to
other potential reviewers. We also removed main effects and in-
teractions that were not statistically signicant at the 5% level. It is
possible that excluding interactions and replacing random factors by
xed factors could have introduced some bias and might limit the
generalizability of the study results.
The nal model was of the form:
Logit(recall) = intercept +
α
i+βc +γf+δp+θs
where the only random factor is the study effect. In particular, this
model does not have an interaction between mode and eld, so the
estimated differences in log odds between modes are the same for every
eld. Additionally, as noted above, to evaluate differences in the mode
effect across different elds we tted alternative models stratied by
eld, and those results are shown in the Supplemental Materials. In
particular, the stratied models show large differences between the
estimated study variances for different elds.
The precision rate is the probability, prob(precision), that an iden-
tied tag was a gold standard tag. The log odds of precision is dened as
logit(precision) =log {prob(precision) / (1 - prob(precision)) }. The
general statistical model for precision was the same formulation as the
above model for recall. As before, the nal model did not include the
mode ×eld interaction due to extremely high standard errors, and we
replaced the random factors for the primary and QA reviewers by xed
factors. After removing non-signicant main effects and interactions, the
nal model (using the same notation) was of the form:
Logit(precision) = intercept +
α
i+βc +γf+ (
α
β)ic+θs
where the only random factor is the study effect. In particular, this
model does not have an interaction between mode and eld, so the
estimated differences in log odds between modes are the same for every
eld. For each study and mode, the total extraction time, including the
primary and QA reviews, was recorded. The time taken for each eld
was not recorded. The general model for time taken assumes that the
natural logarithm of the time taken is the following function of the
mode, complexity score, primary extractor, and QA reviewer. Using the
same notation as before, the general model is of the form:
Log(time) = intercept +
α
i+βc +δp+
ε
q+ (
α
β)ic+θs+error
where error is normally distributed with mean zero and is independent
of the random factors δp,
ε
q, and θs. The interaction term for mode ×
complexity score was not statistically signicant, and again it was
necessary for convergence to replace the random factors for the primary
and QA reviewer by xed factors. The primary extractor effect was not
statistically signicant at the 5% level. The nal model was of the form:
Log(time) = itercept +
α
i+βc +
ε
q+θs+error
3. Results
3.1. Dextr functionality
The rst version of the tool that we evaluated in this study fullls the
ve design principles outlined at the tools inception. Specically, the
tools set-up feature provides interoperability within the existing DNTP
workow where users can upload .ris and .pdf les and export the
extracted data in two forms, as .csv or .zip les. The .zip format allows
exported data to be uploaded into brat (an open-source annotation
software tool; https://brat.nlplab.org/). The ability to export data in a
structure readable by brat allows users to leverage project data for future
model development.
In terms of usability requirements, Dextr enables users to select text
using their mouse or type a phrase into the extraction form. The default
extraction form consists of the ve extraction elds, all powered by the
underlying model to provide predictions. Users can customize the data-
extraction form; however, only the elds on which the model was
trained are supported by automation.
We designed the default form and the tool to be able to handle re-
lationships between the extraction entities, called connections. This
allows the user to specify a hierarchy between elds (e.g., multiple an-
imal models can be dened, each with a species, strain, and sex). The
animal model and endpoints can then be connected to a test article to
create a separate experiment within the study, satisfying the third design
principle related to handling complex, hierarchical data.
After project set-up, users and team members can begin extracting
data elements (manually or semi-automatically) via the My Tasks
page. Users then claim available pdfs (i.e., select a given pdf as part of
the users tasks) and access the full-text pdf within the tool to facilitate
data extraction. The user can highlight the text and associate it with an
extraction eld. Additionally, if the exact text or phrase is not within the
article itself, the user can type the appropriate text into the extraction
eld. These exible options for highlighting text to populate the
extraction form are useful for data extraction within a typical literature-
assessment workow, or for a more detailed annotation workow by
V.R. Walker et al.
Environment International 159 (2022) 107025
9
generating a dataset that may be used by model developers, thus satis-
fying the fourth design principle related to annotations. In both the
manual or semi-automated workows, the primary extractor or machine
predictions are populated in the extraction form before a user accesses
the study. The user then has the option to accept, ignore, or reject the
extracted data or add additional data if it is missing within the form.
The last design requirements, the ability for other models to be easily
incorporated into the tool and the ability to adapt to new model de-
velopments over time, were both addressed in the initial version of Dextr
but not tested in the evaluation study.
3.1.1. Usability feedback
Three of the reviewers were available to participate in a feedback
discussion of the tools usability (two of the semi-automated QA re-
viewers and one of the manual QA reviewers). Two reviewers rated the
usability of Dextr a 5 out of 10, while the other reviewer rated it an 8 out
of 10. The semi-automated reviewers provided feedback on how the tool
could be improved related to automatic page navigation and organiza-
tion on the user interface. The semi-automated QA reviewers liked how
the tool organized the extractions and agreed that the tool helped them
stay organized. Once they were comfortable with the tool, they were
able to work smoothly and efciently. Reviewers identied one draw-
back regarding how the tool handled endpoints. All three of the re-
viewers found it difcult to keep track of endpoints identied by either
the machine or a primary extractor. It was a challenge to nd previously
reviewed and accepted endpoints as reviewers continued searching for
new endpoints in the extraction list. This issue was more noticeable
when multiple, similar endpoints had been identied. They suggested
that a more organized process for listing and tracking endpoints will
improve the tools usability.
3.1.2. Statistical models for recall, unstratied by eld
A total of 51 toxicological studies were included in the nal dataset.
The ability of the model to correctly apply the gold-standard tag, or
modeled overall recall rate, was 97.0% for the manual mode and 91.8%
for the semi-automated mode. The difference in recall rates for the
manual mode compared to the semi-automated mode was observed to
be statistically signicant (p <0.01) (Table 4). These results are com-
parable to the arithmetic means of the recall rates across all studies and
elds, which were 91.8% for the manual mode and 83.8% for the semi-
automated mode. Table 4 also provides the estimate, standard error, and
p-value for the difference between the log odds of the two modes.
Table 5 shows the estimated log odds and recall probabilities as well as
the very similar arithmetic mean recall rates for each eld for the
manual and semi-automated modes. Note that because there is no
interaction term for mode ×eld, the estimated differences in log odds
between the two modes are the same for every eld and equal the values
in the last row of Table 4. Estimates and standard errors for the xed
effects and random effects related to recall are shown in Table S2.
3.1.3. Statistical models for precision, unstratied by eld
The modeled overall precision rate was 95.4% for the manual mode
and 96.0% for the semi-automated mode. The precision rate for the
semi-automated mode was higher, but the difference was not statisti-
cally signicant (Table 6). These results can be compared with the
arithmetic means of the precision rates across all studies and elds,
which were 92.5% for the manual mode and 93.2% for the semi-
automated mode. Table 6 gives the estimated log odds and precision
probabilities for the manual and semi-automated modes, weighting each
eld equally, along with their standard errors and p-values. Table 6 also
provides the estimate, standard error, and p-value for the difference
Table 5
Recall comparison between manual and semi-automated modes for each mode and eld, averaged over evaluators, based on the model unstratied by eld.
1.
Extraction Mode Field Log Odds (Standard Error) P-value of Log Odds Probability (Standard Error) Arithmetic Mean Recall Rate
Manual Endpoint 1.133 (0.115) <0.0001 0.756 (0.021) 0.744
Manual Sex 4.856 (0.726) <0.0001 0.992 (0.006) 0.980
Manual Species 5.444 (1.014) <0.0001 0.996 (0.004) 1.000
Manual Strain 3.893 (0.477) <0.0001 0.980 (0.009) 0.980
Manual Test article 2.091 (0.180) <0.0001 0.890 (0.018) 0.883
Semi-automated Endpoint 0.068 (0.106) 0.5259 0.517 (0.027) 0.523
Semi-automated Sex 3.791 (0.721) <0.0001 0.978 (0.016) 0.990
Semi-automated Species 4.379 (1.011) <0.0001 0.988 (0.012) 0.980
Semi-automated Strain 2.828 (0.470) <0.0001 0.944 (0.025) 0.922
Semi-automated Test article 1.026 (0.168) <0.0001 0.736 (0.033) 0.773
3
Assumes average study complexity scores (0.175).
Table 4
Recall comparison between manual and semi-automated modes when averaged across elds and extractors, based on the model unstratied by eld.
3
Extraction Mode Log Odds (Standard Error) P-value of Log Odds Probability (Standard Error) Arithmetic Mean Recall Rate
Manual 3.483 (0.287) <0.0001 0.970 (0.008) 0.918
Semi-automated
1
2.418 (0.278) <0.0001 0.918 (0.021) 0.838
Comparison
2
1.065 (0.109) <0.0001
1
Dextr predictions conrmed by QA reviewer.
2
Comparison between manual and semi-automated extraction modes.
3
Assumes average study complexity scores (0.175).
Table 6
Precision comparison between manual and semi-automated modes when averaged across elds and extractors, based on the model unstratied by eld.
3
Extraction Mode Log Odds (Standard Error) P-value of Log Odds Probability (Standard Error) Arithmetic Mean Precision Rate
Manual 3.040 (0.281) <0.0001 0.954 (0.012) 0.925
Semi-automated
1
3.174 (0.287) <0.0001 0.960 (0.011) 0.932
Comparison
2
0.134 (0.151) 0.3765
1
Dextr predictions conrmed by QA reviewer.
2
Comparison between manual and semi-automated extraction modes.
3
Assumes average study complexity scores (0.175).
V.R. Walker et al.
Environment International 159 (2022) 107025
10
between the log odds of the two modes. Table 7 shows the estimated log
odds and precision probabilities as well as the very similar arithmetic
mean precision rates for each eld for the manual and semi-automated
modes. Note that because there is no interaction term for mode ×eld in
the nal model, the estimated differences in log odds between the two
modes are the same for every eld and equal the values in the last row of
Table 6. Estimates and standard errors for the xed effects and random
effects related to precision are shown in Table S3.
3.1.4. Statistical models for time
The modeled median time was 933 s for the manual mode and 436 s
for the semi-automated mode. The median time, which is the expo-
nentiated mean log(time), was signicantly lower for the semi-
automated mode (p <0.01). These results can be compared with the
arithmetic means of the time across all studies: 971 s for the manual
mode and 517 s for the semi-automated mode (Table 8). For each mode,
Table 8 gives the estimated means for log(time), the standard errors of
the means, and the estimated medians for time. For this model, the
median time is the same as the geometric mean time. Table 8 also pro-
vides the estimate, standard error, and p-value for the difference be-
tween the mean log(time) for the two modes. Estimates and standard
errors for the xed effects and random effects related to total time are
shown in Table S4.
4. Discussion
Data extraction is a time- and resource-intensive step in the
literature-assessment process. Machine-learning methods for auto-
mating data extraction have been explored to address this challenge;
however, the use of machine learning for data extraction has been
limited to date, particularly in the eld of environmental health sci-
ences. Development and uptake of advanced approaches for extraction
lag behind other steps in the review process such as literature screening,
where automated screening tools have been established and used more
widely. In this paper, we introduced Dextr, a web-based data-extraction
tool that pairs machine-learning models that automatically predict data-
extraction entities with a user interface that enables manual verication
of extracted information (i.e., a semi-automated method). This powerful
tool does more than provide a convenient user interface for extracting
data; the tools extraction scheme supports complex data extraction
from full-text scientic articles with methods to capture data entities as
well as connections between entities. With this advanced approach,
Dextr supports hierarchical data extraction by allowing users to identify
relationships (e.g., the connections between species, strain, sex, expo-
sure, and endpoints) necessary for efcient data collection and synthesis
in literature reviews. When evaluated relative to manual data extraction
of environmental health science articles, Dextrs semi-automated
extraction performed well, resulting in time savings and comparable
performance in both recall and precision.
OConnor et al. (2019) provides a framework to describe the degree
of independence or levels of automationacross tools and discusses
potential barriers to adoption of automation for use in literature re-
views. The degree of automation can range from tools that improve le
management (Level 1), tools that leverage algorithms to assist with
reference prioritization (Level 2), tools that perform a task automatically
but require human supervision to approve the tools decision resulting in
a semi-automated workow (Level 3), and tools that perform a task
automatically without human oversight (Level 4). In developing Dextr,
we intentionally chose to develop a Level 3 tool because we wanted a
workow that would allow expert judgment in a manual verication
step to provide users the exibility to accommodate entities where
existing models may have an error rate that is too high to achieve the
necessary performance. The decision to develop a semi-automated tool
also addresses limited uptake of automation tools (van Altena et al.
2019) and expected barriers to adoption (OConnor et al. 2019) of
automation within the systematic-review community (e.g., providing a
user verication option to address mistrust by an end-user of the auto-
mation tool, supporting transparency to demonstrate ability of the tool
to perform the task, and providing a verication step similar to manual
QA to lessen potential disruption of adding automation to current
workows). The work presented in this paper supports widespread
adoption of a semi-automated data extraction approach because Dextr
has been tested on complex study designs, in an existing workow, and
provides the user the ability to conrm the machine-predicted values,
thereby increasing transparency and demonstrating compatibility with
current practices.
While systematic reviews, scoping reviews, and systematic evidence
maps have different formats and goals, all literature-based assessments
are used to inform evidence-based decisions. Therefore, the testing of
new procedures and automated approaches is essential to assess both the
impact on workow and the accuracy of the results. Given that Dextr
was developed to address the time-intensive step of data extraction, its
Table 7
Precision comparison between manual and semi-automated modes for each mode and eld, averaged over evaluators, based on the model unstratied by eld.
1
Extraction Mode Field Log Odds (Standard Error) P-value of Log Odds Probability (Standard Error) Arithmetic Mean Precision Rate
Manual Endpoint 1.460 (0.152) <0.0001 0.812 (0.023) 0.794
Manual Sex 5.073 (1.021) <0.0001 0.994 (0.006) 0.980
Manual Species 3.366 (0.494) <0.0001 0.967 (0.016) 0.990
Manual Strain 3.330 (0.490) <0.0001 0.965 (0.016) 0.978
Manual Test article 1.972 (0.224) <0.0001 0.878 (0.024) 0.883
Semi-automated Endpoint 1.594 (0.168) <0.0001 0.831 (0.024) 0.818
Semi-automated Sex 5.207 (1.022) <0.0001 0.995 (0.006) 1.000
Semi-automated Species 3.500 (0.496) <0.0001 0.971 (0.014) 0.967
Semi-automated Strain 3.464 (0.492) <0.0001 0.970 (0.014) 0.940
Semi-automated Test article 2.106 (0.236) <0.0001 0.891 (0.023) 0.934
3
Assumes average study complexity scores (0.175).
Table 8
Comparison of predicted mean logarithm and median for total time (seconds) between manual and semi-automated modes, averaged over evaluators
3
.
Extraction Mode Mean Log Time Modeled (Standard Error) P-value of Mean Log Time (Modeled) Median Time (Modeled) Arithmetic Mean Time
Manual 6.838 (0.058) <0.0001 933 971
Semi-automated
1
6.079 (0.059) <0.0001 436 517
Comparison
2
0.760 (0.071) <0.0001
1
Dextr predictions conrmed by QA reviewer.
2
Comparison between manual and semi-automated extraction modes.
3
Assumes average study complexity scores (0.175).
V.R. Walker et al.
Environment International 159 (2022) 107025
11
performance was evaluated in terms of recall, precision, and extraction
time. Although the precision rates for the manual mode and semi-
automated modes were similar, we found an unexpected and
intriguing statistically signicant reduction in the recall rate (arithmetic
mean recall rate 0.918 for manual and 0.834 for semi-automated).
Recall reects the ability of the data-extraction approach to identify
all relevant instances of an entity, and although 84% recall is good, we
explored potential reasons for this decrease. While the recall for sex,
species,and strainwere comparable, the semi-automated recall rate
was lower for the endpointand test articleelds. We hypothesize
that the large number of endpoints predicted by Dextr may have been
difcult or distracting for the user to sort through compared to manual
identication. This is supported by feedback from the reviewer-usability
questions and is a target for rening the user interface in future versions
of Dextr to avoid this potential distraction by adding search function-
ality to provide a list of predicted endpoints to help extractors system-
atically sort through potential endpoints. The differences in recall by
eld (see Table 5) are also correlated with the recall rates achieved by
the model on the TAC SRIE dataset (Nowak and Kunstman 2018). The
elds were chosen purposefully to observe the impact of the model
performance on the results. While the differences reect the relative
difculty of the elds, we believe that model improvements will lead to
closing the gap between the manual and semi-automated approaches. In
terms of time, Dextr added clear efciencies to our workow, providing
an approximately 50% reduction (53% lower predicted median time and
47% lower average time) in the time required for data extraction. This
nding indicates that Dextr has the potential to provide similar recall
and precision with substantial time-savings and reduced manual work-
load for data extraction by integrating semi-automated extraction and
QA in a single step and replacing the conventional 2-step data-extraction
process (a manual extractor and a manual QC check).
Although primarily developed as a tool to improve data-extraction
workow for literature-based reviews, Dextr can also be used to anno-
tate published studies and produce training datasets for future model
development. Using the tool as part of a literature review, Dextr captures
token level annotations during the data-extraction workow; these an-
notations are part of a machine-readable export that can potentially
support model development and renement. This feature provides an
alternative to the current option of a dedicated workow (i.e., outside of
a normal literature review) required to generate training datasets and
offers a reduction in cost for developing them. However, the annotations
captured on each study during a literature review may have some lim-
itations as the topic of the review could direct the extractors towards
endpoints of interest rather than capturing all exposures or endpoints in
a study. The lack of applicable datasets is a major impediment to model
development for literature reviews (Jonnalagadda et al. 2015), and
Dextr provides the potential for important advances to the eld.
There are several limitations in the evaluation of Dextr that should be
noted. First, we only used a single dataset to test performance. The
dataset used to evaluate the tool focused on identifying and extracting
respiratory health outcomes only. In contrast, the endpoint entity al-
gorithm was not trained with this specication, and the model predicted
all potential health outcomes (or endpoints) in each reference and not
the respiratory subset. As noted earlier, the extractors noted in responses
to reviewer-feedback questions that non-target endpoints identied by
Dextr were a distraction. This limitation could have contributed to the
lower recall rate observed because all non-respiratory endpoints had to
be reviewed to identify relevant respiratory endpoints. Second, there are
limitations associated with the models used, even though the models
were not evaluated for this paper. The models currently in Dextr were
developed and trained only on the methods section of environmental
health animal studies. For this reason, the tool automatically identied
and used only the methods section. However, detailed data extraction
requires the full text of a reference because entities are commonly
identied in the abstract, methods, and results sections. Similarly, in-
formation on some endpoints may be available only in tables, which
Dextr currently does not process. Third, we evaluated the key perfor-
mance features of Dextr (recall, precision, and time); however, we
acknowledge that other aspects of the tool were beyond the scope of this
project and were not tested. For example, the ability of users to establish
connections was not directly tested nor a focus for user feedback. Last,
this project was intended to develop a user interface designed to
incorporate NLP data-extraction models. Evaluation and potential
improvement of the models used were outside the scope of the work
described in this paper. Therefore, it is likely that our evaluation metrics
(e.g., recall of the endpoints eld) will improve in conjunction with
focused efforts to address model improvements.
Dextr was developed to add automation and machine-learning
functions to the data-extraction step in DNTPs literature-based assess-
ment workow. Although developed to address a DNTP need, we believe
it is important that the new tool be available to others in the research
community and be stable (i.e., have technical support) over 25 years.
We are in the process of obtaining Federal Risk and Authorization
Management Program (FedRAMP) authorization for the cloud deploy-
ment of Dextr, which will be available at (https://ntp.niehs.nih.gov
/go/Dextr) when completed. The current version of Dextr (v1.0-beta1)
provides a solid foundation for us to continue to rene and incorporate
new features that improve workow and enable faster and more effec-
tive data extraction. Although this publication is paired with the initial
release of the tool, we are already working to expand functionality of
Dextr, with planned improvements to the user interface, use of
controlled vocabularies, and additional data-extraction entities. Testing
the tool for data extraction on more diverse datasets is also underway.
We are also working to identify existing models and develop new models
that can be integrated into Dextr to expand the data-extraction capa-
bilities to other evidence streams (e.g., epidemiological and in vitro
studies). Other potential targets include the ability to extract more
detailed entities (e.g., results, standard error, condence interval) and
information from tables, gures, and captions of scientic literature. As
new features are developed, the design requirements of usability, ex-
ibility, and interoperability will be periodically re-evaluated.
As described in the key design requirements, we considered it critical
for Dextr to: 1) make data-extraction predictions automatically with
user verication; 2) integrate token-level annotations in the data-
extraction workow; and 3) connect extracted entities to support hier-
archical data extraction. This third feature, the connection of data en-
tities, is helpful for efcient data collection and essential to enable
effective synthesis in literature reviews. Controlled vocabularies and
ontologies provide a hierarchical structure of terms to dene conceptual
classes and relations needed for knowledge representation for a given
domain. Controlled vocabularies provide semantics and terminology to
normalize author-reported information and support a conceptual
framework when evaluating results (de Almeida Biolchini et al. 2007).
Efforts are ongoing to develop eld structures in Dextr compatible with
integrating ontologies and controlled vocabularies. These efforts include
the capability of selecting an ontology or vocabulary at the entity level
with the ability to select multiple vocabularies when setting up the data-
extraction form in Dextr. We are also exploring the ability of an ontology
to support data extraction for specic domains or questions based on the
sorting, aggregating, and association context of terms in the ontology (i.
e., identifying only cardiovascular endpoints from a search of environ-
mental exposure references).
5. Conclusions
Dextr is a semi-automated data extraction tool that has been trans-
parently evaluated and shown to improve data extraction by substan-
tially reducing the time required to conduct this step in supporting
environmental health sciences literature-based assessments. Unlike
other data extraction tools, Dextr provides the ability to extract complex
concepts (e.g., multiple experiments with various exposures and doses
within a single study) and properly connect or group the extracted
V.R. Walker et al.
Environment International 159 (2022) 107025
12
elements within a study. Furthermore, Dextr limits the work required by
researchers to generate training data by incorporating machine-readable
annotation exports that are collected as part of the data-extraction
workow within the tool. Dextr was designed to address challenges
associated with environmental health sciences literature; however, we
are condent that the features and capabilities within the tool are
applicable to other elds and would improve the data-extraction process
for other domains as well.
Declaration of Competing Interest
The authors declare that they have no known competing nancial
interests or personal relationships that could have appeared to inuence
the work reported in this paper.
Acknowledgements
This work was supported by the Intramural Research Program
(Contract GS00Q14OADU417, Task Order HHSN273201600015U) at
NIEHS, NIH. DNTP initiated and directed the project providing guidance
on tool requirements to support data extraction for literature-analysis as
well as the evaluation plan. Robyn Blain, Jo Rochester, and Jennifer
Seed of ICF worked under contract for DNTP and completed Dextr
testing, manual extraction, semi-automated extractions, evaluation re-
sults grading, and statistical analysis, while Pam Hartman of ICF
developed the gold-standard dataset. ICF staff also provided project
management for the software development and tool evaluation. Dextr
programming and development were conducted by Evidence Prime as
subcontractor to ICF. The underlying machine-learning model was also
developed by Evidence Prime. Kelly Shipkowski (now with DNTP)
worked for ICF at the beginning of the project. We appreciate the helpful
comments and input on the draft manuscript provided by Keith Shockley
and Nicole Kleinstreuer.
Competing Financial Interests: AJN and KK are employed by, and
AJN is also a shareholder of, Evidence Prime, a software company that
plans to commercialize the results of this work. To mitigate any potential
conicts of interest, these authors excluded themselves from activities
that could inuence the results of the evaluation study. The remaining
authors declare that they have no actual or potential competing nan-
cial interests.
Appendix A. Supplementary material
Supplementary data to this article can be found online at https://doi.
org/10.1016/j.envint.2021.107025.
References
Brockmeier, A.J., Ju, M., Przybyła, P., Ananiadou, S., 2019. Improving reference
prioritisation with PICO recognition. BMC Med. Inf. Decis. Making 19 (1), 256.
https://doi.org/10.1186/s12911-019-0992-8.
Clark, J., Glasziou, P., Del Mar, C., Bannach-Brown, A., Stehlik, P., Scott, A.M., 2020.
A full systematic review was completed in 2 weeks using automation tools: a case
study. J. Clin. Epidemiol. 121, 8190. https://doi.org/10.1016/j.
jclinepi.2020.01.008.
de Almeida Biolchini, J.C., Mian, P.G., Natali, A.C.C., Conte, T.U., Travassos, G.H., 2007.
Scientic research ontology to support systematic review in software engineering.
Adv. Eng. Inf. 21 (2), 133151. https://doi.org/10.1016/j.aei.2006.11.006.
Howard, B.E., Phillips, J., Tandon, A., Maharana, A., Elmore, R., Mav, D., Sedykh, A.,
Thayer, K., Merrick, B.A., Walker, V., Rooney, A., Shah, R.R., 2020. SWIFT-Active
Screener: Accelerated document screening through active learning and integrated
recall estimation. Environ. Int. 138, 105623. https://doi.org/10.1016/j.
envint.2020.105623.
James, K.L., Randall, N.P., Haddaway, N.R., 2016. A methodology for systematic
mapping in environmental sciences. Environ. Evid. 5 (1), 7. https://doi.org/
10.1186/s13750-016-0059-6.
Jonnalagadda, S.R., Goyal, P., Huffman, M.D., 2015. Automating data extraction in
systematic reviews: a systematic review. Syst. Rev. 4 (1), 78. https://doi.org/
10.1186/s13643-015-0066-7.
Marshall, C., Brereton, P., 2015. Systematic review toolbox: a catalogue of tools to
support systematic reviews. In: Paper presented at: Proceedings of the 19th
International Conference on Evaluation and Assessment in Software Engineering.
Association for Computing Machinery; Nanjing, China. https://doi.org/10.11
45/2745802.2745824.
Marshall, I.J., Kuiper, J., Banner, E., Wallace, B.C., 2017. Automating biomedical
evidence synthesis: RobotReviewer. In: Paper presented at: Proceedings of the 55th
Annual Meeting of the Association for Computational Linguistics-System
Demonstrations. Vancouver, Canada. https://dx.doi.org/10.18653/v1/P17-4002.
Millard, L.A.C., Flach, P.A., Higgins, J.P.T., 2016. Machine learning to assist risk-of-bias
assessments in systematic reviews. Int. J. Epidemiol. 45 (1), 266277. https://doi.
org/10.1093/ije/dyv306.
Nowak, A., Kunstman, P., 2018. Team EP at TAC 2018: Automating data extraction in
systematic reviews of environmetnal agents. In: Paper presented at: National
Institute of Standards and Technology Text Analysis Conference. Gaithersburg, MD.
OConnor, A.M., Tsafnat, G., Thomas, J., Glasziou, P., Gilbert, S.B., Hutton, B., 2019.
A question of trust: Can we build an evidence base to gain trust in systematic review
automation technologies? Syst. Rev. 8 (1), 143. https://doi.org/10.1186/s13643-
019-1062-0.
Perera, N., Dehmer, M., Emmert-Streib, F., 2020. Named entity recognition and relation
detection for biomedical information extraction. Front. Cell Dev. Biol. 8, 673.
https://doi.org/10.3389/fcell.2020.00673.
Rathbone, J., Albarqouni, L., Bakhit, M., Beller, E., Byambasuren, O., Hoffmann, T.,
Scott, A.M., Glasziou, P., 2017. Expediting citation screening using PICO-based title-
only screening for identifying studies in scoping searches and rapid reviews. Syst.
Rev. 6 (1), 233. https://doi.org/10.1186/s13643-017-0629-x.
Saldanha, I.J., Schmid, C.H., Lau, J., Dickersin, K., Berlin, J.A., Jap, J., Smith, B.T.,
Carini, S., Chan, W., De Bruijn, B., Wallace, B.C., Hutess, S.M., Sim, I., Murad, M.H.,
Walsh, S.A., Whamond, E.J., Li, T., 2016. Evaluating Data Abstraction Assistant, a
novel software application for data abstraction during systematic reviews: protocol
for a randomized controlled trial. Syst. Rev. 5 (1) https://doi.org/10.1186/s13643-
016-0373-7.
Schmitt, C., Walker, V., Williams, A., Varghese, A., Ahmad, Y., Rooney, A., Wolfe, M.,
2018. Overview of the TAC 2018 systematic review information extraction track. In:
Paper presented at: National Institute of Standards and Technology Text Analysis
Conference. Gaithersburg, MD.
Altena, A.J., Spijker, R., Olabarriaga, S.D., 2019. Usage of automation tools in systematic
reviews. Res. Synth. Methods. 10 (1), 7282. https://doi.org/10.1002/jrsm.1335.
Wallace, B.C., Small, K., Brodley, C.E., Lau, J., Trikalinos, T.A., 2012. Deploying an
interactive machine learning system in an Evidence-based Practice Center:
Abstrackr. In: Paper presented at: IHI 12: Proceedings of the 2nd ACM SIGHIT
International Health Informatics Symposium. ACM Press, New York, NY. https://doi.
org/10.1145/2110363.2110464.
Wolffe, T.A.M., Vidler, J., Halsall, C., Hunt, N., Whaley, P., 2020. A survey of systematic
evidence mapping practice and the case for knowledge graphs in environmental
health and toxicology. Toxicol. Sci. 175 (1), 3549. https://doi.org/10.1093/toxsci/
kfaa025.
Yadav, V., Bethard, S., 2018. A survey on recent advances in named entity recognition
from deep learning models. In: Paper presented at: Proceedings of the 27th
International Conference on Computational Linguistics. Santa Fe, NM.
V.R. Walker et al.
... This phase is also occasionally supported by AI solutions. Commonly, the relevant tools employ classifiers to identify articles possessing specific characteristics [42] or implement named-entity recognition for extracting specific entities or concepts [43] (e.g., RCT entities [44], entities pertaining environmental health studies [45]). ...
... It returns the top five supporting sentences for each extracted RCT entity, ranked according to relevance. Dextr detects data entities used in environmental health experimental animal studies (e.g., species, strain) [45]. Finally, Iris.ai allows users to customise entity extraction by defining their own set of categories and associating them with a set of exemplary papers. ...
... Next, it extracts from these sentences a set of entities via a rule-based approach, relying on the 21 CONSORT categories [44]. Dextr employs a Bidirectional Long Short-Term Memory -Conditional Random Field (BI-LSTM-CRF) neural network architecture [45,100]. Iris.ai does not share specific information about the method used for NER. ...
Preprint
Full-text available
This manuscript presents a comprehensive review of the use of Artificial Intelligence (AI) in Systematic Literature Reviews (SLRs). Our study focuses on how AI techniques are applied in the semi-automation of SLRs, specifically in the screening and extraction phases. We examine 21 leading SLR tools using a framework that combines 23 traditional features with 11 AI features. We also analyse 11 recent tools that leverage large language models for searching the literature and assisting academic writing. Finally, the paper discusses current trends in the field, outlines key research challenges, and suggests directions for future research.
... Legacy toxicity data in PDF reports can also be unlocked by AI optical character recognition and document classification techniques (Clark and Divvala 2016). Metadata extraction and reinforcement learning supports document triage and search (Walker et al. 2022). Thus, AI can accelerate evidence identification and extraction (Fig. 12). ...
Article
Full-text available
The rapid progress of AI impacts diverse scientific disciplines, including toxicology, and has the potential to transform chemical safety evaluation. Toxicology has evolved from an empirical science focused on observing apical outcomes of chemical exposure, to a data-rich field ripe for AI integration. The volume, variety and velocity of toxicological data from legacy studies, literature, high-throughput assays, sensor technologies and omics approaches create opportunities but also complexities that AI can help address. In particular, machine learning is well suited to handle and integrate large, heterogeneous datasets that are both structured and unstructured—a key challenge in modern toxicology. AI methods like deep neural networks, large language models, and natural language processing have successfully predicted toxicity endpoints, analyzed high-throughput data, extracted facts from literature, and generated synthetic data. Beyond automating data capture, analysis, and prediction, AI techniques show promise for accelerating quantitative risk assessment by providing probabilistic outputs to capture uncertainties. AI also enables explanation methods to unravel mechanisms and increase trust in modeled predictions. However, issues like model interpretability, data biases, and transparency currently limit regulatory endorsement of AI. Multidisciplinary collaboration is needed to ensure development of interpretable, robust, and human-centered AI systems. Rather than just automating human tasks at scale, transformative AI can catalyze innovation in how evidence is gathered, data are generated, hypotheses are formed and tested, and tasks are performed to usher new paradigms in chemical safety assessment. Used judiciously, AI has immense potential to advance toxicology into a more predictive, mechanism-based, and evidence-integrated scientific discipline to better safeguard human and environmental wellbeing across diverse populations.
... We will use a structured and tiered approach for information extraction focused on a queryable data format. Opportunities to implement automated, or semi-automated, approaches, such as the Data EXTRaction (DEXTR) tool [19] or other platforms, will be pursued to supplement established approaches for information extraction and updating the database. Although annual literature searches are planned, we will update more frequently if automated workflows can be established to increase efficiency in the process. ...
Article
Full-text available
A sharp rise in autism spectrum disorder (ASD) prevalence estimates, beginning in the 1990s, suggested factors additional to genetics were at play. This stimulated increased research investment in nongenetic factors, including the study of environmental chemical exposures, diet, nutrition, lifestyle, social factors, and maternal medical conditions. Consequently, both peer- and non-peer-reviewed bodies of evidence investigating environmental contributors to ASD etiology have grown significantly. The heterogeneity in the design and conduct of this research results in an inconclusive and unwieldy ‘virtual stack’ of publications. We propose to develop a Web-based tool for Autism Research and the Environment (aWARE) to comprehensively aggregate and assess these highly variable and often conflicting data. The interactive aWARE tool will use an approach for the development of systematic evidence maps (SEMs) to identify and display all available relevant published evidence, enabling users to explore multiple research questions within the scope of the SEM. Throughout tool development, listening sessions and workshops will be used to seek perspectives from the broader autism community. New evidence will be indexed in the tool annually, which will serve as a living resource to investigate the association between environmental factors and ASD.
... The methodology used predominantly for this task is named entity recognition (NER), a branch of NLP employed to identify and categorize key information (entities) in free text [202,227,230,235,238,241]. This algorithm permits the extraction of all types of information, depending on the needs of healthcare professionals, including study outcomes [145,148,216,250], statistical analyses [172,179], knowledge about diseases and their treatments [163,170,173,184,201,211,214] and clinical information and guidance [154,158,171,182,195,202,215,217,229,230,236,238,240,241,252,255]. In particular,there has been a trend over the last decade to explore the automated extraction of PICO elements, as indicated by the finding of 11 articles on this subject [167,175,183,187,200,220,222,228,232,233,245]. ...
Article
Objective: Evidence-based medicine (EBM) is a decision-making process based on the conscious and judicious use of the best available scientific evidence. However, the exponential increase in the amount of information currently available likely exceeds the capacity of human-only analysis. In this context, artificial intelligence (AI) and its branches such as machine learning (ML) can be used to facilitate human efforts in analyzing the literature to foster EBM. The present scoping review aimed to examine the use of AI in the automation of biomedical literature survey and analysis with a view to establishing the state-of-the-art and identifying knowledge gaps. Materials and methods: Comprehensive searches of the main databases were performed for articles published up to June 2022 and studies were selected according to inclusion and exclusion criteria. Data were extracted from the included articles and the findings categorized. Results: The total number of records retrieved from the databases was 12,145, of which 273 were included in the review. Classification of the studies according to the use of AI in evaluating the biomedical literature revealed three main application groups, namely assembly of scientific evidence (n=127; 47%), mining the biomedical literature (n=112; 41%) and quality analysis (n=34; 12%). Most studies addressed the preparation of systematic reviews, while articles focusing on the development of guidelines and evidence synthesis were the least frequent. The biggest knowledge gap was identified within the quality analysis group, particularly regarding methods and tools that assess the strength of recommendation and consistency of evidence. Conclusion: Our review shows that, despite significant progress in the automation of biomedical literature surveys and analyses in recent years, intense research is needed to fill knowledge gaps on more difficult aspects of ML, deep learning and natural language processing, and to consolidate the use of automation by end-users (biomedical researchers and healthcare professionals).
... The methodology used predominantly for this task is named entity recognition (NER), a branch of NLP employed to identify and categorize key information (entities) in free text [202,227,230,235,238,241]. This algorithm permits the extraction of all types of information, depending on the needs of healthcare professionals, including study outcomes [145,148,216,250], statistical analyses [172,179], knowledge about diseases and their treatments [163,170,173,184,201,211,214] and clinical information and guidance [154,158,171,182,195,202,215,217,229,230,236,238,240,241,252,255]. In particular,there has been a trend over the last decade to explore the automated extraction of PICO elements, as indicated by the finding of 11 articles on this subject [167,175,183,187,200,220,222,228,232,233,245]. ...
... Artificial intelligence methods are introduced in the data extraction from literature to help promote the data extraction from the published literature and build a scientific knowledge base in Biology [32], Chemistry [47,53], and many other disciplines [13,24,57]. However, most data extraction studies using artificial intelligence in the scientific literature require case customization. ...
Preprint
Full-text available
Constructing a comprehensive, accurate, and useful scientific knowledge base is crucial for human researchers synthesizing scientific knowledge and for enabling Al-driven scientific discovery. However, the current process is difficult, error-prone, and laborious due to (1) the enormous amount of scientific literature available; (2) the highly-specialized scientific domains; (3) the diverse modalities of information (text, figure, table); and, (4) the silos of scientific knowledge in different publications with inconsistent formats and structures. Informed by a formative study and iterated with participatory design workshops, we designed and developed KnowledgeShovel, an Al-in-the-Loop document annotation system for researchers to construct scientific knowledge bases. The design of KnowledgeShovel introduces a multi-step multi-modal human-AI collaboration pipeline that aligns with users' existing workflows to improve data accuracy while reducing the human burden. A follow-up user evaluation with 7 geoscience researchers shows that KnowledgeShovel can enable efficient construction of scientific knowledge bases with satisfactory accuracy.
... Dextr [47] Dextr provides a similar performance to manual extraction in terms of recall and precision and greatly reduces data extraction time. Unlike other tools, Dextr provides the ability to extract complex concepts (e.g., multiple experiments with various exposures and doses within a single study), properly connect the extracted elements within a study, and effectively limit the work required by researchers to generate machine-readable, annotated exports. ...
Article
Full-text available
Background: The application of machine learning (ML) tools (MLTs) to support clinical trials outputs in evidence-based health informatics can be an effective, useful, feasible, and acceptable way to advance medical research and provide precision medicine. Methods: In this study, the author used the rapid review approach and snowballing methods. The review was conducted in the following databases: PubMed, Scopus, COCHRANE LIBRARY, clinicaltrials.gov, Semantic Scholar, and the first six pages of Google Scholar from the 10 July–15 August 2022 period. Results: Here, 49 articles met the required criteria and were included in this review. Accordingly, 32 MLTs and platforms were identified in this study that applied the automatic extraction of knowledge from clinical trial outputs. Specifically, the initial use of automated tools resulted in modest to satisfactory time savings compared with the manual management. In addition, the evaluation of performance, functionality, usability, user interface, and system requirements also yielded positive results. Moreover, the evaluation of some tools in terms of acceptance, feasibility, precision, accuracy, efficiency, efficacy, and reliability was also positive. Conclusions: In summary, design based on the application of clinical trial results in ML is a promising approach to apply more reliable solutions. Future studies are needed to propose common standards for the assessment of MLTs and to clinically validate the performance in specific healthcare and technical domains.
Article
Data extraction is a time‐consuming and resource‐intensive task in the systematic review process. Natural language processing (NLP) artificial intelligence (AI) techniques have the potential to automate data extraction saving time and resources, accelerating the review process, and enhancing the quality and reliability of extracted data. In this paper, we propose a method for using Bing AI and Microsoft Edge as a second reviewer to verify and enhance data items first extracted by a single human reviewer. We describe a worked example of the steps involved in instructing the Bing AI Chat tool to extract study characteristics as data items from a PDF document into a table so that they can be compared with data extracted manually. We show that this technique may provide an additional verification process for data extraction where there are limited resources available or for novice reviewers. However, it should not be seen as a replacement to already established and validated double independent data extraction methods without further evaluation and verification. Use of AI techniques for data extraction in systematic reviews should be transparently and accurately described in reports. Future research should focus on the accuracy, efficiency, completeness, and user experience of using Bing AI for data extraction compared with traditional methods using two or more reviewers independently.
Article
Full-text available
El objetivo de la investigación fue analizar el papel del aprendizaje automático de datos en las revisiones sistemáticas de literatura. Se aplicó la técnica de Procesamiento de Lenguaje Natural denominada modelado de tópicos, a un conjunto de títulos y resúmenes recopilados de la base de datos Scopus. Especificamente se utilizó la técnica de Asignación Latente de Dirichlet (LDA), a partir de la cual se lograron descubrir y comprender las temáticas subyacentes en la colección de documentos. Los resultados mostraron la utilidad de la técnica utilizada en la revisión exploratoria de literatura, al permitir agrupar los resultados por temáticas. Igualmente, se pudo identificar las áreas y actividades específicas donde más se ha aplicado el aprendizaje automático, en lo referente a revisiones de literatura. Se concluye que la técnica LDA es una estrategia fácil de utilizar y cuyos resultados permiten abordar una amplia colección de documentos de manera sistemática y coherente, reduciendo notablemente el tiempo de la revisión.
Article
Full-text available
The number of scientific publications in the literature is steadily growing, containing our knowledge in the biomedical, health, and clinical sciences. Since there is currently no automatic archiving of the obtained results, much of this information remains buried in textual details not readily available for further usage or analysis. For this reason, natural language processing (NLP) and text mining methods are used for information extraction from such publications. In this paper, we review practices for Named Entity Recognition (NER) and Relation Detection (RD), allowing, e.g., to identify interactions between proteins and drugs or genes and diseases. This information can be integrated into networks to summarize large-scale details on a particular biomedical or clinical problem, which is then amenable for easy data management and further analysis. Furthermore, we survey novel deep learning methods that have recently been introduced for such tasks.
Article
Full-text available
Background In the screening phase of systematic review, researchers use detailed inclusion/exclusion criteria to decide whether each article in a set of candidate articles is relevant to the research question under consideration. A typical review may require screening thousands or tens of thousands of articles in and can utilize hundreds of person-hours of labor. Methods Here we introduce SWIFT-Active Screener, a web-based, collaborative systematic review software application, designed to reduce the overall screening burden required during this resource-intensive phase of the review process. To prioritize articles for review, SWIFT-Active Screener uses active learning, a type of machine learning that incorporates user feedback during screening. Meanwhile, a negative binomial model is employed to estimate the number of relevant articles remaining in the unscreened document list. Using a simulation involving 26 diverse systematic review datasets that were previously screened by reviewers, we evaluated both the document prioritization and recall estimation methods. Results On average, 95% of the relevant articles were identified after screening only 40% of the total reference list. In the 5 document sets with 5,000 or more references, 95% recall was achieved after screening only 34% of the available references, on average. Furthermore, the recall estimator we have proposed provides a useful, conservative estimate of the percentage of relevant documents identified during the screening process. Conclusion SWIFT-Active Screener can result in significant time savings compared to traditional screening and the savings are increased for larger project sizes. Moreover, the integration of explicit recall estimation during screening solves an important challenge faced by all machine learning systems for document screening: when to stop screening a prioritized reference list. The software is currently available in the form of a multi-user, collaborative, online web application.
Article
Full-text available
Systematic evidence mapping offers a robust and transparent methodology for facilitating evidence-based approaches to decision-making in chemicals policy and wider environmental health. Interest in the methodology is growing; however, its application in environmental health is still novel. To facilitate the production of effective systematic evidence maps for environmental health use cases, we survey the successful application of evidence mapping in other fields where the methodology is more established. Focusing on issues of "data storage technology", "data integrity", "data accessibility", and "transparency", we characterise current evidence-mapping practice and critically review its potential value for environmental health contexts. We note that rigid, flat data tables and schema-first approaches dominate current mapping methods and highlight how this practice is ill-suited to the highly connected, heterogeneous and complex nature of environmental health data. We propose this challenge is overcome by storing and structuring data as "knowledge graphs". Knowledge graphs offer a flexible, schemaless and scalable model for systematically mapping the environmental health literature. Associated technologies such as ontologies are well-suited to the long-term goals of systematic mapping methodology in promoting resource-efficient access to the wider environmental health evidence base. Several graph storage implementations are readily available, with a variety of proven use cases in other fields. Thus, developing and adapting systematic evidence mapping for environmental health should utilise these graph-based resources to ensure the production of scalable, interoperable and robust maps to aid decision-making processes in chemicals policy and wider environmental health.
Article
Full-text available
Background: Machine learning can assist with multiple tasks during systematic reviews to facilitate the rapid retrieval of relevant references during screening and to identify and extract information relevant to the study characteristics, which include the PICO elements of patient/population, intervention, comparator, and outcomes. The latter requires techniques for identifying and categorising fragments of text, known as named entity recognition. Methods: A publicly available corpus of PICO annotations on biomedical abstracts is used to train a named entity recognition model, which is implemented as a recurrent neural network. This model is then applied to a separate collection of abstracts for references from systematic reviews within biomedical and health domains. The occurrences of words tagged in the context of specific PICO contexts are used as additional features for a relevancy classification model. Simulations of the machine learning-assisted screening are used to evaluate the work saved by the relevancy model with and without the PICO features. Chi-squared and statistical significance of positive predicted values are used to identify words that are more indicative of relevancy within PICO contexts. Results: Inclusion of PICO features improves the performance metric on 15 of the 20 collections, with substantial gains on certain systematic reviews. Examples of words whose PICO context are more precise can explain this increase. Conclusions: Words within PICO tagged segments in abstracts are predictive features for determining inclusion. Combining PICO annotation model into the relevancy classification pipeline is a promising approach. The annotations may be useful on their own to aid users in pinpointing necessary information for data extraction, or to facilitate semantic search.
Article
Full-text available
Background: Although many aspects of systematic reviews use computational tools, systematic reviewers have been reluctant to adopt machine learning tools. Discussion: We discuss that the potential reason for the slow adoption of machine learning tools into systematic reviews is multifactorial. We focus on the current absence of trust in automation and set-up challenges as major barriers to adoption. It is important that reviews produced using automation tools are considered non-inferior or superior to current practice. However, this standard will likely not be sufficient to lead to widespread adoption. As with many technologies, it is important that reviewers see "others" in the review community using automation tools. Adoption will also be slow if the automation tools are not compatible with workflows and tasks currently used to produce reviews. Many automation tools being developed for systematic reviews mimic classification problems. Therefore, the evidence that these automation tools are non-inferior or superior can be presented using methods similar to diagnostic test evaluations, i.e., precision and recall compared to a human reviewer. However, the assessment of automation tools does present unique challenges for investigators and systematic reviewers, including the need to clarify which metrics are of interest to the systematic review community and the unique documentation challenges for reproducible software experiments. Conclusion: We discuss adoption barriers with the goal of providing tool developers with guidance as to how to design and report such evaluations and for end users to assess their validity. Further, we discuss approaches to formatting and announcing publicly available datasets suitable for assessment of automation technologies and tools. Making these resources available will increase trust that tools are non-inferior or superior to current practice. Finally, we identify that, even with evidence that automation tools are non-inferior or superior to current practice, substantial set-up challenges remain for main stream integration of automation into the systematic review process.
Article
Full-text available
Systematic reviews are a cornerstone of today's evidence‐informed decision making. With the rapid expansion of questions to be addressed and scientific information produced, there is a growing workload on reviewers, making the current practice unsustainable without the aid of automation tools. While many automation tools have been developed and are available, uptake seems to be lagging. For this reason, we set out to investigate the current level of uptake and what the potential barriers and facilitators are for the adoption of automation tools in systematic reviews. We deployed surveys among systematic reviewers that gathered information on tool uptake, demographics, systematic review characteristics, and barriers and facilitators for uptake. Systematic reviewers from multiple domains were targeted during recruitment, however, responders were predominantly from the biomedical sciences. We found that automation tools are currently not widely used among the participants. When tools are used, participants mostly learn about them from their environment, for example through colleagues, peers, or organisation. Tools are often chosen on the basis of user experience, either by own experience or from colleagues or peers. Lastly, licensing, steep learning curve, lack of support, and mismatch to workflow are often reported by participants as relevant barriers. While conclusions can only be drawn for the biomedical field, our work provides evidence and confirms the conclusions and recommendations of previous work, which was based on expert opinions. Furthermore, our study highlights the importance that organisations and best practices in a field can have for the uptake of automation tools for systematic reviews.
Conference Paper
Full-text available
We describe our entry for the Systematic Review Information Extraction track of the 2018 Text Analysis Conference. Our solution is an end-to-end, deep learning, sequence tagging model based on the BI-LSTM-CRF architecture. However, we use interleaved, alternating LSTM layers with highway connections instead of the more traditional approach, where last hidden states of both directions are concatenated to create an input to the next layer. We also make extensive use of pre-trained word embeddings, namely GloVe and ELMo. Thanks to a number of regularization techniques, we were able to achieve relatively large capacity of the model (31.3M+ of trainable parameters) for the size of training set (100 documents, less than 200K tokens). The system's official score was 60.9% (micro-F1) and it ranked first for the Task 1. Additionally, after rectifying an obvious mistake in the submission format, the system scored 67.35%.
Article
Full-text available
Background Citation screening for scoping searches and rapid review is time-consuming and inefficient, often requiring days or sometimes months to complete. We examined the reliability of PICo-based title-only screening using keyword searches based on the PICo elements—Participants, Interventions, and Comparators, but not the Outcomes. MethodsA convenience sample of 10 datasets, derived from the literature searches of completed systematic reviews, was used to test PICo-based title-only screening. Search terms for screening were generated from the inclusion criteria of each review, specifically the PICo elements—Participants, Interventions and Comparators. Synonyms for the PICo terms were sought, including alternatives for clinical conditions, trade names of generic drugs and abbreviations for clinical conditions, interventions and comparators. The MeSH database, Wikipedia, Google searches and online thesauri were used to assist generating terms. Title-only screening was performed by five reviewers independently in Endnote X7 reference management software using OR Boolean operator. Outcome measures were recall of included studies and the reduction in screening effort. Recall is the proportion of included studies retrieved using PICo title-only screening out of the total number of included studies in the original reviews. The percentage reduction in screening effort is the proportion of records not needing screening because the method eliminates them from the screen set. ResultsAcross the 10 reviews, the reduction in screening effort ranged from 11 to 78% with a median reduction of 53%. In nine systematic reviews, the recall of included studies was 100%. In one review (oxygen therapy), four of five reviewers missed the same included study (median recall 67%). A post hoc analysis was performed on the dataset with the lowest reduction in screening effort (11%), and it was rescreened using only the intervention and comparator keywords and omitting keywords for participants. The reduction in screening effort increased to 57%, and the recall of included studies was maintained (100%). Conclusions In this sample of datasets, PICo-based title-only screening was able to expedite citation screening for scoping searches and rapid reviews by reducing the number of citations needed to screen but requires a thorough workup of the potential synonyms and alternative terms. Further research which evaluates the feasibility of this technique with heterogeneous datasets in different fields would be useful to inform the generalisability of this technique.
Article
Full-text available
Background Data abstraction, a critical systematic review step, is time-consuming and prone to errors. Current standards for approaches to data abstraction rest on a weak evidence base. We developed the Data Abstraction Assistant (DAA), a novel software application designed to facilitate the abstraction process by allowing users to (1) view study article PDFs juxtaposed to electronic data abstraction forms linked to a data abstraction system, (2) highlight (or “pin”) the location of the text in the PDF, and (3) copy relevant text from the PDF into the form. We describe the design of a randomized controlled trial (RCT) that compares the relative effectiveness of (A) DAA-facilitated single abstraction plus verification by a second person, (B) traditional (non-DAA-facilitated) single abstraction plus verification by a second person, and (C) traditional independent dual abstraction plus adjudication to ascertain the accuracy and efficiency of abstraction. Methods This is an online, randomized, three-arm, crossover trial. We will enroll 24 pairs of abstractors (i.e., sample size is 48 participants), each pair comprising one less and one more experienced abstractor. Pairs will be randomized to abstract data from six articles, two under each of the three approaches. Abstractors will complete pre-tested data abstraction forms using the Systematic Review Data Repository (SRDR), an online data abstraction system. The primary outcomes are (1) proportion of data items abstracted that constitute an error (compared with an answer key) and (2) total time taken to complete abstraction (by two abstractors in the pair, including verification and/or adjudication). Discussion The DAA trial uses a practical design to test a novel software application as a tool to help improve the accuracy and efficiency of the data abstraction process during systematic reviews. Findings from the DAA trial will provide much-needed evidence to strengthen current recommendations for data abstraction approaches. Trial registration The trial is registered at National Information Center on Health Services Research and Health Care Technology (NICHSR) under Registration # HSRP20152269: https://wwwcf.nlm.nih.gov/hsr_project/view_hsrproj_record.cfm?NLMUNIQUE_ID=20152269&SEARCH_FOR=Tianjing%20Li. All items from the World Health Organization Trial Registration Data Set are covered at various locations in this protocol. Protocol version and date: This is version 2.0 of the protocol, dated September 6, 2016. As needed, we will communicate any protocol amendments to the Institutional Review Boards (IRBs) of Johns Hopkins Bloomberg School of Public Health (JHBSPH) and Brown University. We also will make appropriate as-needed modifications to the NICHSR website in a timely fashion. Electronic supplementary material The online version of this article (doi:10.1186/s13643-016-0373-7) contains supplementary material, which is available to authorized users.
Conference Paper
We present RobotReviewer, an open-source web-based system that uses machine learning and NLP to semi-automate biomedical evidence synthesis, to aid the practice of Evidence-Based Medicine. RobotReviewer processes full-text journal articles (PDFs) describing randomized controlled trials (RCTs). It appraises the reliability of RCTs and extracts text describing key trial characteristics (e.g., descriptions of the population) using novel NLP methods. RobotReviewer then automatically generates a report synthesising this information. Our goal is for RobotReviewer to automatically extract and synthesise the full-range of structured data needed to inform evidence-based practice.