ArticlePDF Available

Evaluation of a semi-automated data extraction tool for public health literature-based reviews: Dextr

January 2022
Environment International 159(1):107025

January 2022
159(1):107025

DOI:10.1016/j.envint.2021.107025

License
CC BY-NC-ND 3.0

Authors:

Vickie R Walker

National Institutes of Health

Artur Nowak

Evidence Prime

Show all 12 authorsHide

Introduction: There has been limited development and uptake of machine-learning methods to automate data extraction for literature-based assessments. Although advanced extraction approaches have been applied to some clinical research reviews, existing methods are not well suited for addressing toxicology or environmental health questions due to unique data needs to support reviews in these fields. Objectives: To develop and evaluate a flexible, web-based tool for semi-automated data extraction that: 1) makes data extraction predictions with user verification, 2) integrates token-level annotations, and 3) connects extracted entities to support hierarchical data extraction. Methods: Dextr was developed with Agile software methodology using a two-team approach. The development team outlined proposed features and coded the software. The advisory team guided developers and evaluated Dextr's performance on precision, recall, and extraction time by comparing a manual extraction workflow to a semi-automated extraction workflow using a dataset of 51 environmental health animal studies. Results: The semi-automated workflow did not appear to affect precision rate (96.0% vs. 95.4% manual, p = 0.38), resulted in a small reduction in recall rate (91.8% vs. 97.0% manual, p < 0.01), and substantially reduced the median extraction time (436 s vs. 933 s per study manual, p < 0.01) compared to a manual workflow. Discussion: Dextr provides similar performance to manual extraction in terms of recall and precision and greatly reduces data extraction time. Unlike other tools, Dextr provides the ability to extract complex concepts (e.g., multiple experiments with various exposures and doses within a single study), properly connect the extracted elements within a study, and effectively limit the work required by researchers to generate machine-readable, annotated exports. The Dextr tool addresses data-extraction challenges associated with environmental health sciences literature with a simple user interface, incorporates the key capabilities of user verification and entity connecting, provides a platform for further automation developments, and has the potential to improve data extraction for literature reviews in this and other fields.

Content uploaded by Artur Nowak

Content may be subject to copyright.

Environment International 159 (2022) 107025

(http://creativecommons.org/licenses/by-nc-nd/3.0/igo/).

Evaluation of a semi-automated data extraction tool for public health

literature-based reviews: Dextr

Vickie R. Walker

, Charles P. Schmitt

, Mary S. Wolfe

, Artur J. Nowak

, Kuba Kulesza

Ashley R. Williams

, Rob Shin

, Jonathan Cohen

, Dave Burch

, Matthew D. Stout

Kelly A. Shipkowski

, Andrew A. Rooney

Division of the National Toxicology Program (DNTP), National Institute of Environmental Health Sciences (NIEHS), National Institutes of Health (NIH), Research

Triangle Park, NC, USA

Evidence Prime Inc, Krakow, Poland

ICF, Research Triangle Park, NC, USA

ARTICLE INFO

Handling Editor: Paul Whaley

Keywords:

Automation

Text mining

Machine learning

Natural language processing

Literature review

Systematic review

Scoping review

Systematic evidence map

ABSTRACT

Introduction: There has been limited development and uptake of machine-learning methods to automate data

extraction for literature-based assessments. Although advanced extraction approaches have been applied to some

clinical research reviews, existing methods are not well suited for addressing toxicology or environmental health

questions due to unique data needs to support reviews in these elds.

Objectives: To develop and evaluate a exible, web-based tool for semi-automated data extraction that: 1) makes

data extraction predictions with user verication, 2) integrates token-level annotations, and 3) connects

extracted entities to support hierarchical data extraction.

Methods: Dextr was developed with Agile software methodology using a two-team approach. The development

team outlined proposed features and coded the software. The advisory team guided developers and evaluated

Dextr’s performance on precision, recall, and extraction time by comparing a manual extraction workow to a

semi-automated extraction workow using a dataset of 51 environmental health animal studies.

Results: The semi-automated workow did not appear to affect precision rate (96.0% vs. 95.4% manual, p =

0.38), resulted in a small reduction in recall rate (91.8% vs. 97.0% manual, p <0.01), and substantially reduced

the median extraction time (436 s vs. 933 s per study manual, p <0.01) compared to a manual workow.

Discussion: Dextr provides similar performance to manual extraction in terms of recall and precision and greatly

reduces data extraction time. Unlike other tools, Dextr provides the ability to extract complex concepts (e.g.,

multiple experiments with various exposures and doses within a single study), properly connect the extracted

elements within a study, and effectively limit the work required by researchers to generate machine-readable,

annotated exports. The Dextr tool addresses data-extraction challenges associated with environmental health

sciences literature with a simple user interface, incorporates the key capabilities of user verication and entity

connecting, provides a platform for further automation developments, and has the potential to improve data

extraction for literature reviews in this and other elds.

1. Introduction

Systematic review methodology is a rigorous approach to literature-

based assessments that maximizes transparency and minimizes bias

(O’Connor et al. 2019). Three main assessment formats (systematic

reviews, scoping reviews, and systematic evidence maps) use these

methods in a t-for-purpose approach depending on the research

question and project goals. Systematic reviews follow a pre-dened

protocol to identify, select, critically assess, synthesize, and integrate

evidence to answer a specic question and reach conclusions. The best

Abbreviations: SR, Systematic review; SCR, Scoping review; SEM, Systematic evidence map.

* Corresponding author at: NIEHS, P.O. Box 12233, Mail Drop K2–04, Research Triangle Park, NC 27709, USA. Express mail: 530 Davis Drive, Morrisville, NC

27560, USA.

E-mail address: Walker.Vickie@nih.gov (V.R. Walker).

Contents lists available at ScienceDirect

Environment International

journal homepage: www.elsevier.com/locate/envint

https://doi.org/10.1016/j.envint.2021.107025

Received 1 July 2021; Received in revised form 7 October 2021; Accepted 3 December 2021

Environment International 159 (2022) 107025

Box 1

A comparison of the literature-assessment steps between scoping reviews/systematic evidence maps and systematic reviews

Literature- Assessment Steps Task Description in Scoping Review

(SCR) or Systematic Evidence

Mapping (SEM) Workow

Task Description in Systematic

Review (SR) Workow

Percent

Time

Problem Formulation and Protocol

Development

Dene research question and objectives,

typically with a broad PECO

, and all

methods before conducting review.

•Best practice to publish protocol

before conducting review for

transparency; however, only minor

impact on bias

•Objectives are often open questions to

survey broad topics and identify

extent of evidence (i.e., areas that are

data rich or data poor / data gaps)

•Protocol should describe key concepts

that will be mapped (e.g., exposures)

to support objectives

Dene research question, PECO, and

all methods before conducting

review to reduce bias.

•Best practice to publish protocol

for transparency

•Critical to publish protocol before

starting evidence evaluation to

reduce bias

•Objectives are focused, closed

questions (e.g., specic exposure

outcome pairs/ hazards)

Identify the

Evidence

Identify

Literature

Develop search strategy to identify

evidence relevant to address the

question.

•Search is biased to address the degree

of precision and certainty of the

objectives where a comprehensive

search may not be necessary

•Searches conducted in one or more

major literature database, or in a

stepwise manner, to address

objectives

•Search terms are generally broad /

topic based with lower specicity

•Searches retrieve evidence supportive

of multiple decisions and scenarios

Develop comprehensive search

strategy to identify all relevant

evidence to address the question.

•Search is biased toward maximum

number of sources to ensure

identication of all evidence

relevant to synthesis

•Search includes literature

databases, sources of grey

literature, and published data

•Search terms are highly resolved

and specied for key elements of

the objectives

Screen Studies Screen studies against eligibility criteria

from objectives and PECO.

•Inclusion and exclusion criteria are

topic-based and may only address

PECO at a high level

•Included studies likely to address

diverse scenarios

Screen studies against eligibility

criteria from objectives and PECO.

•Inclusion and exclusion criteria

specied in detail for all PECO

elements

•Assure specic research question is

efciently addressed

17%

Extract Data Extract study meta-data and

characteristics to address objectives.

•Flexible approach supports t-for-

purpose maps of varying degrees of

comprehensiveness

•Optional extraction of study ndings

and other characteristics depending

on objectives

Complete extraction of meta-data and

results to address question.

•Entities determined by project

objectives

15%

Evaluate the Evidence Appraisal of studies is optional

depending on objectives.

•Study characteristics relating to

quality of study design and conduct,

or internal validity, may be extracted

•May include stepwise approach (e.g.,

methods mapped relative to

objectives), or quality only assessed

for studies addressing key outcomes

Critical appraisal of included studies

is essential to characterizing

certainty in bodies of evidence.

•Performed as assessment of

internal validity (risk of bias)

•May include external validity,

sensitivity, other factors

Summarize and Synthesize Data SCRs and SEMs have limited or no

synthesis – may only include

summaries.

•Primary output shows extent of

evidence and key characteristics

relative to question and objectives

•SCRs provide narrative summaries

(limited or no synthesis) of evidence

relative to objectives

Quantitative synthesis addresses

question and objectives where

appropriate; qualitative synthesis

used if pooling not appropriate.

•Synthesis supports a specic

decision context

•May include meta-analysis

(continued on next page)

V.R. Walker et al.

Environment International 159 (2022) 107025

(continued)

Literature- Assessment Steps Task Description in Scoping Review

(SCR) or Systematic Evidence

Mapping (SEM) Workow

Task Description in Systematic

Review (SR) Workow

Percent

Time

•SEMs output includes evidence map,

database, or tables to support and

inform decision making on question

•Although data may inform multiple

decisions, summary may be specic

for decision-making context in

objectives

•Synthesis should address key

features separately (e.g., evidence

streams, exposures, health effects)

•Example for environmental health

questions would synthesize hazard

characterization data

Integrate

Evidenceand

Report

Findings

Integrate

Evidence

SCRs and SEMs do not typically include

integration or synthesis.

•SEMs may identify regions of

evidence with study characteristics

associated with condence or

certainty

Assessment of condence or certainty

in the results of the synthesis

described according to the objectives.

•Should address certainty of each

body of evidence relative to

questions or objectives

•Includes integration of the

evidence base as a whole

•Example for environmental health

questions would provide detailed

certainty of evidence for hazard or

risk conclusions from exposure

DevelopReport All review outputs provided in

accessible format.

•SCRs and SEMs do not typically

provide conclusions

•Good SEMs are interactive, sortable,

and searchable

•Outputs support and inform decision

making on question

•Outputs should inform research and

analysis decisions, where data rich

areas may support conclusions or data

poor areas may serve as areas of

uncertainty that could be addressed

by research or evidence surveillance

Report all conclusions in clear

language and accessible format with

answer to review question.

•Includes description of certainty of

conclusions

•Describes limitation in the review

and limitations in the evidence

base for assessing the question

12%

Project Management

Oversight of team interactions and

workow performed to complete the

review.

•Develop materials and guidance for

steps in the review (screening, data

extraction, etc.) and provide training

•Manage communication and

meetings for workow, track

progress, and address problems

•Arrange and conduct pilot testing of

review steps and revise approach

based on lessons learned

•Arrange for workow integration of

new tools, machine learning and AI

features

•Recruit technical experts and new

team members and address conict of

interest

•Plan for protocol, data, and document

review

Oversight is the same as SCR and

SEM with additional steps for critical

appraisal and integration of

evidence.

19%

Note: For the purpose of the table, scoping reviews and systematic evidence maps are considered to have the same workow; adapted from

James et al. (2016) and Wolffe et al. (2020).

Estimated percent of work time to complete a systematic review adapted from Clark et al. (2020).

Research questions for scoping and systematic reviews should be stated in terms of the Population, Exposure, Comparator, and Outcome

(PECO) of interest. Scoping reviews and evidence maps sometimes do not include a specic comparator.

Although project management is not typically considered a step in the literature review process, it took nearly 20% of time when considered as

a separate function by Clark et al. (2020).

SR =Systematic Review, SCR =Scoping Review, SEM =Systematic Evidence Map.

V.R. Walker et al.

Environment International 159 (2022) 107025

systematic reviews use a comprehensive literature search based on a

narrowly focused question to facilitate conclusions. Scoping reviews

utilize systematic-review methods to summarize available data on broad

topics to identify data-rich and data-poor areas of research and inform

evidence-based decisions on further research or analysis. Systematic

evidence maps use systematic-review methods to characterize the evi-

dence base for a broad research area to illustrate the extent and types of

evidence available via an interactive visual format that may be a stand-

alone product or part of a scoping review (Wolffe et al. 2020). All three

of these assessment formats are generally time-consuming and resource-

intensive to conduct, primarily due to the need to accomplish most steps

manually (Marshall et al. 2017), but also driven by the complexity of the

data under consideration and amount of relevant literature to evaluate.

The specic steps in a literature-based assessment depend on the

goals and approach used, with ve basic steps in most assessments: 1)

dene the question and methods for the review (i.e., problem formula-

tion and protocol development), 2) identify the evidence, 3) evaluate the

evidence, 4) summarize the evidence, and 5) integrate the evidence and

report the ndings. Box 1 compares these steps across a scoping-review/

systematic-review map and systematic-review products. Many steps in

the literature-based assessment process have repetitive and rule-based

decisions, which lend the steps to automated or semi-automated

approaches.

The development and use of automation are steadily advancing in

literature analysis, with much of its uptake focused on clinical and

medical research. This progress may be related to funding advantages

and the relative consistency of medical data and publications, or perhaps

may be because clinical research has used systematic-review method-

ology longer. Although data sources are similar for many literature-

based assessments, there are differences and unique aspects of the

data relevant for addressing toxicology or environmental health ques-

tions versus clinical questions. One important difference is that envi-

ronmental health assessments require the identication of research from

multiple evidence streams (i.e., human, animal, and in vitro exposure

studies), which necessitates training tools on publications addressing

each evidence stream. In contrast, data that are relevant for addressing

clinical and medical review questions come primarily from randomized

controlled trials in human subjects. Even within the human data there is

greater complexity in toxicology or environmental research, where a

range of epidemiological study designs are used for investigating the

health effects of environmental chemicals. Moreover, experimental an-

imal studies measure more diverse endpoints and may report more data

than clinical studies, resulting in longer and more complicated data

extraction. Finally, cell-based assays and in vitro exposure studies pro-

vide valuable mechanistic insights for the question at hand, but these

assays cover an even more diverse range of endpoints, platforms, tech-

nologies, and associated data. While clinical and environmental as-

sessments both focus on health-related outcomes, the requirements for

environmentally focused reviews expand beyond those considered in the

medical eld. Therefore, a tool that meets the needs of environmental

health assessments could likely be applied successfully to clinical

questions, while the opposite may not be true.

The systematic-review toolbox (http://systematicreviewtools.com)

provides a catalogue of over 200 tools that address parts of all ve

assessment steps as well as associated tasks such as meta-analysis or

collaboration (Marshall and Brereton 2015). Within the toolbox, there

are multiple resources that support developing and implementing

literature assessments using manual processes. For instance, several

tools provide web-based forms for review teams to capture objectives,

record search strategies, detail quality-assessment checklists, and record

manual steps such as extracting evidence and making risk-of-bias or

quality-assessment judgements. The availability of tools that support

full- or partial-automation of literature-assessment processes is much

more limited, despite recent advances in natural language processing

(NLP), machine learning, and articial intelligence (AI). This is espe-

cially true for the process of identifying evidence, where a combination

of active learning and linguistic models can successfully predict the

relevance of literature based on small samples of manually selected

studies (e.g., Brockmeier et al. 2019; Howard et al. 2020; Rathbone et al.

2017; Wallace et al. 2012). These approaches have now been incorpo-

rated into several systematic-review tools. In contrast, the development

and the adoption of automation methods for steps three through ve of a

systematic review has been limited (Box 1). The risk-of-bias assessment

of individual studies is a critical and time-consuming process in assess-

ments that is generally considered to require subject-matter experts to

evaluate complex factors in study design and reporting. While a few

models exist to predict risk-of-bias ratings for clinical research studies

(Marshall et al. 2017; Millard et al. 2016), such methods have not

translated into adoption within mainstream systematic-review tools.

Several assessment steps rely on the extraction of identied data

from text, another widely recognized time-consuming process. Recent

developments in NLP, including both general extraction of named en-

tities and relationships (surveyed in Yadav and Bethard 2018) as well as

specic extraction of biomedical terms (e.g., chemicals, genes, and

adverse outcomes (reviewed in Perera et al. 2020)), suggest that

machine-based approaches are sufciently mature for semi-automated

data-extraction approaches. Some elements of human- and machine-

based data extraction are straightforward, including identifying the

species and sex of the experimental animal models. Other elements, such

as identifying the results of experimental assays, questionnaires, or

statistical analyses, are more complex because publications may report

the results from numerous assays and endpoints after multiple expo-

sures, doses, and time periods. Standardization of reporting is also

lacking, such that authors may report the experimental details using

different measurement units, different names for the same chemical, and

other variations in terminology (Wolffe et al. 2020). In addition, this

information may be located within the text of the publication or in a

table, gure, or gure caption.

In 2018, the Division of the National Toxicology Program (DNTP)

participated in the National Institute of Standards and Technology Text

Analysis Conference (NIST TAC) challenge by hosting the Systematic

Review Information Extraction (SRIE) track to investigate the feasibility

of developing machine-learning models to identify, extract, and connect

data entities routinely extracted from environmental health experi-

mental animal studies. The data entities included 24 elds such as

species, exposure, dose level, time of dose, endpoint. Creating the

training and test sets required structured annotation of the various data-

extraction entities and labeling them in the text of each article. Devel-

oping these datasets required more comprehensive study annotation

than typical data extraction workow because the training dataset

needed to capture all entities and endpoints in a research publication

rather than the subset that might be relevant for a given systematic re-

view question. These training datasets are critical to providing a xed

format that can be automatically processed and interpreted by a com-

puter for training and model development. Overall, the results of the

challenge were promising in that model-derived annotation of design

features from the methods section of experimental animal studies ach-

ieved results in some extraction elds that neared human-level perfor-

mance, suggesting that computer-assisted data extraction is a viable

option for assisting researchers in the labor- and resource-intensive steps

of data extraction in the literature-assessment process (Schmitt et al.

2018).

Given the positive outcome of the NIST TAC challenge, we developed

Dextr, a web-based tool designed to incorporate NLP data-extraction

models (including but not limited to models developed for the NIST

TAC Challenge) into annotation and data-extraction workows to sup-

port literature-based assessments. Many potential features were

considered as we established the design requirements for Dextr

(Table 1), with three design features considered key for our needs. First,

and most importantly, was the ability of the tool to make data-extraction

predications automatically, with the user’s ability to manually verify the

predicted entity or override and modify the extracted information (i.e., a

V.R. Walker et al.

Environment International 159 (2022) 107025

“semi-automated” data-extraction approach where automated pre-

dictions are veried by the user). Second, we considered the capability

to group the extracted entities (e.g., connect the species, strain, and sex

of the animal model or the dose, exposure, and outcomes), critical to

supporting hierarchical data extraction and greater utility of the

extracted data. The third key feature was the ability of the tool to make

token-level annotations (i.e., identifying a word, phrase, or specic

sequence of characters) that can be used in either a typical data-

extraction workow as part of a literature review or to annotate

studies for developing training datasets. The annotation of studies dur-

ing data extraction has the potential to create training datasets without a

separate, directed effort if the tool includes appropriate machine-

readable export options. In addition to the three key features identi-

ed above, Table 1 describes unique characteristics of environmental

health data and key challenges for data extraction considered in devel-

oping Dextr. Given the increasing volume of published studies, we

believe that semi-automation of the labor-intensive step of data

extraction by Dextr has great potential to improve the speed and accu-

racy of conducting literature-based assessments and reduce the work-

load and resources required without comprising the rigor and

transparency that are critical to systematic-review methodology.

In this paper, we briey describe the methods development of Dextr,

investigate the tool’s usability, and evaluate the tool’s impact on per-

formance in terms of recall, precision, and time using this semi-

automated approach in DNTP’s data-extraction workow.

2. Methods

2.1. Tool development

The underlying machine-learning model (Nowak and Kunstman

2018) was developed as part of the NIST TAC 2018 SRIE workshop

(Schmitt et al. 2018). Briey, the model is a deep neural network con-

taining more than 31 million trainable parameters. It consists of pre-

trained embeddings (Global Vectors for Words Representations: GloVe

and Embeddings from Language Models: ELMo), a bidirectional long

short-term memory (LSTM) encoder, and a conditional random eld

(Nowak and Kunstman 2018). This model was developed and trained

only on the methods sections of environmental health studies. There-

fore, for this project, the model similarly was restricted to the methods

section and performed the sequence tagging task by producing a tag

(denoting a single data-extraction eld) for every token from the input

in the methods section. The goal for Dextr was to develop a exible user

interface that met the design requirements in Table 1 and leverage this

model within a literature-review workow. The project was conducted

according to Agile software development methodology around two

principal teams. A development team (AJN and KK) was formed to

outline potential features and functions for Dextr and code the tool. An

advisory team (AAR, CPS, RS, ARW, MSW, VRW) of experts with

backgrounds in public health, literature analysis, and computational

methods was then formed to guide the developers. The development

team worked sequentially in “sprints” on clearly dened, testable pieces

of functionality that could inform further planning and design.

Throughout the process, the development team consulted with the

advisory team, demonstrated newly added functionality, and presented

mock-ups of the user interface illustrating key functions within the tool

(e.g., project management screens, data import, extraction interfaces).

As part of the Agile development sprints, each task had a test plan to

verify its correctness. These test plans were then performed, rst by the

testers that were part of the development team, and then (if successful)

by the advisory team. Members of the advisory team (ARW and RS)

oversaw the project schedule and timeline and managed the develop-

ment team and evaluation study. The development and advisory teams

discussed potential renements, suggested improvements, and agreed

upon the approach to be implemented. When all features had been

developed, a test version of the tool was produced and tested by the

advisory team (ARW and RS). All issues or bugs identied during testing

were addressed by the development team. When both teams agreed that

the tool met the design requirements, the development team deployed

the Minimum Viable Product (MVP) version of the tool to the Quality

Assurance (QA) environment in April 2020. Before applying this initial

version of the tool (Dextr v1.0-beta1) in daily work, QA testing and basic

performance evaluation were conducted on the MVP version to quantify

the potential gains in using a semi-automated workow with Dextr

compared to a manual workow as described in the following section.

2.2. Evaluation

The aim of the evaluation was to understand how the integration of

Dextr, a semi-automated extraction tool employing a machine-learning

model would perform in the DNTP literature-review workow. Specif-

ically, we sought to understand how the tool would impact data

extraction recall, precision, and extraction time compared to a manual

workow. The performance of the underlying machine-learning model

was evaluated previously and was not within the scope of this evaluation

(Schmitt et al. 2018). Although Dextr enables users to connect extracted

entities, there is no difference in this aspect of the workow between

manual and semi-automated approaches. Therefore, the connection

feature was rigorously tested and subject to QA procedures, but not part

of the evaluation. Similarly, as we have continued to rene the user

Table 1

Dextr design requirements.

Challenge Description Dextr Features Addressing the Challenge

Interoperability: Efciently import and export necessary le types •Ability to import various le types (i.e., CSV and RIS)

•Allows bulk upload of PDFs (click and drag)

•Exports as CSV or modied brat le types

Usability features: User interface that operates with efcient mouse

and key stroke options (exibility for user

preferences)

•Selection with mouse click options.

•Project management features

•Easy to follow user interface

•Ability to modify data extraction form

Complex data: Environmental health sciences publications often

report multiple experiments with various chemical

exposures and doses and evaluate several endpoints

(hierarchical data structure and groupings /

connections)

•Ability to extract multiple entities including multiple animal models, exposures, and outcomes

•Ability to connect the metadata at various levels (i.e., dose-exposure-outcome pairings)

Annotations: Capability to annotate studies within a typical data

extraction workow that can be used to develop

annotated datasets needed for training or

developing new models

•Token-level (i.e., word, phrase, or specic sequence of characters) annotations recorded for each

extraction entity

•Ability to export annotations in a machine-readable format for model renement and new model

development

Flexibility: Functionality that emphasizes exibility for taking

advantage of advancements in natural language

processing

•Ability to utilize regular expressions to identify a string of text (i.e., #### mg =dose) or keyword

searches without models

•Ability to add validated models (3rd party models) to the suite of available models

V.R. Walker et al.

Environment International 159 (2022) 107025

interface and develop new capabilities for Dextr, we did not evaluate

new features (such as the ability to use controlled vocabularies) that

were not expected to negatively impact recall, precision, or extraction

time. The evaluation study was designed and conducted independently

of the development team and consisted of two teams, a manual extrac-

tion team (RS, ARW, RB, KS) and a semi-automated extraction team (JR

and JS) (Fig. 1). The manual extraction team included two manual ex-

tractors (RS and ARW), who read each study and manually extracted

each data element, and two QA reviewers (RB and KS), who reviewed

the manual data extractions and made any corrections as needed

(Manual +QA). The semi-automated team included two QA reviewers

(JR and JS) who reviewed the machine-generated data extractions and

made any corrections as needed (Model +QA). The manual extractors

had prior experience with Dextr related to development discussions and

testing tasks. The QA reviewers on both teams had previous and similar

levels of experience with conducting data extraction for literature re-

views, received a user guide for Dextr, and completed a pilot test in

Dextr prior to the evaluation. Reviewers were told to accept correct data,

add missing data, and ignore data incorrectly identied by either the

extractor or the model. Incorrect data were ignored to minimize any

additional time reviewers would spend clicking to reject incorrect sug-

gestions. This approach reects use of Dextr for literature-based reviews;

however, if Dextr is used to construct new training data for model

development, then incorrect data would need to be labeled as such.

Extractor time and reviewer time were recorded within Dextr. Prior

to either beginning extraction or review, users started a timer on the

tool’s user interface and paused or stopped the timer (as needed) until

they completed their task. The times between start and stop actions were

manually checked against a complete event log to identify potential

cases of the timer not being started or stopped and summed for each

study to provide a total extraction time.

All the statistical analyses (JC) used in the evaluation were con-

ducted using SAS Version 9.4 (SAS Institute Inc., Cary, NC). The

generalized linear mixed model regressions were tted using the

GLIMMIX procedure with maximum likelihood estimation based on the

Laplace approximation. Two-sided t tests were used to test the null hy-

pothesis that a given xed effect coefcient or a linear combination of

the coefcients (such as the estimated difference between the log odds of

recall for the manual and semi-automated modes averaged across elds

and extractors) was zero. Similarly, one-sided F tests were used to test

null hypotheses for interactions and contrasts, i.e., that all the corre-

sponding linear combinations of the coefcients were zero. All statistical

tests (two-sided t tests for estimated xed effects and one-sided F tests

for interactions and contrasts) were carried out at the 5% signicance

level. P-values are shown in the tables. There was no missing data and no

removal of potential outliers.

2.3. Pilot evaluation

An initial pilot evaluation was conducted on 10 studies and the re-

sults were used to calculate the sample size for the number of studies to

be included in the evaluation. All software requires a general under-

standing of the functionality and features to navigate the user interface

and perform the tasks it is designed for – in this case to conduct data

extractions. The goals of the pilot were to prepare for evaluating Dextr’s

performance, not to assess the learning process of new users. Therefore,

an extraction guidance document was written so participants would

better understand how Dextr worked and minimize the impact of the

learning curve. The guidance was developed and reviewed by the

advisory team and the development team prior to sharing it with the

extraction team. Extraction team members provided usability feedback

after the pilot; however, no changes were implemented to Dextr before

the evaluation study. Since no changes were made to the tool based on

the pilot, the results from the 10 pilot studies were included in the main

evaluation study.

The evaluation sample size was selected using a statistical power

analysis based on statistical models tted to data from the pilot study.

These statistical models used similar but less complicated formulations

than the nal models tted to the nal data.

Using the same notation as in the Evaluation Metrics section, the

pilot study statistical model for recall was of the form:

Logit(recall) = intercept +

i+βc +θs,

where

i is a xed factor for the mode, θs is a random factor for the study,

drawn from a normal distribution with mean zero, and c is the quanti-

tative complexity score (i.e., the calculated score divided by 100). The

logit is the log odds. The same model formulation was used for precision.

The pilot study statistical model used for time was of the form:

Log(time) = intercept +

i+βc +θs+error,

where error is normally distributed with mean zero and is independent

of the random factor θs.For several candidate values of K, data for K

studies were simulated from each tted model 100 times each under the

alternative hypotheses, and the same statistical model was retted to the

simulated data. 100 simulations were used since the iterative method for

tting the models is computer intensive. For recall, each eld was

assumed to have 4 gold standard tags (the average number in the pilot

data, rounded to the nearest integer). For precision, each eld was

assumed to have 3 tags (the average number in the pilot data, rounded to

the nearest integer). The simulated complexity scores were equally

likely to be any of the 10 pilot study complexity scores. For the manual

mode, the simulated data used the tted statistical models for the

manual mode. For the semi-automated mode, the simulated data for

recall and precision used the same model but increased the log odds by a

xed amount, delta. For the semi-automated mode, the simulated data

for time used the same model but decreased the geometric mean time by

a xed percentage, perc. The estimated statistical power was the pro-

portion of the simulated models where the difference between the two

modes was statistically signicant at the 5% signicance level.

Based on 100 simulations from the tted models and using a 5%

signicance level, we found that a sample of 50 studies would be suf-

cient to have an estimated statistical power of 100% (95% condence

interval (96.4, 100) %) to detect an increase or decrease of 1 in the log

odds of recall, or a decrease of 1 in the log odds of precision; 98% (95%

condence interval (93.0, 99.8) %) to detect an increase of 1 in the log

odds of precision, and 97% (95% condence interval (91.4, 99.4) %) to

detect a 20% decrease in median time. The condence intervals account

for the uncertainty due to the fact that only 100 simulations were used.

Therefore, a nal sample size of at least 50 studies was selected and 51

studies were chosen for the gold-standard dataset.

Fig. 1. Evaluation study design with two teams: manual extraction and semi-

automated extraction. The manual-extraction team included two extractors

who completed the primary extraction, followed by two QA reviewers. The

semi-automated extraction team included the algorithm that completed the

primary extraction, followed by two different QA reviewers.

V.R. Walker et al.

Environment International 159 (2022) 107025

2.3.1. Data-extraction elds

We selected ve data-extraction elds (test article, species, strain,

sex, and endpoint) for the evaluation study from the full list of 24

extraction elds included in the NIST TAC SRIE challenge dataset

(Schmitt et al. 2018). The challenge evaluated models using the F1

metric, which is a harmonic mean of precision (i.e., positive predictive

value) and recall (i.e., sensitivity). The ve data elds represented a mix

of elds with high (species, sex), medium (strain), and low (test article,

endpoint) F1 scores across all models previously evaluated in the

challenge.

2.3.2. Gold-standard dataset

The gold-standard dataset comprised respiratory endpoints associ-

ated with exposure to biocides manually extracted from 51 experimental

animal studies (Supplemental File S1). Although in vitro, experimental

animal, and epidemiological study designs are of interest, we focused on

experimental animal studies for the evaluation of Dextr because the

model used had been developed and trained on experimental animal

studies. Guidance on respiratory outcomes or endpoints provided to the

extractors and QA reviewers is shown in Table 2. The teams were told

that the table was not an exhaustive list and were instructed to identify

any respiratory effect evaluated. The gold-standard dataset was devel-

oped by a separate extractor (PH), not included in either extraction

team, who read the papers and extracted information on test article,

species, strain, sex, and endpoints. A QA review was performed on the

resulting dataset, or gold-standard dataset, by an independent QA

reviewer (VW), who was also not on either extraction team.

2.3.3. Evaluation criteria

The results from Dextr were manually assessed by a single grader (RS

or ARW) and compared to the gold-standard dataset. In brief, nal data

from the manual mode (Manual +QA) and from the semi-automated

mode (Model +QA) were exported from Dextr into a CSV le. The re-

sults of each study by mode (manual or semi-automated) were graded

separately. Each extracted data element was compared to the gold

standard and marked as either a “true positive” (TP), if a match with an

element in the gold standard, or a “false positive” (FP), if an additional

element was not included in the gold standard. Out of 3334 results, 985

were identied as FPs. Fifty-ve of these were manually agged for

further investigation for reasons such as a possible gold standard match,

duplicate nding, or human error. Gold-standard data elements that

were not included in the Dextr results were marked as a “false negative”

(FN). Endpoints in the Dextr exports that were more specic than those

in the gold standard were considered a TP. For example, if a Dextr result

had an endpoint of “lung myeloid cell distribution” and “lung CD4 +T

cell numbers,” but the gold-standard endpoint was “lung myeloid cell

distribution (B and T cells),” then both Dextr-identied endpoints were

considered TPs. Out of 1561 TP results, 540 were exact matches while

1021 were not exact matches for reasons such as plural versus singular,

abbreviated or not, order of terms differed, or more detail provided in

one source or the other. Out of these non-exact matches, 282 did not

exactly match due to plural/singular/abbreviation discrepancies while

739 had a slight difference in wording, but were still considered a match.

The graders consulted a tertiary grader (VW) to make a nal decision in

cases where a data element was identied in the Dextr results and

missed in the gold standard. For QA, four studies were independently

graded by a separate grader (either ARW or RS) and compared to the

initial grading. Changes to the grading were made based on discussions

between the graders. If questions between graders remained, then a

tertiary grader (VW) was consulted to provide clarity and a nal deci-

sion. Duplicate data elements in the Dextr results were graded only once.

We calculated a complexity score for each study to account for the

additional effort an extraction would take based on the number of var-

iations of an experiment. For example, we anticipated a study with

multiple test articles would be more difcult to extract than a study with

only one test article. We were unable to nd an established method to

address complexity, and therefore complexity scores were developed

using expert judgment from experienced extractors. Although the

complexity scores were designed to address study characteristics over-

all, the score is based on study characteristics that relate to the specic

data extraction elements for this paper. The number of data-extraction

elements by eld in the gold-standard dataset was multiplied by the

weights shown in Table 3 and summed across the ve data elds to

calculate the score. The advisory team developed the weights for each

eld based on judgement related to how complex an extraction task was

given multiple test articles, species, strains, sexes, and endpoints.

Additional test articles and species were identied as introducing

complexity to an extraction, while most studies examined multiple

endpoints and did not dramatically add time to an extraction.

2.4. Usability feedback

After completion of the evaluation study, we asked manual and semi-

automated QA reviewers to provide qualitative feedback on their user

experience with Dextr. To summarize the assessment of usability across

the reviewers, six open-ended user-experience questions were developed

and responses for each question were recorded and compiled (Table S1).

Note, that the feedback reects user experience during the pilot and

evaluation phases.

2.5. Evaluation metrics

We evaluated the utility of Dextr in DNTP’s workow on three key

metrics: recall, precision, and extraction time. The recall rate is the

probability, prob(recall), that a gold-standard tag was correctly recalled.

The precision rate is the probability, prob(precision), that an identied

tag was a gold-standard tag. In the main paper, we compare arithmetic

Table 2

Respiratory outcome examples and key terms.

Category Examples and Key Terms

Organ/tissue nasal cavity and paranasal sinus, nose (including

olfactory), larynx, trachea, pharynx, pleura, lung/

pulmonary (including bronchi, alveoli), glottis,

epiglottis

Signs/symptoms sneezing/snifing, nasal congestion, nasal discharge

(e.g., rhinorrhea), coughing, increased mucus/

sputum/phlegm, breathing abnormalities (e.g.,

wheezing, shortness of breath, unusual noises when

breathing)

Respiratory-related diseases

or conditions

brosis, asthma, emphysema, chronic obstructive

pulmonary disease (COPD), pneumonia, sinusitis,

rhinitis, granuloma (or other inammation)

Lung function

measurements

forced expiratory volume (FEV), forced vital capacity

(FVC), peak expiratory ow (PEF), expiratory reserve

volume (ERV), functional residual capacity (FRC),

vital capacity (VC), total lung capacity (TLC), airway

resistance, mucociliary clearance

Examples of outcomes or endpoints and key terms in this table provided in a

guidance document for training of extractors and QA reviewers.

Table 3

Assigned weights to data extraction elds within

Dextr.

Field Weight

Endpoint(s) 0.5

Sex 1

Species 2

Strain 1

Test article 2

Weights were developed using expert judg-

ment to capture how additional extraction ele-

ments introduced complexity into the extraction

task.

V.R. Walker et al.

Environment International 159 (2022) 107025

means of the recall rate or precision rate or the median total extraction

time with predictions from tted statistical models “unstratied by

eld” that estimate the rates or medians as a single function of the mode,

eld, and other explanatory variables. In the Supplemental Materials,

we present alternative models for the recall and precision rates “strati-

ed by eld,” where for each eld the recall or precision rate is sepa-

rately modeled.

The log odds of recall is dened as logit(recall) =log {prob(recall) /

(1 - prob(recall)) }, where “log” is the natural logarithm. The tted

statistical model is a version of the model used in Saldanha et al. (2016).

The statistical model assumes that the log odds of recall for a given gold-

standard tag is a function of the mode (i =manual or semi-automated),

complexity score (c), eld (f =endpoint, sex, species, strain, or test

article), primary extractor (p =primary1 or primary2 for manual mode,

NULL for semiautomated mode), quality assurance reviewer (q =q1 or

q2 for the two semi-automated mode reviewers, q3 or q4 for the two

manual mode reviewers), and study (s =different values for each of the

51 studies). For each combination of study, mode, and eld, the recall

outcomes for each of the gold-standard tags are independent and have

the same log odds, giving a binomial distribution for the number of gold-

standard tags correctly recalled. For all these statistical analyses, the

complexity score dened above was divided by 100 to improve model

convergence without changing the underlying model formulation. The

general model used the equation:

Logit(recall) = intercept +

i+βc+γf+δp+

q+(

γ)if +(

β)ic+(βγ)fc+θs.

In this general model,

iandγf are xed factors for the mode and eld;

δp,

q, and θs are random factors for the primary extractors, QA re-

viewers, and study, drawn from independent normal distributions with

mean zero; and c is the quantitative complexity score (i.e., the calculated

score divided by 100). The terms (

γ)if ,(

β)i,and(βγ)f are interaction

terms for mode ×eld, mode ×complexity score, and complexity score

×eld. Thus, the model allows the effect of the eld to vary with the

mode or with the complexity score and allows the effect of the mode to

vary with the complexity score.

We were unable to t this general model to the data due to problems

with extremely high standard errors for the mode ×eld interaction and

some convergence issues, although there were no problems with com-

plete or quasi-complete separation of the logistic regression models. For

example, in the initial model with random factors, the estimated vari-

ance for the QA reviewer was zero, but the corresponding gradient of

minus twice the log-likelihood was over 150 instead of being at most

0.001, the convergence criterion. For the nal model we therefore

removed the mode ×eld interaction and replaced the random factors

for the primary and QA reviewers by xed factors. Replacing the random

factors by xed factors might limit the generalizability of these results to

other potential reviewers. We also removed main effects and in-

teractions that were not statistically signicant at the 5% level. It is

possible that excluding interactions and replacing random factors by

xed factors could have introduced some bias and might limit the

generalizability of the study results.

The nal model was of the form:

Logit(recall) = intercept +

i+βc +γf+δp+θs

where the only random factor is the study effect. In particular, this

model does not have an interaction between mode and eld, so the

estimated differences in log odds between modes are the same for every

eld. Additionally, as noted above, to evaluate differences in the mode

effect across different elds we tted alternative models stratied by

eld, and those results are shown in the Supplemental Materials. In

particular, the stratied models show large differences between the

estimated study variances for different elds.

The precision rate is the probability, prob(precision), that an iden-

tied tag was a gold standard tag. The log odds of precision is dened as

logit(precision) =log {prob(precision) / (1 - prob(precision)) }. The

general statistical model for precision was the same formulation as the

above model for recall. As before, the nal model did not include the

mode ×eld interaction due to extremely high standard errors, and we

replaced the random factors for the primary and QA reviewers by xed

factors. After removing non-signicant main effects and interactions, the

nal model (using the same notation) was of the form:

Logit(precision) = intercept +

i+βc +γf+ (

β)ic+θs

where the only random factor is the study effect. In particular, this

model does not have an interaction between mode and eld, so the

estimated differences in log odds between modes are the same for every

eld. For each study and mode, the total extraction time, including the

primary and QA reviews, was recorded. The time taken for each eld

was not recorded. The general model for time taken assumes that the

natural logarithm of the time taken is the following function of the

mode, complexity score, primary extractor, and QA reviewer. Using the

same notation as before, the general model is of the form:

Log(time) = intercept +

i+βc +δp+

q+ (

β)ic+θs+error

where error is normally distributed with mean zero and is independent

of the random factors δp,

q, and θs. The interaction term for mode ×

complexity score was not statistically signicant, and again it was

necessary for convergence to replace the random factors for the primary

and QA reviewer by xed factors. The primary extractor effect was not

statistically signicant at the 5% level. The nal model was of the form:

Log(time) = itercept +

i+βc +

q+θs+error

3. Results

3.1. Dextr functionality

The rst version of the tool that we evaluated in this study fullls the

ve design principles outlined at the tool’s inception. Specically, the

tool’s set-up feature provides interoperability within the existing DNTP

workow where users can upload .ris and .pdf les and export the

extracted data in two forms, as .csv or .zip les. The .zip format allows

exported data to be uploaded into brat (an open-source annotation

software tool; https://brat.nlplab.org/). The ability to export data in a

structure readable by brat allows users to leverage project data for future

model development.

In terms of usability requirements, Dextr enables users to select text

using their mouse or type a phrase into the extraction form. The default

extraction form consists of the ve extraction elds, all powered by the

underlying model to provide predictions. Users can customize the data-

extraction form; however, only the elds on which the model was

trained are supported by automation.

We designed the default form and the tool to be able to handle re-

lationships between the extraction entities, called “connections.” This

allows the user to specify a hierarchy between elds (e.g., multiple an-

imal models can be dened, each with a species, strain, and sex). The

animal model and endpoints can then be connected to a test article to

create a separate experiment within the study, satisfying the third design

principle related to handling complex, hierarchical data.

After project set-up, users and team members can begin extracting

data elements (manually or semi-automatically) via the “My Tasks”

page. Users then claim available pdfs (i.e., select a given pdf as part of

the user’s tasks) and access the full-text pdf within the tool to facilitate

data extraction. The user can highlight the text and associate it with an

extraction eld. Additionally, if the exact text or phrase is not within the

article itself, the user can type the appropriate text into the extraction

eld. These exible options for highlighting text to populate the

extraction form are useful for data extraction within a typical literature-

assessment workow, or for a more detailed annotation workow by

V.R. Walker et al.

Environment International 159 (2022) 107025

generating a dataset that may be used by model developers, thus satis-

fying the fourth design principle related to annotations. In both the

manual or semi-automated workows, the primary extractor or machine

predictions are populated in the extraction form before a user accesses

the study. The user then has the option to accept, ignore, or reject the

extracted data or add additional data if it is missing within the form.

The last design requirements, the ability for other models to be easily

incorporated into the tool and the ability to adapt to new model de-

velopments over time, were both addressed in the initial version of Dextr

but not tested in the evaluation study.

3.1.1. Usability feedback

Three of the reviewers were available to participate in a feedback

discussion of the tool’s usability (two of the semi-automated QA re-

viewers and one of the manual QA reviewers). Two reviewers rated the

usability of Dextr a 5 out of 10, while the other reviewer rated it an 8 out

of 10. The semi-automated reviewers provided feedback on how the tool

could be improved related to automatic page navigation and organiza-

tion on the user interface. The semi-automated QA reviewers liked how

the tool organized the extractions and agreed that the tool helped them

stay organized. Once they were comfortable with the tool, they were

able to work smoothly and efciently. Reviewers identied one draw-

back regarding how the tool handled endpoints. All three of the re-

viewers found it difcult to keep track of endpoints identied by either

the machine or a primary extractor. It was a challenge to nd previously

reviewed and accepted endpoints as reviewers continued searching for

new endpoints in the extraction list. This issue was more noticeable

when multiple, similar endpoints had been identied. They suggested

that a more organized process for listing and tracking endpoints will

improve the tool’s usability.

3.1.2. Statistical models for recall, unstratied by eld

A total of 51 toxicological studies were included in the nal dataset.

The ability of the model to correctly apply the gold-standard tag, or

modeled overall recall rate, was 97.0% for the manual mode and 91.8%

for the semi-automated mode. The difference in recall rates for the

manual mode compared to the semi-automated mode was observed to

be statistically signicant (p <0.01) (Table 4). These results are com-

parable to the arithmetic means of the recall rates across all studies and

elds, which were 91.8% for the manual mode and 83.8% for the semi-

automated mode. Table 4 also provides the estimate, standard error, and

p-value for the difference between the log odds of the two modes.

Table 5 shows the estimated log odds and recall probabilities as well as

the very similar arithmetic mean recall rates for each eld for the

manual and semi-automated modes. Note that because there is no

interaction term for mode ×eld, the estimated differences in log odds

between the two modes are the same for every eld and equal the values

in the last row of Table 4. Estimates and standard errors for the xed

effects and random effects related to recall are shown in Table S2.

3.1.3. Statistical models for precision, unstratied by eld

The modeled overall precision rate was 95.4% for the manual mode

and 96.0% for the semi-automated mode. The precision rate for the

semi-automated mode was higher, but the difference was not statisti-

cally signicant (Table 6). These results can be compared with the

arithmetic means of the precision rates across all studies and elds,

which were 92.5% for the manual mode and 93.2% for the semi-

automated mode. Table 6 gives the estimated log odds and precision

probabilities for the manual and semi-automated modes, weighting each

eld equally, along with their standard errors and p-values. Table 6 also

provides the estimate, standard error, and p-value for the difference

Table 5

Recall comparison between manual and semi-automated modes for each mode and eld, averaged over evaluators, based on the model unstratied by eld.

Extraction Mode Field Log Odds (Standard Error) P-value of Log Odds Probability (Standard Error) Arithmetic Mean Recall Rate

Manual Endpoint 1.133 (0.115) <0.0001 0.756 (0.021) 0.744

Manual Sex 4.856 (0.726) <0.0001 0.992 (0.006) 0.980

Manual Species 5.444 (1.014) <0.0001 0.996 (0.004) 1.000

Manual Strain 3.893 (0.477) <0.0001 0.980 (0.009) 0.980

Manual Test article 2.091 (0.180) <0.0001 0.890 (0.018) 0.883

Semi-automated Endpoint 0.068 (0.106) 0.5259 0.517 (0.027) 0.523

Semi-automated Sex 3.791 (0.721) <0.0001 0.978 (0.016) 0.990

Semi-automated Species 4.379 (1.011) <0.0001 0.988 (0.012) 0.980

Semi-automated Strain 2.828 (0.470) <0.0001 0.944 (0.025) 0.922

Semi-automated Test article 1.026 (0.168) <0.0001 0.736 (0.033) 0.773

Assumes average study complexity scores (0.175).

Table 4

Recall comparison between manual and semi-automated modes when averaged across elds and extractors, based on the model unstratied by eld.

Extraction Mode Log Odds (Standard Error) P-value of Log Odds Probability (Standard Error) Arithmetic Mean Recall Rate

Manual 3.483 (0.287) <0.0001 0.970 (0.008) 0.918

Semi-automated

2.418 (0.278) <0.0001 0.918 (0.021) 0.838

Comparison

−1.065 (0.109) <0.0001 – –

Dextr predictions conrmed by QA reviewer.

Comparison between manual and semi-automated extraction modes.

Assumes average study complexity scores (0.175).

Table 6

Precision comparison between manual and semi-automated modes when averaged across elds and extractors, based on the model unstratied by eld.

Extraction Mode Log Odds (Standard Error) P-value of Log Odds Probability (Standard Error) Arithmetic Mean Precision Rate

Manual 3.040 (0.281) <0.0001 0.954 (0.012) 0.925

Semi-automated

3.174 (0.287) <0.0001 0.960 (0.011) 0.932

Comparison

0.134 (0.151) 0.3765 – –

Dextr predictions conrmed by QA reviewer.

Comparison between manual and semi-automated extraction modes.

Assumes average study complexity scores (0.175).

V.R. Walker et al.

Environment International 159 (2022) 107025

between the log odds of the two modes. Table 7 shows the estimated log

odds and precision probabilities as well as the very similar arithmetic

mean precision rates for each eld for the manual and semi-automated

modes. Note that because there is no interaction term for mode ×eld in

the nal model, the estimated differences in log odds between the two

modes are the same for every eld and equal the values in the last row of

Table 6. Estimates and standard errors for the xed effects and random

effects related to precision are shown in Table S3.

3.1.4. Statistical models for time

The modeled median time was 933 s for the manual mode and 436 s

for the semi-automated mode. The median time, which is the expo-

nentiated mean log(time), was signicantly lower for the semi-

automated mode (p <0.01). These results can be compared with the

arithmetic means of the time across all studies: 971 s for the manual

mode and 517 s for the semi-automated mode (Table 8). For each mode,

Table 8 gives the estimated means for log(time), the standard errors of

the means, and the estimated medians for time. For this model, the

median time is the same as the geometric mean time. Table 8 also pro-

vides the estimate, standard error, and p-value for the difference be-

tween the mean log(time) for the two modes. Estimates and standard

errors for the xed effects and random effects related to total time are

shown in Table S4.

4. Discussion

Data extraction is a time- and resource-intensive step in the

literature-assessment process. Machine-learning methods for auto-

mating data extraction have been explored to address this challenge;

however, the use of machine learning for data extraction has been

limited to date, particularly in the eld of environmental health sci-

ences. Development and uptake of advanced approaches for extraction

lag behind other steps in the review process such as literature screening,

where automated screening tools have been established and used more

widely. In this paper, we introduced Dextr, a web-based data-extraction

tool that pairs machine-learning models that automatically predict data-

extraction entities with a user interface that enables manual verication

of extracted information (i.e., a semi-automated method). This powerful

tool does more than provide a convenient user interface for extracting

data; the tool’s extraction scheme supports complex data extraction

from full-text scientic articles with methods to capture data entities as

well as connections between entities. With this advanced approach,

Dextr supports hierarchical data extraction by allowing users to identify

relationships (e.g., the connections between species, strain, sex, expo-

sure, and endpoints) necessary for efcient data collection and synthesis

in literature reviews. When evaluated relative to manual data extraction

of environmental health science articles, Dextr’s semi-automated

extraction performed well, resulting in time savings and comparable

performance in both recall and precision.

O’Connor et al. (2019) provides a framework to describe the degree

of independence or “levels of automation” across tools and discusses

potential barriers to adoption of automation for use in literature re-

views. The degree of automation can range from tools that improve le

management (Level 1), tools that leverage algorithms to assist with

reference prioritization (Level 2), tools that perform a task automatically

but require human supervision to approve the tool’s decision resulting in

a semi-automated workow (Level 3), and tools that perform a task

automatically without human oversight (Level 4). In developing Dextr,

we intentionally chose to develop a Level 3 tool because we wanted a

workow that would allow expert judgment in a manual verication

step to provide users the exibility to accommodate entities where

existing models may have an error rate that is too high to achieve the

necessary performance. The decision to develop a semi-automated tool

also addresses limited uptake of automation tools (van Altena et al.

2019) and expected barriers to adoption (O’Connor et al. 2019) of

automation within the systematic-review community (e.g., providing a

user verication option to address mistrust by an end-user of the auto-

mation tool, supporting transparency to demonstrate ability of the tool

to perform the task, and providing a verication step similar to manual

QA to lessen potential disruption of adding automation to current

workows). The work presented in this paper supports widespread

adoption of a semi-automated data extraction approach because Dextr

has been tested on complex study designs, in an existing workow, and

provides the user the ability to conrm the machine-predicted values,

thereby increasing transparency and demonstrating compatibility with

current practices.

While systematic reviews, scoping reviews, and systematic evidence

maps have different formats and goals, all literature-based assessments

are used to inform evidence-based decisions. Therefore, the testing of

new procedures and automated approaches is essential to assess both the

impact on workow and the accuracy of the results. Given that Dextr

was developed to address the time-intensive step of data extraction, its

Table 7

Precision comparison between manual and semi-automated modes for each mode and eld, averaged over evaluators, based on the model unstratied by eld.

Extraction Mode Field Log Odds (Standard Error) P-value of Log Odds Probability (Standard Error) Arithmetic Mean Precision Rate

Manual Endpoint 1.460 (0.152) <0.0001 0.812 (0.023) 0.794

Manual Sex 5.073 (1.021) <0.0001 0.994 (0.006) 0.980

Manual Species 3.366 (0.494) <0.0001 0.967 (0.016) 0.990

Manual Strain 3.330 (0.490) <0.0001 0.965 (0.016) 0.978

Manual Test article 1.972 (0.224) <0.0001 0.878 (0.024) 0.883

Semi-automated Endpoint 1.594 (0.168) <0.0001 0.831 (0.024) 0.818

Semi-automated Sex 5.207 (1.022) <0.0001 0.995 (0.006) 1.000

Semi-automated Species 3.500 (0.496) <0.0001 0.971 (0.014) 0.967

Semi-automated Strain 3.464 (0.492) <0.0001 0.970 (0.014) 0.940

Semi-automated Test article 2.106 (0.236) <0.0001 0.891 (0.023) 0.934

Assumes average study complexity scores (0.175).

Table 8

Comparison of predicted mean logarithm and median for total time (seconds) between manual and semi-automated modes, averaged over evaluators

Extraction Mode Mean Log Time Modeled (Standard Error) P-value of Mean Log Time (Modeled) Median Time (Modeled) Arithmetic Mean Time

Manual 6.838 (0.058) <0.0001 933 971

Semi-automated

6.079 (0.059) <0.0001 436 517

Comparison

−0.760 (0.071) <0.0001 – –

Dextr predictions conrmed by QA reviewer.

Comparison between manual and semi-automated extraction modes.

Assumes average study complexity scores (0.175).

V.R. Walker et al.

Environment International 159 (2022) 107025

performance was evaluated in terms of recall, precision, and extraction

time. Although the precision rates for the manual mode and semi-

automated modes were similar, we found an unexpected and

intriguing statistically signicant reduction in the recall rate (arithmetic

mean recall rate 0.918 for manual and 0.834 for semi-automated).

Recall reects the ability of the data-extraction approach to identify

all relevant instances of an entity, and although 84% recall is good, we

explored potential reasons for this decrease. While the recall for “sex,”

“species,” and “strain” were comparable, the semi-automated recall rate

was lower for the “endpoint” and “test article” elds. We hypothesize

that the large number of endpoints predicted by Dextr may have been

difcult or distracting for the user to sort through compared to manual

identication. This is supported by feedback from the reviewer-usability

questions and is a target for rening the user interface in future versions

of Dextr to avoid this potential distraction by adding search function-

ality to provide a list of predicted endpoints to help extractors system-

atically sort through potential endpoints. The differences in recall by

eld (see Table 5) are also correlated with the recall rates achieved by

the model on the TAC SRIE dataset (Nowak and Kunstman 2018). The

elds were chosen purposefully to observe the impact of the model

performance on the results. While the differences reect the relative

difculty of the elds, we believe that model improvements will lead to

closing the gap between the manual and semi-automated approaches. In

terms of time, Dextr added clear efciencies to our workow, providing

an approximately 50% reduction (53% lower predicted median time and

47% lower average time) in the time required for data extraction. This

nding indicates that Dextr has the potential to provide similar recall

and precision with substantial time-savings and reduced manual work-

load for data extraction by integrating semi-automated extraction and

QA in a single step and replacing the conventional 2-step data-extraction

process (a manual extractor and a manual QC check).

Although primarily developed as a tool to improve data-extraction

workow for literature-based reviews, Dextr can also be used to anno-

tate published studies and produce training datasets for future model

development. Using the tool as part of a literature review, Dextr captures

token level annotations during the data-extraction workow; these an-

notations are part of a machine-readable export that can potentially

support model development and renement. This feature provides an

alternative to the current option of a dedicated workow (i.e., outside of

a normal literature review) required to generate training datasets and

offers a reduction in cost for developing them. However, the annotations

captured on each study during a literature review may have some lim-

itations as the topic of the review could direct the extractors towards

endpoints of interest rather than capturing all exposures or endpoints in

a study. The lack of applicable datasets is a major impediment to model

development for literature reviews (Jonnalagadda et al. 2015), and

Dextr provides the potential for important advances to the eld.

There are several limitations in the evaluation of Dextr that should be

noted. First, we only used a single dataset to test performance. The

dataset used to evaluate the tool focused on identifying and extracting

respiratory health outcomes only. In contrast, the endpoint entity al-

gorithm was not trained with this specication, and the model predicted

all potential health outcomes (or endpoints) in each reference and not

the respiratory subset. As noted earlier, the extractors noted in responses

to reviewer-feedback questions that non-target endpoints identied by

Dextr were a distraction. This limitation could have contributed to the

lower recall rate observed because all non-respiratory endpoints had to

be reviewed to identify relevant respiratory endpoints. Second, there are

limitations associated with the models used, even though the models

were not evaluated for this paper. The models currently in Dextr were

developed and trained only on the methods section of environmental

health animal studies. For this reason, the tool automatically identied

and used only the methods section. However, detailed data extraction

requires the full text of a reference because entities are commonly

identied in the abstract, methods, and results sections. Similarly, in-

formation on some endpoints may be available only in tables, which

Dextr currently does not process. Third, we evaluated the key perfor-

mance features of Dextr (recall, precision, and time); however, we

acknowledge that other aspects of the tool were beyond the scope of this

project and were not tested. For example, the ability of users to establish

connections was not directly tested nor a focus for user feedback. Last,

this project was intended to develop a user interface designed to

incorporate NLP data-extraction models. Evaluation and potential

improvement of the models used were outside the scope of the work

described in this paper. Therefore, it is likely that our evaluation metrics

(e.g., recall of the endpoints eld) will improve in conjunction with

focused efforts to address model improvements.

Dextr was developed to add automation and machine-learning

functions to the data-extraction step in DNTP’s literature-based assess-

ment workow. Although developed to address a DNTP need, we believe

it is important that the new tool be available to others in the research

community and be stable (i.e., have technical support) over 2–5 years.

We are in the process of obtaining Federal Risk and Authorization

Management Program (FedRAMP) authorization for the cloud deploy-

ment of Dextr, which will be available at (https://ntp.niehs.nih.gov

/go/Dextr) when completed. The current version of Dextr (v1.0-beta1)

provides a solid foundation for us to continue to rene and incorporate

new features that improve workow and enable faster and more effec-

tive data extraction. Although this publication is paired with the initial

release of the tool, we are already working to expand functionality of

Dextr, with planned improvements to the user interface, use of

controlled vocabularies, and additional data-extraction entities. Testing

the tool for data extraction on more diverse datasets is also underway.

We are also working to identify existing models and develop new models

that can be integrated into Dextr to expand the data-extraction capa-

bilities to other evidence streams (e.g., epidemiological and in vitro

studies). Other potential targets include the ability to extract more

detailed entities (e.g., results, standard error, condence interval) and

information from tables, gures, and captions of scientic literature. As

new features are developed, the design requirements of usability, ex-

ibility, and interoperability will be periodically re-evaluated.

As described in the key design requirements, we considered it critical

for Dextr to: 1) make data-extraction predictions automatically with

user verication; 2) integrate token-level annotations in the data-

extraction workow; and 3) connect extracted entities to support hier-

archical data extraction. This third feature, the connection of data en-

tities, is helpful for efcient data collection and essential to enable

effective synthesis in literature reviews. Controlled vocabularies and

ontologies provide a hierarchical structure of terms to dene conceptual

classes and relations needed for knowledge representation for a given

domain. Controlled vocabularies provide semantics and terminology to

normalize author-reported information and support a conceptual

framework when evaluating results (de Almeida Biolchini et al. 2007).

Efforts are ongoing to develop eld structures in Dextr compatible with

integrating ontologies and controlled vocabularies. These efforts include

the capability of selecting an ontology or vocabulary at the entity level

with the ability to select multiple vocabularies when setting up the data-

extraction form in Dextr. We are also exploring the ability of an ontology

to support data extraction for specic domains or questions based on the

sorting, aggregating, and association context of terms in the ontology (i.

e., identifying only cardiovascular endpoints from a search of environ-

mental exposure references).

5. Conclusions

Dextr is a semi-automated data extraction tool that has been trans-

parently evaluated and shown to improve data extraction by substan-

tially reducing the time required to conduct this step in supporting

environmental health sciences literature-based assessments. Unlike

other data extraction tools, Dextr provides the ability to extract complex

concepts (e.g., multiple experiments with various exposures and doses

within a single study) and properly connect or group the extracted

V.R. Walker et al.

Environment International 159 (2022) 107025

elements within a study. Furthermore, Dextr limits the work required by

researchers to generate training data by incorporating machine-readable

annotation exports that are collected as part of the data-extraction

workow within the tool. Dextr was designed to address challenges

associated with environmental health sciences literature; however, we

are condent that the features and capabilities within the tool are

applicable to other elds and would improve the data-extraction process

for other domains as well.

Declaration of Competing Interest

The authors declare that they have no known competing nancial

interests or personal relationships that could have appeared to inuence

the work reported in this paper.

Acknowledgements

This work was supported by the Intramural Research Program

(Contract GS00Q14OADU417, Task Order HHSN273201600015U) at

NIEHS, NIH. DNTP initiated and directed the project providing guidance

on tool requirements to support data extraction for literature-analysis as

well as the evaluation plan. Robyn Blain, Jo Rochester, and Jennifer

Seed of ICF worked under contract for DNTP and completed Dextr

testing, manual extraction, semi-automated extractions, evaluation re-

sults grading, and statistical analysis, while Pam Hartman of ICF

developed the gold-standard dataset. ICF staff also provided project

management for the software development and tool evaluation. Dextr

programming and development were conducted by Evidence Prime as

subcontractor to ICF. The underlying machine-learning model was also

developed by Evidence Prime. Kelly Shipkowski (now with DNTP)

worked for ICF at the beginning of the project. We appreciate the helpful

comments and input on the draft manuscript provided by Keith Shockley

and Nicole Kleinstreuer.

Competing Financial Interests: AJN and KK are employed by, and

AJN is also a shareholder of, Evidence Prime, a software company that

plans to commercialize the results of this work. To mitigate any potential

conicts of interest, these authors excluded themselves from activities

that could inuence the results of the evaluation study. The remaining

authors declare that they have no actual or potential competing nan-

cial interests.

Appendix A. Supplementary material

Supplementary data to this article can be found online at https://doi.

org/10.1016/j.envint.2021.107025.

References

Brockmeier, A.J., Ju, M., Przybyła, P., Ananiadou, S., 2019. Improving reference

prioritisation with PICO recognition. BMC Med. Inf. Decis. Making 19 (1), 256.

https://doi.org/10.1186/s12911-019-0992-8.

Clark, J., Glasziou, P., Del Mar, C., Bannach-Brown, A., Stehlik, P., Scott, A.M., 2020.

A full systematic review was completed in 2 weeks using automation tools: a case

study. J. Clin. Epidemiol. 121, 81–90. https://doi.org/10.1016/j.

jclinepi.2020.01.008.

de Almeida Biolchini, J.C., Mian, P.G., Natali, A.C.C., Conte, T.U., Travassos, G.H., 2007.

Scientic research ontology to support systematic review in software engineering.

Adv. Eng. Inf. 21 (2), 133–151. https://doi.org/10.1016/j.aei.2006.11.006.

Howard, B.E., Phillips, J., Tandon, A., Maharana, A., Elmore, R., Mav, D., Sedykh, A.,

Thayer, K., Merrick, B.A., Walker, V., Rooney, A., Shah, R.R., 2020. SWIFT-Active

Screener: Accelerated document screening through active learning and integrated

recall estimation. Environ. Int. 138, 105623. https://doi.org/10.1016/j.

envint.2020.105623.

James, K.L., Randall, N.P., Haddaway, N.R., 2016. A methodology for systematic

mapping in environmental sciences. Environ. Evid. 5 (1), 7. https://doi.org/

10.1186/s13750-016-0059-6.

Jonnalagadda, S.R., Goyal, P., Huffman, M.D., 2015. Automating data extraction in

systematic reviews: a systematic review. Syst. Rev. 4 (1), 78. https://doi.org/

10.1186/s13643-015-0066-7.

Marshall, C., Brereton, P., 2015. Systematic review toolbox: a catalogue of tools to

support systematic reviews. In: Paper presented at: Proceedings of the 19th

International Conference on Evaluation and Assessment in Software Engineering.

Association for Computing Machinery; Nanjing, China. https://doi.org/10.11

45/2745802.2745824.

Marshall, I.J., Kuiper, J., Banner, E., Wallace, B.C., 2017. Automating biomedical

evidence synthesis: RobotReviewer. In: Paper presented at: Proceedings of the 55th

Annual Meeting of the Association for Computational Linguistics-System

Demonstrations. Vancouver, Canada. https://dx.doi.org/10.18653/v1/P17-4002.

Millard, L.A.C., Flach, P.A., Higgins, J.P.T., 2016. Machine learning to assist risk-of-bias

assessments in systematic reviews. Int. J. Epidemiol. 45 (1), 266–277. https://doi.

org/10.1093/ije/dyv306.

Nowak, A., Kunstman, P., 2018. Team EP at TAC 2018: Automating data extraction in

systematic reviews of environmetnal agents. In: Paper presented at: National

Institute of Standards and Technology Text Analysis Conference. Gaithersburg, MD.

O’Connor, A.M., Tsafnat, G., Thomas, J., Glasziou, P., Gilbert, S.B., Hutton, B., 2019.

A question of trust: Can we build an evidence base to gain trust in systematic review

automation technologies? Syst. Rev. 8 (1), 143. https://doi.org/10.1186/s13643-

019-1062-0.

Perera, N., Dehmer, M., Emmert-Streib, F., 2020. Named entity recognition and relation

detection for biomedical information extraction. Front. Cell Dev. Biol. 8, 673.

https://doi.org/10.3389/fcell.2020.00673.

Rathbone, J., Albarqouni, L., Bakhit, M., Beller, E., Byambasuren, O., Hoffmann, T.,

Scott, A.M., Glasziou, P., 2017. Expediting citation screening using PICO-based title-

only screening for identifying studies in scoping searches and rapid reviews. Syst.

Rev. 6 (1), 233. https://doi.org/10.1186/s13643-017-0629-x.

Saldanha, I.J., Schmid, C.H., Lau, J., Dickersin, K., Berlin, J.A., Jap, J., Smith, B.T.,

Carini, S., Chan, W., De Bruijn, B., Wallace, B.C., Hutess, S.M., Sim, I., Murad, M.H.,

Walsh, S.A., Whamond, E.J., Li, T., 2016. Evaluating Data Abstraction Assistant, a

novel software application for data abstraction during systematic reviews: protocol

for a randomized controlled trial. Syst. Rev. 5 (1) https://doi.org/10.1186/s13643-

016-0373-7.

Schmitt, C., Walker, V., Williams, A., Varghese, A., Ahmad, Y., Rooney, A., Wolfe, M.,

2018. Overview of the TAC 2018 systematic review information extraction track. In:

Paper presented at: National Institute of Standards and Technology Text Analysis

Conference. Gaithersburg, MD.

Altena, A.J., Spijker, R., Olabarriaga, S.D., 2019. Usage of automation tools in systematic

reviews. Res. Synth. Methods. 10 (1), 72–82. https://doi.org/10.1002/jrsm.1335.

Wallace, B.C., Small, K., Brodley, C.E., Lau, J., Trikalinos, T.A., 2012. Deploying an

interactive machine learning system in an Evidence-based Practice Center:

Abstrackr. In: Paper presented at: IHI ’12: Proceedings of the 2nd ACM SIGHIT

International Health Informatics Symposium. ACM Press, New York, NY. https://doi.

org/10.1145/2110363.2110464.

Wolffe, T.A.M., Vidler, J., Halsall, C., Hunt, N., Whaley, P., 2020. A survey of systematic

evidence mapping practice and the case for knowledge graphs in environmental

health and toxicology. Toxicol. Sci. 175 (1), 35–49. https://doi.org/10.1093/toxsci/

kfaa025.

Yadav, V., Bethard, S., 2018. A survey on recent advances in named entity recognition

from deep learning models. In: Paper presented at: Proceedings of the 27th

International Conference on Computational Linguistics. Santa Fe, NM.

V.R. Walker et al.

Artificial Intelligence for Literature Reviews: Opportunities and Challenges

Preprint

Full-text available

Feb 2024

This manuscript presents a comprehensive review of the use of Artificial Intelligence (AI) in Systematic Literature Reviews (SLRs). Our study focuses on how AI techniques are applied in the semi-automation of SLRs, specifically in the screening and extraction phases. We examine 21 leading SLR tools using a framework that combines 23 traditional features with 11 AI features. We also analyse 11 recent tools that leverage large language models for searching the literature and assisting academic writing. Finally, the paper discusses current trends in the field, outlines key research challenges, and suggests directions for future research.

Artificial intelligence (AI)—it’s the end of the tox as we know it (and I feel fine)*

Article

Full-text available

Jan 2024
ARCH TOXICOL

The rapid progress of AI impacts diverse scientific disciplines, including toxicology, and has the potential to transform chemical safety evaluation. Toxicology has evolved from an empirical science focused on observing apical outcomes of chemical exposure, to a data-rich field ripe for AI integration. The volume, variety and velocity of toxicological data from legacy studies, literature, high-throughput assays, sensor technologies and omics approaches create opportunities but also complexities that AI can help address. In particular, machine learning is well suited to handle and integrate large, heterogeneous datasets that are both structured and unstructured—a key challenge in modern toxicology. AI methods like deep neural networks, large language models, and natural language processing have successfully predicted toxicity endpoints, analyzed high-throughput data, extracted facts from literature, and generated synthetic data. Beyond automating data capture, analysis, and prediction, AI techniques show promise for accelerating quantitative risk assessment by providing probabilistic outputs to capture uncertainties. AI also enables explanation methods to unravel mechanisms and increase trust in modeled predictions. However, issues like model interpretability, data biases, and transparency currently limit regulatory endorsement of AI. Multidisciplinary collaboration is needed to ensure development of interpretable, robust, and human-centered AI systems. Rather than just automating human tasks at scale, transformative AI can catalyze innovation in how evidence is gathered, data are generated, hypotheses are formed and tested, and tasks are performed to usher new paradigms in chemical safety assessment. Used judiciously, AI has immense potential to advance toxicology into a more predictive, mechanism-based, and evidence-integrated scientific discipline to better safeguard human and environmental wellbeing across diverse populations.

Becoming aWARE: The Development of a Web-Based Tool for Autism Research and the Environment

Article

Full-text available

Sep 2023

A sharp rise in autism spectrum disorder (ASD) prevalence estimates, beginning in the 1990s, suggested factors additional to genetics were at play. This stimulated increased research investment in nongenetic factors, including the study of environmental chemical exposures, diet, nutrition, lifestyle, social factors, and maternal medical conditions. Consequently, both peer- and non-peer-reviewed bodies of evidence investigating environmental contributors to ASD etiology have grown significantly. The heterogeneity in the design and conduct of this research results in an inconclusive and unwieldy ‘virtual stack’ of publications. We propose to develop a Web-based tool for Autism Research and the Environment (aWARE) to comprehensively aggregate and assess these highly variable and often conflicting data. The interactive aWARE tool will use an approach for the development of systematic evidence maps (SEMs) to identify and display all available relevant published evidence, enabling users to explore multiple research questions within the scope of the SEM. Throughout tool development, listening sessions and workshops will be used to seek perspectives from the broader autism community. New evidence will be indexed in the tool annually, which will serve as a living resource to investigate the association between environmental factors and ASD.

The use of artificial intelligence for automating or semi-automating biomedical literature analyses: A scoping review

Article

May 2023
J BIOMED INFORM

Objective: Evidence-based medicine (EBM) is a decision-making process based on the conscious and judicious use of the best available scientific evidence. However, the exponential increase in the amount of information currently available likely exceeds the capacity of human-only analysis. In this context, artificial intelligence (AI) and its branches such as machine learning (ML) can be used to facilitate human efforts in analyzing the literature to foster EBM. The present scoping review aimed to examine the use of AI in the automation of biomedical literature survey and analysis with a view to establishing the state-of-the-art and identifying knowledge gaps. Materials and methods: Comprehensive searches of the main databases were performed for articles published up to June 2022 and studies were selected according to inclusion and exclusion criteria. Data were extracted from the included articles and the findings categorized. Results: The total number of records retrieved from the databases was 12,145, of which 273 were included in the review. Classification of the studies according to the use of AI in evaluating the biomedical literature revealed three main application groups, namely assembly of scientific evidence (n=127; 47%), mining the biomedical literature (n=112; 41%) and quality analysis (n=34; 12%). Most studies addressed the preparation of systematic reviews, while articles focusing on the development of guidelines and evidence synthesis were the least frequent. The biggest knowledge gap was identified within the quality analysis group, particularly regarding methods and tools that assess the strength of recommendation and consistency of evidence. Conclusion: Our review shows that, despite significant progress in the automation of biomedical literature surveys and analyses in recent years, intense research is needed to fill knowledge gaps on more difficult aspects of ML, deep learning and natural language processing, and to consolidate the use of automation by end-users (biomedical researchers and healthcare professionals).

The Use of Artificial Intelligence for Automating or Semi-Automating Biomedical Literature Analyses: A Scoping Review

Article

Jan 2022

KnowledgeShovel: An AI-in-the-Loop Document Annotation System for Scientific Knowledge Base Construction

Preprint

Full-text available

Oct 2022

Constructing a comprehensive, accurate, and useful scientific knowledge base is crucial for human researchers synthesizing scientific knowledge and for enabling Al-driven scientific discovery. However, the current process is difficult, error-prone, and laborious due to (1) the enormous amount of scientific literature available; (2) the highly-specialized scientific domains; (3) the diverse modalities of information (text, figure, table); and, (4) the silos of scientific knowledge in different publications with inconsistent formats and structures. Informed by a formative study and iterated with participatory design workshops, we designed and developed KnowledgeShovel, an Al-in-the-Loop document annotation system for researchers to construct scientific knowledge bases. The design of KnowledgeShovel introduces a multi-step multi-modal human-AI collaboration pipeline that aligns with users' existing workflows to improve data accuracy while reducing the human burden. A follow-up user evaluation with 7 geoscience researchers shows that KnowledgeShovel can enable efficient construction of scientific knowledge bases with satisfactory accuracy.

Machine Learning Tools and Platforms in Clinical Trial Outputs to Support Evidence-Based Health Informatics: A Rapid Review of the Literature

Article

Full-text available

Sep 2022

Stella Christopoulou

Background: The application of machine learning (ML) tools (MLTs) to support clinical trials outputs in evidence-based health informatics can be an effective, useful, feasible, and acceptable way to advance medical research and provide precision medicine. Methods: In this study, the author used the rapid review approach and snowballing methods. The review was conducted in the following databases: PubMed, Scopus, COCHRANE LIBRARY, clinicaltrials.gov, Semantic Scholar, and the first six pages of Google Scholar from the 10 July–15 August 2022 period. Results: Here, 49 articles met the required criteria and were included in this review. Accordingly, 32 MLTs and platforms were identified in this study that applied the automatic extraction of knowledge from clinical trial outputs. Specifically, the initial use of automated tools resulted in modest to satisfactory time savings compared with the manual management. In addition, the evaluation of performance, functionality, usability, user interface, and system requirements also yielded positive results. Moreover, the evaluation of some tools in terms of acceptance, feasibility, precision, accuracy, efficiency, efficacy, and reliability was also positive. Conclusions: In summary, design based on the application of clinical trial results in ML is a promising approach to apply more reliable solutions. Future studies are needed to propose common standards for the assessment of MLTs and to clinically validate the performance in specific healthcare and technical domains.

Methods for using Bing's AI-powered search engine for data extraction for a systematic review

Article

Dec 2023

Data extraction is a time‐consuming and resource‐intensive task in the systematic review process. Natural language processing (NLP) artificial intelligence (AI) techniques have the potential to automate data extraction saving time and resources, accelerating the review process, and enhancing the quality and reliability of extracted data. In this paper, we propose a method for using Bing AI and Microsoft Edge as a second reviewer to verify and enhance data items first extracted by a single human reviewer. We describe a worked example of the steps involved in instructing the Bing AI Chat tool to extract study characteristics as data items from a PDF document into a table so that they can be compared with data extracted manually. We show that this technique may provide an additional verification process for data extraction where there are limited resources available or for novice reviewers. However, it should not be seen as a replacement to already established and validated double independent data extraction methods without further evaluation and verification. Use of AI techniques for data extraction in systematic reviews should be transparently and accurately described in reports. Future research should focus on the accuracy, efficiency, completeness, and user experience of using Bing AI for data extraction compared with traditional methods using two or more reviewers independently.

Modelado de tópicos aplicado al análisis del papeldel aprendizaje automático en revisiones sistemáticas

Article

Full-text available

Dec 2022

El objetivo de la investigación fue analizar el papel del aprendizaje automático de datos en las revisiones sistemáticas de literatura. Se aplicó la técnica de Procesamiento de Lenguaje Natural denominada modelado de tópicos, a un conjunto de títulos y resúmenes recopilados de la base de datos Scopus. Especificamente se utilizó la técnica de Asignación Latente de Dirichlet (LDA), a partir de la cual se lograron descubrir y comprender las temáticas subyacentes en la colección de documentos. Los resultados mostraron la utilidad de la técnica utilizada en la revisión exploratoria de literatura, al permitir agrupar los resultados por temáticas. Igualmente, se pudo identificar las áreas y actividades específicas donde más se ha aplicado el aprendizaje automático, en lo referente a revisiones de literatura. Se concluye que la técnica LDA es una estrategia fácil de utilizar y cuyos resultados permiten abordar una amplia colección de documentos de manera sistemática y coherente, reduciendo notablemente el tiempo de la revisión.

The application of artificial intelligence on different types of literature reviews - A comparative study

Conference Paper

May 2022

Named Entity Recognition and Relation Detection for Biomedical Information Extraction

Article

Full-text available

Aug 2020

The number of scientific publications in the literature is steadily growing, containing our knowledge in the biomedical, health, and clinical sciences. Since there is currently no automatic archiving of the obtained results, much of this information remains buried in textual details not readily available for further usage or analysis. For this reason, natural language processing (NLP) and text mining methods are used for information extraction from such publications. In this paper, we review practices for Named Entity Recognition (NER) and Relation Detection (RD), allowing, e.g., to identify interactions between proteins and drugs or genes and diseases. This information can be integrated into networks to summarize large-scale details on a particular biomedical or clinical problem, which is then amenable for easy data management and further analysis. Furthermore, we survey novel deep learning methods that have recently been introduced for such tasks.

SWIFT-Active Screener: Accelerated document screening through active learning and integrated recall estimation

Article

Full-text available

May 2020
ENVIRON INT

Background In the screening phase of systematic review, researchers use detailed inclusion/exclusion criteria to decide whether each article in a set of candidate articles is relevant to the research question under consideration. A typical review may require screening thousands or tens of thousands of articles in and can utilize hundreds of person-hours of labor. Methods Here we introduce SWIFT-Active Screener, a web-based, collaborative systematic review software application, designed to reduce the overall screening burden required during this resource-intensive phase of the review process. To prioritize articles for review, SWIFT-Active Screener uses active learning, a type of machine learning that incorporates user feedback during screening. Meanwhile, a negative binomial model is employed to estimate the number of relevant articles remaining in the unscreened document list. Using a simulation involving 26 diverse systematic review datasets that were previously screened by reviewers, we evaluated both the document prioritization and recall estimation methods. Results On average, 95% of the relevant articles were identified after screening only 40% of the total reference list. In the 5 document sets with 5,000 or more references, 95% recall was achieved after screening only 34% of the available references, on average. Furthermore, the recall estimator we have proposed provides a useful, conservative estimate of the percentage of relevant documents identified during the screening process. Conclusion SWIFT-Active Screener can result in significant time savings compared to traditional screening and the savings are increased for larger project sizes. Moreover, the integration of explicit recall estimation during screening solves an important challenge faced by all machine learning systems for document screening: when to stop screening a prioritized reference list. The software is currently available in the form of a multi-user, collaborative, online web application.

A Survey of Systematic Evidence Mapping Practice and the Case for Knowledge Graphs in Environmental Health and Toxicology

Article

Full-text available

Feb 2020
TOXICOL SCI

Systematic evidence mapping offers a robust and transparent methodology for facilitating evidence-based approaches to decision-making in chemicals policy and wider environmental health. Interest in the methodology is growing; however, its application in environmental health is still novel. To facilitate the production of effective systematic evidence maps for environmental health use cases, we survey the successful application of evidence mapping in other fields where the methodology is more established. Focusing on issues of "data storage technology", "data integrity", "data accessibility", and "transparency", we characterise current evidence-mapping practice and critically review its potential value for environmental health contexts. We note that rigid, flat data tables and schema-first approaches dominate current mapping methods and highlight how this practice is ill-suited to the highly connected, heterogeneous and complex nature of environmental health data. We propose this challenge is overcome by storing and structuring data as "knowledge graphs". Knowledge graphs offer a flexible, schemaless and scalable model for systematically mapping the environmental health literature. Associated technologies such as ontologies are well-suited to the long-term goals of systematic mapping methodology in promoting resource-efficient access to the wider environmental health evidence base. Several graph storage implementations are readily available, with a variety of proven use cases in other fields. Thus, developing and adapting systematic evidence mapping for environmental health should utilise these graph-based resources to ensure the production of scalable, interoperable and robust maps to aid decision-making processes in chemicals policy and wider environmental health.

Improving reference prioritisation with PICO recognition

Article

Full-text available

Dec 2019
BMC MED INFORM DECIS

Background: Machine learning can assist with multiple tasks during systematic reviews to facilitate the rapid retrieval of relevant references during screening and to identify and extract information relevant to the study characteristics, which include the PICO elements of patient/population, intervention, comparator, and outcomes. The latter requires techniques for identifying and categorising fragments of text, known as named entity recognition. Methods: A publicly available corpus of PICO annotations on biomedical abstracts is used to train a named entity recognition model, which is implemented as a recurrent neural network. This model is then applied to a separate collection of abstracts for references from systematic reviews within biomedical and health domains. The occurrences of words tagged in the context of specific PICO contexts are used as additional features for a relevancy classification model. Simulations of the machine learning-assisted screening are used to evaluate the work saved by the relevancy model with and without the PICO features. Chi-squared and statistical significance of positive predicted values are used to identify words that are more indicative of relevancy within PICO contexts. Results: Inclusion of PICO features improves the performance metric on 15 of the 20 collections, with substantial gains on certain systematic reviews. Examples of words whose PICO context are more precise can explain this increase. Conclusions: Words within PICO tagged segments in abstracts are predictive features for determining inclusion. Combining PICO annotation model into the relevancy classification pipeline is a promising approach. The annotations may be useful on their own to aid users in pinpointing necessary information for data extraction, or to facilitate semantic search.

A question of trust: Can we build an evidence base to gain trust in systematic review automation technologies?

Article

Full-text available

Jun 2019

Background: Although many aspects of systematic reviews use computational tools, systematic reviewers have been reluctant to adopt machine learning tools. Discussion: We discuss that the potential reason for the slow adoption of machine learning tools into systematic reviews is multifactorial. We focus on the current absence of trust in automation and set-up challenges as major barriers to adoption. It is important that reviews produced using automation tools are considered non-inferior or superior to current practice. However, this standard will likely not be sufficient to lead to widespread adoption. As with many technologies, it is important that reviewers see "others" in the review community using automation tools. Adoption will also be slow if the automation tools are not compatible with workflows and tasks currently used to produce reviews. Many automation tools being developed for systematic reviews mimic classification problems. Therefore, the evidence that these automation tools are non-inferior or superior can be presented using methods similar to diagnostic test evaluations, i.e., precision and recall compared to a human reviewer. However, the assessment of automation tools does present unique challenges for investigators and systematic reviewers, including the need to clarify which metrics are of interest to the systematic review community and the unique documentation challenges for reproducible software experiments. Conclusion: We discuss adoption barriers with the goal of providing tool developers with guidance as to how to design and report such evaluations and for end users to assess their validity. Further, we discuss approaches to formatting and announcing publicly available datasets suitable for assessment of automation technologies and tools. Making these resources available will increase trust that tools are non-inferior or superior to current practice. Finally, we identify that, even with evidence that automation tools are non-inferior or superior to current practice, substantial set-up challenges remain for main stream integration of automation into the systematic review process.

Usage of Automation Tools in Systematic Reviews

Article

Full-text available

Dec 2018

Systematic reviews are a cornerstone of today's evidence‐informed decision making. With the rapid expansion of questions to be addressed and scientific information produced, there is a growing workload on reviewers, making the current practice unsustainable without the aid of automation tools. While many automation tools have been developed and are available, uptake seems to be lagging. For this reason, we set out to investigate the current level of uptake and what the potential barriers and facilitators are for the adoption of automation tools in systematic reviews. We deployed surveys among systematic reviewers that gathered information on tool uptake, demographics, systematic review characteristics, and barriers and facilitators for uptake. Systematic reviewers from multiple domains were targeted during recruitment, however, responders were predominantly from the biomedical sciences. We found that automation tools are currently not widely used among the participants. When tools are used, participants mostly learn about them from their environment, for example through colleagues, peers, or organisation. Tools are often chosen on the basis of user experience, either by own experience or from colleagues or peers. Lastly, licensing, steep learning curve, lack of support, and mismatch to workflow are often reported by participants as relevant barriers. While conclusions can only be drawn for the biomedical field, our work provides evidence and confirms the conclusions and recommendations of previous work, which was based on expert opinions. Furthermore, our study highlights the importance that organisations and best practices in a field can have for the uptake of automation tools for systematic reviews.

Team EP at TAC 2018: Automating data extraction in systematic reviews of environmental agents

Conference Paper

Full-text available

Nov 2018

We describe our entry for the Systematic Review Information Extraction track of the 2018 Text Analysis Conference. Our solution is an end-to-end, deep learning, sequence tagging model based on the BI-LSTM-CRF architecture. However, we use interleaved, alternating LSTM layers with highway connections instead of the more traditional approach, where last hidden states of both directions are concatenated to create an input to the next layer. We also make extensive use of pre-trained word embeddings, namely GloVe and ELMo. Thanks to a number of regularization techniques, we were able to achieve relatively large capacity of the model (31.3M+ of trainable parameters) for the size of training set (100 documents, less than 200K tokens). The system's official score was 60.9% (micro-F1) and it ranked first for the Task 1. Additionally, after rectifying an obvious mistake in the submission format, the system scored 67.35%.

Expediting citation screening using PICo-based title-only screening for identifying studies in scoping searches and rapid reviews

Article

Full-text available

Nov 2017

Background Citation screening for scoping searches and rapid review is time-consuming and inefficient, often requiring days or sometimes months to complete. We examined the reliability of PICo-based title-only screening using keyword searches based on the PICo elements—Participants, Interventions, and Comparators, but not the Outcomes. MethodsA convenience sample of 10 datasets, derived from the literature searches of completed systematic reviews, was used to test PICo-based title-only screening. Search terms for screening were generated from the inclusion criteria of each review, specifically the PICo elements—Participants, Interventions and Comparators. Synonyms for the PICo terms were sought, including alternatives for clinical conditions, trade names of generic drugs and abbreviations for clinical conditions, interventions and comparators. The MeSH database, Wikipedia, Google searches and online thesauri were used to assist generating terms. Title-only screening was performed by five reviewers independently in Endnote X7 reference management software using OR Boolean operator. Outcome measures were recall of included studies and the reduction in screening effort. Recall is the proportion of included studies retrieved using PICo title-only screening out of the total number of included studies in the original reviews. The percentage reduction in screening effort is the proportion of records not needing screening because the method eliminates them from the screen set. ResultsAcross the 10 reviews, the reduction in screening effort ranged from 11 to 78% with a median reduction of 53%. In nine systematic reviews, the recall of included studies was 100%. In one review (oxygen therapy), four of five reviewers missed the same included study (median recall 67%). A post hoc analysis was performed on the dataset with the lowest reduction in screening effort (11%), and it was rescreened using only the intervention and comparator keywords and omitting keywords for participants. The reduction in screening effort increased to 57%, and the recall of included studies was maintained (100%). Conclusions In this sample of datasets, PICo-based title-only screening was able to expedite citation screening for scoping searches and rapid reviews by reducing the number of citations needed to screen but requires a thorough workup of the potential synonyms and alternative terms. Further research which evaluates the feasibility of this technique with heterogeneous datasets in different fields would be useful to inform the generalisability of this technique.

Evaluating Data Abstraction Assistant, a novel software application for data abstraction during systematic reviews: Protocol for a randomized controlled trial

Article

Full-text available

Nov 2016

Background Data abstraction, a critical systematic review step, is time-consuming and prone to errors. Current standards for approaches to data abstraction rest on a weak evidence base. We developed the Data Abstraction Assistant (DAA), a novel software application designed to facilitate the abstraction process by allowing users to (1) view study article PDFs juxtaposed to electronic data abstraction forms linked to a data abstraction system, (2) highlight (or “pin”) the location of the text in the PDF, and (3) copy relevant text from the PDF into the form. We describe the design of a randomized controlled trial (RCT) that compares the relative effectiveness of (A) DAA-facilitated single abstraction plus verification by a second person, (B) traditional (non-DAA-facilitated) single abstraction plus verification by a second person, and (C) traditional independent dual abstraction plus adjudication to ascertain the accuracy and efficiency of abstraction. Methods This is an online, randomized, three-arm, crossover trial. We will enroll 24 pairs of abstractors (i.e., sample size is 48 participants), each pair comprising one less and one more experienced abstractor. Pairs will be randomized to abstract data from six articles, two under each of the three approaches. Abstractors will complete pre-tested data abstraction forms using the Systematic Review Data Repository (SRDR), an online data abstraction system. The primary outcomes are (1) proportion of data items abstracted that constitute an error (compared with an answer key) and (2) total time taken to complete abstraction (by two abstractors in the pair, including verification and/or adjudication). Discussion The DAA trial uses a practical design to test a novel software application as a tool to help improve the accuracy and efficiency of the data abstraction process during systematic reviews. Findings from the DAA trial will provide much-needed evidence to strengthen current recommendations for data abstraction approaches. Trial registration The trial is registered at National Information Center on Health Services Research and Health Care Technology (NICHSR) under Registration # HSRP20152269: https://wwwcf.nlm.nih.gov/hsr_project/view_hsrproj_record.cfm?NLMUNIQUE_ID=20152269&SEARCH_FOR=Tianjing%20Li. All items from the World Health Organization Trial Registration Data Set are covered at various locations in this protocol. Protocol version and date: This is version 2.0 of the protocol, dated September 6, 2016. As needed, we will communicate any protocol amendments to the Institutional Review Boards (IRBs) of Johns Hopkins Bloomberg School of Public Health (JHBSPH) and Brown University. We also will make appropriate as-needed modifications to the NICHSR website in a timely fashion. Electronic supplementary material The online version of this article (doi:10.1186/s13643-016-0373-7) contains supplementary material, which is available to authorized users.

Automating Biomedical Evidence Synthesis: RobotReviewer

Conference Paper

Jul 2017

We present RobotReviewer, an open-source web-based system that uses machine learning and NLP to semi-automate biomedical evidence synthesis, to aid the practice of Evidence-Based Medicine. RobotReviewer processes full-text journal articles (PDFs) describing randomized controlled trials (RCTs). It appraises the reliability of RCTs and extracts text describing key trial characteristics (e.g., descriptions of the population) using novel NLP methods. RobotReviewer then automatically generates a report synthesising this information. Our goal is for RobotReviewer to automatically extract and synthesise the full-range of structured data needed to inform evidence-based practice.

Evaluation of a semi-automated data extraction tool for public health literature-based reviews: Dextr

Abstract

Recommended publications

Data extraction methods for systematic review (semi)automation: A living systematic review [version...

Enhancing recall in automated record screening: A resampling algorithm

Evaluation of a Rule-based Method for Epidemiological Document Classification Towards the Automation...

Data extraction methods for systematic review (semi)automation: A living review protocol