PreprintPDF Available

Active learning for screening prioritization in systematic reviews - A simulation study

Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Background Conducting a systematic review requires great screening effort. Various tools have been proposed to speed up the process of screening thousands of titles and abstracts by engaging in active learning. In such tools, the reviewer interacts with machine learning software to identify relevant publications as early as possible. To gain a comprehensive understanding of active learning models for reducing workload in systematic reviews, the current study provides a methodical overview of such models. Active learning models were evaluated across four different classification techniques (naive Bayes, logistic regression, support vector machines, and random forest) and two different feature extraction strategies (TF-IDF and doc2vec). Moreover, models were evaluated across six systematic review datasets from various research areas to assess generalizability of active learning models across different research contexts. Methods Performance of the models were assessed by conducting simulations on six systematic review datasets. We defined desirable model performance as maximizing recall while minimizing the number of publications needed to screen. Model performance was evaluated by recall curves, WSS@95, RRF@10, and ATD. ResultsWithin all datasets, the model performance exceeded screening at random order to a great degree.The models reduced the number of publications needed to screen by 91.7% to 63.9%. Conclusions Active learning models for screening prioritization show great potential in reducing the workload in systematic reviews. Overall, the Naive Bayes + TF-IDF model performed the best.
Content may be subject to copyright.
Active learning for screening prioritization in systematic reviews
A simulation study
Gerbrich Ferdinands Raoul Schram Jonathan de Bruin Ayoub Bagheri
Daniel Oberski Lars Tummers Rens van de Schoot
10 August, 2020
Conducting a systematic review requires great screening effort. Various tools have been
proposed to speed up the process of screening thousands of titles and abstracts by engaging in active learning.
In such tools, the reviewer interacts with machine learning software to identify relevant publications as
early as possible. To gain a comprehensive understanding of active learning models for reducing workload
in systematic reviews, the current study provides an methodical overview of such models. Active learning
models were evaluated across four different classification techniques (naive Bayes, logistic regression, support
vector machines, and random forest) and two different feature extraction strategies (TF-IDF and doc2vec).
Moreover, models were evaluated across six systematic review datasets from various research areas to assess
generalizability of active learning models across different research contexts.10
Performance of the models were assessed by conducting simulations on six systematic review
datasets. We defined desirable model performance as maximizing recall while minimizing the number of
publications needed to screen. Model performance was evaluated by recall curves, WSS@95, RRF@10, and
Within all datasets, the model performance exceeded screening at random order to a great degree.
The models reduced the number of publications needed to screen by 91.7% to 63.9%.16
Active learning models for screening prioritization show great potential in reducing the
workload in systematic reviews. Overall, the Naive Bayes + TF-IDF model performed the best.18
Systematic Review registrations Not applicable.19
systematic reviews, active learning, screening prioritization, researcher-in-the-loop, title-and-
abstract screening, automation, text mining.21
Systematic reviews are top of the bill in research. A systematic review brings together all available studies
relevant to answer a specific research question [
]. Systematic reviews inform practice and policy [
] and
are key in developing clinical guidelines [
]. However, systematic reviews are costly because to identify
publications relevant to answering the research question, they among else involve the manual screening of
thousands of titles and abstracts.27
Conducting a systematic review typically requires over a year of work by a team of researchers [
]. Nevertheless,
systematic reviewers are often bound to a limited budget and timeframe. Currently, the demand for systematic
reviews exceeds the available time and resources by far [
]. Especially when answering the research question
at hand is urgent, it is extremely challenging to provide a review that is both timely and comprehensive.31
To ensure a timely review, reducing the workload in systematic reviews is essential. With advances in machine
learning (ML), there has been wide interest in tools to reduce the workload in systematic reviews [
]. Various
ML models have been proposed, aiming to predict whether a given publication is relevant or irrelevant to the
systematic review. Previous findings suggest that such models potentially reduce the workload with 30-70%
at the cost of losing 5% of relevant publications, in else, a 95% recall [7].36
A well-established approach to increase the efficiency of title and abstract screening is screening prioritization
]. In screening prioritization, the ML model presents the reviewer with the publications that are most
likely to be relevant first, thereby expediting the process of finding all of the relevant publications. Such
an approach allows for substantial time-savings in the screening process as the reviewer can decide to stop
screening after a sufficient number of relevant publications have been found [
]. Moreover, the early retrieval
of relevant publications facilitates a faster transition of those publications to the next steps in the review
process [8].43
Recent studies have demonstrated the effectiveness of screening prioritization by means of active learning
models [
]. With active learning, the ML model can iteratively improve its predictions
on unlabeled data by allowing the model to select the records from which it wants to learn [
]. The model
proposes these records to a human annotator who provides the records with labels, which the model then
uses to update its predictions. The general assumption is that by letting the model select which records are
labeled, the model can achieve higher accuracy more quickly while requiring the human annotator to label as
few records as possible [
]. Active learning has proven to be an efficient strategy in large unlabeled datasets
where labels are expensive to obtain [
]. This makes the screening phase in systematic reviewing an ideal
candidate solution for such models, because typically labeling a large number of publications is very costly.52
When active learning is applied in the screening phase, the reviewer screens publications that are suggested
by an active learning model. Subsequently, the active learning model learns from the reviewers’ decision
(‘relevant’, ‘irrelevant’) and uses this knowledge to update its predictions and to select the next publication55
to be screened by the reviewer.56
The application of active learning models in systematic reviews has been extensively studied [
While previous studies have evaluated active learning models in many forms and shapes [
], ready-to-use software tools implementing such models (Abstrackr [
], Colandr [
], Rayyan [
], and RobotAnalyst [
]) currently use the same classification technique to predict relevance
of publications, namely support vector machines (SVM). It was found [
] that different classification
techniques can serve different needs in the retrieval of relevant publications, for example the desired balance
between recall and precision. Therefore, it is essential to evaluate different classification techniques in the
context of active learning models. The current study investigates active learning models adopting four
classification techniques: naive Bayes (NB), logistic regression (LR), SVM, and random forest (RF). These
are widely adopted techniques in text classification [
] and are fit for software tools to be used in scientific
practice due to their relatively short computation time.67
Another component that influences model performance is how the textual content of titles and abstracts is
represented in a model, called the feature extraction strategy [
]. One of the more sophisticated
feature extraction strategies is doc2vec (D2V), also known as paragraph vectors [
]. D2V learns continuous
distributed vector representations for pieces of text. In distributed text representations, words are assumed71
to appear in the same context when they are similar in terms of a latent space, the “embedding”. A word
embedding is simply a vector of scores estimated from a corpus for each word; D2V is an extension of this idea
to document embeddings. Embeddings can sometimes outperform simpler feature extraction strategies such
as term frequency-inverse document frequency (TF-IDF). They can be trained on large corpora to capture
wider semantics and subsequently applied in a specific systematic reviewing application [
]. Therefore, it is
interesting to compare models adopting D2V to models adopting TF-IDF.77
Lastly, previous studies have mainly focussed on reviews from a single scientific field, like medicine [
] or
computer science [
]. To draw conclusions about the general effectiveness of active learning models, it
is essential to evaluate models on reviews from varying research contexts [
]. To our knowledge, Miwa
et al [
] were the only researchers to make a direct comparison between systematic reviews from different
research areas, such as the social and the medical sciences. They found that the application of active learning
to systematic reviews was more difficult on a systematic review from the social sciences due to the different
nature of the vocabularies used. Thus, it is of interest to evaluate model performance across different research
contexts, namely social science, medical science and computer science.85
Taken together, for a more comprehensive understanding of active learning models in the context of systematic
reviewing, a methodical evaluation of such models is required. The current study aims to address this issue
by answering the following research questions:88
RQ1 What is the performance of active learning models across four classification techniques?89
RQ2 What is the performance of active learning models across two feature extraction strategies?90
Does the performance of active learning models differ across six systematic reviews from four research
The purpose of this paper is to show the usefulness of active learning models for reducing workload in title
and abstract screening in systematic reviews. We adopt four different classification techniques (NB, LR, SVM,
and RF) and two different feature extraction strategies (TF-IDF and D2V) for the purpose of maximizing
the number of identified relevant publications, while minimizing the number of publications needed to screen.
Models were assessed by conducting a simulation on six systematic review datasets. To assess generalizability
of the models across research contexts, datasets containing previous systeamtic reviews were collected from
the fields of medical science [
], computer science [
], and social science [
]. The models,
datasets and simulations are implemented in a pipeline of active learning for screening prioritization, called
is a generic open source tool, encouraging fellow researchers to replicate findings
from previous studies. To facilitate usability and acceptability of ML-assisted title and abstract screening in
the field of systematic review our scripts and data used are openly available.103
Technical details105
What follows is a more detailed account of the active learning models to clarify the choices made in the
design of the current study.107
Task description108
The screening process of a systematic review starts with all publications obtained in the search. The task is
to identify which of these publications are relevant, by screening their titles and abstracts. In active learning
for screening prioritization, the screening process proceeds as follows:111
Start with the set of all unlabeled records (titles and abstracts)112
The reviewer provides a label for a few, e.g. 5-10 records, creating a set of labeled records. The label
can be either relevant or irrelevant.114
The active learning cycle starts:115
1. A classifier is trained on the labeled records116
2. The classifier predicts relevancy scores for all unlabeled records117
Based on the predictions by the classifier, the model selects the record with the highest relevancy
4. The model requests the reviewer to screen this record120
5. The reviewer screens the record and provides a label, relevant or irrelevant.121
6. The newly labeled record is moved to the training data122
7. Back to step 1123
8. Repeat this cycle until the reviewer decides to stop [10] or until all records have been labeled124
In this active learning cycle, the model incrementally improves its predictions on the remaining unlabeled
title and abstracts. Relevant titles and abstracts are identified as early in the process as possible. A more
technical description of the active learning cycle can be found in Additional file 1.127
This case is an example of pool-based active learning, as the next record to be queried is selected by predicting
relevancy for all records in a fixed pool [
]. Another form of active learning is stream-based active learning,
in which the data is regarded as a stream instead of a fixed pool, in which the model selects one record at
a time and then decides whether or not to query this record. This approach of active learning is preferred
when it is expensive or impossible to exhaustively search the data for selecting the next query. A possible
application of stream-based active learning is living systematic reviews, as the review is continually updated
as new evidence becomes available. For an example see the study by Wynants et al. [38].134
Class imbalance problem135
Typically, only a fraction of the publications belong to the relevant class (2.94%, [
]). To some extent, this
fraction is under the control of the researcher through the search criteria: if the researcher narrows the search
query, it will generally result in a higher proportion of relevant publications. However, in most applications
this practice would yield an unacceptable number of false negatives (erroneously excluded papers) in the
querying phase of the review process. For this reason, the querying phase in most practical applications
would yield a very low percentage of relevant publications. Because there are generally far fewer examples of
relevant than irrelevant publications to train on, the class imbalance causes the classifier to miss relevant
publications [
]. Moreover, classifiers can achieve high accuracy but still fail to identify any of the relevant
publications [15].144
Previous studies have addressed the class imbalance problem by rebalancing the training data in various ways
]. To decrease the class imbalance in the training data, we rebalance the training set by a technique we
propose to call “dynamic resampling” (DR). DR undersamples the number of irrelevant publications in the
training data, whereas the number of relevant publications are oversampled such that the size of the training
data remains the same. The ratio between relevant and irrelevant publications in the rebalanced training
data is not fixed, but dynamically updated and depends on the number of publications in the available
training data, the total number of publications in the dataset, and the ratio between relevant and irrelevant
publications in the available training data. Additional file 2 provides a detailed script to perform DR.152
To make relevancy predictions on the unlabeled publications, a classifier is trained on features from the
training data. The performance of the following four classifiers is explored:155
Support vector machines (SVM) - SVMs separate the data into classes by finding a multidimensional
hyperplane [39, 40].157
L2-regularized logistic regression (LR) - models the probabilities describing the possible outcomes
by a logistic function. The classifier uses regularization, shrinking coefficients of features with small
contributions to the solution towards zero.160
Naive Bayes (NB) is a supervised learning algorithm often used in text classification. Based on Bayes’
theorem, with the ‘naive’ assumption that all features are independent given the class value [41].162
Random forests (RF) is a supervised learning algorithm where a large number of decision trees are fit
on samples obtained from the original data by sampling both rows (bootstrapped samples) and columns
(feature samples). In prediction mode, each tree casts a vote on the class, and the final prediction is the
class that received the most votes [42].166
Feature extraction167
To predict relevance of a given publication, the classifier uses information from the publications in the dataset.
Examples of information are titles and abstracts. However, a model cannot make predictions from the titles
and abstracts as they are; their textual content needs to be represented numerically as feature vectors. This
process of numerically representing textual content is referred to as ‘feature extraction’.171
TF-IDF is a specific way of assigning scores to the cells of the “document-term matrix” used in all bag-of-words
representations. That is, the rows of the document-term matrix represent the documents (titles and abstracts)
and the columns represent all words in the dictionary. Instead of simply counting the number of times each
word occurred in the given document, TF-IDF assigns a score to a word relative to the number of documents
the word occurs. The idea behind weighting words by their rarity is that surprising word choices should
subsequently make for more discriminative features [
]. A disadvantage of TF-IDF and other bag-of-words
methods is that they do not take the ordering of words into account, thereby ignoring syntax. However, in
practice, TF-IDF is often found to be a strong baseline [44].179
In recent years, a range of modern methods have been developed that often outperform bag-of-words
approaches. Here, we consider doc2vec, an extension of the classic word2vec embedding [
]. In word
embedding models, whether a word did or did not happen to appear in a specific context is predicted by
its similarity to that context in a latent space - the “embedding”. The context is usually a sliding window
across training sentences. For example, if the window “child ate cookies” occurs in the training data, this
might be compared with a random ‘negative’ window that did not occur, such as “child lovely cookies”. The
tokens “child” and “cookies” are then assigned scores (vectors) that give a higher inner product with the
“child” vector, and a smaller product with “lovely”. The word vectors of “ate” and “lovely” are similarly
updated. Typically the embedding dimension is a few hundred, i.e. each word vector contains some two
hundred scores. Note that if “cookies” previously co-occurred frequently with “spinach”, then the above
also indirectly makes “ate” more similar to “spinach”, even if these two words have not been observed yet
in the same context. Thus, the distributed representation learns something of the meaning of these words
through their occurrence in similar contexts. D2V performs such a procedure while including a paragraph
identifier, allowing for paragraph embeddings - or, in our case, embeddings for titles and abstracts. In short,
D2V converts each abstract into a vector of a few hundred scores, which can be used to predict relevancy.194
Query strategy195
The active learning model can adopt different strategies in selecting the next publication to be screened by
the reviewer. A strategy mentioned before is selecting the publication with the highest probability of being
relevant. In the active learning literature this is referred to as certainty-based active learning [
]. Another
well-known strategy is uncertainty-based active learning, where the instances that are presented next are
those instances on which the model’s classifications are the least certain, i.e. close to 0.5 probability [
Further strategies include selecting the next instance to optimize for various criteria, including: model fit
(MLI), model change (MMC), parameter estimate accuracy (EVR), and expected (EER) or worst-case (MER)
prediction accuracy [
]. Although uncertainty sampling is not explicitly motivated by the optimization of
any particular criterion, intuitively it can be seen as attempting to improve the model’s accuracy by reducing
uncertainty about its parameter estimates.205
Simulation-based comparisons of these methods across different domains have yielded an ambiguous picture
of their relative strengths [
]. What has become clear from such studies is that the features of the task
at hand determine the effectiveness of active learning strategies (“no free active lunch”). For example, if
a linear classifier is used for a task that also happens to have a Bayes optimal linear decision boundary, a
model-based approach such as Fisher information reduction can be expected to perform well, whereas the
same technique can be disastrous when the model is misspecified - a fact that cannot be known in advance.
Furthermore, the criteria mentioned above differ from the task of title and abstract screening in systematic212
reviews: here, the aim is not to obtain an accurate model, but rather to end up with a list of records belonging
to the relevant class [
]. This is the criterion corresponding intuitively to certainty-based sampling. For this
reason, we choose to focus on certainty-based sampling strategies as the baseline strategy for active learning
in systematic reviewing. However, different strategies may outperform our baseline in specific applications.216
Simulation study217
This section describes the simulation study that was carried out to answer the research questions.218
To address RQ1, four models were investigated combining each classifier with TF-IDF feature extraction:220
1. SVM + TF-IDF221
2. NB + TF-IDF222
3. RF + TF-IDF223
4. LR + TF-IDF224
To address RQ2, the classifiers were combined with D2V feature extraction, leading to the following three
5. SVM + D2V227
6. RF + D2V228
7. LR + D2V229
The combination NB + D2V could not be tested because the multinomial naive Bayes classifier
a feature matrix with positive values, whereas the D2V feature extraction approach
produces a feature
matrix that can contain negative values. The performance of the seven models was evaluated by simulating
every model on six systematic review datasets, addressing RQ3. Hence, 42 simulations were carried out,
representing all model-dataset combinations.234
Instead of having a human reviewer label publications manually, the screening process was simulated by
retrieving the labels in the data. Each simulation started with an initial training set of one relevant and one
irrelevant publication to represent a challenging scenario where the reviewer has very little prior knowledge237
on the publications in the data. The model was retrained each time after a publication had been labeled. A
simulation ended after all publications in the dataset had been labeled. To account for sampling variance,
every simulation was repeated 15 times. To account for bias due to the content of the initial publications,
the initial training set was randomly sampled from the dataset for each of the 15 trials. Although varying
over trials, the 15 initial training sets were kept constant for each dataset to allow for a direct comparison of
models within datasets. A seed value was set to ensure reproducibility. The simulation study was carried out
using the ASReview simulation extension [
]. For each simulation, hyperparameters were optimized through
a Tree of Parzen Estimators (TPE) algorithm [48] to arrive at maximum model performance.245
Simulations were carried out in ASReview version 0.9.3 [
]. Analyses were carried out using
version 3.6.1
[49]. The simulations were carried out on Cartesius, the Dutch national supercomputer.247
The models were simulated on a convenience sample of six systematic review datasets. The data selection
process was driven by two factors. Firstly, datasets are collected from various research areas to assess
generalizability of the models across research contexts (RQ3). Secondly, all original data files have to be
openly published with a CC-BY license. Datasets are available through ASReview’s systematic review
datasets GitHub3.253
The Wilson dataset [
] - from the field of medicine - is from a review on the effectiveness and safety of
treatments of Wilson Disease, a rare genetic disorder of copper metabolism [
]. From the same field, the
ACE dataset contains publications on the efficacy of Angiotensin-converting enzyme (ACE) inhibitors, a
treatment drug for heart disease [
]. Additionally, the Virus dataset is from a systematic review on studies
that performed viral Metagenomic Next-Generation Sequencing (mNGS) in farm animals [
]. From the field
of computer science, the Software dataset contains publications from a review on fault prediction in software
engineering [
]. The Nudging dataset [
] belongs to a systematic review on nudging healthcare professionals
], stemming from the social sciences. From the same research area, the PTSD dataset contains publications
on studies applying latent trajectory analyses on posttraumatic stress after exposure to traumatic events
]. Of these six datasets, ACE and Software have been used for model simulations in previous studies on
ML-aided title and abstract screening [11, 32].264
Data were preprocessed from their original source into a dataset, containing title and abstract of the
publications obtained in the initial search. Duplicates and publications with missing abstracts were removed
from the data. Datasets were labeled to indicate which candidate publications were included in the systematic
review, thereby denoting relevant publications. All datasets consisted of thousands of candidate publications,
of which only a fraction was deemed relevant to the systematic review. For the Virus and the Nudging
dataset, this proportion was about 5 percent. For the remaining six datasets, the proportions of relevant
publications were centered around 1-2 percent. (Table 1).271
Evaluating performance272
Model performance was assessed by three different measures, Work Saved over Sampling (WSS), Relevant
References Found (RRF), and Average Time to Discovery (ATD). WSS indicates the reduction in publications
needed to be screened, at a given level of recall [
]. Typically measured at a recall level of 95%, WSS@95
yields an estimate of the amount of work that can be saved at the cost of failing to identify 5% of relevant
publications. In the current study, WSS is computed at 95% recall. RRF@10 represents the proportion of
relevant publications that are found after screening 10% of all publications.278
Both RRF and WSS are sensitive to the position of the cutoff value and the distribution of the data.
Moreover, WSS makes assumptions about the acceptable recall level whereas this level might depend on the
research question at hand [
]. Therefore, we introduce the ATD, the average fraction of non-reviewed relevant
publications during the review (except the relevant publications in the initial training set). The ATD is an
indicator of performance throughout the entire screening process instead of performance at some arbitrary
cutoff value. The ATD is computed by taking the average of the Time to Discovery (TD) of all relevant
publications. The TD for a given relevant publication
is computed as the fraction of publications needed to
screen to detect i. Additional file 3 provides a detailed script to compute the ATD.286
Furthermore, model performance was visualized by plotting recall curves. Plotting recall as a function of
the proportion of screened publications offers insight in model performance throughout the entire screening
process [
]. The curves give information in two directions. On the one hand they display the number of
publications that need to be screened to achieve a certain level of recall, but on the other hand they present
how many relevant publications are identified after screening a certain proportion of all publications (RRF).
For each simulation, the RRF@10, WSS@95, and ATD are reported as means over 15 trials. To indicate the
spread of performance within simulations, the means are accompanied by an estimated standard deviation
. To compare the overall performance across datasets, median performance is reported for every dataset,
accompanied by the Median Absolute Deviation (MAD), indicating variability between models within a
certain dataset. Recall curves are plotted for each simulation, representing the average recall over 15 trials
the standard error of the mean.297
This section proceeds as follows: Firstly, as an example the results of the Nudging dataset are discussed in
detail to provide a basis for answering the research questions. Secondly, the results are presented for each
research question over all datasets.301
Evaluation on the Nudging dataset302
Figure 1a shows the recall curves for all simulations on the Nudging dataset. As described in the previous
section, these curves plot recall as a function of the proportion of publications screened. The curves represent
the average recall over 15 trials
the standard error of the mean in the direction of the y-axis. The x-axis
is cut off at 40% since at this point in screening all models had already reached 95% recall. The dashed
horizontal lines indicate the RRF@10 values, the dashed vertical lines the WSS@95 values. The dashed black
diagonal line corresponds to the expected recall curve when publications are screened in a random order.308
The recall curves were used to examine model performance throughout the entire screening process and
to make a visual comparison between models within datasets. For example in Figure 1a, after screening
about 30% of the publications all models had already found 95% of the relevant publications. Moreover,
after screening 5% the green curve - representing the RF + TF-IDF model - splits away from the others
and remains to be the lowest of all curves until about 30% of publications have been screened. Hence, from
screening 5 to 30 percent of publications, the RF + TF- IDF model was the slowest in finding the relevant
publications. The ordering of the remaining recall curves changes throughout the screening process, but
maintain relatively similar performance at face value.316
Figure 1b shows a subset of the recall curves in Figure 1a, namely the curves of the first four models to
allow for a visual comparison across classification techniques adopting the TF-IDF feature extraction strategy.
Figure 1c shows recall curves for the remaining three models to compare the models using D2V feature
extraction. Figures 1d to 1f compare recall curves for models adopting the TF-IDF feature extraction strategy
to recall curves for their D2V-using counterparts.321
It can be seen from Table 2 that in terms of ATD, the best performing models on the Nudging dataset were
SVM + D2V and LR + D2V, both with an ATD of 8.8%. This indicates that the average proportion of
publications needed to screen to find a relevant publication was 8.8% for both models. In the SVM + D2V
model, the standard deviation was 0.33, whereas for the LR + D2V model
=0.47. This indicates that for
the SVM + D2V model, the ATD values of individual trials were closer to the overall mean compared to
the LR + D2V model, meaning that the SVM + D2V model performed more stable across different initial
training datasets. Median ATD for this dataset was 9.5% with an MAD of 1.05, indicating that for half of
the models, the ATD was within 1.05 percentage point distance from the median ATD.329
As Table 3 shows, the highest WSS@95 value on the Nudging dataset was achieved by the NB + TF-IDF
model with a mean of 71.7%, meaning that this model reduced the number of publications needed to screen
by 71.7% at the cost of losing 5% of relevant publications. The estimated standard deviation of 1.37 indicates
that in terms of WSS@95, this model performed the most stable across trials. The model with the lowest
WSS@95 value was RF + TF-IDF (
= 64.9%,
=2.50). Median WSS@95 of these models was 66.9%, with
a MAD of 3.05, indicating that of all datasets, the WSS@95 values of the models simulated on the Nudging
dataset varied the most within the Nudging dataset.336
As can be seen from the data in Table 4, LR + D2V was the best performing model in terms of RRF@10,
with a mean of 67.5% indicating that after screening 10% of publications, on average 67.5% of all relevant
publications had been identified, with a standard deviation of 2.59. The worst performing model was RF +
=2.71). Median performance was 62.6%, with an MAD of 3.89 indicating again that
of all datasets, the RRF@10 values were most dispersed for models simulated on the Nudging dataset.341
Overall evaluation342
Recall curves for the simulations on the five remaining datasets are presented in Figure 2. For the sake of
conciseness, recall curves are only plotted once per dataset, like in Figure 1a for the Nudging dataset. Please
refer to Additional file 4 for figures presenting subsets of recall curves for the remaining datasets, like in
Figure 1b-f.346
First of all, as the recall curves exceed the expected recall at screening at random order by far for all datasets,
the models were able to detect the relevant publications much faster compared to when screening publications
at random order. Even the worst results outperform this reference condition. Across simulations, the ATD
was at maximum 11.8% (in the Nudging dataset), the WSS@95 at least 63.9% (in the Virus dataset), and
the lowest RRF@10 was 53.6% (in the Nudging dataset). Interestingly, all these values were achieved by the
RF + TF-IDF model.352
Similar to the simulations on the Nudging dataset (Figure 1a), the ordering of recall curves changes throughout
the screening process, indicating that some models perform better at the start of the screening phase whereas
others models take the lead later on. Moreover, the ordering of models in the Nudging dataset (Figure 1a) is
not replicated in the remaining five datasets (Figure 2).356
RQ1 - Comparison across classification techniques357
The first research question was aimed at evaluating the four models adopting either the NB, SVM, LR or
RF classification technique combined with TF-IDF feature extraction. When comparing ATD-values of the
models (Table 2), the NB + TF-IDF model ranked first in the ACE, Virus, and Wilson dataset, shared first
in the PTSD and Software dataset, and second in the Nudging dataset in which the SVM + D2V and LR +
D2V models achieved the lowest ATD value. The RF + TF-IDF ranked last in all of the datasets except for
the ACE and the Wilson dataset, in which the RF + D2V model achieved the highest ATD-value.363
Additionally, in terms of WSS@95 (Table 3) the ranking of models was strikingly similar across all datasets.
In the Nudging, ACE, and Virus dataset, the highest WSS@95 value was always achieved by the NB +
TF-IDF model, followed by LR + TF-IDF, SVM + TF-IDF, and RF + TF-IDF. In the PTSD and the
Software dataset this ranking applied as well, except that two models showed the same WSS@95 value. The
ordering of the models for the Wilson dataset was NB + TF-IDF, RF + TF-IDF, LR + TF-IDF and SVM +
Moreover, in terms of RRF@10 (Table 4) the NB + TF-IDF model achieved the highest RRF@10 value in the
ACE and Virus dataset. Within the PTSD dataset, LR + TF-IDF was the best performing model, for the
Software and Wilson dataset this was SVM + D2V, and for the Nudging dataset LR + D2V performed best.
Taken together, these results show that while all four models perform quite well, the NB + TF-IDF model
demonstrates high performance on all measures across all datasets, whereas the RF + TF-IDF model never
performed best on any of the measures across all datasets.375
RQ2 - Comparison across feature extraction techniques376
This section is concerned with the question of how models using different feature extraction strategies relate
to each other. The recall curves for the Nudging dataset (Figure 1d-f) show a clear trend of the models
adopting D2V feature extraction outperforming their TF-IDF counterparts. This trend also shows from
the WSS@95 and RRF@10 values indicated by the vertical and horizontal lines in the figure. Likewise, the
ATD values (Table 2) indicate that for the models adopting a particular classification technique, the models
adopting D2V feature extraction always achieved a lower ATD-value than the model adopting TF-IDF feature
In contrast, this pattern of models adopting D2V outperforming their TF-IDF counterparts in the Nudging
dataset is not replicated across other datasets. Whether evaluated in terms of recall curves, WSS@95,
RRF@10, or ATD, the findings were mixed. Neither one of the feature extraction strategies showed superior
performance within certain datasets nor within certain classification techniques.387
RQ3 - Comparison across research contexts388
First of all, models showed much higher recall curves for some datasets than for others. While performance
of the PTSD (Figure 2a) and Software datasets (Figure 2b) was quite high, performance was much lower
across models for the Nudging (Figure 1a) and Virus (Figure 2d) datasets. The models simulated on the
PTSD and Software datasets also demonstrated high performances in terms of the median ATD, WSS@95,
and RRF@10 values for these models (Table 2, 3, and 4).393
Secondly, variability of between-model performance differed across datasets. For the PTSD (Figure 2a),
Software (Figure 2b), and the Virus (Figure 2d) datasets, recall curves form a tight group meaning that within
these datasets, the models performed similarly. In contrast, for the Nudging (Figure 1a), ACE (Figure 2c),
and Wilson (Figure 2e) dataset, the recall curves are much further apart, indicating that model performance
was more dependent on the adopted classification technique and feature extraction strategy. The MAD values
of the ATD, WSS@95 and RRF@10 confirm that model performance is less spread out within the PTSD,
Software, and Virus datasets than within the Nudging, ACE, and Wilson datasets. Moreover, the curves for
the ACE (Figure 2c) and Wilson (Figure 2e) datasets show a larger standard error of the mean compared the
other datasets.402
Taken together, although model performance is very data-dependent, there does not seem to be a distinction
in performance between the datasets from the biomedical sciences (ACE, Virus, and Wilson) and datasets
from other fields (Nudging, PTSD, and Software).405
The current study evaluates the performance of active learning models for the purpose of identifying relevant
publications in systematic review datasets. It has been one of the first attempts to examine different
classification strategies and feature extraction strategies in active learning models for systematic reviews.
Moreover, this study has provided a deeper insight into the performance of active learning models across
research contexts.411
Active learning-based screening prioritization412
All models were able to detect 95% of the relevant publications after screening less than 40% of the total
number of publications, indicating that active learning models can save more than half of the workload in
the screening process. In a previous study, the ACE dataset was used to simulate a model that did not use
active learning, finding a WSS@95 value of 56.61% [
], whereas the models in the current study achieved
far superior WSS@95 values varying from 68.6% to 82.9% in this dataset. In another study [
] that did
use active learning, the Software dataset was used for simulation and a WSS@95 value of 91% was reached,
strikingly similar to the values found in the current study which ranged from 90.5% to 92.3%.419
Classification techniques420
The first research question in this study sought to evaluate models adopting different classification techniques.
The most important finding to emerge from these evaluations was that the NB + TF-IDF model consistently
performed as one of the best models. Our results suggest that while SVM performed fairly well, the LR and
NB classification techniques are good if not superior alternatives to this default classifier in software tools.
Note that LR and NB were always good methods for text classification tasks [53].425
Feature extraction strategy426
The overall results on models adopting D2V versus TF-IDF feature extraction strategy remain inconclusive.
According to our findings, models adopting D2V do not outperform models adopting the well-established
TF-IDF feature extraction strategy. Given these results, preference goes out to the TF-IDF feature extraction
technique as this relatively simple technique will lead to a model that is easier to interpret. Another advantage
of this technique is its short computation time.431
Research contexts432
Difficulty of applying active learning is not confined to any particular research area. The suggestion that
active learning is more difficult for datasets from the social sciences compared to data from the medical
sciences [
] does not seem to be the case. A possible explanation for this is that this difficulty has to be
attributed to factors more directly related to the systematic review at hand, such as the proportion of relevant
publications or the complexity of inclusion criteria used to identify relevant publications [
]. Although
the current study did not investigate the inclusion criteria of systematic reviews, the datasets on which the
active learning models performed worst, Nudging and Virus, were interestingly also the datasets with the
highest proportion of relevant publications, 5.4% and 5.0%, respectively.440
Limitations and future research441
When applied to systematic reviews, the success of active learning models stands or falls with the gener-
alizability of model performance across unseen datasets. In our study, is important to bear in mind that
model hyperparameters were optimized for each model-dataset combination. Thus, the observed results
reflect the maximum model performance for each presented datasets. The question remains whether model
performance generalizes to datasets for which the hyperparameters are not optimized. Further research
should be undertaken to determine the sensitivity of model performance to the hyperparameter values.447
Additionally, while the sample of datasets in the current study is diverse compared to previous studies, the
sample size (n=6) does not allow for investigating how model performance relates to characteristics of the
data, such as the proportion of relevant publications. To build more confidence in active learning models for
screening publications, it is essential to identify how data characteristics affect model performance. Such a
study requires more data on systematic reviews. Thus, a more thorough study depends on researchers to
openly publish their systematic review datasets.453
Moreover, the runtime of simulations varied widely across models, indicating that some models take longer
to retrain after a publication has been labeled than other models. This has important implications for the
practical application of such models, as an efficient model should be able to keep up with the decision-making
speed of the reviewer. Further studies should take into account the retraining time of models.457
Overall, the findings confirm the great potential of active learning models to reduce the workload for systematic
reviews. The results shed new light on the performance of different classification techniques, indicating that
the NB classification technique is superior to the widely used SVM. As model performance differs vastly
across datasets, this study raises the question which factors cause models to yield more workload savings for
some systematic review datasets than for others. In order to facilitate the applicability of active learning
models in systematic review practice, it is essential to identify how dataset characteristics relate to model
List of abbreviations467
ATD - Average Time to Discovery468
D2V - doc2vec469
LR - Logistic regression470
MAD - Median Absolute Deviation471
ML - Machine Learning472
NB - Naive Bayes473
PTSD - Post Traumatic Stress Disorder474
RF - Random forest475
RRF - Relevant References Found476
SD - Standard Deviation477
SEM - Standard Error of the Mean478
SVM - Support vector machines479
TF-IDF - Term Frequency - Inverse Document Frequency480
TPE - Tree of Parzen Estimators481
TD - Time to Discovery482
WSS - Work Saved over Sampling483
Ethics approval and consent to participate484
This study has been approved by the Ethics Committee of the Faculty of Social and Behavioural Sciences of
Utrecht University, filed as an amendment under study 20-104.486
Consent for publication487
Not applicable.488
Availability of data and materials489
All data and materials are stored in the GitHub repository for this paper,
evaluating-models-across-research-areas. This repository contains all systematic review datasets used during
this study and their preprocessing scripts, scripts for the hyperparameter optimization, the simulations, the
processing and analysis of the results of the simulations, and for the figures and tables in this paper. The raw
output files of the simulation study are stored on the Open Science Framework, and
Competing interests496
The authors declare that they have no competing interests.497
This project was funded by the Innovation Fund for IT in Research Projects, Utrecht University, The
Netherlands. Access to the Cartesius supercomputer was granted by SURFsara (ID EINF-156). Both the
Innovation Fund and SURFsara had no role whatsoever in the design of the current study, nor in the data
collection, analysis and interpretation, nor in writing the manuscript.502
Author’s contributions503
RvdS, RS, JdB and GF designed the study. RS developed the DR balance strategy and ATD metric, and
wrote the programs required for hyperparameter optimization and cloud computation. RS, JdB, and DO
designed the architecture required for the simulation study. GF extracted and analyzed the data and drafted
the manuscript. RvdS, AB, RS, DO, LT, and JdB assisted with writing the paper. LT, DO, AB, and RvdS
provided domain knowledge. All authors read and approved the final manuscript.508
We are grateful for all researchers who have made great efforts to openly publish the data on their systematic
reviews, special thanks go out to Rosanna Nagtegaal.511
Figure 1: Recall curves of different models for the Nudging dataset, indicating how fast the model finds
relevant publications during the process of screening publications. Figure a displays curves for all seven
models at once. Figures b to f display curves for several subsets of those models to allow for a more detailed
inspection of model performance.
Figure 2: Recall curves of all seven models for (a) the PTSD, (b) Software, (c) ACE, (d) Virus, and (e)
Wilson dataset.
Table 1: Statistics on the datasets obtained from six original systematic reviews.
Dataset Candidate publications Relevant publications Proportion relevant (%)
Nudging 1,847 100 5.4
PTSD 5,031 38 0.8
Software 8,896 104 1.2
ACE 2,235 41 1.8
Virus 2,304 114 5.0
Wilson 2,333 23 1.0
Table 2: ATD values (¯xs)) for all model-dataset combinations. For every dataset, the best results are in
bold. Median (MAD) is given for all datasets.
Nudging PTSD Software ACE Virus Wilson
SVM + TF-IDF 10.1 (0.18) 2.1 (0.13) 1.9 (0.04) 7.1 (1.15) 8.5 (0.17) 4.0 (0.32)
NB + TF-IDF 9.3 (0.29) 1.7 (0.11) 1.4 (0.03) 4.9 (0.51) 8.2 (0.22) 3.9 (0.35)
RF + TF-IDF 11.7 (0.44) 3.3 (0.26) 2.0 (0.09) 6.8 (0.74) 10.5 (0.42) 5.6 (1.15)
LR + TF-IDF 9.5 (0.19) 1.7 (0.10) 1.4 (0.01) 5.9 (1.17) 8.3 (0.24) 4.3 (0.32)
SVM + D2V 8.8 (0.33) 2.1 (0.15) 1.4 (0.05) 6.1 (0.33) 8.4 (0.21) 4.5 (0.30)
RF + D2V 10.3 (0.87) 3.0 (0.33) 1.6 (0.09) 7.2 (1.26) 9.2 (0.43) 7.2 (1.49)
LR + D2V 8.8 (0.47) 1.9 (0.16) 1.4 (0.04) 5.4 (0.18) 8.3 (0.40) 4.7 (0.30)
median (MAD) 9.5 (1.05) 2.1 (0.48) 1.4 (0.12) 6.1 (1.11) 8.4 (0.18) 4.5 (0.64)
Table 3: WSS@95 values (
)) for all model-dataset combinations. For every dataset, the best results are in
bold. Median (MAD) is given for all datasets.
Nudging PTSD Software ACE Virus Wilson
SVM + TF-IDF 66.2 (2.90) 91.0 (0.41) 92.0 (0.10) 75.8 (1.95) 69.7 (0.81) 79.9 (2.09)
NB + TF-IDF 71.7 (1.37) 91.7 (0.27) 92.3 (0.08) 82.9 (0.99) 71.2 (0.62) 83.4 (0.89)
RF + TF-IDF 64.9 (2.50) 84.5 (3.38) 90.5 (0.34) 71.3 (4.03) 63.9 (3.54) 81.6 (3.35)
LR + TF-IDF 66.9 (4.01) 91.7 (0.18) 92.0 (0.10) 81.1 (1.31) 70.3 (0.65) 80.5 (0.65)
SVM + D2V 70.9 (1.68) 90.6 (0.73) 92.0 (0.21) 78.3 (1.92) 70.7 (1.76) 82.7 (1.44)
RF + D2V 66.3 (3.25) 88.2 (3.23) 91.0 (0.55) 68.6 (7.11) 67.2 (3.44) 77.9 (3.43)
LR + D2V 71.6 (1.66) 90.1 (0.63) 91.7 (0.13) 77.4 (1.03) 70.4 (1.34) 84.0 (0.77)
median (MAD) 66.9 (3.05) 90.6 (1.53) 92.0 (0.47) 77.4 (5.51) 70.3 (0.90) 81.6 (2.48)
Table 4: RRF@10 values (
)) for all model-dataset combinations. For every dataset, the best results are
in bold. Median (MAD) is given for all datasets.
Nudging PTSD Software ACE Virus Wilson
SVM + TF-IDF 60.2 (3.12) 98.6 (1.40) 99.0 (0.00) 86.2 (5.25) 73.4 (1.62) 90.6 (1.17)
NB + TF-IDF 65.3 (2.61) 99.6 (0.95) 98.2 (0.34) 90.5 (1.40) 73.9 (1.70) 87.3 (2.55)
RF + TF-IDF 53.6 (2.71) 94.8 (1.60) 99.0 (0.00) 82.3 (2.75) 62.1 (3.19) 86.7 (5.82)
LR + TF-IDF 62.1 (2.59) 99.8 (0.70) 99.0 (0.00) 88.5 (5.16) 73.7 (1.48) 89.1 (2.30)
SVM + D2V 67.3 (3.00) 97.8 (1.12) 99.3 (0.44) 84.2 (2.78) 73.6 (2.54) 91.5 (4.16)
RF + D2V 62.6 (5.47) 97.1 (1.90) 99.2 (0.34) 80.8 (5.72) 67.3 (3.19) 75.5 (14.35)
LR + D2V 67.5 (2.59) 98.6 (1.40) 99.0 (0.00) 81.7 (1.81) 70.6 (2.21) 90.6 (5.00)
median (MAD) 62.6 (3.89) 98.6 (1.60) 99.0 (0.00) 84.2 (3.71) 73.4 (0.70) 89.1 (2.70)
Additional files
Additional file 1 — The active learning cycle
additional-file-1-active-learning-cycle.pdf. Description: The active learning cycle for screening prioritization
in systematic reviews.
Additional file 2 — Dynamic Resampling
additional-file-2-DR.pdf. Description: Algorithm describing how to rebalance training data by the Dynamic
Resampling (DR) strategy.
Additional file 3 — Average Time to Discovery
additional-file-3-ATD.pdf. Description: Definition of the Average Time to Discovery (ATD), a metric to
assess the model performance.
Additional file 4 — Recall curves
additional-file-4-recall-curves.pdf. Description: Various subsets of recall curves for the PTSD, Software, ACE,
Virus, and Wilson datasets, like Figure 1 presents curves for the Nudging dataset.
PRISMA-P Group, Moher D, Shamseer L, Clarke M, Ghersi D, Liberati A, et al. Preferred Reporting Items
for Systematic Review and Meta-Analysis Protocols (PRISMA-P) 2015 Statement. Syst Rev. 2015;4(1):1.
Available from:
Gough D, Richardson M. Systematic Reviews. In: Advanced Research Methods for Applied Psychology.
Routledge; 2018. p. 75–87.
Chalmers I. The Lethal Consequences of Failing to Make Full Use of All Relevant Evidence about the
Effects of Medical Treatments: The Importance of Systematic Reviews. In: Treating Individuals - from
Randomised Trials to Personalised Medicine. Lancet; 2007. p. 37–58.
Borah R, Brown AW, Capers PL, Kaiser KA. Analysis of the Time and Workers Needed to Conduct
Systematic Reviews of Medical Interventions Using Data from the PROSPERO Registry. BMJ Open.
2017;7(2):e012545. Available from:
Lau J. Editorial: Systematic Review Automation Thematic Series. Syst Rev. 2019;8(1):70. Available
Harrison H, Griffin SJ, Kuhn I, Usher-Smith JA. Software Tools to Support Title and Abstract Screening
for Systematic Reviews in Healthcare: An Evaluation. BMC Med Res Methodol. 2020;20(1):7. Available
O’Mara-Eves A, Thomas J, McNaught J, Miwa M, Ananiadou S. Using Text Mining for Study
Identification in Systematic Reviews: A Systematic Review of Current Approaches. Syst Rev. 2015;4(1):5.
Available from:
Cohen AM, Ambert K, McDonagh M. Cross-Topic Learning for Work Prioritization in Systematic
Review Creation and Update. J Am Med Inform Assoc. 2009;16(5):690–704. Available from: https:
Shemilt I, Simon A, Hollands GJ, Marteau TM, Ogilvie D, O’Mara-Eves A, et al. Pinpointing Needles
in Giant Haystacks: Use of Text Mining to Reduce Impractical Screening Workload in Extremely Large
Scoping Reviews. Res Synth Methods. 2014;5(1):31–49. Available from:
Yu Z, Menzies T. FAST2: An Intelligent Assistant for Finding Relevant Papers. Expert Syst Appl.
2019;120:57–71. Available from:
Yu Z, Kraft NA, Menzies T. Finding Better Active Learners for Faster Literature Reviews. Empir Softw
Eng. 2018;23(6):3161–3186. Available from:
Miwa M, Thomas J, O’Mara-Eves A, Ananiadou S. Reducing Systematic Review Workload through
Certainty-Based Screening. J Biomed Inform. 2014;51:242–253. Available from: http://www.sciencedirec
Cormack GV, Grossman MR. Evaluation of Machine-Learning Protocols for Technology-Assisted Review
in Electronic Discovery. In: Proceedings of the 37th International ACM SIGIR Conference on Research
& Development in Information Retrieval. SIGIR ’14. Association for Computing Machinery; 2014. p.
153–162. Available from:
Cormack GV, Grossman MR. Autonomy and Reliability of Continuous Active Learning for Technology-
Assisted Review. CoRR. 2015;abs/1504.06868. Available from:
Wallace BC, Trikalinos TA, Lau J, Brodley C, Schmid CH. Semi-Automated Screening of Biomedical
Citations for Systematic Reviews. BMC Bioinform. 2010;11(1):55. Available from:
Gates A, Johnson C, Hartling L. Technology-Assisted Title and Abstract Screening for Systematic
Reviews: A Retrospective Evaluation of the Abstrackr Machine Learning Tool. Syst Rev. 2018;7(1):45.
Available from:
Settles B. Active Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. 2012;6(1):1–
114. Available from:
Settles B. Active learning literature survey. University of Wisconsin-Madison Department of Computer
Sciences; 2009.
Singh G, Thomas J, Shawe-Taylor J. Improving active learning in systematic reviews. arXiv preprint
arXiv:180109496. 2018;.
Carvallo A, Parra D. Comparing Word Embeddings for Document Screening based on Active Learning.
In: BIRNDL@ SIGIR; 2019. p. 100–107.
Ma Y. Text classification on imbalanced data: Application to Systematic Reviews Automation. University
of Ottawa (Canada). 2007;.
Wallace BC, Small K, Brodley CE, Lau J, Trikalinos TA. Deploying an Interactive Machine Learning
System in an Evidence-Based Practice Center: Abstrackr. In: Proceedings of the 2nd ACM SIGHIT
International Health Informatics Symposium. IHI ’12. Association for Computing Machinery; 2012. p.
819–824. Available from:
Cheng SH, Augustin C, Bethel A, Gill D, Anzaroot S, Brun J, et al. Using Machine Learning to Advance
Synthesis and Use of Conservation and Environmental Evidence. Conserv Biol. 2018;32(4):762–764.
Available from:
Ouzzani M, Hammady H, Fedorowicz Z, Elmagarmid A. Rayyan—a Web and Mobile App for Systematic
Reviews. Syst Rev. 2016;5(1):210. Available from:
Przybył a P, Brockmeier AJ, Kontonatsios G, Pogam MAL, McNaught J, Erik von Elm, et al. Prioritising
References for Systematic Reviews with RobotAnalyst: A User Study. Res Synth Methods. 2018;9(3):470–
488. Available from:
Kilicoglu H, Demner-Fushman D, Rindflesch TC, Wilczynski NL, Haynes RB. Towards Automatic
Recognition of Scientifically Rigorous Clinical Research Evidence. J Am Med Inform Assn. 2009;16(1):25–
31. Available from:
Aphinyanaphongs Y. Text Categorization Models for High-Quality Article Retrieval in Internal Medicine.
J Am Med Inform Assoc. 2004;12(2):207–216. Available from:
Aggarwal CC, Zhai C. A Survey of Text Classification Algorithms. In: Aggarwal CC, Zhai C, editors.
Mining Text Data. Springer US; 2012. p. 163–222. Available from:
Zhang W, Yoshida T, Tang X. A Comparative Study of TF*IDF, LSI and Multi-Words for Text
Classification. Expert Syst Appl. 2011;38(3):2758–2765. Available from:
Le Q, Mikolov T. Distributed representations of sentences and documents. In: International conference
on machine learning; 2014. p. 1188–1196.
Marshall IJ, Johnson BT, Wang Z, Rajasekaran S, Wallace BC. Semi-Automated Evidence Synthesis in
Health Psychology: Current Methods and Future Prospects. Health Psychol Rev. 2020;14(1):145–158.
Available from:
Cohen AM, Hersh WR, Peterson K, Yen PY. Reducing Workload in Systematic Review Preparation
Using Automated Citation Classification. J Am Med Inform Assoc. 2006;13(2):206–219. Available from:
Appenzeller-Herzog C, Mathes T, Heeres MLS, Weiss KH, Houwen RHJ, Ewald H. Comparative
Effectiveness of Common Therapies for Wilson Disease: A Systematic Review and Meta-Analysis of
Controlled Studies. Liver Int. 2019;39(11):2136–2152. Available from:
Kwok KTT, Nieuwenhuijse DF, Phan MVT, Koopmans MPG. Virus Metagenomics in Farm Animals:
A Systematic Review. Viruses. 2020;12(1):107. Available from: https://www.mdp
Nagtegaal R, Tummers L, Noordegraaf M, Bekkers V. Nudging Healthcare Professionals towards
Evidence-Based Medicine: A Systematic Scoping Review. J Behav Public Adm. 2019;2(2).
van de Schoot R, Sijbrandij M, Winter SD, Depaoli S, Vermunt JK. The GRoLTS-Checklist: Guidelines
for Reporting on Latent Trajectory Studies. Struct Equ Model Multidiscip J. 2017;24(3):451–467.
Available from:
van de Schoot R, de Bruin J, Schram R, Zahedi P, de Boer J, Weijdema F, et al. ASReview: Open
Source Software for Efficient and Transparent Active Learning for Systematic Reviews. arXiv preprint
arXiv:200612166. 2020;.
Wynants L, Calster BV, Collins GS, Riley RD, Heinze G, Schuit E, et al. Prediction Models for Diagnosis
and Prognosis of Covid-19: Systematic Review and Critical Appraisal. BMJ. 2020;369. Available from:
Tong S, Koller D. Support Vector Machine Active Learning with Applications to Text Classification. J
Mach Learn Res. 2001;2:45–66.
Kremer J, Steenstrup Pedersen K, Igel C. Active Learning with Support Vector Machines. WIREs Data
Min Knowl Discov. 2014;4(4):313–326. Available from:
Zhang H. The Optimality of Naive Bayes. In: Proceedings of the Seventeenth International Florida
Artificial Intelligence Research Society Conference, FLAIRS 2004. vol. 2; 2004. p. 562–567.
Breiman L. Random Forests. Machine Learning. 2001;45(1):5–32. Available from:
Ramos J, et al. Using Tf-Idf to Determine Word Relevance in Document Queries. In: Proceedings of the
First Instructional Conference on Machine Learning. vol. 242. Piscataway, NJ; 2003. p. 133–142.
Shahmirzadi O, Lugowski A, Younge K. Text Similarity in Vector Space Models: A Comparative Study.
In: 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA); 2019.
p. 659–666. Available from: 10.1109/ICMLA.2019.00120.
Yang Y, Loog M. A Benchmark and Comparison of Active Learning for Logistic Regression. Pattern
Recognition. 2018;83:401–415. Available from:
Fu JH, Lee SL. Certainty-Enhanced Active Learning for Improving Imbalanced Data Classification. In:
2011 IEEE 11th International Conference on Data Mining Workshops. IEEE; 2011. p. 405–412. Available
van de Schoot R, de Bruin J, Schram R, Zahedi P, Kramer B, Ferdinands G, et al. ASReview: Active
Learning for Systematic Reviews. Zenodo; 2020.
Bergstra J, Yamins D, Cox D. Making a Science of Model Search: Hyperparameter Optimization in
Hundreds of Dimensions for Vision Architectures. In: International Conference on Machine Learning;
2013. p. 115–123. Available from:
R Core Team. R Foundation for Statistical Computing, editor. R: A Language and Environment for
Statistical Computing. Vienna, Austria; 2019. Available from:
Appenzeller-Herzog C. Data from Comparative Effectiveness of Common Therapies for Wilson Disease:
A Systematic Review and Meta-analysis of Controlled Studies. Zenodo. 2020;Available from: https:
Hall T, Beecham S, Bowes D, Gray D, Counsell S. A Systematic Literature Review on Fault Prediction
Performance in Software Engineering. IEEE Trans Softw Eng. 2012;38(6):1276–1304.
Nagtegaal R, Tummers L, Noordegraaf M, Bekkers V. Nudging healthcare professionals towards
evidence-based medicine: A systematic scoping review. Harvard Dataverse. 2019;Available from: https:
Mitchell TM. Does Machine Learning Really Work? AI Mag. 1997;18(3):11–11. Available from:
Rathbone J, Hoffmann T, Glasziou P. Faster Title and Abstract Screening? Evaluating Abstrackr, a
Semi-Automated Online Screening Program for Systematic Reviewers. Systematic Reviews. 2015;4(1):80.
Available from:
... Moreover, previous investigations into the performance of various ML tools have been limited in sample size, and the software tools evaluated were not open source, thereby hindering the identification of relevant parameters for these methods (Gates et al., 2019;Robledo et al., 2023). An attempt to address the previous research gaps was presented by Ferdinands et al. (2020). The authors conducted a study that evaluated the performance of NLP algorithms using four classifiers (LR, SVM, NB, RF) and two feature extraction techniques (TF-IDF, D2V) on a set of six labeled datasets from systematic reviews using the software ASReview. ...
... Balancing strategies are employed to mitigate the risk of the learning algorithm over-fitting irrelevant studies. The default method, Dynamic Resampling (DR), rebalances the training set by undersampling irrelevant studies and oversampling relevant records (Ferdinands et al., 2020). Following the selection of the balancing strategy, users can proceed to choose the query strategy. ...
... However, these tools are still evolving, and their performance may vary across different domains and even within the same (Burgard & Bittermann, 2023;Chai et al., 2021;van de Schoot et al., 2021). Additionally, the current literature lacks comprehensive comparative studies on the performance of NLP algorithms when applied to real-world data (Burgard & Bittermann, 2023), and the impact of data characteristics on model performance is poorly understood (Ferdinands et al., 2020). ...
Full-text available
Systematic reviews and meta-analyses are crucial for advancing research; yet, they are time-consuming and resource-demanding. Although machine learning and natural language processing algorithms may reduce this time and these resources, their performance has not been tested in the field of education and educational psychology, and there is a lack of clear information on when researchers should stop the reviewing process. In this study, we conducted a retrospective screening simulation using 27 systematic reviews in education and educational psychology. We evaluated the recall, work saved over sampling, and the estimated time savings of several active learning screening algorithms and heuristic stopping criteria. The results showed on average a 50% (SD = 20%) reduction in screening workload when using active learning algorithms for abstract screening and an estimated time savings of 1.64 days (SD = 1.78). The learning algorithm Random Forests with Sentence Bidirectional Encoder Representations from Transformers outperformed other algorithms—a finding that emphasizes the importance of incorporating semantic and contextual information during feature extraction and modeling in the screening process. Furthermore, we found that 95% of all relevant abstracts within a given dataset can be retrieved using heuristic stopping rules. Specifically, an approach that stops the screening process after consecutively classifying 7% of irrelevant papers yielded the most significant gains in terms of works savings over sampling (M = 41%, SD = 26%). However, the performance of the heuristic stopping criteria depended on the active learning algorithm used, the length and the proportion of relevant papers in a database. Overall, our study provides empirical evidence on the performance of machine learning screening algorithms for abstract screening in systematic reviews in education and educational psychology.
... An automated screening with Automated Systematic Review Software (ASR) was utilised to detect relevant articles. Such machine learning active learning models for screening prioritisation with algorithms and a software development platform (Ferdinands et al., 2020) allow for time and cost savings and facilitated faster retrieval of relevant publications for a time-effective transition to the subsequent steps in the review process while providing solution to potentially missed literature in screening phase due to human errors or excluded by searching algorithms (Odintsova et al., 2019). The active learning model technology utilised for this screening process was ASReview (Van De Schoot et al., 2021). ...
... This software utilised the title and abstract of each electronic article to prioritise the article in terms of relevance. The Naive Bayes and term frequencyinverse document frequency combination were the chosen classification and extraction techniques, as these were found to demonstrate high performance on all measures across all datasets (Ferdinands et al., 2020;Odintsova et al., 2019). ...
Full-text available
Orientation: Education trends in Africa indicate that key ingredients for effective education are elusive, impacting the teachers who need to remain productive, motivated and healthy in this environment. Research purpose: Using machine learning active learning technology, the study aimed to review current literature related to the factors affecting the capabilities and functionings of secondary school teachers in sub-Saharan Africa (SSA). Motivation for the study: The Capability Approach (CA) provides a framework for studying the sustainable employability (SE) of teachers, including what they require to be able to convert valued opportunities into the needed achievements. Research approach/design and method: A systematic literature review was conducted following Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines, using Machine Learning Active Learning Technology. Eighty six articles from 14 SSA countries were included for analysis, prioritising articles in the South African context first. Main findings: Analysis identified four groupings of resources that are potentially useful or valuable, creating access or empowerment if utilised effectively, namely knowledge commodities, soft commodities, hard commodities, and organisational commodities. Sub-resources were also identified. Practical/managerial implications: This research would assist policy and decision-makers to focus their interventions in the most effective way to sustain productivity and well-being in the workplace. The resource groupings should be included in a model that focuses on enhancing secondary school teachers’ capabilities to promote their well-being and productivity. Contribution/value-add: This article provides new applied knowledge related to machine learning active learning technology as a methodology, and provides further insight into secondary school teacher employability.
... The software has a simple yet extensible default model: a naive Bayes classifier, TF-IDF feature extraction, a dynamic resampling balance strategy 31 and certainty-based sampling 17,32 for the query strategy. These defaults were chosen on the basis of their consistently high performance in benchmark experiments across several datasets 31 . Moreover, the low computation time of these default settings makes them attractive in applications, given that the software should be able to run locally. ...
... There are several balance strategies that rebalance and reorder the training data. This is necessary, because the data is typically extremely imbalanced and therefore we have implemented the following balance strategies: (1) full sampling, which uses all of the labelled records; (2) undersampling the irrelevant records so that the included and excluded records are in some particular ratio (closer to one); and (3) dynamic resampling, a novel method similar to undersampling in that it decreases the imbalance of the training data 31 . However, in dynamic resampling, the number of irrelevant records is decreased, whereas the number of relevant records is increased by duplication such that the total number of records in the training data remains the same. ...
Full-text available
To help researchers conduct a systematic review or meta-analysis as efficiently and transparently as possible, we designed a tool to accelerate the step of screening titles and abstracts. For many tasks—including but not limited to systematic reviews and meta-analyses—the scientific literature needs to be checked systematically. Scholars and practitioners currently screen thousands of studies by hand to determine which studies to include in their review or meta-analysis. This is error prone and inefficient because of extremely imbalanced data: only a fraction of the screened studies is relevant. The future of systematic reviewing will be an interaction with machine learning algorithms to deal with the enormous increase of available text. We therefore developed an open source machine learning-aided pipeline applying active learning: ASReview. We demonstrate by means of simulation studies that active learning can yield far more efficient reviewing than manual reviewing while providing high quality. Furthermore, we describe the options of the free and open source research software and present the results from user experience tests. We invite the community to contribute to open source projects such as our own that provide measurable and reproducible improvements over current practice. It is a challenging task for any research field to screen the literature and determine what needs to be included in a systematic review in a transparent way. A new open source machine learning framework called ASReview, which employs active learning and offers a range of machine learning models, can check the literature efficiently and systemically.
... Ferdinands et al. [19] show that a Naive Bayes classifier with TF-IDF performs better than SVM for their four datasets. ...
Technical Report
Full-text available
By organizing knowledge within a research field, Systematic Reviews (SR) provide valuable leads to steer research. Evidence suggests that SRs have become first-class artifacts in software engineering. However, the tedious manual effort associated with the screening phase of SRs renders these studies a costly and error-prone endeavor. While screening has traditionally been considered not amenable to automation, the advent of generative AI-driven chatbots, backed with large language models is set to disrupt the field. In this report, we propose an approach to leverage these novel technological developments for automating the screening of SRs. We assess the consistency, classification performance, and generalizability of ChatGPT in screening articles for SRs and compare these figures with those of traditional classifiers used in SR automation. Our results indicate that ChatGPT is a viable option to automate the SR processes, but requires careful considerations from developers when integrating ChatGPT into their SR tools.
... Ferdinands et al. [19] show that a Naive Bayes classifier with TF-IDF performs better than SVM for their four datasets. ...
Full-text available
By organizing knowledge within a research field, Systematic Reviews (SR) provide valuable leads to steer research. Evidence suggests that SRs have become first-class artifacts in software engineering. However, the tedious manual effort associated with the screening phase of SRs renders these studies a costly and error-prone endeavor. While screening has traditionally been considered not amenable to automation, the advent of generative AI-driven chatbots, backed with large language models is set to disrupt the field. In this report, we propose an approach to leverage these novel technological developments for automating the screening of SRs. We assess the consistency, classification performance, and generalizability of ChatGPT in screening articles for SRs and compare these figures with those of traditional classifiers used in SR automation. Our results indicate that ChatGPT is a viable option to automate the SR processes, but requires careful considerations from developers when integrating ChatGPT into their SR tools.
... For the 63 included records, the Average Time to Discovery (ATD) was computed by taking the average of the time to discovery of all relevant records (Ferdinands et al., 2020). The time to discovery for a given relevant publication was computed as the number of records needed to screen to detect this record. ...
Full-text available
Introduction: This study examines the performance of active learning-aided systematic reviews using a deep learning-based model compared to traditional machine learning approaches, and explores the potential benefits of model-switching strategies. Methods: Comprising four parts, the study: 1) analyzes the performance and stability of active learning-aided systematic review; 2) implements a convolutional neural network classifier; 3) compares classifier and feature extractor performance; and 4) investigates the impact of model-switching strategies on review performance. Results: Lighter models perform well in early simulation stages, while other models show increased performance in later stages. Model-switching strategies generally improve performance compared to using the default classification model alone. Discussion: The study's findings support the use of model-switching strategies in active learning-based systematic review workflows. It is advised to begin the review with a light model, such as Naïve Bayes or logistic regression, and switch to a heavier classification model based on a heuristic rule when needed.
... We also used the default settings of the ASReview software for the remaining parameters, including TF-IDF feature extraction and a deterministic query strategy sampling. These default values were chosen based on their consistently high performance and low computation time in benchmark tests on multiple datasets [41]. ...
Full-text available
Land use change detection (LUCD) is a critical technology with applications in various fields, including forest disturbance, cropland changes, and urban expansion. However, the current review articles on LUCD tend to be limited in scope, rendering a comprehensive review challenging due to the vast number of publications. This paper systematically reviewed 3512 articles retrieved from the Web of Science Core database between 1985 and 2022, utilizing a combination of bibliometric analysis and machine learning methods with LUCD as the main focus. The results indicated an exponential increase in the number of LUCD studies, indicating continued growth in this research field. Commonly used methods include classification-based, threshold-based, model-based, and deep learning-based change detection, with research themes encompassing forest logging and vegetation succession, urban landscape dynamics, and biodiversity conservation and management. To build an intelligent change detection system, researchers need to develop a flexible framework that integrates data preprocessing, feature extraction, land use type interpretation, and accuracy evaluation, given the continuous evolution and application of remote sensing data, deep learning, big data, and artificial intelligence.
Full-text available
Background: Systematic reviews provide a structured overview of the available evidence in medical-scientific research. However, due to the increasing medical-scientific research output, it is a time-consuming task to conduct systematic reviews. To accelerate this process, artificial intelligence (AI) can be used in the review process. In this communication paper, we suggest how to conduct a transparent and reliable systematic review using the AI tool ‘ASReview’ in the title and abstract screening. Methods: Use of the AI tool consisted of several steps. First, the tool required training of its algorithm with several prelabelled articles prior to screening. Next, using a researcher-in-the-loop algorithm, the AI tool proposed the article with the highest probability of being relevant. The reviewer then decided on relevancy of each article proposed. This process was continued until the stopping criterion was reached. All articles labelled relevant by the reviewer were screened on full text. Results: Considerations to ensure methodological quality when using AI in systematic reviews included: the choice of whether to use AI, the need of both deduplication and checking for inter-reviewer agreement, how to choose a stopping criterion and the quality of reporting. Using the tool in our review resulted in much time saved: only 23% of the articles were assessed by the reviewer. Conclusion: The AI tool is a promising innovation for the current systematic reviewing practice, as long as it is appropriately used and methodological quality can be assured. PROSPERO registration number: CRD42022283952.
Full-text available
Introduction Smoking and insufficient physical activity (PA), independently but especially in conjunction, often lead to disease and (premature) death. For this reason, there is need for effective smoking cessation and PA-increasing interventions. Identity-related interventions which aim to influence how people view themselves offer promising prospects, but an overview of the existing evidence is needed first. This is the protocol for a scoping review aiming to aggregate the evidence on identity processes and identity-related interventions in the smoking and physical activity domains. Methods The scoping review will be guided by an adaption by Levac et al of the 2005 Arksey and O’Malley methodological framework, the 2020 Preferred Reporting Items for Systematic Reviews and Meta-Analyses: Extension for Scoping Review (PRISMA-ScR) and the 2017 Joanna Briggs Institute guidelines. It will include scientific publications discussing identity (processes) and/or identity-related interventions in the context of smoking (cessation) and/or physical (in)activity, in individuals aged 12 and over. A systematic search will be carried out in multiple databases (eg, PubMed, Web of Science). Records will be independently screened against prepiloted inclusion/exclusion criteria by two reviewers, using the Active Learning for Systematic Reviews machine learning artificial intelligence and Rayyan QCRI, a screening assistant. A prepiloted charting table will be used to extract data from included full-text articles. Findings will be reported according to the PRISMA-ScR guidelines and include study quality assessment. Ethics and dissemination Ethical approval is not required for scoping reviews. Findings will aid the development of future identity-related interventions targeting smoking and physical inactivity.
Full-text available
Systematic reviews are research synthesis methods increasingly used in educational research to support evidence-based decision making. The conduction of a systematic review is a complex process with several phases usually supported by software tools. These tools used at the international level are not currently common in the Italian educational research. This work describes four software tools used in the international educational research and evaluates their general functionality and specific features’ usability to conduct the systematic review phases of study screening and selection. For this purpose, this study uses two methods: a feature analysis (Kitchenham et al., 1997) and an expert survey (Harrison et al., 2020). The results of both investigation methods agree to consider Covidence and Rayyan the most functional software tools in conducting SR. Among the four tools, ASReview has the greatest potential for making the process of a SR more efficient.
Full-text available
Background: Systematic reviews are vital to the pursuit of evidence-based medicine within healthcare. Screening titles and abstracts (T&Ab) for inclusion in a systematic review is an intensive, and often collaborative, step. The use of appropriate tools is therefore important. In this study, we identified and evaluated the usability of software tools that support T&Ab screening for systematic reviews within healthcare research. Methods: We identified software tools using three search methods: a web-based search; a search of the online "systematic review toolbox"; and screening of references in existing literature. We included tools that were accessible and available for testing at the time of the study (December 2018), do not require specific computing infrastructure and provide basic screening functionality for systematic reviews. Key properties of each software tool were identified using a feature analysis adapted for this purpose. This analysis included a weighting developed by a group of medical researchers, therefore prioritising the most relevant features. The highest scoring tools from the feature analysis were then included in a user survey, in which we further investigated the suitability of the tools for supporting T&Ab screening amongst systematic reviewers working in medical research. Results: Fifteen tools met our inclusion criteria. They vary significantly in relation to cost, scope and intended user community. Six of the identified tools (Abstrackr, Colandr, Covidence, DRAGON, EPPI-Reviewer and Rayyan) scored higher than 75% in the feature analysis and were included in the user survey. Of these, Covidence and Rayyan were the most popular with the survey respondents. Their usability scored highly across a range of metrics, with all surveyed researchers (n = 6) stating that they would be likely (or very likely) to use these tools in the future. Conclusions: Based on this study, we would recommend Covidence and Rayyan to systematic reviewers looking for suitable and easy to use tools to support T&Ab screening within healthcare research. These two tools consistently demonstrated good alignment with user requirements. We acknowledge, however, the role of some of the other tools we considered in providing more specialist features that may be of great importance to many researchers.
Full-text available
Translating medical evidence into practice is difficult. Key challenges in applying evidence-based medicine are information overload and that evidence needs to be used in context by healthcare professionals. Nudging (i.e. softly steering) healthcare professionals towards utilizing evidence-based medicine may be a feasible possibility. This systematic scoping review is the first overview of nudging healthcare professionals in relation to evidence-based medicine. We have investigated a) the distribution of studies on nudging healthcare professionals, b) the nudges tested and behaviors targeted, c) the methodological quality of studies and d) whether the success of nudges is related to context. In terms of distribution, we found a large but scattered field: 100 articles in over 60 different journals, including various types of nudges targeting different behaviors such as hand hygiene or prescribing drugs. Some nudges – especially reminders to deal with information overload – are often applied, while others - such as providing social reference points – are seldom used. The methodological quality is moderate. Success appears to vary in terms of three contextual characteristics: the task, organizational, and occupational contexts. Based on this review, we propose future research directions, particularly related to methods (preregistered research designs to reduce publication bias), nudges (using less-often applied nudges on less studied outcomes), and context (moving beyond one-size-fits-all approaches).
Full-text available
Literature reviews are essential for any researcher trying to keep up to date with the burgeoning software engineering literature. Finding relevant papers can be hard due to the huge amount of candidates provided by search. FAST² is a novel tool for assisting the researchers to find the next promising paper to read. This paper describes FAST² and tests it on four large systematic literature review datasets. We show that FAST² robustly optimizes the human effort to find most (95%) of the relevant software engineering papers while also compensating for the errors made by humans during the review process. The effectiveness of FAST² can be attributed to three key innovations: (1) a novel way of applying external domain knowledge (a simple two or three keyword search) to guide the initial selection of papers—which helps to find relevant research papers faster with less variances; (2) an estimator of the number of remaining relevant papers yet to be found—which helps the reviewer decide when to stop the review; (3) a novel human error correction algorithm—which corrects a majority of human misclassifications (labeling relevant papers as non-relevant or vice versa) without imposing too much extra human effort.
Full-text available
Literature reviews can be time-consuming and tedious to complete. By cataloging and refactoring three state-of-the-art active learning techniques from evidence-based medicine and legal electronic discovery, this paper finds and implements FASTREAD, a faster technique for studying a large corpus of documents, combining and parametrizing the most efficient active learning algorithms. This paper assesses FASTREAD using datasets generated from existing SE literature reviews (Hall, Wahono, Radjenović, Kitchenham et al.). Compared to manual methods, FASTREAD lets researchers find 95% relevant studies after reviewing an order of magnitude fewer papers. Compared to other state-of-the-art automatic methods, FASTREAD reviews 20–50% fewer studies while finding same number of relevant primary studies in a systematic literature review.
Full-text available
Systematic reviews are essential to summarizing the results of different clinical and social science studies. The first step in a systematic review task is to identify all the studies relevant to the review. The task of identifying relevant studies for a given systematic review is usually performed manually, and as a result, involves substantial amounts of expensive human resource. Lately, there have been some attempts to reduce this manual effort using active learning. In this work, we build upon some such existing techniques, and validate by experimenting on a larger and comprehensive dataset than has been attempted until now. Our experiments provide insights on the use of different feature extraction models for different disciplines. More importantly, we identify that a naive active learning based screening process is biased in favour of selecting similar documents. We aimed to improve the performance of the screening process using a novel active learning algorithm with success. Additionally, we propose a mechanism to choose the best feature extraction method for a given review.
The evidence base in health psychology is vast and growing rapidly. These factors make it difficult (and sometimes practically impossible) to consider all available evidence when making decisions about the state of knowledge on a given phenomenon (e.g., associations of variables, effects of interventions on particular outcomes). Systematic reviews, meta-analyses, and other rigorous syntheses of the research mitigate this problem by providing concise, actionable summaries of knowledge in a given area of study. Yet, conducting these syntheses has grown increasingly laborious owing to the fast accumulation of new evidence; existing, manual methods for synthesis do not scale well. In this article, we discuss how semi-automation via machine learning and natural language processing methods may help researchers and practitioners to review evidence more efficiently. We outline concrete examples in health psychology, highlighting practical, open-source technologies available now. We indicate the potential of more advanced methods and discuss how to avoid the pitfalls of automated reviews.
Background & aims: Wilson disease (WD) is a rare disorder of copper metabolism. The objective of this systematic review is to determine the comparative effectiveness and safety of common treatments of WD. Methods: We included WD patients of any age or stage and the study drugs D-penicillamine, zinc salts, trientine, and tetrathiomolybdate. The control could be placebo, no treatment, or any other treatment. We included prospective, retrospective, randomized, and non-randomized studies. We searched Medline and Embase via Ovid, the Cochrane Central Register of Controlled Trials, and screened reference lists of included articles. Where possible, we applied random-effects meta-analyses. Results: The 23 included studies reported on 2055 patients and mostly compared D-penicillamine to no treatment, zinc, trientine, or succimer. One study compared tetrathiomolybdate and trientine. Post-decoppering maintenance therapy was addressed in one study only. Eleven of 23 studies were of low quality. When compared to no treatment, D-penicillamine was associated with a lower mortality (odds ratio 0.013; 95% CI 0.0010 to 0.17). When compared to zinc, there was no association with mortality (odds ratio 0.73; 95% CI 0.16 to 3.40) and prevention or amelioration of clinical symptoms (odds ratio 0.84; 95% CI 0.48 to 1.48). Conversely, D-penicillamine may have a greater impact on side effects and treatment discontinuations than zinc. Conclusions: There are some indications that zinc is safer than D-penicillamine therapy while being similarly effective in preventing or reducing hepatic or neurologic WD symptoms. Study quality was low warranting cautious interpretation of our findings. This article is protected by copyright. All rights reserved.