Conference PaperPDF Available

State of the Art: Reproducibility in Artificial Intelligence


Abstract and Figures

Background: Research results in artificial intelligence (AI) are criticized for not being reproducible. Objective: To quantify the state of reproducibility of empirical AI research using six reproducibility metrics measuring three different degrees of reproducibility. Hypotheses: 1) AI research is not documented well enough to reproduce the reported results. 2) Documentation practices have improved over time. Method: The literature is reviewed and a set of variables that should be documented to enable reproducibility are grouped into three factors: Experiment, Data and Method. The metrics describe how well the factors have been documented for a paper. A total of 400 research papers from the conference series IJCAI and AAAI have been surveyed using the metrics. Findings: None of the papers document all of the variables. The metrics show that between 20% and 30% of the variables for each factor are documented. One of the metrics show statistically significant increase over time while the others show no change. Interpretation: The reproducibility scores decrease with increased documentation requirements. Improvement over time is found. Conclusion: Both hypotheses are supported.
Content may be subject to copyright.
State of the Art: Reproducibility in Artificial Intelligence
Odd Erik Gundersen and Sigbjørn Kjensmo
Department of Computer Science
Norwegian University of Science and Technology
Background: Research results in artificial intelligence (AI)
are criticized for not being reproducible. Objective: To quan-
tify the state of reproducibility of empirical AI research us-
ing six reproducibility metrics measuring three different de-
grees of reproducibility. Hypotheses: 1) AI research is not
documented well enough to reproduce the reported results. 2)
Documentation practices have improved over time. Method:
The literature is reviewed and a set of variables that should be
documented to enable reproducibility are grouped into three
factors: Experiment, Data and Method. The metrics describe
how well the factors have been documented for a paper. A to-
tal of 400 research papers from the conference series IJCAI
and AAAI have been surveyed using the metrics. Findings:
None of the papers document all of the variables. The metrics
show that between 20% and 30% of the variables for each fac-
tor are documented. One of the metrics show statistically sig-
nificant increase over time while the others show no change.
Interpretation: The reproducibility scores decrease with in-
creased documentation requirements. Improvement over time
is found. Conclusion: Both hypotheses are supported.
Although reproducibility is a cornerstone of science, a large
amount of published research results cannot be reproduced.
This is even the case for results published in the most pres-
tigious journals; even the original researchers cannot repro-
duce their own results (Aarts et al. 2016; Begley and El-
lis 2012; Begley and Ioannidis 2014; Prinz et al. 2011).
(Goodman et al. 2016) presents data from Scopus that shows
that the problem with reproducibility spans several scien-
tific fields. According to (Donoho et al. 2009) ”it is impos-
sible to verify most of the computational results presented
at conferences and in papers today”. This was confirmed
by (Collberg and Proebsting 2016). Out of 402 experimen-
tal papers they were able to repeat 32.3% without commu-
nicating with the author, rising to 48.3% with communica-
tion. Papers by authors with industry affiliation showed a
lower rate of reproducibility. They also found that some re-
searchers are not willing to share code and data, while those
that actually share, provide too little to repeat the experi-
ment. Guidelines, best-practices and solutions to aid repro-
ducibility point towards open data and open source code as
Copyright c
2017, Association for the Advancement of Artificial
Intelligence ( All rights reserved.
requirements for reproducible research (Sandve et al. 2013;
Stodden and Miguez 2014). The increased focus on repro-
ducibility has resulted in an increased adoption of data and
code sharing policies for journals (Stodden et al. 2013). Still,
proposed solutions for facilitating reproducibility see little
adoption due to low ease-of-use and the time required to
retroactively fit an experiment to these solutions (Gent and
Kotthoff 2014). (Braun and Ong 2014) argues that automa-
tion should be possible to a higher degree for machine learn-
ing, as everything needed is available on a computer. De-
spite of this, the percentage of research that is reproducible
is not higher for machine learning and artificial intelligence
(AI) research (Hunold and Tr¨
aff 2013; Fokkens et al. 2013;
Hunold 2015).
The scientific method is based on reproducibility; ”if
other researchers can’t repeat an experiment and get the
same result as the original researchers, then they refute the
hypothesis” (Oates 2006, p. 285). Hence, the inability to
reproduce results affects the trustworthiness of science. To
ensure high trustworthiness of AI and machine learning re-
search measures must be taken to increase its reproducibil-
ity. However, before measures can be taken, the state of re-
producibility in AI research must be documented. The state
of reproducibility can only be documented if a proper frame-
work is built.
Our objective is to quantify the state of reproducibility
of empirical AI research, and our main hypothesis is that
the documentation of AI research is not good enough to re-
produce the reported results. We also investigate a second
hypothesis, which is that documentation practices have im-
proved during recent years. Two predictions were made, one
for each hypothesis. The first prediction is that the current
documentation practices at top AI conferences render most
of the reported research results irreproducible, and the sec-
ond prediction is that a larger portion of the reported re-
search results are reproducible when comparing the latest
installments of conferences to earlier installments. We sur-
veyed research papers from the two top AI conference se-
ries, International Joint Conference on AI (IJCAI) and the
Association for the Advancement of AI (AAAI) to test the
hypotheses. Our contributions are twofold: i) an investiga-
tion of what reproducibility means for AI research and ii) a
quantification of the state of reproducibility of AI research.
Figure 1: By comparing the results of an experiment to the hypotheses and predictions that are being made about the AI
program, we interpret the results and adjust our beliefs about them.
Reproducing Results
We base the survey on a concise definition of reproducibility
and three degrees of reproducibility. These definitions are
based a review of the scientific method and the literature.
The Scientific Method in AI Research
Different strategies for researching information systems and
computing exist (Oates 2006), and these include theory de-
velopment and experiments among others. The scientific
method and reproducibility is closely connected to exper-
iments and empirical studies. We can distinguish between
four different classes of empirical studies: 1) exploratory,
2) assessment, 3) manipulation and 4) observational studies
(Cohen 1995). While exploratory and assessment studies are
conducted to identify and suggest possible hypotheses, ma-
nipulation and observational studies test explicit and precise
hypotheses. Although the scientific method is based on eval-
uating hypothesis, exploratory and assessment studies are
not mandatory sub-processes of it. However, they may be
conducted in order to formulate the hypotheses.
The targets of study in AI research are AI programs and
their behavior (Cohen 1995). Changes to the AI program’s
structure, the task or the environment can affect the pro-
gram’s behavior. An AI program implements an abstract al-
gorithm or system as a program that can be compiled and
executed. Hence, the AI program is something distinct from
the conceptual idea that it implements, which we will refer to
as an AI method. Experiments should be formulated in such
a way that it is clear whether they test hypotheses about the
AI program or the AI method. Examples of tasks performed
by AI methods include classification, planning, learning, de-
cision making and ranking. The environment of the AI pro-
gram is described by data. Typically, when performing AI
experiments in supervised learning, the available data has to
be divided into a training set, a validation set and a test set
(Russell and Norvig 2009).
According to the scientific method and before performing
an experiment, one should formulate one or more hypothe-
ses about the AI program under investigation and make pre-
dictions about its behavior. The results of the experiments
are interpreted by comparing their results to the hypotheses
and the predictions. Beliefs about the AI program should be
adjusted by this interpretation. The adjusted beliefs can be
used to formulate new hypotheses, so that new experiments
can be conducted. If executed honestly with earnest inter-
pretations of the results, the scientific method updates our
beliefs about an AI program so that they should converge
towards objective truth. Figure 1 illustrates the scientific pro-
cess of AI research as described here.
The Terminology of Reproducibility
While researchers in computer science agree that empiri-
cal results should be reproducible, what is meant by repro-
ducibility is neither clearly defined nor agreed upon. (Stod-
den 2011) distinguishes between replication and reproduc-
tion. Replication is seen as re-running the experiment with
code and data provided by the author, while reproduction is
a broader term ”implying both replication and the regener-
ation of findings with at least some independence from the
[original] code and/or data”. (Drummond 2009) states that
replication, as the weakest form of reproducibility, can only
achieve checks for fraud. Due to the inconsistencies in the
use of the terms replicability and reproducibility, (Goodman
et al. 2016) proposes to extend reproducibility into:
Methods reproducibility: The ability to implement, as ex-
actly as possible, the experimental and computational pro-
cedures, with the same data and tools, to obtain the same
Results reproducibility: The production of corroborating
results in a new study, having used the same experimental
Inferential reproducibility: The drawing of qualitatively
similar conclusions from either an independent replica-
tion of a study or a reanalysis of the original study.
Replication, as used by (Drummond 2009) and (Stodden
2011), is in line with methods reproducibility as proposed by
(Goodman et al. 2016) while reproducibility seems to entail
both results reproducibility and inferential reproducibility.
(Peng 2011) on the other hand suggests that reproducibil-
ity is on a spectrum from publication to full replication.
This view neglects that results produced by AI methods can
be reproduced using different data or different implementa-
tions. Results generated by using other implementations or
other data can lead to new interpretations, which broadens
the beliefs about the AI method, so that generalizations can
be made. Despite the disagreements in terminology, there is
a clear agreement on the fact that the reproducibility of re-
search results is not one thing, but that empirical research
can be assigned to some sort of spectrum, scale or ranking
that is decided based on the level of documentation.
We define reproducibility in the following way:
Definition. Reproducibility in empirical AI research is the
ability of an independent research team to produce the same
results using the same AI method based on the documenta-
tion made by the original research team.
Hence, reproducible research is empirical research that is
documented in such detail by a research team that other re-
searchers can produce the same results using the same AI
method. According to (Sandve et al. 2013), a minimal re-
quirement of reproducibility is that you should at least be
able to reproduce the results yourself. We interpret this as
repeatability and not reproducibility. Our view is that an im-
portant aspect of reproducibility is that the experiment is
conducted independently. We will briefly discuss the three
terms AI method,results and independent research team in
this section. The next section is devoted to documentation.
An independent research team is one that conducts the ex-
periment by only using the documentation made by the orig-
inal research team.Enabling others to reproduce the same
results is closely related to trust. Most importantly, other
researchers can be expected to be more objective. They
have no interest in inflating the performance of a method
they have not developed themselves. More practically, they
will not share the same preconceptions and implicit knowl-
edge as the first team reporting the research. Also, other re-
searchers will not share the exact same hardware running the
exact same copies of software. All of this helps controlling
for noise variables related to both the hardware and ancillary
software as well as implicit knowledge and preconceptions.
The distinction between the AI program and the AI
method is important. We must as far as possible remove any
uncertainties to whether other effects than the AI method are
responsible for the results. The concept of using an agent
system for solving some problem is different from the spe-
cific implementation of the agent system. If the results are
dependent on the implementation of the method, the hard-
ware it is running on or the experiment setup, then the char-
acteristics of the AI method do not cause the results.
The results are the output of the experiment, in other
words, the dependent variables of the experiment (Cohen
1995), which typically are captured by performance mea-
sures. The result is the target of the investigation when re-
producing an experiment; we want to ensure that the perfor-
mance of the AI method is the same even if we change the
implementation, the operating system or the hardware that
is being used to conduct the experiment. As long as the re-
sults of the original and the reproduced experiments are the
same, the original experiment is reproducible. What consti-
tutes the same results depends on to which degree the results
are reproduced.
Documenting for Reproducibility
In order to reproduce the results of an experiment, the doc-
umentation must include relevant information, which must
be specified to a certain level of detail. What is relevant
and how detailed the documentation must be are guided by
whether it is possible to reproduce the results of the exper-
iment using this information only. Hence, the color of the
researcher’s jacket is usually not relevant for reproducing
the results. Which operating system is used on the machine
when executing the experiment can very well be relevant
So what exactly is relevant information? The objective
of (Claerbout and Karrenbach 1992; Buckheit and Donoho
1995) was to make it easy to rerun experiments and trace
methods that produced the reported results. For (Claerbout
and Karrenbach 1992), this meant sharing everything on a
CD-ROM, so that anyone could read the research report and
execute the experiments by pushing a button attached to ev-
ery figure. (Buckheit and Donoho 1995) shared Wavelab, a
Matlab package, that made all the code needed for repro-
ducing their figures in one of their papers. (Goodman et al.
2016) highlights that ”reporting of all relevant aspects of
scientific design, conduct, measurements, data and analy-
sis” is necessary for all three types of reproducibility. This
is in line with the view of (Stodden 2011), which is that
availability of the computational environment is necessary
for computational reproducibility. (Peng 2011) argues that
a paper alone is not enough, but that linked and executable
code and data is the gold standard. We have grouped the doc-
umentation into three categories: 1) method, 2) data and 3)
Method: The method documentation includes the AI
method that the AI program implements as well as the a mo-
tivation of why the method is used. As the implementation
does not contain the motivation and intended behavior, shar-
ing the implementation of the AI program is not enough. It is
important to give a high-level description of the AI method
that is being tested. This includes what the AI method in-
tends to do, why it is needed and how it works. To decrease
ambiguity, a description of how a method works should con-
tain pseudo code and an explanation of the pseudo code con-
taining descriptions of the parameters and sub-procedures.
Sharing of the AI method is the objective of most research
papers in AI. The problem that is investigated must be spec-
ified, the objective of the research must be clear and so must
the research method being used.
Data: Sharing the data used in the experiment is getting
simpler with open data repositories, such as the UCI Ma-
chine Learning Repository (Lichman 2013). Reproducing
the results fully requires the procedure for how the data set
has been divided into training, validation and test sets and
which samples belong to the different sets. Sharing the val-
idation set might not be necessary when all samples in the
training set are used or might be hard when the method picks
the samples randomly during the experiment. Data sets often
change, so specifying the version is relevant. Finally in or-
der to compare results, the actual output of the experiment,
such as the classes or decisions made, are required.
Experiment: For others to reproduce the results of an
experiment, the experiment and its setup must be shared.
The experiment contains code as well as an experiment de-
scription. Proper experiment documentation must explain
the purpose of the experiment. The hypotheses that are tested
and the predictions about these must be documented, and so
must the results and the analysis. In order to rule out the
possibilities that the results can be attributed to the hard-
ware or ancillary software, the hardware and ancillary soft-
ware used must be properly specified. The ancillary software
includes, but is not restricted to, the operating system, pro-
gramming environment and programming libraries used for
implementing the experiment. Sharing the experiment code
is not limited to open sourcing the AI program that is in-
vestigated, but sharing of the experiment setup with all in-
dependent variables, such as hyperparameters, as well as the
scripts and environmental settings is required. The exper-
iment setup consists of independent variables that control
the experiment. These variables configure both the ancillary
software and the AI program. Hyperparameters are indepen-
dent variables that configure the AI method and examples
include the number of leaves or depth of a tree and the learn-
ing rate. Documented code increases transparency.
In conclusion, there are different degrees to how well
an empirical study in AI research can be documented. The
degrees depend on whether the method, the data and the
experiment are documented and how well they are docu-
mented. The gold standard is sharing all of the three groups
of documentation through access to a running virtual ma-
chine in the cloud containing all the data, runnables, docu-
mentation and source code, as this includes the hardware and
software stack as well and not only the software libraries
used for running the experiments which was the case with
the proposed solutions by (Claerbout and Karrenbach 1992;
Buckheit and Donoho 1995). This is not necessarily practi-
cal, as it requires costly infrastructure that has a high mainte-
nance cost. Another practical consideration is related to how
long the infrastructure should and can be guaranteed to run
and produce the same results.
Degrees of Reproducibility
We propose to distinguish between three different degrees of
reproducibility, where an increased degree of reproducibil-
ity conveys an increased generality of the AI method. An
increased generality means that the performance of the AI
method documented in the experiment is not related to one
specific implementation or the data used in the experiment;
the AI method is more general than that. The three degrees
of reproducibility are defined as follows:
R1: Experiment Reproducible The results of an experi-
ment are experiment reproducible when the execution of
the same implementation of an AI method produces the
same results when executed on the same data.
R2: Data Reproducible The results of an experiment are
data reproducible when an experiment is conducted that
executes an alternative implementation of the AI method
that produces the same results when executed on the same
R3: Method Reproducible The results of an experiment
are method reproducible when the execution of an al-
ternative implementation of the AI method produces the
same results when executed on different data.
Results that are R1 reproducible require the same soft-
ware and data used for conducting the experiment and a de-
Figure 2: The three degrees of reproducibility are defined by
which documentation is used to reproduce the results.
tailed description of the AI method and experiment. This is
what is called fully reproducible by (Peng 2011) and method
reproducibility by (Goodman et al. 2016). We call it exper-
iment reproducible as everything required to run the exper-
iment is needed to reproduce the results. The results when
re-running the experiment should be exactly the same, as re-
ported in the original experiment. Any differences can only
be attributed to differences in hardware given that the ancil-
lary software is the same.
Results that are R2 reproducible require only the method
description and the data in order to be reproduced. This re-
moves any noise variables related to implementation and
hardware. The belief that the result is being caused by the
AI method is strengthened. Hence, the generality of the AI
method is increased compared to an AI method that is R1
reproducible. As the results are achieved by running the AI
method on the same data as the original experiment, there is
still a possibility that the performance can only be achieved
using the same data. The results that are produced, the per-
formance, using a different implementation should be the
same if not exactly the same. Differences in results can be at-
tributed to different implementations and hardware, such as
different ways of doing floating point arithmetic. However,
differences in software and hardware could have significant
impact on results because of rounding errors in floating point
arithmetic (Hong et al. 2013).
Results that are R3 reproducible only requires the method
documentation to be reproduced. If the results are repro-
duced, all noise variables related to implementation, hard-
ware and data have been removed, and it is safe to assume
that the results are caused by the AI method. As the results
are produced by using a new implementation on a new data
set, the AI method is generalized to other data and the im-
plementation used in the original experiment. In order for
a result to be R3 reproducible the results of the experiments
must support the same hypotheses and thus support the same
beliefs. The same interpretations cannot be made unless the
results are statistically significant, so the analysis should
be supported by statistical hypothesis testing with a p-
value of 0.005 for claiming new discoveries (Johnson 2013;
Benjamin et al. 2017).
When it comes to generality of the results the following
is true: R1< R2< R3, which means that R1 reproducible
results are less general than R2 reproducible results, which
in turn are less general than R3 reproducible results. How-
ever, when it comes to the documentation required, the fol-
lowing is the case: doc(R3) doc(R)2 doc(R1). The
documentation needed for R3 reproducibility is a subset of
the documentation required for R2 reproducibility and the
documentation required for R2 is a subset of the documen-
tation required for R1 reproducibility. R3 reproducible is the
most general reproducibility degree that also requires the
least amount of information.
Current practice of publishers is not to require researchers
to share data and implementation when publishing research
papers. The current practice enables R3 reproducible results
that have the least amount of transparency. For (Goodman et
al. 2016), the goal of transparency is to ease evaluation of
the weight of evidence from studies to facilitate future stud-
ies on actual knowledge gaps and cumulative knowledge,
and reduce time spent exploring blind alleys from poorly
reported research. This means that current practices enable
other research teams to reproduce results at the highest re-
producibility degree with the least effort of the original re-
search team. The majority of effort in reproducing results,
lays with the independent team, instead of the original team.
Transparency does not only reduce the effort needed to re-
produce the results, but it also builds trust in them. Hence,
the results that are produced by current practices are the least
trustworthy from a reproducibility point of view, because of
the lack in transparency; the evidence showing that the re-
sults are valid is not published.
Research Method
We have conducted an observational experiment in form of
a survey of research papers in order to generate quantita-
tive data about the state of reproducibility of research results
in AI. The research papers have been reviewed, and a set
of variables have been manually registered. In order to com-
pare results between papers and conferences, we propose six
metrics for deciding whether research results are R1, R2, and
R3 reproducible as well as to which degree they are.
In order to evaluate the two hypotheses, we have surveyed
a total of 400 papers where 100 papers have been selected
from each of the 2013 and 2016 installments of the con-
ference IJCAI and from the 2014 and 2016 installments of
the conference series AAAI. With an exception of 50 pa-
pers from IJCAI 2013, all the papers have been selected
randomly to avoid any selection biases. Table 1 shows the
number of accepted papers (the population size), the num-
ber of surveyed papers (sample size) and the margin of er-
rors for a confidence level of 95% for the four conferences.
We have computed the margin of error as half the width of
the confidence interval, and for our study the margin of er-
ror is 4.29%. All the data and the code that has been used to
calculate the reproducibility scores and generate the figures
can be found on Github1.
Factors and Variables
We have identified a set of variables that we believe are good
indicators for reproducibility after reviewing the literature.
Table 1: Population size, sample size (with number of em-
pirical studies) and margin of error for a confidence level of
95% for the four conferences and total population.
Conference Population size Sample size MoE
IJCAI 2013 413 100 (71) 8.54%
AAAI 2014 213 100 (85) 7.15%
IJCAI 2016 551 100 (84) 8.87%
AAAI 2016 549 100 (85) 8.87 %
Total 1726 400 (325) 4.30%
These variables have been grouped together into the three
factors Method, Data and Experiment. For each surveyed pa-
per, we have registered these variables. In addition, we have
collected some extra variables, which have been grouped to-
gether in Miscellaneous. The following variables have been
registered for the three factors:
Method: How well is the research method documented?
Problem (*): The problem the research seeks to solve.
Objective/Goal (*): The objective of the research.
Research method (*): The research method used.
Research questions (*): The research question asked.
Pseudo code: Method described using pseudo code.
Data: How well is the data set documented?
Training data: Is the training set shared?
Validation data: Is the validation set shared?
Test data: Is the test set shared?
Results: Are the results shared?
Experiment: How well is the implementation and the ex-
periment documented?
Hypothesis (*): The hypothesis being investigated.
Prediction (*): Predictions related to the hypotheses.
Method source code: Is the method open sourced?
Hardware specifications: Hardware used.
Software dependencies: For method or experiment.
Experiment setup: Is the setup including hyperparame-
ters described?
Experiment source code: Is the experiment code open
Miscellaneous: Different variables that describe the re-
Research type: Experimental (E) or theoretical (T).
Research outcome: Is the paper reporting a positive or a
negative result (positive=1 and negative=0).
Affiliation: The affiliation of the authors. Academia (0),
collaboration (1) or industry (2).
Contribution (*): Contribution of the research.
All variables were registered as true (1) or false (0) unless
otherwise specified. When surveying the papers, we have
looked for explicit mentions of the variables marked with an
asterix (*) above. For example, when reviewing the variable
Figure 3: Percentage of papers documenting each variable for the three factors: a) Method, b) Data and c) Experiment.
Problem, we have looked for an explicit mention of the prob-
lem being solved, such as ”To address this problem, we pro-
pose a novel navigation system ...” (De Weerdt et al. 2013).
The decision to use explicit mentions of the terms, such as
contribution, goal, hypothesis and so on, can be disputed.
However, the reasons for looking for explicit mentions are
both practical and idealistic. Practically, it is easier to review
a substantial amount of papers if the criteria are clear and ob-
jective. If we did not follow this guideline, the registering of
variables would lend itself to subjective assessment rather
than objective, and the results could be disputed based on
how we measured the variables. Our goal was to get results
with a low margin of error, so that we could draw statisti-
cally valid conclusions. In order to survey enough papers,
we had to reduce the time we used on each paper. Explicit
mentions supported this. Idealistically, our attitude is that re-
search documentation should be clear and concise. Explicit
mentions of which problem is being solved, what the goal of
doing the research is, which hypothesis is being tested and
so on are required to remove ambiguity from the text. Less
ambiguous documentation increases the reproducibility of
the research results.
Quantifying Reproducibility
We have defined a set of six metrics to quantify whether an
experiment eis R1, R2 or R3 reproducible and to which
degree. The metrics measure how well the three factors
method, data and experiment are documented. The three
metrics R1(e),R2(e)and R3(e)are boolean metrics that
can be either true or false:
R1(e) = Method(e)Data(e)Exp(e),(1)
R2(e) = Method(e)Data(e),(2)
R3(e) = Method(e),(3)
where Method(e),Data(e)and Exp(e)is the conjunction
of the truth values of the variables listed under the three fac-
tors Method, Data and Experiment in the section Factors and
Variables. This means that for Data(e)to be true for an ex-
periment e, the training data set, the validation data set, the
test data set and the results must be shared for e. Hence,
R1(e)is the most strict requirement while R3is the most
relaxed requirement when it comes to the documentation of
an experiment e, as R3(e)requires only variables of the fac-
tor Method to be true while R1(e)requires all variables for
all the three factors to be true.
The three metrics R1(e),R2(e)and R3(e)are boolean
metrics, so they will provide information on whether an ex-
periment is R1, R2 or R3 reproducible in a strict sense.
They will however not provide any information on to which
degree experiments are reproducible, unless an experiment
meets all the requirements. Therefore we suggest the three
metrics R1D(e),R2D(e)and R3D(e)for measuring to
which degree the the results of an experiment eis:
R1D(e) = δ1Method(e) + δ2Data(e) + δ3Exp(e)
R2D(e) = δ1Method(e) + δ2Data(e)
R3D(e) = Method(e),(6)
where Method(e),Data(e)and Exp(e)is the weighted
sum of the truth values of the variables listed under the three
factors Method, Data and Experiment. The weights of the
factors are δ1,δ2and δ3respectively. This means that the
value for Data(e)for experiment eis the summation of the
truth values for whether the training, validation, and test data
sets as well as the results are shared for e. It is of course also
possible to give different weights to each variable of a fac-
tor. We use a uniform weight for all variables and factors for
our survey, δi= 1. For an experiment e1that has published
the training data and test data, but not the validation set and
the results Data(e)=0.5. Note that some papers have no
value for the training and validation sets if the experiment
does not require either. For these papers, the δiweight is set
to 0.
Results and Discussion
Figure 3 shows percentage of research papers that have doc-
umented the different variables for the three factors. None of
the three factors are documented very well according to the
survey. As can bee seen by analyzing the factor Method, an
explicit description of the motivation behind research is not
common. Figure 4 (b) shows this as well. None of the papers
document all five variables, and most of them (90%) docu-
ment two or less. This might be because it is assumed that
Figure 4: a) Change in the RXD metrics. b), c) and d) show the amount of variables registered for the three factors for all papers.
researchers in the domain are acquainted with the motiva-
tions and problems. Figure 3 (b) shows that few papers pro-
vide the results of the experiment, although, compared to the
other two factors, an encouraging 49% of the research papers
share data as seen from Figure 4 (c). The experiments are not
documented well either as can be seen in figures 3 (c) and
4 (d). The variable Experiment setup is given a high score,
which indicates that the experiment setup is documented to
some degree. As we have not actually tried to reproduce the
results, we have not ensured that the experiment setup is doc-
umented in enough detail to run the experiment.
The amount of empirical papers are shown in Table 1. For
each conference, between 15% and 29% of the the randomly
selected samples are not empirical. In total, 325 papers em-
pirical and considered in the analysis. Table 2 presents the
results of the RXD (R1D, R2D and R3D) metrics. All the
RXD metrics vary between 0.20 and 0.30. This means that
only between a fifth and a third of the variables required for
reproducibility are documented. For all papers, R1D has the
lowest score with 0.24, R2D has a score of 0.25 and R3D has
a score of 0.26. The general trend is that R1D is lower than
the R2D scores, which again are lower than the R3D scores.
This is not surprising, as R1D has fewer variables than R2D,
which has fewer variables than R3D. However, given the er-
ror there is little variation among the three reproducibility
The RX (R1, R2 and R3) scores were 0.00 for all pa-
pers. No paper had full score on all variables for the factor
Method, and it is required for all the three RX metrics. The
three RX metrics are very strict and are not very informative
for a survey such as this. They might have a use though, as
guidelines for reviewers of conferences and journal publica-
tions. The three RXD metrics do not have the same issue as
the RX metrics, as they measure the degree of reproducibil-
ity between 0 and 1.
There is a clear increase in the RXD scores from IJCAI
2013 to IJCAI 2016, see figure 4 a). However, the trend is not
as clear for AAAI as the R2D and R3D scores decrease. Ta-
ble 3 shows the combined scores for the earlier years (2013
and 2014, 156 papers) and the combined scores for 2016
(169 papers). The results show that there is a slight, but sta-
tistically significant increase for R1D. The increase for R2D
is not statistically significant, and there is no change for
R3D. This means that only the experiment documentation
Table 2: The 95% confidence interval for the mean R1D,
R2D and R3D scores where ε= 1.96σ¯xand σ¯x=ˆσ
Conference R1D±ε R2D±ε R3D±ε
IJCAI 2013 0.20 ±0.02 0.20 ±0.03 0.24 ±0.04
AAAI 2014 0.21 ±0.02 0.26 ±0.03 0.28 ±0.04
IJCAI 2016 0.30 ±0.03 0.30 ±0.04 0.29 ±0.04
AAAI 2016 0.23 ±0.02 0.25 ±0.04 0.24 ±0.04
Total 0.24 ±0.01 0.25 ±0.02 0.26 ±0.02
Table 3: The 95% confidence interval for the mean R1D,
R2D and R3D scores when combining the papers from all
four installments of IJCAI and AAAI into two groups ac-
cording to the years they were published. One group con-
tains all papers from 2013 and 2014 and the other group
contains all the papers from 2016.
Years R1D±ε R2D±ε R3D±ε
2013/2014 0.21 ±0.02 0.23 ±0.02 0.26 ±0.03
2016 0.27 ±0.02 0.27 ±0.03 0.26 ±0.03
has improved with time, and that there is no such evidence
for the documentation of methods and data.
The survey confirms our prediction that the current docu-
mentation practices at top AI conferences render most of
the reported research results irreproducible, as the R1, R2
and R3 reproducibility metrics show that no papers are fully
reproducible. Only 24% of the variables required for R1D
reproducibility, 25% of the variables required for R2D re-
producibility and 26% of the variables required for R3D re-
producibility are documented. When investigating whether
there is change over time, we see improvement, which then
confirms our second hypothesis. No improvement is indi-
cated by the R1, R2, R3, R2D and R3D metrics. There is
however a statistically significant improvement in the R1D
metric. Hence, overall there is an improvement.
This work has been carried out at the Telenor-NTNU AI Lab,
Norwegian University of Science and Technology, Trond-
heim, Norway.
Alexander A. Aarts, Christopher J. Anderson, Joanna An-
derson, Marcel A. L. M. van Assen, Peter R. Attridge, An-
gela S. Attwood, Jordan Axt, Molly Babel, ˘
an Bahn´
Erica Baranski, and et al. Reproducibility project: Psychol-
ogy, Dec 2016.
C. Glenn Begley and Lee M. Ellis. Drug development:
Raise standards for preclinical cancer research. Nature,
483(7391):531–533, mar 2012.
C. G. Begley and J. P. A. Ioannidis. Reproducibility in sci-
ence: Improving the standard for basic and preclinical re-
search. Circulation Research, 116(1):116–126, dec 2014.
Daniel J Benjamin, James O Berger, Magnus Johannes-
son, Brian A Nosek, Eric-Jan Wagenmakers, Richard Berk,
Kenneth A Bollen, Bj¨
orn Brembs, Lawrence Brown, Colin
Camerer, et al. Redefine statistical significance. Nature Hu-
man Behaviour, 2017.
Mikio L Braun and Cheng Soon Ong. Open science in ma-
chine learning. Implementing Reproducible Research, page
343, 2014.
Jonathan B. Buckheit and David L. Donoho. Wavelab and
reproducible research. Technical report, Standford, CA,
Jon F. Claerbout and Martin Karrenbach. Electronic doc-
uments give reproducible research a new meaning. In
Proceedings of the 62nd Annual International Meeting of
the Society of Exploration Geophysics, New Orleans, USA,
1992. 25 to 29 October 1992.
Paul R Cohen. Empirical methods for artificial intelligence,
volume 139. MIT press Cambridge, MA, 1995.
Christian Collberg and Todd A. Proebsting. Repeatability in
computer systems research. Communications of the ACM,
59(3):62–69, February 2016.
Mathijs M De Weerdt, Enrico H Gerding, Sebastian Stein,
Valentin Robu, and Nicholas R Jennings. Intention-aware
routing to minimise delays at electric vehicle charging sta-
tions. In Proceedings of the Twenty-Third international joint
conference on Artificial Intelligence, pages 83–89. AAAI
Press, 2013.
David L Donoho, Arian Maleki, Inam Ur Rahman, Morteza
Shahram, and Victoria Stodden. Reproducible research in
computational harmonic analysis. Computing in Science &
Engineering, 11(1), 2009.
Chris Drummond. Replicability is not reproducibility: nor
is it good science. International Conference on Machine
Learning, June 2009.
Antske Fokkens, Marieke Van Erp, Marten Postma, Ted Ped-
ersen, Piek Vossen, and Nuno Freire. Offspring from repro-
duction problems: What replication failure teaches us. In
Proceedings of the 51st Annual Meeting of the Association
for Computational Linguistics, pages 1691–1701. Associa-
tion for Computational Linguistics (ACL), 2013.
Ian P Gent and Lars Kotthoff. Recomputation. org: Ex-
periences of its first year and lessons learned. In Utility
and Cloud Computing (UCC), 2014 IEEE/ACM 7th Inter-
national Conference on, pages 968–973. IEEE, 2014.
Steven N. Goodman, Daniele Fanelli, and John P. A. Ioan-
nidis. What does research reproducibility mean? Sci-
ence Translational Medicine, 8(341):341ps12–341ps12, jun
Song-You Hong, Myung-Seo Koo, Jihyeon Jang, Jung-
Eun Esther Kim, Hoon Park, Min-Su Joh, Ji-Hoon Kang,
and Tae-Jin Oh. An evaluation of the software system de-
pendency of a global atmospheric model. Monthly Weather
Review, 141(11):4165–4172, 2013.
Sascha Hunold and Jesper Larsson Tr¨
aff. On the state and
importance of reproducible experimental research in parallel
computing. CoRR, abs/1308.3648, 2013.
Sascha Hunold. A survey on reproducibility in parallel com-
puting. CoRR, abs/1511.04217, 2015.
Valen E. Johnson. Revised standards for statistical evi-
dence. Proceedings of the National Academy of Sciences,
110(48):19313–19317, 2013.
M. Lichman. UCI machine learning repository, 2013.
Briony J Oates. Researching Information Systems and Com-
puting. SAGE Publications Ltd, 2006.
Roger D Peng. Reproducible research in computational sci-
ence. Science, 334(6060):1226–1227, 2011.
Florian Prinz, Thomas Schlange, and Khusru Asadullah. Be-
lieve it or not: how much can we rely on published data
on potential drug targets? Nature Reviews Drug Discovery,
10(9):712–712, aug 2011.
Stuart Russell and Peter Norvig. Artificial Intelligence: A
modern approach. Prentice-Hall, 2009.
Geir Kjetil Sandve, Anton Nekrutenko, James Taylor,
and Eivind Hovig. Ten simple rules for reproducible
computational research. PLoS Computational Biology,
9(10):e1003285, oct 2013.
Victoria C. Stodden and Sheila Miguez. Best practices for
computational science: Software infrastructure and environ-
ments for reproducible and extensible research. Journal of
Open Research Software, 2(1):e21, jul 2014.
Victoria C. Stodden, Peixuan Guo, and Zhaokun Ma. To-
ward reproducible computational research: An empirical
analysis of data and code policy adoption by journals. PLoS
ONE, 8(6):e67111, jun 2013.
Victoria C. Stodden. Trust your science? Open your data
and code. Amstat News, pages 21–22, 2011.
... Machine learning research faces mounting concern about the reliability of our results and the credibility of our claims [2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20]. The field of psychology has faced a similar crisis [21], and confrontation with its shortcomings has sparked practical innovations [1,[22][23][24][25]. ...
... In this project, we used 1138 hours of GPU time, at a rough cost of $775 with 140 kg of CO 2 . 2 We expect costs to scale with multiverse size, requiring a pause for consideration in light of recent critiques of the environmental impacts of ML [72][73][74]. In response, we first note that at least some of the required compute is already taking place as part of existing trial-and-error workflows. ...
Amid mounting concern about the reliability and credibility of machine learning research, we present a principled framework for making robust and generalizable claims: the Multiverse Analysis. Our framework builds upon the Multiverse Analysis (Steegen et al., 2016) introduced in response to psychology's own reproducibility crisis. To efficiently explore high-dimensional and often continuous ML search spaces, we model the multiverse with a Gaussian Process surrogate and apply Bayesian experimental design. Our framework is designed to facilitate drawing robust scientific conclusions about model performance, and thus our approach focuses on exploration rather than conventional optimization. In the first of two case studies, we investigate disputed claims about the relative merit of adaptive optimizers. Second, we synthesize conflicting research on the effect of learning rate on the large batch training generalization gap. For the machine learning community, the Multiverse Analysis is a simple and effective technique for identifying robust claims, for increasing transparency, and a step toward improved reproducibility.
... Reproducibility. The lack of reproducibility can be a major barrier to achieve progress in AI [134], and recent studies indicate that limited reproducibility is a substantial issue also in recommender systems research [135,136]. Figure 11 shows for how many of the studied technical papers, artifacts were shared to ensure reproducibility of the reported experiments. ...
... Figure 11 shows for how many of the studied technical papers, artifacts were shared to ensure reproducibility of the reported experiments. While the level of reproducibility seems to be higher than in general AI [134], still for the large majority of the considered works authors did not share any code or data. ...
Full-text available
Recommender systems can strongly influence which information we see online, e.g, on social media, and thus impact our beliefs, decisions, and actions. At the same time, these systems can create substantial business value for different stakeholders. Given the growing potential impact of such AI-based systems on individuals, organizations, and society, questions of fairness have gained increased attention in recent years. However, research on fairness in recommender systems is still a developing area. In this survey, we first review the fundamental concepts and notions of fairness that were put forward in the area in the recent past. Afterwards, we provide a survey of how research in this area is currently operationalized, for example, in terms of the general research methodology, fairness metrics, and algorithmic approaches. Overall, our analysis of recent works points to certain research gaps. In particular, we find that in many research works in computer science very abstract problem operationalizations are prevalent, which circumvent the fundamental and important question of what represents a fair recommendation in the context of a given application.
... Artificial Intelligence (AI) is an exciting rising paradigm of software development that however also comes with many new challenges for developers. Challenges range from systemic issues such as a lack of education and training [1] and difficulty in reproducibility [11] to hidden technical debt [25] to a need for fairness and controlling for bias [6]. Individual AI developers developing software that trains machine learning (ML) models face tasks covering a wide range from data collection and cleaning to feature selection to training and evaluating models. ...
... Alternatively, to make the fix easier to understand, the data scientist can send the remediated pipeline and the original pipeline to Maro's explainer component (10). This explains the remediation to the data scientist by rendering it in natural language (11). And finally, the data scientist can use the remediated pipeline as the input to the AutoML tool (12), thus completing the circle. ...
Machine learning in practice often involves complex pipelines for data cleansing, feature engineering, preprocessing, and prediction. These pipelines are composed of operators, which have to be correctly connected and whose hyperparameters must be correctly configured. Unfortunately, it is quite common for certain combinations of datasets, operators, or hyperparameters to cause failures. Diagnosing and fixing those failures is tedious and error-prone and can seriously derail a data scientist's workflow. This paper describes an approach for automatically debugging an ML pipeline, explaining the failures, and producing a remediation. We implemented our approach, which builds on a combination of AutoML and SMT, in a tool called Maro. Maro works seamlessly with the familiar data science ecosystem including Python, Jupyter notebooks, scikit-learn, and AutoML tools such as Hyperopt. We empirically evaluate our tool and find that for most cases, a single remediation automatically fixes errors, produces no additional faults, and does not significantly impact optimal accuracy nor time to convergence.
... Also, random selection of samples during training is typically used in gradient boosted trees [Pouchard et al., 2020]. According to Gundersen and Kjensmo [2018], only 16% specify the validation set and 30% specify the test set, which means that outcome reproducibility is not possible to achieve. A 2% to 12% variation has been shown in labeled attachment scores of Natural Language Processing experiments when comparing models trained using standard test splits versus random test splits, which suggests that the results reported by experiments that only use the standard test splits can be influenced by a bias in the standard test splits that favor certain types of parsers [Çöltekin, 2020]. ...
... Mathiness could lead to reduced readability [Lipton and Steinhardt, 2019]. Gundersen and Kjensmo [2018] found that a large degree of papers only implicitly state what research questions they answer (94%), which problems that they seek to solve (53%), and what the objective (goal) of conducting the research is (78%). Only 5% of the papers explicitly states the hypothesis and 54% contain pseudocode. ...
Full-text available
Lately, several benchmark studies have shown that the state of the art in some of the sub-fields of machine learning actually has not progressed despite progress being reported in the literature. The lack of progress is partly caused by the irreproducibility of many model comparison studies. Model comparison studies are conducted that do not control for many known sources of irreproducibility. This leads to results that cannot be verified by third parties. Our objective is to provide an overview of the sources of irreproducibility that are reported in the literature. We review the literature to provide an overview and a taxonomy in addition to a discussion on the identified sources of irreproducibility. Finally, we identify three lines of further inquiry.
... In ML, different types of reproducibility have been commonly discussed, including technical, statistical, and conceptual reproducibility 1,2 . Technical (or methods) reproducibility refers to implementing computational procedures as precisely as possible (using the same code, dataset, etc.) to yield the same results as those reported; statistical reproducibility refers to upholding statistically similar results under resampled conditions (such as using different subsamples of data for training); and conceptual (or results) reproducibility refers to obtaining similar findings under new conditions that match the theoretical description of the original experiment 1,2 . In this study, we focus on addressing conceptual reproducibility, specifically in the context of external validity and generalizability. ...
Full-text available
As patient health information is highly regulated due to privacy concerns, most machine learning (ML)-based healthcare studies are unable to test on external patient cohorts, resulting in a gap between locally reported model performance and cross-site generalizability. Different approaches have been introduced for developing models across multiple clinical sites, however less attention has been given to adopting ready-made models in new settings. We introduce three methods to do this—(1) applying a ready-made model “as-is” (2); readjusting the decision threshold on the model’s output using site-specific data and (3); finetuning the model using site-specific data via transfer learning. Using a case study of COVID-19 diagnosis across four NHS Hospital Trusts, we show that all methods achieve clinically-effective performances (NPV > 0.959), with transfer learning achieving the best results (mean AUROCs between 0.870 and 0.925). Our models demonstrate that site-specific customization improves predictive performance when compared to other ready-made approaches.
Open access to datasets is increasingly driving modern science. Consequently, discovering such datasets is becoming an important functionality for scientists in many different fields. We investigate methods for dataset recommendation: the task of recommending relevant datasets given a dataset that is already known to be relevant. Previous work has used meta-data descriptions of datasets and interest profiles of authors to support dataset recommendation. In this work, we are the first to investigate the use of co-author networks to drive the recommendation of relevant datasets. We also investigate the combination of such co-author networks with existing methods, resulting in three different algorithms for dataset recommendation. We obtain experimental results on a realistic corpus which show that only the ensemble combination of all three algorithms achieves sufficiently high precision for the dataset recommendation task.
Full-text available
This work objective is to describe two experiments using student grades of an online higher education program to build and apply three classifiers to predict student performance. In the literature, the three classifiers, a Probabilistic Neural Network, a Support Vector Machine, and a Discriminant Analysis, have proved efficient. I applied the leave-one-out cross-validation method, tested their performances by five criteria, and compared their results through statistical analysis. The analyses of the five performance criteria support the decision on which model applies given particular prediction goals. The results allow timely identification of students at risk of failure for early intervention and predict which students will succeed. Find it here:
The present work compares the two-alternative forced choice (2AFC) task to rating scales for measuring aesthetic perception of neural style transfer-generated images and investigates whether and to what extent the 2AFC task extracts clearer and more differentiated patterns of aesthetic preferences. To this aim, 8250 pairwise comparisons of 75 neural style transfer-generated images, varied in five parameter configurations, were measured by the 2AFC task and compared with rating scales. Statistical and qualitative results demonstrated higher precision of the 2AFC task over rating scales in detecting three different aesthetic preference patterns: (a) convergence (number of iterations), (b) an inverted U-shape (learning rate), and (c) a double peak (content-style ratio). Important for practitioners, finding such aesthetically optimal parameter configurations with the 2AFC task enables the reproducibility of aesthetic outcomes by the neural style transfer algorithm, which saves time and computational cost, and yields new insights about parameter-dependent aesthetic preferences. © 2022 The Author(s). Published with license by Taylor & Francis Group, LLC.
Full-text available
The goal of this article is to coalesce a discussion around best practices for scholarly research that utilizes computational methods, by providing a formalized set of best practice recommendations to guide computational scientists and other stakeholders wishing to disseminate reproducible research, facilitate innovation by enabling data and code re-use, and enable broader communication of the output of computational scientific research. Scholarly dissemination and communication standards are changing to reflect the increasingly computational nature of scholarly research, primarily to include the sharing of the data and code associated with published results. We also present these Best Practices as a living, evolving, and changing document at
Full-text available
This study presents the dependency of the simulation results from a global atmospheric numerical model on machines with different hardware and software systems. The global model program (GMP) of the Global/Regional Integrated Model system (GRIMs) is tested on 10 different computer systems having different central processing unit (CPU) architectures or compilers. There exist differences in the results for different compilers, parallel libraries, and optimization levels, primarily a result of the treatment of rounding errors by the different software systems. The system dependency, which is the standard deviation of the 500-hPa geopotential height averaged over the globe, increases with time. However, its fractional tendency, which is the change of the standard deviation relative to the value itself, remains nearly zero with time. In a seasonal prediction framework, the ensemble spread due to the differences in software system is comparable to the ensemble spread due to the differences in initial conditions that is used for the traditional ensemble forecasting.
Technical Report
WaveLab is a library of Matlab routines for wavelet analysis, wavelet-packet analysis, cosine-packet analysis and matching pursuit. The library is available free of charge over the Internet. Versions are provided for Macintosh, UNIX and Windows machines. WaveLab makes available, in one package, all the code to reproduce all the figures in our published wavelet articles. The interested reader can inspect the source code to see exactly what algorithms were used, how parameters were set in producing our figures, and can then modify the source to produce variations on our results. WaveLab has been developed, in part, because of exhortations by Jon Claerbout of Stanford that computational scientists should engage in "really reproducible" research. 1 WaveLab -- Reproducible Research via the Internet A remarkable aspect of "the wavelet community" is the wide span of intellectual activities that it makes contact with. At one extreme, wavelets are interesting to mathematicians who are inter...
To encourage repeatable research, fund repeatability engineering and reward commitments to sharing research artifacts.
We summarize the results of a survey on reproducibility in parallel computing, which was conducted during the Euro-Par conference in August 2015. The survey form was handed out to all participants of the conference and the workshops. The questionnaire, which specifically targeted the parallel computing community, contained questions in four different categories: general questions on reproducibility, the current state of reproducibility, the reproducibility of the participants' own papers, and questions about the participants' familiarity with tools, software, or open-source software licenses used for reproducible research.
Medical and scientific advances are predicated on new knowledge that is robust and reliable and that serves as a solid foundation on which further advances can be built. In biomedical research, we are in the midst of a revolution with the generation of new data and scientific publications at a previously unprecedented rate. However, unfortunately, there is compelling evidence that the majority of these discoveries will not stand the test of time. To a large extent, this reproducibility crisis in basic and preclinical research may be as a result of failure to adhere to good scientific practice and the desperation to publish or perish. This is a multifaceted, multistakeholder problem. No single party is solely responsible, and no single solution will suffice. Here we review the reproducibility problems in basic and preclinical biomedical research, highlight some of the complexities, and discuss potential solutions that may help improve research quality and reproducibility. © 2015 American Heart Association, Inc.
R eproducibility in the computational sciences seems to be capturing everyone's attention. Movements to address the reliability of pub-lished computational results are arising in fields as disparate as geophysics, political science, fluid dynamics, computational harmonic analysis, fMRI research, and bioinformatics. Open data and code in climate modeling has taken on a new priority since ClimateGate in 2009 (i.e., www. published papers today. Even mature branches of science, despite all their efforts, suffer severely from the problem of errors in final published conclusions. Traditional scientific pub-lication is incapable of finding and rooting out errors in scientific computation, and standards of verifiability must be developed.ry-steps-toward-open-science-and-reproducibility-we-need-a-science-cloud). As we embrace and tackle this issue across the computational sciences, concepts are inevitably labeled with different terms and dif-ferent concepts are emphasized. I will touch on the semantic and substantive differences in the various approaches to reliability in computational and data-enabled sciences.
Conference Paper
En-route charging stations allow electric vehicles to greatly extend their range. However, as a full charge takes a considerable amount of time, there may be significant waiting times at peak hours. To address this problem, we propose a novel navigation system, which communicates its intentions (i.e., routing policies) to other drivers. Using these intentions, our system accurately predicts congestion at charging stations and suggests the most efficient route to its user. We achieve this by extending existing time-dependent stochastic routing algorithms to include the battery's state of charge and charging stations. Furthermore, we describe a novel technique for combining historical information with agent intentions to predict the queues at charging stations. Through simulations we show that our system leads to a significant increase in utility compared to existing approaches that do not explicitly model waiting times or use intentions, in some cases reducing waiting times by over 80% and achieving near-optimal overall journey times.