ArticlePDF Available

Data science for the scientific life cycle

Authors:

Abstract

Data science can be incorporated into every stage of a scientific study. Here we describe how data science can be used to generate hypotheses, to design experiments, to perform experiments, and to analyse data. We also present our vision for how data science techniques will be an integral part of the laboratory of the future.
POINT OF VIEW
Data science for the scientific
life cycle
Abstract Data science can be incorporated into every stage of a scientific study. Here we describe
how data science can be used to generate hypotheses, to design experiments, to perform
experiments, and to analyse data. We also present our vision for how data science techniques will be
an integral part of the laboratory of the future.
DAPHNE EZER
†‡
AND KIRSTIE WHITAKER
Introduction
Akey tenet of the scientific method is
that we learn from previous work. In
principle we observe something about
the world and generate a hypothesis. We then
design an experiment to test that hypothesis,
set up the experiment, collect the data and ana-
lyse the results. And when we report our results
and interpretation of them in a paper, we make
it possible for other researchers to build on our
work.
In practice, there are impediments at every
step of the process. In particular, our work
depends on published research that often does
not contain all the information required to repro-
duce what was reported. There are too many
possible experimental parameters to test under
our time and budget constraints, so we make
decisions that affect how we interpret the out-
comes of our experiments. As researchers, we
should not be complacent about these
obstacles: rather, we should always look towards
new technologies, such as data science, to help
us improve the quality and efficiency of scientific
research.
Data science could easily be dismissed as a
simple rebranding of "science" – after all, nearly
all scientists analyse data in some form. An alter-
native definition of a data scientist is someone
who develops new computational or statistical
analysis techniques that can easily be adapted
to a wide range of scenarios, or who can apply
these techniques to answer a specific scientific
question. While there is no clear dividing line
between data science and statistics, data science
generally involves larger datasets. Moreover,
data scientists often think in terms of training
predictive models that can be applied to other
datasets, rather than limiting the analysis to an
existing dataset.
Data science emerged as a discipline largely
because the internet led to the creation of
incredibly large datasets (such as ImageNet, a
database of 14 million annotated images;
Krizhevsky et al., 2012). The availability of
these datasets enabled researchers to apply a
variety of machine learning algorithms which, in
turn, led to the development of new techniques
for analysing large datasets. One area in which
progress has been rapid is the automated anno-
tation and interpretation of images and texts on
the internet, and these techniques are now
being applied to other data-rich domains,
including genetics and genomics (Libbrecht and
Noble, 2015) and the study of gravitational
waves (Abbott et al., 2016).
It is clear that data science can inform the
analysis of an experiment, either to test a spe-
cific hypothesis or to make sense of large data-
sets that have been collected without a specific
hypothesis in mind. What is less obvious, albeit
equally important, is how these techniques can
improve other aspects of the scientific method,
such as the generation of hypotheses and the
design of experiments.
Data science is an inherently interdisciplinary
approach to science. New experimental techni-
ques have revolutionised biology over the years,
These authors contributed equally to
this work
Corresponding author
Copyright Ezer and Whitaker. This
article is distributed under the terms
of the Creative Commons Attribution
License, which permits unrestricted
use and redistribution provided that
the original author and source are
credited.
Ezer and Whitaker. eLife 2019;8:e43979. DOI: https://doi.org/10.7554/eLife.43979 1 of 10
FEATURE ARTICLE
from DNA sequencing and microarrays in the
past to CRISPR and cryo-EM more recently. Data
science differs in that it is not a single technique,
but rather a framework for solving a whole range
of problems. The potential for data science to
answer questions in a range of different disci-
plines is what excites so many researchers. That
said, however, there are social challenges that
cannot be fixed with a technical solution, and it
is all too easy for expertise to be "lost in transla-
tion" when people from different academic
backgrounds come together.
In October 2018, we brought together statis-
ticians, experimental researchers, and social sci-
entists who study the behaviour of academics in
the lab (and in the wild) at a workshop at the
Alan Turing Institute in London to discuss how
we can harness the power of data science to
make each stage of the scientific life cycle more
efficient and effective. Here we summarise the
key points that emerged from the workshop,
and propose a framework for integrating data
science techniques into every part of the
research process (Figure 1). Statistical methods
can optimise the power of an experiment by
selecting which observations should be col-
lected. Robotics and software pipelines can
automate data collection and analysis, and incor-
porate machine learning analyses to adaptively
update the experimental design based on
incoming data. And the traditional output of
research, a static PDF manuscript, can be
enhanced to include analysis code and well-
documented datasets to make the next iteration
of the cycle faster and more efficient. We also
highlight several of the challenges, both techni-
cal and social, that must be overcome to
Observe
Generate hypothesis
Design experiment
Perform experiment
Analyse data
Evaluate hypothesis
Data science for
analysing experiments:
All research outputs are
shared in a findable,
accessible,
interoperable
and reusable way
Data science for
planning experiments:
Research outputs are pooled to
generate new hypotheses
and select optimal
parameters for
experiments
Data science for
performing experiments:
Automated laboratories
and data analysis
pipelines enable fast
feedback for optimising
experimental parameters
Figure 1. Integrating data science into the scientific life cycle. Data science can be used to generate new
hypotheses, optimally design which observations should be collected, automate and provide iterative feedback on
this design as data are being observed, reproducibly analyse the information, and share all research outputs in a
way that is findable, accessible, interoperable and reusable (FAIR).We propose a virtuous cycle through which
experiments can effectively and efficiently "stand on the shoulders" of previous work in order to generate new
scientific insights.
Ezer and Whitaker. eLife 2019;8:e43979. DOI: https://doi.org/10.7554/eLife.43979 2 of 10
Feature article Point of View Data science for the scientific life cycle
translate theory into practice, and share our
vision for the laboratory of the future.
Data science for planning
experiments
Hypothesis-driven research usually requires a sci-
entist to change an independent variable and
measure a dependent variable. However, there
are often too many parameters to take account
of. In plant science, for instance, these parame-
ters might include temperature, exposure to
light, access to water and nutrients, humidity
and so on, and the plant might respond to a
change in each of these in a context-dependent
way.
Data scientists interested in designing opti-
mal experiments must find ways of transforming
a scientific question into an optimisation prob-
lem. For instance, let us say that a scientist wants
to fit a regression model of how temperature
and light exposure influence wheat growth. Ini-
tially they might measure the height of the
wheat at a number of combinations of tempera-
ture and light exposure. Then, the scientist could
ask: what other combinations of temperature
and light exposure should I grow the wheat at in
order to improve my ability to predict wheat
growth, considering the cost and time con-
straints of the project?
At the workshop Stefanie Biedermann (Uni-
versity of Southampton) discussed how to trans-
form a wide range of experimental design
questions into optimisation problems. She and
her colleagues have applied these methods to
find optimal ways of selecting parameters for
studies of enzyme kinetics (Dette and Bieder-
mann, 2003) and medical applications
(Tompsett et al., 2018). Other researchers have
used data science to increase the production of
a drug while reducing unwanted by-products
(Overstall et al., 2018). The process iteratively
builds on small number of initial experiments
that are conducted with different choices of
experimental parameters (such as reagent con-
centrations, temperature or timing): new experi-
mental parameters are then suggested until the
optimal set are identified.
Optimal experimental design can also help
researchers fit parameters of dynamical models,
which can help them develop a mechanistic
understanding of biological systems. Ozgur
Akman (University of Exeter) focuses on dynamic
models that can explain how gene expression
changes over time and where the model param-
eters represent molecular properties, such as
the transcription rates or mRNA degradation
rates. As an example he explained how he had
used this approach to find the optimal parame-
ters for a mathematical model for the circadian
clock (Aitken and Akman, 2013). Akman also
described how it is possible to search for possi-
ble gene regulatory networks that can explain
existing experimental data (Doherty, 2017), and
then select new experiments to help distinguish
between these alternative hypotheses
(Sverchkov and Craven, 2017). For instance,
the algorithm might suggest performing a cer-
tain gene knockout experiment, followed by
RNA-seq, to gain more information about the
network structure.
A clear message from the workshop was that
statisticians need to be involved in the experi-
mental design process as early as possible,
rather than being asked to analyze the data at
the end of a project. Involving statisticians
before data collection makes it more likely the
scientist will be able to answer the research
questions they are interested in. Another clear
message was that the data, software, infrastruc-
ture and the protocols generated during a
research project were just as important as the
results and interpretations that constitute a sci-
entific paper.
Data science for performing
experiments
In order to effectively plan an experiment, it is
necessary to have some preliminary data as a
starting point. Moreover, ensuring that the data
collected during a particular experiment is used
to inform the planning process for future experi-
ments will make the whole process more effi-
cient. For standard molecular biology
experiments, this kind of feedback loop can be
achieved through laboratory automation.
Ross King (University of Manchester) and co-
workers have developed the first robot scientists
– laboratory robots that physically perform
experiments and use machine learning to gener-
ate hypothesis, plan experiments, and perform
deductive reasoning to come to scientific conclu-
sions. The first of these robots, Adam, success-
fully identified the yeast genes encoding orphan
enzymes (King et al., 2009), and the second,
Eve, intelligently screened drug targets for
neglected tropical diseases (Williams et al.,
2015). King is convinced that robot scientists
improves research productivity, and also helps
scientists to develop a better understanding of
science as a process (King et al., 2018). For
Ezer and Whitaker. eLife 2019;8:e43979. DOI: https://doi.org/10.7554/eLife.43979 3 of 10
Feature article Point of View Data science for the scientific life cycle
instance, an important step towards building
these robotic scientists was the development of
a formal language for describing scientific dis-
coveries. Humans might enjoy reading about sci-
entific discoveries that are described in English
or some other human language, but such lan-
guages are subject to ambiguity and exaggera-
tion. However, translating the deductive logic of
research projects into the formal languages of
"robotic scientists" should lead to a more pre-
cise description of our scientific conclusions
(Sparkes et al., 2010).
Let us imagine that a research team observe
that plants with a gene knockout are shorter
than wild type plants. Their written report of the
experiment will state that this gene knockout
results in shorter plants. They are likely to leave
unsaid the caveat that this result was only
observed under their experimental set-up and,
therefore, that this may not be the case under
all possible experimental parameters. The
mutant might be taller than a wild type plant
under certain lighting conditions or tempera-
tures that were not tested in the original study.
In the future, researchers may be able to write
their research outcomes in an unambiguous way
so that it is clear that the evidence came from a
very specific experimental set-up. The work
done to communicate these conditions in a com-
puter-readable format will benefit the human
scientists who extend and replicate the original
work.
Even though laboratory automation technol-
ogy has existed for a number of years, it has yet
to be widely incorporated into academic
research environments. Laboratory automation is
full of complex hardware that is difficult to use,
but a few start-ups are beginning to build tools
to help researchers communicate with their labo-
ratory robots more effectively. Vishal Sanchania
(Synthace) discussed how their software tool
Antha enables scientists to easily develop work-
flows for controlling laboratory automation. Fur-
thermore, these workflows can be iterative: that
is, data collected by the laboratory robots can
be used within the workflow to plan the next
experimental procedure (Fell et al., 2018).
One benefit of having robotic platforms per-
form experiments as a service is that researchers
are able to publish their experimental protocols
as executable code, which any other researcher,
from anywhere around the world, can run on
another automated laboratory system, improv-
ing the reproducibility of experiments.
Data science for reproducible data
analysis
As the robot scientists (and their creators) real-
ised, there is a lot more information that must
be captured and shared for another researcher
to reproduce an experiment. It is important that
data collection and its analysis are reproducible.
All too often, there is no way to verify the results
in published papers because the reader does
not have access to the data, nor to the informa-
tion needed to repeat the same, often complex,
analyses (Ioannidis et al., 2014). At our work-
shop Rachael Ainsworth (University of Manches-
ter) highlighted Peng’s description of the
reproducibility spectrum, which ranges from
“publication only” to "full replication" with
linked and executable code and data
(Peng, 2011). Software engineering tools and
techniques that are commonly applied in data
science projects can nudge researchers towards
the full replication end of the spectrum. These
tools include interactive notebooks (Jupyter,
Rmarkdown), version control and collaboration
tools (git,GitHub,GitLab), package managers
and containers to capture computational envi-
ronments (Conda,Docker), and workflows to
test and continuously integrate updates to the
project (Travis CI). See Beaulieu-Jones and
Greene, 2017 for an overview of how to repur-
pose these tools for scientific analyses.
Imaging has always been a critical technology
for cell and developmental biology (Burel et al.,
2015), ever since scientists looked at samples
through a microscope and made drawings of
what they saw. Photography came next, fol-
lowed by digital image capture and analysis.
Se´ bastien Besson (University of Dundee) pre-
sented a candidate for the next technology in
this series, a set of open-source software and
format standards called the Open Microscopy
Environment (OME). This technology has already
supported projects as diverse as the develop-
ment a deep learning classifier to identify
patients with clinical heart failure (Nirschl et al.,
2018), to the generation of ultra-large high reso-
lution electron microscopy maps in human,
mouse and zebrafish tissue (Faas et al., 2012).
The OME project also subscribes to the phi-
losophy that data must be FAIR: findable, acces-
sible, interoperable and reusable
(Wilkinson et al., 2016). It does this as follows:
i) data are made findable by hosting them online
and providing links to the papers the data have
been used in; ii) data are made accessible
through an open API (application programming
Ezer and Whitaker. eLife 2019;8:e43979. DOI: https://doi.org/10.7554/eLife.43979 4 of 10
Feature article Point of View Data science for the scientific life cycle
interface) and the availability of highly curated
metadata; iii) data are made interoperable via
the Bio-Formats software, which allows more
than 150 proprietary imaging file formats to be
converted into a variety of open formats using a
common vocabulary (Linkert et al., 2010); iv)
data, software and other outputs are made reus-
able under permissive open licences or through
"copyleft" licences which require the user to
release anything they derive from the resource
under the same open licence. (Alternatively,
companies can pay for private access through
Glencoe Software which provides a commercially
licenced version of the OME suite of tools.).
Each group who upload data to be shared
through OME’s Image Data Resource can
choose their own license for sharing their data,
although they are strongly encouraged to use
the most open of the creative commons licenses
(CC-BY or CC0). When shared in this way, these
resources open up new avenues for replication
and verification studies, methods development,
and exploratory work that leads to the genera-
tion of new hypotheses.
Data science for hypothesis
generation
A hypothesis is essentially an "educated guess"
by a researcher about what they think will hap-
pen when they do an experiment. A new
hypothesis usually comes from theoretical mod-
els or from a desire to extend previously pub-
lished experimental research. However, the
traditional process of hypothesis generation is
limited by the amount knowledge an individual
researcher can hold in their head and the num-
ber of papers they can read each year, and it is
also susceptible to their personal biases
(van Helden, 2013).
In contrast, machine learning techniques such
as text mining of published abstracts or elec-
tronic health records (Oquendo et al., 2012), or
exploratory meta-analyses of datasets pooled
from laboratories around the world, can be used
for automated, reproducible and transparent
hypothesis generation. For example, teams at
IBM Research have mined 100,000 academic
papers to identify new protein kinases that inter-
act with a protein tumour suppressor (Span-
gler, 2014) and predicted hospital re-
admissions from the electronic health records of
5,000 patients (Xiao et al., 2018). However, the
availability of datasets, ideally datasets that are
FAIR, is a prerequisite for automated hypothesis
generation (Hall and Pesenti, 2017).
The challenges of translating
theory into practice
When we use data science techniques to design
and analyse experiments, we need to ensure
that the techniques we use are transparent and
interpretable. And when we use robot scientists
to design, perform and analyse experiments, we
need to ensure that science continues to explore
a broad range of scientific questions. Other chal-
lenges include avoiding positive feedback loops
and algorithmic bias, equipping scientists with
the skills they need to thrive in this new multidis-
ciplinary environment, and ensuring that scien-
tists in the global south are not left behind. We
discuss all these points in more detail below.
Interpreting experimental outcomes
When data science is used to make decisions
about how to perform an experiment, we need
to ensure that scientists calibrate their level of
trust appropriately. On one hand, biologists will
need to relinquish some level of control and to
trust the computer program to make important
decisions for them. On the other hand, we must
make sure that scientists do not trust the algo-
rithms so much that they stop thinking critically
about the outcomes of their experiments and
end up misinterpreting their results. We also
want to ensure that scientists remain creative
and open to serendipitous discoveries.
We discussed above the importance of hav-
ing a formal (machine-readable) language that
can be used to describe both scientific ideas and
experimental protocols to robots. However, it is
equally important that the results and conclu-
sions of these experiments are expressed in a
human-understandable format. Ultimately, the
goal of science is not just to be able to predict
natural phenomenon, but also to give humans a
deeper insight into the mechanisms driving the
observed processes. Some machine learning
methods, such as deep learning, while excellent
for predicting an outcome, suffer from a lack of
interpretability (Angermueller et al., 2016).
How to balance predictability and interpretabil-
ity for the human reader is an open question in
machine learning.
Positive feedback loops and algorithmic
bias
As with all applications of data science to new
disciplines, there are risks related to algorithmic
bias (Hajian et al., 2016). Recently there have
been some concerns over algorithmic bias
related to face-recognition of criminals – the
Ezer and Whitaker. eLife 2019;8:e43979. DOI: https://doi.org/10.7554/eLife.43979 5 of 10
Feature article Point of View Data science for the scientific life cycle
face-recognition software was more likely to
report a false-positive of a black face than a
white face due to biases in the dataset that the
software was trained on (Snow, 2017;
Buolamwini and Gebru, 2018). Societal param-
eters shape the input data that is fed into
machine learning models, and if actions are
taken on the basis of their output, these societal
biases will only be amplified.
There are parallel issues with data science for
experimental biology – for instance there are
certain research questions that are popular
within a community through accidents of history.
Many people study model organisms such as
roundworms and fruit flies because early genet-
ics researchers studied them, and now there are
more experimental tools that have been tried
and tested on them – a positive feedback loop
(Stoeger et al., 2018).
We need to be careful to ensure that any
attempt to design experiments has the correct
balance between exploring new research ideas
and exploiting the existing data and experimen-
tal tools available in well-established sub-
disciplines.
Implementation and training
According to Chris Mellingwood (University of
Edinburgh), some biologists are amphibious and
fluidly move between "wet" laboratories and
"dry" computing (Mellingwood, 2017). How-
ever, many biologists do not know how to code
or do not have the required mathematical back-
ground to be able to reframe their research
questions as data science problems, so it may
be difficult for biologists to find ways of using
these new tools to design experiments in their
own laboratories. They might not even realise
that there is a tool available to help them resolve
the experimental design problems that they
face. Researchers may need specialised training
in order to learn how to interact with data sci-
ence tools in a efficient and effective way.
Reproducible data analysis alone requires an
understanding of version control, at least one, if
not multiple, programming languages, techni-
ques such as testing, containerisation and con-
tinuous integration. Machine learning and
optimisation algorithms require detailed statisti-
cal knowledge along with the technical expertise
– sometimes including high performance com-
puting skills – to implement them. Requiring all
these skills along with the robotics engineering
expertise to build an automated lab is outside of
the capacity of most researchers trained by the
current system.
Infrastructure and accessibility
Even once a system is built, it needs to be con-
stantly adapted as science progresses. There is a
risk that by the time a platform is developed, it
might be out of date. Sarah Abel (University of
Iceland) discussed how university incentive sys-
tems do not always reward the types of activities
that would be required for incorporating data
science into a laboratory, such as interdisciplin-
ary collaborations or maintenance of long-term
infrastructure.
Furthermore, due to the burden of develop-
ing and maintaining the infrastructure needed
for this new approach to science, some research-
ers may be left behind. Louise Bezuidenhout
(University of Oxford) explained that even
though one of the goals of "data science for
experimental design" is to have open and
"accessible" data available around the world,
scientists in the global south might not have
access to computing resources needed for this
approach (Bezuidenhout et al., 2017). There-
fore, we need to consider how the benefits of
data science and laboratory automation techni-
ques are felt around the world.
Augmentation, not automation
As we discuss the role of data science in the
cycle of research, we need to be aware that
these technologies should be used to augment,
not replace, human researchers. These new tools
will release researchers from the tasks that a
machine can do well, giving them time and
space to work on the tasks that only humans can
do. Humans are able think "out-of-the-box",
while the behaviour of any algorithm will inher-
ently be restricted by its code.
Perhaps the last person you might imagine
supporting the integration of artificial and
human intelligence is Garry Kasparov, chess
grand master. Kasparov lost to the IBM super-
computer Deep Blue in 1997 but more than 20
years later he is optimistic about the potential
for machines to provide insights into how
humans see the world (Kasparov, 2017). An
example that is closer to the life sciences is the
citizen science game BrainDr, in which partici-
pants quickly assess the quality of brain imaging
scans by swiping left or right (Keshavan et al.,
2018). Over time, there were enough ratings to
train an algorithm to automatically assess the
quality of the images. The tool saves researchers
thousands of hours, permits the quality assess-
ment of very large datasets, improves the reli-
ability of the results, and is really fun!
Ezer and Whitaker. eLife 2019;8:e43979. DOI: https://doi.org/10.7554/eLife.43979 6 of 10
Feature article Point of View Data science for the scientific life cycle
So where does this leave the biologist of the
future? Experimentalists can continue to do cre-
ative research by, for example, designing new
protocols that enable them to study new phe-
nomena and measure new variables. There are
also many experimental protocols that are per-
formed so rarely that it would be inefficient to
automate them. However, humans will not need
to carry out standard protocols, using as purify-
ing DNA, but they might still need to know how
to perform various specialised tasks, such as dis-
secting specimens. They will also need to con-
stantly update the robotic platform to
incorporate new experimental protocols.
Finally and most crucially, the biologist of the
future will need to provide feedback into the
cycle of research – providing insight into what
hypotheses are interesting to the community,
thinking deeply about how experimental results
fit into the theories proposed by the broader
Box 1. What can you do now?
Planning an experiment
In the lab of the future, we envision that experimental parameters will be chosen in a theoreti-
cally sound way, rather than through ad hoc human decision making. There are already plenty
of tools to help researchers plan their experiments, including tools for selecting optimal time
points for conducting an experiment (Kleyman et al., 2017;Ezer and Keir, 2018), a collection
of R packages that enable optimisation of experimental design (CRAN) and the acebayes pack-
age, which takes prior information about the system as input, and then designs experiments
that are most likely to produce the best outputs (Overstall et al., 2017).
Performing an experiment
In the future, standard molecular biology experiments will be performed by robots, and execut-
able experimental protocols will be published alongside each journal article to ensure repro-
ducibility. Although many labs do not have access to laboratory automation, there are many
associated techniques that will improve the reproducibility of research. For instance, systems
like Protocols.io can help researchers to describe protocols in unambiguous ways that can be
easily understood by other researchers. Sharing laboratory know-how, using tools such as
OpenWetWare, will also enable a tighter feedback look between performing and planning
experiments.
Analysing experimental data
In a few years, we hope that many more data formats and pipelines will be standardised, repro-
ducible and open-access. For researchers who are most comfortable in a wet-lab environment,
the article "Five selfish reasons to work reproducibly" makes a strong case for learning how to
code (Markowetz, 2015). Jupyter notebooks are an easy way to share data analyses with
embedded text descriptions, executable code snippets, and figures. Workshops run by The
Carpentries are a good way to learn software skills such as the unix shell, version control with
git, and programming in Python or R or domain specific techniques for data wrangling, analysis
and visualisation.
Sharing your work
For anyone keen to know more about archiving their data, preprints, open access and the pre-
registration of studies, we recommend the "Rainbow of Open Science Practices" (Kramer and
Bosman, 2018) and Rachael Ainsworth’s slides from our workshop (Ainsworth, 2018).
Generating new hypotheses
Pooling studies across laboratories and textual analysis of publications will help identify scien-
tific questions worth studying. The beginning of a new cycle of research might start with an
automated search of the literature for similar research (Extance, 2018) with tools such as
Semantic Scholar from the Allen Institute for Artificial Intelligence. Alternatively, you could
search for new data to investigate using Google’s Dataset Search, or more specific resources
from the European Bioinformatics Institute or National Institute of Mental Health Data Archive.
Ezer and Whitaker. eLife 2019;8:e43979. DOI: https://doi.org/10.7554/eLife.43979 7 of 10
Feature article Point of View Data science for the scientific life cycle
community, and finding innovative connections
across disciplinary boundaries. Essentially, they
will be focused on the varied and interesting
parts of science, rather than the mundane and
repetitive parts.
Acknowledgments
We thank the speakers and attendees at the
Data Science for Experimental Design Work-
shop, and Anneca York and the Events Team at
the Alan Turing Institute.
Daphne Ezer is at the Alan Turing Institute, London,
and in the Department of Statistics, University of
Warwick, Coventry, United Kingdom
dezer@turing.ac.uk
http://orcid.org/0000-0002-1685-6909
Kirstie Whitaker is at the Alan Turing Institute,
London, and in the Department of Psychiatry,
University of Cambridge, Cambridge, United Kingdom
https://orcid.org/0000-0001-8498-4059
Author contributions: Daphne Ezer, Kirstie Whitaker,
Conceptualization, Writing—original draft, Writing—
review and editing
Competing interests: The authors declare that no
competing interests exist.
Published 06 March 2019
Funding
Funder
Grant reference
number Author
Engineering and
Physical
Sciences Re-
search Council
EP/S001360/1 Daphne Ezer
Alan Turing In-
stitute
TU/A/000017 Daphne Ezer
Kirstie Whitaker
The funders had no role in study design, data collection
and interpretation, or the decision to submit the work
for publication.
References
Abbott BP, Abbott R, Abbott TD, Abernathy MR,
Acernese F, Ackley K, Adams C, Adams T, Addesso P,
Adhikari RX, Adya VB, Affeldt C, Agathos M,
Agatsuma K, Aggarwal N, Aguiar OD, Aiello L, Ain A,
Ajith P, Allen B, et al. 2016. Observation of
gravitational waves from a binary black hole merger.
Physical Review Letters 116:061102. DOI: https://doi.
org/10.1103/PhysRevLett.116.061102,PMID: 2691
8975
Ainsworth R. 2018. Reproducibility and open science.
Data Science for Experimental Design (DSED).
DOI: https://doi.org/10.5281/zenodo.1464853
Aitken S, Akman OE. 2013. Nested sampling for
parameter inference in systems biology: application to
an exemplar circadian model. BMC Systems Biology 7:
72. DOI: https://doi.org/10.1186/1752-0509-7-72,
PMID: 23899119
Angermueller C, Pa
¨rnamaa T, Parts L, Stegle O. 2016.
Deep learning for computational biology. Molecular
Systems Biology 12:878. DOI: https://doi.org/10.
15252/msb.20156651,PMID: 27474269
Beaulieu-Jones B, Greene C. 2017. Reproducibility:
automated. https://elifesciences.org/labs/e623676c/
reproducibility-automated [Accessed February 26,
2019].
Bezuidenhout L, Kelly AH, Leonelli S, Rappert B. 2017.
‘$100 Is Not Much To You’: Open Science and
neglected accessibilities for scientific research in
Africa. Critical Public Health 27:39–49. DOI: https://
doi.org/10.1080/09581596.2016.1252832
Buolamwini J, Gebru T. 2018. Gender shades:
intersectional accuracy disparities in commercial
gender classification (PMLR 81:77-91). http://
proceedings.mlr.press/v81/buolamwini18a/
buolamwini18a.pdf [Accessed February 26, 2019].
Burel JM, Besson S, Blackburn C, Carroll M, Ferguson
RK, Flynn H, Gillen K, Leigh R, Li S, Lindner D, Linkert
M, Moore WJ, Ramalingam B, Rozbicki E, Tarkowska
A, Walczysko P, Allan C, Moore J, Swedlow JR. 2015.
Publishing and sharing multi-dimensional image data
with OMERO. Mammalian Genome 26:441–447.
DOI: https://doi.org/10.1007/s00335-015-9587-6,
PMID: 26223880
Dette H, Biedermann S. 2003. Robust and efficient
designs for the Michaelis–Menten model. Journal of
the American Statistical Association 98:679–686.
DOI: https://doi.org/10.1198/016214503000000585
Doherty K. 2017. Optimisation and landscape analysis
of computational biology models: a case study. In:
Proceedings of the Genetic and Evolutionary
Computation Conference Companion (GECCO ’17)
1644–1651. DOI: https://doi.org/10.1145/3067695.
3084609
Extance A. 2018. How AI technology can tame the
scientific literature. Nature 561:273–274. DOI: https://
doi.org/10.1038/d41586-018-06617-5,
PMID: 30202054
Ezer D, Keir JC. 2018. Selection of time points for
costly experiments: a comparison between human
intuition and computer-aided experimental design.
bioRxiv.DOI: https://doi.org/10.1101/301796
Faas FG, Avramut MC, van den Berg BM, Mommaas
AM, Koster AJ, Ravelli RB. 2012. Virtual nanoscopy:
generation of ultra-large high resolution electron
microscopy maps. Journal of Cell Biology 198:457–
469. DOI: https://doi.org/10.1083/jcb.201201140,
PMID: 22869601
Fell T, Ward S, Gershater M, Watson M, Crane P,
Wiederhold R. 2018. Computer-Aided biology. https://
static1.squarespace.com/static/
5af46322620b851d41f3f64f/t/
5bb1d987e5e5f08a8c7fb24a/1538383791006/
Computer_Aided_Biology_Synthace_10_18.pdf
[Accessed February 26, 2019].
Hajian S, Bonchi F, Castillo C. 2016. Algorithmic bias:
from discrimination discovery to Fairness-Aware data
mining part 1 & 2. Proceedings of the 22nd ACM
SIGKDD International Conference on Knowledge
Discovery and Data Mining. DOI: https://doi.org/10.
1145/2939672.2945386
Hall W, Pesenti J. 2017. Growing the artificial
intelligence industry in the UK. https://www.gov.uk/
Ezer and Whitaker. eLife 2019;8:e43979. DOI: https://doi.org/10.7554/eLife.43979 8 of 10
Feature article Point of View Data science for the scientific life cycle
government/publications/growing-the-artificial-
intelligence-industry-in-the-uk [Accessed February 26,
2019].
Ioannidis JP, Munafo` MR, Fusar-Poli P, Nosek BA,
David SP. 2014. Publication and other reporting biases
in cognitive sciences: detection, prevalence, and
prevention. Trends in Cognitive Sciences 18:235–241.
DOI: https://doi.org/10.1016/j.tics.2014.02.010,
PMID: 24656991
Kasparov G. 2017. Deep Thinking: Where Machine
Intelligence Ends and Human Creativity Begins.
London: John Murray.
Keshavan A, Yeatman J, Rokem A. 2018. Combining
citizen science and deep learning to amplify expertise
in neuroimaging. bioRxiv.DOI: https://doi.org/10.
1101/363382
King RD, Rowland J, Aubrey W, Liakata M, Markham
M, Soldatova LN, Whelan KE, Clare A, Young M,
Sparkes A, Oliver SG, Pir P. 2009. The robot scientist
Adam. Computer 42:46–54. DOI: https://doi.org/10.
1109/MC.2009.270
King RD, Schuler Costa V, Mellingwood C, Soldatova
LN. 2018. Automating sciences: philosophical and
social dimensions. IEEE Technology and Society
Magazine 37:40–46. DOI: https://doi.org/10.1109/
MTS.2018.2795097
Kleyman M, Sefer E, Nicola T, Espinoza C, Chhabra D,
Hagood JS, Kaminski N, Ambalavanan N, Bar-Joseph
Z. 2017. Selecting the most appropriate time points to
profile in high-throughput studies. eLife 6:e18541.
DOI: https://doi.org/10.7554/eLife.18541,PMID: 2
8124972
Kramer B, Bosman J. 2018. Rainbow of open science
practices. Zenodo.DOI: https://doi.org/10.5281/
zenodo.1147025
Krizhevsky A, Sutskever I, Hinton GE. 2012. ImageNet
Classification with Deep Convolutional Neural
Network. In: Weinberger K. Q, Pereira F, Burges C. J.
C, Bottou L (Eds). Advances in Neural Information
Processing Systems.25 Curran Associates, Inc. p.
1097–1105 .
Libbrecht MW, Noble WS. 2015. Machine learning
applications in genetics and genomics. Nature Reviews
Genetics 16:321–332. DOI: https://doi.org/10.1038/
nrg3920,PMID: 25948244
Linkert M, Rueden CT, Allan C, Burel JM, Moore W,
Patterson A, Loranger B, Moore J, Neves C,
Macdonald D, Tarkowska A, Sticco C, Hill E, Rossner
M, Eliceiri KW, Swedlow JR. 2010. Metadata matters:
access to image data in the real world. Journal of Cell
Biology 189:777–782. DOI: https://doi.org/10.1083/
jcb.201004104,PMID: 20513764
Markowetz F. 2015. Five selfish reasons to work
reproducibly. Genome Biology 16:274. DOI: https://
doi.org/10.1186/s13059-015-0850-7,PMID: 26646147
Mellingwood C. 2017. What about the frogs?:
reflections on ’Community and Identity in the Techno-
Sciences’ workshop. https://blogs.sps.ed.ac.uk/
engineering-life/2017/03/30/what-about-the-frogs-
reflections-on-community-and-identity-in-the-techno-
sciences-workshop/ [Accessed February 26, 2019].
Nirschl JJ, Janowczyk A, Peyster EG, Frank R,
Margulies KB, Feldman MD, Madabhushi A. 2018. A
deep-learning classifier identifies patients with clinical
heart failure using whole-slide images of H&E tissue.
PloS One 13:e0192726. DOI: https://doi.org/10.1371/
journal.pone.0192726,PMID: 29614076
Oquendo MA, Baca-Garcia E, Arte´ s-Rodrı
´guez A,
Perez-Cruz F, Galfalvy HC, Blasco-Fontecilla H,
Madigan D, Duan N. 2012. Machine learning and data
mining: strategies for hypothesis generation.
Molecular Psychiatry 17:956–959. DOI: https://doi.org/
10.1038/mp.2011.173,PMID: 22230882
Overstall A, Woods D, Adamou M. 2017. Acebayes:
an R package for bayesian optimal design of
experiments via approximate coordinate exchange.
arXiv.https://arxiv.org/abs/1705.08096.
Overstall A, Woods D, Martin KJ. 2018. Bayesian
prediction for physical models with application to the
optimization of the synthesis of pharmaceutical
products using chemical kinetics computational
statistics & data analysis. https://eprints.soton.ac.uk/
425529/ [Accessed February 26, 2019].
Peng RD. 2011. Reproducible research in
computational science. Science 334:1226–1227.
DOI: https://doi.org/10.1126/science.1213847,
PMID: 22144613
Snow J. 2017. Amazon’s face recognition falsely
matched 28 members of congress with mugshots.
https://www.aclu.org/blog/privacy-technology/
surveillance-technologies/amazons-face-recognition-
falsely-matched-28 [Accessed February 26, 2019].
Spangler S. 2014. Automated hypothesis generation
based on mining scientific literature. In: Proceedings of
the 20th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining. DOI: https://
doi.org/10.1145/2623330.2623667
Sparkes A, Aubrey W, Byrne E, Clare A, Khan MN,
Liakata M, Markham M, Rowland J, Soldatova LN,
Whelan KE, Young M, King RD. 2010. Towards robot
scientists for autonomous scientific discovery.
Automated Experimentation 2:1. DOI: https://doi.org/
10.1186/1759-4499-2-1,PMID: 20119518
Stoeger T, Gerlach M, Morimoto RI, Nunes Amaral LA.
2018. Large-scale investigation of the reasons why
potentially important genes are ignored. PLOS Biology
16:e2006643. DOI: https://doi.org/10.1371/journal.
pbio.2006643,PMID: 30226837
Sverchkov Y, Craven M. 2017. A review of active
learning approaches to experimental design for
uncovering biological networks. PLOS Computational
Biology 13:e1005466. DOI: https://doi.org/10.1371/
journal.pcbi.1005466,PMID: 28570593
Tompsett DM, Biedermann S, Liu W. 2018.
Simultaneous confidence sets for several effective
doses. Biometrical Journal 60:703–720. DOI: https://
doi.org/10.1002/bimj.201700161,PMID: 29611627
van Helden P. 2013. Data-driven hypotheses. EMBO
Reports 14:104. DOI: https://doi.org/10.1038/embor.
2012.207,PMID: 23258258
Wilkinson MD, Dumontier M, Aalbersberg IJ,
Appleton G, Axton M, Baak A, Blomberg N, Boiten
JW, da Silva Santos LB, Bourne PE, Bouwman J,
Brookes AJ, Clark T, Crosas M, Dillo I, Dumon O,
Edmunds S, Evelo CT, Finkers R, Gonzalez-Beltran A,
et al. 2016. The FAIR guiding principles for scientific
data management and stewardship. Scientific Data 3:
160018. DOI: https://doi.org/10.1038/sdata.2016.18,
PMID: 26978244
Williams K, Bilsland E, Sparkes A, Aubrey W, Young
M, Soldatova LN, De Grave K, Ramon J, de Clare M,
Sirawaraporn W, Oliver SG, King RD. 2015. Cheaper
faster drug development validated by the
repositioning of drugs against neglected tropical
Ezer and Whitaker. eLife 2019;8:e43979. DOI: https://doi.org/10.7554/eLife.43979 9 of 10
Feature article Point of View Data science for the scientific life cycle
diseases. Journal of the Royal Society Interface 12:
20141289. DOI: https://doi.org/10.1098/rsif.2014.
1289,PMID: 25652463
Xiao C, Ma T, Dieng AB, Blei DM, Wang F. 2018.
Readmission prediction via deep contextual
embedding of clinical concepts. PLOS ONE 13:
e0195024. DOI: https://doi.org/10.1371/journal.pone.
0195024
Ezer and Whitaker. eLife 2019;8:e43979. DOI: https://doi.org/10.7554/eLife.43979 10 of 10
Feature article Point of View Data science for the scientific life cycle
... This work is an example of integration of data science, information science and domain knowledge for compiling experiments based on a stated problem and then running an analysis for hypothesis generation. The incorporation of data science in the scientific life cycle has been broadly depicted by other authors [15] to indicate how data science could be used to generate hypotheses, as well as design and analyse experiments. Hypothesis-driven research is deductive and corresponds to the traditional scientific method. ...
... Given the staggering amount of data at our disposal [13], it does not seem wise to keep loading Science with more experiments when we are unable first to digest the ones we collectively own. We can find already articles in the literature pointing out in this direction [15] and additional examples of successful meta-analysis using different sources of information [34]. ...
Article
Information and data science algorithms were combined to predict the outcome of an experiment in chemical engineering. Using the Scientific Method workflow, we started the journey with the formulation of a specific question. At the research stage, the common process of querying and reading articles on scientific databases was substituted by a systematic review with a built-in recursive data mining method. This procedure identifies a specific community of knowledge with the key concepts and experiments that are necessary to address the formulated question. A small subset of relevant articles from a very specific topic among thousands of papers was identified while assuring the loss of the least amount of information through the process. The secondary dataset was bigger than a common individual study. The process revealed the main ideas currently under study and identified optimal synthesis conditions to produce a chemical substance. Once the research step was finished, the experimental information was compiled and prepared for meta-analysis using a supervised learning algorithm. This is a hypothesis generation stage whereby the secondary dataset was transformed into experimental knowledge about a particular chemical reaction. Finally, the predicted sets of optimal conditions to produce the desired chemical compound were validated in the laboratory.
... Hypothesis testing and p-values. First, we begin with the integration of data science into the scientific life cycle [32] and connect this to effective data visualization and storytelling through graphical representations. We travel through time from John Snow's maps of the 1854 Broad Street Pump cholera outbreak [33] to the present day with examples of SARS-CoV-2 graphical summaries of research that were not the most effective in conveying important data and in turn, public health action [34][35][36][37]. ...
Article
Full-text available
Much guidance on statistical training in STEM fields has been focused largely on the undergraduate cohort, with graduate education often being absent from the equation. Training in quantitative methods and reasoning is critical for graduate students in biomedical and science programs to foster reproducible and responsible research practices. We argue that graduate student education should more center around fundamental reasoning and integration skills rather than mainly on listing 1 statistical test method after the other without conveying the bigger context picture or critical argumentation skills that will enable student to improve research integrity through rigorous practice. Herein, we describe the approach we take in a quantitative reasoning course in the R3 program at the Johns Hopkins Bloomberg School of Public Health, with an error-focused lens, based on visualization and communication competencies. Specifically, we take this perspective stemming from the discussed causes of irreproducibility and apply it specifically to the many aspects of good statistical practice in science, ranging from experimental design to data collection and analysis, and conclusions drawn from the data. We also provide tips and guidelines for the implementation and adaptation of our course material to various graduate biomedical and STEM science programs.
... Over the past decade, advancements in scientific fields such as life science and medical science have caused an exponential growth in data generation (Mallappallil et al., 2020;Ezer and Whitaker 2019). Various datasets such as medical imaging data, which is derived from a patient's diagnostics reports (Munn and Jordan, 2011;He et al., 2017), genomics data, which is derived from next-generation sequencing during cancer and other genomics studies (Munn and Jordan, 2011;He et al., 2017), or pharmaceutical datasets that provide biochemical properties of small molecules (Hassan et al., 2006) are not only huge but complex as well. ...
... Data science has important implications in achieving the United Nation's Sustainable Development Goals (SDGs) (18). The SDGs highlight that achieving health and well-being for all requires harnessing data and creating new knowledge and innovations in health and other sectors (19). For example, pattern recognition methodologies and tools allow identifying a segment of the population that might be at high risk for developing chronic and non-communicable diseases. ...
Article
Full-text available
Technological advances now make it possible to generate diverse, complex and varying sizes of data in a wide range of applications from business to engineering to medicine. In the health sciences, in particular, data are being produced at an unprecedented rate across the full spectrum of scientific inquiry spanning basic biology, clinical medicine, public health and health care systems. Leveraging these data can accelerate scientific advances, health discovery and innovations. However, data are just the raw material required to generate new knowledge, not knowledge on its own, as a pile of bricks would not be mistaken for a building. In order to solve complex scientific problems, appropriate methods, tools and technologies must be integrated with domain knowledge expertise to generate and analyze big data. This integrated interdisciplinary approach is what has become to be widely known as data science. Although the discipline of data science has been rapidly evolving over the past couple of decades in resource-rich countries, the situation is bleak in resource-limited settings such as most countries in Africa primarily due to lack of well-trained data scientists. In this paper, we highlight a roadmap for building capacity in health data science in Africa to help spur health discovery and innovation, and propose a sustainable potential solution consisting of three key activities: a graduate-level training, faculty development, and stakeholder engagement. We also outline potential challenges and mitigating strategies.
... Over the past decade, advancements in scientific fields such as life science and medical science have caused an exponential growth in data generation (Mallappallil et al., 2020;Ezer and Whitaker 2019). Various datasets such as medical imaging data, which is derived from a patient's diagnostics reports (Munn and Jordan, 2011;He et al., 2017), genomics data, which is derived from next-generation sequencing during cancer and other genomics studies (Munn and Jordan, 2011;He et al., 2017), or pharmaceutical datasets that provide biochemical properties of small molecules (Hassan et al., 2006) are not only huge but complex as well. ...
Chapter
Bioinformatics and chemoinformatics are rapidly evolving fields that demand sophisticated algorithms and methods to handle complex data such as genomics data, medical imaging, and electronic medical records. These data can be categorized as big data. During the past few decades, computational methods have impacted biomedical research dramatically. Whether it is an initial screening and prediction of a potential drug candidate or finding a cancer-related pathway, the computational method has proven very useful. Therefore bioinformatics algorithms have been adopted widely as an effective alternative to the laborious time-consuming classical methods in biological research. However, the desired technical and programming experience needed to apply these methods is a big hurdle for the research community to include these remarkable methods in their research. In the recent past, Python has emerged as a very attractive programming language in the field of bioinformatics and chemoinformatics because of its fast learning curve and easy integration in various data analysis pipelines. The main focus of this chapter is to introduce Python to an audience that is not very familiar with the programming language and also provide a hands-on tutorial for those who want to adopt this powerful language in their chemoinformatics research.
... Over the past decade, advancements in scientific fields such as life science and medical science have caused an exponential growth in data generation (Mallappallil et al., 2020;Ezer and Whitaker 2019). Various datasets such as medical imaging data, which is derived from a patient's diagnostics reports (Munn and Jordan, 2011;He et al., 2017), genomics data, which is derived from next-generation sequencing during cancer and other genomics studies (Munn and Jordan, 2011;He et al., 2017), or pharmaceutical datasets that provide biochemical properties of small molecules (Hassan et al., 2006) are not only huge but complex as well. ...
Chapter
This chapter emphasizes the important tools of cheminformatics that are proven to be well organized for pharmaceutical data analysis and applications. Cheminformatics acts as an interface between physics, chemistry, biology, mathematics, biochemistry, statistics, and informatics. Cheminformatics refers to solving chemical and synthetic problems effectively by making use of information tools available in the wide space of the web. Each of the tools described in the chapter has been extensively discussed in many research publications but this work gives a complete overview of the major tools and techniques that are able to aid in the study of many important applications such as drug discovery processes and other biochemical applications. It connects molecular modeling with cheminformatics very well. This chapter discusses briefly almost all aspects of molecular modeling and drug designing that utilize cheminformatics tools, ranging from quantitative structure activity relationship analysis, similarity searching to deriving pharmacophores, and shortlisting datasets to forming combinatorial libraries. Cheminformatics aids in designing reactions and possible synthetic routes, structural elucidation of molecules isolated from various biological, and environmental sources or from reaction pathways.
... Over the past decade, advancements in scientific fields such as life science and medical science have caused an exponential growth in data generation (Mallappallil et al., 2020;Ezer and Whitaker 2019). Various datasets such as medical imaging data, which is derived from a patient's diagnostics reports (Munn and Jordan, 2011;He et al., 2017), genomics data, which is derived from next-generation sequencing during cancer and other genomics studies (Munn and Jordan, 2011;He et al., 2017), or pharmaceutical datasets that provide biochemical properties of small molecules (Hassan et al., 2006) are not only huge but complex as well. ...
Chapter
Structure-based drug design (SBDD) techniques are impacting the discovery of new drugs. Computational SBDD strategies are accelerating the drug discovery process and are regularly used in pharmaceutical and biotechnology industries. The advances in genomics, proteomics, and structural biology have led to the identification of many novel drug targets. In this chapter, we focus on the various techniques of SBDD, their principles, applications, and tools. Molecular docking methods and algorithms, high-throughput virtual screening, and de novo ligand design methods are discussed. Biomolecular simulation is crucial in understanding the dynamic behavior of drug–target complexes. The efficient prediction of the pharmacokinetic profile of a compound helps in the prefiltering of compounds not having drug-like properties. SBDD strategies have resulted in the development of many approved drugs.
... Over the past decade, advancements in scientific fields such as life science and medical science have caused an exponential growth in data generation (Mallappallil et al., 2020;Ezer and Whitaker 2019). Various datasets such as medical imaging data, which is derived from a patient's diagnostics reports (Munn and Jordan, 2011;He et al., 2017), genomics data, which is derived from next-generation sequencing during cancer and other genomics studies (Munn and Jordan, 2011;He et al., 2017), or pharmaceutical datasets that provide biochemical properties of small molecules (Hassan et al., 2006) are not only huge but complex as well. ...
Book
Full-text available
Chemoinformatics and Bioinformatics in the Pharmaceutical Sciences brings together two very important fields in pharmaceutical sciences that have been mostly seen as diverging from each other: chemoinformatics and bioinformatics. As developing drugs is an expensive and lengthy process, technology can improve the cost, efficiency and speed at which new drugs can be discovered and tested. This book presents some of the growing advancements of technology in the field of drug development and how the computational approaches explained here can reduce the financial and experimental burden of the drug discovery process. This book will be useful to pharmaceutical science researchers and students who need basic knowledge of computational techniques relevant to their projects. Bioscientists, bioinformaticians, computational scientists, and other stakeholders from industry and academia will also find this book helpful.
Article
Motivation The integration of vast, complex biological data with computational models offers profound insights and predictive accuracy. Yet, such models face challenges: poor generalization and limited labeled data. Results To overcome these difficulties in binary classification tasks, we developed the Method for Optimal Classification by Aggregation (MOCA) algorithm, which addresses the problem of generalization by virtue of being an ensemble learning method and can be used in problems with limited or no labeled data. We developed both an unsupervised (uMOCA) and a supervised (sMOCA) variant of MOCA. For uMOCA, we show how to infer the MOCA weights in an unsupervised way, which are optimal under the assumption of class-conditioned independent classifier predictions. When it is possible to use labels, sMOCA uses empirically computed MOCA weights. We demonstrate the performance of uMOCA and sMOCA using simulated data as well as actual data previously used in Dialogue on Reverse Engineering and Methods (DREAM) challenges. We also propose an application of sMOCA for transfer learning where we use pre-trained computational models from a domain where labeled data are abundant and apply them to a different domain with less abundant labeled data. Availability and implementation GitHub repository, https://github.com/robert-vogel/moca.
Article
Improvements in the number and resolution of Earth- and satellite-based sensors coupled with finer-resolution models have resulted in an explosion in the volume of Earth science data. This data-rich environment is changing the practice of Earth science, extending it beyond discovery and applied science to new realms. This Review highlights recent big data applications in three subdisciplines—hydrology, oceanography, and atmospheric science. We illustrate how big data relate to contemporary challenges in science: replicability and reproducibility and the transition from raw data to information products. Big data provide unprecedented opportunities to enhance our understanding of Earth’s complex patterns and interactions. The emergence of digital twins enables us to learn from the past, understand the current state, and improve the accuracy of future predictions.
Preprint
Full-text available
Motivation The design of an experiment influences both what a researcher can measure, as well as how much confidence can be placed in the results. As such, it is vitally important that experimental design decisions do not systematically bias research outcomes. At the same time, making optimal design decisions can produce results leading to statistically stronger conclusions. Deciding where and when to sample are among the most critical aspects of many experimental designs; for example, we might have to choose the time points at which to measure some quantity in a time series experiment. Choosing times which are too far apart could result in missing short bursts of activity. On the other hand, there may be time points which provide very little information regarding the overall behaviour of the quantity in question. Results In this study, we design a survey to analyse how biologists use previous research outcomes to inform their decisions about which time points to sample in subsequent experiments. We then determine how the choice of time points affects the type of perturbations in gene expression that can be observed. Finally, we present our main result: NITPicker, a computational strategy for selecting optimal time points (or spatial points along a single axis), that eliminates some of the biases caused by human decision-making while maximising information about the shape of the underlying curves, utilising ideas from the field of functional data analysis. Availability NITPicker is available on GIThub ( https://github.com/ezer/NITPicker ).
Article
Full-text available
Quality control in industrial processes is increasingly making use of prior scientific knowledge, often encoded in physical models that require numerical approximation. Statistical prediction, and subsequent optimization, is key to ensuring the process output meets a specification target. However, the numerical expense of approximating the models poses computational challenges to the identification of combinations of the process factors where there is confidence in the quality of the response. Recent work in Bayesian computation and statistical approximation (emulation) of expensive computational models is exploited to develop a novel strategy for optimizing the posterior probability of a process meeting specification. The ensuing methodology is motivated by, and demonstrated on, a chemical synthesis process to manufacture a pharmaceutical product, within which an initial set of substances evolve according to chemical reactions, under certain process conditions, into a series of new substances. One of these substances is a target pharmaceutical product and two are unwanted by-products. The aim is to determine the combinations of process conditions and amounts of initial substances that maximize the probability of obtaining sufficient target pharmaceutical product whilst ensuring unwanted by-products do not exceed a given level. The relationship between the factors and amounts of substances of interest is theoretically described by the solution to a system of ordinary differential equations incorporating temperature dependence. Using data from a small experiment, it is shown how the methodology can approximate the multivariate posterior predictive distribution of the pharmaceutical target and by-products, and therefore identify suitable operating values.
Article
Full-text available
Biomedical research has been previously reported to primarily focus on a minority of all known genes. Here, we demonstrate that these differences in attention can be explained, to a large extent, exclusively from a small set of identifiable chemical, physical, and biological properties of genes. Together with knowledge about homologous genes from model organisms, these features allow us to accurately predict the number of publications on individual human genes, the year of their first report, the levels of funding awarded by the National Institutes of Health (NIH), and the development of drugs against disease-associated genes. By explicitly identifying the reasons for gene-specific bias and performing a meta-analysis of existing computational and experimental knowledge bases, we describe gene-specific strategies for the identification of important but hitherto ignored genes that can open novel directions for future investigation.
Preprint
Full-text available
Research in many fields has become increasingly reliant on large and complex datasets. "Big Data" holds untold promise to rapidly advance science by tackling new questions that cannot be answered with smaller datasets. While powerful, research with Big Data poses unique challenges, as many standard lab protocols rely on experts examining each one of the samples. This is not feasible for large-scale datasets because manual approaches are time-consuming and hence difficult to scale. Meanwhile, automated approaches lack the accuracy of examination by highly trained scientists and this may introduce major errors, sources of noise, and unforeseen biases into these large and complex datasets. Our proposed solution is to 1) start with a small, expertly labelled dataset, 2) amplify labels through web-based tools that engage citizen scientists, and 3) train machine learning on amplified labels to emulate expert decision making. As a proof of concept, we developed a system to quality control a large dataset of three-dimensional magnetic resonance images (MRI) of human brains. An initial dataset of 200 brain images labeled by experts were amplified by citizen scientists to label 722 brains, with over 80,000 ratings done through a simple web interface. A deep learning algorithm was then trained to predict data quality, based on a combination of the citizen scientist labels that accounts for differences in the quality of classification by different citizen scientists. In an ROC analysis (on left out test data), the deep learning network performed as well as a state-of-the-art, specialized algorithm (MRIQC) for quality control of T1-weighted images, each with an area under the curve of 0.99. Finally, as a specific practical application of the method, we explore how brain image quality relates to the replicability of a well established relationship between brain volume and age over development. Combining citizen science and deep learning can generalize and scale expert decision making; this is particularly important in emerging disciplines where specialized, automated tools do not already exist.
Article
Full-text available
Objective Hospital readmission costs a lot of money every year. Many hospital readmissions are avoidable, and excessive hospital readmissions could also be harmful to the patients. Accurate prediction of hospital readmission can effectively help reduce the readmission risk. However, the complex relationship between readmission and potential risk factors makes readmission prediction a difficult task. The main goal of this paper is to explore deep learning models to distill such complex relationships and make accurate predictions. Materials and methods We propose CONTENT, a deep model that predicts hospital readmissions via learning interpretable patient representations by capturing both local and global contexts from patient Electronic Health Records (EHR) through a hybrid Topic Recurrent Neural Network (TopicRNN) model. The experiment was conducted using the EHR of a real world Congestive Heart Failure (CHF) cohort of 5,393 patients. Results The proposed model outperforms state-of-the-art methods in readmission prediction (e.g. 0.6103 ± 0.0130 vs. second best 0.5998 ± 0.0124 in terms of ROC-AUC). The derived patient representations were further utilized for patient phenotyping. The learned phenotypes provide more precise understanding of readmission risks. Discussion Embedding both local and global context in patient representation not only improves prediction performance, but also brings interpretable insights of understanding readmission risks for heterogeneous chronic clinical conditions. Conclusion This is the first of its kind model that integrates the power of both conventional deep neural network and the probabilistic generative models for highly interpretable deep patient representation learning. Experimental results and case studies demonstrate the improved performance and interpretability of the model.
Article
Full-text available
Over 26 million people worldwide suffer from heart failure annually. When the cause of heart failure cannot be identified, endomyocardial biopsy (EMB) represents the gold-standard for the evaluation of disease. However, manual EMB interpretation has high inter-rater variability. Deep convolutional neural networks (CNNs) have been successfully applied to detect cancer, diabetic retinopathy, and dermatologic lesions from images. In this study, we develop a CNN classifier to detect clinical heart failure from H&E stained whole-slide images from a total of 209 patients, 104 patients were used for training and the remaining 105 patients for independent testing. The CNN was able to identify patients with heart failure or severe pathology with a 99% sensitivity and 94% specificity on the test set, outperforming conventional feature-engineering approaches. Importantly, the CNN outperformed two expert pathologists by nearly 20%. Our results suggest that deep learning analytics of EMB can be used to predict cardiac outcome.
Article
As artificially intelligent tools for literature and data exploration evolve, developers seek to automate how hypotheses are generated and validated. As artificially intelligent tools for literature and data exploration evolve, developers seek to automate how hypotheses are generated and validated.
Article
Clark Glymour argued in 2004 that "despite a lack of public fanfare, there is mounting evidence that we are in the midst of ... a revolution - premised on the automation of scientific discovery" [1]. This paper highlights some of the philosophical and sociological dimensions that have been found empirically in work conducted with robot scientists - that is, with autonomous robotic systems for scientific discovery. Robot scientists do not supply definite answers to the discussed questions, but rather provide "proofs of concept" for various ideas. For example, it is not that robot scientists solve the realist/antirealist philosophical debate, but that when working with robot scientists one has to make a philosophical choice - in this case, to assume a realist view of science. There are still few systems for autonomous scientific discovery in existence, and it is too early to generalize and propose new theories. However, being "in the midst of ... a revolution" it is important for the research community to re-examine views pertinent to scientific discovery. This paper highlights how experience with robot scientists could inform discussions in other disciplines, from philosophy of science to computer creativity research.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry