Content uploaded by Axel Schumacher
Author content
All content in this area was uploaded by Axel Schumacher on Oct 04, 2016
Content may be subject to copyright.
1
Schumacher's Guide #3: Lifecycle management to standardize omics data
A 10-step lifecycle management system
to improve harmonization of omics data
analysis workflows
Towards data-driven decision making in precision medicine
By Axel Schumacher, PhD (2016)
Recent studies have raised concerns about the reproducibility of research results, particularly
in translational research. Findings that cannot be reproduced pose risks to pharmaceutical
R&D, causing both delays and significantly higher costs of drug development, and also
jeopardize the public trust in science. Here I discuss a practical solution to improve result
quality based on a workflow lifecycle model that addresses one of the most important issues,
namely method reproducibility. This article describes how adopting a method lifecycle
management approach throughout the translational research process to standardize and
automate omics-data processes enhances data-driven decision making and provides the
valuable scientific insights to achieve the promise of precision medicine.
Keywords: Lifecycle Management, Genomics, NGS, Omics, Data management, Data processing, Data Analysis, Workflows.
Introduction
In research, multiple causes contribute to lack of
reproducibility or poor results, and many can ultimately
be traced to an underlying lack of frameworks of
standards and best practices. The situation is particularly
evident in the field of translational research where
pharmaceutical companies are increasingly generating
vast amounts of omics data to develop more targeted
and personalized treatments. The translational research
process is complex with a wide variety of stakeholders
who have to collaborate securely and share data
internally and externally. Much of the data is commonly
generated by different sources, partially in-house and
increasingly externally at clinical research organizations
(CROs), at multiple, geographically dispersed locations.
Companies then have to integrate these disparate and
heterogeneous data sets and draw scientific insights as
well as financially-relevant conclusions. Often individuals
with varying skill sets and resources are needed to
analyze this data. Without proper infrastructure to
ensure transparency of analytical methods, data, data
sources and context in which data were generated or
derived1, the reproducibility of results, their quality and
moreover trust throughout the translational research
value chain is compromised.
The problem: Irreproducibility
A recent study showed that the cumulative prevalence of
irreproducible preclinical research exceeds 50%,
resulting in approximately US$28B/year spent on
preclinical research that is not reproducible - in the
United States alone1. The economic impact of the
reproducibility problem is significant, in addition to the
fact that irreproducibility also has downstream effects on
the drug development pipeline and significantly
increases attrition rates. A growing literature is
beginning to identify factors that contribute to the
problems of poor reproducibility, among those are:
Managing the increasing amount of omics data
challenges current R&D information infrastructures.
Data collected from different sites and users and
produced with varying data analysis workflows may
result in data which cannot be integrated and
compared.
Financially critical decisions may be adversely
affected by poorly curated genomics datasets that
lack good quality control.
Poor documentation of data provenance may
contribute to irreproducible conclusions and lack of
compliance with government regulations.
2
Schumacher's Guide #3: Lifecycle management to standardize omics data
Compartmentalized knowledge of optimal data
analysis methods and workflows is inefficient and
can slow down data analysis.
Study design characteristics that may introduce bias, low
statistical power and flexibility in data collection,
analysis, and reporting, are collectively termed
“researcher degrees of freedom”2,3. Such a freedom has
a significant impact on the datasets produced by life
science organizations. Already, many types of poorly
curated "omics" datasets populate public and
proprietary repositories and lower the value of those
resources. Data mining is critically dependent on data
quality, an issue that, surprisingly, is still largely ignored
by many scientists, data repositories, and scientific
journals4. Properly curating the vast quantities of
scientific data produced by pharmaceutical companies is
not just important for regulatory submissions, but is also
vital for ongoing pharma research and clinical trials.
Solution: Methods lifecycle management
Traditionally, the focus on improving R&D productivity
has been at a local organizational level. However, as
more and more research activities become larger,
globally dispersed interdisciplinary efforts, companies
need to incorporate method lifecycle thinking into their
research activities. Simply put companies need
“operationalize” the end-to-end workflows of data
management as if it were a manufacturing process. The
implementation of workflow lifecycles that manage
omics-data means to have a systematic R&D approach in
place that begins with predefined objectives and
emphasizes process understanding and workflow
control, based on sound science and quality risk
management. Although standardized data management
does not guarantee success, it remains a necessary
prerequisite for R&D efficiency and high-quality data.
Researchers that are in the position of ‘end users’ with
no direct control over the data generation process,
including the choice of quality control measures, need to
be sure that they can rely on the curated data they find
in their data warehouse.
Without high-quality data, they do not have the basis for
making good decisions, resulting in poor organizational
productivity. Costs are then incurred by time spent on
reversing what data management has failed to
accomplish. In this context, it is not sufficient to use local,
site-specific standards, it is of utmost importance to:
Standardize all omics analyses throughout a multi-
department, multi-site organization.
Nowadays, data comparability is more challenging since
much research is now collaborative. For example, many
research organizations often have their DNA sequence
data generated by service providers with different
standard operating procedures (SOPs). However, only if
all data acquisition steps are identical, data can be
compared accurately between studies and sites. This
situation highlights the importance for the
pharmaceutical industry to engage all stakeholders in a
dynamic, collaborative effort to standardize typical
multi-omics processes. As a consequence, each
bioinformatics pipeline should allow for the creation,
execution, and management of workflows that can
automate and standardize analyses by providing means
to record the processing order and parameter settings
for sets of different analysis activities5. Such workflow
management makes it possible to run analyses in a fully
automated manner, meaning that even new users can
start using sophisticated analyses by running a workflow
in which all analysis steps are incorporated (i.e.
assembled by a more experienced user).
In this data management framework, I define workflows
as predefined structures of connected activities that,
when run, automatically perform the set of functions
determined by the assembled activities (e.g. an R-code,
Bash-, Perl-, or Python-script). Most data scientists prefer
working with complex workflows instead of single
activities because scripts are difficult to program and
reuse6. Workflows can be saved and shared, ensuring
standardization and reproducibility within and between
research groups.
3
Schumacher's Guide #3: Lifecycle management to standardize omics data
Fig.1: A typical omics workflow with connected activities.
Several types of workflow management software exist to design
scientific workflows such as the open-source projects Galaxy, Kepler
or Taverna7,8. Most programs include a graphical user interface (GUI)
for creating workflows with connected activities. A workflow is data-
centric, and therefore the output from one or several parallel activities
serves as the input for the subsequent task. Depicted is a as used in
Taverna, an open source and domain-independent workflow
management system
Workflow Lifecycles. The development of a workflow
lifecycle may vary in their particular specifications but
generally leads to a similar set of steps (see Fig. 2). Such
lifecycles are appropriate for core facilities, CROs,
sequencing centers, and larger pharmaceutical
companies that wish to standardize their analysis
workflows, where the application of harmonized
workflows is required and data provenance is of
paramount concern. The ability to trace workflow
components (for example different versions of a
workflow component) and the associated data is
important for almost all R&D units as they must
increasingly abide by government regulations, especially
if patient data is involved. Traceability refers here to the
complete data lifecycle that includes the data origination
points, e.g. loading of raw sequencing data, and points at
which data undergoes transformations over time (i.e.
workflow activities).
Data provenance can help organizations in several
different ways such as: a.) data loss prevention, b.)
increasing the efficiency of compliance and audits, c.)
performing root-cause analyses, and d.) ultimately better
quality data analytics. Having a systemic workflow lifecycle
in place will create value for the organization in numerous
different ways, and it forms a key strategic asset for
competition and growth.
Here, I propose a workflow lifecycle model that addresses
the most important factors that contribute to poor
reproducibility, including study design characteristics that
may introduce bias, low statistical power, and flexibility in
data collection, analysis, and reporting. The following
paragraph will provide details to each lifecycle step.
Step 1: Workflow Design
In the design phase, the activities are assembled into a
concrete executable workflow, usually by mapping
processing steps to executable scripts and algorithms.
This process is not as trivial and can take up a significant
time, as it includes precise definitions of the
dependencies of complex omics-data and workflow
components. Due to this complexity, only expert users
should have the responsibility to develop de novo
workflows which can be made available to other, less
experienced researchers to successfully perform their
work. A potential workaround is to use a panel of expert-
selected model workflows that represent - or are
improved versions of – tested, peer-reviewed and
established “best practices”.
Fig.2: The 7+3 steps of omics-workflow lifecycles.
The workflow lifecycle describes each state of the evolution of a
workflow from the initial design to a final result. The common scientific
workflow lifecycle foresees an analysis and learning phase after the
workflow execution9 to refine the initial design and head towards the
final result, for example by changing input parameters or exchanging a
specific algorithm.
When selecting a public or proprietary workflow
repository, the user should always make sure that the
workflows are properly documented with detailed
4
Schumacher's Guide #3: Lifecycle management to standardize omics data
descriptions on how to use the model workflows and
how their individual activities work. Using such
documented workflows also save a significant amount of
time for the bioinformaticians as it is increasingly difficult
to keep an overview on newly published scripts and
algorithms. Of course, for many researchers it is very
tempting to try out the latest method and algorithm on
their data, but organizations have to keep in mind that
changing established workflows too often may come at a
high price. In fact, when new bioinformatics tools are
released it is quite common that authors forget about
them (since the paper describing the tool has been
published). Critical software bugs are not corrected and
almost never documented. In consequence, it is not
unusual that researchers spend more time running and
debugging new software tools, such as new R scripts and
algorithms than analyzing their data. As such, it is
preferable to rely on external experts that provide
thoroughly tested and well-documented activities to be
included in new workflows. Errors or limitations of the
workflows utilized in large pharmaceutical companies
could go undetected with possible negative effects on
R&D efficiency.
Most method lifecycles start with such best practices,
taken from an in-house or public repository (such as the
myExperiment project launched in 2007)10. Almost
always, different algorithms implement equivalent
functions but with different characteristics. Each of them
may be optimal for different input data. Hence, any
workflow design tools should include some form of
optimization of the component used for each step, as
shown in Figure 3, so as to find the best algorithm to
fulfill a given subtask.
In this way, components that have the same input and
output data type can be switched on and off for testing
of their efficiency and utility. In addition to the intra step
component optimization, reorganization of all the steps
in the workflow so called topological optimization, may
also be useful. For example, the performance of the
combination of a normalization step and a filtering step
might differ depending on whether the filtering acts on
the input items before or after the data normalization.
Furthermore appropriate data sources and hardware
requirements are also defined to make sure that
compute-intensive scripts will not slow-down the overall
workflow.
Fig. 3: Example of workflow design optimization.
Component optimizations: Activities which process the same task
but with different algorithms may be switched depending on user
preferences or input data. As each algorithm has its own specific
parameter set, parameter optimization should be applied in
combination with component optimization. Topology optimization:
Often, the execution sequence of consecutive activities (such as
filtering or clustering) may be swapped because input and output
formats are identical (adapted from Holl, 2014)11.
Overall, there are several considerations for any
workflow designer to take into account:
• Are the workflow activities suitable for their intended
purpose?
• Is a systematic approach available to study workflow
robustness?
• Is the methodology and objective of the workflow
clearly defined and understood by all stakeholders?
• Are there any peer-reviewed “best practice”
workflows present?
• Is a statistical method in place to analyze the
validation data?
After a workflow is designed, the version number has to
be updated. Versioning assures researchers always have
the "right" workflow and facilitates the easy
identification of updates between versions. The final goal
of the design phase is to achieve a level of automation
and simplification of the workflow so that scientists
across the organization can focus on their research and
are not required to ponder technical details of data
acquisition, component execution or resource allocation.
5
Schumacher's Guide #3: Lifecycle management to standardize omics data
Step 2: Workflow Validation & Quality Control
Once a “best practice” workflow is finalized, it can be
shared, searched, reused, and manipulated. However,
even peer-reviewed methods may need further in-house
validation. For example, it has been shown that the
concordance of many commonly used variant-calling &
annotation pipelines is disturbingly low11,1. As such,
bioinformaticians have to be very careful in choosing
which workflow activities to include in their R&D
pipelines and how to validate their usefulness. It is
suggested that the traditional approaches to validation
and verification should be integrated into the workflow
lifecycle process rather than being viewed as separate
entities. Systematic testing of workflow components
with real data (mimicking the R&D efforts) can reveal
problems and outcomes that are unexpected and
undetected by the software user. Such errors can have
significant effects in bioinformatics and scientific
computing in general, affecting subsequent research or
clinical/scientific decision making14. Proper validation
becomes especially critical if bioinformatics tools are to
be used in a translational research setting, such as
analysis and interpretation of Whole Exome Sequencing
(WES) or Whole Genome Sequencing (WGS) data.
Research organizations should try to ensure that only
validated algorithms are used and that they are
implemented correctly in the R&D pipeline.
Often, bioinformaticians may want to restart the
workflow after the validation step in case the results do
not meet the required standards. During subsequent
validation cycles, the workflow can be adapted and
tested again. In the end, the computed results must
satisfy the expectation of their intended users.
Sometimes, it is difficult - if not impossible - to define a
gold standard mechanism to decide if the output of a
workflow is correct, given any possible input. In such
case, a systematic verification or validation of the
workflow may be achieved by applying a technique called
Metamorphic Testing (MT)14. MT alleviates the problems
associated with the lack of a gold standard by checking
that the results from multiple executions of a workflow
satisfy a set of expected or desirable properties that can
be derived from the software specification or user
expectations.
In modern R&D organizations, the volumes of data can
be overwhelming, and automated quality control (QC)
reporting is a necessity. A deeper analysis of QC
processes, including other omics-data, is out of the scope
of this article. NGS is quickly becoming the preferred
method for many genome and transcriptome studies,
including gene expression, transcription factor binding,
genomic structural change, genome methylation and
even de novo genome assemblies, among many others.
Properly conducting quality control protocols at all
stages (e.g. raw data, alignment, and variant calling) and
correctly interpreting the quality control results is crucial
to ensure the best outcome (for a review on QC methods
see Guo et al., 201315). Several sequence artifacts,
including read errors, poor quality reads, and primer- or
adaptor contamination are quite common when NGS
data are generated, which can impose a significant
impact on the downstream sequence analysis. A good
starting point for assuring the quality of NGS in
laboratory practice is to look at the principles and
guidelines developed by the Next Generation
Sequencing: Standardization of Clinical Testing (Nex-
StoCT) workgroup or the FDA’s Sequencing Quality
Control (SEQC) project16,17.
Initial steps in the quality control process typically
involve assessing the intrinsic quality of the raw reads
using metrics (read statistics) calculated directly from the
raw reads. This should include a summary report of the
overall alignment statistics, for example, showing the
total number of reads per sample, the percentage of
aligned reads, as well as the percentage of primary
aligned reads of the aligned ones. If the data contains
paired-end sequenced reads, the report should provide
the number of paired sequenced reads, as well as the
number of properly-paired reads.
If read pairs map to different chromosomes, indicating a
possible chromosome fusion, their number should be
presented as well as chromosome spanning reads. QC
processes as outlined here can be used for three main
purposes: 1.) to ensure the workflow is correctly
implemented against the specification, and 2.) to ensure
the correct specification is used against the desired user
requirement (i.e., validation)14, and 3.) to monitor and
control currently used workflows. In this context, it is
clear that there is a need to make the review of sequence
data for researchers as efficient and scalable as possible.
It is imperative therefore, that tools can not only handle
6
Schumacher's Guide #3: Lifecycle management to standardize omics data
current requirements for analysis and QC but also scale
to meet the demands of sequencing in the coming years.
Step 3: Workflow Approval
The ability to standardize workflows throughout
organization was discussed earlier as a key element of
any methods reproducibility approach. Standardization
in turn depends on the ability to parametrize, lock and
approve workflows. Why? Because inexperienced users
are sometimes tempted to ”play” with the workflow
parameters, resulting in many different workflow
versions, which makes comparison and reproducibility of
data almost impossible. For standardization to be useful,
the best approach is to automate and lock processes,
which has even benefits beyond itself. To achieve control
over workflows, governance policies should be defined
from the start; it has to be clear who needs to review and
approve individual workflows at any time, and when
workflows are mandated for periodic review.
When researchers are faced with clinical data analysis
tools, it is often found that the importance of ease-of-use
and training materials outweighs number of features and
functionality5. Consequently, the primary goal of
workflow approval is to facilitate the reuse and
reproducibility of scientific experiments across an
organization with minimal effort. One way to achieve this
reproducibility is that the authorized workflow-designer
or the trial manager marks a workflow as approved, a
process that should be saved to an audit log with
appropriate adherence to regulatory requirements. As
part of the approval, the author of the workflow must be
able to designate any number of activities and activity
settings in the workflow as editable by the user. For best
comparability of different data sets, it is advisable to
minimize the number of those parameters as much as
possible. When a user finally opens the approved
workflow to execute it, the user should be offered the
opportunity to edit settings as appropriate to their role,
and then the workflow should run without any
interruption. Any approved and shared workflow should
always include the approver’s name, date, and time of
approval to ensure traceability and to clarify
responsibilities. In case a user copies and modifies a
previously approved workflow, the
system should be able to determine
if the approved workflow has been
altered and remove it from the list of
approved procedures.
Using only approved workflows
within a company site has the
additional benefit of protecting the
organization’s IP due to negligence
or fraud. While incidents of clear-cut
fraud, fortunately, appear to be
rare, a more pressing concern is the
prevalence of research practices,
often committed unconsciously or
unknowingly, which would serve to
increase the likelihood that a result
is incorrect and potentially very
costly from the business
perspective. Of course,
standardization should include
some degree of flexibility since the
rapid technological evolution the
field is witnessing could make specific recommendations
or best-practices short-lived. This freedom should be
granted with caution as the use of approved workflows
only can effectively reduce differences in practices by
aligning the company’s global efforts around consensus-
based methods, serving as an efficient solution for
biomarker irreproducibility.
Step 4: Workflow sharing
A study amongst researchers found that although
scientists develop and share pre-designed workflows,
7
Schumacher's Guide #3: Lifecycle management to standardize omics data
they tend to reuse almost exclusively their own
workflows, instead of adapting (potentially better)
workflows from other researchers18. In praxis, this
tendency leads to poor R&D efficiency and ultimately a
reduced return on investment (ROI). To avoid
redundancies, it is obviously very essential to be able to
share internal know-how with specific bioinformatics
workflows and algorithms so that not everyone has to
start from scratch again, while keeping in mind role-
based permissions to safeguard content and IP security.
To obtain buy-in to use those workflows and to motivate
active participation in the data quality effort, a sensible
and understandable rationale should be provided to all
stakeholders in a translational research unit (at all
levels).
Platforms that allow sharing scientific workflows by
collaborative means are workflow repositories19. With
approving a workflow to be placed in a repository, all
contained activities should be immediately available to
all authorized users (but nobody else), who can then
share the designed workflow with other groups (e.g. on
other company sites or externally, such as with CROs).
Depending on the scientific management system used
and on the level of IP protection, also workflow
templates or workflow specifications might be shared,
facilitating distributed collaborations. Experience from
real-life applications in the pharmaceutical industry
emphasizes the importance of such workflow
repositories as many domain-specific researchers do not
have the time or knowledge to do their own scripting and
programming. When complex workflows are shared
within an organization and IP is at risk, the workflow
managers should always ensure that proper access
control is implemented. A strict access control will have
the option to grant selectively a user access to projects
and project groups and thereby authorize the use of
certain workflows.
When massive amounts of data are shared between
many researchers and sites, it is often necessary to
customize data management functions with the help of
adaptive middleware. An example of such middleware is
the open source software system iRODS that lets system
administrators roll out an extensible data grid without
changing their infrastructure20. Systems such as iRODS
and other middleware are of particular value when
research organizations want to implement additional or
enhanced functionality such as data virtualization,
automation of data operations, a sophisticated metadata
catalog, or data management policy enforcement and
compliance verification (Fig. 4).
Fig. 4: An R&D infrastructure toward precision
medicine. Samples from patients and clinical trial subjects are
increasingly involved in omic-workflows, hence systems used must
ensure analyses are performed to the highest standards of privacy and
compliance with global regulatory frameworks. Depicted is one
possible framework for such a precision medicine platform that could
provide a comprehensive and scalable translational research eco-
system. Using middleware such as iRODS, which is increasingly
adopted by pharmaceutical companies, creates an advanced security-
hardened research data platform for the management, processing, and
analysis of exploratory omics data from patients and pre-clinical
experiments.
One of the main benefits of using data management
middleware is that it enables secure collaboration, so
scientists only need to log-in to their home grid to access
data for their workflows hosted on a remote grid. Patient
identifiable data has to be protected, ideally in a manner
that does not inhibit biomarker discovery or translational
research efforts.
Establishing effective collaborative efforts with secure
data sharing is no simple feat, but organizations can look
to knowledgeable partners that have been successful in
the past to develop such collaborative solutions for the
life-science community.
Step 5: Workflow optimization
Independent of the workflow source, a systemic lifecycle
approach allows the optimization of data analysis
workflows in a partially automated manner, keeping the
effort of integrating new algorithms to a minimum for
the bioinformaticians.
8
Schumacher's Guide #3: Lifecycle management to standardize omics data
The need for an API. The aim of modern workflow
management systems is to facilitate the integration of
different software, scripts, as well as connection to
external public and proprietary data sources (for an
overview on this topic read de Brevern et al., 2015)21. A
prerequisite for workflow design and optimization is to
have an easy to use shell application programming
interface (API) available to enable the quick and easy
integration of external plugin-activities (i.e. in-house
tools and algorithms such as mappers, trimming tools or
variant callers). Workflow designers are then able to
assemble methods quickly and efficiently without the
overhead manually integrating each one, optimizing the
process of method development.
Generally, the optimization of a particular workflow can
be described as the systematic comparison of possible
parameter sets with the goal of finding a not yet known
optimal parameter set. An optimization is focused on the
maximization or minimization of so-called objective
functions (or fitness functions) regarding various
constraints11. In the real-world pharmaceutical
environment this means finding the best possible
solution in a reasonable amount of time, ideally without
altering the workflow complexity.
Trial and error parameter determination is time-
consuming and inefficient. Usually, large data analysis
workflows, as seen in complex NGS experiments, consists
of many connected activities. For each activity, the
workflow designer has the choice of many parameters
and algorithms, which makes it impractical to rely on a
simple trial and error approach to workflow
optimization11. A better method is to apply optimization
algorithms, which sample the search space to find a near
optimal solution in a reasonable amount of computing
time22. Often it is essential to consider several objectives
in an optimization process, for example a trade-off
between sensitivity and specificity or between result
quality and runtime. Also, instead of re-executing an
entire workflow, a more resource-efficient way is to
store intermediate results (snapshots) and then to re-
execute only parts of the workflow with changed
parameters for refinement. If one activity has failed, this
process allows users to go back one step and modify the
specific activity so that researchers do not have to re-
execute the entire workflow23. Also, the optimization
time can be reduced by utilizing only a sub-fraction of a
data set (e.g. the researcher may want to look only at one
chromosome instead of running the workflow with a
complete mammalian genome). Accelerating the
execution time of single activities can also be achieved by
moving time-consuming and parallel tasks to high-
performance computing recourses. Consequently, if an
activity (e.g. an R-script) can be executed in a parallel
fashion, for example via an API, this resource should
preferably be used.
The most common optimization task is the improvement
of the parameters of certain activities. Typical activities
in NGS workflows contain a variety of parameters. For
example, Cufflinks, a transcriptome assembly and
differential expression analysis tool for RNA-Seq has over
40 different parameters24; Bowtie, a popular, ultrafast
and memory-efficient tool for aligning sequencing reads
to long reference sequences has over 100 parameters25.
It is evident that the number of possible parameter
combinations for any non-trivial workflow provides a
considerable challenge for parameter optimization.
Although methods for automated parameter sweeps
exist, increasing complexities of scientific workflows,
(with rising number of parameters), make such sweeps
impractical. Therefore, methods have been proposed
(such as Genetic Algorithms or Particle Swarm
Optimization) that handle complex, non-linear workflow
optimization in a reasonable amount of time with
exploring a much smaller part of the parameter search
space to find good solutions11.
The overall complexity of the parameter space implies
that improving a scientific workflow result may be
challenging for both, life science researchers and
bioinformaticians. As a consequence, for better
comparability of results, parameters should always be
restricted or completely locked before sharing any
workflow. Domain scientists (often bench scientist)
would typically simply use these default values as they
often do not know how the programs exactly work and
how parameter changes may influence the result26. This
also means that bioinformaticians need a well-
documented platform to provide improved methods
without much effort. A good documentation will ensure
9
Schumacher's Guide #3: Lifecycle management to standardize omics data
that if users customize settings they understand the
consequences and impact on results.
Step 6: Workflow execution phase
Any R&D platform should contain an intuitive interface
with pre-packaged features and easily accessible
workflows making it easy to use for all user levels.
Usually, the software provides an execution
environment by retrieving information about available
data- and computing resources, and from external
software via an API. The workflow components are then
executed in a predefined order. In an ideal case, the user
should have a role-based control over the workflow
components with the possibility to interrupt at any point
and save intermediate results, for example, to avoid
executing compute-intensive components during peak-
hours. Nowadays, most multi-omics data applications
exceed the power of desktop clients or laptops. As such,
each software platform needs the capability to access
and utilize distributed data and computing resources,
including internal High-Performance Computing (HPC),
High-Throughput Computing (HTC) resources, as well as
cloud solutions.
Step 7: Data analysis & learning
Usually, the analysis and learning step is the last phase of
a workflow lifecycle (see Fig. 2), representing the normal
application of the workflow in day-to-day data analysis
routine of ongoing projects. The data analysis phase
involves the examination of the obtained results as well
as their comparison to those of other experiments11. To
ensure a successful completion of this phase, a
sophisticated statistical data analysis software should be
available which:
uses a rich statistical toolbox to perform a wide
range of data analyses
includes external algorithms as plugins
integrates data across technologies and studies
from in-house and public data sources
makes data fully transparent through sophisticated
data visualization (i.e. an interactive Genome
Browser)
contains a comprehensive reporting infrastructure
When working in a pharmaceutical environment, after
successful data analysis, data feeds are typically pushed
into a data warehouse. One example of such an omics
data warehouse is the open-source, community-driven
knowledge management platform tranSMART. Similarly,
database extracts may be pushed or pulled by the
warehouse system. The Extract–Transform–Load (ETL)
process pulls (if necessary) and converts raw data into a
format expected by the warehouse5,27. A few software
platforms directly support the tranSMART platform. For
better comparison of results, data warehouses usually
contain also data from public omics repositories such as
TCGA, CCLE, dbSNP, GEO, 1000 Genomes Project or
ArrayExpress. However, public domain databases are
nowadays filled with data that do not contain sufficient
metadata to accurately determine how the omics-data
was generated. Therefore, before data is uploaded to a
database or data-warehouse, researchers should make
sure that as many as possible standardized metadata
about the experiments are collected and stored, starting
during bio-banking of samples and continued through
data processing and analysis.
Fig. 5: Analysis of age distribution of responders and non-responders
in the open-source tranSMART platform.
Step 8: Reporting
The scientific method relies on the ability of scientists to
reproduce and build upon each other’s results. For life-
science organizations, this means that researchers not
only need to have access to the work their colleagues did
in the past, they also need the ability to trace all the steps
of a previous data processing and analysis workflow.
Usually, this is done by exporting experimental details
into easy to understand reports. Reporting capabilities
can have a significant impact on an R&D department.
While many factors are likely at play here, perhaps one
of the most fundamental requirements for
reproducibility holds that the data reported in a study,
especially raw data (e.g. sequence reads) and processing
parameters, can be uniquely identified and obtained,
such that analyses can be reproduced as faithfully as
possible.
In an ideal case, an R&D department may want to control
or monitor specific steps in a research pipeline. In this
case, it is useful to have the ability to automatically
10
Schumacher's Guide #3: Lifecycle management to standardize omics data
produce a report from any point within a workflow. The
generated report will represent the current state of its
workflow at the point in time when the report-activity is
executed. A standard workflow report (e.g. in form of a
PDF document) may contain several selectable elements
with comprehensive information about the currently
executed workflow such as:
Detailed settings of all executed activities
Properties of the entire workflow
Computed result values per activity
Figures of results where appropriate
Graphical representation of the workflow
Information about the software version
Standard meta-information (i.e. author & creation
date of the report)
Classically, the solution to improving data and process
identifiability needs to be a partnership between all
participants in the scientific process, and deficiencies in
awareness and difficulties coordinating across these
stakeholders is at the root of the problem28. A
nonexistent or “ad hoc” chain-of-custody (CoC) may
suffice for a simple internal investigation of an
experiment, but it is advisable not to take the chance.
Instead, research organizations should protect all results
equally so that, if necessary, it will hold up before
internal auditing or even an FDA investigation.
A sound CoC verifies that employees have not altered
information either in the data collection, copying process
or during analysis. If a company fails to provide a
comprehensive CoC, it may have a difficult time
disproving that somebody might have tampered with the
data, this in turn exposes the organization to compliance
risk
Step 9: Archiving old workflows
Although non-clinical studies may not always need to
adhere to GxP guidelines, essential workflows and data
produced/derived from those workflows should still be
retained for sufficient periods of time to allow for audit
and inspection by internal and regulatory authorities. So
depending on the importance of the studies and whether
any claims are made based on them, the involved
workflows should be readily available upon request. In
turn, authorized users should be able to archive a
workflow, a task that should also be recorded in an audit
log. After archiving, the research environment should
still permit the evaluation of the conduct of the research
and the quality of the data produced. The archived
workflows serve to demonstrate the compliance of the
R&D unit, with GxP standards and all applicable
regulatory requirements. However, to avoid
irreproducibility of data, the main key to an archiving
solution is that the translational research platform must
prevent the user from running archived workflows.
Authorized individuals (such as the main data manager
or workflow author) may still be able to save a copy of
such an archived workflow to a new location, but the
newly saved workflow should no longer be marked as
archived. Of course, unauthorized users should not be
able to save a copy of an archived workflow to another
location. The archived workflows should still be able to
link to the raw data, for example, sequence reads,
metabolomics profiles, or associated records of clinical
findings, necessary for the reconstruction and evaluation
of the project. Since it cannot be ruled out that patients
involved in the research project retract their informed
consent, it is important to ensure that records (suitably
de-identified as necessary) about a patient’s
involvement in the research (i.e. sample- and patient-ID)
are also retained for an appropriate period. It is advisable
that a research organization appoints named individuals
to be responsible for archiving the workflows that are, or
have been, utilized in the R&D pipeline and that access
to those workflows shall be restricted to those appointed
individuals.
Step 10: Routine-review of lifecycle workflows
To evaluate if a data processing workflow requires
optimization, regular review should be performed to
ensure that components of the workflow are still up to
date and reflect current best-practices. Systematizing
this process enables the R&D team to accommodate and
11
Schumacher's Guide #3: Lifecycle management to standardize omics data
manage changes to the workflow. The benefit is a
dramatic improvement in timeliness and productivity. In
the context of securing IP and reproducibility of data, it
is advisable to perform risk assessment to determine the
impact of change to existing workflows. Procedures
should be revisited as needed to refine operational
design based on new knowledge.
The limits of lifecycle management:
Deep learning
Traditional methods of data analysis are no longer
enough to handle, let alone take proper advantage of,
the potential that precision medicine holds. Machine
learning techniques provide powerful methods for
analyzing large data sets, such as omics-data, medical
images, and electronic health records (EHRs). Deep
learning methods in particular, basically when a machine
learns how to do something by studying others doing it,
are beating out traditional data analysis approaches on
virtually every single metric. A key advantage of Deep
Learning is its ability to perform unsupervised feature
extraction over massive data sets. For example, using
automatic feature extraction and pattern recognition, a
software can discover almost all its domain-specific
knowledge with minimal hand-crafted knowledge given
by the programmer. However, the issues of data
management, model sharing, and lifecycle management
are largely ignored29.
Deep learning modeling lifecycle contains a rich set of
artifacts, such as learned parameters, evaluations and
training logs, and frequently conducted tasks, e.g., to
understand the model behaviors and to try out different
models. Dealing with those artifacts needs a novel model
versioning framework and new methods to store and
explore the developed models, and to share models with
others.
Conclusions
The challenge of increasing reproducibility in
translational research is simply too important and costly
a problem to ignore. Addressing methods reproducibility
is a key element for an organization to ensure the highest
quality of their results to build trust in their scientific
discoveries. All research units need to ensure that they
build the appropriate data collection and processing
steps into their translational R&D projects from the start,
by incorporating a methods lifecycle concept into the
initial planning stage. Using such a lifecycle management
as outlined in this paper will deliver various benefits to
the translational research process:
New information infrastructures with proper data
processing and data management enables the R&D
team to cope with the increasing amount of omics
data
Full transparency of workflow processing – know
when a method was used by whom and with what
settings data
Safely leverage the resources of all stakeholders
(external laboratories, CRO’s or collaborators) via
the use of standardized methods throughout the
translational research value chain allowing results to
be easily compared.
Detailed documentation of data provenance will
result in reproducible results and ensures
compliance with government regulations
Harmonized “best practice” data analysis workflows
speed up data analysis and de-risk bioinformatics
R&D operations
Yet, for any leader in the multi-omics translational
research field, it is of utmost importance to realize that
building a strong analytics pipeline or R&D platform
requires many skills that are usually not all present within
a single company. Developing such a platform - and even
more so maintaining it - is not a trivial task. However,
companies who meet the challenge of sustaining high-
quality data will be richly rewarded with more accurate,
believable, and timely data for their organization’s R&D
activities. When properly realized, the method lifecycle
management can reduce time-to-market and enhances
competitive advantage.
12
Schumacher's Guide #3: Lifecycle management to standardize omics data
References
1. Freedman, L. P., Cockburn, I. M. & Simcoe, T. S. The
Economics of Reproducibility in Preclinical Research.
PLOS Biol. 13, e1002165 (2015).
2. Ware, J. J. & Munafò, M. R. Significance chasing in
research practice: Causes, consequences and possible
solutions. Addiction 4–8 (2014). doi:10.1111/add.12673
3. George, B. J. et al. Raising the Bar for Reproducible
Science at the U.S. Environmental Protection Agency
Office of Research and Development. Toxicol. Sci. 145,
16–22 (2015).
4. Mendoza-Parra, M. A. & Gronemeyer, H. Assessing
quality standards for ChIP-seq and related massive
parallel sequencing-generated datasets: When rating
goes beyond avoiding the crisis. Genomics Data 2, 268–
273 (2014).
5. Schumacher, A., Rujan, T. & Hoefkens, J. A collaborative
approach to develop a multi-omics data analytics
platform for translational research. Appl. Transl.
Genomics 4–7 (2014). doi:10.1016/j.atg.2014.09.010
6. Jimenez, R. C. & Corpas, M. Bioinformatics workflows
and web services in systems biology made easy for
experimentalists. Methods Mol. Biol. 1021, 299–310
(2013).
7. de Brevern, A. G., Meyniel, J.-P., Fairhead, C.,
Neuvéglise, C. & Malpertuy, A. Trends in IT Innovation to
Build a Next Generation Bioinformatics Solution to
Manage and Analyse Biological Big Data Produced by
NGS Technologies. Biomed Res. Int. 2015, 904541
(2015).
8. Tiwari, A. & Sekhar, A. K. T. Workflow based framework
for life science informatics. Comput. Biol. Chem. 31, 305–
19 (2007).
9. Gil, Y. et al. Examining the challenges of scientific
workflows. Computer (Long. Beach. Calif). 40, 24–32
(2007).
10. De Roure, D. & Goble, C. myExperiment – A Web 2.0
Virtual Research Environment. Engineering 1–4 (2007).
11. Holl, S. Automated Optimization Methods for Scientific
Workflows in e-Science Infrastructures. (Schriften des
Forschungszentrums Jülich, 2014).
12. McCarthy, D. J. et al. Choice of transcripts and software
has a large effect on variant annotation. Genome Med.
6, 26 (2014).
13. O’Rawe, J. et al. Low concordance of multiple variant-
calling pipelines: practical implications for exome and
genome sequencing. Genome Med. 5, 28 (2013).
14. Giannoulatou, E., Park, S.-H., Humphreys, D. T. & Ho, J.
W. Verification and validation of bioinformatics software
without a gold standard: a case study of BWA and
Bowtie. BMC Bioinformatics 15, S15 (2014).
15. Guo, Y., Ye, F., Sheng, Q., Clark, T. & Samuels, D. C.
Three-stage quality control strategies for DNA re-
sequencing data. Brief. Bioinform. 15, 1–10 (2013).
16. Seqc/Maqc-Iii Consortium. A comprehensive assessment
of RNA-seq accuracy, reproducibility and information
content by the Sequencing Quality Control Consortium.
Nat. Biotechnol. 32, 903–914 (2014).
17. Gargis, a S. et al. Assuring the quality of next-generation
sequencing in clinical laboratory practice. Nat.
Biotechnol. 30, 1033–1036 (2012).
18. Starlinger, J., Cohen-Boulakia, S. & Leser, U. in Scientific
and Statistical Database Management (eds. Ailamaki, A.
& Bowers, S.) 7338, 361–378 (Springer Berlin Heidelberg,
2012).
19. Stoyanovich, J., Taskar, B. & Davidson, S. Exploring
repositories of scientific workflows. in Proceedings of the
1st International Workshop on Workflow Approaches to
New Data-centric Science 7 (ACM, 2010).
doi:10.1145/1833398.1833405
20. Chiang, G.-T., Clapham, P., Qi, G., Sale, K. & Coates, G.
Implementing a genomic data management system
using iRODS in the Wellcome Trust Sanger Institute. BMC
Bioinformatics 12, 361 (2011).
21. Brevern, A. G. De, Meyniel, J., Fairhead, C., Neuvéglise,
C. & Malpertuy, A. Trends in IT Innovation to Build a
Next Generation Bioinformatics Solution to Manage and
Analyse Biological Big Data Produced by NGS
Technologies. 2015, (2015).
22. Colaço, M. & Dulikravich, G. A Survey of Basic
Deterministic, Heuristic and Hybrid Methods for Single-
Objective Optimization and Response Surface
Generation. Therm. Meas. Inverse Tech. 355–405 (2011).
doi:doi:10.1201/b10918-13
23. Sonntag, M., Karastoyanova, D. & Leymann, F. The
Missing Features of Workflow Systems for Scientific
Computations. Proc. 3rd Grid Work. Work. 209–216
(2010).
24. Trapnell, C. et al. Transcript assembly and abundance
estimation from RNA-Seq reveals thousands of new
transcripts and switching among isoforms. Nat.
Biotechnol. 28, 511–515 (2011).
25. Langmead, B., Trapnell, C., Pop, M. & Salzberg, S. L.
Ultrafast and memory-efficient alignment of short DNA
sequences to the human genome. Genome Biol 10, R25
(2009).
26. Kulyk, O., Wassink, I., Vet, P. Van Der, Veer, G. Van Der &
Dijk, B. Van. in University of Twente 1–24 (2008).
27. Athey, B. D., Braxenthaler, M., Haas, M. & Guo, Y.
tranSMART: An Open Source and Community-Driven
Informatics and Data Sharing Platform for Clinical and
Translational Research. AMIA Jt. Summits Transl. Sci.
Proc. AMIA Summit Transl. Sci. 2013, 6–8 (2013).
28. Vasilevsky, N. a et al. On the reproducibility of science:
unique identification of research resources in the
biomedical literature. PeerJ 1, e148 (2013).
29. Miao, H., Li, A., Davis, L. S. & Deshpande, A. ModelHub :
Lifecycle Management for Deep Learning. Univ. Maryl.
(2016).