ArticlePDF Available

GUIdock: Using Docker Containers with a Common Graphics User Interface to Address the Reproducibility of Research

Authors:

Abstract and Figures

Reproducibility is vital in science. For complex computational methods, it is often necessary, not just to recreate the code, but also the software and hardware environment to reproduce results. Virtual machines, and container software such as Docker, make it possible to reproduce the exact environment regardless of the underlying hardware and operating system. However, workflows that use Graphical User Interfaces (GUIs) remain difficult to replicate on different host systems as there is no high level graphical software layer common to all platforms. GUIdock allows for the facile distribution of a systems biology application along with its graphics environment. Complex graphics based workflows, ubiquitous in systems biology, can now be easily exported and reproduced on many different platforms. GUIdock uses Docker, an open source project that provides a container with only the absolutely necessary software dependencies and configures a common X Windows (X11) graphic interface on Linux, Macintosh and Windows platforms. As proof of concept, we present a Docker package that contains a Bioconductor application written in R and C++ called networkBMA for gene network inference. Our package also includes Cytoscape, a java-based platform with a graphical user interface for visualizing and analyzing gene networks, and the CyNetworkBMA app, a Cytoscape app that allows the use of networkBMA via the user-friendly Cytoscape interface.
Content may be subject to copyright.
RESEARCH ARTICLE
GUIdock: Using Docker Containers with a
Common Graphics User Interface to Address
the Reproducibility of Research
Ling-Hong Hung
, Daniel Kristiyanto
, Sung Bong Lee
, Ka Yee Yeung*
Institute of Technology, University of Washington, Tacoma, WA 98402, United States of America
These authors contributed equally to this work.
*kayee@uw.edu
Abstract
Reproducibility is vital in science. For complex computational methods, it is often necessary,
not just to recreate the code, but also the software and hardware environment to reproduce
results. Virtual machines, and container software such as Docker, make it possible to repro-
duce the exact environment regardless of the underlying hardware and operating system.
However, workflows that use Graphical User Interfaces (GUIs) remain difficult to replicate on
different host systems as there is no high level graphical software layer common to all plat-
forms. GUIdock allows for the facile distribution of a systems biology application along with its
graphics environment. Complex graphics based workflows, ubiquitous in systems biology,
can now be easily exported and reproduced on many different platforms. GUIdock uses
Docker, an open source project that provides a container with only the absolutely necessary
software dependencies and configures a common X Windows (X11) graphic interface on
Linux, Macintosh and Windows platforms. As proof of concept, we present a Docker package
that contains a Bioconductor application written in R and C++ called networkBMA for gene
network inference. Our package also includes Cytoscape, a java-based platform with a graph-
ical user interface for visualizing and analyzing gene networks, and the CyNetworkBMA app,
a Cytoscape app that allows the use ofnetworkBMA via the user-friendly Cytoscape interface.
Introduction
Reproducibility is a vital feature in science [1,2]. Recent articles in the June 26 issue of Science
discussed how rarely published results can be reproduced across different disciplines [2,3].
Nosek and colleagues proposed guidelines consisting of eight standards and three levels to pro-
mote transparency, openness and reproducibility in scientific publications [1]. These guidelines
progress from level 0 to level 3 and become increasingly stringent for each standard (see pro-
posed standards and referencesin the Supplementary Material of Nosek et al.[1] for details).
Computational method development and data analyses have become integral to many disci-
plines, such as biomedical research. For data analyses and software implementations, level 2 of
the Nosek et al. guidelines requires that the code must be posted to a trusted repository. In level
PLOS ONE | DOI:10.1371/journal.pone.0152686 April 5, 2016 1/14
OPEN ACCESS
Citation: Hung L-H, Kristiyanto D, Lee SB, Yeung KY
(2016) GUIdock: Using Docker Containers with a
Common Graphics User Interface to Address the
Reproducibility of Research. PLoS ONE 11(4):
e0152686. doi:10.1371/journal.pone.0152686
Editor: Lennart Martens, UGent / VIB, BELGIUM
Received: September 22, 2015
Accepted: March 17, 2016
Published: April 5, 2016
Copyright: © 2016 Hung et al. This is an open
access article distributed under the terms of the
Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any
medium, provided the original author and source are
credited.
Data Availability Statement: All relevant data are
within the paper and its Supporting Information files.
Funding: Yeung and Hung are supported by National
Institutes of Health U54 HL127624. Microsoft Azure
for Research Award provided computational
resources for all authors. Kristiyanto gratefully
acknowledges sponsorship from the Fulbright
scholarship 20142016 and from the University of
Washington in the form of full tuition waivers. The
funders had no role in study design, data collection
and analysis, decision to publish, or preparation of
the manuscript.
3, they propose the additional requirement that the reported analyzes be reproduced indepen-
dently before publication.
These suggestions overlook the fact that modern biomedical workflows and pipelines con-
sist of multiple applications and libraries, each with their own set of software dependencies.
Hence, suites such as Bioconductor [4], BioPython [5], and BioPerl [6] where the user is
assured that the dependencies for the components are properly installed have become increas-
ingly popular. The obvious drawback to this approach is that one is limited to the components
included in the suite. In addition, reproducing workflows that use interactive graphics remain
problematic as each operating system uses their own graphical environment. Our solution to
this problem is GUIdock, which allows for replication of the application, graphics and software
environments that produced the analytic results reported in scientific publications.
GUIdock uses Docker https://www.docker.com/, an open source project that incorporates a
light weight Linux wrapper (container) to ensure application portability and infrastructure
flexibility. On a Linux host, Docker uses the host system. On Mac OS and Windows systems, a
single Docker container consists of a Virtual Machine (VM) containing the guest software and
its Linux environment. Containers differ from traditional VMs in that the resources of the
operating system (OS) and not the hardware are shared transparently (virtualized). Multiple
containers share a single OS kernel saving considerable resources. Docker also supports Dock-
erfiles that contain the instructions to build a Docker Image from scratch or another Docker
Image. Images can be downloaded from repositories using git https://git-scm.com/ or bundled
with the Dockerfile to form packages. Docker provides an easy, modular method to build, dis-
tribute and replicate complex pipelines and workflows across multiple platforms.
Although Docker provides a container with the original computational environment, the
host system, where the container software is executed, is responsible for rendering graphics.
GUIdock configures an additional X Windows software layer that allows for consistent graph-
ics on a variety of host platforms. Thus a complex bioinformatics pipeline with GUI compo-
nents originally running on a Linux machine can be replicated and tested on a Windows or
Mac OS machine. This greatly facilitates the reproduction of scientific results arising from real-
world workflows. Fig 1 shows an overview of GUIdock.
Related Work
In this work, we showcase GUIdock, a method for deploying containers with a graphical user
interface. As proof of concept, our GUIdock package includes our previously published soft-
ware tools for gene network inference [711]. A gene network can be represented by a graph,
in which nodes are genes and edges capture relationships between genes. There are many appli-
cations for these computationally derived gene networks, such as systems approaches to iden-
tify disease genes [12]. Therefore, many computational methods and software tools have been
developed to infer gene networks from genome-wide data and subsequently to formulate
hypotheses from these computationally derived networks. Excellent review articles have been
written to cover these advances in methods and tools, for example, [1214].
Inference of gene networks using Bayesian Model Averaging (BMA). In regression-
based network inference algorithms, we aim to search for candidate regulators (i.e. parent
nodes) for each target gene. In other words, we model the target genes expression levels as the
response variable (y), and the candidate regulatorsexpression levels as independent variables
(x0s) in a regression framework. This problem is then reduced to a variable selection problem
such that the goal is to identify variables (parent nodes) that can be used to predict the expres-
sion levels of the target gene. Many regression-based gene network inference methods have
been developed, such as [1517].
GUIdock: Using Docker Containers with a Common Graphics User Interface
PLOS ONE | DOI:10.1371/journal.pone.0152686 April 5, 2016 2/14
Competing Interests: The authors have declared
that no competing interests exist.
BMA is an ensemble method that accounts for model uncertainty by averaging over the pre-
dictions of multiple models [18,19]. In the context of gene network inference, a model is a set of
candidate regulators. We previously showed the effectiveness of using variants of BMA as multi-
variate variable selection methods in the context of time series gene expression data [79].
Fig 1. Overview of GUIdock.
doi:10.1371/journal.pone.0152686.g001
GUIdock: Using Docker Containers with a Common Graphics User Interface
PLOS ONE | DOI:10.1371/journal.pone.0152686 April 5, 2016 3/14
networkBMA. We implemented these BMA network inference methods in R and C++.
Our networkBMA package [10] is publicly available from the Bioconductor repository
http://bioconductor.org/packages/release/bioc/html/networkBMA.html.Themainfunction
networkBMA takes gene expression data as one of the input arguments, allows the specifi-
cation of prior probabilities to guide the search of optimal parent nodes for each target gene,
and outputs an edge list consisting of edges in the form of (parent, child, posterior probabili-
ties) relationships. The user can also specify a posterior probability threshold to filter out
edges below the given threshold.
In addition to gene network inference, the networkBMA package also features functions for
the assessment of gene networks. Specifically, the contabs.netwBMA function compares
the edges in the inferred network to a given set of known regulatory relationships using a con-
tingency table approach, and the scores function computes assessment statistics correspond-
ing to the contingency table, including sensitivity, precision, specificity, recall etc. There are
also functions to plot and compute the area under the receiver operating characteristic (ROC)
and precision recall (PR) curves.
Cytoscape. Visualization and analyses of networks are integral to systems biology
research. Cytoscape is a well-established Java-based stand-alone application for analyzing and
visualizing networks [2022]. Cytoscape offers a user-friendly graphical user interface (GUI)
for visualizing a given network, and provides an app store at http://apps.cytoscape.org/ from
which apps for various systems biology applications can be downloaded [23]. In addition, soft-
ware developers can submit new apps to be made available from the Cytoscape App Store, such
as [11,2426]. As an example, cyREST is a RESTful API module for Cytoscape [24], and the
Ideker Lab has demonstrated the use of Cytoscape, cyREST and Docker in a meeting focused
on visualizing biological data (VIZBI 2015 [27]).
CyNetworkBMA app. CyNetworkBMA [11] is available on the Cytoscape App Store at
http://apps.cytoscape.org/apps/cynetworkbma. It is an easy-to-use tool that integrates our net-
workBMA Bioconductor package into Cytoscape, thus allowing the user to directly visualize
the resulting gene networks using the Cytoscape utilities. In particular, CyNetworkBMA uses
Rserve to integrate with R over a binary protocol on top of TCP/IP [28]. This means Cytoscape
and R run in separate processes, potentially on different machines and platforms.
While the CyNetworkBMA app adds an easy-to-use graphical interface and visualization
utilities to the BMA-based gene network inference methods implemented in the net-
workBMA Bioconductor package, there are many steps involved in installing CyNet-
workBMA. In addition to Cytoscape and R, CyNetworkBMA depends on multiple R and
Bioconductor packages, including networkBMA for network inference and assessment,
igraph [29] for algorithms used in removing potential cycles from networks, and Rserve for
exposing R services over TCP/IP.
Docker for bioinformatics applications. Docker is an emerging platform that is gaining
traction in the scientific community [30]. In particular, Rocker is a project containing pre-built
Docker images and Dockerfiles to run R using Docker containers [31], and the Bioconductor
project has deposited Docker Images in Docker Hub and source Dockerfiles in GitHub [32]. As
another example, the BioDocker and BioBoxes project has a GitHub repositories for pre-config-
ured containers with bioinformatics tools [33]. As of September 2015, there are about 15 con-
tainers in the BioDocker repository, and most of these containers are built for proteomics and
mass spectrometry data analyses. In addition, the Genouest group in France has begun hosting
a repository for Docker containers called Bioshadock [34]. Galaxy has an easy-to-use interface
and provides a browser-based infrastructure for workflow management. Users can build a
workflow in Galaxy and export the workflow to Docker [35,36].
GUIdock: Using Docker Containers with a Common Graphics User Interface
PLOS ONE | DOI:10.1371/journal.pone.0152686 April 5, 2016 4/14
Our Contributions
Docker containers have largely been used for non-graphical applications. Containers have not
been used to distribute the normal GUI based workflows that users are accustomed to. This is
due to the difference in high level APIs used for the different windowing environments by the
major operating systems. GUIdock addresses this deficiency by configuring a common X11
windowing interface on Linux, Mac OS and Windows. A GUI workflow using X Windows can
now be duplicated on most platforms.
We demonstrate the feasibility of using Docker for applications with a GUI, and hence con-
tainers that support software tools and data analytic pipelines with a graphical user interface.
In particular, we illustrate the use of containers for systems biology applications, including Bio-
conductor packages written in R and C++, and Cytoscape, a stand-alone java-based application
with a graphical user interface. The Docker package we present is a proof-of-concept example
that containers can enhance the reproducibility of analytic results produced using applications
with graphical interfaces. Our Docker image and Dockerfile are publicly available at https://
github.com/WebDataScience/GUIdock.
Previous to GUIdock, the workaround when graphics interaction is required has been to dis-
tribute WebAPIs that provide a consistent GUI. An example is cyREST [24] which is also avail-
able as a Docker package. cyREST provides a RESTful API to Cytoscape and uses a jquery library
(cytoscape.js) to render graphics on the different host systems. However, some knowledge of pro-
gramming is needed to use the API and the result is dependent on the browser and operating sys-
tem. In contrast, our GUIdock package exports the native Cytoscape GUI into a more consistent
X Windows environment. A user simply double-clicks on an icon and the application pops up
and is used exactly as it would be when run in its native environment. Fig 2 compares GUIdock
to virtual machines and Fig 3 shows the software components added by GUIdock.
Materials and Methods
A Docker container is essentially a barebones wrapper providing a Linux environment for a set
of user applications. Linux distributions use X Windows (X11) as their native GUI. The chal-
lenge is to produce the same GUI on hosts that run other operating systems. The solution
adopted by GUIdock is to pass the container X Windows commands to a host X Window emu-
lator which renders the GUI. No additional software needs to be included within the container.
Everything is done by scripts that are easily modified to install, configure and run any Docker-
file or Docker Image. GUIdock is a truly portable approach that is independent of the contents
of the container. Fig 3 summarizes the relationship between various layers, including host
operating system, Docker engine, X11 and software applications included in GUIdock.
Building the CyNetworkBMA GUIdock package
Docker images are reproducible, and it is possible to build a new image from previously estab-
lished image. Since we have chosen gene network inference as our proof-of-concept example,
the Bioconductor Base Image is chosen as the starting point. This image contains R and basic
Bioconductor packages. The Dockerfile starts with this image. We then added all the other
tools needed to run the CyNetworkBMA app to our Docker image, including R Packages such
as Rserve [28], igraph [29], BMA [37], Bioconductor package networkBMA [10], and addi-
tional software like java and Cytoscape. We also configured our Docker image to run Rserve in
the background before launching Cytoscape.
CyNetworkBMA uses the Cytoscape GUI for user interaction. After the Docker package is
created, additional steps are needed to forward the GUI information and render the graphics
on the local host. These steps are dependent on the operating system that the local host is
GUIdock: Using Docker Containers with a Common Graphics User Interface
PLOS ONE | DOI:10.1371/journal.pone.0152686 April 5, 2016 5/14
running and is automated by GUIdock. For a step-by-step guide on how to deploy a GUIdock
package, please refer to the user manual uploaded as S1 Text and in the next section.
GUIdock: on Linux operating systems
The simplest case is when containers are deployed on Linux systems which use X Windows
natively. Since the containers use the same OS as the host, a virtual machine is unnecessary.
Instead, the host OS and resources are used and only the necessary supporting libraries and
binaries are included in the container. The guest software in the container can also export GUI
information and let the host X Windows system render the GUI. These facilities are already pro-
vided by Docker. GUIdock uses a simple configurable bash script to automate the installation.
GUIdock: on Mac OS
Mac OS, like Linux, is a form of UNIX but differs sufficiently that the host OS is not used and a
VM is required to encapsulate a guest Linux OS in the container. The creators of Docker have
provided Docker machine which uses VirtualBox to create the necessary VM. However, the
guest OS cannot directly export GUI commands to the host Mac OS as support for X Windows
was dropped in OS X 10.8 (Mountain Lion). Therefore, we use XQuartz [38], an open source
project, to provide X11 support on the host. We use socat [39] to bind XQuartz services to an
open port and make XQuartz reachable by the Docker container. Socat is a command line
Fig 2. A comparison of the architecture of virtual machines and Docker software containers. Virtual machines are denoted by cyan boxes and
software containers are denoted by green boxes. The left stack is a Type-2 virtual machine (VM) which uses a hypervisor to emulate the guest OS. The
application software, dependences, and the guest OS are all contained inside the VM. A separate VM, dependencies and guest OS are required for each
application stack that is to be deployed. The middle stack depicts Docker container software on a Linux host. Docker uses the host Linux system and
packages the application and dependencies into modular containers. No VM is necessary and the OS resources for the two application stacks are shared
between different containers. The right stack depicts Docker on a non-Linux system. Because Docker requires Linux, a lightweight VM with a mini-Linux
Guest OS is necessary to run Docker and encapsulate the software containers. This still has the advantage that only a single VM and Guest Linux system is
required regardless of the number of containers.
doi:10.1371/journal.pone.0152686.g002
GUIdock: Using Docker Containers with a Common Graphics User Interface
PLOS ONE | DOI:10.1371/journal.pone.0152686 April 5, 2016 6/14
based utility that establishes two bidirectional byte streams and transfers data between them.
The DISPLAY environment variable in the Docker container is also set to the IP address of the
local host running OS X. In this way, the internal X Windows commands are exported to
XQuartz which renders the graphics on the host computer. GUIdock provides a set of bash
scripts to automate the entire installation, configuration and run process.
GUIdock: on Microsoft Windows operating systems
The Microsoft Windows operating systems are very different from Linux and use several differ-
ent proprietary APIs to implement their native GUI. The Windows version of Docker machine,
using VirtualBox, provides a VM with Linux for the container. The current Docker
toolbox also provides Kitematic, a GUI based manager to deploy containers but not for render-
ing the GUI from software within containers. We use a lightweight application, MobaXterm
[40] for this purpose. Although MobaXterm is proprietary, a full-featured free version is avail-
able for download at http://mobaxterm.mobatek.net/download.html. MobaXterm, provides X
Windows support and supports ssh (secure shell) tunneling. Ssh is a widely-used UNIX-based
command interface and protocol for securely accessing a remote computer. Using MobaXterm,
Fig 3. Software components added by GUIdock. Our GUIdock package ensures that the X11 libraries are
present in the Linux OS that runs Docker. In the case of a Linux host, this is the host OS. For Windows and
Mac OS, this is a guest OS inside a VM. An additional X Windows emulation layer is configured for Windows
and Mac OS which allows for GUI commands to be exported from the container software and rendered on the
host.
doi:10.1371/journal.pone.0152686.g003
GUIdock: Using Docker Containers with a Common Graphics User Interface
PLOS ONE | DOI:10.1371/journal.pone.0152686 April 5, 2016 7/14
we set up X11 forwarding using ssh to connect the Docker container with a MobaXterm termi-
nal. The GUI commands pass through the ssh tunnel to the MobaXterm X Windows emulator
which renders the GUI on the host system. GUIdock provides a double-clickable script to ini-
tialize, configure the environment, and run the Docker container.
Availability and requirements
Project name: GUIdock
Project home page: https://github.com/WebDataScience/GUIdock
Contents available for download: Docker Images, Dockerfiles, installation scripts and execu-
tion scripts.
Operating system(s): Linux, Mac OS X, Microsoft Windows. Specifically, we tested GUIdock
on
Linux: Fedora, Ubuntu 15.04
Mac OS X: 10.9, 10.10
Microsoft Windows: 7, 8.1, 10
Demo video (S1 Video): https://youtu.be/k1WkIx0EENo
Results
We deployed the CyNetworkBMA GUIdock package and applied it to three different datasets
of biological relevance: RNAseq data across human cancer cell lines [41], yeast time series gene
expression data [7] and DREAM4 simulated time series data [42]. Note that our demos cover
static (non-time series) RNAseq gene expression data in human, time series microarray data in
a simple model organism (yeast), and simulated time series data.
We show that we get identical results after deploying the package on Linux, Mac OS and
Windows. We added these test data and results to the GUIdock image. We encourage the read-
ers to download our image and reproduce the results shown in Fig 4.
Scenario 1: human cancer RNAseq data
Klijn et al. generated an extensive RNAseq gene expression data across 675 frequently used
human cancer cell lines [41]. We downloaded the variance stabilized version of the normalized
RNAseq data produced by the DESeq Bioconductor package [43] from http://research-pub.
gene.com/KlijnEtAl2014/. We then extracted a subset of 84 genes that belong to 21 cancer-
related pathways that are known to be functionally altered in cancer (see Supplementary
Table 12 in Klijn et al.[41]). This is a steady-state (non-time series) dataset. We applied the
ScanBMA [9] gene network inference algorithm as implemented in the CyNetworkBMA app
from within the GUIdock container.
Fig 4 show the identical results generated by the GUIdock package when installed on a com-
puter running the Linux, Mac OS X and Windows operating systems respectively, after apply-
ing CyNetworkBMA to the same Klijn et al. cancer cell line RNAseq data. We demonstrate the
reproducibility of analytical results when GUIdock is deployed on local hosts running different
operating systems. Fig 5 shows a zoomed in sub-graph of the same network. From Fig 5,we
observe inferred edges among nodes (CDKN2A, CDKN2B, CCNE1, CCND1) that are part of
GUIdock: Using Docker Containers with a Common Graphics User Interface
PLOS ONE | DOI:10.1371/journal.pone.0152686 April 5, 2016 8/14
the cell cycle pathway as indicated in Supplementary Table 12 in Klijn et al.[41]. Similarly, we
also observe inferred edges among nodes (ZRSR2, U2AF1, U2AF2, SRSF2, and SF3A1) that
belong to the splicing pathway.
Scenario 2: Yeast time series microarray data
We also deployed the GUIdock package and applied CyNetworkBMA to a subset of the yeast
time series data [7]. Yeung et al. profiled the response of 97 yeast segregants over six time
points subjected to rapamycin perturbation using microarrays. We extracted a 100-gene subset
and applied CyNetworkBMA using all default settings. Similar to the previous subsection, we
deployed GUIdock on both Windows and mac. See S1 Text for screenshots.
Scenario 3: DREAM 4 simulated data
The DREAM4 simulated time series data consist of 100 genes over 21 time points [42]. We
deployed the GUIdock package on Linux, Mac and Windows. See S1 Text for screenshots.
Fig 4. Screen shot of gene networks generated by GUIdock on (a) Linux, (b) Mac OS, (c) Microsoft Windows using the human cancer RNAseq data
from Klijn et al.Our goal is to demonstrate the reproducibility of analytical results when GUIdock is deployed on computers running different operating
systems.
doi:10.1371/journal.pone.0152686.g004
GUIdock: Using Docker Containers with a Common Graphics User Interface
PLOS ONE | DOI:10.1371/journal.pone.0152686 April 5, 2016 9/14
Discussion
In academic bioinformatics research, data is usually collected and analyzed using a suite of
multiple software tools, implemented by different developers using different languages. Most
workflows are not completely command-line based and have a GUI component. Gathering
and compiling the software components is not a trivial task and is not always sufficient to
reproduce identical results as reported in the scientific literature. GUIdock uses Docker con-
tainers to replicate and distribute the original workflow.
We used gene network inference as a proof-of-concept example. Distribution of the CyNet-
workBMA app was non-trivial because the user is required to install and set up dependencies
such as Rserve, even though the app itself was written in portable java and Cytoscape runs
natively on different platforms. In our figures and demo videos (S1 and S2 Videos), we show
that by using GUIdock, the user only has to run a provided script to replicate the original envi-
ronment and reproduce the results.
Fig 5. A zoomed-in version of the screen shot of gene network generated by GUIdock using the human cancer RNAseq data from Klijn et al.Nodes
(CDKN2A, CDKN2B, CCNE1, CCND1) that are part of the cell cycle pathway are highlighted in blue. Nodes (ZRSR2, U2AF1, U2AF2, SRSF2, and SF3A1)
that belong to the splicing pathway are highlighted in green.
doi:10.1371/journal.pone.0152686.g005
GUIdock: Using Docker Containers with a Common Graphics User Interface
PLOS ONE | DOI:10.1371/journal.pone.0152686 April 5, 2016 10 / 14
Bioinformatics often have to deal with big data that are stored in the cloud. The modular
Docker container repository paradigm is particularly well suited for building cloud based appli-
cations. In the future, we plan to extend GUIdock to work with Docker containers deployed in
the cloud and have done some preliminary work on configuring Microsoft Azure instances and
with multiple containers. We also plan to study the performance impact of adding the X11
layer to GUIdock in various operating systems.
The primary purpose of this manuscript is to provide a proof of concept for using the GUI-
dock methodology to reproduce bioinformatics results. The choice of software was primarily
driven by robustness and the ease of installation and use by the end user. Another obvious
application for container methods is the construction and deployment of custom pipelines
from individual modules rather than from downloading an entire application suite. In this
case, performance becomes a consideration and we may need to reconsider some design
choices such the X-emulator used and the use of a ssh tunnel to transfer X Windows
information.
To summarize, we have developed GUIdock, a new container based workflow to deploy
GUI based pipelines. We have demonstrated the effectiveness of GUIdock by using it to deploy
our Cytoscape based CyNetworkBMA app on Linux, Mac OS and Windows host systems. We
have provided scripts to automate the installation and deployment of GUIdock packages. We
anticipate that GUIdock will be an important step in solving the problem of testing and repro-
ducing scientific results that come from ever-increasingly complicated multi-component
software.
Supporting Information
S1 Text. User manual for GUIdock. In this user manual, we describe the use of our scripts to
install and run GUIdock on Linux, Mac OS and Windows. In addition, we provide a step-by-
step guide to document each step in the installation and deployment process. We also included
additional screen shots for the demos described in the Results section.
(PDF)
S1 Video. Demonstration of GUIdock on Linux, Mac OS and Windows. In this video, we
ran GUIdock on the same sample dataset across Linux, Mac OS and Windows. We demon-
strate that identical gene networks were derived in each operating system. We chose a 9-gene
subset of the human cancer RNAseq data from Klijn et al.[41]. Among these 9 genes,
(CDKN2A, CDKN2B, CCNE1, CCND1) belong to the cell cycle pathway and (ZRSR2, U2AF1,
U2AF2, SRSF2, SF3A1) belong to the splicing pathway as indicated in Supplementary Table 12
in Klijn et al.[41]. This video is also available on YouTube at https://www.youtube.com/watch?
v=k1WkIx0EENo.
(MOV)
S2 Video. Installation of GUIdock on Linux. In this video, we showed the steps involved in
installing GUIdock on computers running Linux. This video (no voice) is available on You-
Tube at https://www.youtube.com/watch?v=HOtI1Eg2J1Q. A version with audio is available
on YouTube at https://www.youtube.com/watch?v=-CrfhxNuMgc&index=2&list=PLczI6k_
oOIdbZQTMTMRcD9QmfWCXAuLsd.
(MOV)
S3 Video. Installation of GUIdock on Mac OS. In this video, we showed the steps involved in
installing GUIdock on computers running Mac OS. This video is also available on YouTube at
https://www.youtube.com/watch?v=4Qg0fCDOxhY.
(MOV)
GUIdock: Using Docker Containers with a Common Graphics User Interface
PLOS ONE | DOI:10.1371/journal.pone.0152686 April 5, 2016 11 / 14
S4 Video. Installation of GUIdock on Windows. In this video, we showed the steps involved
in installing GUIdock on computers running Windows. This video is also available on You-
Tube at https://www.youtube.com/watch?v=cA7HVCB064I.
(MOV)
Acknowledgments
We thank Sina Khankhajeh and Migao Wu for testing the Docker package. We also thank
Chris Fraley, Adrian Raftery and William Chad Young for their contributions to the net-
workBMA package, and Maciej Fronczuk for his contributions to the CyNetworkBMA app.
Author Contributions
Conceived and designed the experiments: KYY. Performed the experiments: LHH DK SBL.
Analyzed the data: DK SBL. Contributed reagents/materials/analysis tools: LHH DK SBL.
Wrote the paper: LHH DK SBL KYY. Developed the Docker package: LHH DK SBL. Tested
the Docker package: LHH DK SBL. Developed GUIdock workflow: SBL DK LHH. GUIdock
scripts: LHH DK. Created supporting information: DK SBL LHH.
References
1. Nosek BA, Alter G, Banks GC, Borsboom D, Bowman SD, Breckler SJ, et al. Promoting an open
research culture: Author guidelines for journals could help to promote transparency, openness and
reproducibility. Science. 2015; 348(6242):14221425. doi: 10.1126/science.aab2374
2. Buck S. Solving reproducibility. Science. 2015; 348:1403. doi: 10.1126/science.aac8041 PMID:
26113692
3. Kaiser J. The cancer test. Science. 2015; 348:14111413. doi: 10.1126/science.348.6242.1411 PMID:
26113698
4. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, et al. Bioconductor: open soft-
ware development for computational biology and bioinformatics. Genome Biology. 2004; 5(10):R80.
doi: 10.1186/gb-2004-5-10-r80 PMID: 15461798
5. Biopython;. Available from: http://biopython.org/wiki/Main_Page.
6. BioPerl;. Available from: http://www.bioperl.org/wiki/Main_Page.
7. Yeung KY, Dombek KM, Lo K, Mittler JE, Zhu J, Schadt EE, et al. Construction of regulatory networks
using expression time-series data of a genotyped population. Proceedings of the National Academy of
Sciences. 2011; 108(48):194319441. doi: 10.1073/pnas.1116442108
8. Lo K, Raftery A, Dombek K, Zhu J, Schadt E, Bumgarner R, et al. Integrating external biological knowl-
edge in the construction of regulatory networks from time-series expression data. BMC Systems Biol-
ogy. 2012; 6(1):101. doi: 10.1186/1752-0509-6-101 PMID: 22898396
9. Young WC, Raftery AE, Yeung KY. Fast Bayesian inference for gene regulatory networks using
ScanBMA. BMC Systems Biology. 2014; 8(1):47. doi: 10.1186/1752-0509-8-47 PMID: 24742092
10. Yeung KY, Fraley C, Young WC, Bumgarner R, Raftery AE. Bayesian Model Averaging methods and R
package for gene network construction. In: Big Data Analytic Technology For Bioinformatics and Health
Informatics (KDDBHI), workshop at the 20th ACM SIGKDD Conference on Knowledge Discovery and
Data Mining (KDD); 2014.
11. Fronczuk M, Raftery AE, Yeung KY. CyNetworkBMA: a Cytoscape app for inferring gene regulatory
networks. Under revision;.
12. Chuang HY, Hofree M, Ideker T. A decade of systems biology. Annual Review of Cell and Developmen-
tal Biology. 2010; 26:721744. doi: 10.1146/annurev-cellbio-100109-104122 PMID: 20604711
13. Novere NL. Quantitative and logic modelling of molecular and gene networks. Nature Reviews Genet-
ics. 2015; 16:146158. doi: 10.1038/nrg3885 PMID: 25645874
14. Emmert-Streib F, Glazko GV, Altay G, de Matos Simoes R. Statistical inference and reverse engineer-
ing of gene regulatory networks from observational expression data. Frontiers in Genetics. 2012; 3:8.
doi: 10.3389/fgene.2012.00008 PMID: 22408642
GUIdock: Using Docker Containers with a Common Graphics User Interface
PLOS ONE | DOI:10.1371/journal.pone.0152686 April 5, 2016 12 / 14
15. Zhang SQ, Ching WK, Tsing NK, Leung HY, Guo D. A new multiple regression approach for the con-
struction of genetic regulatory networks. Artificial Intelligence in Medicine. 2010; 48:153160. doi: 10.
1016/j.artmed.2009.11.001 PMID: 19963359
16. Charbonnier C, Chiquet J, Ambroise C. Weighted-LASSO for structured network inference from time
course data. Statistical Applications in Genetics and Molecular Biology. 2010; 9:15. doi: 10.2202/1544-
6115.1519
17. Liu LZ, Wu FX, Zhang WJ. A group LASSO-based method for robustly inferring gene regulatory net-
works from multiple time-course datasets. BMC Systems Biology. 2014; 8 (Suppl 3):S1. doi: 10.1186/
1752-0509-8-S3-S1
18. Raftery AE. Bayesian model selection in social research (with Discussion). Sociological Methodology;
25:111196. doi: 10.2307/271063
19. Hoeting JA, Madigan D, Raftery AE, Volinsky CT. Bayesian model averaging: A tutorial (with Discus-
sion). Statistical Science; 14:382401.
20. Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, et al. Cytoscape: a software environ-
ment for integrated models of biomolecular interaction networks. Genome Research. 2003; 13
(11):24982504. doi: 10.1101/gr.1239303 PMID: 14597658
21. Christmas R, Avila-Campillo I, Bolouri H, Schwikowski B, Anderson M, Kelley R, et al. Cytoscape: A
Software Environment for Integrated Models of Biomolecular Interaction Networks; 2005. Available
from: http://educationbook.aacrjournals.org/cgi/content/full/2005/1/12.
22. Cline MS, Smoot M, Cerami E, Kuchinsky A, Landys N, Workman C, et al. Integration of biological net-
works and gene expression data using Cytoscape. Nature Protocols. 2007; p. 23662382. doi: 10.
1038/nprot.2007.324 PMID: 17947979
23. Pico AR, Bader GD, Demchak B, Pla OG, Hull T, Longabaugh W, et al. The Cytoscape app article col-
lection. F1000 Research. 2014; 3:138. doi: 10.12688/f1000research.4642.1 PMID: 25580224
24. Ono K, Muetze T, Kolishovski G, Shannon P, Demchak B. CyREST: Turbocharging Cytoscape Access
for External Tools via a RESTful API. F1000 Research. 2015; 4:478. doi: 10.12688/f1000research.
6767.1 PMID: 26672762
25. Cumbo F, Paci P, Santoni D, Paola LD, Giuliani A. GIANT: A Cytoscape Plugin for Modular Networks.
PLoS ONE; 9:e105001. doi: 10.1371/journal.pone.0105001 PMID: 25275465
26. Kutmon M, Kelder T, Mandaviya P, Evelo CTA, Coort SL. CyTargetLinker: A Cytoscape App to Inte-
grate Regulatory Interactions in Network Analysis. PLoS ONE; 8:e82160. doi: 10.1371/journal.pone.
0082160 PMID: 24340000
27. Ono K. VIZBI 2015 Tutorial: Cytoscape, iPython, Docker, and reproducible workflow;. Available from:
https://github.com/idekerlab/cyREST/wiki/VIZBI-2015-Tutorial.
28. Urbanek S. A Fast Way to Provide R Functionality to Applications. In: Proceedings of DSC; 2003. p. 2.
29. Csardi G, Nepusz T. The igraph Software Package for Complex Network Research. InterJournal. 2006;
Complex Systems:1695.
30. Boettiger C. An introduction to Docker for reproducible research, with examples from the R environ-
ment. ACM SIGOPS Operating Systems Review, Special Issue on Repeatability and Sharing of Experi-
mental Artifacts. 2015; 49(1):7179. doi: 10.1145/2723872.2723882
31. Introducing Rocker: Docker for R;. Available from: http://dirk.eddelbuettel.com/blog/2014/10/23/.
32. Docker containers for Bioconductor;. Available from: https://www.bioconductor.org/help/docker/.
33. Bio Docker: Docker for Bioinformatics;. Available from: http://biodocker.org/.
34. Moreews F, Sallou O, Ménager H, Le Bras Y, Monjeaud C, Blanchet C, et al. BioShaDock: a community
driven bioinformatics shared Docker-based tools registry. F1000Research. 2015; 4:1443. doi: 10.
12688/f1000research.7536.1 PMID: 26913191
35. Goecks J, Nekrutenko A, Taylor J, et al. Galaxy: a comprehensive approach for supporting accessible,
reproducible, and transparent computational research in the life sciences. Genome Biol. 2010; 11(8):
R86. doi: 10.1186/gb-2010-11-8-r86 PMID: 20738864
36. Aranguren ME, Wilkinson MD. Enhanced reproducibility of SADI web service workflows with Galaxy
and Docker. GigaScience. 2015; 4(1):19. doi: 10.1186/s13742-015-0092-3
37. BMA: Bayesian Model Averaging. Package for Bayesian model averaging for linear models, generaliz-
able linear models and survival models (cox regression);. Available from: https://cran.r-project.org/web/
packages/BMA/index.html.
38. XQuartz: A version of the X Window System that runs on OS X;. Available from: http://xquartz.
macosforge.org/landing/.
39. socat: Multipurpose relay (SOcket CAT);. Available from: http://www.dest-unreach.org/socat/doc/socat.
html.
GUIdock: Using Docker Containers with a Common Graphics User Interface
PLOS ONE | DOI:10.1371/journal.pone.0152686 April 5, 2016 13 / 14
40. MobaXterm: Enhanced terminal for Windows with X11 server, tabbed SSH client, network tools and
much more;. Available from: http://mobaxterm.mobatek.net/.
41. Klijn C, Durinck S, Stawiski EW, Haverty PM, Jiang Z, Liu H, et al. A comprehensive transcriptional por-
trait of human cancer cell lines. Nature Biotechnology. 2015; 33:306312. doi: 10.1038/nbt.3080 PMID:
25485619
42. Marbach D, Schaffter T, Mattiussi C, Floreano D. Generating Realistic In Silico Gene Networks for Per-
formance Assessment of Reverse Engineering Methods. Journal of Computational Biology. 2009; 16
(2):229239. doi: 10.1089/cmb.2008.09TT PMID: 19183003
43. Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biology. 2010;
11:R106. doi: 10.1186/gb-2010-11-10-r106 PMID: 20979621
GUIdock: Using Docker Containers with a Common Graphics User Interface
PLOS ONE | DOI:10.1371/journal.pone.0152686 April 5, 2016 14 / 14

Supplementary resources (2)

... Though raw metabolomics data can be uploaded and accessed through online databases such as MetaboLights [24] or metabolomics workbench [25], details of data analysis are not always transparent, and reduce the ability to fully reproduce the reported findings [26]. Data analysis software with a graphic user interface (GUI) can be easy to use and document, but is also restricted to only defined operations [27]. An open source data processing script can represent every step of the data analysis while still being flexible [28], but researchers need to adopt specific software within an integrated development environment (IDE), which also reduces reproducibility due to the lack of experience with certain software [29]. ...
... Therefore, we used SRM samples that are commercially available and commonly used in metabolomics workflows, and made the raw data accessible online for future potential research purposes. In order to provide full transparency on the data analysis, we choose a command line based script within a graphic user interface to make sure every step is recorded and reproducible by other researchers [27]. A docker image, xcmsrocker was created based on Rocker image [32], which pre-installs most of the R-based metabolomics and NTA data analysis software. ...
Article
Full-text available
Unknown features in untargeted metabolomics and non-targeted analysis (NTA) are identified using fragment ions from MS/MS spectra to predict the structures of the unknown compounds. The precursor ion selected for fragmentation is commonly performed using data dependent acquisition (DDA) strategies or following statistical analysis using targeted MS/MS approaches. However, the selected precursor ions from DDA only cover a biased subset of the peaks or features found in full scan data. In addition, different statistical analysis can select different precursor ions for MS/MS analysis, which make the post-hoc validation of ions selected following a secondary analysis impossible for precursor ions selected by the original statistical method. Here we propose an automated, exhaustive, statistical model-free workflow: paired mass distance-dependent analysis (PMDDA), for reproducible untargeted mass spectrometry MS2 fragment ion collection of unknown compounds found in MS1 full scan. Our workflow first removes redundant peaks from MS1 data and then exports a list of precursor ions for pseudo-targeted MS/MS analysis on independent peaks. This workflow provides comprehensive coverage of MS2 collection on unknown compounds found in full scan analysis using a “one peak for one compound” workflow without a priori redundant peak information. We compared pseudo-spectra formation and the number of MS2 spectra linked to MS1 data using the PMDDA workflow to that obtained using CAMERA and RAMclustR algorithms. More annotated compounds, molecular networks, and unique MS/MS spectra were found using PMDDA compared with CAMERA and RAMClustR. In addition, PMDDA can generate a preferred ion list for iterative DDA to enhance coverage of compounds when instruments support such functions. Finally, compounds with signals in both positive and negative modes can be identified by the PMDDA workflow, to further reduce redundancies. The whole workflow is fully reproducible as a docker image xcmsrocker with both the original data and the data processing template. Graphical Abstract
... Though raw metabolomics data can be uploaded and accessed through online databases such as MetaboLights (Haug et al., 2020) or Metabolomics Workbenchs (https://www.metabolomicsworkbench.org/), details of data analysis are not as transparent as data sharing, and reduce the ability to fully reproduce the reported findings (Goodman et al., 2016). Data analysis software with a graphic user interface (GUI) can be easy to use and document, but is also restricted to only defined operations (Hung et al., 2016). An open source data process script can represent every step of the data analysis while still being flexible, (Gandrud, 2013) but researchers need to adopt specific software within an integrated development environment (IDE), which also reduces reproducibility due to the lack of experience with certain software (Boettiger, 2015). ...
... Therefore, we used SRM samples that are commercially available and commonly used in metabolomics workflows, and made the raw data accessible online for future potential research purposes. In order to provide full transparency on the data analysis, we choose a command line based script within a graphic user interface to make sure every step is recorded and reproducible by other researchers (Hung et al., 2016). A docker image, xcmsrocker was created based on Rocker image (Boettiger and Eddelbuettel, 2017), which pre-installs most of the R-based metabolomics and NTA data analysis software. ...
Preprint
Full-text available
Motivation Unknown features in untargeted metabolomics and non-targeted analysis (NTA) are identified using fragment ions from MS/MS spectra to predict the structures of the unknown compounds. The precursor ion selected for fragmentation is commonly performed using data dependent acquisition (DDA) strategies or following statistical analysis using targeted MS/MS approaches. However, the selected precursor ions from DDA only cover a biased subset of the peaks or features found in full scan data. In addition, different statistical analysis can select different precursor ions for MS/MS analysis, which make the post-hoc validation of ions selected by new statistical methods impossible for precursor ions selected by the original statistical method. By removing redundant peaks and performing pseudo-targeted MS/MS analysis on independent peaks, we can comprehensively cover unknown compounds found in full scan analysis using a “one peak for one compound” workflow without a priori redundant peak information. Here we propose an reproducible, automated, exhaustive, statistical model-free workflow: paired mass distance-dependent analysis (PMDDA), for untargeted mass spectrometry identification of unknown compounds found in MS1 full scan. Results More annotated compounds/molecular networks/spectrum were found using PMDDA compared with CAMERA and RAMClustR. Meanwhile, PMDDA can generate the preferred ions list for iterative DDA to cover more compounds when instruments support such functions. Availability and implementation The whole workflow is fully reproducible as a docker image xcmsrocker with both the original data and the data processing template. https://hub.docker.com/r/yufree/xcmsrocker A related R package is developed and released online: https://github.com/yufree/rmwf. R script, data files and links of GNPS annotation results including MS1 peaks list and MS2 MGF files were provided in supplementary information.
... The major technical challenge to porting desktop based image analyses applications to the cloud is to support the same graphical interface and display on the cloud that one would see on a laptop or desktop. Bwb supports two methodologies for accomplishing this using software containers 20,21 . We combine both methods in Bwb to allow the user to export graphics from a container that functions both on a local laptop and on a remote cloud server. ...
Article
Full-text available
Modern biomedical image analyses workflows contain multiple computational processing tasks giving rise to problems in reproducibility. In addition, image datasets can span both spatial and temporal dimensions, with additional channels for fluorescence and other data, resulting in datasets that are too large to be processed locally on a laptop. For omics analyses, software containers have been shown to enhance reproducibility, facilitate installation and provide access to scalable computational resources on the cloud. However, most image analyses contain steps that are graphical and interactive, features that are not supported by most omics execution engines. We present the containerized and cloud-enabled Biodepot-workflow-builder platform that supports graphics from software containers and has been extended for image analyses. We demonstrate the potential of our modular approach with multi-step workflows that incorporate the popular and open-source Fiji suite for image processing. One of our examples integrates fully interactive ImageJ macros with Jupyter notebooks. Our second example illustrates how the complicated cloud setup of an computationally intensive process such as stitching 3D digital pathology datasets using BigStitcher can be automated and simplified. In both examples, users can leverage a form-based graphical interface to execute multi-step workflows with a single click, using the provided sample data and preset input parameters. Alternatively, users can interactively modify the image processing steps in the workflow, apply the workflows to their own data, change the input parameters and macros. By providing interactive graphics support to software containers, our modular platform supports reproducible image analysis workflows, simplified access to cloud resources for analysis of large datasets, and integration across different applications such as Jupyter.
... Containerized software or code can be run with dependencies installed within the container, which is isolated from packages or dependencies already installed in the host system. Nowadays, both console-based software and software with graphical user interface (GUI) can be containerized [122,123], and the software container supports both Linuxand Windows-based applications [124]. Some commonly used software containerization tools are Docker and Singularity [125,126], but Singularity has better support towards high-performance computing [127]. ...
Article
Full-text available
Clinical metabolomics emerged as a novel approach for biomarker discovery with the translational potential to guide next-generation therapeutics and precision health interventions. However, reproducibility in clinical research employing metabolomics data is challenging. Checklists are a helpful tool for promoting reproducible research. Existing checklists that promote reproducible metabolomics research primarily focused on metadata and may not be sufficient to ensure reproducible metabolomics data processing. This paper provides a checklist including actions that need to be taken by researchers to make computational steps reproducible for clinical metabolomics studies. We developed an eight-item checklist that includes criteria related to reusable data sharing and reproducible computational workflow development. We also provided recommended tools and resources to complete each item, as well as a GitHub project template to guide the process. The checklist is concise and easy to follow. Studies that follow this checklist and use recommended resources may facilitate other researchers to reproduce metabolomics results easily and efficiently.
... The package is distributed as open-source software 2 under a GPL3 licence. It is available for Linux and MacOS systems, as well as a Docker image [18], which can be deployed on Windows. It can be used as a Python library and being integrated in third-party applications, or used directly from the command line and called from bash scripts. ...
Preprint
Full-text available
We introduce Shennong, a Python toolbox and command-line utility for speech features extraction. It implements a wide range of well-established state of art algorithms including spectro-temporal filters such as Mel-Frequency Cepstral Filterbanks or Predictive Linear Filters, pre-trained neural networks, pitch estimators as well as speaker normalization methods and post-processing algorithms. Shennong is an open source, easy-to-use, reliable and extensible framework. The use of Python makes the integration to others speech modeling and machine learning tools easy. It aims to replace or complement several heterogeneous software, such as Kaldi or Praat. After describing the Shennong software architecture, its core components and implemented algorithms, this paper illustrates its use on three applications: a comparison of speech features performances on a phones discrimination task, an analysis of a Vocal Tract Length Normalization model as a function of the speech duration used for training and a comparison of pitch estimation algorithms under various noise conditions.
... 13 Containers have seen an increased uptake in the life sciences, both for delivering software tools and for facilitating data analysis in various ways. [14][15][16][17][18] When running more than just a few containers, an orchestration system is needed to coordinate and manage their execution and handle issues related to e.g. load balancing, health checks and scaling. ...
Article
Full-text available
Containers are gaining popularity in life science research as they provide a solution for encompassing dependencies of provisioned tools, simplify software installations for end users and offer a form of isolation between processes. Scientific workflows are ideal for chaining containers into data analysis pipelines to aid in creating reproducible analyses. In this article, we review a number of approaches to using containers as implemented in the workflow tools Nextflow, Galaxy, Pachyderm, Argo, Kubeflow, Luigi and SciPipe, when deployed in cloud environments. A particular focus is placed on the workflow tool’s interaction with the Kubernetes container orchestration framework.
... In the past few years several toolboxes have been released in an effort to address such challenges with using Galaxy [14][15][16][17][18][19]. Yet, these toolkits are often designed to analyse only one specific dimension of transcriptome diversity, and/or not fully automated and require some prior knowledge of R command line script [20]. ...
Article
Full-text available
Background As the number of RNA-seq datasets that become available to explore transcriptome diversity increases, so does the need for easy-to-use comprehensive computational workflows. Many available tools facilitate analyses of one of the two major mechanisms of transcriptome diversity, namely, differential expression of isoforms due to alternative splicing, while the second major mechanism—RNA editing due to post-transcriptional changes of individual nucleotides—remains under-appreciated. Both these mechanisms play an essential role in physiological and diseases processes, including cancer and neurological disorders. However, elucidation of RNA editing events at transcriptome-wide level requires increasingly complex computational tools, in turn resulting in a steep entrance barrier for labs who are interested in high-throughput variant calling applications on a large scale but lack the manpower and/or computational expertise. Results Here we present an easy-to-use, fully automated, computational pipeline (Automated Isoform Diversity Detector, AIDD) that contains open source tools for various tasks needed to map transcriptome diversity, including RNA editing events. To facilitate reproducibility and avoid system dependencies, the pipeline is contained within a pre-configured VirtualBox environment. The analytical tasks and format conversions are accomplished via a set of automated scripts that enable the user to go from a set of raw data, such as fastq files, to publication-ready results and figures in one step. A publicly available dataset of Zika virus-infected neural progenitor cells is used to illustrate AIDD’s capabilities. Conclusions AIDD pipeline offers a user-friendly interface for comprehensive and reproducible RNA-seq analyses. Among unique features of AIDD are its ability to infer RNA editing patterns, including ADAR editing, and inclusion of Guttman scale patterns for time series analysis of such editing landscapes. AIDD-based results show importance of diversity of ADAR isoforms, key RNA editing enzymes linked with the innate immune system and viral infections. These findings offer insights into the potential role of ADAR editing dysregulation in the disease mechanisms, including those of congenital Zika syndrome. Because of its automated all-inclusive features, AIDD pipeline enables even a novice user to easily explore common mechanisms of transcriptome diversity, including RNA editing landscapes.
Preprint
Full-text available
Unknown features in untargeted metabolomics and non-targeted analysis (NTA) are identified using fragment ions from MS/MS spectra to predict the structures of the unknown compounds. The precursor ion selected for fragmentation is commonly performed using data dependent acquisition (DDA) strategies or following statistical analysis using targeted MS/MS approaches. However, the selected precursor ions from DDA only cover a biased subset of the peaks or features found in full scan data. In addition, different statistical analysis can select different precursor ions for MS/MS analysis, which make the post-hoc validation of ions selected by new statistical methods impossible for precursor ions selected by the original statistical method. Here we propose an automated, exhaustive, statistical model-free workflow: paired mass distance-dependent analysis (PMDDA), for untargeted mass spectrometry identification of unknown compounds. By removing redundant peaks and performing pseudo-targeted MS/MS analysis on independent peaks, we can comprehensively cover unknown compounds found in full scan analysis using a “one peak for one compound” workflow without a priori redundant peak information. We show that compared to DDA, PMDDA is more comprehensive and robust against samples' matrix effects. Further, more compounds were identified by database annotation using PMDDA compared with CAMERA and RAMClustR. Finally, compounds with signals in both positive and negative modes can be identified by the PMDDA workflow, to further reduce redundancies. The whole workflow is fully reproducible as a docker image xcmsrocker with both the original data and the data processing template.
Preprint
p>Unknown features in untargeted metabolomics and non-targeted analysis (NTA) are identified using fragment ions from MS/MS spectra to predict the structures of the unknown compounds. The precursor ion selected for fragmentation is commonly performed using data dependent acquisition (DDA) strategies or following statistical analysis using targeted MS/MS approaches. However, the selected precursor ions from DDA only cover a biased subset of the peaks or features found in full scan data. In addition, different statistical analysis can select different precursor ions for MS/MS analysis, which make the post-hoc validation of ions selected by new statistical methods impossible for precursor ions selected by the original statistical method. Here we propose an automated, exhaustive, statistical model-free workflow: paired mass distance-dependent analysis (PMDDA), for untargeted mass spectrometry identification of unknown compounds. By removing redundant peaks and performing pseudo-targeted MS/MS analysis on independent peaks, we can comprehensively cover unknown compounds found in full scan analysis using a “one peak for one compound” workflow without a priori redundant peak information. We show that compared to DDA, PMDDA is more comprehensive and robust against samples' matrix effects. Further, more compounds were identified by database annotation using PMDDA compared with CAMERA and RAMClustR. Finally, compounds with signals in both positive and negative modes can be identified by the PMDDA workflow, to further reduce redundancies. The whole workflow is fully reproducible as a docker image xcmsrocker with both the original data and the data processing template. </p
Article
Full-text available
Linux container technologies, as represented by Docker, provide an alternative to complex and time-consuming installation processes needed for scientific software. The ease of deployment and the process isolation they enable, as well as the reproducibility they permit across environments and versions, are among the qualities that make them interesting candidates for the construction of bioinformatic infrastructures, at any scale from single workstations to high throughput computing architectures. The Docker Hub is a public registry which can be used to distribute bioinformatic software as Docker images. However, its lack of curation and its genericity make it difficult for a bioinformatics user to find the most appropriate images needed. BioShaDock is a bioinformatics-focused Docker registry, which provides a local and fully controlled environment to build and publish bioinformatic software as portable Docker images. It provides a number of improvements over the base Docker registry on authentication and permissions management, that enable its integration in existing bioinformatic infrastructures such as computing platforms. The metadata associated with the registered images are domain-centric, including for instance concepts defined in the EDAM ontology, a shared and structured vocabulary of commonly used terms in bioinformatics. The registry also includes user defined tags to facilitate its discovery, as well as a link to the tool description in the ELIXIR registry if it already exists. If it does not, the BioShaDock registry will synchronize with the registry to create a new description in the Elixir registry, based on the BioShaDock entry metadata. This link will help users get more information on the tool such as its EDAM operations, input and output types. This allows integration with the ELIXIR Tools and Data Services Registry, thus providing the appropriate visibility of such images to the bioinformatics community.
Article
Full-text available
Background Semantic Web technologies have been widely applied in the life sciences, for example by data providers such as OpenLifeData and through web services frameworks such as SADI. The recently reported OpenLifeData2SADI project offers access to the vast OpenLifeData data store through SADI services. Findings This article describes how to merge data retrieved from OpenLifeData2SADI with other SADI services using the Galaxy bioinformatics analysis platform, thus making this semantic data more amenable to complex analyses. This is demonstrated using a working example, which is made distributable and reproducible through a Docker image that includes SADI tools, along with the data and workflows that constitute the demonstration. Conclusions The combination of Galaxy and Docker offers a solution for faithfully reproducing and sharing complex data retrieval and analysis workflows based on the SADI Semantic web service design patterns.
Article
Full-text available
Background Inference of gene networks from expression data is an important problem in computational biology. Many algorithms have been proposed for solving the problem efficiently. However, many of the available implementations are programming libraries that require users to write code, which limits their accessibility. Results We have developed a tool called CyNetworkBMA for inferring gene networks from expression data that integrates with Cytoscape. Our application offers a graphical user interface for networkBMA, an efficient implementation of Bayesian Model Averaging methods for network construction. The client-server architecture of CyNetworkBMA makes it possible to distribute or centralize computation depending on user needs. Conclusions CyNetworkBMA is an easy-to-use tool that makes network inference accessible to non-programmers through seamless integration with Cytoscape. CyNetworkBMA is available on the Cytoscape App Store at http://apps.cytoscape.org/apps/cynetworkbma. Electronic supplementary material The online version of this article (doi:10.1186/s13029-015-0043-5) contains supplementary material, which is available to authorized users.
Article
Full-text available
As bioinformatic workflows become increasingly complex and involve multiple specialized tools, so does the difficulty of reliably reproducing those workflows. Cytoscape is a critical workflow component for executing network visualization, analysis, and publishing tasks, but it can be operated only manually via a point-and-click user interface. Consequently, Cytoscape-oriented tasks are laborious and often error prone, especially with multistep protocols involving many networks. In this paper, we present the new cyREST Cytoscape app and accompanying harmonization libraries. Together, they improve workflow reproducibility and researcher productivity by enabling popular languages (e.g., Python and R, JavaScript, and C#) and tools (e.g., IPython/Jupyter Notebook and RStudio) to directly define and query networks, and perform network analysis, layouts and renderings. We describe cyREST’s API and overall construction, and present Python- and R-based examples that illustrate how Cytoscape can be integrated into large scale data analysis pipelines. cyREST is available in the Cytoscape app store (http://apps.cytoscape.org) where it has been downloaded over 1900 times since its release in late 2014.
Article
Full-text available
Transparency, openness, and reproducibility are readily recognized as vital features of science (1, 2). When asked, most scientists embrace these features as disciplinary norms and values (3). Therefore, one might expect that these valued features would be routine in daily practice. Yet, a growing body of evidence suggests that this is not the case (4–6).
Article
Full-text available
Behaviours of complex biomolecular systems are often irreducible to the elementary properties of their individual components. Explanatory and predictive mathematical models are therefore useful for fully understanding and precisely engineering cellular functions. The development and analyses of these models require their adaptation to the problems that need to be solved and the type and amount of available genetic or molecular data. Quantitative and logic modelling are among the main methods currently used to model molecular and gene networks. Each approach comes with inherent advantages and weaknesses. Recent developments show that hybrid approaches will become essential for further progress in synthetic biology and in the development of virtual organisms.
Article
In the fall of 2013, emails arrived in the inboxes of dozens of scientists informing that their work had been chosen for scrutiny by a project aiming to replicate 50 high-impact cancer biology papers. The Reproducibility Project: Cancer Biology, an ambitious, open-science effort to test whether key findings in top journals can be reproduced by independent labs, has stirred concerns in the community. Almost every scientist targeted by the project who spoke with Science agrees that studies in cancer biology, as in many other fields, too often turn out to be irreproducible. But few feel comfortable with this particular effort, which plans to announce its findings in coming months. Leaders of the project say it will ultimately benefit the field by gauging the extent of the reproducibility problem in cancer biology.