Cloudgene: A graphical execution platform for MapReduce programs on private and public clouds

BMC Bioinformatics (Impact Factor: 2.58). 08/2012; 13(1):200. DOI: 10.1186/1471-2105-13-200
Source: PubMed
ABSTRACT
Background
The MapReduce framework enables a scalable processing and analyzing of large datasets by distributing the computational load on connected computer nodes, referred to as a cluster. In Bioinformatics, MapReduce has already been adopted to various case scenarios such as mapping next generation sequencing data to a reference genome, finding SNPs from short read data or matching strings in genotype files. Nevertheless, tasks like installing and maintaining MapReduce on a cluster system, importing data into its distributed file system or executing MapReduce programs require advanced knowledge in computer science and could thus prevent scientists from usage of currently available and useful software solutions.

Results
Here we present Cloudgene, a freely available platform to improve the usability of MapReduce programs in Bioinformatics by providing a graphical user interface for the execution, the import and export of data and the reproducibility of workflows on in-house (private clouds) and rented clusters (public clouds). The aim of Cloudgene is to build a standardized graphical execution environment for currently available and future MapReduce programs, which can all be integrated by using its plug-in interface. Since Cloudgene can be executed on private clusters, sensitive datasets can be kept in house at all time and data transfer times are therefore minimized.

Conclusions
Our results show that MapReduce programs can be integrated into Cloudgene with little effort and without adding any computational overhead to existing programs. This platform gives developers the opportunity to focus on the actual implementation task and provides scientists a platform with the aim to hide the complexity of MapReduce. In addition to MapReduce programs, Cloudgene can also be used to launch predefined systems (e.g. Cloud BioLinux, RStudio) in public clouds. Currently, five different bioinformatic programs using MapReduce and two systems are integrated and have been successfully deployed. Cloudgene is freely available at
http://cloudgene.uibk.ac.at.

Full-text

Available from: Anita Kloss-Brandstätter
SOF T W A R E Open Access
Cloudgene: A graphical execution platform for
MapReduce programs on private and public
clouds
Sebastian Schönherr
1,2
, Lukas Forer
1,2
, Hansi Weißensteiner
1,2
, Florian Kronenberg
1
, Günther Specht
2
and Anita Kloss-Brandstätter
1*
Abstract
Background: The MapReduce framework enables a scalable processing and analyzing of large datasets by
distributing the computational load on connected computer nodes, referred to as a cluster. In Bioinformatics,
MapReduce has already been adopted to various case scenarios such as mapping next generation sequencing data
to a reference genome, finding SNPs from short read data or matching strings in genotype files. Nevertheless, tasks
like installing and maintaining MapReduce on a cluster system, importing data into its distributed file system or
executing MapReduce programs require advanced knowledge in computer science and could thus prevent
scientists from usage of currently available and useful software solutions.
Results: Here we present Cloudgene, a freely available platform to improve the usability of MapReduce programs
in Bioinformatics by providing a graphical user interface for the execution, the import and export of data and the
reproducibility of workflows on in-house (private clouds) and rented clusters (pub lic clouds). The aim of Cloudgene
is to build a standardized graphical execution environment for currently available and future MapReduce programs,
which can all be integrated by using its plug-in interface. Since Cloudgene can be executed on private clusters,
sensitive datasets can be kept in house at all time and data transfer times are therefore minimized.
Conclusions: Our results show that MapReduce programs can be integrated into Cloudgen e with little effort and
without adding any computational overhead to existing programs. This platform gives developers the opportunity
to focus on the actual implementation task and provides scientists a platform with the aim to hide the complexity
of MapReduce. In addition to MapReduce programs, Cloudgene can also be used to launch predefined systems
(e.g. Cloud BioLinux, RStudio) in public clouds. Currently, five different bioinformatic programs using MapReduce
and two systems are integrated and have been successfully deployed. Cloudgene is freely available at http://
cloudgene.uibk.ac.at.
Background
Computer science is becoming increasingly important
in todays genetic research. The accelerated progress in
molecular biological technologies puts increasing
demands on adequate software solutions. This is espe-
cially true for next generation sequencing (NGS) where
costs are falling faster than for computer hardware [1].
As a consequence, the accompanying growth of data
results in longer execution times of currently available
programs and requires new strategies to process data
efficiently. The MapReduce framework [2] and espe-
cially its open-sour ce implementation Hadoop [3] has
become more and mor e popular for processing and
analyzing terabytes of data: Mapping NGS data to the
human genome [4], calculating differential gene ex-
pression in RNA-seq datasets [5] or even simpler but
time intensive tasks like matching strings in large
genotype files
1
are already successfu lly implemented
scenarios. With MapReduce, a computation is distribu-
ted and executed in parallel over all computer nodes
* Correspondence: anita.kloss@i-med.ac.at
Equal contributors
1
Division of Genetic Epidemiology; Department of Medical Genetics,
Molecular and Clinical Pharmacology, Innsbruck Medical University,
Innsbruck, Austria
Full list of author information is available at the end of the article
© 2012 Schönherr et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the
Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use,
distribution, and reproduction in any medium, provided the original work is properly cited.
Schönherr et al. BMC Bioinformatics 2012, 13:200
http://www.biomedcentral.com/1471-2105/13/200
Page 1
in a cluster, allowing to add or to remove nodes on
demand (scale-out principle). The developer is respon-
sible to write the corresponding map and reduce task
and the framework itself is taking over the
parallelization, fault tolerance of hardware and soft-
ware, replication and I/O scheduling. Unfortunately,
small to medium sized genetic research institutes can
often hardly afford the acquirement and maintenance
of own computer clusters. An alternative is public
cloud compu ting which offers the possibility to rent
computer hardware from different providers like Ama-
zons Elastic Compute Cloud (http://aws.amazon.com/
ec2/) on demand.
Working with cluster architectures requires a back-
ground in computer science for both setting up a cluster
infrastructure and executing MapReduce programs, where
sometimes a graphical user interface (GUI) is lacking at
all. To improve usability, different programs [4-6] have
been developed with a focus on a simplified execution.
This constitutes a major improvement for scientists with
the down side to implement a new GUI for every future
MapReduce program. Additionally, concatenating differ-
ent programs to a pipeline is still hampered.
In this paper we present Cloudgene, a platform to inte-
grate available MapReduce programs via manifest files and
to facilitate the use of on-demand cluster architectures in
cloud environments. Cloudgenes biggest advantage lies in
simplifying the import and export of data, execution and
monitoring of MapReduce programs on in-house (private
clouds) or rented clusters (public clouds) and allowing the
reproducibility of an analysis or analysis pipeline.
Implementation
Overall design
In order to be used in public and private clouds, Cloudgene
consists of two independent modules, Cloudgene-Cluster
and Cloudgene-MapRed. Cloudgene-Cluster enables scien-
tists to instantiate a cluster on a public cloud, currently ap-
plied on Amazons EC2. The end user is guided through
the configuration process via graphical wizards, specifying
all necessary cluster information including the complete
hardware specification, security credentials and SSH keys.
Cloudgene-MapRed can be seen as an additional layer be-
tween Apache Hadoop and the end user and defines a
user-friendly way to execute and to monitor MapReduce
programs, providing a standardized import/export interface
for large datasets. Cloudgene-MapRed supports the execu-
tion of Hadoop jar files (written in Java), the Hadoop
Streaming mode (written in any other programming lan-
guage) and allows a concatenation of programs to program
pipelines. One central idea behind Cloudgene is to integrate
available and future programs with little effort: Therefore,
Cloudgene specifies a manifest file (i.e. configuration file)
for every program which defines the graphical wizards to
launch a public cluster or MapReduce jobs (see section
Plug-in interface). Figure 1 summarizes how these two
modules collaborate together to execute programs depend-
ing on the specified cluster environment.
Figure 1 The use of Cloudgene in public or private clouds. When using Cloudgene in a public cloud (blue path), the first module
Cloudgene-Cluster launches a cluster in the cloud (step 1) and installs all necessary data for the specific scenario including the second module
Cloudgene-MapRed (step 2). When finished, the user is able to communicate with the cluster and to execute and monitor jobs from plugged in
MapReduce programs (step 3). In addition, Cloudgene-MapRed can also be used stand-alone on an in-house cluster (private cloud), with the
precondition that a running Hadoop Cluster is available (red path, only step 3 is necessary).
Schönherr et al. BMC Bioinformatics 2012, 13:200 Page 2 of 9
http://www.biomedcentral.com/1471-2105/13/200
Page 2
Architecture and technologies
Both modules are based on a clientserver architecture:
The client is designed as a web application utilizing the
JavaScript framework Sencha Ext JS (http://www.sencha.
com). On server side all necessary resources are imple-
mented in Java by using the RE STful web fram ework
Restlet (http://www.restlet.org/) [7]. The communication
between client and server is obtained through asyn-
chronous HTTP request s (AJAX) with JSON (http://
json.org) as an interchange format. Cloudgene is multi-
user capable and encrypts the transmission between ser-
ver and client with HTTPS (Hypertext Transfer Protocol
Secure). To integrate new programs and describe all
properties of a program or program pipeline, the YAML
(http://www.yaml.org) format is used to define the mani-
fest file. All required metadata is stored in an embedded
Java SQL database H2 (http://www.h2database.com).
The Apache Whirr [8] project is used to launch a cluster
on Amazon EC2, to combine nodes to a working
MapReduce cluster and to define the hardware environ-
ment of it. Figure 2 summarizes the overall architecture.
Cloudgene-Cluster
After a successful login to Cloudgene-Cluster, the main
window provides the possibility to create or to shut
down a public cluster and to get an overview of all
previously started nodes (Figure 3). When launching a
new cluster a wizard is shown: in a first step the cloud
provider, cluster name, the program to install, the
amount of nodes and an available instance type (i.e.
hardware specification of a node) need to be selected
(Figure 3A). Subsequently, the cloud security credentials
have to be entered and an SSH key has to be chosen or
uploaded (Figure 3B). For user convenience, security cre-
dentials need only be entered once for every session
(until log-out) and can additionally be stored encrypted
in the H2 database. A storage of SSH keys is especially
useful for advanced users who want to login into a node
via an SSH console. In addition, an S3 bucket can be
predefined for an automatic transfer of MapReduce
results. Within minutes a ready-to-use cluster is created,
where all necessary software is installed and parameters
are set. As a final step, Cloudgene-Cluster installs
Cloudgene-MapRed on the launched cluster and returns
the web address for accessing it. Cloudgene-Cluster pro-
vides the possibility to download SSH keys, to access the
log with all performed actions during cluster setup, to
add new users or to logout from the system.
Cloudgene-MapRed
The main window of Cloudgene-MapRed (Figure 4) is
structured as follows: The toolbar on top contains but-
tons for program (job) submission, data import and
program installation. Additionally, buttons for changing
the account details (security credentials, general infor-
mation and S3 export location for results) and detailed
cluster information (e.g. number of nodes, MapReduce
configuration) are provided. All currently running and
finalized jobs including name, progress, execution time
and state are displayed in the upper panel. For running
jobs , the progress of the map and reduce pha ses are dis-
played separately. The lower panel displays the job-
specific information including input/output parameters,
S3 export location, job arguments, execution time and
results. The export location is created automatically using
the naming convention S3bucket/jobname/timestamp.
Moreover, the detail view contains a link to the logfile in
case of errors.
Before launching a new job, data needs to be imported
into the distributed file system (Hadoop Distributed File
System), whereby the data source has to be selected. Cur-
rently, Cloudgene supports a data import from FTP,
HTTP, Amazon S3 buckets or direct file uploads. A job
can be submitted by specifying the previously imported
data and the program-specific parameters. After launching
a program, the process can be monitored and all jobs in-
cluding results are viewable or downloadable. As all data
onaclusterinapubliccloudarelostonshutdown,
Cloudgene automatically exports all results, log data and if
specified also imported datasets in parallel to an S3 bucket.
Figure 2 Cloudgene system architecture. Cloudgene consists of
two independent modules, Cloudgene-Cluster and Cloudgene-
MapRed. Both implement a clientserver architecture using open
source technologies. The client is implemented in JavaScript utilizing
Sencha Ext JS and communicates with the Restlet server via a
secured connection (HTTPS). The program parameters are read out
from its manifest file written in YAML. Both modules are username/
password secured and store all required metadata in the relational
SQL database H2.
Schönherr et al. BMC Bioinformatics 2012, 13:200 Page 3 of 9
http://www.biomedcentral.com/1471-2105/13/200
Page 3
Plug-in interface
To integrate new programs into Cloudgene, a simple
structured YAML manifest file has to be specified in-
cluding a section for both Cloudgene-Cluster and
Cloudgene-MapRed. This manifest file needs to be writ-
ten once and can be either provided to other scientists
by the developer or written by any person who is famil-
iar with the execution of a MapReduce program. The
manifest file starts with a block containing general pro-
gram information (e.g. name, author, description, web-
page). In the Cloudgene-Cluster section the file system
image, available instance types, firewall settings, services
(e.g. MapReduce), installation scripts (additional soft-
ware to install) and other program depended parame ters
are specified. The Cloudgene-MapRed section contains
all necessary information that characterizes a MapRe-
duce program including input and output parameters or
Cloudgenes step functionality (i.e. job pipelining). At
start up, Cloudgene loads all necessary information from
the manifest file and generates the program specific
wizards. Figure 5 shows the integration of CloudBurst
into Cloudgene-MapRed. To simplify the integration
process for end users, all currently tested MapReduce
programs including working manifest files, a detailed de-
scription of available parameters, available instance types
and a tutorial on how to set up an EC2 security creden-
tials can be found on our website. Furthermore,
Cloudgene-MapRed provides with its integrated web re-
pository a mechanism to install currently available pro-
grams directly via the web interface.
Results
Cloudgenes overall aim is to simplify the process of exe-
cuting MapReduce programs including all required steps
on private and public clouds (Figure 6). In the following
section, we want to show on different case scenarios the
diversity and advantage of Cloudgene. Table 1 sum-
marizes all currently integrated programs.
Figure 3 Screenshot of Cloudgene-Cluster. Cloudgene-Cluster allows launching a cluster and setting up the cloud environment. The main
window displays all cluster configurations and indicates their current status. Subfigures (A) and (B) show the necessary wizard steps where the
program itself and its parameters can be selected. All needed software is installed automatically and the end user receives an URL of the cluster
namenode where Cloudgene-MapRed has been installed.
Schönherr et al. BMC Bioinformatics 2012, 13:200 Page 4 of 9
http://www.biomedcentral.com/1471-2105/13/200
Page 4
Integrating existing MapReduce programs
As mentioned above, several programs (CloudBurst [4],
Myrna [5] or Crossbow [6]) already exist implementing
a MapReduce approach to process data. To demon-
strate the benefit of Cloudgene, we integrated these
programs by writing appropriate manifest files, includ-
ing sections for Cloudgene-Cluster (public clouds) and
Cloudgene-MapRed (private and public clouds). In case
of CloudBurst, a MapReduce job can now be executed
graphically and the benchmark tests show that Cloud-
gene is scalable in time and competitive to Amazons
Elastic MapReduce platform [9] regarding to cluster
setup and program execution time (see Table 2 for a
detailed comparison). For Myrna and Crossbow, a web
interface has already been made available by the authors
using Hadoops streaming mode. Nevertheless, by inte-
grating these programs into Cloudgene, the users still
benefit from (1) a standardized way to import/export
data, (2) a system which keeps track of all previous exe-
cuted workflows including the complete configuration
setup (input/output parameters, execution times, results)
and (3) the possibility to concatenate different MapRe-
duce jobs to pipelines. Here, Cloudgenes pipeline func-
tionality (specified as steps in the manifest file) has
been used to execute se veral compu tation steps of
Crossbow and Myrna. This can be achieved by defining
the output directory of step x (e.g. step 1: Pre-proces-
sing) as the new input directory for step x+1 (e.g. step
2: Alignment) in the manifest file. Even if the newly cre-
ated workflow consists of several steps in the manifest
file, the user can start it as one job. Additional file 1
includes the corresponding manifest files showing the
step functionality in detail.
Integrating novel MapReduce programs
For a proof of concept, we implemented a MapReduce
pipeline for FastQ pre-processing to check the quality of
large NGS datasets similar to the FastQC tool
2
,inwhich
statistics like the sequence quality, base quality, length dis-
tribution, and sequence duplication levels are calculated.
Unlike FastQC, which can be executed on a single com-
puter and where due to memory requirements only a
Figure 4 Screenshot of Cloudgene-MapRed. Cloudgene-MapRed runs on the namenode of a Hadoop cluster and enables a simplified
execution of MapReduce jobs via a web interface. All input/output parameters can be easily set through wizards. Cloudgene-MapRed shows the
currently executed jobs, visualizes the progress of MapReduce jobs, receives feedback and allows viewing and downloading results.
Schönherr et al. BMC Bioinformatics 2012, 13:200 Page 5 of 9
http://www.biomedcentral.com/1471-2105/13/200
Page 5
sequence subset is used for quality control computations,
our implementation has the advantage that the complete
set of sequences is included into the statistical analysis.
After processing all sequences, a consecutive program
transforms the results into meaningful plots and can fi-
nally be downloaded as pdf-files. Again, this shows the
step functionality of Cloudgene-MapRed in which
different programs can be connected to a pipeline (i.e. step
1: calculating statistics, step 2: generate plot). Moreover,
fastq input-files can be imported from public Amazon S3
buckets (e.g. raw data from the 1000 genomes public S3
bucket), which is especially useful in combination with
Amazon EC2 since data transfer from S3 to EC2
nodes is optimized (see Table 2 for mea surements of
import times).
A further scen ario for a simple but time-intensive task
is the filtering and extractin g of certain rows from a
large SNP-genotype file as usually used in genome-wide
association studies with a file size of several gigaby tes.
Again, this case scenario was implemented as a MapRe-
duce job by the authors of this paper and has been inte-
grated into Cloudgene.
All integrated and introduced programs are available
for download on our webpage or can be installed from
the Cloudgene repository directly.
Launching a stand-alone image
Different file system images (called AMIs in Amazon
terminology) exist to provide a convenient way for set-
ting up systems in the cloud: Cloud BioLinux [10] is a
suitable file system image in Bioinformatics including a
wide range of biological software, programming libraries
as well as data and is therefore an excellent basis for bio-
informatic computations in the cloud. RStudio [11] is a
development environment for R, available as an EC2
Figure 5 A standard YAML manifest file. The manifest file contains a metadata block (name, description, category and website) and a logical
block for both Cloudgene modules, including all parameters that are necessary to execute the program and from a users view often hard to
decide. This file has to be written once and is provided to scientists. Cloudgene checks at every start up for new programs and designs the web
interface dynamically to the specific scenario. Here, the submit form for CloudBurst has been generated with information from the manifest file.
Figure 6 Comparison of approaches. All necessary steps to create
a cluster and run a job can be executed and monitored via
Cloudgene yielding to a significant simplification compared to a
traditional approach. No command line is needed and complicated
tasks are hidden from the end user at any time.
Schönherr et al. BMC Bioinformatics 2012, 13:200 Page 6 of 9
http://www.biomedcentral.com/1471-2105/13/200
Page 6
image [12] and allows the usage of all R programming
tools via a web interf ace. Both systems have been suc-
cessfully integrated into Clou dgene enabling scientists to
launch these systems via Cloudge ne-Cluster. Cloud Bio-
Linux has further been used as the underlying image for
the integration of Myrna and Crossbow into Cloudgene,
since most of the required software is already installed
and the cluster installation pro cess is therefore
simplified.
Launching a web application
Besides the mentioned MapReduce scenarios, Cloudge ne
can also be used to host a user defined web application
on a public cloud. We demonstrated this on HaploGrep
[13], a tool to determine mitochondrial DNA hap-
logroups from mtDNA profiles. Especially for large input
data, HaploGrep requires a lot of main memory, thereby
making a public cloud node with sufficient main mem-
ory an adequate choice. For this purpose a simple shell
script was integrated into the setup process to start the
HaploGrep web server. This shell script can be defined
in the manifest file of Cloudgene-Cluster and is executed
automatically on each cluster node.
Discussion
Although MapReduce enabled a scalable way to process
and analyze data, the exe cution of programs and the
overall setup of cluste r architectures still includes non-
trivial tasks and hampers the spread of MapReduce pro-
grams in Bioinformatics. We therefore developed and
implemented Cloudgene, which provides scientists a
graphical execution platform and a standa rdized way to
manage large-scale bioinformatic projects.
Strengths and limitations
The usage of Cloudgene has several strengths: (1) Pro-
grams can be executed via one centralized platform,
thereby standardizing the import/export of data, the exe-
cution and monitoring of MapReduce jobs and the re-
producibility of programs or new defined program
pipelines; (2) scientists can decide flexibly in which en-
vironment (public or private cluster) a program should
be executed depending on the case scenario to guarantee
an appropriate level of data security and to reduce data
transfer times; (3) an Amazon EC2 cluster can be
launched via the Cloudgene web interface, thus simplify-
ing the overall setup and making Cloudgene available in
Table 1 Currently integrated programs
Name Details
MapReduce programs CloudBurst [4] Highly sensitive short read mapping with MapReduce.
Myrna [5] A cloud computing tool for calculating differential gene
expression in large RNA-seq datasets.
Crossbow [6] A scalable software pipeline for whole genome re-sequencing
analysis.
FastQ-Preprocessing
4
Quality control for high throughput sequence data in fastq format.
SNPFinder
4
Filters and extracts certain SNPs from genome wide association
studies datasets.
Hadoop-Examples [3] Several Hadoop example applications including Sort and Grep.
System images Cloud BioLinux [10] An AWS-EC2 image that includes a wide range of biological software,
programming libraries as well as data sets.
RStudio [11] An AWS-EC2 image that enables the usage of all R programming tools via a web interface.
Web applications HaploGrep [13] A web application to determine mitochondrial DNA haplogroups.
Table 2 Wall-Times: Cloudgene compared with Amazon Elastic Map-Reduce executing CloudBurst with input data from
s3n://elasticmapreduce/samples/cloudburst
Wall Times
EC2-Nodes Instance type Cloudgene Amazon EMR
2 + 1 m1.small 21 min 25 min
4 + 1 m1.small 17 min 18 min
8 + 1 m1.small 12 min 12 min
Import Data Times
Type Source Data Volume Time
FTP ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/HG00096/sequence_read ~18 GByte 28 min
S3-Bucket s3n://1000genomes/data/HG00096/sequence_read ~18 GByte 11 min
Schönherr et al. BMC Bioinformatics 2012, 13:200 Page 7 of 9
http://www.biomedcentral.com/1471-2105/13/200
Page 7
a public cloud; (4) new MapReduce programs can be
integrated by using Cloudgenes plug-in interface with-
out changing the program code at all.
However, Cloudgene has limitations as well: (1) since
the main focus of our approach lies on a simplified exe-
cution of MapReduce jobs, programs using other para-
digms (e.g. MPI or iterative processing) are currently not
supported; (2) Cloudgene allows the concatenation of
jobs to simple pipelines with the limitation that pipelines
have to be executed from start to end and cur rently does
not allow the execution of specific pipeline steps auto-
matically (e.g. restart at step 3); (3) since the amount of
included cluster nodes must be set at start, the launched
cluster architectures are currently static and are not
changeable during runtime.
Comparison with similar software packages
To date, several approaches exist to improve the usabil-
ity of currently available bioinformatic solutions. Systems
such as Galaxy [14,15], GenePattern [16], Ergatis [17],
Mobyle [18] and Taverna [19,20] try to facilitate the cre-
ation, execution and maintainability of workflows in a
fast and user-friendly way. In contrast to these existing
workflow platforms, Cloudgenes primary focus lies on
the usability of MapReduce jobs in public and private
clouds for bioinformatic applications.
Galaxy CloudMan [21] is a similar approach to
Cloudgene-Cluster and supports users to set up cloud
clusters using Amazon EC2. It works in combination
with Bio-Linux (http://nebc.nerc.ac.uk/tools/bio-linux)
and configures at start up the Oracle Grid Engine [22]
as well as Galaxy. Unfortunately, CloudMan does not
support MapReduce by default and therefore a graphical
execution and monitoring of jobs is not possible.
Another mentionable and useful system is Amazon
Elastic MapReduce (EMR) [9], which provides the op-
portunity to create job-flows including custom jars,
streaming or Hive/Pig programs. Since everything is
located on Amazon directly, a highly optimized version
of Hadoop MapReduce in combination with Amazon S3
is provided and can be executed by a comprehensive
user interface. Nevertheless, Amazon Elastic MapReduce
can only be used in combination with Amazon EC2,
sometimes preventing research institutes from using it
due to data se curity rule s or the enormous amount of
data to transfer
3
from their own institutional cloud. In
contrast, Cloudgene allows launching MapReduce jobs
both on public or private clouds, thereby enabling the
user to define the location of data. Since Cloudgene does
not utilize EMR for its job execution, additionally finan-
cial costs for EMR can be saved. Table 2 summarizes the
comparison of Cloudgene and EMR and shows that
Cloudgene is competitive regarding cluster set-up, job
execution and data transfer.
CloVR [23] is a virtual image that provides several
analysis pipelines to use on a personal computer as
well as on the cloud. It utilizes the Grid Engine
(http://gridengine.org) for job scheduling and plans a
possible future integration of Hadoop MapReduce.
Eoulsan [24] is a modular framework which enables the
setup of cloud computing clusters and automates the
analysis of high throughput sequencing data. A modular
plug-in system allows the integration of available
algorithms. Eoulsan uses EMR for the execution of their
MapReduce jobs and has to be executed on the com-
mand-line. Both systems improve the usage of programs
for scientists, but are not focused on a graphical execu-
tion of jobs on public and private clusters.
Future work
The success of a platform like Cloudgene goes hand in
hand with the amount of involved users and scenarios.
Therefore, our short-term focus will be on an extension
of Cloudgene with new case scenarios, hopefully moti-
vating users integrating their own MapReduce progra ms
or systems. One of the biggest advantages of public
clouds is the opportunity to rent as many computer
nodes and thus computational powe r as needed. Thus,
the next version of Cloudgene is conceived to provide
functions for adding and removing computer nodes dur-
ing runtime. Furthermore, a simple user interface for
Hadoop is not only useful for end users but also for
developers. It supports them during prototyping and
testing of novel MapReduce programs by highlighting
performance bottlenecks. Thus, we plan to imp lement
time measurements on the map, reduce and shuffle
phase and to visualize them in an intuitive chart. Add-
itionally, Hadoop plans in its future version the support
of alternate programming paradigms, which is particu-
larly important for applications where custom frame-
works outperform MapReduce by an order of magnitude
(e.g. iterative applications like K-Means).
Conclusions
We presented Cloudgene, a platfo rm that allows scien-
tists to set up a user-defined cluster in the cloud and to
execute or monitor MapReduce jobs via a dynamically
created web interface on the cluster. Cloudgenes aim is
to integrate existing and future MapReduce programs
via a manifest file into one centralized platform. Cloud-
gene supports users without deeper background in com-
puter science and improves the usability of currently
available MapReduce programs in the field of Bioinfor-
matics. We think that this appro ach improves the
utilization of programs and the reproducibility of results.
Additionally, we showed on different scenarios how an
integration can be fulfilled without adding overhead to
Schönherr et al. BMC Bioinformatics 2012, 13:200 Page 8 of 9
http://www.biomedcentral.com/1471-2105/13/200
Page 8
the computation, thereby improving both the develop-
ment and the usability of a program.
Availability and requirements
Project name: Cloudgene
Project home page: http://cloudgene.uibk.ac.at
Operating System: Cloudgene-Cluster (platform-
independent), Cloudgene-MapRed (GNU/L inux)
Programming language: Java, JavaScript
Other requirements: Java 1.6, AWS-EC2 Account for
public clouds, Hadoop MapReduce for private clouds
License: GNU GPL v3
Any restri ctions to use by non-academics: None
Endnotes
1
See http://cloudgene.uibk.ac.at/useca ses.
2
http ://www.bioinformatics.babraham.ac.uk/projects/
fastqc/.
3
Amazon Web Services (AWS) provides possibility to
ship data on hard-drives.
4
In-house implementation.
Additional file
Additional file 1: Supplementary Material to Cloudgene: A graphical
execution platform for MapReduce programs on private and public
clouds.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
LF and SS initialized the project and developed Cloudgene. LF, SS, HW and
AK-B were responsible for designing Cloudgene. AK-B, GS and FK supervised
the project. LF, SS, HW and AK-B drafted the manuscript. All authors read
and approved the final manuscript.
Acknowledgements
SS was supported by a scholarship from the University of Innsbruck (Doctoral
grant for young researchers, MIP10/2009/3). HW was supported by a
scholarship from the Autonomous Province of Bozen/Bolzano (South Tyrol).
The project was supported by an Amazon Research Grant, the grant Aktion
D. Swarovski and by the Österreichische Nationalbank (Grant 13059) as well
as the Sequencing and Genotyping Core Facility of the Innsbruck Medical
University. We thank the open source and free software community as well
as the Apache Whirr Mailing list, especially Tom White and Andrei Savu for
their great assistance.
Author details
1
Division of Genetic Epidemiology; Department of Medical Genetics,
Molecular and Clinical Pharmacology, Innsbruck Medical University,
Innsbruck, Austria.
2
Department of Database and Information Systems;
Institute of Computer Science, University of Innsbruck, Innsbruck, Austria.
Received: 15 May 2012 Accepted: 1 August 2012
Published: 13 August 2012
References
1. DNA Sequencing Costs: Data from the NHGRI Large-Scale Genome
Sequencing Program. http://www.genome.gov/sequencingcosts.
2. Dean J, Ghemawat S: MapReduce: Simplified data processing on large
clusters. Commun ACM 2008, 51(1):107113.
3. Apache Hadoop; http://hadoop.apache.org.
4. Schatz MC: CloudBurst: highly sensitive read mapping with MapReduce.
Bioinformatics 2009, 25(11):13631369.
5. Langmead B, Hansen KD, Leek JT: Cloud-scale RNA-sequencing differential
expression analysis with Myrna. Genome Biol 2010, 11(8):R83.
6. Langmead B, Schatz MC, Lin J, Pop M, Salzberg SL: Searching for SNPs with
cloud computing. Genome Biol 2009, 10(11):R134.
7. Restlet; http://www.restlet.org/.
8. Apache Whirr; http://whirr.apache.org/.
9. Amazon Elastic MapReduce; http://aws.amazon.com/elasticmapreduce/.
10. Krampis K, Booth T, Chapman B, Tiwari B, Bicak M, Field D, Nelson K: Cloud
BioLinux: pre-configured and on-demand bioinformatics computing for
the genomics community. BMC Bioinformatics 2012, 13(1):42.
11. RStudio; http://www.rstudio.org.
12. RStudio AMI; http://www.louisaslett.com/RStudio_AMI.
13. Kloss-Brandstätter A, Pacher D, Schönherr S, Weissensteiner H, Binna R,
Specht G, Kronenberg F: HaploGrep: a fast and reliable algorithm for
automatic classification of mitochondrial DNA haplogroups. Hum Mutat
2011, 32(1):2532.
14. Goecks J, Nekrutenko A, Taylor J: Galaxy: a comprehensive approach for
supporting accessible, reproducible, and transparent computational
research in the life sciences. Genome Biol 2010, 11(8):R86.
15. Blankenberg D, Von Kuster G, Coraor N, Ananda G, Lazarus R, Mangan M,
Nekrutenko A, Taylor J: Galaxy: a web-based genome analysis tool for
experimentalists. Curr Protoc Mol Biol 2010, Chapter 19(Unit 19.10. 1-21):1121.
16. Reich M, Liefeld T, Gould J, Lerner J, Tamayo P, Mesirov JP: Gene Pattern
2.0. Nat Genet 2006, 38(5):500501.
17. Orvis J, Crabtree J, Galens K, Gussman A, Inman JM, Lee E, Nampally S, Riley D,
Sundaram JP, Felix V, et al: Ergatis: a web interface and scalable software
system for bioinformatics workflows. Bioinformatics 2010, 26(12):14881492.
18. Neron B, Menager H, Maufrais C, Joly N, Maupetit J, Letort S, Carrere S,
Tuffery P, Letondal C: Mobyle: a new full web bioinformatics framework.
Bioinformatics 2009, 25
(22):3005
3011.
19. Hull D, Wolstencroft K, Stevens R, Goble C, Pocock MR, Li P, Oinn T:
Taverna: a tool for building and running workflows of services. Nucleic
Acids Res 2006, 34(Web Server issue):W729W732.
20. Oinn T, Addis M, Ferris J, Marvin D, Senger M, Greenwood M, Carver T, Glover K,
Pocock MR, Wipat A, et al: Taverna: a tool for the composition and enactment
of bioinformatics workflows. Bioinformatics 2004, 20(17):30453054 .
21. Afgan E, Baker D, Coraor N, Chapman B, Nekrutenko A, Tay lor J: Galaxy Cloud
Man: delivering cloud compute clusters. BMC Bioinformatics 2010, 11(Suppl 12):S4.
22. Oracle Grid Engine; http://www.oracle.com/technetwork/oem/
grid-engine-166852.html.
23. Angiuoli SV, Matalka M, Gussman A, Galens K, Vangala M, Riley DR, Arze C,
White JR, White O, Fricke WF: CloVR: A virtual machine for automated and
portable sequence analysis from the desktop using cloud computing.
BMC Bioinformatics 2011, 12(1):356.
24. Jourdren L, Bernard M, Dillies MA, Le Crom S: Eoulsan: A Cloud
Computing-Based Framework Facilitating High Throughput Sequencing
Analyses. Bioinformatics 2012, 28(11):15421543.
doi:10.1186/1471-2105-13-200
Cite this article as: Schönherr et al.: Cloudgene: A graphical execution
platform for MapReduce programs on private and public clouds. BMC
Bioinformatics 2012 13:200.
Submit your next manuscript to BioMed Central
and take full advantage of:
Convenient online submission
Thorough peer review
No space constraints or color figure charges
Immediate publication on acceptance
Inclusion in PubMed, CAS, Scopus and Google Scholar
Research which is freely available for redistribution
Submit your manuscript at
www.biomedcentral.com/submit
Schönherr et al. BMC Bioinformatics 2012, 13:200 Page 9 of 9
http://www.biomedcentral.com/1471-2105/13/200
Page 9
  • Source
    • "Within the MUI, the Division of Genetic Epidemiology is an internationally recognized expert on lipid-associated disorders , holds cooperations with several epidemiological studies and is involved in several genome-wide association studies (GWAS) and imputation projects. An intensive cooperation with the research group Databases and Information Systems (DBIS) at the University of Innsbruck exists, developing data-intensive bioinformatics software solutions such as Cloudgene [20], HaploGrep [38] or the mtDNA-Server (http://mtdna-server.uibk.ac. at). "
    [Show abstract] [Hide abstract] ABSTRACT: High-throughput technologies, such as next-generation sequencing, have turned molecular biology into a data-intensive discipline, requiring bioinformaticians to use high-performance computing resources and carry out data management and analysis tasks on large scale. Workflow systems can be useful to simplify construction of analysis pipelines that automate tasks, support reproducibility and provide measures for fault-tolerance. However, workflow systems can incur significant development and administration overhead so bioinformatics pipelines are often still built without them. We present the experiences with workflows and workflow systems within the bioinformatics community participating in a series of hackathons and workshops of the EU COST action SeqAhead. The organizations are working on similar problems, but we have addressed them with different strategies and solutions. This fragmentation of efforts is inefficient and leads to redundant and incompatible solutions. Based on our experiences we define a set of recommendations for future systems to enable efficient yet simple bioinformatics workflow construction and execution. Reviewers This article was reviewed by Dr Andrew Clark.
    Full-text · Article · Dec 2015 · Biology Direct
  • Source
    • "Many life science researchers perform these tasks on a regular basis, whether through the manual collection and curation of data, or the use of specific software [9, 10]. In recent years, data integration and interoperability is focused on three interdependent domains: cloud-based strategies [11, 12], service-oriented architectures [13] and semantic web technologies [14]. Cloud-based approaches are adequate for institutions that want to delegate the solution for computational requirements . "
    [Show abstract] [Hide abstract] ABSTRACT: Background In recent years data integration has become an everyday undertaking for life sciences researchers. Aggregating and processing data from disparate sources, whether through specific developed software or via manual processes, is a common task for scientists. However, the scope and usability of the majority of current integration tools fail to deal with the fast growing and highly dynamic nature of biomedical data. Results In this work we introduce a reactive and event-driven framework that simplifies real-time data integration and interoperability. This platform facilitates otherwise difficult tasks, such as connecting heterogeneous services, indexing, linking and transferring data from distinct resources, or subscribing to notifications regarding the timeliness of dynamic data. For developers, the framework automates the deployment of integrative and interoperable bioinformatics applications, using atomic data storage for content change detection, and enabling agent-based intelligent extract, transform and load tasks. Conclusions This work bridges the gap between the growing number of services, accessing specific data sources or algorithms, and the growing number of users, performing simple integration tasks on a recurring basis, through a streamlined workspace available to researchers and developers alike.
    Full-text · Article · Dec 2015 · BMC Bioinformatics
  • Source
    • "Cloudgene [31] and Eoulsan [32]. See Calabrese and Cannataro [33] for a more details overview of the existing cloud-aware applications and platforms. "
    [Show abstract] [Hide abstract] ABSTRACT: Background: Analyzing high throughput genomics data is a complex and compute intensive task, generally requiring numerous software tools and large reference data sets, tied together in successive stages of data transformation and visualisation. A computational platform enabling best practice genomics analysis ideally meets a number of requirements, including: a wide range of analysis and visualisation tools, closely linked to large user and reference data sets; workflow platform(s) enabling accessible, reproducible, portable analyses, through a flexible set of interfaces; highly available, scalable computational resources; and flexibility and versatility in the use of these resources to meet demands and expertise of a variety of users. Access to an appropriate computational platform can be a significant barrier to researchers, as establishing such a platform requires a large upfront investment in hardware, experience, and expertise. Results: We designed and implemented the Genomics Virtual Laboratory (GVL) as a middleware layer of machine images, cloud management tools, and online services that enable researchers to build arbitrarily sized compute clusters on demand, pre-populated with fully configured bioinformatics tools, reference datasets and workflow and visualisation options. The platform is flexible in that users can conduct analyses through web-based (Galaxy, RStudio, IPython Notebook) or command-line interfaces, and add/remove compute nodes and data resources as required. Best-practice tutorials and protocols provide a path from introductory training to practice. The GVL is available on the OpenStack-based Australian Research Cloud (http://nectar.org.au) and the Amazon Web Services cloud. The principles, implementation and build process are designed to be cloud-agnostic. Conclusions: This paper provides a blueprint for the design and implementation of a cloud-based Genomics Virtual Laboratory. We discuss scope, design considerations and technical and logistical constraints, and explore the value added to the research community through the suite of services and resources provided by our implementation.
    Full-text · Article · Oct 2015 · PLoS ONE
Show more