Cloudgene: A graphical execution platform for MapReduce programs on private and public clouds

BMC Bioinformatics (Impact Factor: 2.58). 08/2012; 13(1):200. DOI: 10.1186/1471-2105-13-200
Source: PubMed


The MapReduce framework enables a scalable processing and analyzing of large datasets by distributing the computational load on connected computer nodes, referred to as a cluster. In Bioinformatics, MapReduce has already been adopted to various case scenarios such as mapping next generation sequencing data to a reference genome, finding SNPs from short read data or matching strings in genotype files. Nevertheless, tasks like installing and maintaining MapReduce on a cluster system, importing data into its distributed file system or executing MapReduce programs require advanced knowledge in computer science and could thus prevent scientists from usage of currently available and useful software solutions.

Here we present Cloudgene, a freely available platform to improve the usability of MapReduce programs in Bioinformatics by providing a graphical user interface for the execution, the import and export of data and the reproducibility of workflows on in-house (private clouds) and rented clusters (public clouds). The aim of Cloudgene is to build a standardized graphical execution environment for currently available and future MapReduce programs, which can all be integrated by using its plug-in interface. Since Cloudgene can be executed on private clusters, sensitive datasets can be kept in house at all time and data transfer times are therefore minimized.

Our results show that MapReduce programs can be integrated into Cloudgene with little effort and without adding any computational overhead to existing programs. This platform gives developers the opportunity to focus on the actual implementation task and provides scientists a platform with the aim to hide the complexity of MapReduce. In addition to MapReduce programs, Cloudgene can also be used to launch predefined systems (e.g. Cloud BioLinux, RStudio) in public clouds. Currently, five different bioinformatic programs using MapReduce and two systems are integrated and have been successfully deployed. Cloudgene is freely available at

Download full-text


Available from: Anita Kloss-Brandstätter
  • Source
    • "To simplify their use and incorporate them into the process automation mechanisms, Seal tools have been integrated into Galaxy, thus allowing their usage as workflow components. Incidentally , the toolbox has also been independently integrated into other high-level workflow tools such as Cloudgene [21]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The number of domains affected by the big data phenomenon is constantly increasing, both in science and indus- try, with high-throughput DNA sequencers being among the most massive data producers. Building analysis frameworks that can keep up with such a high production rate, however, is only part of the problem: current challenges include dealing with articulated data repositories where objects are connected by multiple re- lationships, managing complex processing pipelines where each step depends on a large number of configuration parameters and ensuring reproducibility, error control and usability by non- technical staff. Here we describe an automated infrastructure built to address the above issues in the context of the analysis of the data produced by the CRS4 next-generation sequencing facility. The system integrates open source tools, either written by us or publicly available, into a framework that can handle the whole data transformation process, from raw sequencer output to primary analysis results.
    Full-text · Conference Paper · Jul 2014
  • Source
    • "To help in this regard, we have previously constructed a software system called CloudMan [18], which makes it possible to easily procure and configure a functional data analysis platform on a cloud infrastructure. The procured platform delivers a scalable cluster-in-the-cloud and a data analysis environment preconfigured with a number of applications. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The ever-increasing data production and availability in the field of bioinformatics demands a paradigm shift towards the utilization of novel solutions for efficient data storage and processing, such as the MapReduce data parallel programming model and the corresponding Apache Hadoop framework. Despite the evident potential of this model and existence of already available algorithms and applications, especially for batch processing of large data sets as in the Next Generation Sequencing analysis, bioinformatics MapReduce applications are yet to become widely adopted in the bioinformatics data analysis. We identify two prerequisites for their adaptation and utilization: (1) the ability to compose complex workflows from multiple bioinformatics MapReduce tools that will abstract technical details of how those tools are combined and executed allowing bioinformatics domain experts to focus on the analysis, and (2) the availability of accessible and flexible computing infrastructure for this type of data processing. This paper presents integration of two existing systems: Cloudgene, a bioinformatics MapReduce workflow framework, and CloudMan, a cloud manager for delivering application execution environments. Together, they enable delivery of bioinformatics MapReduce applications in the Cloud.
    Full-text · Conference Paper · May 2014
  • Source
    • "Motivated by the potential scalability and throughput offered by Hadoop, there are an increasing number of Hadoop-based tools for processing sequencing data (Taylor, 2010), ranging from quality control (Robinson et al., 2011) and alignment (Langmead et al., 2009; Pireddu et al., 2011) to SNP calling (Langmead et al., 2009), variant annotation (O’Connor et al., 2010) and structural variant detection (Whelan et al., 2013), including general purpose workflow management (Schönherr et al., 2012). Note the recent publication of independent and complimentary work in (Nordberg et al., 2013). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Summary: Hadoop MapReduce-based approaches have become increasingly popular due to their scalability in processing large sequencing datasets. However, as these methods typically require in-depth expertise in Hadoop and Java, they are still out of reach of many bioinformaticians. To solve this problem, we have created SeqPig, a library and a collection of tools to manipulate, analyze and query sequencing datasets in a scalable and simple manner. SeqPigscripts use the Hadoop-based distributed scripting engine Apache Pig, which automatically parallelizes and distributes data processing tasks. We demonstrate SeqPig’s scalability over many computing nodes and illustrate its use with example scripts.Availability and Implementation: Available under the open source MIT license at andre.schumacher@yahoo.comSupplementary information: Supplementary data are available at Bioinformatics online.
    Full-text · Article · Oct 2013 · Bioinformatics
Show more

We use cookies to give you the best possible experience on ResearchGate. Read our cookies policy to learn more.