Cloud Computing and the DNA Data Race

Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, USA.
Nature Biotechnology (Impact Factor: 39.08). 07/2010; 28(7):691-3. DOI: 10.1038/nbt0710-691
Source: PubMed
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In the rapidly evolving domain of next generation sequencing and bioinformatics analysis, data generation is one aspect that is increasing at a concomitant rate. The burden associated with processing large amounts of sequencing data has emphasised the need to allocate sufficient computing resources to complete analyses in the shortest possible time with manageable and predictable costs. A novel method for predicting time to completion for a popular bioinformatics software (QIIME), was developed using key variables characteristic of the input data assumed to impact processing time. Multiple Linear Regression models were developed to determine run time for two denoising algorithms and a general bioinformatics pipeline. The models were able to accurately predict clock time for denoising sequences from a naturally assembled community dataset, but not an artificial community. Speedup and efficiency tests for AmpliconNoise also highlighted that caution was needed when allocating resources for parallel processing of data. Accurate modelling of computational processing time using easily measurable predictors can assist NGS analysts in determining resource requirements for bioinformatics software and pipelines. Whilst demonstrated on a specific group of scripts, the methodology can be extended to encompass other packages running on multiple architectures, either in parallel or sequentially.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Rapid growth and storage of biomedical data enabled many opportunities for predictive modeling and improvement of healthcare processes. On the other side analysis of such large amounts of data is a difficult and computationally intensive task for most existing data mining algorithms. This problem is addressed by proposing a cloud based system that integrates metalearning framework for ranking and selection of best predictive algorithms for data at hand and open source big data technologies for analysis of biomedical data.
    The Scientific World Journal 04/2014; Volume 2014 (2014)(Article ID 859279):10 pages. DOI:10.1155/2014/859279 · 1.73 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We propose a parallel algorithm that solves the best k-mismatches alignment problem against a genomic refer- ence using the “one sequence/multiple processes” paradigm and distributed memory. Our proposal is designed to take advantage of a computing cluster using MPI (Message Passing Interface) for communication. Our solution distributes the reference among different nodes and each sequence is processed concurrently by different nodes. When a (putative) best solution is found, the successful process propagates the information to other nodes, reducing search space and saving computation time. The distributed algorithm was developed in C++ and op- timized for the PLX and FERMI supercomputers, but it is compatible with every OpenMPI-based cluster. It was included in the ERNE (Extended Randomized Numerical alignEr) package, whose aim is to provide an all-inclusive set of tools for short reads alignment and cleaning. ERNE is free software, distributed under the Open Source License (GPL V3) and can be downloaded at: The algorithm described in this work is implemented in the ERNE-PMAP and ERNE-PBS5 programs, the former designed to align DNA and RNA sequences, while the latter is optimized for bisulphite-treated sequences.
    2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP2014), Turin (Italy); 02/2014


Available from