Cloud Computing and the DNA Data Race

Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, USA.
Nature Biotechnology (Impact Factor: 41.51). 07/2010; 28(7):691-3. DOI: 10.1038/nbt0710-691
Source: PubMed
13 Reads
  • Source
    • "As NGS technology develops , the challenges faced by both bioinformaticians and users relate specifically to the competency of the software tools and the performance of the hardware to handle increasingly larger and more complex datasets. The lack of appropriate hardware infrastructure is the greatest contributing factor to the bioinformatics bottleneck and the rise in virtual environments, parallelised code and super-computing facilities is testament to an understanding of the need for continual development and innovation in NGS data handling and management [8]. However, these structural and programmatic facilitators are not without their drawbacks. "
    [Show abstract] [Hide abstract]
    ABSTRACT: In the rapidly evolving domain of next generation sequencing and bioinformatics analysis, data generation is one aspect that is increasing at a concomitant rate. The burden associated with processing large amounts of sequencing data has emphasised the need to allocate sufficient computing resources to complete analyses in the shortest possible time with manageable and predictable costs. A novel method for predicting time to completion for a popular bioinformatics software (QIIME), was developed using key variables characteristic of the input data assumed to impact processing time. Multiple Linear Regression models were developed to determine run time for two denoising algorithms and a general bioinformatics pipeline. The models were able to accurately predict clock time for denoising sequences from a naturally assembled community dataset, but not an artificial community. Speedup and efficiency tests for AmpliconNoise also highlighted that caution was needed when allocating resources for parallel processing of data. Accurate modelling of computational processing time using easily measurable predictors can assist NGS analysts in determining resource requirements for bioinformatics software and pipelines. Whilst demonstrated on a specific group of scripts, the methodology can be extended to encompass other packages running on multiple architectures, either in parallel or sequentially.
  • Source
    • "Their system, called XBase, is doing various data mining tasks like classification of heart valvular disease, detecting association rules, diagnosis assistance, and treatment recommendation. As Schatz et al. [25] stated, sequencing of DNA chain is improving at a rate of about 5-fold per year, while computer performance is doubling only every 18 or 24 months. Therefore, addressing the issue of designing data analysis arises as a question. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Rapid growth and storage of biomedical data enabled many opportunities for predictive modeling and improvement of healthcare processes. On the other side analysis of such large amounts of data is a difficult and computationally intensive task for most existing data mining algorithms. This problem is addressed by proposing a cloud based system that integrates metalearning framework for ranking and selection of best predictive algorithms for data at hand and open source big data technologies for analysis of biomedical data.
    The Scientific World Journal 04/2014; Volume 2014 (2014)(Article ID 859279):10 pages. DOI:10.1155/2014/859279 · 1.73 Impact Factor
  • Source
    • "The NGS sequencer technology has improved, since 2005, at a very fast rate: every year the throughput of the sequencers increased by a 5-fold factor [3], [4]. Such of high rate of data production imposes the need to reduce the time required to perform the alignment phase (the bottleneck in any resequencing or otherwise analysing project) without sacrificing accuracy. "
    [Show abstract] [Hide abstract]
    ABSTRACT: We propose a parallel algorithm that solves the best k-mismatches alignment problem against a genomic refer- ence using the “one sequence/multiple processes” paradigm and distributed memory. Our proposal is designed to take advantage of a computing cluster using MPI (Message Passing Interface) for communication. Our solution distributes the reference among different nodes and each sequence is processed concurrently by different nodes. When a (putative) best solution is found, the successful process propagates the information to other nodes, reducing search space and saving computation time. The distributed algorithm was developed in C++ and op- timized for the PLX and FERMI supercomputers, but it is compatible with every OpenMPI-based cluster. It was included in the ERNE (Extended Randomized Numerical alignEr) package, whose aim is to provide an all-inclusive set of tools for short reads alignment and cleaning. ERNE is free software, distributed under the Open Source License (GPL V3) and can be downloaded at: The algorithm described in this work is implemented in the ERNE-PMAP and ERNE-PBS5 programs, the former designed to align DNA and RNA sequences, while the latter is optimized for bisulphite-treated sequences.
    2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP2014), Turin (Italy); 02/2014
Show more

Preview (2 Sources)

13 Reads
Available from