Cloud Computing and the DNA Data Race

Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, USA.
Nature Biotechnology (Impact Factor: 41.51). 07/2010; 28(7):691-3. DOI: 10.1038/nbt0710-691
Source: PubMed

Full-text preview

Available from:
  • Source
    • "genomics/), and other commercial vendors are also emerging to help manage the deluge of data using commercial cloud platforms and are likely to play an increasingly important role in genomics in the future. The major technical reason this model will become more widespread is that at large scales, it is overwhelmingly more efficient to upload code segments, measured in kilobytes to megabytes , rather than to download entire large collections, measured in petabytes or beyond (Schatz et al. 2010 "
    [Show abstract] [Hide abstract]
    ABSTRACT: The last 20 years have been a remarkable era for biology and medicine. One of the most significant achievements has been the sequencing of the first human genomes, which has laid the foundation for profound insights into human genetics, the intricacies of regulation and development, and the forces of evolution. Incredibly, as we look into the future over the next 20 years, we see the very real potential for sequencing more than 1 billion genomes, bringing even deeper insight into human genetics as well as the genetics of millions of other species on the planet. Realizing this great potential for medicine and biology, though, will only be achieved through the integration and development of highly scalable computational and quantitative approaches that can keep pace with the rapid improvements to biotechnology. In this perspective, I aim to chart out these future technologies, anticipate the major themes of research, and call out the challenges ahead. One of the largest shifts will be in the training used to prepare the class of 2035 for their highly interdisciplinary world.
    Genome Research 10/2015; 25(10):1417-1422. DOI:10.1101/gr.191684.115 · 14.63 Impact Factor
  • Source
    • "As NGS technology develops , the challenges faced by both bioinformaticians and users relate specifically to the competency of the software tools and the performance of the hardware to handle increasingly larger and more complex datasets. The lack of appropriate hardware infrastructure is the greatest contributing factor to the bioinformatics bottleneck and the rise in virtual environments, parallelised code and super-computing facilities is testament to an understanding of the need for continual development and innovation in NGS data handling and management [8]. However, these structural and programmatic facilitators are not without their drawbacks. "
    [Show abstract] [Hide abstract]
    ABSTRACT: In the rapidly evolving domain of next generation sequencing and bioinformatics analysis, data generation is one aspect that is increasing at a concomitant rate. The burden associated with processing large amounts of sequencing data has emphasised the need to allocate sufficient computing resources to complete analyses in the shortest possible time with manageable and predictable costs. A novel method for predicting time to completion for a popular bioinformatics software (QIIME), was developed using key variables characteristic of the input data assumed to impact processing time. Multiple Linear Regression models were developed to determine run time for two denoising algorithms and a general bioinformatics pipeline. The models were able to accurately predict clock time for denoising sequences from a naturally assembled community dataset, but not an artificial community. Speedup and efficiency tests for AmpliconNoise also highlighted that caution was needed when allocating resources for parallel processing of data. Accurate modelling of computational processing time using easily measurable predictors can assist NGS analysts in determining resource requirements for bioinformatics software and pipelines. Whilst demonstrated on a specific group of scripts, the methodology can be extended to encompass other packages running on multiple architectures, either in parallel or sequentially.
  • Source
    • "Their system, called XBase, is doing various data mining tasks like classification of heart valvular disease, detecting association rules, diagnosis assistance, and treatment recommendation. As Schatz et al. [25] stated, sequencing of DNA chain is improving at a rate of about 5-fold per year, while computer performance is doubling only every 18 or 24 months. Therefore, addressing the issue of designing data analysis arises as a question. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Rapid growth and storage of biomedical data enabled many opportunities for predictive modeling and improvement of healthcare processes. On the other side analysis of such large amounts of data is a difficult and computationally intensive task for most existing data mining algorithms. This problem is addressed by proposing a cloud based system that integrates metalearning framework for ranking and selection of best predictive algorithms for data at hand and open source big data technologies for analysis of biomedical data.
    The Scientific World Journal 04/2014; Volume 2014 (2014)(Article ID 859279):10 pages. DOI:10.1155/2014/859279 · 1.73 Impact Factor
Show more