Cloud Computing and the DNA Data Race

Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, USA.
Nature Biotechnology (Impact Factor: 41.51). 07/2010; 28(7):691-3. DOI: 10.1038/nbt0710-691
Source: PubMed

Full-text preview

Available from:
  • Source
    • "Advances in next generation sequencing technologies have enabled rapid generation of newly sequenced genomes at a rate that can no longer be handled by a single-core non-distributed computing system in a feasible manner[1,2]. The large volume of sequencing data that are now available has created profound challenges in data transfer and analysis[3]. High throughput computing on supercomputers was recently introduced to meet these challenges[4,5]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Here we introduce the Protein Sequence Annotation Tool (PSAT), a web-based, sequence annotation meta-server for performing integrated, high-throughput, genome-wide sequence analyses. Our goals in building PSAT were to (1) create an extensible platform for integration of multiple sequence-based bioinformatics tools, (2) enable functional annotations and enzyme predictions over large input protein fasta data sets, and (3) provide a web interface for convenient execution of the tools. In this paper, we demonstrate the utility of PSAT by annotating the predicted peptide gene products of Herbaspirillum sp. strain RV1423, importing the results of PSAT into EC2KEGG, and using the resulting functional comparisons to identify a putative catabolic pathway, thereby distinguishing RV1423 from a well annotated Herbaspirillum species. This analysis demonstrates that high-throughput enzyme predictions, provided by PSAT processing, can be used to identify metabolic potential in an otherwise poorly annotated genome. PSAT is a meta server that combines the results from several sequence-based annotation and function prediction codes, and is available at PSAT stands apart from other sequence-based genome annotation systems in providing a high-throughput platform for rapid de novo enzyme predictions and sequence annotations over large input protein sequence data sets in FASTA. PSAT is most appropriately applied in annotation of large protein FASTA sets that may or may not be associated with a single genome.
    Preview · Article · Dec 2016 · BMC Bioinformatics
  • Source
    • "genomics/), and other commercial vendors are also emerging to help manage the deluge of data using commercial cloud platforms and are likely to play an increasingly important role in genomics in the future. The major technical reason this model will become more widespread is that at large scales, it is overwhelmingly more efficient to upload code segments, measured in kilobytes to megabytes , rather than to download entire large collections, measured in petabytes or beyond (Schatz et al. 2010 "
    [Show abstract] [Hide abstract]
    ABSTRACT: The last 20 years have been a remarkable era for biology and medicine. One of the most significant achievements has been the sequencing of the first human genomes, which has laid the foundation for profound insights into human genetics, the intricacies of regulation and development, and the forces of evolution. Incredibly, as we look into the future over the next 20 years, we see the very real potential for sequencing more than 1 billion genomes, bringing even deeper insight into human genetics as well as the genetics of millions of other species on the planet. Realizing this great potential for medicine and biology, though, will only be achieved through the integration and development of highly scalable computational and quantitative approaches that can keep pace with the rapid improvements to biotechnology. In this perspective, I aim to chart out these future technologies, anticipate the major themes of research, and call out the challenges ahead. One of the largest shifts will be in the training used to prepare the class of 2035 for their highly interdisciplinary world.
    Preview · Article · Oct 2015 · Genome Research
  • Source
    • "As NGS technology develops , the challenges faced by both bioinformaticians and users relate specifically to the competency of the software tools and the performance of the hardware to handle increasingly larger and more complex datasets. The lack of appropriate hardware infrastructure is the greatest contributing factor to the bioinformatics bottleneck and the rise in virtual environments, parallelised code and super-computing facilities is testament to an understanding of the need for continual development and innovation in NGS data handling and management [8]. However, these structural and programmatic facilitators are not without their drawbacks. "
    [Show abstract] [Hide abstract]
    ABSTRACT: In the rapidly evolving domain of next generation sequencing and bioinformatics analysis, data generation is one aspect that is increasing at a concomitant rate. The burden associated with processing large amounts of sequencing data has emphasised the need to allocate sufficient computing resources to complete analyses in the shortest possible time with manageable and predictable costs. A novel method for predicting time to completion for a popular bioinformatics software (QIIME), was developed using key variables characteristic of the input data assumed to impact processing time. Multiple Linear Regression models were developed to determine run time for two denoising algorithms and a general bioinformatics pipeline. The models were able to accurately predict clock time for denoising sequences from a naturally assembled community dataset, but not an artificial community. Speedup and efficiency tests for AmpliconNoise also highlighted that caution was needed when allocating resources for parallel processing of data. Accurate modelling of computational processing time using easily measurable predictors can assist NGS analysts in determining resource requirements for bioinformatics software and pipelines. Whilst demonstrated on a specific group of scripts, the methodology can be extended to encompass other packages running on multiple architectures, either in parallel or sequentially.
    Full-text · Article · Mar 2015
Show more