Cloud CPFP: A Shotgun Proteomics Data Analysis Pipeline Using Cloud and High Performance Computing.

Journal of Proteome Research (Impact Factor: 5.06). 10/2012; DOI: 10.1021/pr300694b
Source: PubMed

ABSTRACT We have extended the functionality of the Central Proteomics Facilities Pipeline (CPFP) to allow use of remote cloud and high performance computing (HPC) resources for shotgun proteomics data processing. CPFP has been modified to include modular local and remote scheduling for data processing jobs. The pipeline can now be run on a single PC or server, a local cluster, a remote HPC cluster, and/or the Amazon Web Services (AWS) cloud. We provide public images that allow easy deployment of CPFP in its entirety in the AWS cloud. This significantly reduces the effort necessary to use the software, and allows proteomics laboratories to pay for compute time ad hoc, rather than obtaining and maintaining expensive local server clusters. Alternatively the Amazon cloud can be used to increase the throughput of a local installation of CPFP as necessary. We demonstrate that cloud CPFP allows users to process data at higher speed than local installations but with similar cost and lower staff requirements. In addition to the computational improvements, the web interface to CPFP is simplified, and other functionalities are enhanced. The software is under active development at two leading institutions and continues to be released under an open-source license at

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Bacteria use diverse mechanisms to kill, manipulate, and compete with other cells. The recently discovered type VI secretion system (T6SS) is widespread in bacterial pathogens and used to deliver virulence effector proteins into target cells. Using comparative proteomics, we identified two previously unidentified T6SS effectors that contained a conserved motif. Bioinformatic analyses revealed that this N-terminal motif, named MIX (marker for type six effectors), is found in numerous polymorphic bacterial proteins that are primarily located in the T6SS genome neighborhood. We demonstrate that several MIX-containing proteins are T6SS effectors and that they are not required for T6SS activity. Thus, we propose that MIX-containing proteins are T6SS effectors. Our findings allow for the identification of numerous uncharacterized T6SS effectors that will undoubtedly lead to the discovery of new biological mechanisms.
    Proceedings of the National Academy of Sciences 06/2014; DOI:10.1073/pnas.1406110111 · 9.81 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Democratization of genomics technologies has enabled the rapid determination of genotypes. More recently the democratization of comprehensive proteomics technologies is enabling the determination of the cellular phenotype and the molecular events that define its dynamic state. Core proteomic technologies include mass spectrometry to define protein sequence, protein:protein interactions, and protein post-translational modifications. Key enabling technologies for proteomics are bioinformatic pipelines to identify, quantitate, and summarize these events. The Trans-Proteomics Pipeline (TPP) is a robust open-source standardized data processing pipeline for large-scale reproducible quantitative mass spectrometry proteomics. It supports all major operating systems and instrument vendors via open data formats. Here we provide a review of the overall proteomics workflow supported by the TPP, its major tools, and how it can be used in its various modes from desktop to cloud computing. We describe new features for the TPP, including data visualization functionality. We conclude by describing some common perils that affect the analysis of tandem mass spectrometry datasets, as well as some major upcoming features.This article is protected by copyright. All rights reserved
    PROTEOMICS - CLINICAL APPLICATIONS 01/2015; DOI:10.1002/prca.201400164 · 1.81 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Cloud computing, where scalable, on-demand compute cycles and storage are available as a service, has the potential to accelerate mass spectrometry-based proteomics research by providing simple, expandable and affordable large-scale computing to all laboratories regardless of location or information technology expertise. We present new cloud computing functionality for the Trans-Proteomic Pipeline, a free and open-source suite of tools for the processing and analysis of tandem mass spectrometry datasets. Enabled with Amazon Web Services cloud computing, the Trans-Proteomic Pipeline now accesses large scale computing resources, limited only by the available Amazon Web Services infrastructure, for all users. The Trans-Proteomic Pipeline runs in an environment fully hosted on Amazon Web Services, where all software and data reside on cloud resources to tackle large search studies. In addition, it can also be run on a local computer with computationally intensive tasks launched onto the Amazon Elastic Compute Cloud service to greatly decrease analysis times. We describe the new Trans-Proteomic Pipeline cloud service components, compare the relative performance and costs of various Elastic Compute Cloud service instance types, and present on-line tutorials that enable users to learn how to deploy cloud computing technology rapidly with the Trans-Proteomic Pipeline. We provide tools for estimating the necessary computing resources and costs given the scale of a job and demonstrate the use of cloud enabled Trans-Proteomic Pipeline by performing over 1100 tandem mass spectrometry files through four proteomic search engines in 9 hours and at a very low cost. Copyright © 2014, The American Society for Biochemistry and Molecular Biology.
    Molecular &amp Cellular Proteomics 11/2014; DOI:10.1074/mcp.O114.043380 · 7.25 Impact Factor