Cloud CPFP: A Shotgun Proteomics Data Analysis Pipeline Using Cloud and High Performance Computing

Journal of Proteome Research (Impact Factor: 4.25). 10/2012; 11(12). DOI: 10.1021/pr300694b
Source: PubMed


We have extended the functionality of the Central Proteomics Facilities Pipeline (CPFP) to allow use of remote cloud and high performance computing (HPC) resources for shotgun proteomics data processing. CPFP has been modified to include modular local and remote scheduling for data processing jobs. The pipeline can now be run on a single PC or server, a local cluster, a remote HPC cluster, and/or the Amazon Web Services (AWS) cloud. We provide public images that allow easy deployment of CPFP in its entirety in the AWS cloud. This significantly reduces the effort necessary to use the software, and allows proteomics laboratories to pay for compute time ad hoc, rather than obtaining and maintaining expensive local server clusters. Alternatively the Amazon cloud can be used to increase the throughput of a local installation of CPFP as necessary. We demonstrate that cloud CPFP allows users to process data at higher speed than local installations but with similar cost and lower staff requirements. In addition to the computational improvements, the web interface to CPFP is simplified, and other functionalities are enhanced. The software is under active development at two leading institutions and continues to be released under an open-source license at

14 Reads
  • Source
    • "The mass spectrometer acquired up to 12 MS/MS spectra (Orbitrap) or 20 MS/MS spectra (Q Exactive) for each full spectrum acquired. Raw MS data files were converted to a peak list format and analyzed using the central proteomics facilities pipeline (CPFP), version 2.0.3 (Trudgian and Mirzaei, 2012). Peptide identification was performed using X!Tandem (Craig and Beavis, 2004) and the open MS search algorithm (OMSSA) (Geer et al., 2004) against a custom sequence database. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Cholera toxin (CT) enters and intoxicates host cells after binding cell surface receptors using its B subunit (CTB). The ganglioside (glycolipid) GM1 is thought to be the sole CT receptor; however, the mechanism by which CTB binding to GM1 mediates internalization of CT remains enigmatic. Here we report that CTB binds cell surface glycoproteins. Relative contributions of gangliosides and glycoproteins to CTB binding depend on cell type, and CTB binds primarily to glycoproteins in colonic epithelial cell lines. Using a metabolically incorporated photocrosslinking sugar, we identified one CTB-binding glycoprotein and demonstrated that the glycan portion of the molecule, not the protein, provides the CTB interaction motif. We further show that fucosylated structures promote CTB entry into a colonic epithelial cell line and subsequent host cell intoxication. CTB-binding fucosylated glycoproteins are present in normal human intestinal epithelia and could play a role in cholera.
    Full-text · Article · Oct 2015 · eLife Sciences
  • Source
    • "Within this environment, a cloud service has now been implemented that infers domains from the FASTA sequences of all proteins identified in the experiment at hand and maps peptide quantitation values into the corresponding functional domains. Recently, the growth of MS/MS data has motivated the proteomics community to seek cloud computing tools to enable small laboratories to analyze complex datasets [9] [10]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Mass-spectrometry-based shotgun proteomics has become a widespread technology for analyzing complex protein mixtures. Here we describe a new module integrated into PatternLab for Proteomics that allows the pinpointing of differentially expressed domains. This is accomplished by inferring functional domains through our cloud service, using HMMER3 and Pfam remotely, and then mapping the quantitation values into domains for downstream analysis. In all, spotting which functional domains are changing when comparing biological states serves as a complementary approach to facilitate the understanding of a system's biology. We exemplify the new module's use by reanalyzing a previously published MudPIT dataset of Cryptococcus gattii cultivated under iron-depleted and replete conditions. We show how the differential analysis of functional domains can facilitate the interpretation of proteomic data by providing further valuable insight.
    Full-text · Article · Jun 2013 · Journal of proteomics
  • [Show abstract] [Hide abstract]
    ABSTRACT: Modern day proteomics generates ever more complex data, causing the requirements on the storage and processing of such data to outgrow the capacity of most desktop computers. To cope with the increased computational demands, distributed architectures have gained substantial popularity in the recent years. In this review, we provide an overview of the current techniques for distributed computing, along with examples of how the techniques are currently being employed in the field of proteomics. We thus underline the benefits of distributed computing in proteomics, while also pointing out the potential issues and pitfalls involved.
    No preview · Article · Mar 2014 · Proteomics
Show more