Cloud CPFP: A Shotgun Proteomics Data Analysis Pipeline Using Cloud and High Performance Computing

Journal of Proteome Research (Impact Factor: 4.25). 10/2012; 11(12). DOI: 10.1021/pr300694b
Source: PubMed


We have extended the functionality of the Central Proteomics Facilities Pipeline (CPFP) to allow use of remote cloud and high performance computing (HPC) resources for shotgun proteomics data processing. CPFP has been modified to include modular local and remote scheduling for data processing jobs. The pipeline can now be run on a single PC or server, a local cluster, a remote HPC cluster, and/or the Amazon Web Services (AWS) cloud. We provide public images that allow easy deployment of CPFP in its entirety in the AWS cloud. This significantly reduces the effort necessary to use the software, and allows proteomics laboratories to pay for compute time ad hoc, rather than obtaining and maintaining expensive local server clusters. Alternatively the Amazon cloud can be used to increase the throughput of a local installation of CPFP as necessary. We demonstrate that cloud CPFP allows users to process data at higher speed than local installations but with similar cost and lower staff requirements. In addition to the computational improvements, the web interface to CPFP is simplified, and other functionalities are enhanced. The software is under active development at two leading institutions and continues to be released under an open-source license at

13 Reads
  • Source
    • "Within this environment, a cloud service has now been implemented that infers domains from the FASTA sequences of all proteins identified in the experiment at hand and maps peptide quantitation values into the corresponding functional domains. Recently, the growth of MS/MS data has motivated the proteomics community to seek cloud computing tools to enable small laboratories to analyze complex datasets [9] [10]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Mass-spectrometry-based shotgun proteomics has become a widespread technology for analyzing complex protein mixtures. Here we describe a new module integrated into PatternLab for Proteomics that allows the pinpointing of differentially expressed domains. This is accomplished by inferring functional domains through our cloud service, using HMMER3 and Pfam remotely, and then mapping the quantitation values into domains for downstream analysis. In all, spotting which functional domains are changing when comparing biological states serves as a complementary approach to facilitate the understanding of a system's biology. We exemplify the new module's use by reanalyzing a previously published MudPIT dataset of Cryptococcus gattii cultivated under iron-depleted and replete conditions. We show how the differential analysis of functional domains can facilitate the interpretation of proteomic data by providing further valuable insight.
    Journal of proteomics 06/2013; 89. DOI:10.1016/j.jprot.2013.06.013 · 3.89 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Modern day proteomics generates ever more complex data, causing the requirements on the storage and processing of such data to outgrow the capacity of most desktop computers. To cope with the increased computational demands, distributed architectures have gained substantial popularity in the recent years. In this review, we provide an overview of the current techniques for distributed computing, along with examples of how the techniques are currently being employed in the field of proteomics. We thus underline the benefits of distributed computing in proteomics, while also pointing out the potential issues and pitfalls involved.
    Proteomics 03/2014; 14(4-5). DOI:10.1002/pmic.201300288 · 3.81 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Bottom-up proteomics largely relies on tryptic peptides for protein identification and quantification. Tryptic digestion often provides limited coverage of protein sequence due to issues such as peptide length, ionization efficiency and posttranslational modification (PTM) colocalization. Unfortunately a region of interest in a protein, e.g. due to proximity to an active site or the presence of important PTMs, may not be covered by tryptic peptides. Detection limits, quantification accuracy and isoform differentiation can also be improved with greater sequence coverage. Selected Reaction Monitoring (SRM) would also greatly benefit from being able to identify additional targetable sequences. In an attempt to improve protein sequence coverage and to target regions of proteins that do not generate useful tryptic peptides, we deployed a multi-protease strategy on the HeLa proteome. First, we used seven commercially available enzymes in single, double and triple enzyme combinations. A total of 48 digests were performed. 5223 proteins were detected by analyzing the unfractionated cell lysate digest directly, with 42% mean sequence coverage. Additional strong-anion exchange (SAX) fractionation of the most complementary digests permitted identification of over 3000 more proteins, with improved mean sequence coverage. We then constructed a web application ( that allows the community to examine a target protein or protein isoform in order to discover the enzyme or combination of enzymes that would yield peptides spanning a certain region of interest in the sequence. Finally, we examined the utility of non-tryptic digests for SRM. From our SAX data we were able to identify three or more proteotypic SRM candidates within a single digest for 6056 genes. Surprisingly, in 25% of these cases the digest producing the most observable proteotypic peptides was neither trypsin nor Lys-C. SRM analysis of Asp-N vs. tryptic peptides for eight proteins determined that Asp-N yielded higher signal in 5 of 8 cases.
    Molecular &amp Cellular Proteomics 04/2014; 13(6). DOI:10.1074/mcp.M113.035170 · 6.56 Impact Factor
Show more