Deconvolution and database search of complex tandem mass spectra of intact proteins: a combinatorial approach.

Department of Computer Science and Engineering, University of California, San Diego, California 92093, USA.
Molecular &amp Cellular Proteomics (Impact Factor: 7.25). 12/2010; 9(12):2772-82. DOI: 10.1074/mcp.M110.002766
Source: PubMed

ABSTRACT Top-down proteomics studies intact proteins, enabling new opportunities for analyzing post-translational modifications. Because tandem mass spectra of intact proteins are very complex, spectral deconvolution (grouping peaks into isotopomer envelopes) is a key initial stage for their interpretation. In such spectra, isotopomer envelopes of different protein fragments span overlapping regions on the m/z axis and even share spectral peaks. This raises both pattern recognition and combinatorial challenges for spectral deconvolution. We present MS-Deconv, a combinatorial algorithm for spectral deconvolution. The algorithm first generates a large set of candidate isotopomer envelopes for a spectrum, then represents the spectrum as a graph, and finally selects its highest scoring subset of envelopes as a heaviest path in the graph. In contrast with other approaches, the algorithm scores sets of envelopes rather than individual envelopes. We demonstrate that MS-Deconv improves on Thrash and Xtract in the number of correctly recovered monoisotopic masses and speed. We applied MS-Deconv to a large set of top-down spectra from Yersinia rohdei (with a still unsequenced genome) and further matched them against the protein database of related and sequenced bacterium Yersinia enterocolitica. MS-Deconv is available at

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Background: Proteomics research is enabled with the high-throughput technologies, but our ability to identify expressed proteome is limited in small samples. The coverage and consistency of proteome expression are critical problems in proteomics. Here, we propose pathway analysis and combination of microproteomics and transcriptomics analyses to improve mass-spectrometry protein identification from small size samples. Results: Multiple proteomics runs using MCF-7 cell line detected 4,957 expressed proteins. About 80% of expressed proteins were present in MCF-7 transcripts data; highly expressed transcripts are more likely to have expressed proteins. Approximately 1,000 proteins were detected in each run of the small sample proteomics. These proteins were mapped to gene symbols and compared with gene sets representing canonical pathways, more than 4,000 genes were extracted from the enriched gene sets. The identified canonical pathways were largely overlapping between individual runs. Of identified pathways 182 were shared between three individual small sample runs. Conclusions: Current technologies enable us to directly 10% of expressed proteomes from small sample comprising as few as 50 cells. We used knowledge-based approaches to elucidate the missing proteome that can be verified by targeted proteomics. This knowledge-based approach includes pathway analysis and combination of gene expression and protein expression data for target prioritization. Genes present in both the enriched gene sets (canonical pathways collection) and in small sample proteomics data correspond to approximately 50% of expressed proteomes in larger sample proteomics data. In addition, 90% of targets from canonical pathways were estimated to be expressed. The comparison of proteomics and transcriptomics data, suggests that highly expressed transcripts have high probability of protein expression. However, approximately 10% of expressed proteins could not be matched with the expressed transcripts.
    BMC Genomics 12/2014; 2014(15). DOI:10.1186/1471-2164-15-S9-S1 · 4.04 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Isobaric labelling technique coupled with high resolution mass spectrometry has been widely employed in proteomic workflows requiring relative quantification. For each high resolution tandem mass spectrum (MS/MS), it can be used not only to quantify the peptide from different samples by reporter ions,but also to identify the peptide it derived from. Since the ions related to isobaric labeling may act as noise in database searching, the MS/MS spectrum should be preprocessed before peptide/protein identification. In this paper, we demonstrate that there are a lot of high frequency, high abundance isobaric related ions in MS/MS spectrum, and combining removing isobaric related ions with deisotoping and deconvolution in MS/MS preprocessing procedure improves the peptide/protein identification sensitivity significantly. The user-friendly software package TurboRaw2MGF (v2.0) has been implemented for converting raw TIC data files to mascot generic format files and can be downloaded for free from as part of the software suite ProteomicsTools. The data have been deposited to the ProteomeXchange with identifier PXD000994. Copyright © 2014, The American Society for Biochemistry and Molecular Biology.
    Molecular &amp Cellular Proteomics 11/2014; 14(2). DOI:10.1074/mcp.O114.041376 · 7.25 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Top-down mass spectrometry plays an important role in intact protein identification and characterization. Top-down mass spectra are more complex than bottom-up mass spectra because they often contain many isotopomer envelopes from highly charged ions, which may overlap with one another. As a result, spectral deconvolution, which converts a complex top-down mass spectrum into a monoisotopic mass list, is a key step in top-down spectral interpretation. In this paper, we propose a new scoring function, L-score, for evaluating isotopomer envelopes. By combining L-score with MS-Deconv, a new software tool, MS-Deconv+, was developed for top-down spectral deconvolution. Experimental results showed that MS-Deconv+ outperformed existing software tools in top-down spectral deconvolution. L-score shows high discriminative ability in identification of isotopomer envelopes. Using L-score, MS-Deconv+ reports many correct monoisotopic masses missed by other software tools, which are valuable for proteoform identification and characterization.