Improved validation of peptide MS/MS assignments using spectral intensity prediction.

Department of Computer Science and Engineering, University of Colorado at Denver and Health Sciences Center, Denver, Colorado 80217-3364, USA.
Molecular &amp Cellular Proteomics (Impact Factor: 7.25). 02/2007; 6(1):1-17. DOI: 10.1074/mcp.M600320-MCP200
Source: PubMed

ABSTRACT A major limitation in identifying peptides from complex mixtures by shotgun proteomics is the ability of search programs to accurately assign peptide sequences using mass spectrometric fragmentation spectra (MS/MS spectra). Manual analysis is used to assess borderline identifications; however, it is error-prone and time-consuming, and criteria for acceptance or rejection are not well defined. Here we report a Manual Analysis Emulator (MAE) program that evaluates results from search programs by implementing two commonly used criteria: 1) consistency of fragment ion intensities with predicted gas phase chemistry and 2) whether a high proportion of the ion intensity (proportion of ion current (PIC)) in the MS/MS spectra can be derived from the peptide sequence. To evaluate chemical plausibility, MAE utilizes similarity (Sim) scoring against theoretical spectra simulated by MassAnalyzer software (Zhang, Z. (2004) Prediction of low-energy collision-induced dissociation spectra of peptides. Anal. Chem. 76, 3908-3922) using known gas phase chemical mechanisms. The results show that Sim scores provide significantly greater discrimination between correct and incorrect search results than achieved by Sequest XCorr scoring or Mascot Mowse scoring, allowing reliable automated validation of borderline cases. To evaluate PIC, MAE simplifies the DTA text files summarizing the MS/MS spectra and applies heuristic rules to classify the fragment ions. MAE output also provides data mining functions, which are illustrated by using PIC to identify spectral chimeras, where two or more peptide ions were sequenced together, as well as cases where fragmentation chemistry is not well predicted.

  • [Show abstract] [Hide abstract]
    ABSTRACT: Proteomics addresses the important goal of determining the chem- istry and composition of proteins in biological samples. Mass-spectrometry-based strategies have been highly successful in identifying and profiling proteins in com- plex mixtures; however, although depth of sampling continues to improve, a gen- eral recognition exists that no study has yet achieved complete protein coverage in any tissue, cell type, subcellular component, or fluid. The development of new approaches for comprehensively surveying highly complex protein mixtures, dis- tinguishing protein isoforms, quantifying changes in protein abundance between different samples, and mapping post-translational modifications are areas of active research. These will be needed to achieve the "systems-wide" protein profiling goals of defining molecular responses to cell perturbations and obtaining biomarker information for disease detection, prognosis, and responses to therapy. We review recent progress in approaching these problems and present examples of successful applications and the outlook for the future.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In shotgun proteomics, database search algorithms rely on fragmentation models to predict fragment ions that should be observed for a given peptide sequence. The most widely used strategy (Naive model) is oversimplified, cleaving all peptide bonds with equal probability to produce fragments of all charges below that of the precursor ion. More accurate models, based on fragmentation simulation, are too computationally intensive for on-the-fly use in database search algorithms. We have created an ordinal-regression-based model called Basophile that takes fragment size and basic residue distribution into account when determining charge retention during CID/ higher-energy collision induced dissociation (HCD) of charged peptides. This model improves the accuracy of predictions by reducing the number of unnecessary fragments that are routinely predicted for highly-charged precursors. Basophile increased the identification rates by 26% (on average) over Naive model, when analyzing triply-charged precursors from ion trap data. Basophile achieves simplicity and speed by solving the prediction problem with an ordinal regression equation, which can be incorporated into any database search software for shotgun proteomic identification.
    Genomics Proteomics & Bioinformatics 03/2013;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Mass spectrometry provides a high-throughput approach to identify proteins in biological samples. A key step in the analysis of mass spectrometry data is to identify the peptide sequence that, most probably, gave rise to each observed spectrum. This is often tackled using a database search: each observed spectrum is compared against a large number of theoretical "expected" spectra predicted from candidate peptide sequences in a database, and the best match is identified using some heuristic scoring criterion. Here we provide a more principled, likelihood-based, scoring criterion for this problem. Specifically, we introduce a probabilistic model that allows one to assess, for each theoretical spectrum, the probability that it would produce the observed spectrum. This probabilistic model takes account of peak locations and intensities, in both observed and theoretical spectra, which enables incorporation of detailed knowledge of chemical plausibility in peptide identification. Besides placing peptide scoring on a sounder theoretical footing, the likelihood-based score also has important practical benefits: it provides natural measures for assessing the uncertainty of each identification, and in comparisons on benchmark data it produced more accurate peptide identifications than other methods, including SEQUEST. Although we focus here on peptide identification, our scoring rule could easily be integrated into any downstream analyses that require peptide-spectrum match scores.
    The Annals of Applied Statistics 01/2013; 6(4). · 1.69 Impact Factor