RAId_DbS: Peptide Identification using Database Searches with Realistic Statistics

National Center for Biotechnology Information, National Library of Medicine, NIH, Bethesda, MD 20894, USA.
Biology Direct (Impact Factor: 4.04). 02/2007; 2:25. DOI: 10.1186/1745-6150-2-25
Source: PubMed

ABSTRACT BACKGROUND: The key to mass-spectrometry-based proteomics is peptide identification. A major challenge in peptide identification is to obtain realistic E-values when assigning statistical significance to candidate peptides. RESULTS: Using a simple scoring scheme, we propose a database search method with theoretically characterized statistics. Taking into account possible skewness in the random variable distribution and the effect of finite sampling, we provide a theoretical derivation for the tail of the score distribution. For every experimental spectrum examined, we collect the scores of peptides in the database, and find good agreement between the collected score statistics and our theoretical distribution. Using Student's t-tests, we quantify the degree of agreement between the theoretical distribution and the score statistics collected. The T-tests may be used to measure the reliability of reported statistics. When combined with reported P-value for a peptide hit using a score distribution model, this new measure prevents exaggerated statistics. Another feature of RAId_DbS is its capability of detecting multiple co-eluted peptides. The peptide identification performance and statistical accuracy of RAId_DbS are assessed and compared with several other search tools. The executables and data related to RAId_DbS are freely available upon request.


Available from: Aleksey Y Ogurtsov, Apr 25, 2014
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The past 15 years have seen significant progress in LC-MS/MS peptide sequencing, including the advent of successful de novo and database search methods; however, analysis of glycopeptide and, more generally, glycoconjugate spectra remains a much more open problem, and much annotation is still performed manually. This is partly because glycans, unlike peptides, need not be linear chains, and are instead described by trees. In this paper we introduce SweetSEQer, an extremely simple open source tool for identifying potential glycopeptide MS/MS spectra. We evaluate SweetSEQer on manually curated glycoconjugate spectra and on negative controls, and demonstrate high-quality filtering that can be easily improved for specific applications. We also demonstrate a high overlap between peaks annotated by experts and peaks annotated by SweetSEQer, as well as demonstrate inferred glycan graphs consistent with canonical glycan tree motifs.
    Molecular &amp Cellular Proteomics 02/2013; DOI:10.1074/mcp.O112.025940 · 7.25 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Due to its high specificity, trypsin is the enzyme of choice in shotgun proteomics. Nonetheless, several publications do report the identification of semi-tryptic and non-tryptic peptides. Many of these peptides are conjectured to be signaling peptides or to have formed during sample preparation. It is known that only a small fraction of tandem mass spectra from a trypsin-digested protein mixture can be confidently matched to tryptic peptides. Leaving aside other possibilities such as post-translational modifications and single amino acid polymorphisms, this suggests that many unidentified spectra originate from semi-tryptic and non-tryptic peptides . To include them in database searches, however, may not improve overall peptide identification due to possible sensitivity reduction from search space expansion. To circumvent this issue for E-value based search methods, we have designed a scheme that categorizes qualified peptides ( i.e., peptides whose molecular weight differences from the parent ion are within a specified error tolerance) into three tiers: tryptic, semi-tryptic and non-tryptic. This classification allows peptides belonging to different tiers to have different Bonferroni correction factors. Our results show that this scheme can significantly improve retrieval performance when compared to search strategies that assign equal Bonferroni correction factors to all qualified peptides.
    Journal of Proteome Research 05/2013; 12(6). DOI:10.1021/pr301139y · 5.00 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Motivation: Assigning statistical significance accurately has become increasingly important as meta data of many types, often assembled in hierarchies, are constructed and combined for further biological analyses. Statistical inaccuracy of meta data at any level may propagate to downstream analyses, undermining the validity of scientific conclusions thus drawn. From the perspective of mass spectrometry based proteomics, even though accurate statistics for peptide identification can now be achieved, accurate protein level statistics remain challenging. Results: We have constructed a protein ID method that combines peptide evidences of a candidate protein based on a rigorous formula derived earlier; in this formula the database $P$-value of every peptide is weighted, prior to the final combination, according to the number of proteins it maps to. We have also shown that this protein ID method provides accurate protein level $E$-value, eliminating the need of using empirical post-processing methods for type-I error control. Using a known protein mixture, we find that this protein ID method, when combined with the Soric formula, yields accurate values for the proportion of false discoveries. In terms of retrieval efficacy, the results from our method are comparable with other methods tested. Availability: The source code, implemented in C++ on a linux system, is available for download at
    Bioinformatics 07/2014; 31(5). DOI:10.1093/bioinformatics/btu717 · 4.62 Impact Factor