RAId_DbS: Peptide Identification using Database Searches with Realistic Statistics

National Center for Biotechnology Information, National Library of Medicine, NIH, Bethesda, MD 20894, USA.
Biology Direct (Impact Factor: 4.66). 02/2007; 2(1):25. DOI: 10.1186/1745-6150-2-25
Source: PubMed


The key to mass-spectrometry-based proteomics is peptide identification. A major challenge in peptide identification is to obtain realistic E-values when assigning statistical significance to candidate peptides.

Using a simple scoring scheme, we propose a database search method with theoretically characterized statistics. Taking into account possible skewness in the random variable distribution and the effect of finite sampling, we provide a theoretical derivation for the tail of the score distribution. For every experimental spectrum examined, we collect the scores of peptides in the database, and find good agreement between the collected score statistics and our theoretical distribution. Using Student's t-tests, we quantify the degree of agreement between the theoretical distribution and the score statistics collected. The T-tests may be used to measure the reliability of reported statistics. When combined with reported P-value for a peptide hit using a score distribution model, this new measure prevents exaggerated statistics. Another feature of RAId_DbS is its capability of detecting multiple co-eluted peptides. The peptide identification performance and statistical accuracy of RAId_DbS are assessed and compared with several other search tools. The executables and data related to RAId_DbS are freely available upon request.

Download full-text


Available from: Aleksey Y Ogurtsov, Apr 25, 2014
16 Reads
  • Source
    • "Kim et al. (2009) address the issue of spectrum specificity by calculating a generating function and infer the probability of a correct spectrum identification based on all matching peptides. RAId_DbS (Alves et al., 2007) uses a score in the form of a weighted sum of logarithmic intensities and applies an extension of the Central Limit Theorem to assign statistical significance to the matches. However, the approach based on fitting specific parametric models cannot be generalized to other platforms. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Although many methods and statistical approaches have been developed for protein identification by mass spectrometry, the problem of accurate assessment of statistical significance of protein identifications remains an open question. The main issues are as follows: (i) statistical significance of inferring peptide from experimental mass spectra must be platform independent and spectrum specific and (ii) individual spectrum matches at the peptide level must be combined into a single statistical measure at the protein level. We present a method and software to assign statistical significance to protein identifications from search engines for mass spectrometric data. The approach is based on asymptotic theory of order statistics. The parameters of the asymptotic distributions of identification scores are estimated for each spectrum individually. The method relies on new unbiased estimators for parameters of extreme value distribution. The estimated parameters are used to assign a spectrum-specific P-value to each peptide-spectrum match. The protein-level confidence measure combines P-values of peptide-to-spectrum matches. Conclusion: We extensively tested the method using triplicate mouse and yeast high-throughput proteomic experiments. The proposed statistical approach improves the sensitivity of protein identifications without compromising specificity. While the method was primarily designed to work with Mascot, it is platform-independent and is applicable to any search engine which outputs a single score for a peptide-spectrum match. We demonstrate this by testing the method in conjunction with X!Tandem. The software is available for download at Supplementary data are available at Bioinformatics online.
    Bioinformatics 02/2011; 27(8):1128-34. DOI:10.1093/bioinformatics/btr089 · 4.98 Impact Factor
  • Source
    • "When calculating the E-value of a peptide with correct N-terminal cleavage, RAId_DbS multiplies the peptide's P-value by Cc. However, the E-value of a peptide with incorrect N-terminal cleavage will be obtained by multiplying the peptide's P-value by Cc + Cinc [13]. In line with the Bonferroni correction that is rooted in the Bonferroni inequality [17], our approach avoids overstating the significance of a hit from a larger effective database (the pool of peptides regardless of whether the N-terminal cleavage is correct) versus a hit from a smaller effective database (the pool of peptides with correct N-terminal cleavage only). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Existing scientific literature is a rich source of biological information such as disease markers. Integration of this information with data analysis may help researchers to identify possible controversies and to form useful hypotheses for further validations. In the context of proteomics studies, individualized proteomics era may be approached through consideration of amino acid substitutions/modifications as well as information from disease studies. Integration of such information with peptide searches facilitates speedy, dynamic information retrieval that may significantly benefit clinical laboratory studies. We have integrated from various sources annotated single amino acid polymorphisms, post-translational modifications, and their documented disease associations (if they exist) into one enhanced database per organism. We have also augmented our peptide identification software RAId_DbS to take into account this information while analyzing a tandem mass spectrum. In principle, one may choose to respect or ignore the correlation of amino acid polymorphisms/modifications within each protein. The former leads to targeted searches and avoids scoring of unnecessary polymorphism/modification combinations; the latter explores possible polymorphisms in a controlled fashion. To facilitate new discoveries, RAId_DbS also allows users to conduct searches permitting novel polymorphisms as well as to search a knowledge database created by the users. We have finished constructing enhanced databases for 17 organisms. The web link to RAId_DbS and the enhanced databases is The relevant databases and binaries of RAId_DbS for Linux, Windows, and Mac OS X are available for download from the same web page.
    BMC Genomics 11/2008; 9(1):505. DOI:10.1186/1471-2164-9-505 · 3.99 Impact Factor
  • Source
    • "OMSSA [35](v2.0), RAId_DbS [36]. OMSSA, RAId_DbS and X!Tandem were installed for evaluation on the Biowulf cluster, a Linux parallel processing system with ≈ 3700 processors, of the National Institutes of Health (NIH). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Background Current experimental techniques, especially those applying liquid chromatography mass spectrometry, have made high-throughput proteomic studies possible. The increase in throughput however also raises concerns on the accuracy of identification or quantification. Most experimental procedures select in a given MS scan only a few relatively most intense parent ions, each to be fragmented (MS2) separately, and most other minor co-eluted peptides that have similar chromatographic retention times are ignored and their information lost. Results We have computationally investigated the possibility of enhancing the information retrieval during a given LC/MS experiment by selecting the two or three most intense parent ions for simultaneous fragmentation. A set of spectra is created via superimposing a number of MS2 spectra, each can be identified by all search methods tested with high confidence, to mimick the spectra of co-eluted peptides. The generated convoluted spectra were used to evaluate the capability of several database search methods – SEQUEST, Mascot, X!Tandem, OMSSA, and RAId_DbS – in identifying true peptides from superimposed spectra of co-eluted peptides. We show that using these simulated spectra, all the database search methods will gain eventually in the number of true peptides identified by using the compound spectra of co-eluted peptides. Open peer review Reviewed by Vlad Petyuk (nominated by Arcady Mushegian), King Jordan and Shamil Sunyaev. For the full reviews, please go to the Reviewers' comments section.
    Biology Direct 08/2008; 3(1):27. DOI:10.1186/1745-6150-3-27 · 4.66 Impact Factor
Show more