The Standard Protein Mix Database: A Diverse Data Set To Assist in the Production of Improved Peptide and Protein Identification Software Tools

Institute for Systems Biology, Seattle, Washington 98103, USA.
Journal of Proteome Research (Impact Factor: 4.25). 02/2008; 7(1):96-103. DOI: 10.1021/pr070244j
Source: PubMed


Tandem mass spectrometry (MS/MS) is frequently used in the identification of peptides and proteins. Typical proteomic experiments rely on algorithms such as SEQUEST and MASCOT to compare thousands of tandem mass spectra against the theoretical fragment ion spectra of peptides in a database. The probabilities that these spectrum-to-sequence assignments are correct can be determined by statistical software such as PeptideProphet or through estimations based on reverse or decoy databases. However, many of the software applications that assign probabilities for MS/MS spectra to sequence matches were developed using training data sets from 3D ion-trap mass spectrometers. Given the variety of types of mass spectrometers that have become commercially available over the last 5 years, we sought to generate a data set of reference data covering multiple instrumentation platforms to facilitate both the refinement of existing computational approaches and the development of novel software tools. We analyzed the proteolytic peptides in a mixture of tryptic digests of 18 proteins, named the "ISB standard protein mix", using 8 different mass spectrometers. These include linear and 3D ion traps, two quadrupole time-of-flight platforms (qq-TOF), and two MALDI-TOF-TOF platforms. The resulting data set, which has been named the Standard Protein Mix Database, consists of over 1.1 million spectra in 150+ replicate runs on the mass spectrometers. The data were inspected for quality of separation and searched using SEQUEST. All data, including the native raw instrument and mzXML formats and the PeptideProphet validated peptide assignments, are available at

Download full-text


Available from: Parag Mallick,
  • Source
    • "To validate our alignment, the peptide identifications of the MS2 spectra, determined by SEQUEST are used as ground truths. Details about sample preparation, mass spectrometry setting and searching parameters can be found in (Klimek et al., 2007). In our evaluation, we selected one analysis generated from high resolution instruments per " Mix " (LTQ-FT for Mix 1, QSTAR for Mix 2, QTOF for Mix 3 and Mix 4, LTQ-Orbitrap for Mix 7) as the test datasets. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Liquid chromatography coupled to mass spectrometry (LC-MS) is the dominant technological platform for proteomics. An LC-MS analysis of a complex biological sample can be visualized as a "map" of which the positional coordinates are the mass-to-charge ratio (m/z) and chromatographic retention time (RT) of the chemical species profiled. Label-free quantitative-proteomics requires the alignment and comparison of multiple LC-MS maps to ascertain the reproducibility of experiments or reveal proteome changes under different conditions. The main challenge in this task lies in correcting inevitable RT shifts. Similar, but not identical, LC instruments and settings can cause peptides to elute at very different times, and sometimes in a different order, violating the assumptions of many state-of-the-art alignment tools. To meet this challenge, we developed LWBMatch, a new algorithm based on weighted bipartite matching. Unlike existing tools, which search for accurate warping functions to correct RT shifts, we directly seek a peak-to-peak mapping by maximizing a global similarity function between two LC-MS maps. For alignment tasks with large RT shifts (over 500 seconds), an approximate warping function is determined by locally weighted scatterplot smoothing of potential matched features, detected using a novel voting scheme based on co-elution. For validation, we defined the ground-truth for alignment success based on MS/MS identifications from sequence searching. We showed that our method outperforms several existing tools in terms of precision and recall, and is capable of aligning maps from different instruments and settings. Available at
    Bioinformatics 07/2013; 29(19). DOI:10.1093/bioinformatics/btt435 · 4.98 Impact Factor
  • Source
    • "Each vertex of the network contains a conditional probability table given the values of its parent vertices. The probability tables are trained by using the large-scale Seattle dataset [20]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Background Mass spectrometry-based protein identification is a very challenging task. The main identification approaches include de novo sequencing and database searching. Both approaches have shortcomings, so an integrative approach has been developed. The integrative approach firstly infers partial peptide sequences, known as tags, directly from tandem spectra through de novo sequencing, and then puts these sequences into a database search to see if a close peptide match can be found. However the current implementation of this integrative approach has several limitations. Firstly, simplistic de novo sequencing is applied and only very short sequence tags are used. Secondly, most integrative methods apply an algorithm similar to BLAST to search for exact sequence matches and do not accommodate sequence errors well. Thirdly, by applying these methods the integrated de novo sequencing makes a limited contribution to the scoring model which is still largely based on database searching. Results We have developed a new integrative protein identification method which can integrate de novo sequencing more efficiently into database searching. Evaluated on large real datasets, our method outperforms popular identification methods.
    BMC Bioinformatics 01/2013; 14(2). DOI:10.1186/1471-2105-14-S2-S24 · 2.58 Impact Factor
  • Source
    • "Database search for peptide identification was done with Sequest [13], and the statistical analysis of identification results was done with PeptideProphet [18]. Notice that some possible contaminants are considered in the datasets [20], and the summary of the two datasets is given in Table 1. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Background Protein inference is an important computational step in proteomics. There exists a natural nest relationship between protein inference and peptide identification, but these two steps are usually performed separately in existing methods. We believe that both peptide identification and protein inference can be improved by exploring such nest relationship. Results In this study, a feedback framework is proposed to process peptide identification reports from search engines, and an iterative method is implemented to exemplify the processing of Sequest peptide identification reports according to the framework. The iterative method is verified on two datasets with known validity of proteins and peptides, and compared with ProteinProphet and PeptideProphet. The results have shown that not only can the iterative method infer more true positive and less false positive proteins than ProteinProphet, but also identify more true positive and less false positive peptides than PeptideProphet. Conclusions The proposed iterative method implemented according to the feedback framework can unify and improve the results of peptide identification and protein inference.
    Proteome Science 11/2012; 10(1):68. DOI:10.1186/1477-5956-10-68 · 1.73 Impact Factor
Show more