Performance evaluation of existing de novo sequencing algorithms

Department of Mathematical Cybernetics, Lomonosov Moscow State University, Moskva, Moscow, Russia
Journal of Proteome Research (Impact Factor: 5). 11/2006; 5(11):3018-28. DOI: 10.1021/pr060222h
Source: PubMed

ABSTRACT Two methods have been developed for protein identification from tandem mass spectra: database searching and de novo sequencing. De novo sequencing identifies peptide directly from tandem mass spectra. Among many proposed algorithms, we evaluated the performance of the five de novo sequencing algorithms, AUDENS, Lutefisk, NovoHMM, PepNovo, and PEAKS. Our evaluation methods are based on calculation of relative sequence distance (RSD), algorithm sensitivity, and spectrum quality. We found that de novo sequencing algorithms have different performance in analyzing QSTAR and LCQ mass spectrometer data, but in general, perform better in analyzing QSTAR data than LCQ data. For the QSTAR data, the performance order of the five algorithms is PEAKS > Lutefisk, PepNovo > AUDENS, NovoHMM. The performance of PEAKS, Lutefisk, and PepNovo strongly depends on the spectrum quality and increases with an increase of spectrum quality. However, AUDENS and NovoHMM are not sensitive to the spectrum quality. Compared with other four algorithms, PEAKS has the best sensitivity and also has the best performance in the entire range of spectrum quality. For the LCQ data, the performance order is NovoHMM > PepNovo, PEAKS > Lutefisk > AUDENS. NovoHMM has the best sensitivity, and its performance is the best in the entire range of spectrum quality. But the overall performance of NovoHMM is not significantly different from the performance of PEAKS and PepNovo. AUDENS does not give a good performance in analyzing either QSTAR and LCQ data.

Download full-text


Available from: Irina Fedulova, Jul 01, 2015
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We report the successful de novo sequencing of hemoglobin using a mass spectrometry-based approach combined with automatic data processing and manual validation for nine North American species with currently unsequenced genomes. The complete α and β chain of all nine mammalian hemoglobin samples used in this study were successfully sequenced. These sequences will be appended to the existing database containing all known hemoglobins to be used for identification of the mammalian host species that provided the last blood meal for the tick vector of Lyme disease, Ixodes scapularis.
    Biological Chemistry 02/2012; 393(3):195-201. DOI:10.1515/hsz-2011-0196 · 2.69 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Proteomics is an essential source of information about biological systems because it generates knowledge about the concentrations, interactions, functions, and catalytic activities of proteins, which are the major structural and functional determinants of cells. In the last few years significant technology development has taken place both at the level of data analysis software and mass spectrometry hardware. Conceptual progress in proteomics has made possible the analysis of entire proteomes at previously unprecedented density and accuracy. New concepts have emerged that comprise quantitative analyses of full proteomes, database-independent protein identification strategies, targeted quantitative proteomics approaches with proteotypic peptides and the systematic analysis of an increasing number of posttranslational modifications at high temporal and spatial resolution. Although plant proteomics is making progress, there are still several analytical challenges that await experimental and conceptual solutions. With this review I will highlight the current status of plant proteomics and put it into the context of the aforementioned conceptual progress in the field, illustrate some of the plant-specific challenges and present my view on the great opportunities for plant systems biology offered by proteomics.
    Mass Spectrometry Reviews 01/2009; 28(1):93-120. DOI:10.1002/mas.20183 · 8.05 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We present and evaluate a strategy for the mass spectrometric identification of proteins from organisms for which no genome sequence information is available that incorporates cross-species information from sequenced organisms. The presented method combines spectrum quality scoring, de novo sequencing and error tolerant BLAST searches and is designed to decrease input data complexity. Spectral quality scoring reduces the number of investigated mass spectra without a loss of information. Stringent quality-based selection and the combination of different de novo sequencing methods substantially increase the catalog of significant peptide alignments. The de novo sequences passing a reliability filter are subsequently submitted to error tolerant BLAST searches and MS-BLAST hits are validated by a sampling technique. With the described workflow, we identified up to 20% more groups of homologous proteins in proteome analyses with organisms whose genome is not sequenced than by state-of-the-art database searches in an Arabidopsis thaliana database. We consider the novel data analysis workflow an excellent screening method to identify those proteins that evade detection in proteomics experiments as a result of database constraints.
    PROTEOMICS 12/2007; 7(23):4245-54. DOI:10.1002/pmic.200700474 · 3.97 Impact Factor