Performance Evaluation of Existing De Novo Sequencing Algorithms

Department of Mathematical Cybernetics, Lomonosov Moscow State University, Moskva, Moscow, Russia
Journal of Proteome Research (Impact Factor: 4.25). 11/2006; 5(11):3018-28. DOI: 10.1021/pr060222h
Source: PubMed


Two methods have been developed for protein identification from tandem mass spectra: database searching and de novo sequencing. De novo sequencing identifies peptide directly from tandem mass spectra. Among many proposed algorithms, we evaluated the performance of the five de novo sequencing algorithms, AUDENS, Lutefisk, NovoHMM, PepNovo, and PEAKS. Our evaluation methods are based on calculation of relative sequence distance (RSD), algorithm sensitivity, and spectrum quality. We found that de novo sequencing algorithms have different performance in analyzing QSTAR and LCQ mass spectrometer data, but in general, perform better in analyzing QSTAR data than LCQ data. For the QSTAR data, the performance order of the five algorithms is PEAKS > Lutefisk, PepNovo > AUDENS, NovoHMM. The performance of PEAKS, Lutefisk, and PepNovo strongly depends on the spectrum quality and increases with an increase of spectrum quality. However, AUDENS and NovoHMM are not sensitive to the spectrum quality. Compared with other four algorithms, PEAKS has the best sensitivity and also has the best performance in the entire range of spectrum quality. For the LCQ data, the performance order is NovoHMM > PepNovo, PEAKS > Lutefisk > AUDENS. NovoHMM has the best sensitivity, and its performance is the best in the entire range of spectrum quality. But the overall performance of NovoHMM is not significantly different from the performance of PEAKS and PepNovo. AUDENS does not give a good performance in analyzing either QSTAR and LCQ data.

Download full-text


Available from: Irina Fedulova, Oct 07, 2015
45 Reads
  • Source
    • "With the advancement of mass spectrometry technology and appearance of novel computational methods, de novo algorithm has been greatly improved. However, it is still not comparable with database-searching for common protein identification and needs manual checks by proteomics experts, which is time consuming and of low-throughput (Pevtsov et al., 2006; Kim et al., 2009a). In this study, we exploited the predominance of de novo peptide sequencing to identify SAV-peptides, and used mature database-searching strategy to monitor false discovery. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The detection of single amino-acid variants (SAVs) usually depends on single-nucleotide polymorphisms (SNPs) database. Here, we describe a novel method that discovers SAVs at proteome level independent of SNPs data. Using mass spectrometry-based de novo sequencing algorithm, peptide-candidates are identified and compared with theoretical protein database to generate SAVs under pairing strategy, which is followed by database re-searching to control false discovery rate. In human brain tissues, we can confidently identify known and novel protein variants with diverse origins. Combined with DNA/RNA sequencing, we verify SAVs derived from DNA mutations, RNA alternative splicing, and unknown post-transcriptional mechanisms. Furthermore, quantitative analysis in human brain tissues reveals several tissue-specific differential expressions of SAVs. This approach provides a novel access to high-throughput detection of protein variants, which may offer the potential for clinical biomarker discovery and mechanistic research.
    Journal of Molecular Cell Biology 07/2014; 6(5). DOI:10.1093/jmcb/mju031 · 6.77 Impact Factor
  • Source
    • "A possible alternative is using de novo sequencing, in which amino acid sequences are deduced directly from fragmentation spectra, without the need for a protein DB, followed by BLAST search for identification of candidate homologous proteins [24], [25]. However, manual inspection of spectra is often required due to the error-prone nature of de novo sequencing, and very high quality data are necessary for achieving reliable results [26]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Metaproteomics enables the investigation of the protein repertoire expressed by complex microbial communities. However, to unleash its full potential, refinements in bioinformatic approaches for data analysis are still needed. In this context, sequence databases selection represents a major challenge. This work assessed the impact of different databases in metaproteomic investigations by using a mock microbial mixture including nine diverse bacterial and eukaryotic species, which was subjected to shotgun metaproteomic analysis. Then, both the microbial mixture and the single microorganisms were subjected to next generation sequencing to obtain experimental metagenomic- and genomic-derived databases, which were used along with public databases (namely, NCBI, UniProtKB/SwissProt and UniProtKB/TrEMBL, parsed at different taxonomic levels) to analyze the metaproteomic dataset. First, a quantitative comparison in terms of number and overlap of peptide identifications was carried out among all databases. As a result, only 35% of peptides were common to all database classes; moreover, genus/species-specific databases provided up to 17% more identifications compared to databases with generic taxonomy, while the metagenomic database enabled a slight increment in respect to public databases. Then, database behavior in terms of false discovery rate and peptide degeneracy was critically evaluated. Public databases with generic taxonomy exhibited a markedly different trend compared to the counterparts. Finally, the reliability of taxonomic attribution according to the lowest common ancestor approach (using MEGAN and Unipept software) was assessed. The level of misassignments varied among the different databases, and specific thresholds based on the number of taxon-specific peptides were established to minimize false positives. This study confirms that database selection has a significant impact in metaproteomics, and provides critical indications for improving depth and reliability of metaproteomic results. Specifically, the use of iterative searches and of suitable filters for taxonomic assignments is proposed with the aim of increasing coverage and trustworthiness of metaproteomic data.
    PLoS ONE 12/2013; 8(12):e82981. DOI:10.1371/journal.pone.0082981 · 3.23 Impact Factor
  • Source
    • "Most approaches rely on identifi cation of sequences that exist in databases (NCBI, SwissProt, etc.), while others focus on de novo sequencing. For this, several software packages have been designed, such as PEAKS (Ma et al. , 2003 ), NovoHMM (Fischer et al. , 2005 ), and Lutefi sk (Taylor and Johnson , 1997 ) and implemented with success in real-life applications (Pevtsov et al. , 2006 ; Pitzer et al. , 2007 ). In this work, we used PepNovo (by the Pevzner group from University of California San Diego, CA, USA) (Frank and Pevzner , 2005 ), an open-source de novo sequencing algorithm and validated the results manually, as described in detail below. "
    [Show abstract] [Hide abstract]
    ABSTRACT: We report the successful de novo sequencing of hemoglobin using a mass spectrometry-based approach combined with automatic data processing and manual validation for nine North American species with currently unsequenced genomes. The complete α and β chain of all nine mammalian hemoglobin samples used in this study were successfully sequenced. These sequences will be appended to the existing database containing all known hemoglobins to be used for identification of the mammalian host species that provided the last blood meal for the tick vector of Lyme disease, Ixodes scapularis.
    Biological Chemistry 02/2012; 393(3):195-201. DOI:10.1515/hsz-2011-0196 · 3.27 Impact Factor
Show more