PRIDE Cluster: Building a consensus of proteomics data

Article (PDF Available)inNature Methods 10(2):95-6 · February 2013with 141 Reads 
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
DOI: 10.1038/nmeth.2343 · Source: PubMed
Cite this publication
PRIDE Cluster: building the consensus of proteomics data
Johannes Griss*, Joseph M. Foster, Henning Hermjakob, and Juan Antonio Vizcaíno
EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge,
CB10 1SD, UK.
To the editor: The amount of mass spectrometry (MS) proteomics data in public
repositories is growing rapidly1 but its (re-)use to increase the reliability of newly performed
experiments is still limited. Two of the major obstacles are the high heterogeneity of the data
present in repositories, and the inflation of false positive identifications when combining
datasets. Here we present ‘PRIDE Cluster’: a novel method to identify reliable
identifications in heterogeneous MS proteomics experiments. It is used to highlight reliable
peptide identifications in the PRIDE database2 ( and generate
constantly updated, reliable spectral libraries based on these identifications.
The current state of the art to estimate the false discovery rate of proteomics experiments is
the target-decoy strategy3. While this approach can estimate the overall error rate of an
experiment, it cannot estimate the reliability of individual peptide identifications when
combining independent experiments. Search engines’ identification scores are based on
different statistical models and are therefore not suited to assess identification reliabilities
beyond a single experiment4.
‘PRIDE Cluster’ uses spectral clustering to identify reliable identifications in highly
heterogeneous data, taking advantage of the wealth of data present in PRIDE. It uses a
modified version of the MS-Cluster algorithm5, which we refined to increase the clustering
quality (Supplementary Note 1). In our opinion, this was necessary to make the approach
applicable to highly heterogeneous data, where the original algorithm may lead to inaccurate
results (Supplementary Fig. 1). The new algorithm is freely available as a Java Application
Programming Interface at (Supplementary
Note 2).
We tested our clustering algorithm using three large, highly heterogeneous datasets which
we searched against a target-decoy database (Supplementary Protocol and Supplementary
Note 3). The proportion of clusters that contained spectra identified as multiple different
peptides proved to be too dataset-dependent for a reliable assessment of clustering quality5
(Supplementary Fig. 2). Instead, we assessed it by looking at the precursor ion
range of
spectra that were clustered together and found that the algorithm was robust for every test
dataset (Supplementary Note 4). We observed that larger clusters contained more spectra
identified as the same peptide (Supplementary Fig. 3) and that classical search engines
identified their consensus spectra more reliably (Supplementary Protocol and Supplementary
*Corresponding author. European Bioinformatics Institute. Wellcome Trust Genome Campus CB10 1SD,
Cambridge, UK Tel: +44 (0) 1223 492686..
Author contributions J. Griss designed and implemented the algorithm, ran the experiments, performed the analysis, and developed
the ‘PRIDE Cluster’ application. J. M. Foster contributed to the development of the algorithm and the data analysis. J. Griss and J. A.
Vizcaíno wrote the manuscript. H. Hermjakob and J. A. Vizcaíno supervised the project. All authors discussed, commented and
contributed to the final version of the manuscript.
Europe PMC Funders Group
Author Manuscript
Nat Methods. Author manuscript; available in PMC 2013 May 30.
Published in final edited form as:
Nat Methods
. 2013 February ; 10(2): 95–96. doi:10.1038/nmeth.2343.
Europe PMC Funders Author Manuscripts Europe PMC Funders Author Manuscripts
Figs. 4 and 5). This suggested that the larger a (clustered) repository like PRIDE becomes,
the more distinct reliable and unreliable clusters get.
The relative proportion of spectra within a cluster identified as the same peptide, called
‘ratio’ (Supplementary Fig. 6), was the most predictive attribute to distinguish between
target and decoy peptide identifications (Fig. 1a and Supplementary Note 5). Furthermore,
the number of target and decoy identifications with low ‘ratios’ was basically identical while
nearly only target identifications had high ‘ratios’ (Supplementary Fig. 7). This is in line
with the target-decoy strategy which states that incorrect identifications are evenly
distributed among the target and the decoy database3. It therefore seemed conclusive that
identifications with low ‘ratios’ represented the random incorrect identifications of low
quality spectra.
Using our clustering algorithm we built a new resource called ‘PRIDE Cluster’ in which we
clustered all identified public spectra available in PRIDE (Supplementary Fig. 8). ‘PRIDE
Cluster’ contained 20,666,123 identified spectra from 9,040 experiments and 40 species
(June 2012, Supplementary Protocol). The clustering algorithm robustly organized the data
in 3,152,393 clusters; a more than fivefold reduction compared to the original number of
spectra (Fig. 1b), taking approximately 601 central processing unit days. 5.7 % of the
clusters contained spectra from multiple species which were either identified as
contaminants or conserved sequences (Supplementary Fig. 9).
‘PRIDE Cluster’ currently provides two main methods to access its data: 1) retrieve all
clusters that contain a given peptide identification and 2) retrieve all clusters with a
consensus spectrum similar to a query spectrum (Fig. 1c). Researchers can now easily check
whether similar spectra to their experimental ones were already identified in PRIDE. As a
key functionality, reliable peptide identifications based on the results from ‘PRIDE Cluster’
(Supplementary Note 5) are now highlighted in the classical PRIDE web interface (http:// Furthermore, ‘PRIDE Cluster’ provides a simple method to correct
inaccurate annotations in the PRIDE database (Supplementary Note 6). The consensus
spectra of all reliable clusters can be downloaded as spectral libraries (
pride/cluster/libraries, Supplementary Table 1), which performed comparably to the
corresponding ones from the National Institute of Standards and Technology
(Supplementary Note 7).
‘PRIDE Cluster’ is the first step to introduce stringent quality control in a highly
heterogeneous MS proteomics repository and represents a constantly updated, reliable
consensus of published proteomics data.
Supplementary Material
Refer to Web version on PubMed Central for supplementary material.
We want to acknowledge all other members of the PRIDE team, who contributed to some aspects of this work in
terms of technical implementation and discussion of the results. Additional thanks go to Matrix Science for
providing a Mascot license. Special thanks go to all data submitters of the PRIDE database whose data is the
foundation of the here presented work. J. Griss is supported by the Wellcome Trust [grant number WT085949MA].
J. A. Vizcaíno is supported by the EU FP7 grants LipidomicNet [grant number 202272] and ProteomeXchange
[grant number 260558]. J. M. Foster is supported by a Biotechnology and Biological Sciences Research Council
CASE studentship, also funded by Philips.
Griss et al. Page 2
Nat Methods
. Author manuscript; available in PMC 2013 May 30.
Europe PMC Funders Author Manuscripts Europe PMC Funders Author Manuscripts
1. Csordas A, et al. PRIDE: quality control in a proteomics data repository. Database (Oxford). 2012;
2012 bas004.
2. Vizcaíno JA, et al. The Proteomics Identifications database: 2010 update. Nucleic Acids Res. 2010;
38:D736–742. [PubMed: 19906717]
3. Elias JE, Gygi SP. Target-decoy search strategy for increased confidence in large-scale protein
identifications by mass spectrometry. Nat Methods. 2007; 4:207–214. [PubMed: 17327847]
4. Gupta N, Bandeira N, Keich U, Pevzner PA. Target-decoy approach and false discovery rate: when
things may go wrong. J Am Soc Mass Spectrom. 2011; 22:1111–1120. [PubMed: 21953092]
5. Frank AM, et al. Spectral archives: extending spectral libraries to analyze both identified and
unidentified spectra. Nat Methods. 2011; 8:587–591. [PubMed: 21572408]
Griss et al. Page 3
Nat Methods
. Author manuscript; available in PMC 2013 May 30.
Europe PMC Funders Author Manuscripts Europe PMC Funders Author Manuscripts
Griss et al. Page 4
Nat Methods
. Author manuscript; available in PMC 2013 May 30.
Europe PMC Funders Author Manuscripts Europe PMC Funders Author Manuscripts
Figure 1. A peptide identification’s ‘ratio’ was sufficient to identify reliable identifications from
the clustering results of the PRIDE repository
((a) Distribution of ‘ratios’ (relative number of spectra within a cluster that were identified
as this sequence) between target and decoy sequences. Target sequences predominantly had
‘ratios’ close to one. The considerably less common (incorrect) target sequences with low
‘ratios’ were considered as outliers in the boxplots. Decoy identifications were clearly
distinguished through lower ‘ratios’ (detailed distributions in Supplementary Fig. 7). (b) The
values range of spectra within one cluster from the ‘PRIDE Cluster’ instance.
The algorithm performed equally well as in the test datasets: 91 % of the clusters contained
spectra within 1
unit or less. (c) ‘Result view’ when searching for clusters with a similar
consensus spectrum as the entered one. It presents all clusters that have similar consensus
spectra as the entered one including the similarity score (normalized-dot product).
Griss et al. Page 5
Nat Methods
. Author manuscript; available in PMC 2013 May 30.
Europe PMC Funders Author Manuscripts Europe PMC Funders Author Manuscripts
  • ... MG-GF+ or X! Tandem as the search engine. We note that it is possible to further improve the speed of HAPiID by incorporating spectral clustering using our new algorithm msCrush [55], just like MetaLab [33] which adopts PRIDE Cluster [56] for spectra clustering. ...
    Full-text available
    Background: A few recent large efforts significantly expanded the collection of human-associated bacterial genomes, which now contains thousands of entities including reference complete/draft genomes and metagenome assembled genomes (MAGs). These genomes provide useful resource for studying the functionality of the human-associated microbiome and their relationship with human health and diseases. One application of these genomes is to provide a universal reference for database search in metaproteomic studies, when matched metagenomic/metatranscriptomic data are unavailable. However, a greater collection of reference genomes may not necessarily result in better peptide/protein identification because the increase of search space often leads to fewer spectrum-peptide matches, not to mention the drastic increase of computation time. Methods: Here, we present a new approach that uses two steps to optimize the use of the reference genomes and MAGs as the universal reference for human gut metaproteomic MS/MS data analysis. The first step is to use only the High Abundance Proteins (HAPs) (i.e., ribosomal proteins and elongation factors) for metaproteomic MS/MS database search and, based on the identification results, to derive the taxonomic composition of the underlying microbial community. The second step is to expand the search database by including all proteins from identified abundant species. We call our approach HAPiID (HAPs guided metaproteomics IDentification). Results: We tested our approach using human gut metaproteomic datasets from a previous study and compared it to the state-of-the-art reference database search method MetaPro-IQ for metaproteomic identification in studying human gut microbiota. Our results show that our two-steps method not only performed significantly faster but also was able to identify more peptides. We further demonstrated the application of HAPiID to revealing protein profiles of individual human-associated bacterial species, one or a few species at a time, using metaproteomic data. Conclusions: The HAP guided profiling approach presents a novel effective way for constructing target database for metaproteomic data analysis. The HAPiID pipeline built upon this approach provides a universal tool for analyzing human gut-associated metaproteomic data.
  • ... MSFragger, ANN-SoLo, TagGraph, among others (180-182) have solved the high computational cost by guest on November 19, 2019of these methods. Additionally, in our opinion, a promising approach for targeting interesting spectra would be the use of clustering of MS/MS spectra(183)(184)(185). This approach can be used to select those spectra that remain unidentified and that are commonly found across MS runs in different samples. ...
    The science that investigates the ensembles of all peptides associated to human leukocyte antigen (HLA) molecules is termed "immunopeptidomics" and is typically driven by mass spectrometry (MS) technologies. Recent advances in MS technologies, neoantigen discovery and cancer immunotherapy have catalyzed the launch of the Human Immunopeptidome Project (HIPP) with the goal of providing a complete map of the human immunopeptidome and making the technology so robust that it will be available in every clinic. Here, we provide a long-term perspective of the field and we use this framework to explore how we think the completion of the HIPP will truly impact the society in the future. In this context, we introduce the concept of immunopeptidome-wide association studies (IWAS). We highlight the importance of large cohort studies for the future and how applying quantitative immunopeptidomics at population scale may provide a new look at individual predisposition to common immune diseases as well as responsiveness to vaccines and immunotherapies. Through this vision, we aim to provide a fresh view of the field to stimulate new discussions within the community, and present what we see as the key challenges for the future for unlocking the full potential of immunopeptidomics in this era of precision medicine.
  • ... It is possible to combine our approaches with those that are recently developed, including the utilization of spectral clustering to speed up the search as shown in [57]. We used the MS-GF+ search engine [48], which is one of the fastest MS/MS search engines. ...
    Matching metagenomic and/or metatranscriptomic data, currently often under-utilized, can be useful reference for metaproteomic tandem mass spectra (MS/MS) data analysis. Here we developed a software pipeline for identification of peptides and proteins from metaproteomic MS/MS data using proteins derived from matching metagenomic (and metatranscriptomic) data as the search database, based on two novel approaches Graph2Pro (published) and Var2Pep (new). Graph2Pro retains and utilizes uncertainties of metagenome assembly for reference-based MS/MS data analysis. Var2Pep considers the variations found in metagenomic/metatranscriptomic sequencing reads that are not retained in the assemblies (contigs). The new software pipeline provides one stop application of both tools, and it supports the use of metagenome assembly from commonly used assemblers including MegaHit and metaSPAdes. When tested on two collections of multi-omic microbiome datasets, our pipeline significantly improved the identification rate of the metaproteomic MS/MS spectra by about two folds, comparing to conventional contig- or read-based approaches (the Var2Pep alone identified 5.6% to 24.1% more unique peptides, depending on the dataset). We also showed that identified variant peptides are important for functional profiling of microbiomes. All results suggested that it is important to take into consideration of the assembly uncertainties and genomic variants to facilitate metaproteomic MS/MS data interpretation.
  • ... 6 Originally, we developed our spectra-cluster algorithm to process repository sized data sets, such as the data sets submitted to the PRIDE Archive. 3,7 There, it allowed us to identify correctly and incorrectly identified spectra, as well as millions of consistently unidentified spectra. In short, the spectra-cluster algorithm is a greedy clustering algorithm merging the first spectra that pass the set threshold. ...
    Full-text available
    Label-free quantification has become a common-practice in many mass spectrometry-based proteomics experiments. In recent years, we and others have shown that spectral clustering can considerably improve the analysis of (primarily large-scale) proteomics datasets. We show that spectral clustering can also be used to infer additional peptide-spectrum matches and improve the quality of label-free quantitative proteomics data in datasets containing only tens of MS runs. We analysed four well-known public benchmark datasets that represent different experimental settings using spectral counting and peak intensity based label-free quantification. In both approaches, the additionally inferred peptide-spectrum matches through our spectra-cluster algorithm improved the detectability of low abundant proteins while increasing the accuracy of the derived quantitative data, without increasing the datasets’ noise. Additionally, we developed a Proteome Discoverer node for our spectra-cluster algorithm which allows anyone to re-build our proposed pipeline using the free version of Proteome Discoverer.
  • ... Matches-between-runs (MBR) has proven to be an effective technique to propagate MS2 information to the MS1 features [5,34,1] and clustering of MS2 spectra from large repositories has allowed us to zoom in on frequently unidentified spectra for peptide identification [11]. Unsupervised clustering of MS2 spectra also significantly reduces the number of MS2 spectra that need to be searched [8,10,29]. This allows more computationally expensive searches to be conducted such as partial digestions, variable modification searches, open modification searches or even de novo searches. ...
    Full-text available
    In shotgun proteomics, the amount of information that can be extracted from label-free quantification experiments is typically limited by the identification rate and the noise level of the quantitative signals. This generally causes a low sensitivity in differential expression analysis on protein level. Here, we propose a quantification-first approach that reverses the classical identification-first workflow. Specifically, we introduce a method, Quandenser, that applies unsupervised clustering on both MS1 and MS2 level to summarize all analytes of interest without assigning identities. This prevents valuable information from being discarded prematurely in the identification process and allows us to spend more effort on the identification process due to the data reduction achieved by clustering. Applying this methodology to a dataset of partially known composition, we could now employ open modification and de novo searches to identify multiple analytes of interest that would have gone unnoticed in traditional pipelines. Furthermore, Quandenser reports error rates on feature level which we integrated into our probabilistic protein quantification method, Triqler, to propagate error probabilities from feature level all the way to protein level. Quandenser/Triqler outperformed the state-of-the-art method MaxQuant/Perseus, consistently reporting more differentially abundant proteins, even in a clinical dataset where none were discovered previously. Compellingly, in all three clinical datasets investigated, the differentially abundant proteins showed enrichment for functional annotation terms.
  • Article
    Characterization of complex biological systems based on high-throughput protein quantification through mass spectrometry commonly involves differential expression analysis between replicate samples originating from different experimental conditions. Here we present Proteomics INTegrator (PINT), a new user-friendly web-based platform-independent system to store, visualize, and query proteomic results. PINT provides an extremely flexible query interface that allows advanced Boolean algebra-based data filtering of many different proteomics features such as confidence values, abundance levels or ratios, dataset overlaps, sample characteristics, as well as UniprotKB annotations, which are transparently incorporated into the system. In addition, PINT allows developers to incorporate data visualization and analysis tools, such as PSEA-Quant and Reactome pathways analysis, for dataset enrichment analysis. PINT serves as a centralized hub for large scale proteomics data and as a platform for data analysis, facilitating interpretation of proteomics results and expediting biologically relevant conclusions.
  • Preprint
    Full-text available
    Despite an explosion of data in public repositories, peptide mass spectra are usually analyzed by each laboratory in isolation, treating each experiment as if it has no relationship to any others. This approach fails to exploit the wealth of existing, previously analyzed mass spectrometry data. Others have jointly analyzed many mass spectra, often using clustering. However, mass spectra are not necessarily best summarized as clusters, and although new spectra can be added to existing clusters, clustering methods previously applied to mass spectra do not allow new clusters to be defined without completely re-clustering. As an alternative, we propose to train a deep neural network, called "GLEAMS," to learn an embedding of spectra into a low-dimensional space in which spectra generated by the same peptide are close to one another. We demonstrate empirically the utility of this learned embedding by propagating annotations from labeled to unlabeled spectra. We further use GLEAMS to detect groups of unidentified, proximal spectra representing the same peptide, and we show how to use these spectral communities to reveal misidentified spectra and to characterize frequently observed but consistently unidentified molecular species. We provide a software implementation of our approach, along with a tool to quickly embed additional spectra using a pre-trained model, to facilitate large-scale analyses.
  • Article
    Open modification searching (OMS) is a powerful search strategy that identifies peptides carrying any type of modification by allowing a modified spectrum to match against its unmodified variant by using a very wide precursor mass window. A drawback of this strategy, however, is that it leads to a large increase in search time. Although performing an open search can be done using existing spectral library search engines by simply setting a wide precursor mass window, none of these tools have been optimized for OMS, leading to excessive runtimes and suboptimal identification results. Here we present the ANN-SoLo tool for fast and accurate open spectral library searching. ANN-SoLo uses approximate nearest neighbor indexing to speed up OMS by selecting only a limited number of the most relevant library spectra to compare to an unknown query spectrum. This approach is combined with a cascade search strategy to maximize the number of identified unmodified and modified spectra while strictly controlling the false discovery rate, as well as a shifted dot product score to sensitively match modified spectra to their unmodified counterparts. ANN-SoLo achieves state-of-the-art performance in terms of speed and the number of identifications. On a previously published human cell line data set, ANN-SoLo confidently identifies more spectra than SpectraST or MSFragger and achieves a speedup of an order of magnitude compared to SpectraST. ANN-SoLo is implemented in Python and C++. It is freely available under the Apache 2.0 license at
  • Article
    In this Viewpoint article, we discuss current and future applications of spectral clustering in the context of mass spectrometry‐based proteomics approaches. First of all, we are introducing the main algorithms and tools that can currently be used to perform spectral clustering. In addition, we explain its main applications and their use in current computational proteomics workflows, including the generation of spectral libraries and spectral archives. Finally, we discuss possible future directions for spectral clustering, including its potential use to achieve a deeper coverage of the proteome and the discovery of novel post‐translational modifications and single amino acid variants. This article is protected by copyright. All rights reserved
  • Article
    Full-text available
    The PRoteomics IDEntifications (PRIDE) database is a large public proteomics data repository, containing over 270 million mass spectra (by November 2011). PRIDE is an archival database, providing the proteomics data supporting specific scientific publications in a computationally accessible manner. While PRIDE faces rapid increases in data deposition size as well as number of depositions, the major challenge is to ensure a high quality of data depositions in the context of highly diverse proteomics work flows and data representations. Here, we describe the PRIDE curation pipeline and its practical application in quality control of complex data depositions. Database URL:
  • Article
    Full-text available
    The target-decoy approach (TDA) has done the field of proteomics a great service by filling in the need to estimate the false discovery rates (FDR) of peptide identifications. While TDA is often viewed as a universal solution to the problem of FDR evaluation, we argue that the time has come to critically re-examine TDA and to acknowledge not only its merits but also its demerits. We demonstrate that some popular MS/MS search tools are not TDA-compliant and that it is easy to develop a non-TDA compliant tool that outperforms all TDA-compliant tools. Since the distinction between TDA-compliant and non-TDA compliant tools remains elusive, we are concerned about a possible proliferation of non-TDA-compliant tools in the future (developed with the best intentions). We are also concerned that estimation of the FDR by TDA awkwardly depends on a virtual coin toss and argue that it is important to take the coin toss factor out of our estimation of the FDR. Since computing FDR via TDA suffers from various restrictions, we argue that TDA is not needed when accurate p-values of individual Peptide-Spectrum Matches are available.
  • Article
    Full-text available
    Tandem mass spectrometry (MS/MS) experiments yield multiple, nearly identical spectra of the same peptide in various laboratories, but proteomics researchers typically do not leverage the unidentified spectra produced in other labs to decode spectra they generate. We propose a spectral archives approach that clusters MS/MS datasets, representing similar spectra by a single consensus spectrum. Spectral archives extend spectral libraries by analyzing both identified and unidentified spectra in the same way and maintaining information about peptide spectra that are common across species and conditions. Thus archives offer both traditional library spectrum similarity-based search capabilities along with new ways to analyze the data. By developing a clustering tool, MS-Cluster, we generated a spectral archive from ∼1.18 billion spectra that greatly exceeds the size of existing spectral repositories. We advocate that publicly available data should be organized into spectral archives rather than be analyzed as disparate datasets, as is mostly the case today.
  • Article
    Full-text available
    The Proteomics Identifications database (PRIDE, at the European Bioinformatics Institute has become one of the main repositories of mass spectrometry-derived proteomics data. For the last 2 years, PRIDE data holdings have grown substantially, comprising 60 different species, more than 2.5 million protein identifications, 11.5 million peptides and over 50 million spectra by September 2009. We here describe several new and improved features in PRIDE, including the revised submission process, which now includes direct submission of fragment ion annotations. Correspondingly, it is now possible to visualize spectrum fragmentation annotations on tandem mass spectra, a key feature for compliance with journal data submission requirements. We also describe recent developments in the PRIDE BioMart interface, which now allows integrative queries that can join PRIDE data to a growing number of biological resources such as Reactome, Ensembl, InterPro and UniProt. This ability to perform extremely powerful across-domain queries will certainly be a cornerstone of future bioinformatics analyses. Finally, we highlight the importance of data sharing in the proteomics field, and the corresponding integration of PRIDE with other databases in the ProteomExchange consortium.
  • Article
    Full-text available
    Liquid chromatography and tandem mass spectrometry (LC-MS/MS) has become the preferred method for conducting large-scale surveys of proteomes. Automated interpretation of tandem mass spectrometry (MS/MS) spectra can be problematic, however, for a variety of reasons. As most sequence search engines return results even for 'unmatchable' spectra, proteome researchers must devise ways to distinguish correct from incorrect peptide identifications. The target-decoy search strategy represents a straightforward and effective way to manage this effort. Despite the apparent simplicity of this method, some controversy surrounds its successful application. Here we clarify our preferred methodology by addressing four issues based on observed decoy hit frequencies: (i) the major assumptions made with this database search strategy are reasonable; (ii) concatenated target-decoy database searches are preferable to separate target and decoy database searches; (iii) the theoretical error associated with target-decoy false positive (FP) rate measurements can be estimated; and (iv) alternate methods for constructing decoy databases are similarly effective once certain considerations are taken into account.