PRIDE Cluster: building the consensus of proteomics data
Johannes Griss*, Joseph M. Foster, Henning Hermjakob, and Juan Antonio Vizcaíno
EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge,
CB10 1SD, UK.
To the editor: The amount of mass spectrometry (MS) proteomics data in public
repositories is growing rapidly1 but its (re-)use to increase the reliability of newly performed
experiments is still limited. Two of the major obstacles are the high heterogeneity of the data
present in repositories, and the inflation of false positive identifications when combining
datasets. Here we present ‘PRIDE Cluster’: a novel method to identify reliable
identifications in heterogeneous MS proteomics experiments. It is used to highlight reliable
peptide identifications in the PRIDE database2 (http://www.ebi.ac.uk/pride) and generate
constantly updated, reliable spectral libraries based on these identifications.
The current state of the art to estimate the false discovery rate of proteomics experiments is
the target-decoy strategy3. While this approach can estimate the overall error rate of an
experiment, it cannot estimate the reliability of individual peptide identifications when
combining independent experiments. Search engines’ identification scores are based on
different statistical models and are therefore not suited to assess identification reliabilities
beyond a single experiment4.
‘PRIDE Cluster’ uses spectral clustering to identify reliable identifications in highly
heterogeneous data, taking advantage of the wealth of data present in PRIDE. It uses a
modified version of the MS-Cluster algorithm5, which we refined to increase the clustering
quality (Supplementary Note 1). In our opinion, this was necessary to make the approach
applicable to highly heterogeneous data, where the original algorithm may lead to inaccurate
results (Supplementary Fig. 1). The new algorithm is freely available as a Java Application
Programming Interface at http://pride-spectra-clustering.googlecode.com (Supplementary
We tested our clustering algorithm using three large, highly heterogeneous datasets which
we searched against a target-decoy database (Supplementary Protocol and Supplementary
Note 3). The proportion of clusters that contained spectra identified as multiple different
peptides proved to be too dataset-dependent for a reliable assessment of clustering quality5
(Supplementary Fig. 2). Instead, we assessed it by looking at the precursor ion
spectra that were clustered together and found that the algorithm was robust for every test
dataset (Supplementary Note 4). We observed that larger clusters contained more spectra
identified as the same peptide (Supplementary Fig. 3) and that classical search engines
identified their consensus spectra more reliably (Supplementary Protocol and Supplementary
*Corresponding author. firstname.lastname@example.org European Bioinformatics Institute. Wellcome Trust Genome Campus CB10 1SD,
Cambridge, UK Tel: +44 (0) 1223 492686..
Author contributions J. Griss designed and implemented the algorithm, ran the experiments, performed the analysis, and developed
the ‘PRIDE Cluster’ application. J. M. Foster contributed to the development of the algorithm and the data analysis. J. Griss and J. A.
Vizcaíno wrote the manuscript. H. Hermjakob and J. A. Vizcaíno supervised the project. All authors discussed, commented and
contributed to the final version of the manuscript.
Europe PMC Funders Group
Nat Methods. Author manuscript; available in PMC 2013 May 30.
Published in final edited form as:
. 2013 February ; 10(2): 95–96. doi:10.1038/nmeth.2343.
Europe PMC Funders Author Manuscripts Europe PMC Funders Author Manuscripts