Yasset Perez-Riverol

Yasset Perez-Riverol
EMBL-EBI | EBI · Proteomics Services

PhD Biochemistry, Havana (Cuba), BsC. Software Engineering
Team Coordinator (EMBL-EBI) - omics projects, including BioContainers, PRIDE Archive, PRIDE Peptidome, quantms.

About

207
Publications
80,674
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
20,310
Citations
Citations since 2017
130 Research Items
18318 Citations
201720182019202020212022202301,0002,0003,000
201720182019202020212022202301,0002,0003,000
201720182019202020212022202301,0002,0003,000
201720182019202020212022202301,0002,0003,000
Introduction
Leading the development of Omics Discovery Index, an open source platform to facilitate the access and dissemination of omics datasets coming from multiple omics studies. In addition, lead the BioContainers, a community for bioinformatics software containers. In 2019, as Proteomics Team Coordinator, Im leading the developments of PRIDE Archive, PRIDE Cluster and Proteogenomics resources.
Additional affiliations
January 2019 - present
EMBL-EBI
Position
  • Team Coordinator PRIDE Resources
March 2017 - present
EMBL-EBI
Position
  • Project Manager
March 2016 - April 2017
EMBL-EBI
Position
  • Software Engineer
Education
January 2011 - January 2014
University of Havana
Field of study
  • Biochemistry

Publications

Publications (207)
Preprint
Full-text available
We have implemented the pypgatk package and the pgdb workflow to create proteogenomics databases based on ENSEMBL resources. The tools allow the generation of protein sequences from novel protein-coding transcripts by performing a three-frame translation of pseudogenes, lncRNAs, and other non-canonical transcripts, such as those produced by alterna...
Article
Testing for significant differences in quantities at the protein level is a common goal of many LFQ-based mass spectrometry proteomics experiments. Starting from a table of protein and/or peptide quantities from a given proteomics quantification software, many tools and R packages exist to perform the final tasks of imputation, summarization, norma...
Preprint
Full-text available
Mass spectrometry has become an indispensable tool in the life sciences. The new major version 3 of the computational framework OpenMS provides significant advancements regarding open, scalable, and reproducible high-throughput workflows for proteomics, metabolomics, and oligonucleotide mass spectrometry. OpenMS makes analyses from emerging fields...
Article
Full-text available
Metaproteomics research using mass spectrometry data has emerged as a powerful strategy to understand the mechanisms underlying microbiome dynamics and the interaction of microbiomes with their immediate environment. Recent advances in sample preparation, data acquisition, and bioinformatics workflows have greatly contributed to progress in this fi...
Article
Full-text available
Relative and absolute intensity-based protein quantification across cell lines, tissue atlases and tumour datasets is increasingly available in public datasets. These atlases enable researchers to explore fundamental biological questions, such as protein existence, expression location, quantity and correlation with RNA expression. Most studies prov...
Preprint
Full-text available
Here we provide a curated, large scale, label free mass spectrometry-based proteomics data set derived from HeLa cell lines for general purpose machine learning and analysis. Data access and filtering is a tedious task, which takes up considerable amounts of time for researchers. Therefore we provide machine based metadata for easy selection and ov...
Preprint
Full-text available
Public proteomics data is rapidly increasing, creating a computational challenge for large-scale reanalysis. Here, we introduce quantms, an open-source cloud-based pipeline for massively parallel proteomics data analysis. We used quantms to reanalyze 56 of the largest datasets, comprising 26801 instrument files from 9502 human samples, to quantify...
Preprint
Full-text available
Sharing data and resources has revolutionized life sciences, particularly in proteomics, where public data has enabled researchers to reanalyze and reinterpret data in novel ways. However, the lack of comprehensive metadata remains a significant challenge to unlocking the full potential of publicly shared data. In response, the Sample and Data Rela...
Preprint
Full-text available
Relative and absolute intensity-based protein quantification across cell lines, tissue atlases, and tumour datasets is increasingly available in public datasets. These atlases enable researchers to explore fundamental biological questions, such as protein existence, expression location, quantity, and correlation with RNA expression. Most studies pr...
Article
Full-text available
Data set acquisition and curation are often the most difficult and time-consuming parts of a machine learning endeavor. This is especially true for proteomics-based liquid chromatography (LC) coupled to mass spectrometry (MS) data sets, due to the high levels of data reduction that occur between raw data and machine learning-ready data. Since predi...
Article
The Human Proteome Organization (HUPO) Proteomics Standards Initiative (PSI) has been successfully developing guidelines, data formats, and controlled vocabularies (CVs) for the proteomics community and other fields supported by mass spectrometry since its inception 20 years ago. Here we describe the general operation of the PSI, including its lead...
Article
Full-text available
Introduction: The creation of ProteomeXchange data workflows in 2012 transformed the field of proteomics, consisting of the standardization of data submission and dissemination, and enabling the widespread reanalysis of public MS proteomics data worldwide. ProteomeXchange has triggered a growing trend toward public dissemination of proteomics data...
Article
Full-text available
Mass spectrometry (MS) is by far the most used experimental approach in high-throughput proteomics. The ProteomeXchange (PX) consortium of proteomics resources (http://www.proteomexchange.org) was originally set up to standardize data submission and dissemination of public MS proteomics data. It is now 10 years since the initial data workflow was i...
Preprint
Full-text available
The Human Proteome Organization (HUPO) Proteomics Standards Initiative (PSI) has been successfully developing guidelines, data formats, and controlled vocabularies (CVs) for the proteomics community and other fields supported by mass spectrometry since its inception twenty years ago. Here we describe the general operation of the PSI, including its...
Preprint
Full-text available
Dataset acquisition and curation are often the hardest and most time-consuming parts of a machine learning endeavor. This is especially true for proteomics-based LC-IM-MS datasets, due to the high-throughput data structure with high levels of noise and complexity between raw and machine learning-ready formats. While predictive proteomics is a field...
Preprint
Full-text available
Testing for significant differences in quantities on protein level is a common goal of many LFQ-based mass spectrometry proteomics experiments. Starting from a table of protein and/or peptide quantities from a fixed proteomics quantification software, there exists a multitude of tools and R packages to perform the final tasks of imputation, summari...
Article
Full-text available
Phosphoproteomic methods are commonly employed to identify and quantify phosphorylation sites on proteins. In recent years, various tools have been developed, incorporating scores or statistics related to whether a given phosphosite has been correctly identified or to estimate the global false localization rate (FLR) within a given data set for all...
Article
Full-text available
Spectrum clustering is a powerful strategy to minimize redundant mass spectra by grouping them based on similarity, with the aim of forming groups of mass spectra from the same repeatedly measured analytes. Each such group of near-identical spectra can be represented by its so-called consensus spectrum for downstream processing. Although several al...
Article
Full-text available
In the last decade, a revolution in liquid chromatography-mass spectrometry (LC-MS) based proteomics was unfolded with the introduction of dozens of novel instruments that incorporate additional data dimensions through innovative acquisition methodologies, in turn inspiring specialized data analysis pipelines. Simultaneously, a growing number of pr...
Article
Full-text available
It is important for the proteomics community to have a standardized manner to represent all possible variations of a protein or peptide primary sequence, including natural, chemically induced, and artifactual modifications. The Human Proteome Organization Proteomics Standards Initiative in collaboration with several members of the Consortium for To...
Preprint
Full-text available
Spectrum clustering is a powerful strategy to minimize redundant mass spectral data by grouping highly similar mass spectra corresponding to repeatedly measured analytes. Based on spectrum similarity, near-identical spectra are grouped in clusters, after which each cluster can be represented by its so-called consensus spectrum for downstream proces...
Article
Full-text available
We have implemented the pypgatk package and the pgdb workflow to create proteogenomics databases based on ENSEMBL resources. The tools allow the generation of protein sequences from novel protein-coding transcripts by performing a three-frame translation of pseudogenes, lncRNAs, and other non-canonical transcripts, such as those produced by alterna...
Article
Full-text available
Plasma analysis by mass spectrometry-based proteomics remains a challenge due to its large dynamic range of 10 orders in magnitude. We created a methodology for protein identification known as Wise MS Transfer (WiMT). Melanoma plasma samples from biobank archives were directly analyzed using simple sample preparation. WiMT is based on MS1 features...
Presentation
Full-text available
Article
Full-text available
MaxDIA is a software platform for analyzing data-independent acquisition (DIA) proteomics data within the MaxQuant software environment. Using spectral libraries, MaxDIA achieves deep proteome coverage with substantially better coefficients of variation in protein quantification than other software. MaxDIA is equipped with accurate false discovery...
Article
Full-text available
The amount of public proteomics data is rapidly increasing but there is no standardized format to describe the sample metadata and their relationship with the dataset files in a way that fully supports their understanding or reanalysis. Here we propose to develop the transcriptomics data format MAGE-TAB into a standard representation for proteomics...
Preprint
Full-text available
In the last decade, a revolution in liquid chromatography-mass spectrometry (LC-MS) based proteomics was unfolded with the introduction of dozens of novel instruments that incorporate additional data dimensions through innovative acquisition methodologies, in turn inspiring specialized data analysis pipelines. Simultaneously, a growing number of pr...
Article
Full-text available
The PRoteomics IDEntifications (PRIDE) database (https://www.ebi.ac.uk/pride/) is the world's largest data repository of mass spectrometry-based proteomics data. PRIDE is one of the founding members of the global ProteomeXchange (PX) consortium and an ELIXIR core data resource. In this manuscript, we summarize the developments in PRIDE resources an...
Article
Full-text available
The PRoteomics IDEntifications (PRIDE) database (https://www.ebi.ac.uk/pride/) is the world's largest data repository of mass spectrometry-based proteomics data. PRIDE is one of the founding members of the global ProteomeXchange (PX) consortium and an ELIXIR core data resource. In this manuscript, we summarize the developments in PRIDE resources an...
Preprint
Full-text available
Phosphoproteomics methods are commonly employed in labs to identify and quantify the sites of phosphorylation on proteins. In recent years, various software tools have been developed, incorporating scores or statistics related to whether a given phosphosite has been correctly identified, or to estimate the global false localisation rate (FLR) withi...
Presentation
Full-text available
PRIDE Ecosystem includes multiple services including PRIDE Archive, PRIDE Peptidome, PRIDE Spectra Archive and a set of libraries and pipelines.
Preprint
Full-text available
There is the need to represent in a standard manner all the possible variations of a protein or peptide primary sequence, including both artefactual and post-translational modifications of peptides and proteins. With that overall aim, here, the Human Proteome Organization (HUPO) Proteomics Standards Initiative (PSI) has developed a notation, called...
Article
Full-text available
Numerous genetic studies have established a role for rare genomic variants in Congenital Heart Disease (CHD) at the copy number variation (CNV) and de novo variant (DNV) level. To identify novel haploinsufficient CHD disease genes, we performed an integrative analysis of CNVs and DNVs identified in probands with CHD including cases with sporadic th...
Article
Full-text available
Mass spectra provide the ultimate evidence to support the findings of mass spectrometry proteomics studies in publications, and it is therefore crucial to be able to trace the conclusions back to the spectra. The Universal Spectrum Identifier (USI) provides a standardized mechanism for encoding a virtual path to any mass spectrum contained in datas...
Preprint
Full-text available
The amount of public proteomics data is increasing at an extraordinary rate. Hundreds of datasets are submitted each month to ProteomeXchange repositories, representing many types of proteomics studies, focusing on different aspects such as quantitative experiments, post-translational modifications, protein-protein interactions, or subcellular loca...
Article
Here, we present the Universal Spectrum Explorer (USE), a web-based tool based on IPSA for cross-resource (peptide) spectrum visualization and comparison (https://www.proteomicsdb.org/use/). Mass spectra under investigation can be either provided manually by the user (table format) or automatically retrieved from online repositories supporting acce...
Article
Full-text available
Using 11 proteomics datasets, mostly available through the PRIDE database, we assembled a reference expression map for 191 cancer cell lines and 246 clinical tumour samples, across 13 lineages. We found unique peptides identified only in tumour samples despite a much higher coverage in cell lines. These were mainly mapped to proteins related to reg...
Article
The European Bioinformatics Community for Mass Spectrometry (EuBIC-MS; eubic-ms.org) was founded in 2014 to unite European computational mass spectrometry researchers and proteomics bioinformaticians working in academia and industry. EuBIC-MS maintains educational resources (proteomics-academy.org) and organises workshops at national and internatio...
Article
BioContainers is an open-source project that aims to create, store, and distribute bioinformatics software containers and packages. The BioContainers community has developed a set of guidelines to standardize software containers including the metadata, versions, licenses, and software dependencies. BioContainers supports multiple packaging and cont...
Article
Spectral similarity calculation is widely used in protein identification tools and mass spectra clustering algorithms while comparing theoretical or experimental spectra. The performance of the spectral similarity calculation plays an important role in these tools and algorithms especially in the analysis of large-scale datasets. Recently, deep lea...
Preprint
Full-text available
Motivation: Omics Discovery Index (OmicsDI-www.omicsdi.org) is an integrated and open-source platform to facilitate the discovery and dissemination of omics datasets metadata. It provides a unique infrastructure to integrate datasets coming from multiple omics studies, including at present proteomics, genomics, transcriptomics, metabolomics, and sy...
Conference Paper
Full-text available
Dear colleagues worldwide, we are glad to invite you to MOL2NET-07, International Conference on Multidisciplinary Sciences, ISSN: 2624-5078, MDPI SciForum, Basel, Switzerland, 2021. MOL2NET is an international conference formed by an association of several inter-university tansatlantic workshops or sessions. These workshops are chaired by one North...
Preprint
Full-text available
Mass spectra provide the ultimate evidence for supporting the findings of mass spectrometry (MS) proteomics studies in publications, and it is therefore crucial to be able to trace the conclusions back to the spectra. The Universal Spectrum Identifier (USI) provides a standardized mechanism for encoding a virtual path to any mass spectrum contained...
Article
Full-text available
The 2020 European Bioinformatics Community for Mass Spectrometry (EuBIC-MS) Developers’ meeting was held from January 13th to January 17th 2020 in Nyborg, Denmark. Among the participants were scientists as well as developers working in the field of computational mass spectrometry (MS) and proteomics. The 4-day program was split between introductory...
Article
Full-text available
MassIVE.quant is a repository infrastructure and data resource for reproducible quantitative mass spectrometry–based proteomics, which is compatible with all mass spectrometry data acquisition types and computational analysis tools. A branch structure enables MassIVE.quant to systematically store raw experimental data, metadata of the experimental...
Preprint
Full-text available
Here we present the Universal Spectrum Explorer (USE), a web-based tool based on IPSA for cross-resource (peptide) spectrum visualization and comparison (https://www.proteomicsdb.org/use/). Mass spectra under investigation can either be provided manually by the user (table format), or automatically retrieved from online repositories supporting acce...
Article
The experimental design metadata is a cornerstone of biomedical research, especially for data scientists; and it is paramount in the context of data repositories. For every proteomics dataset we should capture at least three levels of metadata: (i) dataset description and experimental protocols, (ii) data files, and (iii) the sample to data files r...
Preprint
Full-text available
BioContainers is an open-source project that aims to create, store, and distribute bioinformatics software containers and packages. The BioContainers community has developed a set of guidelines to standardize the software containers including the metadata, versions, licenses, and/or software dependencies. BioContainers supports multiple packaging a...
Preprint
Full-text available
Cuban science and technology are known for important achievements, particularly in human healthcare and biotechnology. During the second half of XX century, the country developed a system of scientific institutions to address and solve major economical, cultural, social and health problems. However, the economic crisis faced by the island during th...
Preprint
Full-text available
Congenital Heart Disease (CHD) affects approximately 7-9 children per 1000 live births. Numerous genetic studies have established a role for rare genomic variants at the copy number variation (CNV) and single nucleotide variant level. In particular, the role of de novo mutations(DNM) has been highlighted in syndromic and non-syndromic CHD. To ident...
Article
Full-text available
The Omics Discovery Index is an open source platform that can be used to access, discover and disseminate omics datasets. OmicsDI integrates proteomics, genomics, metabolomics, models and transcriptomics datasets. Using an efficient indexing system, OmicsDI integrates different biological entities including genes, transcripts, proteins, metabolites...
Preprint
Full-text available
Single-cell RNA-Seq (scRNA-Seq) data analysis requires expertise in command-line tools, programming languages and scaling on compute infrastructure. As scRNA-Seq becomes widespread, computational pipelines need to be more accessible, simpler and scalable. We introduce an interactive analysis environment for scRNA-Seq, based on Galaxy, with ~70 func...
Chapter
Bioinformatics software development has become a cornerstone in modern biology research. Large-scale quantitative biology studies have created a demand for more complex workflows and data analysis pipelines. Challenges in reproducing bioinformatics analyses are compounded by the fact that the programs themselves are difficult to install on computer...
Presentation
Full-text available
Sample metadata and experimental design are the cornerstone of modern biomedical research. In proteomics, the sample to data files information is missing for all ProteomeXchange datasets. While originally the sample metadata was model in the mzML files, none of the instrument providers and open source tools include the sample information in the mzM...
Preprint
Full-text available
The Omics Discovery Index is an open source platform that can be used to access, discover and disseminate omics datasets. OmicsDI integrates proteomics, genomics, metabolomics, models and transcriptomics datasets. Using an efficient indexing system, OmicsDI integrates different biological entities including genes, transcripts, proteins, metabolites...
Poster
Full-text available
More and more proteomics datasets are becoming available in public repositories. The knowledge embedded in these datasets can be used to improve peptide identification workflows. Spectral library searching provides a straightforward method to boost identification rates using previously identified spectra. Alternatively, machine learning methods can...
Preprint
Full-text available
Motivation Spectrum clustering has been used to enhance proteomics data analysis: some originally unidentified spectra can potentially be identified and individual peptides can be evaluated to find potential mis-identifications by using clusters of identified spectra. The Phoenix Enhancer provides an infrastructure to analyze tandem mass spectra an...
Article
Full-text available
The field of computational proteomics is approaching the big data age, driven both by a continuous growth in the number of samples analysed per experiment, as well as by the growing amount of data obtained in each analytical run. In order to process these large amounts of data, it is increasingly necessary to use elastic compute resources such as L...
Article
Full-text available
The ProteomeXchange (PX) consortium of proteomics resources (http://www.proteomexchange.org) has standardized data submission and dissemination of mass spectrometry proteomics data worldwide since 2012. In this paper, we describe the main developments since the previous update manuscript was published in Nucleic Acids Research in 2017. Since then,...
Article
Full-text available
The recent improvements in mass spectrometry instruments and new analytical methods are increasing the intersection between proteomics and big data science. In addition, bioinformatics analysis is becoming increasingly complex and convoluted, involving multiple algorithms and tools. A wide variety of methods and software tools have been developed f...
Article
Full-text available
Shotgun proteomics based on peptide fractionation via liquid chromatography has become the common procedure for proteomic studies, although in the very beginning of the field, protein separation via electrophoresis was the main tool. Nonetheless, during the last two decades, the electrophoretic techniques for peptide mixtures fractionation have evo...
Article
Full-text available
In the advanced stages, malignant melanoma (MM) has a very poor prognosis. Due to tremendous efforts in cancer research over the last 10 years, and the introduction of novel therapies such as targeted therapies and immunomodulators, the rather dark horizon of the median survival has dramatically changed from under 1 year to several years. With the...
Article
Full-text available
Publishing databases in the Resource Description Framework (RDF) model is becoming widely accepted to maximize the syntactic and semantic interoperability of open data in life sciences. Here we report advancements made in the 6th and 7th annual BioHackathons which were held in Tokyo and Miyagi respectively. This review consists of two major section...
Conference Paper
Full-text available
USA-Europe Data Analysis Training School (USEDAT) is a Multi-center Trans-Atlantic initiative offering hands-on training focused in both Introduction to Experimental Data Recording (NMR, MS, IR, 2DGE, EEG, etc.) and/or posterior Computational Data Analysis (Machine Learning, Complex Networks, etc.). We made emphasis on applications in for Cheminfor...
Article
Full-text available
The amount of omics data in the public domain is increasing every year. Modern science has become a data-intensive discipline. Innovative solutions for data management, data sharing, and for discoverin