ArticlePublisher preview available

Thesaurus: quantifying phosphopeptide positional isomers

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract and Figures

Proteins can be phosphorylated at neighboring sites resulting in different functional states, and studying the regulation of these sites has been challenging. Here we present Thesaurus, a search engine that detects and quantifies phosphopeptide positional isomers from parallel reaction monitoring and data-independent acquisition mass spectrometry experiments. We apply Thesaurus to analyze phosphorylation events in the PI3K/AKT signaling pathway and show neighboring sites with distinct regulation.
Thesaurus algorithmic workflow to search for positional isomers from DIA data For each phosphopeptide in a spectrum library (for example MQSLSLNK from PDAP1), Thesaurus (a) determines the combinatorial list of potential positional isomers and (b) generates synthetic library spectra for positional isomers that are missing from the library. (c) Then for each isomer, Thesaurus calculates primary scores at each retention time point using all fragment ions in the library spectrum. (d) Starting with the highest scoring isomer/retention time pair, (e) Thesaurus calculates a pairwise p-value for detecting that isomer at that retention time versus every other potential isomer using only site-specific fragment ions. The localization p-value is the least significant of those pairwise comparisons. (f) Afterwards, Thesaurus checks the phosphopeptide detection to make sure a sufficient number of fragment ions follow the peak shape described by the site-specific ions. (g) Isomers are reserved for FDR analysis if they pass user-specified localization p-value and IonCount score thresholds. However, if no isomer passes these thresholds for a given phosphopeptide, the isomer with the lowest p-value is reserved. Thesaurus iterates to the next highest scoring isomer/retention time pair, and repeats steps e-g until all the potential positional isomers are considered. After every phosphopeptide is considered, the detected positional isomers are processed with Percolator, and localization p-values for passing isomers are also independently Benjamini-Hochberg FDR corrected.
… 
This content is subject to copyright. Terms and conditions apply.
Brief CommuniCation
https://doi.org/10.1038/s41592-019-0498-4
Department of Genome Sciences, University of Washington, Seattle, WA, USA. *e-mail: jvillen@uw.edu
Proteins can be phosphorylated at neighboring sites resulting
in different functional states, and studying the regulation of
these sites has been challenging. Here we present Thesaurus,
a search engine that detects and quantifies phosphopeptide
positional isomers from parallel reaction monitoring and data-
independent acquisition mass spectrometry experiments. We
apply Thesaurus to analyze phosphorylation events in the
PI3K/AKT signaling pathway and show neighboring sites with
distinct regulation.
Hundreds of thousands of amino acids in thousands of pro-
teins are estimated to be actively phosphorylated in every human
cell1. Many proteins are phosphorylated at neighboring sites2 and
over half of sites in multi-phosphorylated proteins are within four
amino acids of each other3. Several well-studied proteins make use
of neighboring phosphorylation sites to act as switches (MAPK4
and CDC4 (ref. 5)), timers (PER6) or as negative inhibition toggles
(IRS1 (ref. 7)) but global analysis of these phosphorylation clusters
has remained impractical. Tandem mass spectrometry (MS/MS) of
tryptic peptides is a key tool in discovering and quantifying sites of
protein phosphorylation. Typical phosphoproteomic workflows use
data-dependent acquisition (DDA) to collect MS/MS spectra based
on precursor m/z as peptides chromatographically elute. Site local-
ization software tools such as Ascore8 assign the most likely phos-
phorylation position for each peptide using site-specific fragment
ions. To increase the number of distinct peptides that are sampled,
DDA dynamically excludes peptides of the same m/z from being
sampled repeatedly within a narrow elution time. However, phos-
phopeptides that exist as multiple positional isomers are difficult
to sample and assign using DDA because they have the same mass,
similar retention times and share many fragment ions.
Parallel reaction monitoring (PRM) and data-independent
acquisition (DIA) are alternative approaches that systematically col-
lect MS/MS spectra across the chromatographic elution profile of
peptides, improving quantitative reproducibility. While PRM meth-
ods target specific peptide precursors9, DIA methods acquire MS/
MS spectra systematically across the m/z space10. These methods
are free of both intensity biases during data collection and active
exclusion of previously sequenced precursors, making it possible
to detect closely eluting positional isomers. Despite the strengths
of these methods, assigning phosphorylation to a specific amino
acid remains difficult. Recently, Rosenberger etal.11 reported on
IPF, a peptide-centric tool that uses OpenSwath12 to determine the
most likely positional isomer from fragment ions in a peak. An
alternative spectrum-centric approach, PIQED13, deconvolves DIA
data with DIA-Umpire14 to enable site localization tools originally
designed for DDA. Finally, Specter15 deconvolves DIA signals using
linear combinations of spectra in libraries, and in some instances
can resolve positional isomers. A limitation of IPF and PIQED
is that they compete potential positional isomers with similar
retention times against each other and only the best scoring isomer
is reported. On the other hand, Specter was not designed for phos-
phopeptide localization and lacks site localization statistics. Here,
we extend these approaches and present a new DIA and PRM search
engine named Thesaurus, which is designed to specifically look for
positional isomers.
Thesaurus detects phosphopeptides with EncyclopeDIA and
a spectrum library16, and using the detections as retention time
anchors, iteratively finds new positional isomers that share many
of the same fragment ions but differ in their phosphorylation site-
specific ions (Methods, see Supplementary Fig. 1). Thesaurus can
detect multiple co-eluting positional isomers because it calculates
localization probabilities directly using an interference distribution,
rather than by competing isomers against each other. For each phos-
phopeptide, Thesaurus determines every possible positional isomer
and extracts corresponding site-specific fragment ion signals. Each
ion has a unique frequency of interference across the experiment,
and this frequency is highest with low m/z ions (Supplementary
Fig. 2). Thesaurus uses this frequency to calculate a background
distribution for each run and precursor isolation window, since
these distributions depend on peptide mass and various acquisi-
tion settings. Localization P values are calculated as the probability
that all site-specific ions were observable by chance in this back-
ground distribution and false discovery rate (FDR) corrected using
the Benjamini–Hochberg method. Thesaurus detects positional
isomers absent from the spectrum library by generating synthetic
spectra with shifted fragment ions. Thesaurus quantifies positional
isomers even if their precursor signals are convolved, using site-
specific ions to determine peak boundaries and including addi-
tional fragment ions that fit that shape.
We validated Thesaurus using a synthetic phosphopeptide DIA
dataset described previously11 (Supplementary Fig. 3), and found
that it produced both more detections and more accurate error
estimates than IPF and PIQED. In addition to correctly localizing
240 synthetic phosphopeptides, Thesaurus was also able to iden-
tify and flag 11 products of a gas-phase phosphate rearrangement
(Supplementary Fig. 4). We further demonstrated Thesaurus’ per-
formance with phosphopeptides derived from serum-stimulated
HeLa cells. Previously, we reported a human phosphopeptide
library based on nearly a thousand DDA experiments17. Here we
used a subset of this library containing 82,029 phosphopeptides,
where 44% of phosphopeptides are phosphorylated at multiple
positions (Supplementary Fig. 5). Thesaurus was able to detect an
average of 10,780 phosphopeptides across four technical replicates
(Supplementary Dataset 1), corresponding to an average of 6,288
confidently localized positional isomers (Supplementary Fig. 6a).
We found that within phosphopeptides containing multiple accep-
tor sites, approximately 13% were phosphorylated at multiple posi-
tions (Supplementary Fig. 6b).
Thesaurus: quantifying phosphopeptide
positional isomers
Brian C. Searle , Robert T. Lawrence , Michael J. MacCoss  and Judit Villén *
NATURE METHODS | VOL 16 | AUGUST 2019 | 703–706 | www.nature.com/naturemethods 703
Content courtesy of Springer Nature, terms of use apply. Rights reserved
... Early applications of DIA in phosphoproteomics focused on targeted analyses, such as insulin signaling networks [83], histone modifications [84,85], and plasma proteins [86]. Later, DIA-based targeted phosphopeptide quantification demonstrated high accuracy compared to the selective reaction monitoring method, as well as the capability to differentiate and quantify positionally isomeric phosphopeptides [87]. ...
Article
Full-text available
Protein phosphorylation introduces post‐genomic diversity to proteins, which plays a crucial role in various cellular activities. Elucidation of system‐wide signaling cascades requires high‐performance tools for precise identification and quantification of dynamics of site‐specific phosphorylation events. Recent advances in phosphoproteomic technologies have enabled the comprehensive mapping of the dynamic phosphoproteomic landscape, which has opened new avenues for exploring cell type‐specific functional networks underlying cellular functions and clinical phenotypes. Here, we provide an overview of the basics and challenges of phosphoproteomics, as well as the technological evolution and current state‐of‐the‐art global and quantitative phosphoproteomics methodologies. With a specific focus on highly sensitive platforms, we summarize recent trends and innovations in miniaturized sample preparation strategies for micro‐to‐nanoscale and single‐cell profiling, data‐independent acquisition mass spectrometry (DIA‐MS) for enhanced coverage, and quantitative phosphoproteomic pipelines for deep mapping of cell and disease biology. Each aspect of phosphoproteomic analysis presents unique challenges and opportunities for improvement and innovation. We specifically highlight evolving phosphoproteomic technologies that enable deep profiling from low‐input samples. Finally, we discuss the persistent challenges in phosphoproteomic technologies, including the feasibility of nanoscale and single‐cell phosphoproteomics, as well as future outlooks for biomedical applications.
Thesis
Full-text available
Liquid chromatography-based mass spectrometry (LC-MS) is widely used for proteoform identification, characterization, and quantitation. Bottom-up proteomics analyzes enzymatically digested peptides, while top-down proteomics examines intact proteoforms, enabling comprehensive identification of proteoforms with post-translational modifications (PTMs), genetic mutations, and alternative splicing. In MS data, due to the occurrence of different isotopes, proteins with the same chemical composition and charge state produce a group of peaks with different mass-to-charge ratios (m/z), called an isotopic envelope. A top-down mass spectrum often contains hundreds of high-charge state envelopes, some of which are overlapping. Consequently, analyzing top-down MS data presents computational challenges due to the complexity of top-down spectra. This dissertation introduces three new software tools EnvCNN, TopFD, and TopDIA for enhancing proteoform identification, characterization, and quantification in top-down MS data analysis. EnvCNN is a deep-learning model for evaluating isotopic envelopes of proteoforms and their fragments. This model aims to improve the accuracy of reporting fragments, thus increasing the number of identified proteoforms and improving the reliability of proteoform identification and characterization. TopFD is a software tool for proteoform feature detection, grouping all peaks of a proteoform in an LC-MS map into a single feature. TopFD outperforms other existing tools in the accuracy and reproducibility of feature detection, thereby improving proteoform identification and quantification. TopDIA is the first software tool for proteoform identification by top-down data-independent acquisition MS (TD-DIA-MS). Unlike conventional top-down data-dependent acquisition MS (TD-DDA-MS), which relies on intensity-based proteoform selection to generate fragment mass spectra, TD-DIA-MS fragments all proteoforms within predefined isolation windows, generating fragment mass spectra for every proteoform. TopDIA processes TD-DIA-MS data to generate demultiplexed pseudo spectra, which are searched against a protein database for proteoform identification, leading to a significant increase in the number of identified proteoforms compared with TD-DDA-MS. In summary, these new software tools help advance proteomics research by increasing the accuracy and comprehensiveness of proteoform analysis by top-down MS.
Article
Top-down mass spectrometry is widely used for proteoform identification, characterization, and quantification owing to its ability to analyze intact proteoforms. In the past decade, top-down proteomics has been dominated by top-down data-dependent acquisition mass spectrometry (TD-DDA-MS), and top-down data-independent acquisition mass spectrometry (TD-DIA-MS) has not been well studied. While TD-DIA-MS produces complex multiplexed tandem mass spectrometry (MS/MS) spectra, which are challenging to confidently identify, it selects more precursor ions for MS/MS analysis and has the potential to increase proteoform identifications compared with TD-DDA-MS. Here we present TopDIA, the first software tool for proteoform identification by TD-DIA-MS. It generates demultiplexed pseudo MS/MS spectra from TD-DIA-MS data and then searches the pseudo MS/MS spectra against a protein sequence database for proteoform identification. We compared the performance of TD-DDA-MS and TD-DIA-MS using Escherichia coli K-12 MG1655 cells and demonstrated that TD-DIA-MS with TopDIA increased proteoform and protein identifications compared with TD-DDA-MS.
Article
Full-text available
Aberrant signaling pathway activity is a hallmark of tumorigenesis and progression, which has guided targeted inhibitor design for over 30 years. Yet, adaptive resistance mechanisms, induced by rapid, context-specific signaling network rewiring, continue to challenge therapeutic efficacy. Leveraging progress in proteomic technologies and network-based methodologies, we introduce Virtual Enrichment-based Signaling Protein-activity Analysis (VESPA)—an algorithm designed to elucidate mechanisms of cell response and adaptation to drug perturbations—and use it to analyze 7-point phosphoproteomic time series from colorectal cancer cells treated with clinically-relevant inhibitors and control media. Interrogating tumor-specific enzyme/substrate interactions accurately infers kinase and phosphatase activity, based on their substrate phosphorylation state, effectively accounting for signal crosstalk and sparse phosphoproteome coverage. The analysis elucidates time-dependent signaling pathway response to each drug perturbation and, more importantly, cell adaptive response and rewiring, experimentally confirmed by CRISPR knock-out assays, suggesting broad applicability to cancer and other diseases.
Preprint
Full-text available
Top-down mass spectrometry is widely used for proteoform identification, characterization, and quantification owing to its ability to analyze intact proteoforms. In the last decade, top-down proteomics has been dominated by top-down data-dependent acquisition mass spectrometry (TD-DDA-MS), and top-down data-independent acquisition mass spectrometry (TD-DIA-MS) has not been well studied. While TD-DIA-MS produces complex multiplexed tandem mass spectrometry (MS/MS) spectra, which are challenging to confidently identify, it selects more precursor ions for MS/MS analysis and has the potential to increase proteoform identifications compared with TD-DDA-MS. Here we present TopDIA, the first software tool for proteoform identification by TD-DIA-MS. It generates demultiplexed pseudo MS/MS spectra from TD-DIA-MS data and then searches the pseudo MS/MS spectra against a protein sequence database for proteoform identification. We compared the performance of TD-DDA-MS and TD-DIA-MS using Escherichia coli K-12 MG1655 cells and demonstrated that TD-DIA-MS with TopDIA increased proteoform and protein identifications compared with TD-DDA-MS.
Article
Data-independent acquisition (DIA) mass spectrometry (MS) has emerged as a powerful technology for high-throughput, accurate, and reproducible quantitative proteomics. This review provides a comprehensive overview of recent advances in both the experimental and computational methods for DIA proteomics, from data acquisition schemes to analysis strategies and software tools. DIA acquisition schemes are categorized based on the design of precursor isolation windows, highlighting wide-window, overlapping-window, narrow-window, scanning quadrupole-based, and parallel accumulation-serial fragmentation–enhanced DIA methods. For DIA data analysis, major strategies are classified into spectrum reconstruction, sequence-based search, library-based search, de novo sequencing, and sequencing-independent approaches. A wide array of software tools implementing these strategies are reviewed, with details on their overall workflows and scoring approaches at different steps. The generation and optimization of spectral libraries, which are critical resources for DIA analysis, are also discussed. Publicly available benchmark datasets covering global proteomics and phosphoproteomics are summarized to facilitate performance evaluation of various software tools and analysis workflows. Continued advances and synergistic developments of versatile components in DIA workflows are expected to further enhance the power of DIA-based proteomics.
Article
Due to their oftentimes ambiguous nature, phosphopeptide positional isomers can present challenges in bottom‐up mass spectrometry‐based workflows as search engine scores alone are often not enough to confidently distinguish them. Additional scoring algorithms can remedy this by providing confidence metrics in addition to these search results, reducing ambiguity. Here we describe challenges to interpreting phosphoproteomics data and review several different approaches to determine sites of phosphorylation for both data‐dependent and data‐independent acquisition‐based workflows. Finally, we discuss open questions regarding neutral losses, gas‐phase rearrangement, and false localization rate estimation experienced by both types of acquisition workflows and best practices for managing ambiguity in phosphosite determination. This article is protected by copyright. All rights reserved
Article
Full-text available
This paper completes the overall design of a linguistic interactive terminology database based on the characteristics of second language acquisition and terminology and completes the construction of the terminology database by combining a goodness-of-fit detection algorithm based on terminology eigenvalue extraction. The efficiency of terminology information recognition is analyzed and compared with the terminology conversion rate of the eigenvalue goodness-offit algorithm using a neural network learning model of long and short-term memory to optimize the performance of the terminology database. The metric approach's classifier performance evaluation metrics are used to compare the accuracy and recall of the two algorithms accurately. The results show that the accuracy of the fitted superiority classifier with the application of word eigenvalue embedding compared to the LSTM classifier for the classification of electric power terms is improved by about 11% in all categories, and the average accuracy of the classifier exceeds 76.5%.
Article
Full-text available
Data independent acquisition (DIA) mass spectrometry is a powerful technique that is improving the reproducibility and throughput of proteomics studies. Here, we introduce an experimental workflow that uses this technique to construct chromatogram libraries that capture fragment ion chromatographic peak shape and retention time for every detectable peptide in a proteomics experiment. These coordinates calibrate protein databases or spectrum libraries to a specific mass spectrometer and chromatography setup, facilitating DIA-only pipelines and the reuse of global resource libraries. We also present EncyclopeDIA, a software tool for generating and searching chromatogram libraries, and demonstrate the performance of our workflow by quantifying proteins in human and yeast cells. We find that by exploiting calibrated retention time and fragmentation specificity in chromatogram libraries, EncyclopeDIA can detect 20–25% more peptides from DIA experiments than with data dependent acquisition-based spectrum libraries alone.
Article
Full-text available
Mass spectrometry with data-independent acquisition (DIA) is a promising method to improve the comprehensiveness and reproducibility of targeted and discovery proteomics, in theory by systematically measuring all peptide precursors in a biological sample. However, the analytical challenges involved in discriminating between peptides with similar sequences in convoluted spectra have limited its applicability in important cases, such as the detection of single-nucleotide polymorphisms (SNPs) and alternative site localizations in phosphoproteomics data. We report Specter (https://github.com/rpeckner-broad/Specter), an open-source software tool that uses linear algebra to deconvolute DIA mixture spectra directly through comparison to a spectral library, thus circumventing the problems associated with typical fragment-correlation-based approaches. We validate the sensitivity of Specter and its performance relative to that of other methods, and show that Specter is able to successfully analyze cases involving highly similar peptides that are typically challenging for DIA analysis methods.
Article
Full-text available
Localization of phosphorylation sites in peptide sequences is a challenging problem in large-scale phospho-proteomics analysis. The intense neutral loss peaks and co-existence of multiple serine/threonine and/or tyrosine residues are limiting factors for objectively scoring site patterns across thousands of peptides. Various computational approaches for phosphorylation site localization have been proposed including Ascore, Mascot Delta score (MD-Score), and ProteinProspector, yet few address direct estimation of the false localization rate (FLR) in each experiment. Here we propose LuciPHOr, a modified target-decoy based approach that uses mass accuracy and peak intensities for site localization scoring and FLR estimation. Accurate estimation of FLR is a difficult task at the individual site level because the degree of uncertainty in localization varies significantly across different peptides. LuciPHOr carries out simultaneous localization on all candidate sites in each peptide and estimates the FLR based on the target-decoy framework, where decoy phospho-peptides generated by placing artificial phosphorylation(s) on non-candidate residues compete with the non-decoy phospho-peptides. LuciPHOr also reports approximate site-level confidence scores for all candidate sites as a means to localize additional sites from multi-phosphorylated peptides in which localization can be partially achieved. Unlike the existing tools, LuciPHOr is compatible with any search engine output processed through the Trans-Proteomic Pipeline. We evaluated the performance of LuciPHOr in terms of sensitivity and accuracy of FLR estimates using two synthetic phospho-peptide libraries and a phospho-proteomic dataset generated from complex mouse brain samples.
Article
Full-text available
Selected reaction monitoring (SRM) on a triple quadrupole (QqQ) mass spectrometer is currently experiencing a renaissance within the proteomics community for its, as yet, unparalleled ability to characterize and quantify a set of proteins reproducibly, completely, and with high sensitivity. Given the immense benefit that high resolution and accurate mass (HR/AM) instruments have brought to the discovery proteomics field, we wondered if highly accurate mass measurement capabilities could be leveraged to provide benefits in the targeted proteomics domain as well. Here, we propose a new targeted proteomics paradigm centered on the use of next generation, quadrupole-equipped HR/AM instruments: parallel reaction monitoring (PRM). In PRM, the third quadrupole of a QqQ is substituted with a HR/AM mass analyzer to permit the parallel detection of all target product ions in one, concerted high resolution mass analysis. We detail the analytical performance of the PRM method, using a quadrupole-equipped bench-top Orbitrap MS, and draw a performance comparison to SRM in terms of run-to-run reproducibility, dynamic range, and measurement accuracy. In addition to requiring minimal upfront method development and facilitating automated data analysis, PRM yielded quantitative data over a wider dynamic range than SRM in the presence of a yeast background matrix due to PRM's high selectivity in the mass-to-charge domain. With achievable linearity over the quantifiable dynamic range found to be statistically equal between the two methods, our investigation suggests that PRM will be a promising new addition to the quantitative proteomics toolbox.
Article
Consistent detection and quantification of protein post-translational modifications (PTMs) across sample cohorts is a prerequisite for functional analysis of biological processes. Data-independent acquisition (DIA) is a bottom-up mass spectrometry approach that provides complete information on precursor and fragment ions. However, owing to the convoluted structure of DIA data sets, confident, systematic identification and quantification of peptidoforms has remained challenging. Here, we present inference of peptidoforms (IPF), a fully automated algorithm that uses spectral libraries to query, validate and quantify peptidoforms in DIA data sets. The method was developed on data acquired by the DIA method SWATH-MS and benchmarked using a synthetic phosphopeptide reference data set and phosphopeptide-enriched samples. IPF reduced false site-localization by more than sevenfold compared with previous approaches, while recovering 85.4% of the true signals. Using IPF, we quantified peptidoforms in DIA data acquired from >200 samples of blood plasma of a human twin cohort and assessed the contribution of heritable, environmental and longitudinal effects on their PTMs.
Article
Systematic approaches to studying cellular signaling require phosphoproteomic techniques that reproducibly measure the same phosphopeptides across multiple replicates, conditions, and time points. Here we present a method to mine information from large-scale, heterogeneous phosphoproteomics data sets to rapidly generate robust targeted mass spectrometry (MS) assays. We demonstrate the performance of our method by interrogating the IGF-1/AKT signaling pathway, showing that even rarely observed phosphorylation events can be consistently detected and precisely quantified.
Article
As a result of recent improvements in mass spectrometry (MS), there is increased interest in data-independent acquisition (DIA) strategies in which all peptides are systematically fragmented using wide mass-isolation windows ('multiplex fragmentation'). DIA-Umpire (http://diaumpire.sourceforge.net/), a comprehensive computational workflow and open-source software for DIA data, detects precursor and fragment chromatographic features and assembles them into pseudo-tandem MS spectra. These spectra can be identified with conventional database-searching and protein-inference tools, allowing sensitive, untargeted analysis of DIA data without the need for a spectral library. Quantification is done with both precursor- and fragment-ion intensities. Furthermore, DIA-Umpire enables targeted extraction of quantitative information based on peptides initially identified in only a subset of the samples, resulting in more consistent quantification across multiple samples. We demonstrated the performance of the method with control samples of varying complexity and publicly available glycoproteomics and affinity purification-MS data.
Article
Discovering novel post-translational modifications (PTMs) to proteins and detecting specific modification sites on proteins is one of the last frontiers of proteomics. At present, hunting for post-translational modifications remains challenging in widely practiced shotgun proteomics workflows, due to the typically low abundance of modified peptides and the greatly inflated search space as more potential mass shifts are considered by the search engines. Moreover, most popular search methods require that the user specifies the modification(s) to search for; therefore, unexpected and novel PTMs will not be detected. Here, a new algorithm is proposed to apply spectral library searching to the problem of open modification searches, namely, hunting for PTMs without prior knowledge of what PTMs are in the sample. The proposed tier-wise scoring method intelligently looks for unexpected PTMs by allowing mass-shifted peak matches, but only when the number of matches found is deemed statistically significant. This allows the search engine to search for unexpected modifications while maintaining its ability to identify unmodified peptides effectively at the same time. The utility of the method is demonstrated using three different datasets, in which the numbers of spectrum identifications to both unmodified and modified peptides were substantially increased relative to a regular spectral library search, as well as to another open modification spectral search method, pMatch.