PosterPDF Available

Probabilistic framework for integration of mass spectrum and retention time information in small molecule identification

  • Elisa Oy


Poster presented at the ISBM in the COSI: CompMS (2021).
Check out
the paper
Probabilistic framework for integration of mass spectrum and retention time
information in small molecule identification
Eric Bach 1,B, Simon Rogers 2, John Williamson 2, and Juho Rousu 1
1Department of Computer Science, School of Science, Aalto University, Espoo, Finland, 2School of Computing Science, University of Glasgow, Glasgow, UK
1. Small Molecule Identification in Untargeted Metabolomics
Challenge in untargeted metabolomics studies: Identification of the small
molecules present in a biological sample
LC-MS2widely used analysis platform: Liquid chromatography (LC) coupled with
tandem mass spectrometry (MS2) (Fig. 1)
In practice: Many MS feature will have missing MS2information
Using the LC Retention Time (RT) Information:
LC RT can aid small molecule identification [5, 7]
Challenges leveraging the RT information:
RTs are LC-system specific
RT databases typically limited in size and molecule coverage
Mapping of RTs between LC-systems requires RT database overlap
LC-MS Spectrum
(ve MS features and MS²)
mass per charge [m/z]
retention time [min]
Biological sample
(containing multiple unknown
small molecules)
Intensity Intensity
Set of (MS², RT, Candidate Set)-tuples
Input to our Framework
observed retention
Fig. 1: LC-MS2analysis pipeline and resulting data used as input for our framework.
2. Key Elements of our Framework to combine MS and RT Information
Probabilistic model with an efficient inference to jointly use MS and RT
MS score agnostic: The user defines how the MS information is used.
No reference RTs of the target LC-system required
Output: Ranking of molecular structures from user defined candidate sets.
Using the observed retention orders:
Exploitation of all pairwise observed retention orders in an LC-MS2dataset
(Fig. 1, middle)
Comparison of observed and predicted retention orders to up- and down-vote
molecular candidates
3. LC-MS2Experiment Data: Input and Output of our Framework
Pre-processed LC-MS2data with Nfeatures (Fig.1): D={(xi, ti,Ci)}N
xi: MS Information; MS2, or MS1(precursor m/z) if no fragmentation available
ti: Measured RT of feature i
Ci: Molecular candidate sets, e.g. extracted from PubChem using exact mass
Precomputed MS scoring assumed:
MS2: CSI:FingerID [3], MetFrag [5] or IOKR [2] scores
MS1: e.g, deviation of candidate and precursor mass or isotope pattern score
Score for each molecular candidate mir ∈ Ciof the MS features i
Scores integrate MS and RT information and can be used for ranking
4. Our Probabilistic Framework to integrate MS and RT Information
Let G= (V, E)be complete graph with a node iVfor each MS feature, and an
edge (i, j)Efor each feature pair (Fig. 2)
Discrete random variable zi∈ Zi={1, . . . , ni}associated with each node (ni=|Ci|)
Candidate assignment for the complete data z={zi|iV}∈Z1×. . . × ZN=Z
Intuitively: Random variable zidenotes the candidate mir ∈ Ciassigned to feature i.
Pairwise Markov Random Field as probabilistic model [4]:
p(z) = 1
ψij(zi, zj)
Potential functions: ψi(zi)MS score and ψij(zi, zi)match of observed and predicted
retention order
Molecular candidates ranked based on max-marginals [4] (Fig. 2):
pmax(zi=r) = max
{z0∈Z | z0
Intuitively: Maximum probability of a candidate assignment zwith zi=r.
Ranked Candidates of MS features
(from high to low marginal probability)
Inference of Max-marginal Probabilties
(for all candidate molecules and features)
Markov Random Field
Fig. 2: MRF probability distribution and candidate ranking, e.g. MS feature i= 3 and candidate 4(m34).
5. Exploiting Observed Retention Orders via the Edge Potentials ψij
Edge potential ψij :Zi× ZjR>0, with σbeing the sigmoid function:
ψij(zi=r, zj=s) = σsign(titj)
| {z }
retention order
·hw, φ(mir)i−hw, φ(mjs)i
| {z }
predicted retention order
Intuitively: Matching observed and predicted retention orders receive high scores.
Retention order prediction using Ranking Support Vector Machine (RankSVM) w[1]
Candidate molecules mir representation using non-linear features φ
6. Feasible Inference through Approximation using Tree Ensembles
Marginal inference (Eq. (1)) intractable in practice due to exponential size of Z
For tree-like Gexact inference is feasible [4]
We average the max-marginals for a set of trees T={Tt}L
t=1 sampled from G:
pmax(zi=r|T) = 1
Random Spanning Trees Ensemble used to approximate the MRF
Averaged Max-marginals
used for candidate ranking
Fig. 3: Random spanning tree sample and averaged max-marginals.
7. Experiments and Results
Evaluation datasets: CASMI 2016 [6], EA subset from MassBank used by [5]
681 (MS2, RT)-tuples with each 310 candidates (median statistic)
Datasets cover two different LC columns and flow gradients
Sampling of LC-MS2data with 50 to 100 MS features from the tuples
RankSVM training data: 1248 RTs from PredRed [7] and CASMI 2016 training
No evaluation set molecule in RankSVM tranining set
Performance measure: Top-kaccuracy, percentage of correct molecular
candidates at rank k
Experiment 1: Comparison to MetFrag + LogP (RT Proxy) Prediction
MetFrag relaunched [5]: Prediction of LogP values for candidates, linear model
mapping measured RTs to LogPs, candidate re-ranking based on LogP deviation
Method Top-1 Top-5 Top-10 Top-20
MS2+ RT (Our) 21.3 52.9 64.0 74.3
MS2+ RT (MetFrag & LogP) 20.5 49.1 61.2 72.6
Only MS2(baseline) 16.7 49.5 60.4 70.6
Experiment 2: Performance with different MS2-Scoring Methods
MetFrag (in-silico fragmenter scores) and IOKR [2] as MS2-scoring methods
MS2-Scorer Method Top-1 Top-5 Top-10 Top-20
MetFrag MS2+ RT (our) 21.3 52.9 64.0 74.3
Only MS2(baseline) 16.7 49.5 60.4 70.6
IOKR MS2+ RT (our) 26.7 52.1 62.5 70.3
Only MS2(baseline) 25.1 49.5 60.3 67.6
Experiment 3: Missing MS2Spectra
Simulating missing MS2information: Varying from 0% to 100% MS2
If only MS1: Use mass deviation between precursor and candidate molecule
~ +4%p ~ +4%p
[1] E. Bach, S. Szedmak, C. Brouard, S. Böcker, and
J. Rousu. Liquid-chromatography retention order predic-
tion for metabolite identification. Bioinformatics, 2018.
[2] C. Brouard, H. Shen, K. Dührkop, F. d’Alché-Buc,
S. Böcker, and J. Rousu. Fast metabolite identification
with Input Output Kernel Regression. Bioinformatics,
[3] K. Dührkop, M. Fleischauer, M. Ludwig, A. A. Aksenov,
A. V. Melnik, M. Meusel, P. C. Dorrestein, J. Rousu, and
S. Böcker. Sirius 4: a rapid tool for turning tandem mass
spectra into metabolite structure information. Nat Meth-
ods, 2019.
[4] D. J. MacKay. Information theory, inference and learning
algorithms. Cambridge university press, 2005.
[5] C. Ruttkies, E. L. Schymanski, S. Wolf, J. Hollender, and
S. Neumann. Metfrag relaunched: incorporating strate-
gies beyond in silico fragmentation. Journal of Chemin-
formatics, 2016.
[6] E. L. Schymanski, C. Ruttkies, M. Krauss, C. Brouard,
T. Kind, K. Dührkop, F. Allen, A. Vaniya, D. Verdegem,
S. Böcker, J. Rousu, H. Shen, H. Tsugawa, T. Sajed,
O. Fiehn, B. Ghesquière, and S. Neumann. Critical as-
sessment of small molecule identification 2016: auto-
mated methods. Journal of Cheminformatics, 2017.
[7] J. Stanstrup, S. Neumann, and U. Vrhovsek. Predret:
Prediction of retention time by direct mapping between
multiple chromatographic systems. Analytical Chemistry,
Acknowledgements: This work has been supported by the Academy of Finland grant 310107 (MACOME); and the Aalto Science-IT infrastructure. SR and JHW were supported by EPSRC (EP/R018634/1) and the Scottish Informatics and Computing Science Alliance (SICSA) distinguished visiting fellow scheme.
Contact (B):
Full-text available
We present LC-MS ² Struct, a machine learning framework for structural annotation of small molecule data arising from liquid chromatography-tandem mass spectrometry (LC-MS ² ) measurements. LC-MS ² Struct predicts the annotations for a set of mass spectrometry features in a sample, using the ions’ observed retention orders and the output of state-of-the-art MS ² scorers. LC-MS ² Struct is based on a novel structured prediction model trained to benefit from dependencies between retention times and the mass spectral features for an improved annotation accuracy. We demonstrate the benefit of LC-MS ² Struct on a comprehensive dataset containing reference MS ² spectra and retention times of 4327 molecules from MassBank, measured using a variety of LC conditions. We show that LC-MS ² Struct obtains significantly higher annotation accuracy than methods based on retention time prediction. Furthermore, LC-MS ² Struct improves the annotation accuracy of state-of-the-art MS ² scorers by up to 66.1 percent and even up to 95.9 percent when predicting stereochemical variants of small molecules.
Full-text available
Mass spectrometry is a predominant experimental technique in metabolomics and related fields, but metabolite structural elucidation remains highly challenging. We report SIRIUS 4 (, which provides a fast computational approach for molecular structure identification. SIRIUS 4 integrates CSI:FingerID for searching in molecular structure databases. Using SIRIUS 4, we achieved identification rates of more than 70% on challenging metabolomics datasets. SIRIUS 4 is a fast and highly accurate tool for molecular structure interpretation from mass-spectrometry-based metabolomics data.
Full-text available
Motivation: Liquid Chromatography (LC) followed by tandem Mass Spectrometry (MS/MS) is one of the predominant methods for metabolite identification. In recent years, machine learning has started to transform the analysis of tandem mass spectra and the identification of small molecules. In contrast, LC data is rarely used to improve metabolite identification, despite numerous published methods for retention time prediction using machine learning. Results: We present a machine learning method for predicting the retention order of molecules; that is, the order in which molecules elute from the LC column. Our method has important advantages over previous approaches: We show that retention order is much better conserved between instruments than retention time. To this end, our method can be trained using retention time measurements from different LC systems and configurations without tedious pre-processing, significantly increasing the amount of available training data. Our experiments demonstrate that retention order prediction is an effective way to learn retention behaviour of molecules from heterogeneous retention time data. Finally, we demonstrate how retention order prediction and MS/MS-based scores can be combined for more accurate metabolite identifications when analyzing a complete LC-MS/MS run. Availability and implementation: Implementation of the method is available at
Full-text available
Background The fourth round of the Critical Assessment of Small Molecule Identification (CASMI) Contest ( was held in 2016, with two new categories for automated methods. This article covers the 208 challenges in Categories 2 and 3, without and with metadata, from organization, participation, results and post-contest evaluation of CASMI 2016 through to perspectives for future contests and small molecule annotation/identification. ResultsThe Input Output Kernel Regression (CSI:IOKR) machine learning approach performed best in “Category 2: Best Automatic Structural Identification—In Silico Fragmentation Only”, won by Team Brouard with 41% challenge wins. The winner of “Category 3: Best Automatic Structural Identification—Full Information” was Team Kind (MS-FINDER), with 76% challenge wins. The best methods were able to achieve over 30% Top 1 ranks in Category 2, with all methods ranking the correct candidate in the Top 10 in around 50% of challenges. This success rate rose to 70% Top 1 ranks in Category 3, with candidates in the Top 10 in over 80% of the challenges. The machine learning and chemistry-based approaches are shown to perform in complementary ways. Conclusions The improvement in (semi-)automated fragmentation methods for small molecule identification has been substantial. The achieved high rates of correct candidates in the Top 1 and Top 10, despite large candidate numbers, open up great possibilities for high-throughput annotation of untargeted analysis for “known unknowns”. As more high quality training data becomes available, the improvements in machine learning methods will likely continue, but the alternative approaches still provide valuable complementary information. Improved integration of experimental context will also improve identification success further for “real life” annotations. The true “unknown unknowns” remain to be evaluated in future CASMI contests.Open image in new windowGraphical abstract.
Full-text available
Motivation: An important problematic of metabolomics is to identify metabolites using tandem mass spectrometry data. Machine learning methods have been proposed recently to solve this problem by predicting molecular fingerprint vectors and matching these fingerprints against existing molecular structure databases. In this work we propose to address the metabolite identification problem using a structured output prediction approach. This type of approach is not limited to vector output space and can handle structured output space such as the molecule space. Results: We use the Input Output Kernel Regression method to learn the mapping between tandem mass spectra and molecular structures. The principle of this method is to encode the similarities in the input (spectra) space and the similarities in the output (molecule) space using two kernel functions. This method approximates the spectra-molecule mapping in two phases. The first phase corresponds to a regression problem from the input space to the feature space associated to the output kernel. The second phase is a preimage problem, consisting in mapping back the predicted output feature vectors to the molecule space. We show that our approach achieves state-of-the-art accuracy in metabolite identification. Moreover, our method has the advantage of decreasing the running times for the training step and the test step by several orders of magnitude over the preceding methods. Availability and implementation: Contact: Supplementary information: Supplementary data are available at Bioinformatics online.
Full-text available
Background: The in silico fragmenter MetFrag, launched in 2010, was one of the first approaches combining compound database searching and fragmentation prediction for small molecule identification from tandem mass spectrometry data. Since then many new approaches have evolved, as has MetFrag itself. This article details the latest developments to MetFrag and its use in small molecule identification since the original publication. Results: MetFrag has gone through algorithmic and scoring refinements. New features include the retrieval of reference, data source and patent information via ChemSpider and PubChem web services, as well as InChIKey filtering to reduce candidate redundancy due to stereoisomerism. Candidates can be filtered or scored differently based on criteria like occurence of certain elements and/or substructures prior to fragmentation, or presence in so-called "suspect lists". Retention time information can now be calculated either within MetFrag with a sufficient amount of user-provided retention times, or incorporated separately as "user-defined scores" to be included in candidate ranking. The changes to MetFrag were evaluated on the original dataset as well as a dataset of 473 merged high resolution tandem mass spectra (HR-MS/MS) and compared with another open source in silico fragmenter, CFM-ID. Using HR-MS/MS information only, MetFrag2.2 and CFM-ID had 30 and 43 Top 1 ranks, respectively, using PubChem as a database. Including reference and retention information in MetFrag2.2 improved this to 420 and 336 Top 1 ranks with ChemSpider and PubChem (89 and 71 %), respectively, and even up to 343 Top 1 ranks (PubChem) when combining with CFM-ID. The optimal parameters and weights were verified using three additional datasets of 824 merged HR-MS/MS spectra in total. Further examples are given to demonstrate flexibility of the enhanced features. Conclusions: In many cases additional information is available from the experimental context to add to small molecule identification, which is especially useful where the mass spectrum alone is not sufficient for candidate selection from a large number of candidates. The results achieved with MetFrag2.2 clearly show the benefit of considering this additional information. The new functions greatly enhance the chance of identification success and have been incorporated into a command line interface in a flexible way designed to be integrated into high throughput workflows. Feedback on the command line version of MetFrag2.2 available at is welcome.
Demands in research investigating small molecules by applying untargeted approaches have been a key motivator for the development of repositories for mass spectrometry spectra and automated tools to aid compound identification. Comparatively little attention has been afforded to using retention times (RTs) to distinguish compounds and for liquid chromatography there are currently no coordinated efforts to share and exploit RT information. We therefore present PredRet; the first tool that makes community sharing of RT information possible across labs and chromatographic systems (CSs). At a database of RTs from different CSs is available and users can upload their own experimental RTs and download predicted RTs for compounds which they have not experimentally determined in their own experiments. For each possible pair of CSs in the database, the RTs are used to construct a projection model between the RTs in the two CSs. The number of compounds for which RTs can be predicted and the accuracy of the predictions are dependent upon the compound coverage overlap between the CSs used for construction of projection models. At the moment it is possible to predict up to 400 RTs with a median error between 0.01 and 0.28 min depending on the CS and the median width of the prediction interval ranged from 0.08 to 1.86 min. By comparing experimental and predicted RTs, the user can thus prioritize which isomers to target for further characterization and potentially exclude some structures completely. As the database grows the number and accuracy of predictions will increase.
Best known in our circles for his key role in the renaissance of low- density parity-check (LDPC) codes, David MacKay has written an am- bitious and original textbook. Almost every area within the purview of these TRANSACTIONS can be found in this book: data compression al- gorithms, error-correcting codes, Shannon theory, statistical inference, constrained codes, classification, and neural networks. The required mathematical level is rather minimal beyond a modicum of familiarity with probability. The author favors exposition by example, there are few formal proofs, and chapters come in mostly self-contained morsels richly illustrated with all sorts of carefully executed graphics. With its breadth, accessibility, and handsome design, this book should prove to be quite popular. Highly recommended as a primer for students with no background in coding theory, the set of chapters on error-correcting codes are an excellent brief introduction to the elements of modern sparse-graph codes: LDPC, turbo, repeat-accumulate, and fountain codes are de- scribed clearly and succinctly. As a result of the author's research on the field, the nine chapters on neural networks receive the deepest and most cohesive treatment in the book. Under the umbrella title of Probability and Inference we find a medley of chapters encompassing topics as varied as the Viterbi algorithm and the forward-backward algorithm, Monte Carlo simu- lation, independent component analysis, clustering, Ising models, the saddle-point approximation, and a sampling of decision theory topics. The chapters on data compression offer a good coverage of Huffman and arithmetic codes, and we are rewarded with material not usually encountered in information theory textbooks such as hash codes and efficient representation of integers. The expositions of the memoryless source coding theorem and of the achievability part of the memoryless channel coding theorem stick closely to the standard treatment in (1), with a certain tendency to over- simplify. For example, the source coding theorem is verbalized as: " i.i.d. random variables each with entropy can be compressed into more than bits with negligible risk of information loss, as ; conversely if they are compressed into fewer than bits it is virtually certain that informa- tion will be lost." Although no treatment of rate-distortion theory is offered, the author gives a brief sketch of the achievability of rate with bit- error rate , and the details of the converse proof of that limit are left as an exercise. Neither Fano's inequality nor an operational definition of capacity put in an appearance. Perhaps his quest for originality is what accounts for MacKay's pro- clivity to fail to call a spade a spade. Almost-lossless data compres- sion is called "lossy compression;" a vanilla-flavored binary hypoth-