Content uploaded by Eric Bach
Author content
All content in this area was uploaded by Eric Bach on Oct 20, 2021
Content may be subject to copyright.
Code
available
Check out
the paper
Contact
us
Probabilistic framework for integration of mass spectrum and retention time
information in small molecule identification
Eric Bach 1,B, Simon Rogers 2, John Williamson 2, and Juho Rousu 1
1Department of Computer Science, School of Science, Aalto University, Espoo, Finland, 2School of Computing Science, University of Glasgow, Glasgow, UK
1. Small Molecule Identification in Untargeted Metabolomics
•Challenge in untargeted metabolomics studies: Identification of the small
molecules present in a biological sample
•LC-MS2widely used analysis platform: Liquid chromatography (LC) coupled with
tandem mass spectrometry (MS2) (Fig. 1)
•In practice: Many MS feature will have missing MS2information
Using the LC Retention Time (RT) Information:
•LC RT can aid small molecule identification [5, 7]
•Challenges leveraging the RT information:
◦RTs are LC-system specific
◦RT databases typically limited in size and molecule coverage
◦Mapping of RTs between LC-systems requires RT database overlap
LC-MS Spectrum
(five MS features and MS²)
mass per charge [m/z]
retention time [min]
1
MS²
missing
2
3
4
5
Biological sample
(containing multiple unknown
small molecules)
LC-MS²
[m/z]
Intensity Intensity
[m/z]
Intensity
[m/z]
Intensity
[m/z]
Intensity
[m/z]
no
MS²
Set of (MS², RT, Candidate Set)-tuples
Input to our Framework
observed retention
orders
Fig. 1: LC-MS2analysis pipeline and resulting data used as input for our framework.
2. Key Elements of our Framework to combine MS and RT Information
•Probabilistic model with an efficient inference to jointly use MS and RT
information
•MS score agnostic: The user defines how the MS information is used.
•No reference RTs of the target LC-system required
•Output: Ranking of molecular structures from user defined candidate sets.
Using the observed retention orders:
•Exploitation of all pairwise observed retention orders in an LC-MS2dataset
(Fig. 1, middle)
•Comparison of observed and predicted retention orders to up- and down-vote
molecular candidates
3. LC-MS2Experiment Data: Input and Output of our Framework
Input:
•Pre-processed LC-MS2data with Nfeatures (Fig.1): D={(xi, ti,Ci)}N
i=1
◦xi: MS Information; MS2, or MS1(precursor m/z) if no fragmentation available
◦ti: Measured RT of feature i
◦Ci: Molecular candidate sets, e.g. extracted from PubChem using exact mass
search
•Precomputed MS scoring assumed:
◦MS2: CSI:FingerID [3], MetFrag [5] or IOKR [2] scores
◦MS1: e.g, deviation of candidate and precursor mass or isotope pattern score
Output:
•Score for each molecular candidate mir ∈ Ciof the MS features i
•Scores integrate MS and RT information and can be used for ranking
4. Our Probabilistic Framework to integrate MS and RT Information
•Let G= (V, E)be complete graph with a node i∈Vfor each MS feature, and an
edge (i, j)∈Efor each feature pair (Fig. 2)
•Discrete random variable zi∈ Zi={1, . . . , ni}associated with each node (ni=|Ci|)
•Candidate assignment for the complete data z={zi|i∈V}∈Z1×. . . × ZN=Z
•Intuitively: Random variable zidenotes the candidate mir ∈ Ciassigned to feature i.
•Pairwise Markov Random Field as probabilistic model [4]:
p(z) = 1
ZY
i∈V
ψi(zi)Y
(i,j)∈E
ψij(zi, zj)
•Potential functions: ψi(zi)MS score and ψij(zi, zi)match of observed and predicted
retention order
•Molecular candidates ranked based on max-marginals [4] (Fig. 2):
pmax(zi=r) = max
{z0∈Z | z0
i=r}p(z0)(1)
•Intuitively: Maximum probability of a candidate assignment zwith zi=r.
Ranked Candidates of MS features
(from high to low marginal probability)
Inference of Max-marginal Probabilties
(for all candidate molecules and features)
Markov Random Field
Fig. 2: MRF probability distribution and candidate ranking, e.g. MS feature i= 3 and candidate 4(m34).
5. Exploiting Observed Retention Orders via the Edge Potentials ψij
•Edge potential ψij :Zi× Zj→R>0, with σbeing the sigmoid function:
ψij(zi=r, zj=s) = σsign(ti−tj)
| {z }
observed
retention order
·hw, φ(mir)i−hw, φ(mjs)i
| {z }
predicted retention order
•Intuitively: Matching observed and predicted retention orders receive high scores.
•Retention order prediction using Ranking Support Vector Machine (RankSVM) w[1]
•Candidate molecules mir representation using non-linear features φ
6. Feasible Inference through Approximation using Tree Ensembles
•Marginal inference (Eq. (1)) intractable in practice due to exponential size of Z
•For tree-like Gexact inference is feasible [4]
•We average the max-marginals for a set of trees T={Tt}L
t=1 sampled from G:
¯
pmax(zi=r|T) = 1
L
L
X
t=1
pmax(zi=r|Tt)
Random Spanning Trees Ensemble used to approximate the MRF
Averaged Max-marginals
used for candidate ranking
Fig. 3: Random spanning tree sample and averaged max-marginals.
7. Experiments and Results
•Evaluation datasets: CASMI 2016 [6], EA subset from MassBank used by [5]
◦681 (MS2, RT)-tuples with each 310 candidates (median statistic)
◦Datasets cover two different LC columns and flow gradients
◦Sampling of LC-MS2data with 50 to 100 MS features from the tuples
•RankSVM training data: 1248 RTs from PredRed [7] and CASMI 2016 training
◦No evaluation set molecule in RankSVM tranining set
•Performance measure: Top-kaccuracy, percentage of correct molecular
candidates at rank ≤k
Experiment 1: Comparison to MetFrag + LogP (RT Proxy) Prediction
◦MetFrag relaunched [5]: Prediction of LogP values for candidates, linear model
mapping measured RTs to LogPs, candidate re-ranking based on LogP deviation
Method Top-1 Top-5 Top-10 Top-20
MS2+ RT (Our) 21.3 52.9 64.0 74.3
MS2+ RT (MetFrag & LogP) 20.5 49.1 61.2 72.6
Only MS2(baseline) 16.7 49.5 60.4 70.6
Experiment 2: Performance with different MS2-Scoring Methods
◦MetFrag (in-silico fragmenter scores) and IOKR [2] as MS2-scoring methods
MS2-Scorer Method Top-1 Top-5 Top-10 Top-20
MetFrag MS2+ RT (our) 21.3 52.9 64.0 74.3
Only MS2(baseline) 16.7 49.5 60.4 70.6
IOKR MS2+ RT (our) 26.7 52.1 62.5 70.3
Only MS2(baseline) 25.1 49.5 60.3 67.6
Experiment 3: Missing MS2Spectra
◦Simulating missing MS2information: Varying from 0% to 100% MS2
◦If only MS1: Use mass deviation between precursor and candidate molecule
+15%p
~ +4%p ~ +4%p
+9%p
References
[1] E. Bach, S. Szedmak, C. Brouard, S. Böcker, and
J. Rousu. Liquid-chromatography retention order predic-
tion for metabolite identification. Bioinformatics, 2018.
[2] C. Brouard, H. Shen, K. Dührkop, F. d’Alché-Buc,
S. Böcker, and J. Rousu. Fast metabolite identification
with Input Output Kernel Regression. Bioinformatics,
2016.
[3] K. Dührkop, M. Fleischauer, M. Ludwig, A. A. Aksenov,
A. V. Melnik, M. Meusel, P. C. Dorrestein, J. Rousu, and
S. Böcker. Sirius 4: a rapid tool for turning tandem mass
spectra into metabolite structure information. Nat Meth-
ods, 2019.
[4] D. J. MacKay. Information theory, inference and learning
algorithms. Cambridge university press, 2005.
[5] C. Ruttkies, E. L. Schymanski, S. Wolf, J. Hollender, and
S. Neumann. Metfrag relaunched: incorporating strate-
gies beyond in silico fragmentation. Journal of Chemin-
formatics, 2016.
[6] E. L. Schymanski, C. Ruttkies, M. Krauss, C. Brouard,
T. Kind, K. Dührkop, F. Allen, A. Vaniya, D. Verdegem,
S. Böcker, J. Rousu, H. Shen, H. Tsugawa, T. Sajed,
O. Fiehn, B. Ghesquière, and S. Neumann. Critical as-
sessment of small molecule identification 2016: auto-
mated methods. Journal of Cheminformatics, 2017.
[7] J. Stanstrup, S. Neumann, and U. Vrhovsek. Predret:
Prediction of retention time by direct mapping between
multiple chromatographic systems. Analytical Chemistry,
2015.
Acknowledgements: This work has been supported by the Academy of Finland grant 310107 (MACOME); and the Aalto Science-IT infrastructure. SR and JHW were supported by EPSRC (EP/R018634/1) and the Scottish Informatics and Computing Science Alliance (SICSA) distinguished visiting fellow scheme.
Contact (B): eric.bach@aalto.fi