Figure - uploaded by Arthur Leroy
Content may be subject to copyright.
Prediction curves (blue) with associated 95% credible intervals (grey) from GP regression (top), Magma (middle) and MagmaClust (bottom). The dashed lines represent the mean parameters from the mean processes estimates. Observed data points are in black, testing data points are in red. Backward points are the observations from the training data set, coloured relatively to individuals (middle) or clusters (bottom).

Prediction curves (blue) with associated 95% credible intervals (grey) from GP regression (top), Magma (middle) and MagmaClust (bottom). The dashed lines represent the mean parameters from the mean processes estimates. Observed data points are in black, testing data points are in red. Backward points are the observations from the training data set, coloured relatively to individuals (middle) or clusters (bottom).

Source publication
Article
Full-text available
A model involving Gaussian processes (GPs) is introduced to simultaneously handle multi-task learning, clustering, and prediction for multiple functional data. This procedure acts as a model-based clustering method for functional data as well as a learning step for subsequent predictions for new tasks. The model is instantiated as a mixture of mult...

Similar publications

Conference Paper
Full-text available
Among the models for the analysis of rating data, the CUB (Combination of discrete Uniform and shifted Binomial random variable) is particularly interesting because it gives an interpretation of the dual latent factors believed to influence the final decision of a rater: feeling and uncertainty. In essence, this model represents the distribution of...

Citations

... An increasing number of applications feature functional data collected at a discrete, random set of design points, also known as the random design framework. Examples include sports science Leroy et al. (2023), Warmenhoven (2024), oceanography Acar-Denizli et al. (2018), Yarger et al. (2022), medicine Sørensen et al. (2013), and spatial data Burbano-Moreno and Mayrink (2024). In these applications, a fundamental task in the functional data analysis pipeline is the approximation of integrals of functions of the sample paths (also called trajectories). ...
Preprint
Full-text available
The computation of integrals is a fundamental task in the analysis of functional data, which are typically considered as random elements in a space of squared integrable functions. Borrowing ideas from recent advances in the Monte Carlo integration literature, we propose effective unbiased estimation and inference procedures for integrals of uni- and multivariate random functions. Several applications to key problems in functional data analysis (FDA) involving random design points are studied and illustrated. In the absence of noise, the proposed estimates converge faster than the sample mean and the usual algorithms for numerical integration. Moreover, the proposed estimator facilitates effective inference by generally providing better coverage with shorter confidence and prediction intervals, in both noisy and noiseless setups.
... For both models, the same GP kernel and initial hyperparameters were specified. Moreover, the number K of clusters was defined as 3, thanks to a model selection procedure based on the maximization of a variational Bayesian information criterion calculated for different values of K. 21 Once the 2 models were trained, the 2 associated predictions were made on the mean GPs obtained from the training step of the MAGMACLUST algorithm. ...
Article
Purpose : To analyze the evolution of the world ranking in swimming over the last 10 years, with particular attention to the effects of COVID-19 on the different levels of participating athletes. Methods : The top 200 world-ranked entries in all swimming events (50-m pool) were collected from 2013 to 2022. A mathematical model (Gaussian model) was proposed to evaluate the ranking progression for different performance levels (clusters) according to distance, stroke, and gender. The model was applied both with and without the COVID season data. Results : Overall results indicated a general progression in world rankings over the last 10 years, except for the COVID season and the post-Olympic year(s), with peak results in the 2021 postpandemic (Olympic) year. The gender gap in World Aquatics points scoring has shown an increasing gap in favor of males since 2017, reaching 1.5% in 2022. The top 200 positions of world rankings were grouped into 3 different clusters defined by the 23.3%, 66.5%, and 100% of ranked male swimmers (or 31.5%, 72.5%, and 100% for females) and with average World Aquatics scores of 910 (12), 858 (10) and 816 (11) points (907 [13], 847 [11], and 802 [12] for females). The Gaussian model showed a gap averaging ∼21 to ∼36 points between performance curves with or without COVID season data, with larger gaps for female rankings and cluster-3 swimmers. Conclusions : These results suggest that, given the lower relative performance of female swimmers in the different clusters of world rankings, female events may provide an opportunity to enter international-level swimming.
... DOMINO dramatically improves on the MAGMA algorithm. A natural next step is to conduct a similar study on MAGMAClust [Leroy et al., 2023], a generalisation of MAGMA which learns cluster-specific means and infers clusters whilst learning the common means. ...
Preprint
Full-text available
We consider time-series forecasting problems where data is scarce, difficult to gather, or induces a prohibitive computational cost. As a first attempt, we focus on short-term electricity consumption in France, which is of strategic importance for energy suppliers and public stakeholders. The complexity of this problem and the many levels of geospatial granularity motivate the use of an ensemble of Gaussian Processes (GPs). Whilst GPs are remarkable predictors, they are computationally expensive to train, which calls for a frugal few-shot learning approach. By taking into account performance on GPs trained on a dataset and designing a random walk on these, we mitigate the training cost of our entire Bayesian decision-making procedure. We introduce our algorithm called \textsc{Domino} (ranDOM walk on gaussIaN prOcesses) and present numerical experiments to support its merits.
... The separately trained ML models are more likely to overfit these smaller subcohorts (64). Alternative to defining subcohorts a priori is to first use data-driven approaches to subtype the data and then derive subtypespecific predictive models (65)(66)(67). For example, Drysdale et al. (68) first identified 4 clusters of individuals with depression symptoms and then trained separate ML models to predict treatment response to repetitive transcranial magnetic stimulation for each subtype. ...
Article
Full-text available
Despite the advantage of neuroimaging-based machine learning (ML) models as pivotal tools for investigating brain-behavior relationships in neuropsychiatric studies, these data-driven predictive approaches have yet to yield substantial, clinically actionable insights for mental health care. A notable impediment lies in the inadequate accommodation of most ML research to the natural heterogeneity within large samples. Although commonly thought of as individual-level analyses, many ML algorithms are unimodal and homogeneous and thus incapable of capturing the potentially heterogeneous relationships between biology and psychopathology. We review the current landscape of computational research targeting population heterogeneity and argue that there is a need to expand from brain subtyping and behavioral phenotyping to analyses that focus on heterogeneity at the relational level. To this end, we review and suggest several existing ML models with the capacity to discern how external environmental and sociodemographic factors moderate the brain-behavior mapping function in a data-driven fashion. These heterogeneous ML models hold promise for enhancing the discovery of individualized brain-behavior associations and advancing precision psychiatry.
... Future enhancements for our models include incorporating additional features as covariates, adopting more advanced preprocessing techniques, considering error measurements in a heteroskedastic model and refining the kernel to make it more specific to the task at hand. Another potential next step would be to use MagmaClustR [12], a variant of the MAGMA algorithm that may reduce prediction uncertainty by performing clustering on profiles. These enhancements could establish GPs as highly effective models for spatial atmosphere analysis. ...
Conference Paper
Full-text available
In the field of spatial aeronomy, atmospheric profile datasets often contain partial data. Probabilistic models, particularly Gaussian processes (GPs), offer promising solutions for filling these data gaps. However, traditional GP algorithms encounter challenges when handling multiple sequences simultaneously, both in terms of performance and computational complexity. Recently, an algorithm named MAGMA was introduced to address these issues. This paper evaluates MAGMA’s performance using the SOIR Venus atmosphere dataset, marking the first application of MAGMA to atmospheric profiles. Results indicate that MAGMA represents a significant advancement towards the efficient application of GPs for extrapolating atmospheric profiles.
Article
Investigating the relationship, particularly the lead–lag effect, between time series is a common question across various disciplines, especially when uncovering biological processes. However, analyzing time series presents several challenges. Firstly, due to technical reasons, the time points at which observations are made are not at uniform intervals. Secondly, some lead–lag effects are transient, necessitating time-lag estimation based on a limited number of time points. Thirdly, external factors also impact these time series, requiring a similarity metric to assess the lead–lag relationship. To counter these issues, we introduce a model grounded in the Gaussian process, affording the flexibility to estimate lead–lag effects for irregular time series. In addition, our method outputs dissimilarity scores, thereby broadening its applications to include tasks such as ranking or clustering multiple pairwise time series when considering their strength of lead–lag effects with external factors. Crucially, we offer a series of theoretical proofs to substantiate the validity of our proposed kernels and the identifiability of kernel parameters. Our model demonstrates advances in various simulations and real-world applications, particularly in the study of dynamic chromatin interactions, compared to other leading methods.