Figure - available from: Nature
This content is subject to copyright. Terms and conditions apply.
Robustness across datasets and performance comparison with tuned ensembles
a, A comparison of modified datasets. We can see that TabPFN is not more vulnerable to the modifications compared with baselines. We also see that TabPFN reproduces the accuracy of CatBoost (default) with only half the training samples provided. Here we normalize scores per dataset (sharing one normalization across all modifications of one experiment) to avoid negative outliers. b, We split the test datasets by data characteristics and analyse the performance per subgroup. c, Classification performance. Left, the win rate of TabPFN (PHE) against AutoGluon (with one tie excluded); right, the ROC AUC score over time for tuning each method, with the first marker representing the default configuration for the non-ensembling methods. d, Regression performance presented as in c but using the RMSE metric. Intervals represent the 95% confidence interval and Wilcoxon P refers to the two-sided Wilcoxon signed-rank test P value⁵⁴.

Robustness across datasets and performance comparison with tuned ensembles a, A comparison of modified datasets. We can see that TabPFN is not more vulnerable to the modifications compared with baselines. We also see that TabPFN reproduces the accuracy of CatBoost (default) with only half the training samples provided. Here we normalize scores per dataset (sharing one normalization across all modifications of one experiment) to avoid negative outliers. b, We split the test datasets by data characteristics and analyse the performance per subgroup. c, Classification performance. Left, the win rate of TabPFN (PHE) against AutoGluon (with one tie excluded); right, the ROC AUC score over time for tuning each method, with the first marker representing the default configuration for the non-ensembling methods. d, Regression performance presented as in c but using the RMSE metric. Intervals represent the 95% confidence interval and Wilcoxon P refers to the two-sided Wilcoxon signed-rank test P value⁵⁴.

Source publication
Article
Full-text available
Tabular data, spreadsheets organized in rows and columns, are ubiquitous across scientific fields, from biomedicine to particle physics to economics and climate science1,2. The fundamental prediction task of filling in missing values of a label column based on the rest of the columns is essential for various applications as diverse as biomedical ri...

Citations

... Recent advancements in machine learning (ML) and artificial intelligence (AI) underscore the immense potential of data and computational power [20,49,63]. Exceptionally large language models (LLMs), trained on large-scale datasets, have demonstrated remarkable effectiveness in a wide range of learning tasks [11,47,69]. ...
Preprint
The rapid advancement of multimodal large language models (LLMs) has opened new frontiers in artificial intelligence, enabling the integration of diverse large-scale data types such as text, images, and spatial information. In this paper, we explore the potential of multimodal LLMs (MLLM) for geospatial artificial intelligence (GeoAI), a field that leverages spatial data to address challenges in domains including Geospatial Semantics, Health Geography, Urban Geography, Urban Perception, and Remote Sensing. We propose a MLLM (OmniGeo) tailored to geospatial applications, capable of processing and analyzing heterogeneous data sources, including satellite imagery, geospatial metadata, and textual descriptions. By combining the strengths of natural language understanding and spatial reasoning, our model enhances the ability of instruction following and the accuracy of GeoAI systems. Results demonstrate that our model outperforms task-specific models and existing LLMs on diverse geospatial tasks, effectively addressing the multimodality nature while achieving competitive results on the zero-shot geospatial tasks. Our code will be released after publication.
... In our theoretical model, each token represents a datapoint containing input features and the target label. Remarkably, TabPFN (Hollmann et al., 2023(Hollmann et al., , 2025 shows that, using this simple encoding and pretraining the model with sufficiently rich data priors, transformers can accomplish state-of-the-art classification on tabular datasets via in-context learning. Test-time training. ...
... We demonstrate that test-time training (TTT) can significantly reduce the context length required for a given performance on TabPFN v2 (Hollmann et al., 2025). Specifically, we evaluate the TabPFN v2 model on samples from a single dataset with different context lengths. ...
Preprint
Test-time training (TTT) methods explicitly update the weights of a model to adapt to the specific test instance, and they have found success in a variety of settings, including most recently language modeling and reasoning. To demystify this success, we investigate a gradient-based TTT algorithm for in-context learning, where we train a transformer model on the in-context demonstrations provided in the test prompt. Specifically, we provide a comprehensive theoretical characterization of linear transformers when the update rule is a single gradient step. Our theory (i) delineates the role of alignment between pretraining distribution and target task, (ii) demystifies how TTT can alleviate distribution shift, and (iii) quantifies the sample complexity of TTT including how it can significantly reduce the eventual sample size required for in-context learning. As our empirical contribution, we study the benefits of TTT for TabPFN, a tabular foundation model. In line with our theory, we demonstrate that TTT significantly reduces the required sample size for tabular classification (3 to 5 times fewer) unlocking substantial inference efficiency with a negligible training cost.
... Especially with the latest generation of AI, including foundation models [42], multimodal learning and time-series forecasting architectures [43,44] hold significant promise for also improving CGM-based prediction accuracy, personalization and clinical decision support. Foundation models, such as large-scale deep learning networks pre-trained on diverse health datasets, can potentially leverage vast amounts of glucose dynamics data to enhance predictive accuracy. ...
Article
Full-text available
Continuous glucose monitoring (CGM) and flash glucose monitoring (FGM) systems have revolutionized diabetes management by delivering real-time, dynamic insights into blood glucose levels. This article provides a concise overview of the evolution of CGM technology, highlights emerging innovations in the field and explores current and potential future applications (including insulin management, early diagnostics, predictive modeling, diabetes education and integration into automated insulin delivery (AID) systems) of CGM in healthcare.
... Autoregressive sequence modeling has recently gained significant attention as a powerful approach for modeling exchangeable sequences Y 1:∞ [22,23,31,30,20]. Unlike conventional Bayesian modeling-which requires specifying a prior over an unobserved latent variable θ and a likelihood for the observed data, often a challenging task-autoregressive sequence modeling builds on De Finneti's predictive view of Bayesian inference [6,7,8,9]. ...
... Viewing future observations as the sole source of epistemic uncertainty in θ for exchangeable sequences [2,4,12], autoregressive sequence modeling enables direct prediction of the observables Y 1:∞ , offering a conceptually elegant and practical alternative to conventional Bayesian approaches. Transformers have emerged as the dominant architecture for autoregressive sequence modeling [20,22,17,18,15,29], owing to their remarkable performance in natural language and vision applications [5,10]. As Transformers are increasingly employed to meta-learn probabilistic models for large-scale tabular datasets [32,20], they offer a unique opportunity to move beyond traditional prediction tasks or merely replicating supervised algorithms-an area that has been the primary focus so far. ...
... Transformers have emerged as the dominant architecture for autoregressive sequence modeling [20,22,17,18,15,29], owing to their remarkable performance in natural language and vision applications [5,10]. As Transformers are increasingly employed to meta-learn probabilistic models for large-scale tabular datasets [32,20], they offer a unique opportunity to move beyond traditional prediction tasks or merely replicating supervised algorithms-an area that has been the primary focus so far. Instead, following De Finetti's perspective, when meta-trained on tabular datasets, these models can effectively quantify epistemic uncertainty, which powers decision-making and active exploration across diverse domains, including recommendation systems, adaptive experimentation, and active learning [31]. ...
Preprint
Full-text available
Autoregressive models have emerged as a powerful framework for modeling exchangeable sequences - i.i.d. observations when conditioned on some latent factor - enabling direct modeling of uncertainty from missing data (rather than a latent). Motivated by the critical role posterior inference plays as a subroutine in decision-making (e.g., active learning, bandits), we study the inferential and architectural inductive biases that are most effective for exchangeable sequence modeling. For the inference stage, we highlight a fundamental limitation of the prevalent single-step generation approach: inability to distinguish between epistemic and aleatoric uncertainty. Instead, a long line of works in Bayesian statistics advocates for multi-step autoregressive generation; we demonstrate this "correct approach" enables superior uncertainty quantification that translates into better performance on downstream decision-making tasks. This naturally leads to the next question: which architectures are best suited for multi-step inference? We identify a subtle yet important gap between recently proposed Transformer architectures for exchangeable sequences (Muller et al., 2022; Nguyen & Grover, 2022; Ye & Namkoong, 2024), and prove that they in fact cannot guarantee exchangeability despite introducing significant computational overhead. We illustrate our findings using controlled synthetic settings, demonstrating how custom architectures can significantly underperform standard causal masks, underscoring the need for new architectural innovations.
... To cope with these difficulties, current approaches often rely on strong assumptions, such as having prior knowledge of the causal graph and imposing parametric assumptions on the underlying causal model (see Section 5 for more detail). For related causal tasks, there have been various proposals to tackle these challenges by extracting causal information from datasets via neural networks (Lorch et al., 2022;Scetbon et al., 2024;Sauter et al., 2024b;Annadani et al., 2025;Hollmann et al., 2025). Results from these approaches suggest that we can effectively amortize over datasets coming from different causal models for various downstream tasks. ...
... We model our encoder based on the extension of nonparametric encoders (Kossen et al., 2021) described in (Lorch et al., 2022). As it has been shown for various causal tasks, this encoder architecture can successfully encode a dataset into a vector containing relevant information about the underlying causal model, such as causal structure (Lorch et al., 2022), topological ordering , informative interventions (Annadani et al., 2025), or more abstract prediction tasks (Hollmann et al., 2025). In our setup, we employ this architecture to encode information necessary to predict interventional distributions. ...
Preprint
Full-text available
Predicting the distribution of outcomes under hypothetical interventions is crucial in domains like healthcare, economics, and policy-making. Current methods often rely on strong assumptions, such as known causal graphs or parametric models, and lack amortization across problem instances, limiting their practicality. We propose a novel transformer-based conditional variational autoencoder architecture, named ACTIVA, that extends causal transformer encoders to predict causal effects as mixtures of Gaussians. Our method requires no causal graph and predicts interventional distributions given only observational data and a queried intervention. By amortizing over many simulated instances, it enables zero-shot generalization to novel datasets without retraining. Experiments demonstrate accurate predictions for synthetic and semi-synthetic data, showcasing the effectiveness of our graph-free, amortized causal inference approach.
... Therefore, additional learning algorithms or different feature selection strategies can readily be integrated into the pipeline of Section 3.1 to further extend the analysis. While neural networks were not included in this initial benchmarking due to their relatively poor performances in larger tabular benchmarking studies (Shmuel et al., 2024), recent advancements in neural networks designed for tabular data such as TabPFN (Hollmann et al., 2025) have demonstrated promising results. First successful applications of TabPFN in DSM can be found by Barkov et al. (2024) but its application has to be further evaluated. ...
Preprint
Full-text available
Digital soil mapping (DSM) relies on a broad pool of statistical methods, yet determining the optimal method for a given context remains challenging and contentious. Benchmarking studies on multiple datasets are needed to reveal strengths and limitations of commonly used methods. Existing DSM studies usually rely on a single dataset with restricted access, leading to incomplete and potentially misleading conclusions. To address these issues, we introduce an open-access dataset collection called Precision Liming Soil Datasets (LimeSoDa). LimeSoDa consists of 31 field- and farm-scale datasets from various countries. Each dataset has three target soil properties: (1) soil organic matter or soil organic carbon, (2) clay content and (3) pH, alongside a set of features. Features are dataset-specific and were obtained by optical spectroscopy, proximal- and remote soil sensing. All datasets were aligned to a tabular format and are ready-to-use for modeling. We demonstrated the use of LimeSoDa for benchmarking by comparing the predictive performance of four learning algorithms across all datasets. This comparison included multiple linear regression (MLR), support vector regression (SVR), categorical boosting (CatBoost) and random forest (RF). The results showed that although no single algorithm was universally superior, certain algorithms performed better in specific contexts. MLR and SVR performed better on high-dimensional spectral datasets, likely due to better compatibility with principal components. In contrast, CatBoost and RF exhibited considerably better performances when applied to datasets with a moderate number (< 20) of features. These benchmarking results illustrate that the performance of a method is highly context-dependent. LimeSoDa therefore provides an important resource for improving the development and evaluation of statistical methods in DSM.
... More recently, Nam et al. [38] introduced Selfgenerated Tasks from UNlabeled Tables (STUNT), leveraging self-generated fewshot tasks for tabular learning, though its reliance on large unlabeled datasets may limit practical applicability. Additionally, Hollmann et al. [25] proposed the Tabular Prior-data Fitted Network (TabPFN), a tabular foundation model pre-trained on millions of synthetic datasets. ...
... Baselines We compare our proposed method with several baselines: logistic regression (LR), 2-layer MLP with ReLU activation and 100 hidden units (MLP), random forest (RF), XGBoost (XGB) [9], CatBoost [14], TabPFN [25] and FeatLLM [21]. ...
Preprint
Full-text available
Large Language Models (LLMs) have demonstrated remarkable performance across diverse domains. However, effectively leveraging their vast knowledge for training smaller downstream models remains an open challenge, especially in domains like tabular data learning, where simpler models are often preferred due to interpretability and efficiency. In this paper, we introduce a novel yet straightforward method for incorporating LLM-generated global task feature attributions into the training process of smaller networks. Specifically, we propose an attribution-matching regularization term that aligns the training dynamics of the smaller model with the insights provided by the LLM. By doing so, our approach yields superior performance in few-shot learning scenarios. Notably, our method requires only black-box API access to the LLM, making it easy to integrate into existing training pipelines with minimal computational overhead. Furthermore, we demonstrate how this method can be used to address common issues in real-world datasets, such as skewness and bias. By integrating high-level knowledge from LLMs, our approach improves generalization, even when training data is limited or imbalanced. We validate its effectiveness through extensive experiments across multiple tasks, demonstrating improved learning efficiency and model robustness.
... The use of AutoML is generally an effective but time-and resource-consuming process in searching for the best pipelines. Therefore, recent studies are focusing on developing the so-called Green AutoML approaches (Tornede et al., 2023) allowing one to consider the carbon footprint of the hyperparameter optimization process or proposing methods that are competitive in performance with robust AutoML frameworks, but require only a few seconds to find the best model due to improved meta-learning, for instance, TabPFN (Hollmann et al., 2025). ...
Article
Full-text available
Current vegetation indices and biophysical parameters derived from optical satellite data for forest monitoring are widely used in various applications but can be limited by atmospheric effects like clouds. Synthetic aperture radar (SAR) data can offer insightful and systematic forest monitoring with complete time series due to signal penetration through clouds and day and night image acquisitions. This study explores the use of SAR data, combined with ancillary data and machine learning (ML), to estimate forest parameters typically derived from optical satellites. It investigates whether SAR signals provide sufficient information for the accurate estimation of these parameters, focusing on two spectral vegetation indices (Normalized Difference Vegetation Index – NDVI and Enhanced Vegetation Index – EVI) and two biophysical parameters (Leaf Area Index – LAI and Fraction of Absorbed Photosynthetically Active Radiation – FAPAR) in healthy and disturbed temperate forests in Czechia and Central Europe in 2021. Vegetation metrics derived from Sentinel-2 multispectral data were used to evaluate the results. A paired multi-modal time-series dataset was created using Google Earth Engine (GEE), including temporally and spatially aligned Sentinel-1, Sentinel-2, DEM-based features and meteorological variables, along with a forest type class. The inclusion of DEM-based auxiliary features and additional meteorological information improved the results. In the comparison of ML models, the traditional ML algorithms, Random Forest Regressor and Extreme Gradient Boosting (XGB) slightly outperformed the Automatic Machine Learning (AutoML) approach, auto-sklearn, for all forest parameters, achieving high accuracies (R2 between 70% and 86%) and low errors (0.055–0.29 of mean absolute error). XGB was the most computationally efficient. Moreover, SAR-based estimations over Central Europe achieved comparable results to those obtained in testing within Czechia, demonstrating their transferability for large-scale modeling. A key advantage of the SAR-based vegetation metrics is the ability to detect abrupt forest changes with sub-weekly temporal accuracy, providing up to 240 measurements per year at a 20 m resolution.
... Furthermore, some of the aforementioned tabular transformers [149,134], when pre-trained with a diverse range of tabular datasets, have been shown to exhibit better performance via finetuning than training from scratch [164]. Following these findings, various large tabular models pretrained on synthetic tabular data have been proposed for classification, which yield great promise for application on edge applications [57,13]. By using on-thefly sketches to summarize unbounded streaming data, one can feed this information into the pre-trained foundational model for efficient processing [88]. ...
Preprint
This literature review explores continual learning methods for on-device training in the context of neural networks (NNs) and decision trees (DTs) for classification tasks on smart environments. We highlight key constraints, such as data architecture (batch vs. stream) and network capacity (cloud vs. edge), which impact TinyML algorithm design, due to the uncontrolled natural arrival of data streams. The survey details the challenges of deploying deep learners on resource-constrained edge devices, including catastrophic forgetting, data inefficiency, and the difficulty of handling IoT tabular data in open-world settings. While decision trees are more memory-efficient for on-device training, they are limited in expressiveness, requiring dynamic adaptations, like pruning and meta-learning, to handle complex patterns and concept drifts. We emphasize the importance of multi-criteria performance evaluation tailored to edge applications, which assess both output-based and internal representation metrics. The key challenge lies in integrating these building blocks into autonomous online systems, taking into account stability-plasticity trade-offs, forward-backward transfer, and model convergence.
... Here we show results from re-running the previous analysis on TabPFN-v2 [Hollmann et al., 2025]. The results for the 1d binary classification and 2d multiclass classification experiments show surprisingly worse behavior from TabPFN-v2. ...
Preprint
Full-text available
TabPFN [Hollmann et al., 2023], a Transformer model pretrained to perform in-context learning on fresh tabular classification problems, was presented at the last ICLR conference. To better understand its behavior, we treat it as a black-box function approximator generator and observe its generated function approximations on a varied selection of training datasets. Exploring its learned inductive biases in this manner, we observe behavior that is at turns either brilliant or baffling. We conclude this post with thoughts on how these results might inform the development, evaluation, and application of prior-data fitted networks (PFNs) in the future.