Carl Edward Rasmussen’s research while affiliated with University of Cambridge and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (145)


Figure 1: Negative log marginal likelihood surface as a function of two hyperparameters: σ 2 f and l for a squared exponential kernel and 1d function. The red cross indicates the true hyperparameters. The hyperparameters selected via gradien-based optimisation are sensitive to the initialisation due to the long ridge of almost identical height at values of the hyperparameters not concordant with the ground truth.
Figure 2: Top -1d regression with Left: SGPR, Middle: SGPR + HMC, Right: Joint HMC [Hensman et al., 2015b], Below -Left: Samples from the mixture posterior, Middle: Length-scale distribution under SGPR+HMC and ML-II. Note that the data is generated through a parameteric function and hence there is no ground truth lengthscale. Right: Noise std. deviation distribution from SGPR+HMC and ML-II.
Figure 3: Negative test log likelihoods with standard error of mean across 10 splits with 80% of the data reserved for training. Our method is SGPR + HMC.
Figure 5: Negative test log likelihoods with standard error of mean with 80% of the data reserved for training. Our method is SGPR + HMC.
Existing literature on fully Bayesian inference in GPs, sparse GPs and generic likelihoods.
Sparse Gaussian Process Hyperparameters: Optimize or Integrate?
  • Preprint
  • File available

November 2022

·

123 Reads

·

Wessel P. Bruinsma

·

David R. Burt

·

Carl E. Rasmussen

The kernel function and its hyperparameters are the central model selection choice in a Gaussian proces (Rasmussen and Williams, 2006). Typically, the hyperparameters of the kernel are chosen by maximising the marginal likelihood, an approach known as Type-II maximum likelihood (ML-II). However, ML-II does not account for hyperparameter uncertainty, and it is well-known that this can lead to severely biased estimates and an underestimation of predictive uncertainty. While there are several works which employ a fully Bayesian characterisation of GPs, relatively few propose such approaches for the sparse GPs paradigm. In this work we propose an algorithm for sparse Gaussian process regression which leverages MCMC to sample from the hyperparameter posterior within the variational inducing point framework of Titsias (2009). This work is closely related to Hensman et al. (2015b) but side-steps the need to sample the inducing points, thereby significantly improving sampling efficiency in the Gaussian likelihood case. We compare this scheme against natural baselines in literature along with stochastic variational GPs (SVGPs) along with an extensive computational analysis.

Download

Figure 4: We illustrate how spatial resolution affects approximation accuracy and computational cost. We see that increasing the spatial resolution leads to using fewer inducing points, dropping rapidly in low dimension, and slower in higher dimensions. On the other hand, the error in approximation, as measured by Wasserstein-2 distance between the approximate posterior and exact posterior, increases as the spatial resolution increases. Computational costs follow the opposite relationship: finer spatial resolutions are less stable and more expensive to compute, as evidenced by condition numbers and the number of iterations needed for conjugate gradients to converge.
Numerically Stable Sparse Gaussian Processes via Minimum Separation using Cover Trees

October 2022

·

51 Reads

Alexander Terenin

·

David R. Burt

·

·

[...]

·

Hong Ge

As Gaussian processes mature, they are increasingly being deployed as part of larger machine learning and decision-making systems, for instance in geospatial modeling, Bayesian optimization, or in latent Gaussian models. Within a system, the Gaussian process model needs to perform in a stable and reliable manner to ensure it interacts correctly with other parts the system. In this work, we study the numerical stability of scalable sparse approximations based on inducing points. We derive sufficient and in certain cases necessary conditions on the inducing points for the computations performed to be numerically stable. For low-dimensional tasks such as geospatial modeling, we propose an automated method for computing inducing points satisfying these conditions. This is done via a modification of the cover tree data structure, which is of independent interest. We additionally propose an alternative sparse approximation for regression with a Gaussian likelihood which trades off a small amount of performance to further improve stability. We evaluate the proposed techniques on a number of examples, showing that, in geospatial settings, sparse approximations with guaranteed numerical stability often perform comparably to those without.


Kernel Identification Through Transformers

June 2021

·

49 Reads

Kernel selection plays a central role in determining the performance of Gaussian Process (GP) models, as the chosen kernel determines both the inductive biases and prior support of functions under the GP prior. This work addresses the challenge of constructing custom kernel functions for high-dimensional GP regression models. Drawing inspiration from recent progress in deep learning, we introduce a novel approach named KITT: Kernel Identification Through Transformers. KITT exploits a transformer-based architecture to generate kernel recommendations in under 0.1 seconds, which is several orders of magnitude faster than conventional kernel search algorithms. We train our model using synthetic data generated from priors over a vocabulary of known kernels. By exploiting the nature of the self-attention mechanism, KITT is able to process datasets with inputs of arbitrary dimension. We demonstrate that kernels chosen by KITT yield strong performance over a diverse collection of regression benchmarks.


The Promises and Pitfalls of Deep Kernel Learning

February 2021

·

24 Reads

·

1 Citation

Deep kernel learning and related techniques promise to combine the representational power of neural networks with the reliable uncertainty estimates of Gaussian processes. One crucial aspect of these models is an expectation that, because they are treated as Gaussian process models optimized using the marginal likelihood, they are protected from overfitting. However, we identify pathological behavior, including overfitting, on a simple toy example. We explore this pathology, explaining its origins and considering how it applies to real datasets. Through careful experimentation on UCI datasets, CIFAR-10, and the UTKFace dataset, we find that the overfitting from overparameterized deep kernel learning, in which the model is "somewhat Bayesian", can in certain scenarios be worse than that from not being Bayesian at all. However, we find that a fully Bayesian treatment of deep kernel learning can rectify this overfitting and obtain the desired performance improvements over standard neural networks and Gaussian processes.



Lazily Adapted Constant Kinky Inference for nonparametric regression and model-reference adaptive control

December 2020

·

47 Reads

·

43 Citations

Automatica

Techniques known as Nonlinear Set Membership prediction or Lipschitz Interpolation are approaches to supervised machine learning that utilise presupposed Lipschitz properties to perform inference over unobserved function values. Provided a bound on the true best Lipschitz constant of the target function is known a priori, they offer convergence guarantees, as well as bounds around the predictions. Considering a more general setting that builds on Lipschitz continuity, we propose an online method for estimating the Lipschitz constant online from function value observations that are possibly corrupted by bounded noise. Utilising this as a data-dependent hyper-parameter gives rise to a nonparametric machine learning method, for which we establish strong universal approximation guarantees. That is, we show that our prediction rule can learn any continuous function on compact support in the limit of increasingly dense data, up to a worst-case error that can be bounded by the level of observational error. We also consider applications of our nonparametric regression method to learning-based control. For a class of discrete-time settings, we establish convergence guarantees on the closed-loop tracking error of our online learning-based controllers. To provide evidence that our method can be beneficial not only in theory but also in practice, we apply it in the context of nonparametric model-reference adaptive control (MRAC). Across a range of simulated aircraft roll-dynamics and performance metrics our approach outperforms recently proposed alternatives that were based on Gaussian processes and RBF-neural networks.


Marginalised Gaussian Processes with Nested Sampling

October 2020

·

90 Reads

Gaussian Process (GPs) models are a rich distribution over functions with inductive biases controlled by a kernel function. Learning occurs through the optimisation of kernel hyperparameters using the marginal likelihood as the objective. This classical approach known as Type-II maximum likelihood (ML-II) yields point estimates of the hyperparameters, and continues to be the default method for training GPs. However, this approach risks underestimating predictive uncertainty and is prone to overfitting especially when there are many hyperparameters. Furthermore, gradient based optimisation makes ML-II point estimates highly susceptible to the presence of local minima. This work presents an alternative learning procedure where the hyperparameters of the kernel function are marginalised using Nested Sampling (NS), a technique that is well suited to sample from complex, multi-modal distributions. We focus on regression tasks with the spectral mixture (SM) class of kernels and find that a principled approach to quantifying model uncertainty leads to substantial gains in predictive performance across a range of synthetic and benchmark data sets. In this context, nested sampling is also found to offer a speed advantage over Hamiltonian Monte Carlo (HMC), widely considered to be the gold-standard in MCMC based inference.


Figure 1: Architecture of the neural network where shading depicts external inputs or data.
Figure 3: Summary of the toy problem results, showing the synthetic observations and model predictions for a single month 5 years out of sample. These are shown alongside the model weights recovered by the BayNNE for each model, the model bias, and aleatoric noise.
Ensembling geophysical models with Bayesian Neural Networks

October 2020

·

109 Reads

Ensembles of geophysical models improve projection accuracy and express uncertainties. We develop a novel data-driven ensembling strategy for combining geophysical models using Bayesian Neural Networks, which infers spatiotemporally varying model weights and bias while accounting for heteroscedastic uncertainties in the observations. This produces more accurate and uncertainty-aware projections without sacrificing interpretability. Applied to the prediction of total column ozone from an ensemble of 15 chemistry-climate models, we find that the Bayesian neural network ensemble (BayNNE) outperforms existing ensembling methods, achieving a 49.4% reduction in RMSE for temporal extrapolation, and a 67.4% reduction in RMSE for polar data voids, compared to a weighted mean. Uncertainty is also well-characterized, with 90.6% of the data points in our extrapolation validation dataset lying within 2 standard deviations and 98.5% within 3 standard deviations.


Ensembling geophysical models with Bayesian Neural Networks

September 2020

·

157 Reads

·

10 Citations

Ensembles of geophysical models improve projection accuracy and express uncertainties. We develop a novel data-driven ensembling strategy for combining geophysical models using Bayesian Neural Networks, which infers spatiotemporally varying model weights and bias while accounting for heteroscedastic uncertainties in the observations. This produces more accurate and uncertainty-aware projections without sacrificing interpretability. Applied to the prediction of total column ozone from an ensemble of 15 chemistry-climate models, we find that the Bayesian neural network ensemble (BayNNE) outperforms existing ensembling methods, achieving a 49.4% reduction in RMSE for temporal extrapolation, and a 67.4% reduction in RMSE for polar data voids, compared to a weighted mean. Uncertainty is also well-characterized, with 90.6% of the data points in our extrapolation validation dataset lying within 2 standard deviations and 98.5% within 3 standard deviations.


Bayesian Machine Learning for the Prognosis of Combustion Instabilities From Noise

September 2020

·

42 Reads

·

3 Citations

Experiments are performed on a turbulent swirling flame placed inside a vertical tube whose fundamental acoustic mode becomes unstable at higher powers and equivalence ratios. The power, equivalence ratio, fuel composition and boundary condition of this tube are varied and, at each operating point, the combustion noise is recorded. In addition, short acoustic pulses at the fundamental frequency are supplied to the tube with a loudspeaker and the decay rates of subsequent acoustic oscillations are measured. This quantifies the linear stability of the system at every operating point. Using this data for training, we show that it is possible for a Bayesian ensemble of neural networks to predict the decay rate from a 300 millisecond sample of the (un-pulsed) combustion noise and therefore forecast impending thermoacoustic instabilities. We also show that it is possible to recover the equivalence ratio and power of the flame from these noise snippets, confirming our hypothesis that combustion noise indeed provides a fingerprint of the combustor’s internal state. Furthermore, the Bayesian nature of our algorithm enables principled estimates of uncertainty in our predictions, a reassuring feature that prevents it from making overconfident extrapolations. We use the techniques of permutation importance and integrated gradients to understand which features in the combustion noise spectra are crucial for accurate predictions and how they might influence the prediction. This study serves as a first step towards establishing interpretable and Bayesian machine learning techniques as tools to discover informative relationships in combustor data and thereby build trustworthy, robust and reliable combustion diagnostics.


Citations (82)


... The settings for α and Λ are same as before: for SoR and DTC, α = 0 and = σ 2 n I ; for FITC, α = 1 and FITC = diag D ff + σ 2 n I . The PITC approximation does not correspond exactly to a GP, making predictions slightly more complex [54]. ...

Reference:

Online variational Gaussian process for time series data
Approximation Methods for Gaussian Process Regression
  • Citing Chapter
  • August 2007

... Such actions could be, for example, the motor torque commands of a robot, and the sensory data could come from on-robot or external sensors such as cameras, sonar, thermostats, and water pressure sensors. Reinforcement learning (RL) has been one of the main avenues of research towards learning visuomotor policy for manipulation [36], [37], [38]. More recently, deep reinforcement learning (DRL) uses deep neural networks to parameterize a policy that can be optimized under a reinforcement learning algorithm [39], [40], [41], [42]. ...

Learning to Control a Low-Cost Manipulator Using Data-Efficient Reinforcement Learning
  • Citing Chapter
  • June 2012

... However, in many real-world dialogue scenarios, transition probabilities or rewarding transitions are not known in advance, and dialogue agents usually require extensive exploration of the environment (Li et al., 2016;Kwan et al., 2022). Some of these explorations are valid, and some are not, and the more invalid explorations there are the less effective learning of dialogue policies become (Wu et al., 2019;Wu and Rasmussen, 2021). In many dialogues, a long dialogue segment is an invalid exploration due to the effect of dead-ends 2 . ...

Clipping Loops for Sample-Efficient Dialogue Policy Optimisation
  • Citing Conference Paper
  • January 2021

... Non-parametric models relax that assumption, which is arguably crucial to detect when data are outside the domain of training data ('out-of-domain') and for avoiding extreme overconfidence, i.e., 'silent catastrophic failure' . In future work, non-parametric models, for example Gaussian Processes, capable of measuring uncertainties about 'out-of-domain' data, should also be explored [44][45][46]55 . ...

The Promises and Pitfalls of Deep Kernel Learning
  • Citing Preprint
  • February 2021

... This continuous evolution makes it difficult to identify the critical point of the system parameter or the onset of S TI , which is crucial for designing safe operating regimes of such complex spatiotemporal systems. Further, whether the onset of S TI can be forewarned has been a vital research question that has been pursued in the recent decades , Gopalakrishnan et al 2016, Gotoda et al 2014, Kobayashi et al 2019, Sengupta et al 2020. ...

Bayesian Machine Learning for the Prognosis of Combustion Instabilities From Noise
  • Citing Conference Paper
  • September 2020

... • Bayesian Neural Networks (BNN) ( [9], [20], [19], and [23]). BNNs incorporate uncertainty directly into the model by treating weights as distributions rather than fixed values, providing a probabilistic framework that naturally captures epistemic uncertainty. ...

Ensembling geophysical models with Bayesian Neural Networks

... 2. An enhanced NPL method is developed based on the Lazily Adapted Constant Kinky Inference (LACKI) scheme to formulate a searchbased map between the sampled dataset and the RDMV's complex nonlinear dynamics function [24]. Compared with the existing method [18], a learning rule to formulate the sampled dataset is introduced to enhance its real-time performance. ...

Lazily Adapted Constant Kinky Inference for nonparametric regression and model-reference adaptive control
  • Citing Article
  • December 2020

Automatica

... However, in many real-world dialogue scenarios, transition probabilities or rewarding transitions are not known in advance, and dialogue agents usually require extensive exploration of the environment (Li et al., 2016;Kwan et al., 2022). Some of these explorations are valid, and some are not, and the more invalid explorations there are the less effective learning of dialogue policies become (Wu et al., 2019;Wu and Rasmussen, 2021). In many dialogues, a long dialogue segment is an invalid exploration due to the effect of dead-ends 2 . ...

Improving Sample-Efficiency in Reinforcement Learning for Dialogue Systems by Using Trainable-Action-Mask
  • Citing Conference Paper
  • May 2020

... Apart from the LSTM model, Wu et al. [37] demonstrated that the encoder of the Transformer architecture also offers strong long-term prediction capabilities, especially for large datasets, in capturing nonlinear flame responses. Additionally, Sengupta et al. [38,39] utilized a Bayesian neural network model to predict the amplitude of dynamic pressure time series based on physical parameters such as pressure, temperature, heat release rate, and future flow control signals. To address the issue of insufficient data for ML, Xu et al. [40] used generative adversarial networks (GANs) to generate synthetic data, facilitating the training of the prediction model. ...

Bayesian Machine Learning for the Prognosis of Combustion Instabilities From Noise

Journal of Engineering for Gas Turbines and Power

... The point estimates, which are the result of the previously described approaches, cannot capture uncertainty, which plays a key role in guiding the exploration in BO. Therefore, another approach is frequently proposed in literature, namely the Fully Bayesian Approach or Fully Bayesian Gaussian Process Regression [32]. The main idea is to compute an integrated acquisition functionâ, which is marginalized over all possible values of hyperparameterŝ ...

Approximate Inference for Fully Bayesian Gaussian Process Regression