Preprint

Transformation Models for Flexible Posteriors in Variational Bayes

Authors:
Preprints and early-stage research may not have been peer reviewed yet.
To read the file of this research, you can request a copy directly from the authors.

Abstract

The main challenge in Bayesian models is to determine the posterior for the model parameters. Already, in models with only one or few parameters, the analytical posterior can only be determined in special settings. In Bayesian neural networks, variational inference is widely used to approximate difficult-to-compute posteriors by variational distributions. Usually, Gaussians are used as variational distributions (Gaussian-VI) which limits the quality of the approximation due to their limited flexibility. Transformation models on the other hand are flexible enough to fit any distribution. Here we present transformation model-based variational inference (TM-VI) and demonstrate that it allows to accurately approximate complex posteriors in models with one parameter and also works in a mean-field fashion for multi-parameter models like neural networks.

No file available

Request Full-text Paper PDF

To read the file of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Normalizing Flows are generative models which produce tractable distributions where both sampling and density evaluation can be efficient and exact. The goal of this survey article is to give a coherent and comprehensive review of the literature around the construction and use of Normalizing Flows for distribution learning. We aim to provide context and explanation of the models, review current state-of-the-art literature, and identify open questions and promising future directions.
Article
Full-text available
Background: Sum scores of ordinal outcomes are common in randomized clinical trials. The approaches routinely employed for assessing treatment effects, such as t-tests or Wilcoxon tests, are not particularly powerful in detecting changes in relevant parameters or lack the ability to incorporate baseline information. Hence, tailored statistical methods are needed for the analysis of ordinal outcomes in clinical research. Methods: We propose baseline-adjusted proportional odds logistic regression models to overcome previous limitations in the analysis of ordinal outcomes in randomized clinical trials. For the validation of our method, we focus on common ordinal sum score outcomes of neurological clinical trials such as the upper extremity motor score, the spinal cord independence measure, and the self-care subscore of the latter. We compare the statistical power of our models to other conventional approaches in a large simulation study of two-arm randomized clinical trials based on data from the European Multicenter Study about Spinal Cord Injury (EMSCI, ClinicalTrials.gov Identifier: NCT01571531). We also use the new method as an alternative analysis of the historical Sygen®clinical trial. Results: The simulation study of all postulated trial settings demonstrated that the statistical power of the novel method was greater than that of conventional methods. Baseline adjustments were more suited for the analysis of the upper extremity motor score compared to the spinal cord independence measure and its self-care subscore. Conclusions: The proposed baseline-adjusted proportional odds models allow the global treatment effect to be directly interpreted. This clear interpretation, the superior statistical power compared to the conventional analysis approaches, and the availability of open-source software support the application of this novel method for the analysis of ordinal outcomes of future clinical trials.
Article
Full-text available
We propose and study properties of maximum likelihood estimators in the class of conditional transformation models. Based on a suitable explicit parameterisation of the unconditional or conditional transformation function, we establish a cascade of increasingly complex transformation models that can be estimated, compared and analysed in the maximum likelihood framework. Models for the unconditional or conditional distribution function of any univariate response variable can be set-up and estimated in the same theoretical and computational framework simply by choosing an appropriate transformation function and parameterisation thereof. The ability to evaluate the distribution function directly allows us to estimate models based on the exact full likelihood, especially in the presence of random censoring or truncation. For discrete and continuous responses, we establish the asymptotic normality of the proposed estimators. A reference software implementation of maximum likelihood-based estimation for conditional transformation models allowing the same flexibility as the theory developed here was employed to illustrate the wide range of possible applications.
Article
Full-text available
This paper presents a tutorial introduction to the use of variational methods for inference and learning in graphical models (Bayesian networks and Markov random fields). We present a number of examples of graphical models, including the QMR-DT database, the sigmoid belief network, the Boltzmann machine, and several variants of hidden Markov models, in which it is infeasible to run exact inference algorithms. We then introduce variational methods, which exploit laws of large numbers to transform the original graphical model into a simplified graphical model in which inference is efficient. Inference in the simpified model provides bounds on probabilities of interest in the original model. We describe a general framework for generating variational transformations based on convex duality. Finally we return to the examples and demonstrate how variational algorithms can be formulated in each case.
Article
Full-text available
The formalism of probabilistic graphical models provides a unifying framework for capturing complex dependencies among random variables, and building large-scale multivariate statistical models. Graphical models have become a focus of research in many statistical, computational and mathematical fields, including bioinformatics, communication theory, statistical physics, combinatorial optimization, signal and image processing, information retrieval and statistical machine learning. Many problems that arise in specific instances — including the key problems of computing marginals and modes of probability distributions — are best studied in the general setting. Working with exponential family representations, and exploiting the conjugate duality between the cumulant function and the entropy for exponential families, we develop general variational representations of the problems of computing likelihoods, marginal probabilities and most probable configurations. We describe how a wide variety of algorithms — among them sum-product, cluster variational methods, expectation-propagation, mean field methods, max-product and linear programming relaxation, as well as conic programming relaxations — can all be understood in terms of exact or approximate forms of these variational representations. The variational approach provides a complementary alternative to Markov chain Monte Carlo as a general source of approximation methods for inference in large-scale statistical models.
Article
We propose a novel Bayesian model framework for discrete ordinal and count data based on conditional transformations of the responses. The conditional transformation function is estimated from the data in conjunction with an a priori chosen reference distribution. For count responses, the resulting transformation model is novel in the sense that it is a Bayesian fully parametric yet distribution-free approach that can additionally account for excess zeros with additive transformation function specifications. For ordinal categoric responses, our cumulative link transformation model allows the inclusion of linear and non-linear covariate effects that can additionally be made category-specific, resulting in (non-)proportional odds or hazards models and more, depending on the choice of the reference distribution. Inference is conducted by a generic modular Markov chain Monte Carlo algorithm where multivariate Gaussian priors enforce specific properties such as smoothness on the functional effects. To illustrate the versatility of Bayesian discrete conditional transformation models, applications to counts of patent citations in the presence of excess zeros and on treating forest health categories in a discrete partial proportional odds model are presented.
Article
Outcomes with a natural order commonly occur in prediction problems and often the available input data are a mixture of complex data like images and tabular predictors. Deep Learning (DL) models are state-of-the-art for image classification tasks but frequently treat ordinal outcomes as unordered and lack interpretability. In contrast, classical ordinal regression models consider the outcome’s order and yield interpretable predictor effects but are limited to tabular data. We present ordinal neural network transformation models (ontrams), which unite DL with classical ordinal regression approaches. ontrams are a special case of transformation models and trade off flexibility and interpretability by additively decomposing the transformation function into terms for image and tabular data using jointly trained neural networks. The performance of the most flexible ontram is by definition equivalent to a standard multi-class DL model trained with cross-entropy while being faster in training when facing ordinal outcomes. Lastly, we discuss how to interpret model components for both tabular and image data on two publicly available datasets.
Article
One of the core problems of modern statistics is to approximate difficult-to-compute probability distributions. This problem is especially important in Bayesian statistics, which frames all inference about unknown quantities as a calculation about the posterior. In this paper, we review variational inference (VI), a method from machine learning that approximates probability distributions through optimization. VI has been used in myriad applications and tends to be faster than classical methods, such as Markov chain Monte Carlo sampling. The idea behind VI is to first posit a family of distributions and then to find the member of that family which is close to the target. Closeness is measured by Kullback-Leibler divergence. We review the ideas behind mean-field variational inference, discuss the special case of VI applied to exponential family models, present a full example with a Bayesian mixture of Gaussians, and derive a variant that uses stochastic optimization to scale up to massive data. We discuss modern research in VI and highlight important open problems. VI is powerful, but it is not yet well understood. Our hope in writing this paper is to catalyze statistical research on this widely-used class of algorithms.
Article
One hundred years after the introduction of the Bernstein polynomial basis, we survey the historical development and current state of theory, algorithms, and applications associated with this remarkable method of representing polynomials over finite domains. Originally introduced by Sergei Natanovich Bernstein to facilitate a constructive proof of the Weierstrass approximation theorem, the leisurely convergence rate of Bernstein polynomial approximations to continuous functions caused them to languish in obscurity, pending the advent of digital computers. With the desire to exploit the power of computers for geometric design applications, however, the Bernstein form began to enjoy widespread use as a versatile means of intuitively constructing and manipulating geometric shapes, spurring further development of basic theory, simple and efficient recursive algorithms, recognition of its excellent numerical stability properties, and an increasing diversification of its repertoire of applications. This survey provides a brief historical perspective on the evolution of the Bernstein polynomial basis, and a synopsis of the current state of associated algorithms and applications.
Article
The ultimate goal of regression analysis is to obtain information about the conditional distribution of a response given a set of explanatory variables. This goal is, however, seldom achieved because most established regression models only estimate the conditional mean as a function of the explanatory variables and assume that higher moments are not affected by the regressors. The underlying reason for such a restriction is the assumption of additivity of signal and noise. We propose to relax this common assumption in the framework of transformation models. The novel class of semiparametric regression models proposed herein allows transformation functions to depend on explanatory variables. These transformation functions are estimated by regularised optimisation of scoring rules for probabilistic forecasts, e.g. the continuous ranked probability score. The corresponding estimated conditional distribution functions are consistent. Conditional transformation models are potentially useful for describing possible heteroscedasticity, comparing spatially varying distributions, identifying extreme events, deriving prediction intervals and selecting variables beyond mean regression effects. An empirical investigation based on a heteroscedastic varying coefficient simulation model demonstrates that semiparametric estimation of conditional distribution functions can be more beneficial than kernel-based non-parametric approaches or parametric generalised additive models for location, scale and shape.
  • P F Baumann
  • T Hothorn
  • D Rügamer
Baumann, P. F., Hothorn, T., and Rügamer, D. Deep conditional transformation models. arXiv preprint arXiv:2010.07860, 2020.
Weight uncertainty in neural network
  • C Blundell
  • J Cornebise
  • K Kavukcuoglu
  • D Wierstra
Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. Weight uncertainty in neural network. In International Conference on Machine Learning, pp. 1613-1622. PMLR, 2015.
Nice: Non-linear independent components estimation
  • L Dinh
  • D Krueger
  • Y Bengio
Dinh, L., Krueger, D., and Bengio, Y. Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516, 2014.
Liberty or depth: Deep bayesian neural nets do not need complex weight posterior approximations
  • S Farquhar
  • L Smith
  • Y Gal
Farquhar, S., Smith, L., and Gal, Y. Liberty or depth: Deep bayesian neural nets do not need complex weight posterior approximations. arXiv e-prints, pp. arXiv-2002, 2020.
Multiplicative normalizing flows for variational bayesian neural networks
  • C Louizos
  • M Welling
Louizos, C. and Welling, M. Multiplicative normalizing flows for variational bayesian neural networks. In International Conference on Machine Learning, pp. 2218-2227. PMLR, 2017.
  • S Ramasinghe
  • K Fernando
  • S Khan
  • N Barnes
Ramasinghe, S., Fernando, K., Khan, S., and Barnes, N. Robust normalizing flows using bernstein-type polynomials. arXiv preprint arXiv:2102.03509, 2021.
Variational inference with normalizing flows
  • D Rezende
  • S Mohamed
Rezende, D. and Mohamed, S. Variational inference with normalizing flows. In International Conference on Machine Learning, pp. 1530-1538. PMLR, 2015.
Stacking for nonmixing bayesian computations: The curse and blessing of multimodal posteriors
  • Y Yao
  • A Vehtari
  • A Gelman
Yao, Y., Vehtari, A., and Gelman, A. Stacking for nonmixing bayesian computations: The curse and blessing of multimodal posteriors. arXiv preprint arXiv:2006.12335, 2020.