Figure 3 - uploaded by Li Kevin Wenliang
Content may be subject to copyright.
Results on the real datasets; bars show medians, points show each of 15 individual runs, excluding invalid values. (1st row) The estimate of the squared FSSD, a measure of model goodness of fit based on derivatives of the log density; lower is better. (2nd row) The p-value of a test that each model is no better than DKEF in terms of the FSSD; values near 0 indicate that DKEF fits the data significantly better than the other model. (3nd row) Value of the loss (4); lower is better. (4th row) Log-likelihoods; higher is better. DKEF estimates are based on 10 10 samples forˆZforˆ forˆZ θ , with vertical lines showing the upper bound on the bias from Proposition 1 (which is often too small to be easily visible).

Results on the real datasets; bars show medians, points show each of 15 individual runs, excluding invalid values. (1st row) The estimate of the squared FSSD, a measure of model goodness of fit based on derivatives of the log density; lower is better. (2nd row) The p-value of a test that each model is no better than DKEF in terms of the FSSD; values near 0 indicate that DKEF fits the data significantly better than the other model. (3nd row) Value of the loss (4); lower is better. (4th row) Log-likelihoods; higher is better. DKEF estimates are based on 10 10 samples forˆZforˆ forˆZ θ , with vertical lines showing the upper bound on the bias from Proposition 1 (which is often too small to be easily visible).

Source publication
Conference Paper
Full-text available
The kernel exponential family is a rich class of distributions, which can be fit efficiently and with statistical guarantees by score matching. Being required to choose a priori a simple kernel such as the Gaussian, however, limits its practical applicability. We provide a scheme for learning a kernel parameterized by a deep network, which can find...

Contexts in source publication

Context 1
... E.2 gives further details. Figure 3 shows results. In gradient matching as measured by the FSSD, DKEF tends to have the best values. ...
Context 2
... E.2 gives further details. Figure 3 shows results. In gradient matching as measured by the FSSD, DKEF tends to have the best values. ...

Similar publications

Article
Full-text available
Probabilistic latent component analysis (PLCA) is applied to the problem of gearbox vibration source separation. A model for the probability distribution of gearbox vibration employs a latent variable intended to correspond to a particular vibration source, with the measured vibration at a particular sensor for each source the product of a marginal...

Citations

... Score-matching methods and their variants, first explored in [5,17,18], have faced challenges in EBM training, particularly in addressing the presence of multiple, imbalanced modes in the target distribution. These shortcomings are examined in [19,20,21]. Contrastive divergence, introduced in [6, 7, 8], is a widely used technique for approximating the gradient of the log-likelihood. ...
... The integration of the exact SDE (20) in principle produce samples from ρ(t, x) at each time, but numerical error and arbitrary noise scale ε can represent an issue in this sense. We show how using Jarzynski weights one can better control these two hyperparameters. ...
... We can use (20) and ...
Preprint
Full-text available
Energy-Based Models (EBMs) provide a flexible framework for generative modeling, but their training remains theoretically challenging due to the need to approximate normalization constants and efficiently sample from complex, multi-modal distributions. Traditional methods, such as contrastive divergence and score matching, introduce biases that can hinder accurate learning. In this work, we present a theoretical analysis of Jarzynski reweighting, a technique from non-equilibrium statistical mechanics, and its implications for training EBMs. We focus on the role of the choice of the kernel and we illustrate these theoretical considerations in two key generative frameworks: (i) flow-based diffusion models, where we reinterpret Jarzynski reweighting in the context of stochastic interpolants to mitigate discretization errors and improve sample quality, and (ii) Restricted Boltzmann Machines, where we analyze its role in correcting the biases of contrastive divergence. Our results provide insights into the interplay between kernel choice and model performance, highlighting the potential of Jarzynski reweighting as a principled tool for generative learning.
... In this paper, we use a neural network as the default score model. Certainly, other models such as deep kernel exponential families [56], could also be used to introduce smoothing prior. Additionally, we adopt the slicing technique for score matching. ...
Preprint
Determining conditional independence (CI) relationships between random variables is a fundamental yet challenging task in machine learning and statistics, especially in high-dimensional settings. Existing generative model-based CI testing methods, such as those utilizing generative adversarial networks (GANs), often struggle with undesirable modeling of conditional distributions and training instability, resulting in subpar performance. To address these issues, we propose a novel CI testing method via score-based generative modeling, which achieves precise Type I error control and strong testing power. Concretely, we first employ a sliced conditional score matching scheme to accurately estimate conditional score and use Langevin dynamics conditional sampling to generate null hypothesis samples, ensuring precise Type I error control. Then, we incorporate a goodness-of-fit stage into the method to verify generated samples and enhance interpretability in practice. We theoretically establish the error bound of conditional distributions modeled by score-based generative models and prove the validity of our CI tests. Extensive experiments on both synthetic and real-world datasets show that our method significantly outperforms existing state-of-the-art methods, providing a promising way to revitalize generative model-based CI testing.
... Recently, energy-based models (EBMs) [1,2] have drawn some attention in the text-to-speech (TTS) area [3]. These probabilistic models define the log-likelihood of speech given text as the difference between negative (unnormalised) energy function and the logarithm of the corresponding normalisation term. ...
Preprint
Full-text available
Noise contrastive estimation (NCE) is a popular method for training energy-based models (EBM) with intractable normalisation terms. The key idea of NCE is to learn by comparing unnormalised log-likelihoods of the reference and noisy samples, thus avoiding explicitly computing normalisation terms. However, NCE critically relies on the quality of noisy samples. Recently, sliced score matching (SSM) has been popularised by closely related diffusion models (DM). Unlike NCE, SSM learns a gradient of log-likelihood, or score, by learning distribution of its projections on randomly chosen directions. However, both NCE and SSM disregard the form of log-likelihood function, which is problematic given that EBMs and DMs make use of first-order optimisation during inference. This paper proposes a new criterion that learns scores more suitable for first-order schemes. Experiments contrasts these approaches for training EBMs.
... Energy-based models (EBMs) (Murphy, 2012;LeCun et al., 2006;Ngiam et al., 2011), which provide a flexible way of parameterization, are widely used in data modeling such as sensitive estimation (Nguyen & Reiter, 2015;Wenliang et al., 2019;Jiang & Xiao, 2021), structured prediction (Belanger & McCallum, 2016;He & Jiang, 2020;Pan et al., 2020), and anomaly detection (Zhai et al., 2016;. They recently have achieved successes as generative models for data of different modalities, such as images (Du & Mordatch, 2019;Vahdat & Kautz, 2020), texts (Deng et al., 2020;, graph (Liu et al., 2021;Chen et al., 2022;Xu et al., 2022) and point cloud (Xie et al., 2020;Luo & Hu, 2021), thanks to the advent of new training methods of score matching (Song & Ermon, 2019;Song et al., 2020b;Ho et al., 2020). ...
Preprint
Full-text available
With the advent of score-matching techniques for model training and Langevin dynamics for sample generation, energy-based models (EBMs) have gained renewed interest as generative models. Recent EBMs usually use neural networks to define their energy functions. In this work, we introduce a novel hybrid approach that combines an EBM with an exponential family model to incorporate inductive bias into data modeling. Specifically, we augment the energy term with a parameter-free statistic function to help the model capture key data statistics. Like an exponential family model, the hybrid model aims to align the distribution statistics with data statistics during model training, even when it only approximately maximizes the data likelihood. This property enables us to impose constraints on the hybrid model. Our empirical study validates the hybrid model's ability to match statistics. Furthermore, experimental results show that data fitting and generation improve when suitable informative statistics are incorporated into the hybrid model.
... Gaussian processes and tilted distributions have been shown to work well for nonparametric density estimation problems [21,22,23]. Kernel-based exponential family ideas have also been introduced for score matching, leading to iterative learning algorithms and deep representations [24,25,26,27]. In contrast with these methods, our algorithms will have closed form solutions. ...
... List of Expectations (also see Eqs.18,19,[24][25][26] ...
Preprint
Full-text available
We present three Fisher divergence (FD) minimization algorithms for learning Gaussian process (GP) based score models for lower dimensional density estimation problems. The density is formed by multiplying a base multivariate normal distribution with an exponentiated GP refinement, and so we refer to it as a GP-tilted nonparametric density. By representing the GP part of the score as a linear function using the random Fourier feature (RFF) approximation, we show that all learning problems can be solved in closed form. This includes the basic and noise conditional versions of the Fisher divergence, as well as a novel alternative to noise conditional FD models based on variational inference (VI). Here, we propose using an ELBO-like optimization of the approximate posterior with which we derive a Fisher variational predictive distribution. The RFF representation of the GP, which is functionally equivalent to a single layer neural network score model with cosine activation, provides a unique linear form for which all expectations are in closed form. The Gaussian base also helps with tractability of the VI approximation. We demonstrate our three learning algorithms, as well as a MAP baseline algorithm, on several low dimensional density estimation problems. The closed-form nature of the learning problem removes the reliance on iterative algorithms, making this technique particularly well-suited to large data sets.
... Score matching is a best-in-class algorithm for learning local features of the density, however, it is known to struggle to model a distribution with multiple modes that are separated by a large region of low probability. 12,41 Because score matching is based only on the gradient of the log-density, all modes have to be connected by regions of positive density to correctly determine relative weights in multiple modes. However, for a distant system, the target density is nearly zero between two modes, leading to no or few samples in between. ...
Article
Full-text available
Variational ab initio methods in quantum chemistry stand out among other methods in providing direct access to the wave function. This allows, in principle, straightforward extraction of any other observable of interest, besides the energy, but, in practice, this extraction is often technically difficult and computationally impractical. Here, we consider the electron density as a central observable in quantum chemistry and introduce a novel method to obtain accurate densities from real-space many-electron wave functions by representing the density with a neural network that captures known asymptotic properties and is trained from the wave function by score matching and noise-contrastive estimation. We use variational quantum Monte Carlo with deep-learning Ansätze to obtain highly accurate wave functions free of basis set errors and from them, using our novel method, correspondingly accurate electron densities, which we demonstrate by calculating dipole moments, nuclear forces, contact densities, and other density-based properties.
... In this section, we compare the performance of our geometric optics approximation sampler (GOAS) with that of several existing MCMC methods, including the Metropolis-Hasting (MH) [52], slice sampler [44], Hamiltonian Monte Carlo (HMC) [16,43] and the Metropolis-Adjusted Langevin Algorithm (MALA) [53]. For a comprehensive comparison, we consider the models based on several two-dimensional synthetic datasets: Funnel, Banana, Mixture of Gaussians (MoG), Ring, and Cosine [29,66]. Together, these non-Gaussian distributions encompass a range of geometric complexities and multimodality. ...
Preprint
Full-text available
In this article, we propose a new dimensionality-independent and gradient-free sampler, called Geometric Optics Approximation Sampling, which is based on the reflector antenna system. The core idea is to construct a reflecting surface that redirects rays from a source with a predetermined simpler measure towards a output domain while achieving a desired distribution defined by the projection of a complex target measure of interest. Given a such reflecting surface, one can generate arbitrarily many independent and uncorrelated samples from the target measure simply by dual re-simulating or rays tracing the reflector antenna system and then projecting the traced rays onto target domain. In order to obtain a desired reflecting surface, we use the method of supporting paraboloid to solve the reflector antenna problem that does not require a gradient information regarding the density of the target measure. Furthermore, within the supporting paraboloid method, we utilize a low-discrepancy sequence or a random sequence to discretize the target measure, which in turn yields a dimensionality-independent approach for constructing the reflecting surface. Meanwhile, we present a dual re-simulation or ray tracing method based on its dual reflecting surface, which enables drawing samples from the target measure using the reflector antenna system obtained through the dimensionality-independent method. Several examples and numerical experiments comparing with measure transport samplers as well as traditional Markov chain Monte Carlo simulations are provided in this paper to demonstrate the efficiency and applicability of our geometric optics approximation sampling, especially in the context of Bayesian inverse problems. Additionally, these numerical results confirm the theoretical findings.
... The idea of data-based initialization has appeared in the empirical machine learning literature in many different forms, for example as a part of the mechanics of "contrastive divergence" training for energy-based methods and other approximations to Maximum Likelihood Estimation. For a few related references in the empirical literature, see [Hin02; Xie+16; Gao+18; Nij+19; Nij+20; Wen+19], and see also [KV23] for more discussion. In particular, the terminology of "data-based initialization" is as used in [Nij+20]. ...
Preprint
We consider the problem of sampling a multimodal distribution with a Markov chain given a small number of samples from the stationary measure. Although mixing can be arbitrarily slow, we show that if the Markov chain has a kth order spectral gap, initialization from a set of O~(k/ε2)\tilde O(k/\varepsilon^2) samples from the stationary distribution will, with high probability over the samples, efficiently generate a sample whose conditional law is ε\varepsilon-close in TV distance to the stationary measure. In particular, this applies to mixtures of k distributions satisfying a Poincar\'e inequality, with faster convergence when they satisfy a log-Sobolev inequality. Our bounds are stable to perturbations to the Markov chain, and in particular work for Langevin diffusion over Rd\mathbb R^d with score estimation error, as well as Glauber dynamics combined with approximation error from pseudolikelihood estimation. This justifies the success of data-based initialization for score matching methods despite slow mixing for the data distribution, and improves and generalizes the results of Koehler and Vuong (2023) to have linear, rather than exponential, dependence on k and apply to arbitrary semigroups. As a consequence of our results, we show for the first time that a natural class of low-complexity Ising measures can be efficiently learned from samples.
... Other works focusing on kernel machines in a deep learning context are deep Gaussian processes [4] and deep kernel learning [33,32]. Deep GPs concatenate multiple layers of kernelised GP operations; however, they are Bayesian, non-invertible models for prediction tasks instead of density estimation and involve high computational complexity due to operations that require inverting a kernel matrix. ...
Preprint
Normalising Flows are generative models characterised by their invertible architecture. However, the requirement of invertibility imposes constraints on their expressiveness, necessitating a large number of parameters and innovative architectural designs to achieve satisfactory outcomes. Whilst flow-based models predominantly rely on neural-network-based transformations for expressive designs, alternative transformation methods have received limited attention. In this work, we present Ferumal flow, a novel kernelised normalising flow paradigm that integrates kernels into the framework. Our results demonstrate that a kernelised flow can yield competitive or superior results compared to neural network-based flows whilst maintaining parameter efficiency. Kernelised flows excel especially in the low-data regime, enabling flexible non-parametric density estimation in applications with sparse data availability.
... Score-matching techniques and variants originate from [24,39,40]; their shortcoming in the context of EBM training is investigated in [41] and their blindness to the presence of multiple, imbalanced modes in the target density has been known for long: we refer to [42,43]) for discussions. Contrastive divergence (CD) algorithms originate from [25,26,27]. ...
... The parameterization for our model potential U z is consistent with (40): 42) and the associated partition function and free energy are ...
Preprint
Full-text available
Energy-based models (EBMs) are generative models inspired by statistical physics with a wide range of applications in unsupervised learning. Their performance is best measured by the cross-entropy (CE) of the model distribution relative to the data distribution. Using the CE as the objective for training is however challenging because the computation of its gradient with respect to the model parameters requires sampling the model distribution. Here we show how results for nonequilibrium thermodynamics based on Jarzynski equality together with tools from sequential Monte-Carlo sampling can be used to perform this computation efficiently and avoid the uncontrolled approximations made using the standard contrastive divergence algorithm. Specifically, we introduce a modification of the unadjusted Langevin algorithm (ULA) in which each walker acquires a weight that enables the estimation of the gradient of the cross-entropy at any step during GD, thereby bypassing sampling biases induced by slow mixing of ULA. We illustrate these results with numerical experiments on Gaussian mixture distributions as well as the MNIST dataset. We show that the proposed approach outperforms methods based on the contrastive divergence algorithm in all the considered situations.