About
591
Publications
268,659
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
9,281
Citations
Introduction
I am studying the geometry of information (including the field of information geometry) and consider applications in machine learning (including deep learning) and visual computing. I have an interest in building and implementing efficient geometric algorithms in information spaces, and designing distances from first principles.
I co-organize the biannual conference Geometric Science of Information (GSI), and I am an associate editor of the 'Information Geometry' journal (Springer INGE).
My personal web page is https://franknielsen.github.io/ and I tweet at https://twitter.com/FrnkNlsn
Current institution
Publications
Publications (591)
This work studies the Geometric Jensen-Shannon divergence, based on the notion of geometric mean of probability measures, in the setting of Gaussian measures on an infinite-dimensional Hilbert space. On the set of all Gaussian measures equivalent to a fixed one, we present a closed form expression for this divergence that directly generalizes the f...
Why do deep neural networks (DNNs) benefit from very high dimensional parameter spaces? Their huge parameter complexities vs. stunning performance in practice is all the more intriguing and not explainable using the standard theory of model selection for regular models. In this work, we propose a geometrically flavored information-theoretic approac...
Slide deck of the talk given at
IMS-NTU joint workshop on Applied Geometry for Data Sciences Part II
https://ims.nus.edu.sg/events/applied-geometry-for-data-sciences-part-ii/
https://arxiv.org/abs/2504.05654
By analogy to curved exponential families, we define curved Bregman divergences as restrictions of Bregman divergences to sub-dimensional parameter subspaces, and prove that the barycenter of a finite weighted parameter set with respect to a curved Bregman divergence amounts to the Bregman projection onto the subsp...
Slides of the talk given at OIST ML Workshop 2025
March 3 – 5, 2025, Okinawa, Japan
The k-nearest neighbor (k-NN) is a widely adopted technique for nonparametric classification. However, the specification of the number of neighbors, k, often presents a challenge and highlights relevant constraints. Many desirable characteristics of a classifier-including the robustness to noise, smoothness of decision boundaries, bias-variance tra...
We present a generalization of Bregman divergences in finite-dimensional symplectic vector spaces that we term symplectic Bregman divergences. Symplectic Bregman divergences are derived from a symplectic generalization of the Fenchel–Young inequality which relies on the notion of symplectic subdifferentials. The symplectic Fenchel–Young inequality...
Hyperbolic geometry has become popular in machine learning due to its capacity to embed hierarchical graph structures with low distortions for further downstream processing. It has thus become important to consider statistical models and inference methods for data sets grounded in hyperbolic spaces. In this paper, we study various information-theor...
The symmetric Kullback–Leibler centroid, also called the Jeffreys centroid, of a set of mutually absolutely continuous probability distributions on a measure space provides a notion of centrality which has proven useful in many tasks, including information retrieval, information fusion, and clustering. However, the Jeffreys centroid is not availabl...
An inductive mean is a mean defined as a limit of a convergence sequence of other means. Historically, this notion of inductive means obtained as limits of sequences was pioneered independently by Lagrange and Gauss for defining the arithmetic-geometric mean. In this note, we first explain several generalizations of the scalar geometric mean to sym...
The symmetric Kullback-Leibler centroid also called the Jeffreys centroid of a set of mutually absolutely continuous probability distributions on a measure space provides a notion of centrality which has proven useful in many tasks including information retrieval, information fusion, and clustering in image, video and sound processing. However, the...
Fast Proxy Centers for the Jeffreys Centroid: The Jeffreys-Fisher-Rao Center and the Gauss-Bregman Inductive Center." Entropy 26.12 (2024): 1008.
The k-nearest neighbor (k-NN) algorithm is one of the most popular methods for nonparametric classification. However, a relevant limitation concerns the definition of the number of neighbors k. This parameter exerts a direct impact on several properties of the classifier, such as the bias-variance tradeoff, smoothness of decision boundaries, robust...
We present a generalization of Bregman divergences in symplectic vector spaces called symplectic Bregman divergences. Symplectic Bregman divergences are derived from a symplectic generalization of the Fenchel-Young inequalities which rely on symplectic subdifferentials. The generic symplectic Fenchel-Young inequality is obtained using symplectic Fe...
A Bregman manifold is a synonym for a dually flat space in information geometry which admits as a canonical divergence a Bregman divergence. Bregman manifolds are induced by smooth strictly convex functions like the cumulant or partition functions of regular exponential families, the negative entropy of mixture families, or the characteristic funct...
In this paper, we introduce a notion of mean on irreducible symmetric cones, based on the product decomposition between the determinant-one hypersurface and the determinant. Irreducible symmetric cones and their determinant one surfaces form an important class of spaces for statistics and data science, since they encompass positive definite self-ad...
Presentation at the 9th European Congress on Mathematics
While powerful probabilistic models such as Gaussian Processes naturally have this property, deep neural networks often lack it. In this paper, we introduce Distance Aware Bottleneck (DAB), i.e., a new method for enriching deep neural networks with this property. Building on prior information bottleneck approaches, our method learns a codebook that...
Overview of some generalizations of Bregman divergences and the underlying geometry of Bregman manifolds
This talk explains three recent concepts in distance/divergences: projective distances,
comparative convexity and maximal invariant.
In the field of optimal transport, two prominent subfields face each other: (i) unregularized optimal transport, ``a-la-Kantorovich'', which leads to extremely sparse plans but with algorithms that scale poorly, and (ii) entropic-regularized optimal transport, ``a-la-Sinkhorn-Cuturi'', which gets near-linear approximation algorithms but leads to ma...
The Fisher-Rao distance between two probability distributions of a finite-dimensional para-metric statistical model is defined as the Riemannian geodesic distance induced by the Fisher information metric. The Fisher-Rao distance is guaranteed by construction to be invariant under diffeomorphisms of both the sample space and the parameter space of t...
Exponential families are statistical models which are the workhorses in statistics, information theory, and machine learning, among others. An exponential family can either be normalized subtractively by its cumulant or free energy function, or equivalently normalized divisively by its partition function. Both the cumulant and partition functions a...
Markov chain Monte Carlo methods for sampling from complex distributions and estimating normalization constants often simulate samples from a sequence of intermediate distributions along an annealing path, which bridges between a tractable initial distribution and a target density of interest. Prior works have constructed annealing paths using quas...
Rationale • Need to define statistical dissimilarity measures D(p,q) between statistical models p and q in statistics and machine learning: for example, total variation distance, Kullback-Leibler divergence, Wasserstein, Maximum Mean Discrepancy, etc. • Infer models from a statistical model P={p θ }: estimate θ and measure goodness-of-fit from data...
In this paper, we first extend the result of Ali and Silvey [J R Stat Soc Ser B, 28:131–142, 1966] who proved that any f-divergence between two isotropic multivariate Gaussian distributions amounts to a corresponding strictly increasing scalar function of their corresponding Mahalanobis distance. We give sufficient conditions on the standard probab...
Exponential families are statistical models which are the workhorses in statistics, information theory, and machine learning. An exponential family can either be normalized subtractively by its cumulant function or equivalently normalized divisively by its partition function. Both subtractive and divisive normalizers are strictly convex and smooth...
This note explains the direct implementation of this demo video:
https://www.youtube.com/shorts/59j_-DpMTkY
Tempered Exponential Measures (TEMs) are a parametric generalization of the exponential family of distributions maximizing the tempered entropy function among positive measures subject to a probability normalization of their power densities. Calculus on TEMs relies on a deformed algebra of arithmetic operators induced by the deformed logarithms use...
An inductive mean is a mean defined as a limit of a convergence sequence of other means.
In the field of optimal transport, two prominent subfields face each other: (i) unregularized optimal transport, ``\`a-la-Kantorovich'', which leads to extremely sparse plans but with algorithms that scale poorly, and (ii) entropic-regularized optimal transport, ``\`a-la-Sinkhorn-Cuturi'', which gets near-linear approximation algorithms but leads t...
We first explain how the information geometry of Bregman manifolds brings a natural generalization of scalar quasi-arithmetic means that we term quasi-arithmetic centers. We study the invariance and equivariance properties of quasi-arithmetic centers from the viewpoint of the Fenchel-Young canonical divergences.
Second, we consider statistical qua...
Data sets of multivariate normal distributions abound in many scientific areas like diffusion tensor medical imaging, structure tensor computer vision, radar signal processing, machine learning, etc. In order to process those data sets for downstream tasks like filtering, classification or clustering, one needs to define proper notions of dissimila...
Hyperbolic geometry has become popular in machine learning due to its capacity to embed discrete hierarchical graph structures with low distortions into continuous spaces for further downstream processing. It is thus becoming important to consider statistical models and inference methods for data sets grounded in hyperbolic spaces. In this work, we...
We first explain how the information geometry of Bregman manifolds brings a natural generalization of scalar quasi-arithmetic means that we term quasi-arithmetic centers. We study the invariance and equivariance properties of quasi-arithmetic centers from the viewpoint of the Fenchel-Young canonical divergences. Second, we consider statistical quas...
Presentation slides for the poster presented at
2nd Annual TAG in Machine Learning
A Workshop at the 40th International Conference on Machine Learning , Honolulu,
https://www.tagds.com/events/conference-workshops/tag-ml23
Data sets of multivariate normal distributions abound in many scientific areas like diffusion tensor imaging, structure tensor computer vision, radar signal processing, machine learning, just to name a few. In order to process those normal data sets for downstream tasks like filtering, classification or clustering, one needs to define proper notion...
A key technique of machine learning is to embed discrete weighted graphs into continuous spaces for further downstream analysis. Embedding discrete hierarchical structures in hyperbolic geometry has proven very successful since it was shown that any weighted tree can be embedded in that geometry with arbitrary low distortion. Various optimization m...
Very short description of the original purpose of information geometry and the extended scope of geometric science of information.
We describe the basic structures of information geometry:
- the geometry of dual torsion-free affine connections coupled to a Riemannian metric tensor
- the alpha-geometry derived from the cubic tensor
Then we instan...
A well-known bottleneck of Min-Sum-of-Square Clustering (MSSC, the celebrated k-means problem) is to tackle the presence of outliers. In this paper, we propose a Partial clustering variant termed PMSSC which considers a fixed number of outliers to remove. We solve PMSSC by Integer Programming formulations and complexity results extending the ones f...
We present a simple method to approximate the Fisher–Rao distance between multivariate normal distributions based on discretizing curves joining normal distributions and approximating the Fisher–Rao distances between successive nearby normal distributions on the curves by the square roots of their Jeffreys divergences. We consider experimentally th...
The estimation of probability density functions is a non trivial task that over the last years has been tackled with machine learning techniques. Successful applications can be obtained using models inspired by the Boltzmann machine (BM) architecture. In this manuscript, the product Jacobi-Theta Boltzmann machine (pJTBM) is introduced as a restrict...
We first explain how the information geometry of Bregman manifolds brings a natural generalization of scalar quasi-arithmetic means that we term quasi-arithmetic centers.
We study the invariance and equivariance properties of quasi-arithmetic centers from the viewpoint of the Fenchel-Young canonical divergences.
Second, we consider statistical qua...
Riemannian submanifold optimization with momentum is computationally challenging because ensuring iterates remain on the submanifold often requires solving difficult differential equations. We simplify such optimization algorithms for the submanifold of symmetric positive-definite matrices with the affine invariant metric. We propose a generalized...
We present a method to approximate Rao's distance between multivariate normal distributions based on discretizing curves joining normal distributions and approximating Rao distances between successive nearby normals on the curve by using Jerey's divergence. We consider experimentally the linear interpolation curves in the ordinary, natural and expe...
Hyperbolic geometry has become popular in machine learning due to its capacity to embed hierarchical graph structures with low distortions for further downstream processing. It has thus become important to consider statistical models and inference methods for data sets grounded in hyperbolic spaces. In this note, we study f-divergences between the...
A known bottleneck of Min-Sum-of-Square Clustering (MSSC, also denoted k-means problem) is to tackle the perturbation implied by outliers. This paper proposes a partial clustering variant, denoted PMSSC, considering a fixed number of outliers to remove. Integer Programming formulations are proposed. Complexity results extending the ones from MSSC a...
We generalize quasi-arithmetic means beyond scalars by considering the gradient map of a Legendre type real-valued function. The gradient map of a Legendre type function is proven strictly comonotone with a global inverse. It thus yields a generalization of strictly mononotone and differentiable functions generating scalar quasi-arithmetic means. F...
The family of α-divergences including the oriented forward and reverse Kullback–Leibler divergences is often used in signal processing, pattern recognition, and machine learning, among others. Choosing a suitable α-divergence can either be done beforehand according to some prior knowledge of the application domains or directly learned from data set...
Hyperbolic geometry has become popular in machine learning due to its capacity to embed hierarchical graph structures with low distortions for further downstream processing. It has thus become important to consider statistical models and inference methods for data sets grounded in hyperbolic spaces. In this paper, we study various information-theor...
We consider the zeta distributions, which are discrete power law distributions that can
be interpreted as the counterparts of the continuous Pareto distributions with a unit scale. The family of zeta distributions forms a discrete exponential family with normalizing constants expressed using the Riemann zeta function. We present several information...
Video abstract for https://www.mdpi.com/1099-4300/21/5/485
The Chernoff information between two probability measures is a statistical divergence measuring their deviation defined as their maximally skewed Bhattacharyya distance. Although the Chernoff information was originally introduced for bounding the Bayes error in statistical hypothesis testing, the divergence found many other applications due to its...
see https://arxiv.org/abs/2207.03745 and journal paper in MDPI Entropy 2022.
Markov Chain Monte Carlo methods for sampling from complex distributions and estimating normalization constants often simulate samples from a sequence of intermediate distributions along an annealing path, which bridges between a tractable initial distribution and a target density of interest. Prior work have constructed annealing paths using quasi...
Tutorial on Information Geometry.
The video of the talk in available at:
https://www.youtube.com/watch?v=w6r_jsEBlgU
Slides for
"A note on some information-theoretic divergences between Zeta distributions",
https://arxiv.org/abs/2104.10548
(To be presented at MaxEnt 2022 https://maxent22.see.asso.fr/)
https://www.sciencedirect.com/handbook/handbook-of-statistics/vol/46/suppl/C
A common way to learn and analyze statistical models is to consider operations in the model parameter space. But what happens if we optimize in the parameter space and there is no one-to-one mapping between the parameter space and the underlying statistical model space? Such cases frequently occur for hierarchical models which include statistical m...
A smooth and strictly convex function on an open convex domain induces both (1) a Hessian structure with respect to the standard flat Euclidean connection, and (2) a dually flat structure in information geometry. We first review these fundamental constructions and illustrate how to instantiate them for (a) full regular exponential families from the...
We study the dually flat information geometry of the Tojo-Yoshino exponential family with has sample space the Poincaré upper plane and parameter space the open convex cone of 2 × 2 positive-definite matrices. Using the framework of Eaton's maximal invariant, we prove that all f-divergences between Tojo-Yoshino Poincaré distributions are functions...
We first extend the result of Ali and Silvey [Journal of the Royal Statistical Society: Series B, 28.1 (1966), 131-142]
who first reported that any $f$-divergence between two isotropic multivariate Gaussian distributions
amounts to a corresponding strictly increasing scalar function of their corresponding Mahalanobis distance.
We report sufficien...
The informational energy of Onicescu is a positive quantity that measures the amount of uncertainty of a random variable. However, contrary to Shannon’s entropy, the informational energy is strictly convex and increases when randomness decreases. We report a closed-form formula for Onicescu’s informational energy and its associated correlation coef...
A key technique of machine learning and computer vision is to embed discrete weighted graphs into continuous spaces for further downstream processing. Embedding discrete hierarchical structures in hyperbolic geometry has proven very successful since it was shown that any weighted tree can be embedded in that geometry with arbitrary low distortion....
By calculating the Kullback–Leibler divergence between two probability measures belonging to different exponential families dominated by the same measure, we obtain a formula that generalizes the ordinary Fenchel–Young divergence. Inspired by this formula, we define the duo Fenchel–Young divergence and report a majorization condition on its pair of...
These are the slide deck in french of a 40 minute lecture given at College de France on 23 February 2022 in the curriculum "Information and Complexity" of Prof. Stephane Mallat.
https://www.college-de-france.fr/site/stephane-mallat/seminar-2022-02-23-11h15.htm
By calculating the Kullback-Leibler divergence between two probability measures belonging to different exponential families, we end up with a formula that generalizes the ordinary Fenchel-Young divergence. Inspired by this formula, we define the duo Fenchel-Young divergence and reports a majorization condition on its pair of generators which guaran...
A lattice Gaussian distribution of given mean and covariance matrix is a discrete distribution supported on a lattice maximizing Shannon’s entropy under these mean and covariance constraints. Lattice Gaussian distributions find applications in cryptography and in machine learning. The set of Gaussian distributions on a given lattice can be handled...
Information geometry [Ama16, AJLS17, Ama21] aims at unravelling the geometric structures of families of probability distributions and at studying their uses in information sciences. Information sciences is an umbrella term regrouping statistics, information theory, signal processing, machine learning and AI, etc. Information geometry was born indep...
We prove that all
f
-divergences between univariate Cauchy distributions are symmetric. Furthermore, those
f
-divergences can be calculated as strictly increasing scalar functions of the chi-square divergence. We report a criterion which allows one to expand
f
-divergences as converging series of power chi divergences, and exemplifies the tec...
NeurIPS Meetup Japan 2021, December 13–14, 2021 | Online
https://neuripsmeetup.jp/2021/
When analyzing parametric statistical models, a useful approach consists in modeling geometrically the parameter space. However, even for very simple and commonly used hierarchical models like statistical mixtures or stochastic deep neural networks, the smoothness assumption of manifolds is violated at singular points which exhibit non-smooth neigh...
This note reports a proof of the well-known fact that the harmonic mean of two independent Cauchy distributions is a Cauchy distribution. Simulation of the formula with a R code
List can be explored online at
https://franknielsen.github.io/Cards/index.html
Talk at the "Geometry and topology seminar" in Austria
The Jeffreys divergence is a renown arithmetic symmetrization of the oriented Kullback–Leibler divergence broadly used in information sciences. Since the Jeffreys divergence between Gaussian mixture models is not available in closed-form, various techniques with advantages and disadvantages have been proposed in the literature to either estimate, a...
This note shows that the Hilbert's metric distance in the probability simplex is a non-separable distance which satisfies the information monotonicity. Consider the open cone R d ++ of positive measures (i.e., histograms with d positive bins) with its open probability simplex subset ∆ d = {(x 1 ,. .. , x d) ∈ R d + : d i=1 x i = 1}. The f-divergenc...
This note illustrates how to apply the generic formula of the Kullback-Leibler divergence between two densities of two different exponential families [2]. This column is also available as the file KLPoissonGeometricDistributions.pdf. It is well-known that the Kullback-Leibler between two densities P θ1 and P θ2 of the same exponential family amount...
Discrete normal distributions are defined as the distributions with prescribed means and covariance matrices which maximize entropy on the integer lattice support. The set of discrete normal distributions form an exponential family with cumulant function related to the Riemann holomorphic theta function. In this work, we present formula for several...
A memo on dissimilarities, divergences and distances.
Poster that concisely presents main actors yielding to the field of Information Geometry
This is a slide for presentation in GSI 2021.
This is related to the paper "On f-divergences between Cauchy distributions" by Frank Nielsen and Kazuki Okamura.
We propose a methodology to approximate conditional distributions in the elliptope of correlation matrices based on conditional generative adversarial networks. We illustrate the methodology with an application from quantitative finance: Monte Carlo simulations of correlated returns to compare risk-based portfolio construction methods. Finally, we...
In this paper, we propose new structured second-order methods and structured adaptive-gradient methods obtained by performing natural-gradient descent on structured parameter spaces. Natural-gradient descent is an attractive approach to design new algorithms in many settings such as gradient-free, adaptive-gradient, and second-order methods. Our st...
The Jeffreys divergence is a renown symmetrization of the oriented Kullback-Leibler divergence broadly used in information sciences.
Since the Jeffreys divergence between Gaussian mixture models is not available in closed-form, various techniques with pros and cons have been proposed in the literature to either estimate, approximate, or lower and u...
We prove that all f-divergences between univariate Cauchy distributions are symmetric, and can be expressed as functions of the chi-squared divergence. This property does not hold anymore for multivariate Cauchy distributions.