ArticlePDF Available

R\'enyi Divergence and Kullback-Leibler Divergence

Authors:

Abstract and Figures

R\'enyi divergence is related to R\'enyi entropy much like Kullback-Leibler divergence is related to Shannon's entropy, and comes up in many settings. It was introduced by R\'enyi as a measure of information that satisfies almost the same axioms as Kullback-Leibler divergence, and depends on a parameter that is called its order. In particular, the R\'enyi divergence of order 1 equals the Kullback-Leibler divergence. We review and extend the most important properties of R\'enyi divergence and Kullback-Leibler divergence, including convexity, continuity, limits of {\sigma}-algebras and the relation of the special order 0 to the Gaussian dichotomy and contiguity. We also extend the known equivalence between channel capacity and minimax redundancy to continuous channel inputs (for all orders), and present several other minimax results.
Content may be subject to copyright.
A preview of the PDF is not available
... i.e., w ≺q, iff w (y) = 0 for all y satisfying q(y) = 0. For any α ∈ R+, the usual definition of the order α Rényi divergence, e.g., [20], can be extended as follows for any w ∈ P(Y), and q ∈ L + (Y), ...
... If w , q ∈ P(Y), then the Rényi divergence D α (w ∥q) is nonnegative and D α (p∥q) = 0 iff p = q. Furthermore, [20]- [22], D α (w ∥q) ≥ 1∧α 2 ∥w − q∥ 2 1 ∀w , q ∈ P(Y). ...
... where (a) follows from (12) and (22), (b) follows from Ψ T α,p p = p, (c) follows from Λ T α,p p = 0. Both Ψ T α,p p = p and Λ T α,p p = 0 follow from (16), (20), and (21) by substitution. ...
Conference Paper
Full-text available
For channels with finite input and output sets, under mild technical assumptions, the local behavior of the Augustin information as a function of the input distribution is characterized for all positive orders using the implicit function theorem and the characterization of the Augustin information in terms of the Augustin dual. For channels with (potentially multiple) linear constraints, the slowest decrease of Augustin information with increasing distance from the Augustin capacity-achieving input distributions is characterized within small neighborhoods around these distributions for all positive orders.
... The Rényi divergence, tilting, and Augustin's information measures are central to the analysis we present in the following sections. We introduce these concepts in §2.1 and §2.2, a more detailed discussion can be found in [27], [37]. In §2.3 we define the SPE and derive widely known properties of it for our general channel model. ...
... For properties of the Rényi divergence, throughout the manuscript, we will refer to the comprehensive study provided by van Erven and Harremoës [37]. Note that the order one Rényi divergence is the Kullback-Leibler divergence. ...
... Note that the order one Rényi divergence is the Kullback-Leibler divergence. For other orders, the Rényi divergence can be characterized in terms of the Kullback-Leibler divergence, as well, see [37,Thm. 30]. ...
Preprint
Sphere packing bounds (SPBs) ---with prefactors that are polynomial in the block length--- are derived for codes on two families of memoryless channels using Augustin's method: (possibly non-stationary) memoryless channels with (possibly multiple) additive cost constraints and stationary memoryless channels with convex constraints on the composition (i.e. empirical distribution, type) of the input codewords. A variant of Gallager's bound is derived in order to show that these sphere packing bounds are tight in terms of the exponential decay rate of the error probability with the block length under mild hypotheses.
... A. The Augustin Information and Mean Definition 1. For any α ∈ ℜ+ ,W : X → P(Y), and p ∈ P(X) the order α Augustin information for the prior p is [7,Thm 12]. Such a q α,p is called the order α Augustin mean for the prior p. ...
... There are cases for which C α,W,̺ is finite for all ̺ ∈ Γ ρ , yet C λ α,W is infinite for λ small enough. 7 The equality given in (c) might not hold if ̺ ∈ Γ ρ \ intΓ ρ and |X| = ∞. ...
... Recall that the product structure assertion for q α,W,̺ in Lemma 7, was qualified by the existence of a {̺ t } t∈T satisfying t∈T C α,Wt ,̺t = C α,W,̺ < ∞. In Lemma 10, on the other hand, the product structure assertion for q λ α,W is qualified only by C λ α,W < ∞. 7 In [3, §33- §35], Augustin considers bounded ρ's of the form ρ : X → [0, 1] ℓ . In that case, it is easy to see that if ∃̺ ∈ intΓρ s.t. ...
Preprint
For any channel with a convex constraint set and finite Augustin capacity, existence of a unique Augustin center and associated Erven-Harremoes bound are established. Augustin-Legendre capacity, center, and radius are introduced and proved to be equal to the corresponding Renyi-Gallager entities. Sphere packing bounds with polynomial prefactors are derived for codes on two families of channels: (possibly non-stationary) memoryless channels with multiple additive cost constraints and stationary memoryless channels with convex constraints on the empirical distribution of the input codewords.
... To compute I f , we recall that the framework based on the extended least action principle [41] allows us to choose a general relative entropy definition such as Kullback-Leibler divergence, Rényi divergence, or Tsallis divergence [48][49][50]. The Rényi divergence, or the Tsallis divergence, is the one-parameter generation of the K-L divergence, where the parameter is called the order of divergence. ...
Article
Full-text available
Quantum theory of electron spin is developed here based on the extended least action principle and assumptions of intrinsic angular momentum of an electron with random orientations. The novelty of the formulation is the introduction of relative entropy for the random orientations of intrinsic angular momentum when extremizing the total actions. Applying recursively this extended least action principle, we show that the quantization of electron spin is a mathematical consequence when the order of relative entropy approaches a limit. In addition, the formulation of the measurement probability when a second Stern-Gerlach apparatus is rotated relative to the first Stern-Gerlach apparatus, and the Schrödinger-Pauli equation, are recovered successfully. Furthermore, the principle allows us to provide an intuitive physical model and formulation to explain the entanglement phenomenon between two electron spins. In this model, spin entanglement is the consequence of the correlation between the random orientations of the intrinsic angular momenta of the two electrons. Since spin orientation is an intrinsic local property of the electron, the correlation of spin orientations can be preserved and manifested even when the two electrons are remotely separated. The entanglement of a spin singlet state is represented by two joint probability density functions that reflect the orientation correlation. Using these joint probability density functions, we prove that the Bell-CHSH inequality is violated in a Bell test. To test the validity of the spin-entanglement model, we propose a Bell test experiment with time delay. Such an experiment starts with a typical Bell test that confirms the violation of the Bell-CHSH inequality but adds an extra step that Bob’s measurement is delayed with a period of time after Alice’s measurement. As the time delay increases, the test results are expected to change from violating the Bell-CHSH inequality to not violating the inequality.
... A simple derivation shows, indeed, that the right side of (3) is Rényi divergence between p (α) and q (α) of order 1/α [27]. For an extensive study of properties of the Rényi divergence, we refer the reader to [18]. Note that, even though the relative αentropy is connected to the Csiszár divergence which, in turn, is linked to the Bregman divergence B f through ...
Preprint
We study the geometry of probability distributions with respect to a generalized family of Csisz\'ar f-divergences. A member of this family is the relative α\alpha-entropy which is also a R\'enyi analog of relative entropy in information theory and known as logarithmic or projective power divergence in statistics. We apply Eguchi's theory to derive the Fisher information metric and the dual affine connections arising from these generalized divergence functions. This enables us to arrive at a more widely applicable version of the Cram\'{e}r-Rao inequality, which provides a lower bound for the variance of an estimator for an escort of the underlying parametric probability distribution. We then extend the Amari-Nagaoka's dually flat structure of the exponential and mixer models to other distributions with respect to the aforementioned generalized metric. We show that these formulations lead us to find unbiased and efficient estimators for the escort model. Finally, we compare our work with prior results on generalized Cram\'er-Rao inequalities that were derived from non-information-geometric frameworks.
... Furthermore, D α (P Q) keeps increasing in the order of α, mentioned in Theorem 3 of [21]. ...
Preprint
Message identification (M-I) divergence is an important measure of the information distance between probability distributions, similar to Kullback-Leibler (K-L) and Renyi divergence. In fact, M-I divergence with a variable parameter can make an effect on characterization of distinction between two distributions. Furthermore, by choosing an appropriate parameter of M-I divergence, it is possible to amplify the information distance between adjacent distributions while maintaining enough gap between two nonadjacent ones. Therefore, M-I divergence can play a vital role in distinguishing distributions more clearly. In this paper, we first define a parametric M-I divergence in the view of information theory and then present its major properties. In addition, we design a M-I divergence estimation algorithm by means of the ensemble estimator of the proposed weight kernel estimators, which can improve the convergence of mean squared error from O(Γj/d){O(\varGamma^{-j/d})} to O(Γ1){O(\varGamma^{-1})} (j(0,d])({j\in (0,d]}). We also discuss the decision with M-I divergence for clustering or classification, and investigate its performance in a statistical sequence model of big data for the outlier detection problem.
Article
Full-text available
Small convolutional neural network (CNN)‐based models usually require transferring knowledge from a large model before they are deployed in computationally resource‐limited edge devices. Masked image modelling (MIM) methods achieve great success in various visual tasks but remain largely unexplored in knowledge distillation for heterogeneous deep models. The reason is mainly due to the significant discrepancy between the transformer‐based large model and the CNN‐based small network. In this paper, the authors develop the first heterogeneous self‐supervised knowledge distillation (HSKD) based on MIM, which can efficiently transfer knowledge from large transformer models to small CNN‐based models in a self‐supervised fashion. Our method builds a bridge between transformer‐based models and CNNs by training a UNet‐style student with sparse convolution, which can effectively mimic the visual representation inferred by a teacher over masked modelling. Our method is a simple yet effective learning paradigm to learn the visual representation and distribution of data from heterogeneous teacher models, which can be pre‐trained using advanced self‐supervised methods. Extensive experiments show that it adapts well to various models and sizes, consistently achieving state‐of‐the‐art performance in image classification, object detection, and semantic segmentation tasks. For example, in the Imagenet 1K dataset, HSKD improves the accuracy of Resnet‐50 (sparse) from 76.98% to 80.01%.
Preprint
In recent years generative artificial intelligence has been used to create data to support science analysis. For example, Generative Adversarial Networks (GANs) have been trained using Monte Carlo simulated input and then used to generate data for the same problem. This has the advantage that a GAN creates data in a significantly reduced computing time. N training events for a GAN can result in GN generated events with the gain factor, G, being more than one. This appears to violate the principle that one cannot get information for free. This is not the only way to amplify data so this process will be referred to as data amplification which is studied using information theoretic concepts. It is shown that a gain of greater than one is possible whilst keeping the information content of the data unchanged. This leads to a mathematical bound which only depends on the number of generated and training events. This study determines conditions on both the underlying and reconstructed probability distributions to ensure this bound. In particular, the resolution of variables in amplified data is not improved by the process but the increase in sample size can still improve statistical significance. The bound is confirmed using computer simulation and analysis of GAN generated data from the literature.
Preprint
It is common in phylogenetics to have some, perhaps partial, information about the overall evolutionary tree of a group of organisms and wish to find an evolutionary tree of a specific gene for those organisms. There may not be enough information in the gene sequences alone to accurately reconstruct the correct "gene tree." Although the gene tree may deviate from the "species tree" due to a variety of genetic processes, in the absence of evidence to the contrary it is parsimonious to assume that they agree. A common statistical approach in these situations is to develop a likelihood penalty to incorporate such additional information. Recent studies using simulation and empirical data suggest that a likelihood penalty quantifying concordance with a species tree can significantly improve the accuracy of gene tree reconstruction compared to using sequence data alone. However, the consistency of such an approach has not yet been established, nor have convergence rates been bounded. Because phylogenetics is a non-standard inference problem, the standard theory does not apply. In this paper, we propose a penalized maximum likelihood estimator for gene tree reconstruction, where the penalty is the square of the Billera-Holmes-Vogtmann geodesic distance from the gene tree to the species tree. We prove that this method is consistent, and derive its convergence rate for estimating the discrete gene tree structure and continuous edge lengths (representing the amount of evolution that has occurred on that branch) simultaneously. We find that the regularized estimator is "adaptive fast converging," meaning that it can reconstruct all edges of length greater than any given threshold from gene sequences of polynomial length. Our method does not require the species tree to be known exactly; in fact, our asymptotic theory holds for any such guide tree.
Preprint
We develop a structural default model for interconnected financial institutions in a probabilistic framework. For all possible network structures we characterize the joint default distribution of the system using Bayesian network methodologies. Particular emphasis is given to the treatment and consequences of cyclic financial linkages. We further demonstrate how Bayesian network theory can be applied to detect contagion channels within the financial network, to measure the systemic importance of selected entities on others, and to compute conditional or unconditional probabilities of default for single or multiple institutions.
Article
1. Motivation 2. A modicum of measure theory 3. Densities and derivatives 4. Product spaces and independence 5. Conditioning 6. Martingale et al 7. Convergence in distribution 8. Fourier transforms 9. Brownian motion 10. Representations and couplings 11. Exponential tails and the law of the iterated logarithm 12. Multivariate normal distributions Appendix A. Measures and integrals Appendix B. Hilbert spaces Appendix C. Convexity Appendix D. Binomial and normal distributions Appendix E. Martingales in continuous time Appendix F. Generalized sequences.
Article
Renyi's entropy and divergence of order alpha are given operational characterizations in terms of block coding and hypothesis testing, as so-called beta-cutoff rates, with alpha = (1 + beta)(-1) for entropy and alpha = (1 - beta)(-1) for divergence. Out of several possible definitions of mutual information and channel capacity of order alpha, our approach distinguishes one that admits an operational characterization as beta-cutoff rate for channel coding, with alpha = (1 - beta)(-1) The ordinary cutoff rate of a DMC corresponds to beta = -1.