Article
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In this work, the probability of an event under some joint distribution is bounded by measuring it with the product of the marginals instead (which is typically easier to analyze) together with a measure of the dependence between the two random variables. These results find applications in adaptive data analysis, where multiple dependencies are introduced and in learning theory, where they can be employed to bound the generalization error of a learning algorithm. Bounds are given in terms of Sibson’s Mutual Information, α\alpha -Divergences, Hellinger Divergences, and f -Divergences. A case of particular interest is the Maximal Leakage (or Sibson’s Mutual Information of order infinity), since this measure is robust to post-processing and composes adaptively. The corresponding bound can be seen as a generalization of classical bounds, such as Hoeffding’s and McDiarmid’s inequalities, to the case of dependent random variables.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The generalization error in statistical machine learning is the expectation (w.r.t. the joint probability distribution of datasets and models) of the difference between the population risk and the empirical risk [1]- [11]. The generalization error has become a key metric for assessing an algorithm's performance and its ability to generalize beyond training data. ...
... The main contribution of this manuscript is a method that allows constructing explicit expressions for the generalization error in (11). In the following, such method is referred to as the method of gaps and is introduced in the next section. ...
... The method of gaps is a two-step technique to construct explicit expressions for the generalization error G P Θ|Z , P Z in (11) in terms of information measures. The method is based on the analysis of the variations of the functionals R θ in (8) and R z in (9). ...
Preprint
Full-text available
In this paper, the method of gaps, a technique for deriving closed-form expressions in terms of information measures for the generalization error of machine learning algorithms is introduced. The method relies on two central observations: (a)~The generalization error is an average of the variation of the expected empirical risk with respect to changes on the probability measure (used for expectation); and~(b)~these variations, also referred to as gaps, exhibit closed-form expressions in terms of information measures. The expectation of the empirical risk can be either with respect to a measure on the models (with a fixed dataset) or with respect to a measure on the datasets (with a fixed model), which results in two variants of the method of gaps. The first variant, which focuses on the gaps of the expected empirical risk with respect to a measure on the models, appears to be the most general, as no assumptions are made on the distribution of the datasets. The second variant develops under the assumption that datasets are made of independent and identically distributed data points. All existing exact expressions for the generalization error of machine learning algorithms can be obtained with the proposed method. Also, this method allows obtaining numerous new exact expressions, which improves the understanding of the generalization error; establish connections with other areas in statistics, e.g., hypothesis testing; and potentially, might guide algorithm designs.
... Additionally, an approach that leverages the Kullback-Leibler divergence can be found in [22,23]. Finally, we remark that [24] exploits a technique similar to what is pursued in this work, in order to extend McDiarmid's inequality to the case where the function f depends on the random variables themselves (while the random variables remain, in fact, independent). Said result was then applied to a learning setting. ...
... Note that Equation (23) requires only absolute continuity as an assumption. Equations (24) and (28) instead leverage tensorisation properties of H α . These tensorisation properties are particularly suited for Markovian settings as Equation (28) shows. ...
... These tensorisation properties are particularly suited for Markovian settings as Equation (28) shows. However, in the general case, one can still reduce H α (P X n P n i=1 Xi ) (a divergence between n-dimensional measures) to n one-dimensional objects, see Appendix D for details about the tensorisation of both the Hellinger integral H α and Rényi's α-divergence D α and thus retrieve Equation (24). Note that Equation (23) gives a natural generalisation of concentration inequalities to the case of arbitrarily dependent random variables (just like Equation (25) generalises them to Markovian settings). ...
Preprint
Full-text available
We propose a novel approach to concentration for non-independent random variables. The main idea is to ``pretend'' that the random variables are independent and pay a multiplicative price measuring how far they are from actually being independent. This price is encapsulated in the Hellinger integral between the joint and the product of the marginals, which is then upper bounded leveraging tensorisation properties. Our bounds represent a natural generalisation of concentration inequalities in the presence of dependence: we recover exactly the classical bounds (McDiarmid's inequality) when the random variables are independent. Furthermore, in a ``large deviations'' regime, we obtain the same decay in the probability as for the independent case, even when the random variables display non-trivial dependencies. To show this, we consider a number of applications of interest. First, we provide a bound for Markov chains with finite state space. Then, we consider the Simple Symmetric Random Walk, which is a non-contracting Markov chain, and a non-Markovian setting in which the stochastic process depends on its entire past. To conclude, we propose an application to Markov Chain Monte Carlo methods, where our approach leads to an improved lower bound on the minimum burn-in period required to reach a certain accuracy. In all of these settings, we provide a regime of parameters in which our bound fares better than what the state of the art can provide.
... An alternative way to obtain single-draw generalization bounds is through the use of Hölder's inequality (Theorem 3.21), via an approach introduced by Esposito et al. (2021a). We start by providing the following general theorem. ...
... Theorem 5.12 and Corollary 5.13 are due to Esposito et al. (2021a), who also presented several additional bounds and results beyond learning theory. In Hellström and Durisi (2020a), the "strong converse" lemma from binary hypothesis testing is used to obtain single-draw bounds in terms of the tail of the information density. ...
... We will discuss this further in Section 7.3. The single-draw bounds in Theorem 6.10 and Corollary 6.11 can be found in Hellström and Durisi (2020a), and are extensions of bounds from Esposito et al. (2021a) to the CMI setting. ...
Preprint
Full-text available
A fundamental question in theoretical machine learning is generalization. Over the past decades, the PAC-Bayesian approach has been established as a flexible framework to address the generalization capabilities of machine learning algorithms, and design new ones. Recently, it has garnered increased interest due to its potential applicability for a variety of learning algorithms, including deep neural networks. In parallel, an information-theoretic view of generalization has developed, wherein the relation between generalization and various information measures has been established. This framework is intimately connected to the PAC-Bayesian approach, and a number of results have been independently discovered in both strands. In this monograph, we highlight this strong connection and present a unified treatment of generalization. We present techniques and results that the two perspectives have in common, and discuss the approaches and interpretations that differ. In particular, we demonstrate how many proofs in the area share a modular structure, through which the underlying ideas can be intuited. We pay special attention to the conditional mutual information (CMI) framework; analytical studies of the information complexity of learning algorithms; and the application of the proposed methods to deep learning. This monograph is intended to provide a comprehensive introduction to information-theoretic generalization bounds and their connection to PAC-Bayes, serving as a foundation from which the most recent developments are accessible. It is aimed broadly towards researchers with an interest in generalization and theoretical machine learning.
... Recall the definitions (4), (5), and (15). The following theorem states the main data-dependent tail bound of this paper. ...
... 4 However, we hasten to mention that since our approach and the resulting general-purpose bounds of this section are meant to unify several distinct approaches, some of which are data-independent, special instances of our bounds obtained by specialization to those settings can be data-independent. 5 A common choice is to considerŴ Ď W, as in the previous section. 6 Although for simplicity ϵ is assumed to take a fixed value here, in general, it can be chosen to depend on ν S,W . ...
... 9 G δ U,S,W , S U,S,W p¨q, and T α,P U P S pν U,S , pŴ |S , qŴ |S , λf q are defined in (4), (5), and (15), respectively. ...
Preprint
Full-text available
p>In this paper, we establish novel data-dependent upper bounds on the generalization error through the lens of a "variable-size compressibility'' framework that we introduce newly here. In this framework, the generalization error of an algorithm is linked to a variable-size 'compression rate' of its input data. This is shown to yield bounds that depend on the empirical measure of the given input data at hand, rather than its unknown distribution. Our new generalization bounds that we establish are tail bounds, tail bounds on the expectation, and in-expectations bounds. Moreover, it is shown that our framework also allows to derive general bounds on any function of the input data and output hypothesis random variables. In particular, these general bounds are shown to subsume and possibly improve over several existing PAC-Bayes and data-dependent intrinsic dimension-based bounds that are recovered as special cases, thus unveiling a unifying character of our approach. For instance, a new data-dependent intrinsic dimension-based bound is established, which connects the generalization error to the optimization trajectories and reveals various interesting connections with the rate-distortion dimension of a process, the Rényi information dimension of a process, and the metric mean dimension.</p
... This work is closely related to a rich literature of information-theoretic generalization bounds, some of which were discussed earlier [Xu and Raginsky, 2017, Bassily et al., 2018, Pensia et al., 2018, Negrea et al., 2019, Bu et al., 2020, Steinke and Zakynthinou, 2020, Haghifam et al., 2020, Hafez-Kolahi et al., 2020, Alabdulmohsin, 2020, Neu et al., 2021, Raginsky et al., 2021, Esposito et al., 2021. Most of these work derive generalization bounds that depend on a mutual information quantity measured between the output of the training algorithm and some quantity related to training data. ...
... In the previous chapter we discussed various information-theoretic bounds based on different notions of training set information captured by the training algorithm [Xu and Raginsky, 2017, Bassily et al., 2018, Negrea et al., 2019, Bu et al., 2020, Steinke and Zakynthinou, 2020, Haghifam et al., 2020, Neu et al., 2021, Raginsky et al., 2021, Hellström and Durisi, 2020, Esposito et al., 2021. The data and algorithm dependent nature of these bounds make them applicable to typical settings of deep learning, where powerful and overparameterized neural networks are employed. ...
... Information-theoretic generalization bounds have been improved or generalized in many ways. A few works have proposed to use other types of information measures and distances between distributions, instead of Shannon mutual information and Kullback-Leibler divergence respectively [Esposito et al., 2021, Aminian et al., 2021b, Rodríguez Gálvez et al., 2021. In particular, Rodríguez Gálvez et al. [2021] derived expected generalization gap bounds that depend on the Wasserstein distance between P W |Z i and P W . Aminian et al. [2021b] derived similar bounds but that depend on sample-wise Jensen-Shannon information Esposito et al. [2021] derive bounds on probability of an event in a joint distribution P X,Y in terms of the probability of the same event in the product of marginals distribution P X ⊗ P Y and an information measure between X and Y (Sibson's α-mutual information, maximal leakage, f -mutual information) or a divergence between P X,Y and P X ⊗ P Y (Rényi α-divergences, f -divergences, Hellinger divergences). ...
Preprint
Despite the popularity and success of deep learning, there is limited understanding of when, how, and why neural networks generalize to unseen examples. Since learning can be seen as extracting information from data, we formally study information captured by neural networks during training. Specifically, we start with viewing learning in presence of noisy labels from an information-theoretic perspective and derive a learning algorithm that limits label noise information in weights. We then define a notion of unique information that an individual sample provides to the training of a deep network, shedding some light on the behavior of neural networks on examples that are atypical, ambiguous, or belong to underrepresented subpopulations. We relate example informativeness to generalization by deriving nonvacuous generalization gap bounds. Finally, by studying knowledge distillation, we highlight the important role of data and label complexity in generalization. Overall, our findings contribute to a deeper understanding of the mechanisms underlying neural network generalization.
... The authors focused on the notion of ϕ-informativity (cf. [13]) and leveraged the Data-Processing inequality similarly to [14,Theorem 3]. A more thorough discussion of the differences between this work and [12] can be found in Appendix A. ...
... The behavior of the third actor in Equation (27), the dual of D ϕ (· µ), is crucial in order to obtain bounds. For instance, when f is the indicator function of an event, one can explicitly compute the dual (and then retrieve a family of Fano-like inequalities involving arbitrary divergences, see [14], [29,Chapter 3]). When f is not an indicator function, one cannot typically compute the dual explicitly and has to upper-bound it leveraging properties of µ and f . ...
... Remark 3. Theorem 2 (and corresponding generalisations including Orlicz and Amemiya norms) has already appeared in [14] in a slightly less general form, and in [29] in a variety of forms. It has been re-stated here for ease of reference. ...
Preprint
Full-text available
This paper focuses on parameter estimation and introduces a new method for lower bounding the Bayesian risk. The method allows for the use of virtually \emph{any} information measure, including R\'enyi's α\alpha, φ\varphi-Divergences, and Sibson's α\alpha-Mutual Information. The approach considers divergences as functionals of measures and exploits the duality between spaces of measures and spaces of functions. In particular, we show that one can lower bound the risk with any information measure by upper bounding its dual via Markov's inequality. We are thus able to provide estimator-independent impossibility results thanks to the Data-Processing Inequalities that divergences satisfy. The results are then applied to settings of interest involving both discrete and continuous parameters, including the ``Hide-and-Seek'' problem, and compared to the state-of-the-art techniques. An important observation is that the behaviour of the lower bound in the number of samples is influenced by the choice of the information measure. We leverage this by introducing a new divergence inspired by the ``Hockey-Stick'' Divergence, which is demonstrated empirically to provide the largest lower-bound across all considered settings. If the observations are subject to privatisation, stronger impossibility results can be obtained via Strong Data-Processing Inequalities. The paper also discusses some generalisations and alternative directions.
... In general, classical measures of model expressivity (such as Vapnik-Chervonenkis (VC) dimension ( (Vapnik and Chervonenkis, 1991)), Rademacher complexity ((Bartlett and Mendelson, 2003)), etc.), fail to explain the generalization abilities of DNNs due to the fact that they are typically over-parameterized models with less training data than model parameters. A novel approach was introduced by (Russo and Zou, 2016), and (Xu and Raginsky, 2017) (further developed by (Steinke and Zakynthinou, 2020;Bu et al., 2020;Esposito et al., 2021;Esposito and Gastpar, 2022) and many others), where information-theoretic techniques are used to link the generalization capabilities of a learning algorithm to information measures. These quantities are algorithm-dependent and can be used to analyze the generalization capabilities of general classes of updates and models, e.g., noisy iterative algorithms like Stochastic Gradient Langevin Dynamics (SGLD) ( (Pensia et al., 2018;Wang et al., 2021)), which can thus be applied to deep learning. ...
... In this work we adopt and expand the framework introduced by (Pensia et al., 2018), but instead of focusing on the mutual information between the input and output of an iterative algorithm, we compute the maximal leakage ( (Issa et al., 2020)). Maximal leakage, together with other information measures of the Sibson/Rényi family (maximal leakage can be shown to be Sibson Mutual information of order infinity ( (Issa et al., 2020))), have been linked to high-probability bounds on the generalization error ( (Esposito et al., 2021)). In particular, given a learning algorithm A trained on data-set S (made of n samples) one can provide the following guarantee in case of the 0 − 1 loss: P(|gen-err(A, S)| ≥ η) ≤ 2 exp(−2nη 2 + L (S→A(S))). ...
... General bounds on the expected generalization error leveraging arbitrary divergences were given in (Esposito and Gastpar, 2022;Lugosi and Neu, 2022). Another line of work considered instead bounds on the probability of having a large generalization error (Bassily et al., 2018;Esposito et al., 2021;Hellström and Durisi, 2020) and focused on large families of divergences and generalizations of the Mutual Information (in particular of the Sibson/Rényi-family, including conditional versions). ...
Preprint
Full-text available
We adopt an information-theoretic framework to analyze the generalization behavior of the class of iterative, noisy learning algorithms. This class is particularly suitable for study under information-theoretic metrics as the algorithms are inherently randomized, and it includes commonly used algorithms such as Stochastic Gradient Langevin Dynamics (SGLD). Herein, we use the maximal leakage (equivalently, the Sibson mutual information of order infinity) metric, as it is simple to analyze, and it implies both bounds on the probability of having a large generalization error and on its expected value. We show that, if the update function (e.g., gradient) is bounded in L2L_2-norm, then adding isotropic Gaussian noise leads to optimal generalization bounds: indeed, the input and output of the learning algorithm in this case are asymptotically statistically independent. Furthermore, we demonstrate how the assumptions on the update function affect the optimal (in the sense of minimizing the induced maximal leakage) choice of the noise. Finally, we compute explicit tight upper bounds on the induced maximal leakage for several scenarios of interest.
... Meanwhile, bounds using chaining mutual information have been proposed in [9], [10]. Other authors have also constructed information-theoretic based generalization error bounds based on other information measures such as α-Rényi divergence for α > 1, f -divergence, and maximal leakage [11]. In [12], an upper bound based on α-Rényi divergence for 0 < α < 1 is derived by using the variational representation of α-Rényi divergence. ...
... This proposition shows that, unlike mutual information-based and lautum information-based generalisation bounds that currently exist (e.g. [7], [8], [9], and [11]) the proposed generalised α-Jensen-Shannon information generalisation bound is always finite. Corollary 14: (Proved in Appendix B-G) Consider the assumptions in Proposition 8. ...
... If we consider the bounded loss functions in the interval [a, b], then our upper bound (31) would be 2 log(2)(b − a) which is less than total variation constant upper bound, 2(b − a) presented in [15], [48]. Note also that this result cannot be immediately recovered from existing approaches such as [11,Theorem. 2]. ...
Preprint
Generalization error boundaries are essential for comprehending how well machine learning models work. In this work, we suggest a creative method, i.e., the Auxiliary Distribution Method, that derives new upper bounds on generalization errors that are appropriate for supervised learning scenarios. We show that our general upper bounds can be specialized under some conditions to new bounds involving the generalized α\alpha-Jensen-Shannon, α\alpha-R\'enyi (0<α<10< \alpha < 1) information between random variable modeling the set of training samples and another random variable modeling the set of hypotheses. Our upper bounds based on generalized α\alpha-Jensen-Shannon information are also finite. Additionally, we demonstrate how our auxiliary distribution method can be used to derive the upper bounds on generalization error under the distribution mismatch scenario in supervised learning algorithms, where the distributional mismatch is modeled as α\alpha-Jensen-Shannon or α\alpha-R\'enyi (0<α<10< \alpha < 1) between the distribution of test and training data samples. We also outline the circumstances in which our proposed upper bounds might be tighter than other earlier upper bounds.
... PAC-Bayes bounds are usually based on the Kullback-Leilber divergence from the distribution of hypotheses after training (the "posterior" distribution) to a fixed "prior" distribution [4,5,6,7,8]. Information-theoretic bounds are based on various notions of training set information captured by the training algorithm [9,10,11,12,13,14,15,16,17,18,19]. For both types of bounds the main conclusion is that when a training algorithm captures little information about the training data then the generalization gap should be small. ...
... Information-theoretic generalization bounds have been improved or generalized in many ways. A few works have proposed to use other types of information measures and distances between distributions, instead of Shannon mutual information and Kullback-Leibler divergence respectively [19,22,23]. In particular, Rodríguez Gálvez et al. [23] derived expected generalization gap bounds that depend on the Wasserstein distance between P W |Zi and P W . Aminian et al. [22] derived similar bounds but that depend on sample-wise Jensen-Shannon information I JS (W ; Z i ) JS (P W,Zi || P W P Zi ) or on lautum information L(W ; Z i ) KL (P W P Zi P W,Zi ). ...
... In particular, Rodríguez Gálvez et al. [23] derived expected generalization gap bounds that depend on the Wasserstein distance between P W |Zi and P W . Aminian et al. [22] derived similar bounds but that depend on sample-wise Jensen-Shannon information I JS (W ; Z i ) JS (P W,Zi || P W P Zi ) or on lautum information L(W ; Z i ) KL (P W P Zi P W,Zi ). Esposito et al. [19] derive bounds on probability of an event in a joint distribution P X,Y in terms of the probability of the same event in the product of marginals distribution P X P Y and an information measure between X and Y (Sibson's α-mutual information, maximal leakage, f -mutual information) or a divergence between P X,Y and P X P Y (Rényi α-divergences, f -divergences, Hellinger divergences). They note that one can derive in-expectation generalization bounds from these results. ...
Preprint
Some of the tightest information-theoretic generalization bounds depend on the average information between the learned hypothesis and a \emph{single} training example. However, these sample-wise bounds were derived only for \emph{expected} generalization gap. We show that even for expected \emph{squared} generalization gap no such sample-wise information-theoretic bounds exist. The same is true for PAC-Bayes and single-draw bounds. Remarkably, PAC-Bayes, single-draw and expected squared generalization gap bounds that depend on information in pairs of examples exist.
... Definition 7 (Rényi's α-Divergence [11]). Let (Ω, F , P ), (Ω, F , Q) be two probability spaces. ...
... The information-theoretic approach was initially introduced by [1,2] and subsequently refined to derive tighter bounds [45][46][47]. Besides, various other information-theoretic bounds have been proposed, leveraging concepts such as conditional mutual information [12], f -divergence [11], the Wasserstein distance [48,49], and more [50,51]. Some studies have ...
Article
Full-text available
We analyze the generalization properties of batch reinforcement learning (batch RL) with value function approximation from an information-theoretic perspective. We derive generalization bounds for batch RL using (conditional) mutual information. In addition, we demonstrate how to establish a connection between certain structural assumptions on the value function space and conditional mutual information. As a by-product, we derive a high-probability generalization bound via conditional mutual information, which was left open and may be of independent interest.
... Remark 2. Inequality (12) shows that considering D α instead of T α in the case of α ∈ (1, ∞), the result of Theorem 1 will be valid. ...
... Remark 9. Inequality (12) suggests that all the achievability proofs for T α as a security measure also holds for D α when α ∈ (1, ∞). ...
Preprint
Full-text available
Random binning is a widely used technique in information theory with diverse applications. In this paper, we focus on the output statistics of random binning (OSRB) using the Tsallis divergence T_\alpha. We analyze all values of \alpha\in(0,\infty)\cup\{\infty\} and consider three scenarios: (i) the binned sequence is generated i.i.d., (ii) the sequence is randomly chosen from an \epsilon-typical set, and (iii) the sequence originates from an \epsilon-typical set and is passed through a non-memoryless virtual channel. Our proofs cover both achievability and converse results. To address the unbounded nature of T_\infty, we extend the OSRB framework using Rényi’s divergence with order infinity, denoted D_\infty. As part of our exploration, we analyze a specific form of Rényi’s conditional entropy and its properties. Additionally, we demonstrate the application of this framework in deriving achievability results for the wiretap channel, where Tsallis divergence serves as a security measure. The secure rate we obtain through the OSRB analysis matches the secure capacity for \alpha\in (0, 2] \cup\{\infty\} and serves as a potential candidate for the secure capacity when \alpha\in(2,\infty).
... [11] proposed various generalization error bounds for learning algorithms by defining various notions of stability. [12] and [13] first proposed upper bounds for generalization error using mutual information and these results have been generalized to other measures of dependence between input and output in [14], [15], [16] and [17]. [18] presented an algorithm for finite Littlestone classes with improved privacy parameters, from doubly exponential in d to polynomial in d. ...
... Also, H f ailure = −P r(Failure) log 2 P r(Failure) ≤ 1 eln (2) . (16) since H f ailure achieves maxima at P r(Failure) = 1 e . ...
Preprint
Full-text available
We prove that every online learnable class of functions of Littlestone dimension d admits a learning algorithm with finite information complexity. Towards this end, we use the notion of a globally stable algorithm. Generally, the information complexity of such a globally stable algorithm is large yet finite, roughly exponential in d. We also show there is room for improvement; for a canonical online learnable class, indicator functions of affine subspaces of dimension d, the information complexity can be upper bounded logarithmically in d.
... Data science, information theory, probability theory, statistical learning, statistical signal processing, and other related disciplines greatly benefit from non-negative measures of dissimilarity between pairs of probability measures. These are known as divergence measures, and exploring their mathematical foundations and diverse applications is of significant interest (see, e.g., [1][2][3][4][5][6][7][8][9][10] and references therein). ...
... Basic properties of an f -divergence are its non-negativity, convexity in the pair of probability measures, and the satisfiability of data-processing inequalities as a result of the convexity of the function f (and by the requirement that f vanishes at 1). These properties lead to f -divergence inequalities, and to information-theoretic applications (see, e.g., [4,10,[32][33][34][35][36][37]). Furthermore, tightened (strong) data-processing inequalities for f -divergences have been of recent interest (see, e.g., [38][39][40][41][42]). ...
Article
Full-text available
Data science, information theory, probability theory, statistical learning, statistical signal processing, and other related disciplines greatly benefit from non-negative measures of dissimilarity between pairs of probability measures[...]
... • Generalization in Quantum Machine Learning: Classical maximal leakage has been already utilized to better understand generalization of machine learning models [16,17]. Therefore, we expect to be able to use maximal quantum leakage to analyze generalization of various quantum machine learning models. ...
Preprint
Full-text available
A new measure of information leakage for quantum encoding of classical data is defined. An adversary can access a single copy of the state of a quantum system that encodes some classical data and is interested in correctly guessing a general randomized or deterministic function of the data (e.g., a specific feature or attribute of the data in quantum machine learning) that is unknown to the security analyst. The resulting measure of information leakage, referred to as maximal quantum leakage, is the multiplicative increase of the probability of correctly guessing any function of the data upon observing measurements of the quantum state. Maximal quantum leakage is shown to satisfy post-processing inequality (i.e., applying a quantum channel reduces information leakage) and independence property (i.e., leakage is zero if the quantum state is independent of the classical data), which are fundamental properties required for privacy and security analysis. It also bounds accessible information. Effects of global and local depolarizing noise models on the maximal quantum leakage are established.
... Hellström and Durisi [19] later resolved this issue to obtain analogous PAC-Bayes bounds employing properties unique to subgaussian random variables [20, Theorem 2.6]. Esposito et al. [39] also derived PAC-Bayes bounds for this setting considering different dependency measures. ...
Preprint
Full-text available
In this paper, we present new high-probability PAC-Bayes bounds for different types of losses. Firstly, for losses with a bounded range, we present a strengthened version of Catoni's bound that holds uniformly for all parameter values. This leads to new fast rate and mixed rate bounds that are interpretable and tighter than previous bounds in the literature. Secondly, for losses with more general tail behaviors, we introduce two new parameter-free bounds: a PAC-Bayes Chernoff analogue when the loss' cumulative generating function is bounded, and a bound when the loss' second moment is bounded. These two bounds are obtained using a new technique based on a discretization of the space of possible events for the "in probability" parameter optimization problem. Finally, we extend all previous results to anytime-valid bounds using a simple technique applicable to any existing bound.
... Another recent measure of interest is maximal leakage which seeks to control the adversary's ability to refine his or her estimate of any function of data [26], [27]. Maximal leakage has been recently discussed in the context of hypothesis testing: Privacy-utility trade-offs using maximal leakage as a privacy metric and the type II (false alarm) error exponent as the utility metric have been studied in [28]; In [29] the socalled "noiseless privacy" is related to hypothesis testing and to maximal leakage; And, maximal leakage is used to bound generalization errors of learning algorithms in [30]. ...
Article
Full-text available
As the applications of biometric recognition systems are increasing rapidly, there is a growing need to secure the sensitive data used within these systems. Considering privacy challenges in such systems, different biometric template protection (BTP) schemes were proposed in the literature, and the ISO/IEC 24745 standard defined a number of requirements for protecting biometric templates. While there are several studies on evaluating different requirements of the ISO/IEC 24745 standard, there have been few studies on how to measure the linkability of biometric templates. In this paper, we propose a new method for measuring linkability of protected biometric templates. The proposed method is based on maximal leakage, which is a well-studied measure in information-theoretic literature. We show that the resulting linkability measure has a number of important theoretical properties and an operational interpretation in terms of statistical hypothesis testing. We compare the proposed measure to two other linkability measures: one previously introduced in the literature, and a similar measure based on differential privacy. In our experiments, we use the proposed measure to evaluate the linkability of biometric templates from different biometric characteristics (face, voice, and finger vein), which are protected with different BTP schemes. The source codes of our proposed measure and all experiments are publicly available.
... Most of these studies use a surrogate loss to avoid dealing with the zero-gradient of the misclassification loss. There were some other works which use information-theoretic approach to find PAC-bounds on generalization errors for machine learning (Xu and Raginsky, 2017;Esposito et al., 2021) and deep learning (Jakubovitz et al., 2018). ...
Preprint
Full-text available
In a recent paper, Ling et al. investigated the over-parametrized Deep Equilibrium Model (DEQ) with ReLU activation and proved that the gradient descent converges to a globally optimal solution at a linear convergence rate for the quadratic loss function. In this paper, we show that this fact still holds for DEQs with any general activation which has bounded first and second derivatives. Since the new activation function is generally non-linear, a general population Gram matrix is designed, and a new form of dual activation with Hermite polynomial expansion is developed.
... Asadi et al. (2018) proposed using chaining mutual information, and Hafez-Kolahi et al. (2020); Haghifam et al. (2020); Steinke and Zakynthinou (2020) advocated the conditioning and processing techniques. Informationtheoretic generalization error bounds using other information quantities are also studied, e.g., f -divergences, α-Rényi divergence and generalized Jensen-Shannon divergence Esposito et al. (2021); Aminian et al. (2022bAminian et al. ( , 2021b. ...
Preprint
Full-text available
This paper provides an exact characterization of the expected generalization error (gen-error) for semi-supervised learning (SSL) with pseudo-labeling via the Gibbs algorithm. This characterization is expressed in terms of the symmetrized KL information between the output hypothesis, the pseudo-labeled dataset, and the labeled dataset. It can be applied to obtain distribution-free upper and lower bounds on the gen-error. Our findings offer new insights that the generalization performance of SSL with pseudo-labeling is affected not only by the information between the output hypothesis and input training data but also by the information {\em shared} between the {\em labeled} and {\em pseudo-labeled} data samples. To deepen our understanding, we further explore two examples -- mean estimation and logistic regression. In particular, we analyze how the ratio of the number of unlabeled to labeled data λ\lambda affects the gen-error under both scenarios. As λ\lambda increases, the gen-error for mean estimation decreases and then saturates at a value larger than when all the samples are labeled, and the gap can be quantified {\em exactly} with our analysis, and is dependent on the \emph{cross-covariance} between the labeled and pseudo-labeled data sample. In logistic regression, the gen-error and the variance component of the excess risk also decrease as λ\lambda increases.
... [26] provides tighter bounds by considering the individual sample mutual information, [27], [28] propose using chaining mutual information, and [29]- [31] advocate the conditioning and processing techniques. Information-theoretic generalization error bounds using other information quantities are also studied, such as, f -divergence [32], α-Rényi divergence and maximal leakage [33], [34], Jensen-Shannon divergence [35], [36] and Wasserstein distance [37]- [40]. Using rate-distortion theory, [41]- [43] provide information-theoretic generalization error upper bounds for model misspecification and model compression. ...
Preprint
Various approaches have been developed to upper bound the generalization error of a supervised learning algorithm. However, existing bounds are often loose and even vacuous when evaluated in practice. As a result, they may fail to characterize the exact generalization ability of a learning algorithm. Our main contributions are exact characterizations of the expected generalization error of the well-known Gibbs algorithm (a.k.a. Gibbs posterior) using different information measures, in particular, the symmetrized KL information between the input training samples and the output hypothesis. Our result can be applied to tighten existing expected generalization error and PAC-Bayesian bounds. Our information-theoretic approach is versatile, as it also characterizes the generalization error of the Gibbs algorithm with a data-dependent regularizer and that of the Gibbs algorithm in the asymptotic regime, where it converges to the standard empirical risk minimization algorithm. Of particular relevance, our results highlight the role the symmetrized KL information plays in controlling the generalization error of the Gibbs algorithm.
... Lopez and Jog (2018) derived upper bounds on the generalization error using the Wasserstein distance involving the distributions of input data and output hypothesis, which are shown to be tighter under some natural cases. Esposito et al. (2021) derived generalization error bounds via Rényi-, f -divergences and maximal leakage. Steinke and Zakynthinou (2020) proposed using the Conditional Mutual Information (CMI) to bound the generalization error; the CMI is useful as it possesses the chain rule property. ...
Article
Full-text available
Using information-theoretic principles, we consider the generalization error (gen-error) of iterative semi-supervised learning (SSL) algorithms that iteratively generate pseudo-labels for a large amount of unlabelled data to progressively refine the model parameters. In contrast to most previous works that bound the gen-error, we provide an exact expression for the gen-error and particularize it to the binary Gaussian mixture model. Our theoretical results suggest that when the class conditional variances are not too large, the gen-error decreases with the number of iterations, but quickly saturates. On the flip side, if the class conditional variances (and so amount of overlap between the classes) are large, the gen-error increases with the number of iterations. To mitigate this undesirable effect, we show that regularization can reduce the gen-error. The theoretical results are corroborated by extensive experiments on the MNIST and CIFAR datasets in which we notice that for easy-to-distinguish classes, the gen-error improves after several pseudo-labelling iterations, but saturates afterwards, and for more difficult-to-distinguish classes, regularization improves the generalization performance.
... Several authors used other methods to estimate of the misclassification error with a non-zero gradient by proposing new training algorithms to evaluate the optimal output distribution in PAC-Bayesian bounds analytically [19][20][21]. Recently, there have been some interesting works which use information-theoretic approach to find PAC-bounds on generalization errors for machine learning [22,23] and deep learning [24]. ...
Preprint
Full-text available
In this paper, we develop some novel bounds for the Rademacher complexity and the generalization error in deep learning with i.i.d. and Markov datasets. The new Rademacher complexity and generalization bounds are tight up to O(1/n)O(1/\sqrt{n}) where n is the size of the training set. They can be exponentially decayed in the depth L for some neural network structures. The development of Talagrand's contraction lemmas for high-dimensional mappings between function spaces and deep neural networks for general activation functions is a key technical contribution to this work.
... Bu et al. [8] have derived tighter generalization error bounds based on individual sample mutual information. The generalization error bounds based on other information measures such as α-Réyni divergence [9], maximal leakage [10], Jensen-Shannon divergence [11], Wasserstein distances [12,13] and individual sample Wasserstein distance [14] are also considered. Chaining mutual information technique is proposed in [15] and [16] to further improve the mutual information-based bound. ...
Preprint
Generalization error bounds are essential to understanding machine learning algorithms. This paper presents novel expected generalization error upper bounds based on the average joint distribution between the output hypothesis and each input training sample. Multiple generalization error upper bounds based on different information measures are provided, including Wasserstein distance, total variation distance, KL divergence, and Jensen-Shannon divergence. Due to the convexity of the information measures, the proposed bounds in terms of Wasserstein distance and total variation distance are shown to be tighter than their counterparts based on individual samples in the literature. An example is provided to demonstrate the tightness of the proposed generalization error bounds.
... Hellström and Durisi (2020a,b) provide a variety of improvements over the standard bound, such as proving subgaussian high-probability bounds in terms of a "disintegrated" version of the mutual information, and highlighting connections with PAC-Bayes bounds. Among other contributions, Esposito et al. (2021) provided generalization bounds in terms of Rényi's α-divergences and Csiszár's f -divergences, focusing on high-probability guarantees with subgaussian tails. Going beyond subgaussian losses, Zhang et al. (2018) and Wang et al. (2019) provided bounds in terms of the Wasserstein distance between P Wn|Sn and P Wn under the condition that the loss function is Lipschitz. ...
Preprint
Since the celebrated works of Russo and Zou (2016,2019) and Xu and Raginsky (2017), it has been well known that the generalization error of supervised learning algorithms can be bounded in terms of the mutual information between their input and the output, given that the loss of any fixed hypothesis has a subgaussian tail. In this work, we generalize this result beyond the standard choice of Shannon's mutual information to measure the dependence between the input and the output. Our main result shows that it is indeed possible to replace the mutual information by any strongly convex function of the joint input-output distribution, with the subgaussianity condition on the losses replaced by a bound on an appropriately chosen norm capturing the geometry of the dependence measure. This allows us to derive a range of generalization bounds that are either entirely new or strengthen previously known ones. Examples include bounds stated in terms of p-norm divergences and the Wasserstein-2 distance, which are respectively applicable for heavy-tailed loss distributions and highly smooth loss functions. Our analysis is entirely based on elementary tools from convex analysis by tracking the growth of a potential function associated with the dependence measure and the loss function.
Preprint
Full-text available
We study which machine learning algorithms have tight generalization bounds. First, we present conditions that preclude the existence of tight generalization bounds. Specifically, we show that algorithms that have certain inductive biases that cause them to be unstable do not admit tight generalization bounds. Next, we show that algorithms that are sufficiently stable do have tight generalization bounds. We conclude with a simple characterization that relates the existence of tight generalization bounds to the conditional variance of the algorithm's loss.
Article
In this paper, we establish novel data-dependent upper bounds on the generalization error through the lens of a “variable-size compressibility” framework that we introduce newly here. In this framework, the generalization error of an algorithm is linked to a variable-size ‘compression rate’ of its input data. This is shown to yield bounds that depend on the empirical measure of the given input data at hand, rather than its unknown distribution. Our new generalization bounds that we establish are tail bounds, tail bounds on the expectation, and in-expectations bounds. Moreover, it is shown that our framework also allows to derive general bounds on any function of the input data and output hypothesis random variables. In particular, these general bounds are shown to subsume and possibly improve over several existing PAC-Bayes and data-dependent intrinsic dimension-based bounds that are recovered as special cases, thus unveiling a unifying character of our approach. For instance, a new data-dependent intrinsic dimension-based bound is established, which connects the generalization error to the optimization trajectories and reveals various interesting connections with the rate-distortion dimension of a process, the Rényi information dimension of a process, and the metric mean dimension.
Article
Generalization error bounds are essential for comprehending how well machine learning models work. In this work, we suggest a novel method, i.e., the Auxiliary Distribution Method, that leads to new upper bounds on expected generalization errors that are appropriate for supervised learning scenarios. We show that our general upper bounds can be specialized under some conditions to new bounds involving the α\alpha -Jensen-Shannon, α\alpha -Rényi (0<α<1)(0\lt \alpha \lt 1) information between a random variable modeling the set of training samples and another random variable modeling the set of hypotheses. Our upper bounds based on α\alpha -Jensen-Shannon information are also finite. Additionally, we demonstrate how our auxiliary distribution method can be used to derive the upper bounds on excess risk of some learning algorithms in the supervised learning context and the generalization error under the distribution mismatch scenario in supervised learning algorithms, where the distribution mismatch is modeled as α\alpha -Jensen-Shannon or α\alpha -Rényi divergence between the distribution of test and training data samples distributions. We also outline the conditions for which our proposed upper bounds might be tighter than other earlier upper bounds.
Article
We propose a novel approach to concentration for non-independent random variables. The main idea is to “pretend” that the random variables are independent and pay a multiplicative price measuring how far they are from actually being independent. This price is encapsulated in the Hellinger integral between the joint and the product of the marginals, which is then upper bounded leveraging tensorisation properties. Our bounds represent a natural generalisation of concentration inequalities in the presence of dependence: we recover exactly the classical bounds (McDiarmid’s inequality) when the random variables are independent. Furthermore, in a “large deviations” regime, we obtain the same decay in the probability as for the independent case, even when the random variables display non-trivial dependencies. To show this, we consider a number of applications of interest. First, we provide a bound for Markov chains with finite state space. Then, we consider the Simple Symmetric Random Walk, which is a non-contracting Markov chain, and a non-Markovian setting in which the stochastic process depends on its entire past. To conclude, we propose an application to Markov Chain Monte Carlo methods, where our approach leads to an improved lower bound on the minimum burn-in period required to reach a certain accuracy. In all of these settings, we provide a regime of parameters in which our bound fares better than what the state of the art can provide.
Article
An alternative measure of information leakage for quantum encoding of classical data is defined. An adversary can access a single copy of the state of a quantum system that encodes some classical data and is interested in correctly guessing a general randomized or deterministic function of the data (e.g., a specific feature or attribute of the data in quantum machine learning) that is unknown to the security analyst. The resulting measure of information leakage, referred to as maximal quantum leakage, is the multiplicative increase of the probability of correctly guessing any function of the classical data upon observing measurements of the quantum state. Maximal quantum leakage is shown to satisfy the postprocessing inequality (i.e., applying a quantum channel reduces information leakage) and independence property (i.e., leakage is zero if the quantum state is independent of the classical data), which are fundamental properties required for privacy and security analysis. It also bounds accessible information. Effects of global and local depolarizing noise models on the maximal quantum leakage are established.
Article
The empirical risk minimization (ERM) problem with relative entropy regularization (ERM-RER) is investigated under the assumption that the reference measure is a σ\sigma -finite measure, and not necessarily a probability measure. Under this assumption, which leads to a generalization of the ERM-RER problem allowing a larger degree of flexibility for incorporating prior knowledge, numerous relevant properties are stated. Among these properties, the solution to this problem, if it exists, is shown to be a unique probability measure, mutually absolutely continuous with the reference measure. Such a solution exhibits a probably-approximately-correct guarantee for the ERM problem independently of whether the latter possesses a solution. For a fixed dataset and under a specific condition, the empirical risk is shown to be a sub-Gaussian random variable when the models are sampled from the solution to the ERM-RER problem. The generalization capabilities of the solution to the ERM-RER problem (the Gibbs algorithm) are studied via the sensitivity of the expected empirical risk to deviations from such a solution towards alternative probability measures. Finally, an interesting connection between sensitivity, generalization error, and lautum information is established.
Article
Multisource information fusion is a comprehensive and interdisciplinary subject. Dempster-Shafer (D-S) evidence theory copes with uncertain information effectively. Pattern classification is the core research content of pattern recognition, and multisource information fusion based on D-S evidence theory can be effectively applied to pattern classification problems. However, in D-S evidence theory, highly-conflicting evidence may cause counterintuitive fusion results. Belief divergence theory is one of the theories that are proposed to address problems of highly-conflicting evidence. Although belief divergence can deal with conflict between evidence, none of the existing belief divergence methods has considered how to effectively measure the discrepancy between two pieces of evidence with time evolutionary. In this study, a novel fractal belief Rényi (FBR) divergence is proposed to handle this problem. We assume that it is the first divergence that extends the concept of fractal to R/'enyi divergence. The advantage is measuring the discrepancy between two pieces of evidence with time evolution, which satisfies several properties and is flexible and practical in various circumstances. Furthermore, a novel algorithm for multisource information fusion based on FBR divergence, namely FBReD-based weighted multisource information fusion, is developed. Ultimately, the proposed multisource information fusion algorithm is applied to a series of experiments for pattern classification based on real datasets, where our proposed algorithm achieved superior performance.
Article
Various approaches have been developed to upper bound the generalization error of a supervised learning algorithm. However, existing bounds are often loose and even vacuous when evaluated in practice. As a result, they may fail to characterize the exact generalization ability of a learning algorithm. Our main contributions are exact characterizations of the expected generalization error of the well-known Gibbs algorithm (a.k.a. Gibbs posterior) using different information measures, in particular, the symmetrized KL information between the input training samples and the output hypothesis. Our result can be applied to tighten existing expected generalization errors and PAC-Bayesian bounds. Our information-theoretic approach is versatile, as it also characterizes the generalization error of the Gibbs algorithm with a data-dependent regularizer and that of the Gibbs algorithm in the asymptotic regime, where it converges to the standard empirical risk minimization algorithm. Of particular relevance, our results highlight the role the symmetrized KL information plays in controlling the generalization error of the Gibbs algorithm.
Preprint
Full-text available
The dependence on training data of the Gibbs algorithm (GA) is analytically characterized. By adopting the expected empirical risk as the performance metric, the sensitivity of the GA is obtained in closed form. In this case, sensitivity is the performance difference with respect to an arbitrary alternative algorithm. This description enables the development of explicit expressions involving the training errors and test errors of GAs trained with different datasets. Using these tools, dataset aggregation is studied and different figures of merit to evaluate the generalization capabilities of GAs are introduced. For particular sizes of such datasets and parameters of the GAs, a connection between Jeffrey's divergence, training and test errors is established.
Article
In this paper, we provide three applications for f {\mathsf {f}} -divergences: (i) we introduce Sanov’s upper bound on the tail probability of the sum of independent random variables based on super-modular f {\mathsf {f}} -divergence and show that our generalized Sanov’s bound strictly improves over ordinary one, (ii) we consider the lossy compression problem which studies the set of achievable rates for a given distortion and code length. We extend the rate-distortion function using mutual f {\mathsf {f}} -information and provide new and strictly better bounds on achievable rates in the finite blocklength regime using super-modular f {\mathsf {f}} -divergences, and (iii) we provide a connection between the generalization error of algorithms with bounded input/output mutual f {\mathsf {f}} -information and a generalized rate-distortion problem. This connection allows us to bound the generalization error of learning algorithms using lower bounds on the f {\mathsf {f}} -rate -distortion function. Our bound is based on a new lower bound on the rate-distortion function that (for some examples) strictly improves over previously best-known bounds.
Article
We introduce a tunable loss function called α\alpha -loss, parameterized by α(0,]\alpha \in (0,\infty] , which interpolates between the exponential loss ( α=1/2\alpha = 1/2 ), the log-loss ( α=1\alpha = 1 ), and the 0–1 loss ( α=\alpha = \infty ), for the machine learning setting of classification. Theoretically, we illustrate a fundamental connection between α\alpha -loss and Arimoto conditional entropy, verify the classification-calibration of α\alpha -loss in order to demonstrate asymptotic optimality via Rademacher complexity generalization techniques, and build-upon a notion called strictly local quasi-convexity in order to quantitatively characterize the optimization landscape of α\alpha -loss. Practically, we perform class imbalance, robustness, and classification experiments on benchmark image datasets using convolutional-neural-networks. Our main practical conclusion is that certain tasks may benefit from tuning α\alpha -loss away from log-loss ( α=1\alpha = 1 ), and to this end we provide simple heuristics for the practitioner. In particular, navigating the α\alpha hyperparameter can readily provide superior model robustness to label flips ( α>1\alpha > 1 ) and sensitivity to imbalanced classes ( α<1\alpha < 1 ).
Preprint
Full-text available
This work discusses how to derive upper bounds for the expected generalisation error of supervised learning algorithms by means of the chaining technique. By developing a general theoretical framework, we establish a duality between generalisation bounds based on the regularity of the loss function, and their chained counterparts, which can be obtained by lifting the regularity assumption from the loss onto its gradient. This allows us to re-derive the chaining mutual information bound from the literature, and to obtain novel chained information-theoretic generalisation bounds, based on the Wasserstein distance and other probability metrics. We show on some toy examples that the chained generalisation bound can be significantly tighter than its standard counterpart, particularly when the distribution of the hypotheses selected by the algorithm is very concentrated. Keywords: Generalisation bounds; Chaining; Information-theoretic bounds; Mutual information; Wasserstein distance; PAC-Bayes.
Article
Full-text available
We propose the notion of sub‐Weibull distributions, which are characterised by tails lighter than (or equally light as) the right tail of a Weibull distribution. This novel class generalises the sub‐Gaussian and sub‐Exponential families to potentially heavier‐tailed distributions. Sub‐Weibull distributions are parameterized by a positive tail index θ and reduce to sub‐Gaussian distributions for θ =1/2 and to sub‐Exponential distributions for θ =1. A characterisation of the sub‐Weibull property based on moments and on the moment generating function is provided and properties of the class are studied. An estimation procedure for the tail parameter is proposed and is applied to an example stemming from Bayesian deep learning.
Article
Full-text available
An information-theoretic upper bound on the generalization error of supervised learning algorithms is derived. The bound is constructed in terms of the mutual information between each individual training sample and the output of the learning algorithm. The bound is derived under more general conditions on the loss function than in existing studies; nevertheless, it provides a tighter characterization of the generalization error. Examples of learning algorithms are provided to demonstrate the the tightness of the bound, and to show that it has a broad range of applicability. Application to noisy and iterative algorithms, e.g., stochastic gradient Langevin dynamics (SGLD), is also studied, where the constructed bound provides a tighter characterization of the generalization error than existing results. Finally, it is demonstrated that, unlike existing bounds, which are difficult to compute and evaluate empirically, the proposed bound can be estimated easily in practice.
Article
Full-text available
We study learning algorithms that are restricted to revealing little information about their input sample. Various manifestations of this notion have been recently studied. A central theme in these works, and in ours, is that such algorithms generalize. We study a category of learning algorithms, which we term d-bit information learners}. These are algorithms whose output conveys at most d bits of information on their input. We focus on the learning capacity of such algorithms: we prove generalization bounds with tight dependencies on the confidence and error parameters. We observe connections with well studied notions such as PAC-Bayes and differential privacy. For example, it is known that pure differentially private algorithms leak little information. We complement this fact with a separation between bounded information and pure differential privacy in the setting of proper learning, showing that differential privacy is strictly more restrictive. We also demonstrate limitations by exhibiting simple concept classes for which every (possibly randomized) empirical risk minimizer must leak a lot of information. On the other hand, we show that in the distribution-dependent setting every VC class has empirical risk minimizers that do not leak a lot of information.
Article
Full-text available
This paper develops systematic approaches to obtain f -divergence inequalities, dealing with pairs of probability measures defined on arbitrary alphabets. Functional domination is one such approach, where special emphasis is placed on finding the best possible constant upper bounding a ratio of f -divergences. Another approach used for the derivation of bounds among f -divergences relies on moment inequalities and the logarithmic-convexity property, which results in tight bounds on the relative entropy and Bhattacharyya distance in terms of χ 2 divergences. A rich variety of bounds are shown to hold under boundedness assumptions on the relative information. Special attention is devoted to the total variation distance and its relation to the relative information and relative entropy, including “reverse Pinsker inequalities,” as well as on the E γ divergence, which generalizes the total variation distance. Pinsker's inequality is extended for this type of f -divergence, a result which leads to an inequality linking the relative entropy and relative information spectrum. Integral expressions of the Rényi divergence in terms of the relative information spectrum are derived, leading to bounds on the Rényi divergence in terms of either the variational distance or relative entropy.
Article
Full-text available
In this paper, we initiate a principled study of how the generalization properties of approximate differential privacy can be used to perform adaptive hypothesis testing, while giving statistically valid p-value corrections. We do this by observing that the guarantees of algorithms with bounded approximate max-information are sufficient to correct the p-values of adaptively chosen hypotheses, and then by proving that algorithms that satisfy (ϵ,δ)(\epsilon,\delta)-differential privacy have bounded approximate max information when their inputs are drawn from a product distribution. This substantially extends the known connection between differential privacy and max-information, which previously was only known to hold for (pure) (ϵ,0)(\epsilon,0)-differential privacy. It also extends our understanding of max-information as a partially unifying measure controlling the generalization properties of adaptive data analyses. We also show a lower bound, proving that (despite the strong composition properties of max-information), when data is drawn from a product distribution, (ϵ,δ)(\epsilon,\delta)-differentially private algorithms can come first in a composition with other algorithms satisfying max-information bounds, but not necessarily second if the composition is required to itself satisfy a nontrivial max-information bound. This, in particular, implies that the connection between (ϵ,δ)(\epsilon,\delta)-differential privacy and max-information holds only for inputs drawn from product distributions, unlike the connection between (ϵ,0)(\epsilon,0)-differential privacy and max-information.
Article
Full-text available
Overfitting is the bane of data analysts, even when data are plentiful. Formal approaches to understanding this problem focus on statistical inference and generalization of individual analysis procedures. Yet the practice of data analysis is an inherently interactive and adaptive process: new analyses and hypotheses are proposed after seeing the results of previous ones, parameters are tuned on the basis of obtained results, and datasets are shared and reused. An investigation of this gap has recently been initiated by the authors in (Dwork et al., 2014), where we focused on the problem of estimating expectations of adaptively chosen functions. In this paper, we give a simple and practical method for reusing a holdout (or testing) set to validate the accuracy of hypotheses produced by a learning algorithm operating on a training set. Reusing a holdout set adaptively multiple times can easily lead to overfitting to the holdout set itself. We give an algorithm that enables the validation of a large number of adaptively chosen hypotheses, while provably avoiding overfitting. We illustrate the advantages of our algorithm over the standard use of the holdout set via a simple synthetic experiment. We also formalize and address the general problem of data reuse in adaptive data analysis. We show how the differential-privacy based approach given in (Dwork et al., 2014) is applicable much more broadly to adaptive data analysis. We then show that a simple approach based on description length can also be used to give guarantees of statistical validity in adaptive settings. Finally, we demonstrate that these incomparable approaches can be unified via the notion of approximate max-information that we introduce.
Book
Full-text available
A comprehensive introduction and reference guide to the minimum description length (MDL) Principle that is accessible to researchers dealing with inductive reference in diverse areas including statistics, pattern classification, machine learning, data mining, biology, econometrics, and experimental psychology, as well as philosophers interested in the foundations of statistics. The minimum description length (MDL) principle is a powerful method of inductive inference, the basis of statistical modeling, pattern recognition, and machine learning. It holds that the best explanation, given a limited set of observed data, is the one that permits the greatest compression of the data. MDL methods are particularly well-suited for dealing with model selection, prediction, and estimation problems in situations where the models under consideration can be arbitrarily complex, and overfitting the data is a serious concern. This extensive, step-by-step introduction to the MDL Principle provides a comprehensive reference (with an emphasis on conceptual issues) that is accessible to graduate students and researchers in statistics, pattern classification, machine learning, and data mining, to philosophers interested in the foundations of statistics, and to researchers in other applied sciences that involve model selection, including biology, econometrics, and experimental psychology. Part I provides a basic introduction to MDL and an overview of the concepts in statistics and information theory needed to understand MDL. Part II treats universal coding, the information-theoretic notion on which MDL is built, and part III gives a formal treatment of MDL theory as a theory of inductive inference based on universal coding. Part IV provides a comprehensive overview of the statistical theory of exponential families with an emphasis on their information-theoretic properties. The text includes a number of summaries, paragraphs offering the reader a "fast track" through the material, and boxes highlighting the most important concepts.
Article
Full-text available
R\'enyi divergence is related to R\'enyi entropy much like Kullback-Leibler divergence is related to Shannon's entropy, and comes up in many settings. It was introduced by R\'enyi as a measure of information that satisfies almost the same axioms as Kullback-Leibler divergence, and depends on a parameter that is called its order. In particular, the R\'enyi divergence of order 1 equals the Kullback-Leibler divergence. We review and extend the most important properties of R\'enyi divergence and Kullback-Leibler divergence, including convexity, continuity, limits of {\sigma}-algebras and the relation of the special order 0 to the Gaussian dichotomy and contiguity. We also extend the known equivalence between channel capacity and minimax redundancy to continuous channel inputs (for all orders), and present several other minimax results.
Article
Full-text available
We present a proof that in Orlicz spaces the Amemiya norm and the Orlicz norm coincide for any Orlicz function ϕ. This gives the answer for an open problem. We also give a description of the Amemiya type for the Mazur-Orlicz F-norm.
Conference Paper
Full-text available
Recent research in quantitative theories for information-hiding topics, such as Anonymity and Secure Information Flow, tend to converge towards the idea of modeling the system as a noisy channel in the information-theoretic sense. The notion of information leakage, or vulnerability of the system, has been related in some approaches to the concept of mutual information of the channel. A recent work of Smith has shown, however, that if the attack consists in one single try, then the mutual information and other concepts based on Shannon entropy are not suitable, and he has proposed to use Rényi's min-entropy instead. In this paper, we consider and compare two different possibilities of defining the leakage, based on the Bayes risk, a concept related to Rényi min-entropy.
Article
Full-text available
We define notions of stability for learning algorithms and derive bounds on their generalization error based on the empirical error and the leave-one-out error. We then study the stability properties of large classes of learning algorithms such as regularization based algorithms. In particular we focus on Hilbert space regularization and Kullback-Liebler regularization. We then apply the results to SVM for regression and classification and to maximum entropy discrimination. 1 Introduction Deriving bounds on the generalization error of a learning system is one of the major issue in statistical learning theory. Many researchers [1, 2, 3, 8, 13] have focused on these bounds by developing worst case analysis. That is, they showed that for all functions in a hypothesis space, the empirical error is close to the generalization error. In this report, we developp a different analysis inspired by [7, 6] where the generalization error is studied with respect to the learning algorithm. The goal ...
Article
Full-text available
We define notions of stability for learning algorithms and show how to use these notions to derive generalization error bounds based on the empirical error and the leave-one-out error. The methods we use can be applied in the regression framework as well as in the classification one when the classifier is obtained by thresholding a real-valued function. We study the stability properties of large classes of learning algorithms such as regularization based algorithms. In particular we focus on Hilbert space regularization and Kullback-Leibler regularization. We demonstrate how to apply the results to SVM for regression and classification.
Book
This monograph presents a mathematical theory of concentration inequalities for functions of independent random variables. The basic phenomenon under investigation is that if a function of many independent random variables does not depend too much on any of them then it is concentrated around its expected value. This book offers a host of inequalities to quantify this statement. The authors describe the interplay between the probabilistic structure (independence) and a variety of tools ranging from functional inequalities, transportation arguments, to information theory. Applications to the study of empirical processes, random projections, random matrix theory, and threshold phenomena are presented. The book offers a self-contained introduction to concentration inequalities, including a survey of concentration of sums of independent random variables, variance bounds, the entropy method, and the transportation method. Deep connections with isoperimetric problems are revealed. Special attention is paid to applications to the supremum of empirical processes.
Article
Given two random variables X and Y , an operational approach is undertaken to quantify the “leakage” of information from X to Y. The resulting measure L(X→Y ) is called maximal leakage, and is defined as the multiplicative increase, upon observing Y , of the probability of correctly guessing a randomized function of X, maximized over all such randomized functions. A closedform expression for L(X→Y ) is given for discrete X and Y, and it is subsequently generalized to handle a large class of random variables. The resulting properties are shown to be consistent with an axiomatic view of a leakage measure, and the definition is shown to be robust to variations in the setup. Moreover, a variant of the Shannon cipher system is studied, in which performance of an encryption scheme is measured using maximal leakage. A singleletter characterization of the optimal limit of (normalized) maximal leakage is derived and asymptotically-optimal encryption schemes are demonstrated. Furthermore, the sample complexity of estimating maximal leakage from data is characterized up to subpolynomial factors. Finally, the guessing framework used to define maximal leakage is used to give operational interpretations of commonly used leakage measures, such as Shannon capacity, maximal correlation, and local differential privacy.
Article
In statistical learning theory, generalization error is used to quantify the degree to which a supervised machine learning algorithm may overfit to training data. Recent work [Xu and Raginsky (2017)] has established a bound on the generalization error of empirical risk minimization based on the mutual information I(S;W) between the algorithm input S and the algorithm output W, when the loss function is sub-Gaussian. We leverage these results to derive generalization error bounds for a broad class of iterative algorithms that are characterized by bounded, noisy updates with Markovian structure. Our bounds are very general and are applicable to numerous settings of interest, including stochastic gradient Langevin dynamics (SGLD) and variants of the stochastic gradient Hamiltonian Monte Carlo (SGHMC) algorithm. Furthermore, our error bounds hold for any output function computed over the path of iterates, including the last iterate of the algorithm or the average of subsets of iterates, and also allow for non-uniform sampling of data in successive updates of the algorithm.
Article
We derive upper bounds on the generalization error of a learning algorithm in terms of the mutual information between its input and output. The upper bounds provide theoretical guidelines for striking the right balance between data fit and generalization by controlling the input-output mutual information of a learning algorithm. The results can also be used to analyze the generalization capability of learning algorithms under adaptive composition, and the bias-accuracy tradeoffs in adaptive data analytics. Our work extends and leads to nontrivial improvements on the recent results of Russo and Zou.
Article
Modern data is messy and high-dimensional, and it is often not clear a priori what are the right questions to ask. Instead, the analyst typically needs to use the data to search for interesting analyses to perform and hypotheses to test. This is an adaptive process, where the choice of analysis to be performed next depends on the results of the previous analyses on the same data. It's widely recognized that this process, even if well-intentioned, can lead to biases and false discoveries, contributing to the crisis of reproducibility in science. But while adaptivity renders standard statistical theory invalid, folklore and experience suggest that not all types of adaptive analysis are equally at risk for false discoveries. In this paper, we propose a general information-theoretic framework to quantify and provably bound the bias and other statistics of an arbitrary adaptive analysis process. We prove that our mutual information based bound is tight in natural models, and then use it to give rigorous insights into when commonly used procedures do or do not lead to substantially biased estimation. We first consider several popular feature selection protocols, like rank selection or variance-based selection. We then consider the practice of adding random noise to the observations or to the reported statistics, which is advocated by related ideas from differential privacy and blinded data analysis. We discuss the connections between these techniques and our framework, and supplement our results with illustrative simulations.
Article
A great deal of effort has been devoted to reducing the risk of spurious scientific discoveries, from the use of sophisticated validation techniques, to deep statistical methods for controlling the false discovery rate in multiple hypothesis testing. However, there is a fundamental disconnect between the theoretical results and the practice of data analysis: the theory of statistical inference assumes a fixed collection of hypotheses to be tested, or learning algorithms to be applied, selected non-adaptively before the data are gathered, whereas in practice data is shared and reused with hypotheses and new analyses being generated on the basis of data exploration and the outcomes of previous analyses. In this work we initiate a principled study of how to guarantee the validity of statistical inference in adaptive data analysis. As an instance of this problem, we propose and investigate the question of estimating the expectations of m adaptively chosen functions on an unknown distribution given n random samples. We show that, surprisingly, there is a way to estimate an exponential in n number of expectations accurately even if the functions are chosen adaptively. This gives an exponential improvement over standard empirical estimators that are limited to a linear number of estimates. Our result follows from a general technique that counter-intuitively involves actively perturbing and coordinating the estimates, using techniques developed for privacy preservation. We give additional applications of this technique to our question.
Article
Machine learning is one of the fastest growing areas of computer science, with far-reaching applications. The aim of this textbook is to introduce machine learning, and the algorithmic paradigms it offers, in a principled way. The book provides an extensive theoretical account of the fundamental ideas underlying machine learning and the mathematical derivations that transform these principles into practical algorithms. Following a presentation of the basics of the field, the book covers a wide array of central topics that have not been addressed by previous textbooks. These include a discussion of the computational complexity of learning and the concepts of convexity and stability; important algorithmic paradigms including stochastic gradient descent, neural networks, and structured output learning; and emerging theoretical concepts such as the PAC-Bayes approach and compression-based bounds. Designed for an advanced undergraduate or beginning graduate course, the text makes the fundamentals and algorithms of machine learning accessible to students and non-expert readers in statistics, computer science, mathematics, and engineering.
Article
Renyi's entropy and divergence of order alpha are given operational characterizations in terms of block coding and hypothesis testing, as so-called beta-cutoff rates, with alpha = (1 + beta)(-1) for entropy and alpha = (1 - beta)(-1) for divergence. Out of several possible definitions of mutual information and channel capacity of order alpha, our approach distinguishes one that admits an operational characterization as beta-cutoff rate for channel coding, with alpha = (1 - beta)(-1) The ordinary cutoff rate of a DMC corresponds to beta = -1.
Article
Rényi divergence is related to Rényi entropy much like Kullback-Leibler divergence is related to Shannon's entropy, and comes up in many settings. It was introduced by Rényi as a measure of information that satisfies almost the same axioms as Kullback-Leibler divergence, and depends on a parameter that is called its order. In particular, the Rényi divergence of order 1 equals the Kullback-Leibler divergence. We review and extend the most important properties of Rényi divergence and Kullback-Leibler divergence, including convexity, continuity, limits of σ\sigma -algebras, and the relation of the special order 0 to the Gaussian dichotomy and contiguity. We also show how to generalize the Pythagorean inequality to orders different from 1, and we extend the known equivalence between channel capacity and minimax redundancy to continuous channel inputs (for all orders) and present several other minimax results.
Article
This paper is an account of a new method of constructing measures of divergence between probability measures; the new divergence measures so constructed are called information radius measures. They are information-theoretic in character, and are based on the work of Rnyi [8] and Csiszr [2, 3]. The divergence measure K 1 can be used for the measurement of dissimilarity in numerical taxonomy, and its application to this field is discussed in Jardine and Sibson [5]; it was this application which originally motivated the study of information radius. Other forms of information radius are related to the variation distance, and the normal information radius discussed in 3 is related to Mahalanobis D 2 Statistic. This paper is in part intended to lay the mathematical foundations for [5], but because information radius appears to be of some general interest, the investigation of its properties is here carried further than is needed for the applications discussed in [5].
Article
The paper deals with the f-divergences of Csiszar generalizing the discrimination information of Kullback, the total variation distance, the Hellinger divergence, and the Pearson divergence. All basic properties of f-divergences including relations to the decision errors are proved in a new manner replacing the classical Jensen inequality by a new generalized Taylor expansion of convex functions. Some new properties are proved too, e.g., relations to the statistical sufficiency and deficiency. The generalized Taylor expansion also shows very easily that all f-divergences are average statistical informations (differences between prior and posterior Bayes errors) mutually differing only in the weights imposed on various prior distributions. The statistical information introduced by De Groot and the classical information of Shannon are shown to be extremal cases corresponding to alpha=0 and alpha=1 in the class of the so-called Arimoto alpha-informations introduced in this paper for 0<alpha<1 by means of the Arimoto alpha-entropies. Some new examples of f-divergences are introduced as well, namely, the Shannon divergences and the Arimoto alpha-divergences leading for alphauarr1 to the Shannon divergences. Square roots of all these divergences are shown to be metrics satisfying the triangle inequality. The last section introduces statistical tests and estimators based on the minimal f-divergence with the empirical distribution achieved in the families of hypothetic distributions. For the Kullback divergence this leads to the classical likelihood ratio test and estimator
Article
When studying convergence of measures, an important issue is the choice of probability metric. In this review, we provide a summary and some new results concerning bounds among ten important probability metrics/distances that are used by statisticians and probabilists. We focus on these metrics because they are either well-known, commonly used, or admit practical bounding techniques. We summarize these relationships in a handy reference diagram, and also give examples to show how rates of convergence can depend on the metric chosen.
Chaining mutual information and tightening generalization bounds
  • A R Asadi
  • E Abbe
  • S Verdú
$alpha$ -mutual information
  • S Verdú
Methods for quantifying rates of convergence for random walks on groups
  • F E Su
Chaining mutual information and tightening generalization bounds
  • asadi