ArticlePDF Available

# Entropy and mutual information in models of deep neural networks

Authors:

## Abstract

We examine a class of stochastic deep learning models with a tractable method to compute information-theoretic quantities. Our contributions are three-fold: (i) we show how entropies and mutual informations can be derived from heuristic statistical physics methods, under the assumption that weight matrices are independent and orthogonally-invariant. (ii) We extend particular cases in which this result is known to be rigorously exact by providing a proof for two-layers networks with Gaussian random weights, using the recently introduced adaptive interpolation method. (iii) We propose an experiment framework with generative models of synthetic datasets, on which we train deep neural networks with a weight constraint designed so that the assumption in (i) is verified during learning. We study the behavior of entropies and mutual informations throughout learning and conclude that, in the proposed setting, the relationship between compression and generalization remains elusive.
Journal of Statistical Mechanics:
Theory and Experiment
ML 2019 • OPEN ACCESS
Entropy and mutual information in models of deep neural networks*
To cite this article: Marylou Gabrié et al J. Stat. Mech. (2019) 124014
View the article online for updates and enhancements.
J. Stat. Mech. (2019) 124014
Entropy and mutual information
in models of deep neural networks*
Marylou Gabrié1, Andre Manoel2, Clément Luneau3,
Jean Barbier4, Nicolas Macris3, Florent Krzakala1
and Lenka Zdeborová5
1 Laboratoire de Physique de I’École Normale Supérieure, ENS,
Université PSL, CNRS, Sorbonne Université, Université de Paris, France
2 OWKIN, Inc., New York, NY, United States of America
3 Laboratoire de Théorie des Communications, École Polytechnique
Fédérale de Lausanne, Switzerland
4 International Center for Theoretical Physics, Trieste, Italy
5 Institut de Physique Théorique, CEA, CNRS, Université Paris-Saclay,
France
E-mail: marylou.gabrie@ens.fr
Accepted for publication 25 June 2019
Published 20 December 2019
Online at stacks.iop.org/JSTAT/2019/124014
https://doi.org/10.1088/1742-5468/ab3430
Abstract.We examine a class of stochastic deep learning models with a
tractable method to compute information-theoretic quantities. Our contributions
are three-fold: (i) we show how entropies and mutual informations can be derived
from heuristic statistical physics methods, under the assumption that weight
matrices are independent and orthogonally-invariant. (ii) We extend particular
cases in which this result is known to be rigorously exact by providing a proof
for two-layers networks with Gaussian random weights, using the recently
introduced adaptive interpolation method. (iii) We propose an experiment
framework with generative models of synthetic datasets, on which we train
deep neural networks with a weight constraint designed so that the assumption
M Gabrié etal
Entropy and mutual information in models of deep neural networks
Printed in the UK
124014
JSMTC6
2019
19
J. Stat. Mech.
JSTAT
1742-5468
10.1088/1742-5468/ab3430
12
Journal of Statistical Mechanics: Theory and Experiment
ournal of Statistical Mechanics:
J
Theory and Experiment
IOP
Original content from this work may be used under the terms of the Creative Commons Attribution 3.0
licence. Any further distribution of this work must maintain attribution to the author(s) and
the title of the work, journal citation and DOI.
borová L 2018 Entropy and mutual information in models of deep neural networks Advances in Neural Informa-
tion Processing Systems 31 (Red Hook, NY : Curran Associates, Inc.) pp 18211831
1742 - 5 4 6 8/ 1 9 /1 24 014 +16 $3 3. 0 0 Entropy and mutual information in models of deep neural networks 2 https://doi.org/10.1088/1742-5468/ab3430 J. Stat. Mech. (2019) 124014 in (i) is veriﬁed during learning. We study the behavior of entropies and mutual informations throughout learning and conclude that, in the proposed setting, the relationship between compression and generalization remains elusive. Keywords: machine learning Contents 1. Multi-layer model and main theoretical results 3 1.1. A stochastic multi-layer model ........................................................................3 1.2. Replica formula ...............................................................................................3 1.3. Rigorous statement ..........................................................................................4 2. Tractable models for deep learning 5 2.1. Other related works .........................................................................................7 3. Numerical experiments 8 3.1. Estimators and activation comparisons ...........................................................8 3.2. Learning experiments with linear networks ................................................... 10 3.3. Learning experiments with deep non-linear networks ....................................11 4. Conclusion and perspectives 14 Acknowledgments ................................................................................ 14 References 15 The successes of deep learning methods have spurred eorts towards quantitative modeling of the performance of deep neural networks. In particular, an information- theoretic approach linking generalization capabilities to compression has been receiving increasing interest. The intuition behind the study of mutual informations in latent variable models dates back to the information bottleneck (IB) theory of [1]. Although recently reformulated in the context of deep learning [2], verifying its relevance in prac- tice requires the computation of mutual informations for high-dimensional variables, a notoriously hard problem. Thus, pioneering works in this direction focused either on small network models with discrete (continuous, eventually binned) activations [3], or on linear networks [4, 5]. In the present paper we follow a dierent direction, and build on recent results from statistical physics [6, 7] and information theory [8, 9] to propose, in section 1, a formula to compute information-theoretic quantities for a class of deep neural network models. The models we approach, described in section 2, are non-linear feed-forward neural networks trained on synthetic datasets with constrained weights. Such networks capture some of the key properties of the deep learning setting that are usually dicult to include in tractable frameworks: non-linearities, arbitrary large width and depth, and correlations in the input data. We demonstrate the proposed method in a series of numerical experiments in section 3. First observations suggest a rather complex Entropy and mutual information in models of deep neural networks 3 https://doi.org/10.1088/1742-5468/ab3430 J. Stat. Mech. (2019) 124014 picture, where the role of compression in the generalization ability of deep neural net- works is yet to be elucidated. 1. Multi-layer model and main theoretical results 1.1. A stochastic multi-layer model We consider a model of multi-layer stochastic feed-forward neural network where each element xi of the input layer xRn0 is distributed independently as P0(xi) , while hid- den units t,i at each successive layer (vectors are column vectors) come from P (t ,i |W ,it 1 ) , with t0x and W , i denoting the ith row of the matrix of weights WRn ×n 1 . In other words t 0,i x i P 0 (·), t 1,i P 1 (·|W 1,i x), ... t L,i P L (·|W L,i t L 1 ), (1) given a set of weight matrices {W} L =1 and distributions {P} L =1 which encode possible non-linearities and stochastic noise applied to the hidden layer vari- ables, and P0 that generates the visible variables. In particular, for a non-linearity t ,i =ϕ (h,ξ ,i) , where ξ,i P ξ ( ·) is the stochastic noise (independent for each i), we have P (t,i | W ,i t 1)= dPξ(ξ,i)δ t,i ϕ(W ,i t 1,ξ,i) . Model (1) thus describes a Markov chain which we denote by XT1T2···TL , with T=ϕ(WT1,ξ) , ξ = { ξ ,i}n i=1 , and the activation function ϕ applied componentwise. 1.2. Replica formula We shall work in the asymptotic high-dimensional statistics regime where all ˜α n /n 0 are of order one while n0→∞ , and make the important assumption that all matrices W are orthogonally-invariant random matrices independent from each other; in other words, each matrix WRn ×n 1 can be decomposed as a product of three matrices, W=USV , where U O(n ) and V O(n 1) are independently sampled from the Haar measure, and S is a diagonal matrix of singular values. The main technical tool we use is a formula for the entropies of the hidden vari- ables, H(T )=E T ln P T (t ) , and the mutual information between adjacent lay- ers I (T ;T 1 )=H(T )+ ET,T1 ln P T|T1 (t | t 1) , based on the heuristic replica method [6, 7, 10, 11]: Claim 1 (Replica formula). Assume model (1) with L layers in the high-dimensional limit with componentwise activation functions and weight matrices generated from the ensemble described above, and denote by λ W k the eigenvalues of W kWk . Then for any ∈{ 1, ...,L } the normalized entropy of T is given by the minimum among all station- ary points of the replica potential: lim n0 →∞ 1 n0 H(T) = min extr A,V, ˜ A, ˜ V φ(A,V, ˜ A, ˜ V ), (2) Entropy and mutual information in models of deep neural networks 4 https://doi.org/10.1088/1742-5468/ab3430 J. Stat. Mech. (2019) 124014 which depends on -dimensional vectors A,V, ˜ A, ˜ V , and is written in terms of mutual information I and conditional entropies H of scalar variables as φ(A,V, ˜ A, ˜ V)=I t0;t0+ξ0 ˜ A1 1 2 k=1 ˜αk1˜ AkVk+αkAk˜ VkFWk(AkVk) + 1 k=1 ˜αk H(tk|ξk;˜ Ak+1,˜ Vkρk)1 2log(2πe ˜ A1 k+1) αH(t|ξ;˜ Vρ), (3) where αk=nk/nk1 , ˜αk=nk/n0 , ρ k =dP k 1 (t)t2 , ˜ρ k =(E λW kλ Wk )ρ k k, and ξk∼N(0, 1) for k= 0, ..., . In the computation of the conditional entropies in (3), the scalar tk-variables are generated from P(t0)=P0(t0) and P(tk | ξk;A,V,ρ)= E ˜ ξz Pk(tk+ ˜ ξ/ A | ρ k+ V˜z), k= 1, ..., 1, (4) P (t | ξ ;V,ρ)= E˜z P (t | ρ + V˜z ), (5) where ˜ ξ and ˜z are independent N(0, 1) random variables. Finally, the function FWk (x ) depends on the distribution of the eigenvalues λ W following F Wk(x) = min θR 2αkθ+(αk 1) ln(1 θ)+ E λWkln[Wk+ (1 θ)(1 αkθ)] . (6) The computation of the entropy in the large dimensional limit, a computationally dicult task, has thus been reduced to an extremization of a function of 4 variables, that requires evaluating single or bidimensional integrals. This extremization can be done eciently by means of a ﬁxed-point iteration starting from dierent initial condi- tions, as detailed in the supplementary material (stacks.iop.org/JSTAT/19/124014/ mmedia). Moreover, a user-friendly Python package is provided [12], which performs the computation for dierent choices of prior P0, activations ϕ and spectra λ W . Finally, the mutual information between successive layers I(T;T1) can be obtained from the entropy following the evaluation of an additional bidimensional integral, see section 1.6.1 of the supplementary material. Our approach in the derivation of (3) builds on recent progresses in statistical estimation and information theory for generalized linear models following the applica- tion of methods from statistical physics of disordered systems [10, 11] in communica- tion [13], statistics [14] and machine learning problems [15, 16]. In particular, we use advanced mean ﬁeld theory [17] and the heuristic replica method [6, 10], along with its recent extension to multi-layer estimation [7, 8], in order to derive the above form- ula (3). This derivation is lengthy and thus given in the supplementary material. In a related contrib ution, Reeves [9] proposed a formula for the mutual information in the multi-layer setting, using heuristic information-theoretic arguments. As ours, it exhib- its layer-wise additivity, and the two formulas are conjectured to be equivalent. 1.3. Rigorous statement We recall the assumptions under which the replica formula of claim 1 is conjectured to be exact: (i) weight matrices are drawn from an ensemble of random orthogonally-invariant Entropy and mutual information in models of deep neural networks 5 https://doi.org/10.1088/1742-5468/ab3430 J. Stat. Mech. (2019) 124014 matrices, (ii) matrices at dierent layers are statistically independent and (iii) layers have a large dimension and respective sizes of adjacent layers are such that weight matri- ces have aspect ratios {αkαk} k=1 of order one. While we could not prove the replica prediction in full generality, we stress that it comes with multiple credentials: (i) for Gaussian prior P0 and Gaussian distributions P , it corresponds to the exact analytical solution when weight matrices are independent of each other (see section 1.6.2 of the supplementary material). (ii) In the single-layer case with a Gaussian weight matrix, it reduces to formula (6) in the supplementary material, which has been recently rigor- ously proven for (almost) all activation functions ϕ [18]. (iii) In the case of Gaussian distributions P , it has also been proven for a large ensemble of random matrices [19] and (iv) it is consistent with all the results of the AMP [2022] and VAMP [23] algo- rithms, and their multi-layer versions [7, 8], known to perform well for these estimation problems. In order to go beyond results for the single-layer problem and heuristic arguments, we prove claim 1 for the more involved multi-layer case, assuming Gaussian i.i.d. matrices and two non-linear layers: Theorem 1 (Two-layer Gaussian replica formula). Suppose (H1) the input units dis- tribution P0 is separable and has bounded support; ( H 2) the activations ϕ1 and ϕ2 corre- sponding to P 1(t1,i | W 1,i x ) and P 2(t2,i | W 2,i t1 ) are bounded C2 with bounded ﬁrst and second derivatives w.r.t their ﬁrst argument; and ( H 3) the weight matrices W1, W2 have Gaussian i.i.d. entries. Then for model (1) with two layers L = 2 the high-dimensional limit of the entropy veriﬁes claim 1. The theorem, that closes the conjecture presented in [7], is proven using the adap- tive interpolation method of [18, 24, 25] in a multi-layer setting, as ﬁrst developed in [26]. The lengthy proof, presented in details in section 2 of the supplementary mat- erial, is of independent interest and adds further credentials to the replica formula, as well as oers a clear direction to further developments. Note that, following the same approximation arguments as in [18] where the proof is given for the single-layer case, the hypothesis ( H 1) can be relaxed to the existence of the second moment of the prior, ( H 2) can be dropped and ( H 3) extended to matrices with i.i.d. entries of zero mean, O(1/n0) variance and ﬁnite third moment. 2. Tractable models for deep learning The multi-layer model presented above can be leveraged to simulate two prototypical settings of deep supervised learning on synthetic datasets amenable to the replica trac- table computation of entropies and mutual informations. The ﬁrst scenario is the so-called teacher-student (see ﬁgure 1, left). Here, we assume that the input x is distributed according to a separable prior distribution P X (x)= i P 0 (x i) , factorized in the components of x , and the corresponding label y is given by applying a mapping xy , called the teacher. After generating a train and test set in this manner, we perform the training of a deep neural network, the student, on the synthetic dataset. In this case, the data themselves have a simple structure given by P0. Entropy and mutual information in models of deep neural networks 6 https://doi.org/10.1088/1742-5468/ab3430 J. Stat. Mech. (2019) 124014 In constrast, the second scenario allows generative models (see ﬁgure 1, right) that create more structure, and that are reminiscent of the generative-recognition pair of models of a Variational Autoencoder (VAE). A code vector y is sampled from a sepa- rable prior distribution PY (y)= i P 0 (y i) and a corresponding data point x is gener- ated by a possibly stochastic neural network, the generative model. This setting allows to create input data x featuring correlations, dierently from the teacher-student sce- nario. The studied supervised learning task then consists in training a deep neural net, the recognition model, to recover the code y from x . In both cases, the chain going from X to any later layer is a Markov chain in the form of (1). In the ﬁrst scenario, model (1) directly maps to the student network. In the second scenario however, model (1) actually maps to the feed-forward combination of the generative model followed by the recognition model. This shift is necessary to verify the assumption that the starting point (now given by Y ) has a separable distribution. In particular, it generates correlated input data X while still allowing for the computa- tion of the entropy of any T . At the start of a neural network training, weight matrices initialized as i.i.d. Gaussian random matrices satisfy the necessary assumptions of the formula of claim 1. In their singular value decomposition W=USV (7) the matrices UO(n) and VO(n1) , are typical independent samples from the Haar measure across all layers. To make sure weight matrices remain close enough to independent during learning, we deﬁne a custom weight constraint which consists in keeping U and V ﬁxed while only the matrix S , constrained to be diagonal, is updated. The number of parameters is thus reduced from n×n1 to min(n,n1) . We refer to layers following this weight constraint as USV-layers. For the replica formula of claim 1 to be correct, the matrices S from dierent layers should furthermore remain uncor- related during the learning. In section 3, we consider the training of linear networks Figure 1. Two models of synthetic data. Entropy and mutual information in models of deep neural networks 7 https://doi.org/10.1088/1742-5468/ab3430 J. Stat. Mech. (2019) 124014 for which information-theoretic quantities can be computed analytically, and conﬁrm numerically that with USV-layers the replica predicted entropy is correct at all times. In the following, we assume that is also the case for non-linear networks. In section 3.2 of the supplementary material, we train a neural network with USV- layers on a simple real-world dataset (MNIST), showing that these layers can learn to represent complex functions despite their restriction. We further note that such a product decomposition is reminiscent of a series of works on adaptative structured ecient linear layers (SELLs and ACDC) [27, 28] motivated this time by speed gains, where only diagonal matrices are learned (in these works the matrices U and V are chosen instead as permutations of Fourier or Hadamard matrices, so that the matrix multiplication can be replaced by fast transforms). In section 3, we discuss learning experiments with USV-layers on synthetic datasets. While we have deﬁned model (1) as a stochastic model, traditional feed forward neu- ral networks are deterministic. In the numerical experiments of section 3, we train and test networks without injecting noise, and only assume a noise model in the computa- tion of information-theoretic quantities. Indeed, for continuous variables the presence of noise is necessary for mutual informations to remain ﬁnite (see discussion of appen- dix C in [5]). We assume at layer an additive white Gaussian noise of small amplitude just before passing through its activation function to obtain H(T) and I(T;T1) , while keeping the mapping XT1 deterministic. This choice attempts to stay as close as possible to the deterministic neural network, but remains inevitably somewhat arbitrary (see again discussion of appendix C in [5]). 2.1. Other related works The strategy of studying neural networks models, with random weight matrices and/ or random data, using methods originated in statistical physics heuristics, such as the replica and the cavity methods [10] has a long history. Before the deep learning era, this approach led to pioneering results in learning for the Hopﬁeld model [29] and for the random perceptron [15, 16, 30, 31]. Recently, the successes of deep learning along with the disqualifying complexity of studying real world problems have sparked a revived interest in the direction of random weight matrices. Recent resultswithout exhaustivitywere obtained on the spectrum of the Gram matrix at each layer using random matrix theory [32, 33], on expressivity of deep neural networks [34], on the dynamics of propagation and learning [3538], on the high-dimensional non-convex landscape where the learning takes place [39], or on the universal random Gaussian neural nets of [40]. The information bottleneck theory [1] applied to neural networks consists in com- puting the mutual information between the data and the learned hidden representa- tions on the one hand, and between labels and again hidden learned representations on the other hand [2, 3]. A successful training should maximize the information with respect to the labels and simultaneously minimize the information with respect to the input data, preventing overﬁtting and leading to a good generalization. While this intuition suggests new learning algorithms and regularizers [4147], we can also hypothesize that this mechanism is already at play in a priori unrelated commonly used optimization methods, such as the simple stochastic gradient descent (SGD). It was Entropy and mutual information in models of deep neural networks 8 https://doi.org/10.1088/1742-5468/ab3430 J. Stat. Mech. (2019) 124014 ﬁrst tested in practice by [3] on very small neural networks, to allow the entropy to be estimated by binning of the hidden neurons activities. Afterwards, the authors of [5] reproduced the results of [3] on small networks using the continuous entropy estimator of [45], but found that the overall behavior of mutual information during learning is greatly aected when changing the nature of non-linearities. Additionally, they inves- tigate the training of larger linear networks on i.i.d. normally distributed inputs where entropies at each hidden layer can be computed analytically for an additive Gaussian noise. The strategy proposed in the present paper allows us to evaluate entropies and mutual informations in non-linear networks larger than in [3, 5]. 3. Numerical experiments We present a series of experiments both aiming at further validating the replica estima- tor and leveraging its power in noteworthy applications. A ﬁrst application presented in the paragraph 3.1 consists in using the replica formula in settings where it is proven to be rigorously exact as a basis of comparison for other entropy estimators. The same experiment also contributes to the discussion of the information bottleneck theory for neural networks by showing how, without any learning, information-theoretic quanti- ties have dierent behaviors for dierent non-linearities. In the following paragraph 3.2, we validate the accuracy of the replica formula in a learning experiment with USV- layerswhere it is not proven to be exactby considering the case of linear networks for which information-theoretic quantities can be otherwise computed in closed-form. We ﬁnally consider in the paragraph 3.3, a second application testing the information bottleneck theory for large non-linear networks. To this aim, we use the replica estima- tor to study compression eects during learning. 3.1. Estimators and activation comparisons Two non-parametric estimators have already been considered by [5] to compute entro- pies and/or mutual informations during learning. The kernel-density approach of Kolchinsky et al [45] consists in ﬁtting a mixture of Gaussians (MoG) to samples of the variable of interest and subsequently compute an upper bound on the entropy of the MoG [48]. The method of Kraskov et al [49] uses nearest neighbor distances between samples to directly build an estimate of the entropy. Both methods require the com- putation of the matrix of distances between samples. Recently [46], proposed a new non-parametric estimator for mutual informations which involves the optimization of a neural network to tighten a bound. It is unfortunately computationally hard to test how these estimators behave in high dimension as even for a known distribution the computation of the entropy is intractable in most cases. However the replica method proposed here is a valuable point of comparison for cases where it is rigorously exact. In the ﬁrst numerical experiment we place ourselves in the setting of theorem 1: a 2-layer network with i.i.d weight matrices, where the formula of claim 1 is thus rigor- ously exact in the limit of large networks, and we compare the replica results with the non-parametric estimators of [45] and [49]. Note that the requirement for smooth Entropy and mutual information in models of deep neural networks 9 https://doi.org/10.1088/1742-5468/ab3430 J. Stat. Mech. (2019) 124014 activations (H2) of theorem 1 can be relaxed (see discussion below the theorem). Additionally, non-smooth functions can be approximated arbitrarily closely by smooth functions with equal information-theoretic quantities, up to numerical precision. We consider a neural network with layers of equal size n = 1000 that we denote: XT1T2 . The input variable components are i.i.d. Gaussian with mean 0 and variance 1. The weight matrices entries are also i.i.d. Gaussian with mean 0. Their standard-deviation is rescaled by a factor 1/n and then multiplied by a coecient σ varying between 0.1 and 10, i.e. around the recommended value for training initializa- tion. To compute entropies, we consider noisy versions of the latent variables where an additive white Gaussian noise of very small variance ( σ2 noise = 105 ) is added right before the activation function, T1=f(W1X+1) and T2=f(W2f(W1X)+2) with 1,2 ∼N (0, σ 2 noise I n) , which is also done in the remaining experiments to guarantee the mutual informations to remain ﬁnite. The non-parametric estimators [45, 49] were evaluated using 1000 samples, as the cost of computing pairwise distances is signiﬁcant in such high dimension and we checked that the entropy estimate is stable over inde- pendent draws of a sample of such a size (error bars smaller than marker size). On ﬁgure 2, we compare the dierent estimates of H(T1) and H(T2) for dierent activa- tion functions: linear, hardtanh or ReLU. The hardtanh activation is a piecewise linear approximation of the tanh, hardtanh(x)=1 for x < 1, x for 1 < x < 1, and 1 for x > 1, for which the integrals in the replica formula can be evaluated faster than for the tanh. In the linear and hardtanh case, the non-parametric methods are following the tendency of the replica estimate when σ is varied, but appear to systematically over- estimate the entropy. For linear networks with Gaussian inputs and additive Gaussian noise, every layer is also a multivariate Gaussian and therefore entropies can be directly computed in closed form (exact in the plot legend). When using the Kolchinsky estimate in the linear case we also check the consistency of two strategies, either ﬁtting the MoG to the noisy sample or ﬁtting the MoG to the deterministic part of the T and aug- ment the resulting variance with σ2 noise , as done in [45] (Kolchinsky et al parametric in the plot legend). In the network with hardtanh non-linearities, we check that for small weight values, the entropies are the same as in a linear network with same weights (linear approx in the plot legend, computed using the exact analytical result for linear networks and therefore plotted in a similar color to exact). Lastly, in the case of the ReLUReLU network, we note that non-parametric methods are predicting an entropy increasing as the one of a linear network with identical weights, whereas the replica computation reﬂects its knowledge of the cut-o and accurately features a slope equal to half of the linear network entropy ( 1/2 linear approx in the plot legend). While non- parametric estimators are invaluable tools able to approximate entropies from the mere knowledge of samples,they inevitably introduce estimation errors. The replica method is taking the opposite view. While being restricted to a class of models, it can leverage its knowledge of the neural network structure to provide a reliable estimate. To our knowledge, there is no other entropy estimator able to incorporate such information about the underlying multi-layer model. Beyond informing about estimators accuracy, this experiment also unveils a simple but possibly important distinction between activation functions. For the hardtanh activation, as the random weights magnitude increases, the entropies decrease after Entropy and mutual information in models of deep neural networks 10 https://doi.org/10.1088/1742-5468/ab3430 J. Stat. Mech. (2019) 124014 reaching a maximum, whereas they only increase for the unbounded activation func- tions we considereven for the single-side saturating ReLU. This loss of information for bounded activations was also observed by [5], where entropies were computed by discretizing the output as a single neuron with bins of equal size. In this setting, as the tanh activation starts to saturate for large inputs, the extreme bins (at 1 and 1) concentrate more and more probability mass, which explains the information loss. Here we conﬁrm that the phenomenon is also observed when computing the entropy of the hardtanh (without binning and with small noise injected before the non-linearity). We check via the replica formula that the same phenomenology arises for the mutual infor- mations I(X;T) (see section 3.1 of the supplementary material). 3.2. Learning experiments with linear networks In the following, and in section 3.3 of the supplementary material, we discuss training experiments of dierent instances of the deep learning models deﬁned in section 2. We seek to study the simplest possible training strategies achieving good generalization. Hence for all experiments we use plain stochastic gradient descent (SGD) with constant learning rates, without momentum and without any explicit form of regularization. The sizes of the training and testing sets are taken equal and scale typically as a few hundreds times the size of the input layer. Unless otherwise stated, plots correspond to single runs, yet we checked over a few repetitions that outcomes of independent runs lead to identical qualitative behaviors. The values of mutual informations I(X;T) are computed by considering noisy versions of the latent variables where an additive white Gaussian noise of very small variance ( σ2 noise = 105 ) is added right before the activation function, as in the previous experiment. This noise is neither present at training time, where it could act as a regularizer, nor at testing time. Given the noise is only assumed at the last layer, the second to last layer is a deterministic mapping of the input variable; Figure 2. Entropy of latent variables in stochastic networks XT1T2 , with equally sized layers n = 1000, inputs drawn from N(0, In) , weights from N(0, σ2In 2 /n) , as a function of the weight scaling parameter σ . An additive white Gaussian noise N(0, 105In) is added inside the non-linearity. Left column: linear network. Center column: hardtanhhardtanh network. Right column: ReLUReLU network. Entropy and mutual information in models of deep neural networks 11 https://doi.org/10.1088/1742-5468/ab3430 J. Stat. Mech. (2019) 124014 hence the replica formula yielding mutual informations between adjacent layers gives us directly I(T;T1)=H(T)H(T|T1)=H(T)H(T|X)=I(T;X) . We provide a second Python package [50] to implement in Keras learning experiments on synthetic datasets, using USV- layers and interfacing the ﬁrst Python package [12] for replica computations. To start with we consider the training of a linear network in the teacher-student scenario. The teacher has also to be linear to be learnable: we consider a simple sin- gle-layer network with additive white Gaussian noise, Y=˜ WteachX+ , with input x∼N(0, In) of size n, teacher matrix ˜ Wteach i.i.d. normally distributed as N(0, 1/n) , noise ∼N(0, 0.01In) , and output of size nY = 4. We train a student network of three USV-layers, plus one fully connected unconstrained layer XT1T2T3ˆ Y on the regression task, using plain SGD for the MSE loss (ˆ Y Y) 2 . We recall that in the USV-layers (7) only the diagonal matrix is updated during learning. On the left panel of ﬁgure 3, we report the learning curve and the mutual informations between the hidden layers and the input in the case where all layers but outputs have size n = 1500. Again this linear setting is analytically tractable and does not require the replica form- ula, a similar situation was studied in [5]. In agreement with their observations, we ﬁnd that the mutual informations I(X;T) keep on increasing throughout the learning, without compromising the generalization ability of the student. Now, we also use this linear setting to demonstrate (i) that the replica formula remains correct throughout the learning of the USV-layers and (ii) that the replica method gets closer and closer to the exact result in the limit of large networks, as theoretically predicted (2). To this aim, we repeat the experiment for n varying between 100 and 1500, and report the maximum and the mean value of the squared error on the estimation of the I(X;T) over all epochs of 5 independent training runs. We ﬁnd that even if errors tend to increase with the number of layers, they remain objectively very small and decrease drastically as the size of the layers increases. 3.3. Learning experiments with deep non-linear networks Finally, we apply the replica formula to estimate mutual informations during the train- ing of non-linear networks on correlated input data. We consider a simple single layer generative model X = ˜ WgenY+ with normally distributed code Y∼N(0, InY) of size nY = 100, data of size nX = 500 generated with matrix ˜ W gen i.i.d. normally distributed as N (0, 1/n Y) and noise ∼N(0, 0.01InX) . We then train a recognition model to solve the binary classiﬁcation problem of recovering the label y= sign(Y 1) , the sign of the ﬁrst neuron in Y , using plain SGD but this time to minimize the cross-entropy loss. Note that the rest of the initial code (Y2, ..YnY) acts as noise/nuisance with respect to the learning task. We compare two 5-layers recog- nition models with 4 USV- layers plus one unconstrained, of sizes 500-1000-500-250- 100-2, and activations either linear-ReLU-linear-ReLU-softmax (top row of ﬁgure 4) or linear-hardtanh-linear-hardtanh-softmax (bottom row). Because USV-layers only fea- ture O (n ) parameters instead of O(n2) we observe that they require more iterations to train in general. In the case of the ReLU network, adding interleaved linear layers was key to successful training with 2 non-linearities, which explains the somewhat unusual architecture proposed. For the recognition model using hardtanh, this was actually Entropy and mutual information in models of deep neural networks 12 https://doi.org/10.1088/1742-5468/ab3430 J. Stat. Mech. (2019) 124014 not an issue (see supplementary material for an experiment using only hardtanh acti- vations), however, we consider a similar architecture for fair comparison. We discuss further the ability of learning of USV-layers in the supplementary material. This experiment is reminiscent of the setting of [3], yet now tractable for networks of larger sizes. For both types of non-linearities we observe that the mutual information Figure 3. Training of a 4-layer linear student of varying size on a regression task generated by a linear teacher of output size nY=4 . Upper-left: MSE loss on the training and testing sets during training by plain SGD for layers of size n = 1500. Best training loss is 0.004 735, best testing loss is 0.004 789. Lower- left: corresponding mutual information evolution between hidden layers and input. Center-left, center-right, right: maximum and squared error of the replica estimation of the mutual information as a function of layers size n, over the course of five independent trainings for each value of n for the ﬁrst, second and third hidden layer. Figure 4. Training of two recognition models on a binary classiﬁcation task with correlated input data and either ReLU (top) or hardtanh (bottom) non-linearities. Left: training and generalization cross-entropy loss (left axis) and accuracies (right axis) during learning. Best training-testing accuracies are 0.9950.991 for ReLU version (top row) and 0.9980.996 for hardtanh version (bottom row). Remaining colums: mutual information between the input and successive hidden layers. Insets zoom on the ﬁrst epochs. Entropy and mutual information in models of deep neural networks 13 https://doi.org/10.1088/1742-5468/ab3430 J. Stat. Mech. (2019) 124014 between the input and all hidden layers decrease during the learning, except for the very beginning of training where we can sometimes observe a short phase of increase (see zoom in insets). For the hardtanh layers this phase is longer and the initial increase of noticeable amplitude. In this particular experiment, the claim of [3] that compression can occur during training even with non double-saturated activation seems corroborated (a phenomenon that was not observed by [5]). Yet we do not observe that the compression is more pronounced in deeper layers and its link to generalization remains elusive. For instance, we do not see a delay in the generalization w.r.t. training accuracy/loss in the recogni- tion model with hardtanh despite of an initial phase without compression in two layers. Futhermore, we ﬁnd that changing the weight initialization can drastically change the behavior of mutual informations during training while resulting in identical train- ing and testing ﬁnal performances. In an additional experiment, we consider a setting closely related to the classiﬁcation on correlated data presented above. On ﬁgure 5 we compare three identical 5-layers recognition models with sizes 500-1000-500-250-100-2, and activations hardtanhhardtanh-hardtanh- hartanh-softmax, for the same genera- tive model and binary classiﬁcation rule as the previous experiment. For the model pre- sented at the top row, initial weights were sampled according to W,ij ∼N (0, 4/n 1) , for the model of the middle row N(0, 1/n1) was used instead, and ﬁnally N(0, 1 / 4n 1 ) for the bottom row. The ﬁrst column shows that training is delayed for the weight initialized at smaller values, but eventually catches up and reaches accuracies superior to 0.97 both in training and testing. Meanwhile, mutual informations have dierent Figure 5. Learning and hidden-layers mutual information curves for a classiﬁcation problem with correlated input data, using a 4-USV hardtanh layers and 1 unconstrained softmax layer, from three dierent initializations. Top: initial weights at layer of variance 4/n1 , best training accuracy 0.999, best test accuracy 0.994. Middle: initial weights at layer of variance 1/n1 , best train accuracy 0.994, best test accuracy 0.9937. Bottom: initial weights at layer of variance 0.25/n1 , best train accuracy 0.975, best test accuracy 0.974. The overall direction of evolution of the mutual information can be ﬂipped by a change in weight initialization without changing drastically ﬁnal performance in the classiﬁcation task. Entropy and mutual information in models of deep neural networks 14 https://doi.org/10.1088/1742-5468/ab3430 J. Stat. Mech. (2019) 124014 initial values for the dierent weight initializations and follow very dierent paths. They either decrease during the entire learning, or on the contrary are only increasing, or actually feature an hybrid path. We further note that it is to some extent surpris- ing that the mutual information would increase at all in the ﬁrst row if we expect the hardtanh saturation to instead induce compression. Figure 4 of the supplementary material presents a second run of the same experiment with a dierent random seed. Findings are identical. Further learning experiments, including a second run of the last two experiments, are presented in the supplementary material. 4. Conclusion and perspectives We have presented a class of deep learning models together with a tractable method to compute entropy and mutual information between layers. This, we believe, oers a promising framework for further investigations, and to this aim we provide Python packages that facilitate both the computation of mutual informations and the train- ing, for an arbitrary implementation of the model. In the future, allowing for biases by extending the proposed formula would improve the ﬁtting power of the considered neural network models. We observe in our high-dimensional experiments that compression can happen dur- ing learning, even when using ReLU activations. While we did not observe a clear link between generalization and compression in our setting, there are many directions to be further explored within the models presented in section 2. Studying the entropic eect of regularizers is a natural step to formulate an entropic interpretation to generaliza- tion. Furthermore, while our experiments focused on the supervised learning, the replica formula derived for multi-layer models is general and can be applied in unsupervised contexts, for instance in the theory of VAEs. On the rigorous side, the greater perspec- tive remains proving the replica formula in the general case of multi-layer models, and further conﬁrm that the replica formula stays true after the learning of the USV-layers. Another question worth of future investigation is whether the replica method can be used to describe not only entropies and mutual informations for learned USV-layers, but also the optimal learning of the weights itself. Acknowledgments The authors would like to thank Léon Bottou, Antoine Maillard, Marc Mézard, Léo Miolane, and Galen Reeves for insightful discussions. This work has been supported by the ERC under the European Unions FP7 Grant Agreement 307087-SPARCS and the European Unions Horizon 2020 Research and Innovation Program 714608- SMiLe, as well as by the French Agence Nationale de la Recherche under grant ANR- 17-CE23-0023-01 PAIL. Additional funding is acknowledged by MG from Chaire de recherche sur les modéles et sciences des données, Fondation CFM pour la Recherche- ENS; by AM from Labex DigiCosme; and by CL from the Swiss National Science Entropy and mutual information in models of deep neural networks 15 https://doi.org/10.1088/1742-5468/ab3430 J. Stat. Mech. (2019) 124014 Foundation under Grant 200021E-175541. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research. References [1] Tishby N, Pereira F C and Bialek W 1999 The information bottleneck method 37th Annual Allerton Conf. on Communication, Control, and Computing [2] Tishby N and Zaslavsky N 2015 Deep learning and the information bottleneck principle IEEE Information Theory Workshop pp 1 [3] Shwartz-Ziv R and Tishby N 2017 Opening the black box of deep neural networks via information (arXiv:1703.00810) [4] Chechik G, Globerson A, Tishby N and Weiss Y 2005 Information bottleneck for Gaussian variables J. Mach. Learn. Res. 6 16588 [5] Saxe A M, Bansal Y, Dapello J, Advani M, Kolchinsky A, Tracey B D and Cox D D 2018 On the informa- tion bottleneck theory of deep learning Int. Conf. on Learning Representations [6] Kabashima Y 2008 Inference from correlated patterns: a uniﬁed theory for perceptron learning and linear vector channels J. Phys.: Conf. Ser. 95 012001 [7] Manoel A, Krzakala F, Mézard M and Zdeborová L 2017 Multi-layer generalized linear estimation IEEE Int. Symp. on Information Theory pp 2098102 [8] Fletcher A K, Rangan S and Schniter P 2018 Inference in deep networks in high dimensions IEEE Int. Symp. on Information Theory vol 1 pp 18848 [9] Reeves G 2017 Additivity of information in multilayer networks via additive Gaussian noise transforms 55th Annual Allerton Conf. on Communication, Control, and Computing [10] Mézard M, Parisi G and Virasoro M 1987 Spin Glass Theory and Beyond (Singapore: World Scientiﬁc) [11] Mézard M and Montanari A 2009 Information, Physics, and Computation (Oxford: Oxford University Press) [12] 2018 Dnner: deep neural networks entropy with replicas, Python library (https://github.com/sphinxteam/ dnner) [13] Tulino A M, Caire G, Verdú S and Shamai S 2013 Support recovery with sparsely sampled free random matrices IEEE Trans. Inf. Theory 59 424371 [14] Donoho D and Montanari A 2016 High dimensional robust M-estimation: asymptotic variance via approximate message passing Probab. Theory Relat. Fields 166 93569 [15] Seung H S, Sompolinsky H and Tishby N 1992 Statistical mechanics of learning from examples Phys. Rev. A 45 6056 [16] Engel A and Van den Broeck C 2001 Statistical Mechanics of Learning (Cambridge: Cambridge University Press) [17] Opper M and Saad D 2001 Advanced mean ﬁeld methods: Theory and practice (Cambridge, MA: MIT Press) [18] Jean Barbier, Florent Krzakala, Nicolas Macris, Léo Miolane and Lenka Zdeborová 2019 Optimal errors and phase transitions in high-dimensional generalized linear models Proc. Natl Acad. Sci. 116 545160 [19] Barbier J, Macris N, Maillard A and Krzakala F 2018 The mutual information in random linear estimation beyond i.i.d. matrices IEEE Int. Symp. on Information Theory pp 62532 [20] Donoho D, Maleki A and Montanari A 2009 Message-passing algorithms for compressed sensing Proc. Natl Acad. Sci. 106 189149 [21] Zdeborová L and Krzakala F 2016 Statistical physics of inference: thresholds and algorithms Adv. Phys. 65 453552 [22] Rangan S 2011 Generalized approximate message passing for estimation with random linear mixing IEEE Int. Symp. on Information Theory pp 216872 [23] Rangan S, Schniter P and Fletcher A K 2017 Vector approximate message passing IEEE Int. Symp. on Information Theory pp 158892 [24] Barbier J and Macris N 2019 The adaptive interpolation method for proving replica formulas. Applications to the CurieWeiss and Wigner spike models J. Phys. A 52 294002 [25] Barbier J and Macris N 2019 The adaptive interpolation method: a simple scheme to prove replica formulas in Bayesian inference Probab Theory Relat. Fields 174 113385 [26] Barbier J, Macris N and Miolane L 2017 The layered structure of tensor estimation and its mutual informa- tion 55th Annual Allerton Conf. on Communication, Control, and Computing pp 105663 Entropy and mutual information in models of deep neural networks 16 https://doi.org/10.1088/1742-5468/ab3430 J. Stat. Mech. (2019) 124014 [27] Moczulski M, Denil M, Appleyard J and de Freitas N 2016 ACDC: a structured ecient linear layer Int. Conf. on Learning Representations [28] Yang Z, Moczulski M, Denil M, de Freitas N, Smola A, Song L and Wang Z 2015 Deep fried convnets IEEE Int. Conf. on Computer Vision pp 147683 [29] Amit D J, Gutfreund H and Sompolinsky H 1985 Storing inﬁnite numbers of patterns in a spin-glass model of neural networks Phys. Rev. Lette. 55 1530 [30] Gardner E and Derrida B 1989 Three unﬁnished works on the optimal storage capacity of networks J. Phys. A 22 1983 [31] Mézard M 1989 The space of interactions in neural networks: Gardners computation with the cavity method J. Phys. A 22 2181 [32] Louart C and Couillet R 2017 Harnessing neural networks: a random matrix approach IEEE Int. Conf. on Acoustics, Speech and Signal Processing pp 22826 [33] Pennington J and Worah P 2017 Nonlinear random matrix theory for deep learning Advances in Neural Information Processing Systems [34] Raghu M, Poole B, Kleinberg J, Ganguli S and Sohl-Dickstein J 2017 On the expressive power of deep neural networks Int. Conf. on Machine Learning [35] Saxe A, McClelland J and Ganguli S 2014 Exact solutions to the nonlinear dynamics of learning in deep linear neural networks Int. Conf. on Learning Representations [36] Schoenholz S S, Gilmer J, Ganguli S and Sohl-Dickstein J 2017 Deep information propagation Int. Conf. on Learning Representations [37] Advani M and Saxe A 2017 High-dimensional dynamics of generalization error in neural networks (arXiv:1710.03667) [38] Baldassi C, Braunstein A, Brunel N and Zecchina R 2007 Ecient supervised learning in networks with binary synapses Proc. Natl Acad. Sci. 104 1107984 [39] Dauphin Y, Pascanu R, Gulcehre C, Cho K, Ganguli S and Bengio Y 2014 Identifying and attacking the sad- dle point problem in high-dimensional non-convex optimization Advances in Neural Information Process- ing Systems [40] Giryes R, Sapiro G and Bronstein A M 2016 Deep neural networks with random Gaussian weights: a univer- sal classiﬁcation strategy? IEEE Trans. Signal Process. 64 344457 [41] Chalk M, Marre O and Tkacik G 2016 Relevant sparse codes with variational information bottleneck Advances in Neural Information Processing Systems [42] Achille A and Soatto S 2018 Information dropout: learning optimal representations through noisy computa- tion IEEE Trans. Pattern Anal. Mach. Intell. pp 2897905 [43] Alemi A, Fischer I, Dillon J and Murphy K 2017 Deep variational information bottleneck Int. Conf. on Learning Representations [44] Achille A and Soatto S 2017 Emergence of invariance and disentangling in deep representations ICML 2017 Workshop on Principled Approaches to Deep Learning [45] Kolchinsky A, Tracey B D and Wolpert D H 2017 Nonlinear information bottleneck (arXiv:1705.02436) [46] Belghazi M I, Baratin A, Rajeswar S, Ozair S, Bengio Y, Courville A and Hjelm R D 2018 MINE: mutual information neural estimation Int. Conf. on Machine Learning [47] Zhao S, Song J and Ermon S 2017 InfoVAE: information maximizing variational autoencoders (arXiv:1706.02262) [48] Kolchinsky A and Tracey B D 2017 Estimating mixture entropy with pairwise distances Entropy 19 361 [49] Kraskov A, Stögbauer H and Grassberger P 2004 Estimating mutual information Phys. Rev. E 69 066138 [50] 2018 lsd: Learning with Synthetic Data, Python library (https://github.com/marylou-gabrie/ learning-synthetic-data) ... Pioneered by Boltzmann [1], Gibbs [2], and Shannon [3], entropy is an elemental measure of randomness that is foundational in statistical physics and in information theory [4]- [8]. Since its inception, entropy amassed numerous uses in a host of fields, e.g.: astrophysics [9], ecology [10], mechanical engineering [11], evolution [12], and neural networks [13]. Yet, despite its prevalence in science and engineering, entropy is relatively under explored -to date -in the context of first-passage. ... ... The formulae appearing in Eqs. (13) and (16) provide two perspectives of the difference between the output's and the input's entropies: one via the input's statistics over the temporal interval (0, τ); and one via the input's statistics over the temporal ray (τ, ∞). The former perspective yielded the criteria of Eqs. ... ... if there is no such intersection point then τ * = ∞. With the minimal point τ * defined, Eq. (13) implies that: the range of "sufficiently small" timers is 0 < τ < τ * . ... Preprint Restart has the potential of expediting or impeding the completion times of general random processes. Consequently, the issue of mean-performance takes center stage: quantifying how the application of restart on a process of interest impacts its completion-time's mean. Going beyond the mean, little is known on how restart affects stochasticity measures of the completion time. This paper is the first in a duo of studies that address this knowledge gap via: a comprehensive analysis that quantifies how sharp restart -- a keystone restart protocol -- impacts the completion-time's Boltzmann-Gibbs-Shannon entropy. The analysis establishes closed-form results for sharp restart with general timers, with fast timers (high-frequency resetting), and with slow timers (low-frequency resetting). These results share a common structure: comparing the completion-time's hazard rate to a flat benchmark -- the constant hazard rate of an exponential distribution whose entropy is equal to the completion-time's entropy. In addition, using an information-geometric approach based on Kullback-Leibler distances, the analysis establishes results that determine the very existence of timers with which the application of sharp restart decreases or increases the completion-time's entropy. Our work sheds first light on the intricate interplay between restart and randomness -- as gauged by the Boltzmann-Gibbs-Shannon entropy. ... Referring to the last point relating to estimators, the dominant role of the discussion regards four classes of estimators: binning estimators (Shwartz-Ziv and Tishby, 2017), Kernel Density Estimation (KDE) (Kolchinsky and Tracey, 2017), variational and neural network based estimators (Belghazi et al., 2018), kernel-based estimators . Some of other estimators (Balda et al., 2018;Gabrié et al., 2019;Goldfeld et al., 2019b;Noshad et al., 2019;Shwartz-Ziv and Alemi, 2020) (Horn, 1990) will be numerically problematic for layers with many filters. Gabrié et al. (2019) propose a replica method from statistical physics to estimate the differential entropy (approximated MI). ... ... Some of other estimators (Balda et al., 2018;Gabrié et al., 2019;Goldfeld et al., 2019b;Noshad et al., 2019;Shwartz-Ziv and Alemi, 2020) (Horn, 1990) will be numerically problematic for layers with many filters. Gabrié et al. (2019) propose a replica method from statistical physics to estimate the differential entropy (approximated MI). To make this estimate, a network with wide layers is trained on a synthetic dataset to satisfy the orthogonal invariance of the weight matrices. ... ... Darlow and Storkey, 2019;Gabrié et al., 2019; Goldfeld et al., 2019a), believe that information compression does not induces improvements of generalisation. Much research also focuses on I(Y ; T ). ... Thesis This dissertation is on the analysis and applications of a constructive architecture for training Deep Neural Networks, which are usually trained by End-to-End gradient propagation with fixed depths. End-to-End training of Deep Neural Networks has proven to offer impressive performances in a number of applications such as computer vision, machine translation and in playing complex games such as GO. Cascade Learning, the approach of interest here, trains networks in a layer-wise fashion and has been demonstrated to achieve satisfactory performance in large scale tasks such as the popular ImageNet benchmark dataset, at substantially reduced computing and memory requirements. Here we focus on the nature of features extracted from Cascade Learning. By attempting to explain the process of learning using Tishby et al.s’ Information Bottleneck theory, we derive an empirical rule (Information Transition Ratio) to automatically determine a satisfactory depth for Deep Neural Networks. We suggest that Cascade Learning packs information in a hierarchical manner, with coarse features in early layers and more task specific features in later layers. This is verified by considering Transfer Learning whereby features learned from a data-rich source domain assist in learning a data-sparse target domain. Using a wide range of inference problems in medical imaging, human activity recognition and inference from single cell gene expression between mice and humans, we demonstrate that Transfer Learning from a cascade trained model outperforms results noted by previous authors. An exception to this is the single cell gene expression problem where a single hidden layer network happens to be an adequate solution. ... Inspired by studies exploring the flow of information in deep neural networks [23,24,34,35], the entropy of CNN activation maps was proposed as a compact description of texture in medical images [22,32,37]. Although these entropybased features were shown to be predictive of different diseases, they only offer a limited measure of heterogeneity and do not capture the full range of statistics describing the texture of affected tissues. ... Preprint Full-text available Imaging biomarkers offer a non-invasive way to predict the response of immunotherapy prior to treatment. In this work, we propose a novel type of deep radiomic features (DRFs) computed from a convolutional neural network (CNN), which capture tumor characteristics related to immune cell markers and overall survival. Our study uses four MRI sequences (T1-weighted, T1-weighted post-contrast, T2-weighted and FLAIR) with corresponding immune cell markers of 151 patients with brain tumor. The proposed method extracts a total of 180 DRFs by aggregating the activation maps of a pre-trained 3D-CNN within labeled tumor regions of MRI scans. These features offer a compact, yet powerful representation of regional texture encoding tissue heterogeneity. A comprehensive set of experiments is performed to assess the relationship between the proposed DRFs and immune cell markers, and measure their association with overall survival. Results show a high correlation between DRFs and various markers, as well as significant differences between patients grouped based on these markers. Moreover, combining DRFs, clinical features and immune cell markers as input to a random forest classifier helps discriminate between short and long survival outcomes, with AUC of 72\% and p=2.36$\times$10$^{-5}$. These results demonstrate the usefulness of proposed DRFs as non-invasive biomarker for predicting treatment response in patients with brain tumors. ... The case in which Z is actually Gaussian has been thoroughly studied [53,28,55,54,12,61]. Beyond Gaussianity, a rapidly growing literature is focusing on rotationally invariant models assuming perfect knowledge of the statistics of the structured matrix appearing in the problem (such as noise in inference, a sensing, data, or coding matrix in regression tasks, weight matrices in neural networks, or a matrix of interactions in spin glass models) [67,23,37,38,42,65,56,57,74,77,78,16,33,66,71,47]. However, despite this impressive progress when the noise statistics is known, low-rank estimation in a mismatched setting with partial to no knowledge of the statistics of the rotationally invariant noise matrix remains poorly understood. ... Preprint We consider the problem of estimating a rank-1 signal corrupted by structured rotationally invariant noise, and address the following question: how well do inference algorithms perform when the noise statistics is unknown and hence Gaussian noise is assumed? While the matched Bayes-optimal setting with unstructured noise is well understood, the analysis of this mismatched problem is only at its premises. In this paper, we make a step towards understanding the effect of the strong source of mismatch which is the noise statistics. Our main technical contribution is the rigorous analysis of a Bayes estimator and of an approximate message passing (AMP) algorithm, both of which incorrectly assume a Gaussian setup. The first result exploits the theory of spherical integrals and of low-rank matrix perturbations; the idea behind the second one is to design and analyze an artificial AMP which, by taking advantage of the flexibility in the denoisers, is able to "correct" the mismatch. Armed with these sharp asymptotic characterizations, we unveil a rich and often unexpected phenomenology. For example, despite AMP is in principle designed to efficiently compute the Bayes estimator, the former is outperformed by the latter in terms of mean-square error. We show that this performance gap is due to an incorrect estimation of the signal norm. In fact, when the SNR is large enough, the overlaps of the AMP and the Bayes estimator coincide, and they even match those of optimal estimators taking into account the structure of the noise. ... , L. This scaling for the variables sizes is often assumed in order not to make the inference of σ * from Y impossible nor trivial. This multi-layer GLM has been studied by various authors [36,40,56,77,85]. ... Article Full-text available We consider generic optimal Bayesian inference, namely, models of signal reconstruction where the posterior distribution and all hyperparameters are known. Under a standard assumption on the concentration of the free energy, we show how replica symmetry in the strong sense of concentration of all multioverlaps can be established as a consequence of the Franz–de Sanctis identities; the identities themselves in the current setting are obtained via a novel perturbation coming from exponentially distributed “side-observations” of the signal. Concentration of multioverlaps means that asymptotically the posterior distribution has a particularly simple structure encoded by a random probability measure (or, in the case of binary signal, a non-random probability measure). We believe that such strong control of the model should be key in the study of inference problems with underlying sparse graphical structure (error correcting codes, block models, etc) and, in particular, in the rigorous derivation of replica symmetric formulas for the free energy and mutual information in this context. ... However, as neural networks are deterministic, mutual information between layers can be whether infinite or constant. Some estimators can ye bet derived by injecting little amount of noise in the layers' output, and allows to demonstrate interesting compression abilities as the number of layers increase [138]. The information bottleneck framework can also be leveraged to study neural networks in the case of supervised learning, showing that during the train process the system first reduces the empirical error of the task (empirical error minimization), then compresses its representation [139]. ... Thesis Among the diverse research fields within computer music, synthesis and generation of audio signals epitomize the cross-disciplinarity of this domain, jointly nourishing both scientific and artistic practices since its creation. Inherent in computer music since its genesis, audio generation has inspired numerous approaches, evolving both with musical practices and scientific/technical advances. Moreover, some syn- thesis processes also naturally handle the reverse process, named analysis, such that synthesis parameters can also be partially or totally extracted from actual sounds, and providing an alternative representation of the analyzed audio signals.On top of that, the recent rise of machine learning algorithms earnestly questioned the field of scientific research, bringing powerful data-centred methods that raised several epistemological questions amongst researchers, in spite of their efficiency. Especially, a family of machine learning methods, called generative models, are focused on the generation of original content using features extracted from an existing dataset. In that case, such methods not only questioned previous approaches in generation, but also the way of integrating this methods into existing creative processes. While these new generative frameworks are progressively introduced in the domain of image generation, the application of such generative techniques in audio synthesis is still marginal.In this work, we aim to propose a new audio analysis-synthesis framework based on these modern gen- erative models, enhanced by recent advances in machine learning. We first review existing approaches, both in sound synthesis and in generative machine learning, and focus on how our work inserts itself in both practices and what can be expected from their collation. Subsequently, we focus a little more on generative models, and how modern advances in the domain can be exploited to allow us learning complex sound distributions, while being sufficiently flexible to be integrated in the creative flow of the user. We then propose an inference / generation process, mirroring analysis/synthesis paradigms that are natural in the audio processing domain, using latent models that are based on a continuous higher-level space, that we use to control the generation. We first provide preliminary results of our method applied on spectral information, extracted from several datasets, and evaluate both qualitatively and quantitatively the obtained results. Subsequently, we study how to make these methods more suitable for learning audio data, tackling successively three different aspects. First, we propose two different latent regularization strategies specifically designed for audio, based on and signal / symbol translation and perceptual constraints. Then, we propose different methods to address the inner temporality of musical signals, based on the extraction of multi-scale representations and on prediction, that allow the obtained generative spaces that also model the dynamics of the signal.As a last chapter, we swap our scientific approach to a more research & creation-oriented point of view: first, we describe the architecture and the design of our open-source library, vsacids, aiming to be used by expert and non-expert music makers as an integrated creation tool. Then, we propose n first musical use of our system by the creation of a real-time performance, called aego, based jointly on our framework vsacids and an explorative agent using reinforcement learning to be trained during the performance. Finally, we draw some conclusions on the different manners to improve and reinforce the proposed generation method, as well as possible further creative applications. ... High-dimensional analyses in the proportional asymptotics regime similar to assumptions A1 to A3 have been widely-used in statistical physics and random matrix-based analyses of inference algorithms (Zdeborová and Krzakala, 2016). The high-dimensional framework has yielded powerful results in a wide range of applications such as estimation error in linear inverse problems (Donoho et al., 2009;Bayati and Montanari, 2011;Krzakala et al., 2012;Rangan et al., 2019;Hastie et al., 2019), convolutional inverse problems (Sahraee-Ardakan et al., 2021), dynamics of deep linear networks (Saxe et al., 2013), matrix factorization (Kabashima et al., 2016), binary classification (Taheri et al., 2020;Kini and Thrampoulidis, 2020), inverse problems with deep priors (Gabrié et al., 2019;Pandit et al., 2019Pandit et al., , 2020, generalization error in linear and generalized linear models (Gerace et al., 2020;Emami et al., 2020;Loureiro et al., 2021;Gerbelot et al., 2020), random features (D'Ascoli et al., 2020), and for choosing the optimal objective function for regression (Bean et al., 2013;Advani and Ganguli, 2016) to name a few. Our result that, under a similar set of assumptions, kernel regression degenerates to linear models is thus somewhat surprising. ... Preprint Empirical observation of high dimensional phenomena, such as the double descent behaviour, has attracted a lot of interest in understanding classical techniques such as kernel methods, and their implications to explain generalization properties of neural networks. Many recent works analyze such models in a certain high-dimensional regime where the covariates are independent and the number of samples and the number of covariates grow at a fixed ratio (i.e. proportional asymptotics). In this work we show that for a large class of kernels, including the neural tangent kernel of fully connected networks, kernel methods can only perform as well as linear models in this regime. More surprisingly, when the data is generated by a kernel model where the relationship between input and the response could be very nonlinear, we show that linear models are in fact optimal, i.e. linear models achieve the minimum risk among all models, linear or nonlinear. These results suggest that more complex models for the data other than independent features are needed for high-dimensional analysis. Chapter The common medical image fusion algorithms often have the problems of information redundancy and spatial information loss. Therefore, an adversarial network generation method based on double discrimination is proposed in this paper. Firstly, the dual discriminant network designed by us is used for the initial fusion, and then the fusion strategy is used to enhance the detailed features to complete the final fusion. The experimental results show that the fusion results obtained by our method have clearer spatial structure and more accurate spectral information compared with the common representative methods. Preprint We formulate and analyze the compound information bottleneck programming. In this problem, a Markov chain$ \mathsf{X} \rightarrow \mathsf{Y} \rightarrow \mathsf{Z} $is assumed with fixed marginal distributions$\mathsf{P}_{\mathsf{X}}$and$\mathsf{P}_{\mathsf{Y}}$, and the mutual information between$ \mathsf{X} $and$ \mathsf{Z} $is sought to be maximized over the choice of conditional probability of$\mathsf{Z}$given$\mathsf{Y}$from a given class, under the \textit{worst choice} of the joint probability of the pair$(\mathsf{X},\mathsf{Y})\$ from a different class. We consider several classes based on extremes of: mutual information; minimal correlation; total variation; and the relative entropy class. We provide values, bounds, and various characterizations for specific instances of this problem: the binary symmetric case, the scalar Gaussian case, the vector Gaussian case and the symmetric modulo-additive case. Finally, for the general case, we propose a Blahut-Arimoto type of alternating iterations algorithm to find a consistent solution to this problem.
Article
We investigate the analogy between the renormalization group (RG) and deep neural networks, wherein subsequent layers of neurons are analogous to successive steps along the RG. In particular, we quantify the flow of information by explicitly computing the relative entropy or Kullback-Leibler divergence in both the one- and two-dimensional Ising models under decimation RG, as well as in a feedforward neural network as a function of depth. We observe qualitatively identical behavior characterized by the monotonic increase to a parameter-dependent asymptotic value. On the quantum field theory side, the monotonic increase confirms the connection between the relative entropy and the c-theorem. For the neural networks, the asymptotic behavior may have implications for various information maximization methods in machine learning, as well as for disentangling compactness and generalizability. Furthermore, while both the two-dimensional Ising model and the random neural networks we consider exhibit non-trivial critical points, the relative entropy appears insensitive to the phase structure of either system. In this sense, more refined probes are required in order to fully elucidate the flow of information in these models.
Article
Full-text available
Generalized linear models (GLMs) are used in high-dimensional machine learning, statistics, communications, and signal processing. In this paper we analyze GLMs when the data matrix is random, as relevant in problems such as compressed sensing, error-correcting codes, or benchmark models in neural networks. We evaluate the mutual information (or “free entropy”) from which we deduce the Bayes-optimal estimation and generalization errors. Our analysis applies to the high-dimensional limit where both the number of samples and the dimension are large and their ratio is fixed. Nonrigorous predictions for the optimal errors existed for special cases of GLMs, e.g., for the perceptron, in the field of statistical physics based on the so-called replica method. Our present paper rigorously establishes those decades-old conjectures and brings forward their algorithmic interpretation in terms of performance of the generalized approximate message-passing algorithm. Furthermore, we tightly characterize, for many learning problems, regions of parameters for which this algorithm achieves the optimal performance and locate the associated sharp phase transitions separating learnable and nonlearnable regions. We believe that this random version of GLMs can serve as a challenging benchmark for multipurpose algorithms.
Conference Paper
Full-text available
The practical successes of deep neural networks have not been matched by theoretical progress that satisfyingly explains their behavior. In this work, we study the information bottleneck (IB) theory of deep learning, which makes three specific claims: first, that deep networks undergo two distinct phases consisting of an initial fitting phase and a subsequent compression phase; second, that the compression phase is causally related to the excellent generalization performance of deep networks; and third, that the compression phase occurs due to the diffusion-like behavior of stochastic gradient descent. Here we show that none of these claims hold true in the general case. Through a combination of analytical results and simulation, we demonstrate that the information plane trajectory is predominantly a function of the neural nonlinearity employed: double-sided saturating nonlineari-ties like tanh yield a compression phase as neural activations enter the saturation regime, but linear activation functions and single-sided saturating nonlinearities like the widely used ReLU in fact do not. Moreover, we find that there is no evident causal connection between compression and generalization: networks that do not compress are still capable of generalization, and vice versa. Next, we show that the compression phase, when it exists, does not arise from stochasticity in training by demonstrating that we can replicate the IB findings using full batch gradient descent rather than stochastic gradient descent. Finally, we show that when an input domain consists of a subset of task-relevant and task-irrelevant information, hidden representations do compress the task-irrelevant information, although the overall information about the input may monotonically increase with training time, and that this compression happens concurrently with the fitting process rather than during a subsequent compression period.
Article
Full-text available
We argue that the estimation of the mutual information between high dimensional continuous random variables is achievable by gradient descent over neural networks. This paper presents a Mutual Information Neural Estimator (MINE) that is linearly scalable in dimensionality as well as in sample size. MINE is back-propable and we prove that it is strongly consistent. We illustrate a handful of applications in which MINE is succesfully applied to enhance the property of generative models in both unsupervised and supervised settings. We apply our framework to estimate the information bottleneck, and apply it in tasks related to supervised classification problems. Our results demonstrate substantial added flexibility and improvement in these settings.