ArticlePDF Available

Entropy and mutual information in models of deep neural networks

Authors:

Abstract

We examine a class of stochastic deep learning models with a tractable method to compute information-theoretic quantities. Our contributions are three-fold: (i) we show how entropies and mutual informations can be derived from heuristic statistical physics methods, under the assumption that weight matrices are independent and orthogonally-invariant. (ii) We extend particular cases in which this result is known to be rigorously exact by providing a proof for two-layers networks with Gaussian random weights, using the recently introduced adaptive interpolation method. (iii) We propose an experiment framework with generative models of synthetic datasets, on which we train deep neural networks with a weight constraint designed so that the assumption in (i) is verified during learning. We study the behavior of entropies and mutual informations throughout learning and conclude that, in the proposed setting, the relationship between compression and generalization remains elusive.
Journal of Statistical Mechanics:
Theory and Experiment
ML 2019 • OPEN ACCESS
Entropy and mutual information in models of deep neural networks*
To cite this article: Marylou Gabrié et al J. Stat. Mech. (2019) 124014
View the article online for updates and enhancements.
This content was downloaded from IP address 92.240.206.176 on 20/12/2019 at 17:16
J. Stat. Mech. (2019) 124014
Entropy and mutual information
in models of deep neural networks*
Marylou Gabrié1, Andre Manoel2, Clément Luneau3,
Jean Barbier4, Nicolas Macris3, Florent Krzakala1
and Lenka Zdeborová5
1 Laboratoire de Physique de I’École Normale Supérieure, ENS,
Université PSL, CNRS, Sorbonne Université, Université de Paris, France
2 OWKIN, Inc., New York, NY, United States of America
3 Laboratoire de Théorie des Communications, École Polytechnique
Fédérale de Lausanne, Switzerland
4 International Center for Theoretical Physics, Trieste, Italy
5 Institut de Physique Théorique, CEA, CNRS, Université Paris-Saclay,
France
E-mail: marylou.gabrie@ens.fr
Received 30 May 2019
Accepted for publication 25 June 2019
Published 20 December 2019
Online at stacks.iop.org/JSTAT/2019/124014
https://doi.org/10.1088/1742-5468/ab3430
Abstract.We examine a class of stochastic deep learning models with a
tractable method to compute information-theoretic quantities. Our contributions
are three-fold: (i) we show how entropies and mutual informations can be derived
from heuristic statistical physics methods, under the assumption that weight
matrices are independent and orthogonally-invariant. (ii) We extend particular
cases in which this result is known to be rigorously exact by providing a proof
for two-layers networks with Gaussian random weights, using the recently
introduced adaptive interpolation method. (iii) We propose an experiment
framework with generative models of synthetic datasets, on which we train
deep neural networks with a weight constraint designed so that the assumption
M Gabrié etal
Entropy and mutual information in models of deep neural networks
Printed in the UK
124014
JSMTC6
© 2019 The Author(s). Published by IOP Publishing Ltd on behalf of SISSA Medialab srl
2019
19
J. Stat. Mech.
JSTAT
1742-5468
10.1088/1742-5468/ab3430
12
Journal of Statistical Mechanics: Theory and Experiment
© 2019 The Author(s).
Published by IOP Publishing Ltd on behalf of SISSA Medialab srl
ournal of Statistical Mechanics:
J
Theory and Experiment
IOP
Original content from this work may be used under the terms of the Creative Commons Attribution 3.0
licence. Any further distribution of this work must maintain attribution to the author(s) and
the title of the work, journal citation and DOI.
* This article is an updated version of: Gabrié M, Manoel A, Luneau C, Barbier J, Macris N, Krzakala F and Zde-
borová L 2018 Entropy and mutual information in models of deep neural networks Advances in Neural Informa-
tion Processing Systems 31 (Red Hook, NY : Curran Associates, Inc.) pp 18211831
1742 - 5 4 6 8/ 1 9 /1 24 014 +16 $ 3 3. 0 0
Entropy and mutual information in models of deep neural networks
2
https://doi.org/10.1088/1742-5468/ab3430
J. Stat. Mech. (2019) 124014
in (i) is verified during learning. We study the behavior of entropies and mutual
informations throughout learning and conclude that, in the proposed setting,
the relationship between compression and generalization remains elusive.
Keywords: machine learning
Contents
1. Multi-layer model and main theoretical results 3
1.1. A stochastic multi-layer model ........................................................................3
1.2. Replica formula ...............................................................................................3
1.3. Rigorous statement ..........................................................................................4
2. Tractable models for deep learning 5
2.1. Other related works .........................................................................................7
3. Numerical experiments 8
3.1. Estimators and activation comparisons ...........................................................8
3.2. Learning experiments with linear networks ................................................... 10
3.3. Learning experiments with deep non-linear networks ....................................11
4. Conclusion and perspectives 14
Acknowledgments ................................................................................ 14
References 15
The successes of deep learning methods have spurred eorts towards quantitative
modeling of the performance of deep neural networks. In particular, an information-
theoretic approach linking generalization capabilities to compression has been receiving
increasing interest. The intuition behind the study of mutual informations in latent
variable models dates back to the information bottleneck (IB) theory of [1]. Although
recently reformulated in the context of deep learning [2], verifying its relevance in prac-
tice requires the computation of mutual informations for high-dimensional variables,
a notoriously hard problem. Thus, pioneering works in this direction focused either on
small network models with discrete (continuous, eventually binned) activations [3], or
on linear networks [4, 5].
In the present paper we follow a dierent direction, and build on recent results
from statistical physics [6, 7] and information theory [8, 9] to propose, in section 1, a
formula to compute information-theoretic quantities for a class of deep neural network
models. The models we approach, described in section 2, are non-linear feed-forward
neural networks trained on synthetic datasets with constrained weights. Such networks
capture some of the key properties of the deep learning setting that are usually dicult
to include in tractable frameworks: non-linearities, arbitrary large width and depth,
and correlations in the input data. We demonstrate the proposed method in a series
of numerical experiments in section 3. First observations suggest a rather complex
Entropy and mutual information in models of deep neural networks
3
https://doi.org/10.1088/1742-5468/ab3430
J. Stat. Mech. (2019) 124014
picture, where the role of compression in the generalization ability of deep neural net-
works is yet to be elucidated.
1. Multi-layer model and main theoretical results
1.1. A stochastic multi-layer model
We consider a model of multi-layer stochastic feed-forward neural network where each
element xi of the input layer
xRn0
is distributed independently as
P0(xi)
, while hid-
den units
t,i
at each successive layer
(vectors are column vectors) come from
P
(t
,i
|W
,it
1
)
, with
t0x
and
W
,
i
denoting the ith row of the matrix of weights
WRn
×n
1
. In other words
t
0,i
x
i
P
0
(·), t
1,i
P
1
(·|W
1,i
x), ... t
L,i
P
L
(·|W
L,i
t
L
1
),
(1)
given a set of weight matrices
{W}
L
=1
and distributions
{P}
L
=1
which encode
possible non-linearities and stochastic noise applied to the hidden layer vari-
ables, and P0 that generates the visible variables. In particular, for a non-linearity
t
,i
=ϕ
(h,ξ
,i)
, where
ξ,i
P
ξ
(
·)
is the stochastic noise (independent for each i), we
have
P
(t,i
|
W
,i
t
1)=
dPξ(ξ,i)δ
t,i
ϕ(W
,i
t
1,ξ,i)
. Model (1) thus describes a
Markov chain which we denote by
XT1T2···TL
, with
T=ϕ(WT1,ξ)
,
ξ
=
{
ξ
,i}n
i=1
, and the activation function
ϕ
applied componentwise.
1.2. Replica formula
We shall work in the asymptotic high-dimensional statistics regime where all
˜α
n
/n
0
are of order one while
n0→∞
, and make the important assumption that
all matrices
W
are orthogonally-invariant random matrices independent from each
other; in other words, each matrix
WRn
×n
1
can be decomposed as a product
of three matrices,
W=USV
, where
U
O(n
)
and
V
O(n
1)
are independently
sampled from the Haar measure, and
S
is a diagonal matrix of singular values.
The main technical tool we use is a formula for the entropies of the hidden vari-
ables,
H(T
)=E
T
ln P
T
(t
)
, and the mutual information between adjacent lay-
ers
I
(T
;T
1
)=H(T
)+
ET,T1
ln P
T|T1
(t
|
t
1)
, based on the heuristic replica
method [6, 7, 10, 11]:
Claim 1 (Replica formula). Assume model (1) with L layers in the high-dimensional
limit with componentwise activation functions and weight matrices generated from the
ensemble described above, and denote by
λ
W
k
the eigenvalues of
W
kWk
. Then for any
∈{
1, ...,L
}
the normalized entropy of
T
is given by the minimum among all station-
ary points of the replica potential:
lim
n0
→∞
1
n0
H(T) = min extr
A,V,
˜
A,
˜
V
φ(A,V,
˜
A,
˜
V
),
(2)
Entropy and mutual information in models of deep neural networks
4
https://doi.org/10.1088/1742-5468/ab3430
J. Stat. Mech. (2019) 124014
which depends on
-dimensional vectors
A,V,
˜
A,
˜
V
, and is written in terms of mutual
information I and conditional entropies H of scalar variables as
φ(A,V,
˜
A,
˜
V)=I
t0;t0+ξ0
˜
A1
1
2
k=1
˜αk1˜
AkVk+αkAk˜
VkFWk(AkVk)
+
1
k=1
˜αk
H(tk|ξk;˜
Ak+1,˜
Vkρk)1
2log(2πe ˜
A1
k+1)
αH(t|ξ;˜
Vρ),
(3)
where
αk=nk/nk1
,
˜αk=nk/n0
,
ρ
k
=dP
k
1
(t)t2
,
˜ρ
k
=(E
λW
kλ
Wk
)ρ
k
k, and
ξk∼N(0, 1)
for
k= 0, ...,
. In the computation of the conditional entropies in (3), the
scalar tk-variables are generated from
P(t0)=P0(t0)
and
P(tk
|
ξk;A,V,ρ)=
E
˜
ξz
Pk(tk+
˜
ξ/
A
|
ρ
k+
V˜z), k= 1, ...,
1,
(4)
P
(t
|
ξ
;V,ρ)=
E˜z
P
(t
|
ρ
+
V˜z
),
(5)
where
˜
ξ
and
˜z
are independent
N(0, 1)
random variables. Finally, the function
FWk
(x
)
depends on the distribution of the eigenvalues
λ
W
following
F
Wk(x) = min
θR
2αkθ+(αk
1) ln(1
θ)+
E
λWkln[Wk+ (1
θ)(1
αkθ)]
.
(6)
The computation of the entropy in the large dimensional limit, a computationally
dicult task, has thus been reduced to an extremization of a function of
4
variables,
that requires evaluating single or bidimensional integrals. This extremization can be
done eciently by means of a fixed-point iteration starting from dierent initial condi-
tions, as detailed in the supplementary material (stacks.iop.org/JSTAT/19/124014/
mmedia). Moreover, a user-friendly Python package is provided [12], which performs
the computation for dierent choices of prior P0, activations
ϕ
and spectra
λ
W
.
Finally, the mutual information between successive layers
I(T;T1)
can be obtained
from the entropy following the evaluation of an additional bidimensional integral, see
section 1.6.1 of the supplementary material.
Our approach in the derivation of (3) builds on recent progresses in statistical
estimation and information theory for generalized linear models following the applica-
tion of methods from statistical physics of disordered systems [10, 11] in communica-
tion [13], statistics [14] and machine learning problems [15, 16]. In particular, we use
advanced mean field theory [17] and the heuristic replica method [6, 10], along with
its recent extension to multi-layer estimation [7, 8], in order to derive the above form-
ula (3). This derivation is lengthy and thus given in the supplementary material. In a
related contrib ution, Reeves [9] proposed a formula for the mutual information in the
multi-layer setting, using heuristic information-theoretic arguments. As ours, it exhib-
its layer-wise additivity, and the two formulas are conjectured to be equivalent.
1.3. Rigorous statement
We recall the assumptions under which the replica formula of claim 1 is conjectured to be
exact: (i) weight matrices are drawn from an ensemble of random orthogonally-invariant
Entropy and mutual information in models of deep neural networks
5
https://doi.org/10.1088/1742-5468/ab3430
J. Stat. Mech. (2019) 124014
matrices, (ii) matrices at dierent layers are statistically independent and (iii) layers
have a large dimension and respective sizes of adjacent layers are such that weight matri-
ces have aspect ratios
{αkαk}
k=1
of order one. While we could not prove the replica
prediction in full generality, we stress that it comes with multiple credentials: (i) for
Gaussian prior P0 and Gaussian distributions
P
, it corresponds to the exact analytical
solution when weight matrices are independent of each other (see section 1.6.2 of the
supplementary material). (ii) In the single-layer case with a Gaussian weight matrix, it
reduces to formula (6) in the supplementary material, which has been recently rigor-
ously proven for (almost) all activation functions
ϕ
[18]. (iii) In the case of Gaussian
distributions
P
, it has also been proven for a large ensemble of random matrices [19]
and (iv) it is consistent with all the results of the AMP [2022] and VAMP [23] algo-
rithms, and their multi-layer versions [7, 8], known to perform well for these estimation
problems.
In order to go beyond results for the single-layer problem and heuristic arguments,
we prove claim 1 for the more involved multi-layer case, assuming Gaussian i.i.d.
matrices and two non-linear layers:
Theorem 1 (Two-layer Gaussian replica formula). Suppose
(H1)
the input units dis-
tribution P0 is separable and has bounded support;
(
H
2)
the activations
ϕ1
and
ϕ2
corre-
sponding to
P
1(t1,i
|
W
1,i
x
)
and
P
2(t2,i
|
W
2,i
t1
)
are bounded
C2
with bounded first and
second derivatives w.r.t their first argument; and
(
H
3)
the weight matrices W1, W2 have
Gaussian i.i.d. entries. Then for model (1) with two layers L = 2 the high-dimensional
limit of the entropy verifies claim 1.
The theorem, that closes the conjecture presented in [7], is proven using the adap-
tive interpolation method of [18, 24, 25] in a multi-layer setting, as first developed in
[26]. The lengthy proof, presented in details in section 2 of the supplementary mat-
erial, is of independent interest and adds further credentials to the replica formula, as
well as oers a clear direction to further developments. Note that, following the same
approximation arguments as in [18] where the proof is given for the single-layer case,
the hypothesis
(
H
1)
can be relaxed to the existence of the second moment of the prior,
(
H
2)
can be dropped and
(
H
3)
extended to matrices with i.i.d. entries of zero mean,
O(1/n0) variance and finite third moment.
2. Tractable models for deep learning
The multi-layer model presented above can be leveraged to simulate two prototypical
settings of deep supervised learning on synthetic datasets amenable to the replica trac-
table computation of entropies and mutual informations.
The first scenario is the so-called teacher-student (see figure 1, left). Here, we
assume that the input
x
is distributed according to a separable prior distribution
P
X
(x)=
i
P
0
(x
i)
, factorized in the components of
x
, and the corresponding label
y
is given by applying a mapping
xy
, called the teacher. After generating a train and
test set in this manner, we perform the training of a deep neural network, the student,
on the synthetic dataset. In this case, the data themselves have a simple structure given
by P0.
Entropy and mutual information in models of deep neural networks
6
https://doi.org/10.1088/1742-5468/ab3430
J. Stat. Mech. (2019) 124014
In constrast, the second scenario allows generative models (see figure 1, right) that
create more structure, and that are reminiscent of the generative-recognition pair of
models of a Variational Autoencoder (VAE). A code vector
y
is sampled from a sepa-
rable prior distribution
PY
(y)=
i
P
0
(y
i)
and a corresponding data point
x
is gener-
ated by a possibly stochastic neural network, the generative model. This setting allows
to create input data
x
featuring correlations, dierently from the teacher-student sce-
nario. The studied supervised learning task then consists in training a deep neural net,
the recognition model, to recover the code
y
from
x
.
In both cases, the chain going from
X
to any later layer is a Markov chain in the
form of (1). In the first scenario, model (1) directly maps to the student network. In the
second scenario however, model (1) actually maps to the feed-forward combination of
the generative model followed by the recognition model. This shift is necessary to verify
the assumption that the starting point (now given by
Y
) has a separable distribution.
In particular, it generates correlated input data
X
while still allowing for the computa-
tion of the entropy of any
T
.
At the start of a neural network training, weight matrices initialized as i.i.d.
Gaussian random matrices satisfy the necessary assumptions of the formula of claim 1.
In their singular value decomposition
W=USV
(7)
the matrices
UO(n)
and
VO(n1)
, are typical independent samples from the
Haar measure across all layers. To make sure weight matrices remain close enough to
independent during learning, we define a custom weight constraint which consists in
keeping
U
and
V
fixed while only the matrix
S
, constrained to be diagonal, is updated.
The number of parameters is thus reduced from
n×n1
to
min(n,n1)
. We refer to
layers following this weight constraint as USV-layers. For the replica formula of claim
1 to be correct, the matrices
S
from dierent layers should furthermore remain uncor-
related during the learning. In section 3, we consider the training of linear networks
Figure 1. Two models of synthetic data.
Entropy and mutual information in models of deep neural networks
7
https://doi.org/10.1088/1742-5468/ab3430
J. Stat. Mech. (2019) 124014
for which information-theoretic quantities can be computed analytically, and confirm
numerically that with USV-layers the replica predicted entropy is correct at all times.
In the following, we assume that is also the case for non-linear networks.
In section 3.2 of the supplementary material, we train a neural network with USV-
layers on a simple real-world dataset (MNIST), showing that these layers can learn
to represent complex functions despite their restriction. We further note that such
a product decomposition is reminiscent of a series of works on adaptative structured
ecient linear layers (SELLs and ACDC) [27, 28] motivated this time by speed gains,
where only diagonal matrices are learned (in these works the matrices
U
and
V
are
chosen instead as permutations of Fourier or Hadamard matrices, so that the matrix
multiplication can be replaced by fast transforms). In section 3, we discuss learning
experiments with USV-layers on synthetic datasets.
While we have defined model (1) as a stochastic model, traditional feed forward neu-
ral networks are deterministic. In the numerical experiments of section 3, we train and
test networks without injecting noise, and only assume a noise model in the computa-
tion of information-theoretic quantities. Indeed, for continuous variables the presence
of noise is necessary for mutual informations to remain finite (see discussion of appen-
dix C in [5]). We assume at layer
an additive white Gaussian noise of small amplitude
just before passing through its activation function to obtain
H(T)
and
I(T;T1)
,
while keeping the mapping
XT1
deterministic. This choice attempts to stay as
close as possible to the deterministic neural network, but remains inevitably somewhat
arbitrary (see again discussion of appendix C in [5]).
2.1. Other related works
The strategy of studying neural networks models, with random weight matrices and/
or random data, using methods originated in statistical physics heuristics, such as the
replica and the cavity methods [10] has a long history. Before the deep learning era,
this approach led to pioneering results in learning for the Hopfield model [29] and for
the random perceptron [15, 16, 30, 31].
Recently, the successes of deep learning along with the disqualifying complexity of
studying real world problems have sparked a revived interest in the direction of random
weight matrices. Recent resultswithout exhaustivitywere obtained on the spectrum
of the Gram matrix at each layer using random matrix theory [32, 33], on expressivity
of deep neural networks [34], on the dynamics of propagation and learning [3538], on
the high-dimensional non-convex landscape where the learning takes place [39], or on
the universal random Gaussian neural nets of [40].
The information bottleneck theory [1] applied to neural networks consists in com-
puting the mutual information between the data and the learned hidden representa-
tions on the one hand, and between labels and again hidden learned representations
on the other hand [2, 3]. A successful training should maximize the information with
respect to the labels and simultaneously minimize the information with respect to
the input data, preventing overfitting and leading to a good generalization. While
this intuition suggests new learning algorithms and regularizers [4147], we can also
hypothesize that this mechanism is already at play in a priori unrelated commonly used
optimization methods, such as the simple stochastic gradient descent (SGD). It was
Entropy and mutual information in models of deep neural networks
8
https://doi.org/10.1088/1742-5468/ab3430
J. Stat. Mech. (2019) 124014
first tested in practice by [3] on very small neural networks, to allow the entropy to be
estimated by binning of the hidden neurons activities. Afterwards, the authors of [5]
reproduced the results of [3] on small networks using the continuous entropy estimator
of [45], but found that the overall behavior of mutual information during learning is
greatly aected when changing the nature of non-linearities. Additionally, they inves-
tigate the training of larger linear networks on i.i.d. normally distributed inputs where
entropies at each hidden layer can be computed analytically for an additive Gaussian
noise. The strategy proposed in the present paper allows us to evaluate entropies and
mutual informations in non-linear networks larger than in [3, 5].
3. Numerical experiments
We present a series of experiments both aiming at further validating the replica estima-
tor and leveraging its power in noteworthy applications. A first application presented
in the paragraph 3.1 consists in using the replica formula in settings where it is proven
to be rigorously exact as a basis of comparison for other entropy estimators. The same
experiment also contributes to the discussion of the information bottleneck theory for
neural networks by showing how, without any learning, information-theoretic quanti-
ties have dierent behaviors for dierent non-linearities. In the following paragraph 3.2,
we validate the accuracy of the replica formula in a learning experiment with USV-
layerswhere it is not proven to be exactby considering the case of linear networks
for which information-theoretic quantities can be otherwise computed in closed-form.
We finally consider in the paragraph 3.3, a second application testing the information
bottleneck theory for large non-linear networks. To this aim, we use the replica estima-
tor to study compression eects during learning.
3.1. Estimators and activation comparisons
Two non-parametric estimators have already been considered by [5] to compute entro-
pies and/or mutual informations during learning. The kernel-density approach of
Kolchinsky et al [45] consists in fitting a mixture of Gaussians (MoG) to samples of the
variable of interest and subsequently compute an upper bound on the entropy of the
MoG [48]. The method of Kraskov et al [49] uses nearest neighbor distances between
samples to directly build an estimate of the entropy. Both methods require the com-
putation of the matrix of distances between samples. Recently [46], proposed a new
non-parametric estimator for mutual informations which involves the optimization of
a neural network to tighten a bound. It is unfortunately computationally hard to test
how these estimators behave in high dimension as even for a known distribution the
computation of the entropy is intractable in most cases. However the replica method
proposed here is a valuable point of comparison for cases where it is rigorously exact.
In the first numerical experiment we place ourselves in the setting of theorem 1: a
2-layer network with i.i.d weight matrices, where the formula of claim 1 is thus rigor-
ously exact in the limit of large networks, and we compare the replica results with
the non-parametric estimators of [45] and [49]. Note that the requirement for smooth
Entropy and mutual information in models of deep neural networks
9
https://doi.org/10.1088/1742-5468/ab3430
J. Stat. Mech. (2019) 124014
activations
(H2)
of theorem 1 can be relaxed (see discussion below the theorem).
Additionally, non-smooth functions can be approximated arbitrarily closely by smooth
functions with equal information-theoretic quantities, up to numerical precision.
We consider a neural network with layers of equal size n = 1000 that we denote:
XT1T2
. The input variable components are i.i.d. Gaussian with mean 0 and
variance 1. The weight matrices entries are also i.i.d. Gaussian with mean 0. Their
standard-deviation is rescaled by a factor
1/n
and then multiplied by a coecient
σ
varying between 0.1 and 10, i.e. around the recommended value for training initializa-
tion. To compute entropies, we consider noisy versions of the latent variables where
an additive white Gaussian noise of very small variance (
σ2
noise = 105
) is added right
before the activation function,
T1=f(W1X+1)
and
T2=f(W2f(W1X)+2)
with
1,2 ∼N
(0, σ
2
noise
I
n)
, which is also done in the remaining experiments to guarantee the
mutual informations to remain finite. The non-parametric estimators [45, 49] were
evaluated using 1000 samples, as the cost of computing pairwise distances is significant
in such high dimension and we checked that the entropy estimate is stable over inde-
pendent draws of a sample of such a size (error bars smaller than marker size). On
figure 2, we compare the dierent estimates of
H(T1)
and
H(T2)
for dierent activa-
tion functions: linear, hardtanh or ReLU. The hardtanh activation is a piecewise linear
approximation of the tanh,
hardtanh(x)=1
for x < 1, x for 1 < x < 1, and 1 for
x > 1, for which the integrals in the replica formula can be evaluated faster than for
the tanh.
In the linear and hardtanh case, the non-parametric methods are following the
tendency of the replica estimate when
σ
is varied, but appear to systematically over-
estimate the entropy. For linear networks with Gaussian inputs and additive Gaussian
noise, every layer is also a multivariate Gaussian and therefore entropies can be directly
computed in closed form (exact in the plot legend). When using the Kolchinsky estimate
in the linear case we also check the consistency of two strategies, either fitting the MoG
to the noisy sample or fitting the MoG to the deterministic part of the
T
and aug-
ment the resulting variance with
σ2
noise
, as done in [45] (Kolchinsky et al parametric in
the plot legend). In the network with hardtanh non-linearities, we check that for small
weight values, the entropies are the same as in a linear network with same weights
(linear approx in the plot legend, computed using the exact analytical result for linear
networks and therefore plotted in a similar color to exact). Lastly, in the case of the
ReLUReLU network, we note that non-parametric methods are predicting an entropy
increasing as the one of a linear network with identical weights, whereas the replica
computation reflects its knowledge of the cut-o and accurately features a slope equal
to half of the linear network entropy ( 1/2 linear approx in the plot legend). While non-
parametric estimators are invaluable tools able to approximate entropies from the mere
knowledge of samples,they inevitably introduce estimation errors. The replica method
is taking the opposite view. While being restricted to a class of models, it can leverage
its knowledge of the neural network structure to provide a reliable estimate. To our
knowledge, there is no other entropy estimator able to incorporate such information
about the underlying multi-layer model.
Beyond informing about estimators accuracy, this experiment also unveils a simple
but possibly important distinction between activation functions. For the hardtanh
activation, as the random weights magnitude increases, the entropies decrease after
Entropy and mutual information in models of deep neural networks
10
https://doi.org/10.1088/1742-5468/ab3430
J. Stat. Mech. (2019) 124014
reaching a maximum, whereas they only increase for the unbounded activation func-
tions we considereven for the single-side saturating ReLU. This loss of information
for bounded activations was also observed by [5], where entropies were computed by
discretizing the output as a single neuron with bins of equal size. In this setting, as
the tanh activation starts to saturate for large inputs, the extreme bins (at 1 and 1)
concentrate more and more probability mass, which explains the information loss. Here
we confirm that the phenomenon is also observed when computing the entropy of the
hardtanh (without binning and with small noise injected before the non-linearity). We
check via the replica formula that the same phenomenology arises for the mutual infor-
mations
I(X;T)
(see section 3.1 of the supplementary material).
3.2. Learning experiments with linear networks
In the following, and in section 3.3 of the supplementary material, we discuss training
experiments of dierent instances of the deep learning models defined in section 2. We
seek to study the simplest possible training strategies achieving good generalization.
Hence for all experiments we use plain stochastic gradient descent (SGD) with constant
learning rates, without momentum and without any explicit form of regularization.
The sizes of the training and testing sets are taken equal and scale typically as a few
hundreds times the size of the input layer. Unless otherwise stated, plots correspond to
single runs, yet we checked over a few repetitions that outcomes of independent runs
lead to identical qualitative behaviors. The values of mutual informations
I(X;T)
are
computed by considering noisy versions of the latent variables where an additive white
Gaussian noise of very small variance (
σ2
noise = 105
) is added right before the activation
function, as in the previous experiment. This noise is neither present at training time,
where it could act as a regularizer, nor at testing time. Given the noise is only assumed
at the last layer, the second to last layer is a deterministic mapping of the input variable;
Figure 2. Entropy of latent variables in stochastic networks
XT1T2
, with
equally sized layers n = 1000, inputs drawn from
N(0, In)
, weights from
N(0, σ2In
2
/n)
,
as a function of the weight scaling parameter
σ
. An additive white Gaussian noise
N(0, 105In)
is added inside the non-linearity. Left column: linear network. Center
column: hardtanhhardtanh network. Right column: ReLUReLU network.
Entropy and mutual information in models of deep neural networks
11
https://doi.org/10.1088/1742-5468/ab3430
J. Stat. Mech. (2019) 124014
hence the replica formula yielding mutual informations between adjacent layers gives
us directly
I(T;T1)=H(T)H(T|T1)=H(T)H(T|X)=I(T;X)
. We
provide a second Python package [50] to implement in Keras learning experiments on
synthetic datasets, using USV- layers and interfacing the first Python package [12] for
replica computations.
To start with we consider the training of a linear network in the teacher-student
scenario. The teacher has also to be linear to be learnable: we consider a simple sin-
gle-layer network with additive white Gaussian noise,
Y=˜
WteachX+
, with input
x∼N(0, In)
of size n, teacher matrix
˜
Wteach
i.i.d. normally distributed as
N(0, 1/n)
,
noise
∼N(0, 0.01In)
, and output of size nY = 4. We train a student network of three
USV-layers, plus one fully connected unconstrained layer
XT1T2T3ˆ
Y
on the regression task, using plain SGD for the MSE loss
(ˆ
Y
Y)
2
. We recall that in
the USV-layers (7) only the diagonal matrix is updated during learning. On the left
panel of figure 3, we report the learning curve and the mutual informations between the
hidden layers and the input in the case where all layers but outputs have size n = 1500.
Again this linear setting is analytically tractable and does not require the replica form-
ula, a similar situation was studied in [5]. In agreement with their observations, we
find that the mutual informations
I(X;T)
keep on increasing throughout the learning,
without compromising the generalization ability of the student. Now, we also use this
linear setting to demonstrate (i) that the replica formula remains correct throughout
the learning of the USV-layers and (ii) that the replica method gets closer and closer
to the exact result in the limit of large networks, as theoretically predicted (2). To this
aim, we repeat the experiment for n varying between 100 and 1500, and report the
maximum and the mean value of the squared error on the estimation of the
I(X;T)
over all epochs of 5 independent training runs. We find that even if errors tend to
increase with the number of layers, they remain objectively very small and decrease
drastically as the size of the layers increases.
3.3. Learning experiments with deep non-linear networks
Finally, we apply the replica formula to estimate mutual informations during the train-
ing of non-linear networks on correlated input data.
We consider a simple single layer generative model
X
=
˜
WgenY+
with normally
distributed code
Y∼N(0, InY)
of size nY = 100, data of size nX = 500 generated with
matrix
˜
W
gen
i.i.d. normally distributed as
N
(0, 1/n
Y)
and noise
∼N(0, 0.01InX)
. We
then train a recognition model to solve the binary classification problem of recovering
the label y= sign(Y
1)
, the sign of the first neuron in
Y
, using plain SGD but this time
to minimize the cross-entropy loss. Note that the rest of the initial code
(Y2, ..YnY)
acts
as noise/nuisance with respect to the learning task. We compare two 5-layers recog-
nition models with 4 USV- layers plus one unconstrained, of sizes 500-1000-500-250-
100-2, and activations either linear-ReLU-linear-ReLU-softmax (top row of figure 4) or
linear-hardtanh-linear-hardtanh-softmax (bottom row). Because USV-layers only fea-
ture
O
(n
)
parameters instead of O(n2) we observe that they require more iterations to
train in general. In the case of the ReLU network, adding interleaved linear layers was
key to successful training with 2 non-linearities, which explains the somewhat unusual
architecture proposed. For the recognition model using hardtanh, this was actually
Entropy and mutual information in models of deep neural networks
12
https://doi.org/10.1088/1742-5468/ab3430
J. Stat. Mech. (2019) 124014
not an issue (see supplementary material for an experiment using only hardtanh acti-
vations), however, we consider a similar architecture for fair comparison. We discuss
further the ability of learning of USV-layers in the supplementary material.
This experiment is reminiscent of the setting of [3], yet now tractable for networks
of larger sizes. For both types of non-linearities we observe that the mutual information
Figure 3. Training of a 4-layer linear student of varying size on a regression
task generated by a linear teacher of output size
nY=4
. Upper-left: MSE loss
on the training and testing sets during training by plain SGD for layers of size
n = 1500. Best training loss is 0.004 735, best testing loss is 0.004 789. Lower-
left: corresponding mutual information evolution between hidden layers and
input. Center-left, center-right, right: maximum and squared error of the replica
estimation of the mutual information as a function of layers size n, over the course
of five independent trainings for each value of n for the first, second and third
hidden layer.
Figure 4. Training of two recognition models on a binary classification task with
correlated input data and either ReLU (top) or hardtanh (bottom) non-linearities.
Left: training and generalization cross-entropy loss (left axis) and accuracies (right
axis) during learning. Best training-testing accuracies are 0.9950.991 for ReLU
version (top row) and 0.9980.996 for hardtanh version (bottom row). Remaining
colums: mutual information between the input and successive hidden layers. Insets
zoom on the first epochs.
Entropy and mutual information in models of deep neural networks
13
https://doi.org/10.1088/1742-5468/ab3430
J. Stat. Mech. (2019) 124014
between the input and all hidden layers decrease during the learning, except for the
very beginning of training where we can sometimes observe a short phase of increase
(see zoom in insets). For the hardtanh layers this phase is longer and the initial increase
of noticeable amplitude.
In this particular experiment, the claim of [3] that compression can occur during
training even with non double-saturated activation seems corroborated (a phenomenon
that was not observed by [5]). Yet we do not observe that the compression is more
pronounced in deeper layers and its link to generalization remains elusive. For instance,
we do not see a delay in the generalization w.r.t. training accuracy/loss in the recogni-
tion model with hardtanh despite of an initial phase without compression in two layers.
Futhermore, we find that changing the weight initialization can drastically change
the behavior of mutual informations during training while resulting in identical train-
ing and testing final performances. In an additional experiment, we consider a setting
closely related to the classification on correlated data presented above. On figure 5 we
compare three identical 5-layers recognition models with sizes 500-1000-500-250-100-2,
and activations hardtanhhardtanh-hardtanh- hartanh-softmax, for the same genera-
tive model and binary classification rule as the previous experiment. For the model pre-
sented at the top row, initial weights were sampled according to
W,ij ∼N
(0, 4/n
1)
,
for the model of the middle row
N(0, 1/n1)
was used instead, and finally
N(0,
1
/
4n
1
)
for the bottom row. The first column shows that training is delayed for the weight
initialized at smaller values, but eventually catches up and reaches accuracies superior
to 0.97 both in training and testing. Meanwhile, mutual informations have dierent
Figure 5. Learning and hidden-layers mutual information curves for a classification
problem with correlated input data, using a 4-USV hardtanh layers and 1
unconstrained softmax layer, from three dierent initializations. Top: initial weights
at layer
of variance
4/n1
, best training accuracy 0.999, best test accuracy 0.994.
Middle: initial weights at layer
of variance
1/n1
, best train accuracy 0.994, best
test accuracy 0.9937. Bottom: initial weights at layer
of variance
0.25/n1
, best
train accuracy 0.975, best test accuracy 0.974. The overall direction of evolution of
the mutual information can be flipped by a change in weight initialization without
changing drastically final performance in the classification task.
Entropy and mutual information in models of deep neural networks
14
https://doi.org/10.1088/1742-5468/ab3430
J. Stat. Mech. (2019) 124014
initial values for the dierent weight initializations and follow very dierent paths.
They either decrease during the entire learning, or on the contrary are only increasing,
or actually feature an hybrid path. We further note that it is to some extent surpris-
ing that the mutual information would increase at all in the first row if we expect the
hardtanh saturation to instead induce compression. Figure 4 of the supplementary
material presents a second run of the same experiment with a dierent random seed.
Findings are identical.
Further learning experiments, including a second run of the last two experiments,
are presented in the supplementary material.
4. Conclusion and perspectives
We have presented a class of deep learning models together with a tractable method
to compute entropy and mutual information between layers. This, we believe, oers
a promising framework for further investigations, and to this aim we provide Python
packages that facilitate both the computation of mutual informations and the train-
ing, for an arbitrary implementation of the model. In the future, allowing for biases
by extending the proposed formula would improve the fitting power of the considered
neural network models.
We observe in our high-dimensional experiments that compression can happen dur-
ing learning, even when using ReLU activations. While we did not observe a clear link
between generalization and compression in our setting, there are many directions to be
further explored within the models presented in section 2. Studying the entropic eect
of regularizers is a natural step to formulate an entropic interpretation to generaliza-
tion. Furthermore, while our experiments focused on the supervised learning, the replica
formula derived for multi-layer models is general and can be applied in unsupervised
contexts, for instance in the theory of VAEs. On the rigorous side, the greater perspec-
tive remains proving the replica formula in the general case of multi-layer models, and
further confirm that the replica formula stays true after the learning of the USV-layers.
Another question worth of future investigation is whether the replica method can be
used to describe not only entropies and mutual informations for learned USV-layers,
but also the optimal learning of the weights itself.
Acknowledgments
The authors would like to thank Léon Bottou, Antoine Maillard, Marc Mézard, Léo
Miolane, and Galen Reeves for insightful discussions. This work has been supported
by the ERC under the European Unions FP7 Grant Agreement 307087-SPARCS
and the European Unions Horizon 2020 Research and Innovation Program 714608-
SMiLe, as well as by the French Agence Nationale de la Recherche under grant ANR-
17-CE23-0023-01 PAIL. Additional funding is acknowledged by MG from Chaire de
recherche sur les modéles et sciences des données, Fondation CFM pour la Recherche-
ENS; by AM from Labex DigiCosme; and by CL from the Swiss National Science
Entropy and mutual information in models of deep neural networks
15
https://doi.org/10.1088/1742-5468/ab3430
J. Stat. Mech. (2019) 124014
Foundation under Grant 200021E-175541. We gratefully acknowledge the support of
NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.
References
  [1]  Tishby N, Pereira F C and Bialek W 1999 The information bottleneck method 37th Annual Allerton Conf.
on Communication, Control, and Computing
  [2]  Tishby N and Zaslavsky N 2015 Deep learning and the information bottleneck principle IEEE Information
Theory Workshop pp 1
  [3]  Shwartz-Ziv R and Tishby N 2017 Opening the black box of deep neural networks via information
(arXiv:1703.00810)
  [4]  Chechik G, Globerson A, Tishby N and Weiss Y 2005 Information bottleneck for Gaussian variables
J. Mach. Learn. Res. 6 16588
  [5]  Saxe A M, Bansal Y, Dapello J, Advani M, Kolchinsky A, Tracey B D and Cox D D 2018 On the informa-
tion bottleneck theory of deep learning Int. Conf. on Learning Representations
  [6]  Kabashima Y 2008 Inference from correlated patterns: a unified theory for perceptron learning and linear
vector channels J. Phys.: Conf. Ser. 95 012001
  [7]  Manoel A, Krzakala F, Mézard M and Zdeborová L 2017 Multi-layer generalized linear estimation IEEE Int.
Symp. on Information Theory pp 2098102
  [8]  Fletcher A K, Rangan S and Schniter P 2018 Inference in deep networks in high dimensions IEEE Int.
Symp. on Information Theory vol 1 pp 18848
  [9]  Reeves G 2017 Additivity of information in multilayer networks via additive Gaussian noise transforms 55th
Annual Allerton Conf. on Communication, Control, and Computing
[10]  Mézard M, Parisi G and Virasoro M 1987 Spin Glass Theory and Beyond (Singapore: World Scientific)
[11]  Mézard M and Montanari A 2009 Information, Physics, and Computation (Oxford: Oxford University
Press)
[12]  2018 Dnner: deep neural networks entropy with replicas, Python library (https://github.com/sphinxteam/
dnner)
[13]  Tulino A M, Caire G, Verdú S and Shamai S 2013 Support recovery with sparsely sampled free
random matrices IEEE Trans. Inf. Theory 59 424371
[14]  Donoho D and Montanari A 2016 High dimensional robust M-estimation: asymptotic variance via
approximate message passing Probab. Theory Relat. Fields 166 93569
[15]  Seung H S, Sompolinsky H and Tishby N 1992 Statistical mechanics of learning from examples Phys. Rev. A
45 6056
[16]  Engel A and Van den Broeck C 2001 Statistical Mechanics of Learning (Cambridge: Cambridge University
Press)
[17]  Opper M and Saad D 2001 Advanced mean field methods: Theory and practice (Cambridge, MA: MIT
Press)
[18]  Jean Barbier, Florent Krzakala, Nicolas Macris, Léo Miolane and Lenka Zdeborová 2019 Optimal errors and
phase transitions in high-dimensional generalized linear models Proc. Natl Acad. Sci. 116 545160
[19]  Barbier J, Macris N, Maillard A and Krzakala F 2018 The mutual information in random linear estimation
beyond i.i.d. matrices IEEE Int. Symp. on Information Theory pp 62532
[20]  Donoho D, Maleki A and Montanari A 2009 Message-passing algorithms for compressed sensing Proc. Natl
Acad. Sci. 106 189149
[21]  Zdeborová L and Krzakala F 2016 Statistical physics of inference: thresholds and algorithms Adv. Phys.
65 453552
[22]  Rangan S 2011 Generalized approximate message passing for estimation with random linear mixing IEEE
Int. Symp. on Information Theory pp 216872
[23]  Rangan S, Schniter P and Fletcher A K 2017 Vector approximate message passing IEEE Int. Symp. on
Information Theory pp 158892
[24]  Barbier J and Macris N 2019 The adaptive interpolation method for proving replica formulas. Applications
to the CurieWeiss and Wigner spike models J. Phys. A 52 294002
[25]  Barbier J and Macris N 2019 The adaptive interpolation method: a simple scheme to prove replica formulas
in Bayesian inference Probab Theory Relat. Fields 174 113385
[26]  Barbier J, Macris N and Miolane L 2017 The layered structure of tensor estimation and its mutual informa-
tion 55th Annual Allerton Conf. on Communication, Control, and Computing pp 105663
Entropy and mutual information in models of deep neural networks
16
https://doi.org/10.1088/1742-5468/ab3430
J. Stat. Mech. (2019) 124014
[27]  Moczulski M, Denil M, Appleyard J and de Freitas N 2016 ACDC: a structured ecient linear layer Int.
Conf. on Learning Representations
[28]  Yang Z, Moczulski M, Denil M, de Freitas N, Smola A, Song L and Wang Z 2015 Deep fried convnets IEEE
Int. Conf. on Computer Vision pp 147683
[29]  Amit D J, Gutfreund H and Sompolinsky H 1985 Storing infinite numbers of patterns in a spin-glass model
of neural networks Phys. Rev. Lette. 55 1530
[30]  Gardner E and Derrida B 1989 Three unfinished works on the optimal storage capacity of networks J. Phys.
A 22 1983
[31]  Mézard M 1989 The space of interactions in neural networks: Gardners computation with the cavity method
J. Phys. A 22 2181
[32]  Louart C and Couillet R 2017 Harnessing neural networks: a random matrix approach IEEE Int. Conf. on
Acoustics, Speech and Signal Processing pp 22826
[33]  Pennington J and Worah P 2017 Nonlinear random matrix theory for deep learning Advances in Neural
Information Processing Systems
[34]  Raghu M, Poole B, Kleinberg J, Ganguli S and Sohl-Dickstein J 2017 On the expressive power of deep neural
networks Int. Conf. on Machine Learning
[35]  Saxe A, McClelland J and Ganguli S 2014 Exact solutions to the nonlinear dynamics of learning in deep
linear neural networks Int. Conf. on Learning Representations
[36]  Schoenholz S S, Gilmer J, Ganguli S and Sohl-Dickstein J 2017 Deep information propagation Int. Conf. on
Learning Representations
[37]  Advani M and Saxe A 2017 High-dimensional dynamics of generalization error in neural networks
(arXiv:1710.03667)
[38]  Baldassi C, Braunstein A, Brunel N and Zecchina R 2007 Ecient supervised learning in networks with
binary synapses Proc. Natl Acad. Sci. 104 1107984
[39]  Dauphin Y, Pascanu R, Gulcehre C, Cho K, Ganguli S and Bengio Y 2014 Identifying and attacking the sad-
dle point problem in high-dimensional non-convex optimization Advances in Neural Information Process-
ing Systems
[40]  Giryes R, Sapiro G and Bronstein A M 2016 Deep neural networks with random Gaussian weights: a univer-
sal classification strategy? IEEE Trans. Signal Process. 64 344457
[41]  Chalk M, Marre O and Tkacik G 2016 Relevant sparse codes with variational information bottleneck
Advances in Neural Information Processing Systems
[42]  Achille A and Soatto S 2018 Information dropout: learning optimal representations through noisy computa-
tion IEEE Trans. Pattern Anal. Mach. Intell. pp 2897905
[43]  Alemi A, Fischer I, Dillon J and Murphy K 2017 Deep variational information bottleneck Int. Conf. on
Learning Representations
[44]  Achille A and Soatto S 2017 Emergence of invariance and disentangling in deep representations ICML 2017
Workshop on Principled Approaches to Deep Learning
[45]  Kolchinsky A, Tracey B D and Wolpert D H 2017 Nonlinear information bottleneck (arXiv:1705.02436)
[46]  Belghazi M I, Baratin A, Rajeswar S, Ozair S, Bengio Y, Courville A and Hjelm R D 2018 MINE: mutual
information neural estimation Int. Conf. on Machine Learning
[47]  Zhao S, Song J and Ermon S 2017 InfoVAE: information maximizing variational autoencoders
(arXiv:1706.02262)
[48]  Kolchinsky A and Tracey B D 2017 Estimating mixture entropy with pairwise distances Entropy 19 361
[49]  Kraskov A, Stögbauer H and Grassberger P 2004 Estimating mutual information Phys. Rev. E 69 066138
[50]  2018 lsd: Learning with Synthetic Data, Python library (https://github.com/marylou-gabrie/
learning-synthetic-data)
... Combining Eqs. (7), (24), (26) and (36)(37)(38)(39), we obtain the limit on the rate of KL divergence ...
... [8]). Moreover, the derived bounds on the rates of mutual information might be useful estimates for information flow in real neurons [36], artificial deep neural networks [37], molecular circuits [38], or other systems where exact values are difficult to obtain [39] and require heavy numerical calculations [43,44]. Finally, it is worth to mention that there are also possible other types of limits on the rates of statistical divergences, but the goal here was to have bounds that can be related clearly to the known physical observables. ...
Preprint
Full-text available
Statistical divergences are important tools in data analysis, information theory, and statistical physics, and there exist well known inequalities on their bounds. However, in many circumstances involving temporal evolution, one needs limitations on the rates of such quantities, instead. Here, several general upper bounds on the rates of some f-divergences are derived, valid for any type of stochastic dynamics (both Markovian and non-Markovian), in terms of information-like and/or thermodynamic observables. As special cases, the analytical bounds on the rate of mutual information are obtained, which may provide explicit and simple alternatives to the existing numerical algorithms for its estimation. The major role in all those limitations is played by temporal Fisher information, and some of them contain entropy production, suggesting a link with stochastic thermodynamics. Overall, the derived bounds can be applied to any complex network of interacting elements, where predictability of network dynamics is of prime concern.
... Conventionally, a number of additional computational procedures are added to a main learning procedure, which are in principle contradictory to each other, and contradictions in those procedures are forced to be resolved simultaneously. Let us take an example of the information-theoretic mutual information method [29][30][31][32][33][34][35][36], because those information-theoretic measures, such as entropy, are very close to the neutralization ones used in this paper. Mutual information maximization can be decomposed into entropy maximization, accompanied by conditional entropy minimization. ...
Article
Full-text available
The present paper aims to propose a new method for neutralizing contradictions in neural networks. Neural networks exhibit numerous contradictions in the form of contrasts, differences, and errors, making it extremely challenging to find a compromise between them. In this context, neutralization is introduced not to resolve these contradictions, but to weaken them by transforming them into more manageable and concrete forms. In this paper, contradictions are neutralized or weakened through four neutralization methods: comprehensive, nullified, compressive, and collective. Comprehensive neutralization involves increasing the neutrality of all components in a neural network. Nullified neutralization is employed to weaken contradictions among different computational and optimization procedures. Compressive neutralization aims to simplify multi-layered neural networks while preserving the original internal information as much as possible. Collective neutralization is achieved by considering as many final networks as possible under different conditions, inputs, learning steps, and so on. The proposed method was applied to two data sets, one of which consisted of irregular forms resulting from natural language processing. The experimental results demonstrate that comprehensive neutralization could enhance the neutrality of all components and represent features across a broader range of components, thereby improving generalization. Nullified neutralization enabled a compromise between neutrality maximization and error minimization. Through compressive and collective neutralization of a large number of compressed weights, it became possible to interpret compressed and collective weights. In particular, inputs that were considered relatively unimportant by conventional methods emerged as highly significant. Finally, these results were compared with those obtained in the field of the human-centered approach to provide a clearer understanding of the significance of contradiction resolution, applied to neural networks.
... When working with raw ECG data, one-dimensional convolutional neural networks (1D CNNs) apply kernels along the temporal dimension (Nurmaini et al., 2020b), whereas two-dimensional convolutional neural networks (2D CNNs) deal with ECG data transformed into images and other two-dimensional formats . Examples of such transformations include distance distribution matrices that are derived from entropy computations (Gabrié et al., 2018) as well as gray-level co-occurrence matrices (De Siqueira et al., 2013) and beat-to-beat correlations (Wen et al., 2019). When applied to ECG signal analysis, CNNs automatically learn and extract relevant features from raw ECG signals, improving the accuracy of arrhythmia detection. ...
Article
Full-text available
Cardiovascular diseases are a leading cause of mortality globally. Electrocardiography (ECG) still represents the benchmark approach for identifying cardiac irregularities. Automatic detection of abnormalities from the ECG can aid in the early detection, diagnosis, and prevention of cardiovascular diseases. Deep Learning (DL) architectures have been successfully employed for arrhythmia detection and classification and offered superior performance to traditional shallow Machine Learning (ML) approaches. This survey categorizes and compares the DL architectures used in ECG arrhythmia detection from 2017–2023 that have exhibited superior performance. Different DL models such as Convolutional Neural Networks (CNNs), Multilayer Perceptrons (MLPs), Transformers, and Recurrent Neural Networks (RNNs) are reviewed, and a summary of their effectiveness is provided. This survey provides a comprehensive roadmap to expedite the acclimation process for emerging researchers willing to develop efficient algorithms for detecting ECG anomalies using DL models. Our tailored guidelines bridge the knowledge gap allowing newcomers to align smoothly with the prevailing research trends in ECG arrhythmia detection. We shed light on potential areas for future research and refinement in model development and optimization, intending to stimulate advancement in ECG arrhythmia detection and classification.
... We formally write down the moment generating function (MGF) of the predictor. We then use the well-known replica method in statistical physics [32,33], which has also been shown to be a powerful tool for deriving analytical results for learning in NNs [34][35][36][37][38]. We analytically calculate the MGF after averaging over the posterior distribution of the network weights in the infinite width limit, which enables us to compute statistics of the predictor. ...
Preprint
Full-text available
Artificial neural networks have revolutionized machine learning in recent years, but a complete theoretical framework for their learning process is still lacking. Substantial progress has been made for infinitely wide networks. In this regime, two disparate theoretical frameworks have been used, in which the network's output is described using kernels: one framework is based on the Neural Tangent Kernel (NTK) which assumes linearized gradient descent dynamics, while the Neural Network Gaussian Process (NNGP) kernel assumes a Bayesian framework. However, the relation between these two frameworks has remained elusive. This work unifies these two distinct theories using a Markov proximal learning model for learning dynamics in an ensemble of randomly initialized infinitely wide deep networks. We derive an exact analytical expression for the network input-output function during and after learning, and introduce a new time-dependent Neural Dynamical Kernel (NDK) from which both NTK and NNGP kernels can be derived. We identify two learning phases characterized by different time scales: gradient-driven and diffusive learning. In the initial gradient-driven learning phase, the dynamics is dominated by deterministic gradient descent, and is described by the NTK theory. This phase is followed by the diffusive learning stage, during which the network parameters sample the solution space, ultimately approaching the equilibrium distribution corresponding to NNGP. Combined with numerical evaluations on synthetic and benchmark datasets, we provide novel insights into the different roles of initialization, regularization, and network depth, as well as phenomena such as early stopping and representational drift. This work closes the gap between the NTK and NNGP theories, providing a comprehensive framework for understanding the learning process of deep neural networks in the infinite width limit.
... We resort to mutual information (MI) [33,36] for the learning objective design in our paper since it can measure the dependency relationship between variables [10,47]. ...
Preprint
Full-text available
Self-supervised learning with masked autoencoders has recently gained popularity for its ability to produce effective image or textual representations, which can be applied to various downstream tasks without retraining. However, we observe that the current masked autoencoder models lack good generalization ability on graph data. To tackle this issue, we propose a novel graph masked autoencoder framework called GiGaMAE. Different from existing masked autoencoders that learn node presentations by explicitly reconstructing the original graph components (e.g., features or edges), in this paper, we propose to collaboratively reconstruct informative and integrated latent embeddings. By considering embeddings encompassing graph topology and attribute information as reconstruction targets, our model could capture more generalized and comprehensive knowledge. Furthermore, we introduce a mutual information based reconstruction loss that enables the effective reconstruction of multiple targets. This learning objective allows us to differentiate between the exclusive knowledge learned from a single target and common knowledge shared by multiple targets. We evaluate our method on three downstream tasks with seven datasets as benchmarks. Extensive experiments demonstrate the superiority of GiGaMAE against state-of-the-art baselines. We hope our results will shed light on the design of foundation models on graph-structured data. Our code is available at: https://github.com/sycny/GiGaMAE.
... Statistical mechanics on neural networks has a long history that dates back to the 1980's [7][8][9]. Studies on the single perceptrons [8,9] and shallow networks [4,10] have provided many useful insights, and some progress has been made also on deeper networks [11][12][13][14]. However, what is going on in the hidden layers remains largely unknown. ...
Article
Full-text available
Despite spectacular successes, deep neural networks (DNNs) with a huge number of adjustable parameters remain largely black boxes. To shed light on the hidden layers of DNNs, we study supervised learning by a DNN of width N and depth L consisting of NL perceptrons with c inputs by a statistical mechanics approach called the teacher-student setting. We consider an ensemble of student machines that exactly reproduce M sets of N-dimensional input/output relations provided by a teacher machine. We show that the statistical mechanics problem becomes exactly solvable in a high-dimensional limit which we call a “dense limit”: N≫c≫1 and M≫1 with fixed α=M/c using the replica method developed by Yoshino [SciPost Phys. Core 2, 005 (2020)] In conjunction with the theoretical study, we also study the model numerically performing simple greedy Monte Carlo simulations. Simulations reveal that learning by the DNN is quite heterogeneous in the network space: configurations of the teacher and the student machines are more correlated within the layers closer to the input/output boundaries, while the central region remains much less correlated due to the overparametrization in qualitative agreement with the theoretical prediction. We evaluate the generalization error of the DNN with various depths L both theoretically and numerically. Remarkably, both the theory and the simulation suggest that the generalization ability of the student machines, which are only weakly correlated with the teacher in the center, does not vanish even in the deep limit L≫1, where the system becomes heavily overparametrized. We also consider the impact of the effective dimension D(≤N) of data by incorporating the hidden manifold model [Goldt, Mézard, Krzakala, and Zdevorová, Phys. Rev. X 10, 041044 (2020)] into our model. Replica theory implies that the loop corrections to the dense limit, which reflect correlations between different nodes in the network, become enhanced by either decreasing the width N or decreasing the effective dimension D of the data. Simulation suggests that both lead to significant improvements in generalization ability.
... The analysis of the GLM-SBM requires to glue the two graphical models using the GLM as the prior for the SBM, and the SBM as a source of uncertainty of the outputs of the GLM. Such a glueing of two dense exactly solvable graphical models for developed in Manoel et al. (2017) with rigorous justifications given in Gabrié et al. (2018); Aubin et al. (2019); Gerbelot and Berthier (2021). Our work is the first one, as far as we are aware, where a sparse graphical model (the SBM) is glued to a dense graphical model (the GLM). ...
Article
Full-text available
The stochastic block model (SBM) is widely studied as a benchmark for graph clustering aka community detection. In practice, graph data often come with node attributes that bear additional information about the communities. Previous works modeled such data by considering that the node attributes are generated from the node community memberships. In this work, motivated by a recent surge of works in signal processing using deep neural networks as priors, we propose to model the communities as being determined by the node attributes rather than the opposite. We define the corresponding model; we call it the neural-prior SBM. We propose an algorithm, stemming from statistical physics, based on a combination of belief propagation and approximate message passing. We analyze the performance of the algorithm as well as the Bayes-optimal performance. We identify detectability and exact recovery phase transitions, as well as an algorithmically hard region. The proposed model and algorithm can be used as a benchmark for both theory and algorithms. To illustrate this, we compare the optimal performances to the performance of simple graph neural networks.
Preprint
Full-text available
The quest to comprehend the nature of consciousness has spurred the development of many theories that seek to explain its underlying mechanisms and account for its neural correlates. In this paper, I compare my own conscious electromagnetic information field (cemi field) theory with integrated information theory and global workspace theory for their ability to ‘carve nature at its joints’ in the sense of predicting the entities, structures, states and dynamics that are conventionally recognized as being conscious or nonconscious. I demonstrate the cemi field theory shares features with both integrated information theory and global workspace theory but consistently outperforms both by correctly predicting the entities, structures, states and dynamics that support consciousness. I argue that the simplest solution to the question of why cemi field theory consistently outperforms rival theories of consciousness is that the brain’s EM field is consciousness.
Article
Full-text available
Generalized linear models (GLMs) are used in high-dimensional machine learning, statistics, communications, and signal processing. In this paper we analyze GLMs when the data matrix is random, as relevant in problems such as compressed sensing, error-correcting codes, or benchmark models in neural networks. We evaluate the mutual information (or “free entropy”) from which we deduce the Bayes-optimal estimation and generalization errors. Our analysis applies to the high-dimensional limit where both the number of samples and the dimension are large and their ratio is fixed. Nonrigorous predictions for the optimal errors existed for special cases of GLMs, e.g., for the perceptron, in the field of statistical physics based on the so-called replica method. Our present paper rigorously establishes those decades-old conjectures and brings forward their algorithmic interpretation in terms of performance of the generalized approximate message-passing algorithm. Furthermore, we tightly characterize, for many learning problems, regions of parameters for which this algorithm achieves the optimal performance and locate the associated sharp phase transitions separating learnable and nonlearnable regions. We believe that this random version of GLMs can serve as a challenging benchmark for multipurpose algorithms.
Conference Paper
Full-text available
The practical successes of deep neural networks have not been matched by theoretical progress that satisfyingly explains their behavior. In this work, we study the information bottleneck (IB) theory of deep learning, which makes three specific claims: first, that deep networks undergo two distinct phases consisting of an initial fitting phase and a subsequent compression phase; second, that the compression phase is causally related to the excellent generalization performance of deep networks; and third, that the compression phase occurs due to the diffusion-like behavior of stochastic gradient descent. Here we show that none of these claims hold true in the general case. Through a combination of analytical results and simulation, we demonstrate that the information plane trajectory is predominantly a function of the neural nonlinearity employed: double-sided saturating nonlineari-ties like tanh yield a compression phase as neural activations enter the saturation regime, but linear activation functions and single-sided saturating nonlinearities like the widely used ReLU in fact do not. Moreover, we find that there is no evident causal connection between compression and generalization: networks that do not compress are still capable of generalization, and vice versa. Next, we show that the compression phase, when it exists, does not arise from stochasticity in training by demonstrating that we can replicate the IB findings using full batch gradient descent rather than stochastic gradient descent. Finally, we show that when an input domain consists of a subset of task-relevant and task-irrelevant information, hidden representations do compress the task-irrelevant information, although the overall information about the input may monotonically increase with training time, and that this compression happens concurrently with the fitting process rather than during a subsequent compression period.
Article
Full-text available
We argue that the estimation of the mutual information between high dimensional continuous random variables is achievable by gradient descent over neural networks. This paper presents a Mutual Information Neural Estimator (MINE) that is linearly scalable in dimensionality as well as in sample size. MINE is back-propable and we prove that it is strongly consistent. We illustrate a handful of applications in which MINE is succesfully applied to enhance the property of generative models in both unsupervised and supervised settings. We apply our framework to estimate the information bottleneck, and apply it in tasks related to supervised classification problems. Our results demonstrate substantial added flexibility and improvement in these settings.
Book
Learning is one of the things that humans do naturally, and it has always been a challenge for us to understand the process. Nowadays this challenge has another dimension as we try to build machines that are able to learn and to undertake tasks such as datamining, image processing and pattern recognition. We can formulate a simple framework, artificial neural networks, in which learning from examples may be described and understood. The contribution to this subject made over the last decade by researchers applying the techniques of statistical mechanics is the subject of this book. The authors provide a coherent account of various important concepts and techniques that are currently only found scattered in papers, supplement this with background material in mathematics and physics and include many examples and exercises to make a book that can be used with courses, or for self-teaching, or as a handy reference.