ArticlePDF Available

Utilizing Information Bottleneck to Evaluate the Capability of Deep Neural Networks for Image Classification

Authors:

Abstract and Figures

Inspired by the pioneering work of the information bottleneck (IB) principle for Deep Neural Networks’ (DNNs) analysis, we thoroughly study the relationship among the model accuracy, I ( X ; T ) and I ( T ; Y ) , where I ( X ; T ) and I ( T ; Y ) are the mutual information of DNN’s output T with input X and label Y. Then, we design an information plane-based framework to evaluate the capability of DNNs (including CNNs) for image classification. Instead of each hidden layer’s output, our framework focuses on the model output T. We successfully apply our framework to many application scenarios arising in deep learning and image classification problems, such as image classification with unbalanced data distribution, model selection, and transfer learning. The experimental results verify the effectiveness of the information plane-based framework: Our framework may facilitate a quick model selection and determine the number of samples needed for each class in the unbalanced classification problem. Furthermore, the framework explains the efficiency of transfer learning in the deep learning area.
Content may be subject to copyright.
entropy
Article
Utilizing Information Bottleneck to Evaluate the
Capability of Deep Neural Networks for
Image Classification
Hao Cheng 1,2,3, Dongze Lian 3, Shenghua Gao 3and Yanlin Geng 4,*
1Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences,
Shanghai 200050, China; chenghao@shanghaitech.edu.cn
2University of Chinese Academy of Sciences, Beijing 100049, China
3School of Information Science and Technology, ShanghaiTech University, Shanghai 201210, China;
liandz@shanghaitech.edu.cn (D.L.); gaoshh@shanghaitech.edu.cn (S.G.)
4State Key Laboratory of ISN, Xidian University, Xi’an 710071, China
*Correspondence: gengyanlin@gmail.com
This paper is an extended version of our paper published in the 15th European Conference on Computer
Vision (ECCV), Munich, Germany, 8–14 September 2018.
Received: 10 February 2019; Accepted: 28 April 2019; Published: 1 May 2019


Abstract:
Inspired by the pioneering work of the information bottleneck (IB) principle for Deep
Neural Networks’ (DNNs) analysis, we thoroughly study the relationship among the model accuracy,
I(X
;
T)
and
I(T
;
Y)
, where
I(X
;
T)
and
I(T
;
Y)
are the mutual information of DNN’s output
T
with input
X
and label
Y
. Then, we design an information plane-based framework to evaluate
the capability of DNNs (including CNNs) for image classification. Instead of each hidden layer’s
output, our framework focuses on the model output
T
. We successfully apply our framework
to many application scenarios arising in deep learning and image classification problems, such
as image classification with unbalanced data distribution, model selection, and transfer learning.
The experimental results verify the effectiveness of the information plane-based framework: Our
framework may facilitate a quick model selection and determine the number of samples needed
for each class in the unbalanced classification problem. Furthermore, the framework explains the
efficiency of transfer learning in the deep learning area.
Keywords: information bottleneck; mutual information; neural networks; image classification
1. Introduction
1.1. Deep Neural Networks
Deep neural networks (DNNs) are very powerful machine learning models that have
revolutionized many research and application areas [
1
6
] in recent years. These include image
recognition [
1
,
4
], speech recognition [
2
], and natural language processing [
5
]. DNN is a type of
representation learning that can automatically generate good representations from raw data for further
processing.
With many
training data and GPU acceleration [
1
], the performance of DNN has greatly
exceeded other traditional learning algorithms in image recognition tasks. Inspired by the work in [
1
],
researchers have developed more efficient network structures: The work in [
7
] decreased the size of
kernel and found better representations of data. The work in [
8
] proposed a facial recognition tool,
named FaceNet, that could be embedded for face recognition and clustering. The work in [
9
] proposed
ResNet by utilizing identity mapping between layers. ResNet could make the network deeper than
1000 layers without losing performance and has been widely used in many applications.
Entropy 2019,21, 456; doi:10.3390/e21050456 www.mdpi.com/journal/entropy
Entropy 2019,21, 456 2 of 22
However, DNN has millions of parameters and numerous hyper-parameters. Before training
a DNN model, one needs to define all the hyper-parameters heuristically. These include the initial
learning rate, momentum, batch size, activation functions, number of layers, weight decay, etc. There is
no standard guideline on how to choose appropriate hyper-parameters. Furthermore, the same
network structure or hyper-parameter setting may be suitable for one task (dataset), but fail on the
other one. Thus, for DNN, it lacks deep understanding or an explanation of how networks work
or behave.
Recently, there have been some works that tried to understand networks by visualization. The
work in [
10
] proposed a method to compute the image-resolution receptive field of neural activations
in a feature map. Accurate calculation of the receptive field would help people to understand the
representation of the filter. The work in [
11
,
12
] understood DNN by estimating the feature distribution
of each class in the feature space of the pre-trained model. The work in [
13
] used the decovnet
operation for visualizing discriminant features from data. The work in [14] introduced a tool, named
network dissection. This tool can be used to quantify the exploitability of hidden representations of
kernels in convolutional networks by evaluating the alignment between individual hidden units and
a set of semantic concepts. However, almost all of these visualization techniques understand DNNs
by visualizing features or activation maps to gain insights into neural networks, which somehow
lacks a theoretical basis. In this work, we are going to use the tool of mutual information, which is
a fundamental measure in information theory, to represent networks in the information plane for
better understanding.
Remark 1. This is an extend version of our conference paper published in ECCV 2018 [15].
1.2. Mutual Information and Information Bottleneck
The reason we used mutual information is that it provides a measure to quantify the common
part between two random variables, especially for the case where their samples are highly non-linear.
Notice that supervised or unsupervised deep learning models have the input data and the output ones;
the relationship or uncertainty between them can be measured by mutual information. For example, the
work in [
16
] introduced a new learning rate schedule based on mutual information. The work in [
17
]
utilized a mutual information method to score all the candidate features in the network and selected
the most informative one as the input for DNN. The work in [
18
] learned a better representation by
maximizing the mutual information between DNN’s input and output.
In supervised deep learning, the label
Y
is also available. In addition to the relationship
between the input
X
and output
T
, the correlation between the output
T
and label
Y
is also
worth studying. We used
I(X
;
T)
and
I(T
;
Y)
to represent the corresponding mutual information.
The study of
I(X
;
T)
and
I(T
;
Y)
inspires the information bottleneck (IB) method. IB was introduced
by
Naftali Tishby et al.
[
19
] and serves as a method for compressing the source data (random
variable) without losing the predictive capability for the target label (random variable). Since then,
the modification and generalization of IB has attracted considerable interest. The work in [
20
]
generalized the IB method to the continuous setting, which deals with variables such as Gaussian. The
work in [21] proposed the deterministic IB (DIB), a different formulation that uses entropy instead of
mutual information. It stated that DIB could better distinguish relevant features. The work in [
22
]
deeply explored the relationship between IB and minimal sufficient statistics.
In recent years, IB has been applied to the deep learning research field. The work in [
23
] introduced
a variational approximation technique to IB. The new method uses a neural network to parameterize
the IB model, which could lead to an efficient and stable training process. The work in [
24
] introduced
a new IB-based framework, called parametric IB (PIB). The advantage of PIB is that it could jointly
optimize the compressed information and the relevant information of the layers in the neural network
and leads to a better representation ability of DNNs. The work in [
25
] proposed a method called
Entropy 2019,21, 456 3 of 22
information dropout, which generalizes the original dropout with theoretical consideration. The new
dropout method can make it easier for the network to fit the training data.
1.3. Visualizing DNN via Information Bottleneck
Different from the work in [
23
25
], Ravid and Naftali [
26
] took IB as a visualization tool to
understand how neural networks work in general. Specifically, they estimated the mutual information
I(X
;
T)
and
I(T
;
Y)
for
each hidden layer
, where
X
is the input data,
Y
is the label, and
T
is the
hidden layer output, respectively. Figure 1, from [
15
], draws
I(X
;
T)
,
I(T
;
Y)
with training epochs in
the two-dimensional plane (referred to as the information plane). From the figure, the green points
(transition points) separate mutual information paths into two stages:
The first phase, which is also called the “fitting phase”, only takes a little time in the whole
training procedure.
I(X
;
T)
and
I(T
;
Y)
both increase in this stage. It can be explained as: the
network needs to remember the information from the data “
I(X
;
T)
” for fitting the label “
I(T
;
Y)
”.
The second phase, which is also called the “compression phase”, takes most of the training
time.
I(T
;
Y)
still increases while
I(X
;
T)
decreases. It can be explained as: the value of
I(X
;
T)
,
obtained in the first fitting phase, contains too much irrelevant information (for classification).
The network needs to drop this irrelevant information for better generalization.
Figure 1.
In this figure (from [
15
]), the network that performs the experiment is a fully-connected
neural network. The input data
X
are a 12-dimensional binary vector. The label
Y
takes values from
{0, 1}.
Each mutual information path corresponds to a hidden layer. The leftmost path represents the
last hidden layer, and the rightmost path represents the first hidden layer. The green points (also called
transition points) separate mutual information paths into two stages. Best viewed in color.
The dynamic change of
I(X
;
T)
and
I(T
;
Y)
in the information plane helps us understand how
neural networks work. However, there are still some unsolved problems: (1) “Accuracy” is an
important indicator for DNNs. To evaluate DNNs via IB, one needs to investigate fully the relationship
among the model accuracy,
I(X
;
T)
, and
I(T
;
Y)
. One benefit of this investigation is that we can use
the information plane to help us evaluate or select a better model that leads to a higher accuracy in
a short time (see Section 3.3). (2) The models used in [
26
] are some simple fully-connected neural
networks (FCs), while convolutional neural networks (CNNs) are generally used in many applications,
like computer vision and speech recognition. Is there any difference between FCs and CNNs in the
information plane? (3) Could IB be used in some other problems arising in deep learning? This paper
aims to address these problems to make IB theory more acceptable and applicable in deep learning
research. The contributions of our work are summarized in two parts as follows:
Entropy 2019,21, 456 4 of 22
(1) Framework:
“Accuracy” is a common indicator that reflects DNN’s quality. Since we are to use mutual
information to evaluate DNNs, we need to explore the relationship among
I(X
;
T)
,
I(T
;
Y)
,
and accuracy. With extensive experiments, we found that low
I(X
;
T)
and high
I(T
;
Y)
both
contribute to the accuracy. Furthermore, the correlation between
I(X
;
T)
and accuracy grew
stronger as the network got deeper.
An information plane-based framework is proposed based on the investigation of the
relationship among
I(X
;
T)
,
I(T
;
Y)
, and accuracy. We used the framework to evaluate
fully-connected neural networks (FCs) and convolutional neural networks (CNNs) in the
information plane and found some interesting phenomena. For example, when a dataset is
harder to recognize, FCs with fewer layers do not exhibit the second phase.
Our results suggest that the information plane is more informative than the loss curve; thus,
it may be used for better evaluation of neural networks. For example, two models with
similar loss curves (both decreasing along with the training epochs) behave differently in the
information plane.
(2) Applications:
We successfully applied the framework to the image classification problem with an
unbalanced data distribution. The framework helped us determine the samples of each class
needed in the image classification task.
We applied the framework to explain the efficiency of transfer learning. We found that
the weight-transfer method helped the model fit the data only in the first learning phase.
This result observed by us is consistent with [
27
]. While different, we verify this result from
the information plane.
We apply the framework to analyze how different optimization methods affect DNNs.
We compare the optimization methods “SGD” and “Adam” in the information plane.
We found that Adam might have the advantage of fitting labels in the beginning of training,
but generalizes more poorly than SGD in the end. This finding is also consistent with [28].
Compared to our original conference version [
15
], this paper focuses more on application scenarios
arising in deep learning and image classification problems like transfer learning and on the comparison
of different optimization algorithms listed above.
2. Materials and Methods
In this section, we introduce the definition of mutual information and explain representation
learning in the language of information bottleneck. Then, we show how to estimate mutual information
in DNNs in detail.
2.1. Some Concepts of Mutual Information
The mutual information of two discrete random variables Xand Yis defined as:
I(X;Y) =
x,y
p(x,y)log p(x,y)
p(x)p(y), (1)
where
p(x
,
y)
is the joint probability mass function of
X
and
Y
and
p(x)
and
p(y)
are the marginal
probability mass functions of Xand Y, respectively.
For continuous random variables, the summation is replaced by a definite double integral:
I(X;Y) = ZyZxp(x,y)log p(x,y)
p(x)p(y)dxdy, (2)
Entropy 2019,21, 456 5 of 22
where
p(x
,
y)
, an abuse of notation, is now the joint probability density function of
X
and
Y
.
Compared with
(2)
,
(1)
is easier and more practical to calculate (integrating probability density
functions is time consuming in networks). Thus, in the experiments, we will first on using a
discretization technique to transform continuous values to discrete ones before calculating the mutual
information. The detail can be found in Section 2.3.
For a discrete X, the entropy H(X)can be written in the form of mutual information:
H(X) = I(X;X) =
x
p(x)log p(x). (3)
In general, mutual information quantifies the similarity, or dependence, between two variables.
The following two properties of mutual information are essential for analyzing neural networks.
Function transformation: For any invertible functions ψand φ,
I(X;Y) = I(ψ(X);φ(Y)). (4)
Data processing inequality: If XYZforms a Markov chain, then:
I(X;Y)I(X;Z). (5)
2.2. Optimal Representations of Information Bottleneck
In machine learning, representation learning is a set of techniques that allows a system to
discover the representations needed for feature detection or classification from raw data automatically.
The neural network is a type of representation learning that can be used to perform feature learning,
since it learns a representation of the input at the hidden layer, which is subsequently used for
classification or regression at the output layer. The goal of representation learning is to extract the
efficient representation of raw data
X
. This is often called “compression”. However, the compression
should not lose the predictive capability of the label
Y
. This learning process is identical to the concept
of minimal sufficient statistics. A minimal sufficient statistic
T(X)
is the solution to the following
optimization problem:
T(X) = arg min
S(X):I(S(X);Y)=I(X;Y)
I(S(X);X). (6)
From
(6)
, the minimum
T(X)
is the simplest sufficient statistic. In other words,
T(X)
is the
function of other sufficient statistics. One way of formulating this is through the Markov chain
YXS(X)T(X)
. Thus, from the concept of minimal sufficient statistics, the goal of DNNs is
to make
I(X
;
S(X))
as small as possible, which means the representation is efficient; while
I(S(X)
;
Y)
should be the same value of
I(X
;
Y)
, which means the information on
Y
is not reduced. Since explicit
formulae for minimal sufficient statistics only exist for very special distributions (exponential families),
the work in [
19
] relaxed this optimization problem by first allowing the map to be stochastic defined
as an encoder
P(T|X)
and, then, by allowing the map to capture
I(X
;
T)
as much as possible, not
necessarily all of it. This leads to the IB method [19].
IB is a special case of the rate-distortion theory, which provides a technique for finding
approximate minimal sufficient statistics, or the optimal trade-off between the compression of
X
and the predictive ability of
Y
. Let
x
be an input,
t
be the output of the model, or the compressed
representation of
x
, and
p(t|x)
be a probabilistic mapping. We can formulate the information bottleneck
as the following optimization problem:
min
p(t|x),YXT{I(X;T)βI(T;Y)}. (7)
Entropy 2019,21, 456 6 of 22
The positive Lagrange multiplier
β
operates as a trade-off parameter between the compression
I(X
;
T)
and the amount of preserved relevant information
I(T
;
Y)
. The solution to
(7)
can be given by
three IB self-consistent equations:
p(t|x) = p(t)
Z(x;β)exp{−βDKL (p(y|x)||p(y|t))}, (8)
p(t) =
x
p(t|x)p(x), (9)
p(y|t) =
x
p(y|x)p(x|t), (10)
where Z(x;β)is the normalization function.
From
(7)
, a good representation is a trade-off between
I(X
;
T)
and
I(T
;
Y)
. These two terms are
essential for analyzing neural networks. In the next section, we will show how to estimate mutual
information in neural networks approximately.
2.3. Calculating Mutual Information in DNNs
In Section 2.2, we emphasize that
I(X
;
T)
and
I(T
;
Y)
are important for analyzing DNNs (a type of
representation learning model). Next, we will show step by step how to calculate mutual information
I(X;T)and I(T;Y)in neural networks.
First, Figure 2shows a general network structure.
YXT
forms a Markov chain. The number
of neurons in
T
equals the number of classes
C
in the training set. Notice that the outputs of
T
are
the scores of different classes and are unbounded. To better calculate mutual information
I(X
;
T)
and
I(T
;
Y)
, a normalized exponential function is employed to squash a
C
-dimensional real vector
z
of
arbitrary real values to a
C
-dimensional vector
σ(z)
of real values in the range
[
0, 1
]
that add up to one.
The normalization function is written as:
σ(z)j=ezj
C
i=1ezifor j=1, . . . , C, (11)
which is of the same form as the softmax function in the neural network. Now, in the layer
T
, the value
of each neuron ranges from 0–1. Then, we bin each neuron’s output into 10 equal intervals between
zero and one and get our final model output
T
, which is the discretization process shown in Figure 2.
Y X T
Discretization
Label Input Output
CNN architecture
Figure 2.
This figure, from [
15
], depicts a general neural network structure. The blue circles represent
the softmax probability. Discretization is performed on the output of this softmax function.
Second, from
(1)
, one needs to calculate the joint distribution of two random variables for
estimating mutual information. Suppose we want to estimate
I(X
;
T)
from a mini-batch (containing
n
Entropy 2019,21, 456 7 of 22
samples) in the image classification task; since every image is different from other images, for the ith
input image
xi
, it is common to let
p(xi) = 1
n
. When performing calculation of mutual information at
each epoch, the network’s parameter
θ
is fixed.
ti
is generated by inputting
xi
into the deterministic
network. We can put is as
ti=f(xi
,
θ)
. In this case,
p(ti|xi) =
1. There are multiple data points; thus,
one can take
X
as a random variable (of data), but not for
θ
at this epoch since it is fixed. Thus,
T
is a deterministic function of
X
. Therefore, we have
H(T|X) =
0. At the next epoch, the network’s
parameter θis updated, and we continue this calculation process. Now, due to binning, we have:
I(X;T) = H(T)H(T|X) = H(T) =
t
p(t)log 1
p(t)=
n
i=1
1
nlog 1
p(ti). (12)
From
(12)
,
I(X
;
T)
is related to the distribution of
T
. It does not mean that calculating
I(X
;
T)
does not need
X
, since
T
is generated by inputting
X
into the network. Suppose in the mini-batch, after
binning, each value
ti
is different from others. Then,
I(X
;
T)
achieves the maximum
log n
. Suppose
some
ti
’s have the same value, then
I(X
;
T)
becomes lower. Furthermore,
(12)
shows that
I(X
;
T)
is not
related to the dimension of images. Thus, for high dimensional images, we can still estimate mutual
information this way (usually, the number of classes is much smaller than the dimension of the image).
However, there are several points worth noticing:
We make a hypothesis that in the mini-batch, each image is different from the others. Actually, it
can happen that some images are the same. However, this will not affect the expression
(12)
, since
the same images must produce the same outputs. Only the probability mass function of
T
might
be changed.
The mutual information we estimate may be different from the ground truth since we perform
discretization on the layer
T
. Thus, for fair comparison, for different data with the same number
of classes, we need to use the same discretization interval.
If the number of neurons in the layer
T
is large, the mutual information
I(X
;
T)
barely changes;
since in this case, the sample space of
T
is huge even if we decrease the number of intervals.
As a result,
XT
is approximately a bijection function (
p(x|t)
is almost deterministic).
Thus, I(X;T)H(X)from (1) and (3).
From the above analysis, the dimension of
T
cannot be huge. Since the number of neurons in the
layer
T
equals the number of classes in the training set, our method is subject to the number of classes.
For cases with a large numbers of classes, we investigated the method of clustering to estimate the
probability mass functions. We provide an API at https://github.com/piratehao/API-of-estimating-
mutual-information-in-networks.
3. Main Results
This section is structured as follows: In Section 3.1, we explore the relationship among the model
accuracy,
I(X
;
T)
, and
I(T
;
Y)
; in Section 3.2, we introduce our information plane-based framework
and utilize this framework to compare the differences between FCs and CNNs in the information
plane; in Section 3.3, we compare the evaluation framework with loss curves and state that our
framework is more informative when evaluating neural networks; in Section 3.4, we evaluate modern
popular CNNs in the information plane; in Section 3.5, we show that the information plane can be
used to evaluate DNNs, not only on a dataset, but also on each single class of the dataset in image
classification tasks; in Section 3.6, we use IB to evaluate DNNs for image classification problems with
an unbalanced data distribution; in Section 3.7, we analyze the efficiency of transfer learning in the
information plane; in Section 3.8, we compare different optimization algorithms of neural networks in
the information plane.
Entropy 2019,21, 456 8 of 22
3.1. Relationship among Model Accuracy, I(X;T), and I(T;Y)
Accuracy is a common “indicator” to reflect the model’s quality. Since we are to use mutual
information to evaluate neural networks, it is necessary to investigate the relationship among
I(X
;
T)
,
I(T
;
Y)
, and the accuracy. The work in [
22
] studied the relationship between mutual information and
accuracy and stated that
I(T
;
Y)
explains the training accuracy,
I(X
;
T)
serves as a regularization term
that controls the generalization; while from our experiments, we found that in deep neural networks,
when
I(T
;
Y)
is fixed, there is also a correlation between
I(X
;
T)
and the training accuracy, where
X
represents the training data. Furthermore, the correlation grew stronger as the network got deeper.
To validate the hypothesis experimentally, we need to sample the values of the training accuracy,
I(X
;
T)
, and
I(T
;
Y)
. The method of estimating mutual information has been discussed in Section 2.3.
The network and dataset we chose were VGG-16 and CIFAR-10, respectively. The sampling was
performed after every fixed iteration steps during the training. We first plotted
I(X
;
T)
,
I(T
;
Y)
, and
the training accuracy of sampled data in 3D space, which is shown in Figure 3.
Figure 3. Visualization of the sampled data in 3D space.
From Figure 3, we can see the data did not fully fill the whole space. They came from a narrow
area out of the whole 3D space. Furthermore, there existed two points that had similar
I(T
;
Y)
, but the
point with smaller
I(X
;
T)
had a higher training accuracy. To verify the correlation between
(X
;
T)
and
the training accuracy quantitatively, we used the method of “checking inversion”.
First, we used
I(X
;
T)i
,
I(T
;
Y)i
, and
Acci
to denote the values of the ith sampling result,
which corresponds to a blue circle in Figure 3. Suppose there exists a sample pair
(i
,
j)
such that
I(T;Y)i=I(T;Y)j
, then we can directly check the relationship between
I(X
;
T)
and the training
accuracy to find how they are related. However, each
I(T
;
Y)i
is a real number, and it is difficult to find
samples that have the same value of
I(T
;
Y)
. Thus, we verify the hypothesis by checking inversions.
An inversion is defined as a pair of samples
(i
,
j)
that satisfies
I(T
;
Y)i<I(T
;
Y)j
and
Acci>Accj
.
A “satisfied inversion” is defined as an inversion for which
I(X
;
T)i<I(X
;
T)j
and
Acci>Accj
.
The “percentage”, defined as follows:
percentage =#satisfied inversions
#all inversions , (13)
Entropy 2019,21, 456 9 of 22
is a proper indicator to reflect the rightness of our hypothesis that low
I(X
;
T)
also contributes to the
training accuracy. Consider two cases:
The percentage is near 0.5. This means that almost half of the inversion pairs satisfy
I(X;T)i<I(X;T)j
and
Acci>Accj
, and the other inversion pairs satisfy
I(X
;
T)i<I(X
;
T)j
and
Acci<Accj. Thus, I(X;T)almost has no relation to the training accuracy.
The percentage is high. This means that a large number of inversion pairs satisfy
I(X;T)i<I(X;T)jand Acci>Accj. Thus, low I(X;T)also contributes to the training accuracy.
We performed experiments under different training conditions to train the neural networks.
Table 1, from [15], records percentages with different training conditions.
Table 1.
In this table (from [
15
]), we include 600 samples when performing experiments. The percentage
already converged for 600 samples. SGD represents stochastic gradient descent, which means that
we stochastically selected a mini-batch from all training samples each time for training the network.
BGD represents batch gradient descent, which means we selected all the training samples for training
the network. For equal comparison, BGD and SGD used the same training set. CNN-9 is a convolutional
neural network with 9 hidden layers. The linear network is a fully-connected neural network whose
activation function is the identity mapping.
Network Structure Training Method Percentages with 600 Samples
CNN-9 SGD 0.865
BGD 0.821
Linear Network SGD 0.755
BGD 0.594
Table 1shows that in various condition settings, the percentages were all over 0.5, which indicates
that
I(X
;
T)
also contributed to the training accuracy. Furthermore, different training schemes (SGD,
batch gradient descent (BGD)) and network structures (FC, CNN) may lead to different percentages.
We explain these phenomenons as follows:
Even though
I(T
;
Y)
represents the correlation between the model output and label,
I(T
;
Y)
is
not a monotonic function of the training accuracy. Suppose there are
C
classes in the image
dataset, and let
Ci
be the ith class. Consider two cases: In the first case,
T=σ(Y)
where
σ
is an identity mapping, which indicates that the model output
T
always equals the true class.
For another case,
T=ϕ(Y)
where
ϕ
is a shift mapping, which indicates that if the true class is
Ci
,
the prediction of
T
is
Ci+1
. In both cases, since
σ
and
ϕ
are invertible functions, from
(4)
, we have
I(T
;
Y) = I(σ(Y)
;
Y) = I(ϕ(Y)
;
Y) = H(Y)
. However, in the first case, the training accuracy was
one, whereas in the other case it was zero.
The activation function of the linear network was identity mapping. Thus, the loss function of the
linear network was a convex function; whereas the loss function of convolutional neural networks
is highly non-convex. By using BGD to optimize a convex function with a proper learning rate,
the training loss with respect to all the training data always decreased (the model was stablest in
this case), which indicated that
T
was gradually closer to
Y
during training. Thus,
I(T
;
Y)
can
fully explain the training accuracy and
I(X
;
T)
may not have contributed to the training accuracy
greatly; however, when the loss function was a non-convex function, or the training scheme was
SGD. The loss with respect to all the training data did not decrease all the time (the network
was sometimes learning in the wrong direction). Thus,
I(T
;
Y)
cannot fully explain the training
accuracy for SGD and convolutional neural networks.
The work in [
29
] used
I(X
;
T)
to represent the stability of the learning algorithm. The model with
low
I(X
;
T)
tended to be more stable on the training data. Thus, when
I(T
;
Y)i
and
I(T
;
Y)j
are
equal, the model with low I(X;T)may lead to a high training accuracy.
Entropy 2019,21, 456 10 of 22
Another interesting phenomenon is that we found that there was a correlation between percentage
and the number of layers of convolutional neural networks. Table 2, from [
15
], shows that percentage
rose as we increased the number of layers of convolutional neural networks. This result may relate to
some inherent properties of CNNs, which is worth investigating for future work.
Table 2.
This table (from [
15
]) lists the percentages for convolutional neural networks with variant
numbers of hidden layers. CNN-
i
is a deep convolutional neural network with
i
convolutional layers.
Network Structure CNN-2 CNN-4 CNN-9 CNN-16
percentage with 600 samples 0.56 0.68 0.87 0.96
Compared to the training accuracy, we were more interested in validation accuracy. Thus, we
also validated our hypothesis on the validation data where
X
represents the validation input. Table 3,
from [
15
], shows that low
I(X
;
T)
also contributed to the validation accuracy. This observation will
form the basis of our evaluation framework proposed in the next section.
Table 3. This table (from [15]) lists the percentages with the numbers of samples on the validation set.
The training scheme and the network structure were SGD and VGG-16, respectively.
Number of Samples 100 200 300 400 500 600
percentage 0.905 0.921 0.912 0.924 0.924 0.924
3.2. Evaluating DNNs in the Information Plane
In Section 2.2, we explain that
I(X
;
T)
and
I(T
;
Y)
are essential for analyzing neural networks.
The neural network training process can be regarded as finding a trade-off between
I(X
;
T)
and
I(T
;
Y)
.
Furthermore, in Section 3.1, the experiments showed that not only high
I(T
;
Y)
, but also low
I(X
;
T)
contributed to the validation accuracy, where
X
represents validation input. Since
I(X
;
T)
and
I(T
;
Y)
are both indicators to reflect the model’s accuracy, we used
I(T;Y)
I(X;T)
to represent the model’s learning
capability in the information plane. Notice that
I(T;Y)
I(X;T)
is exactly the slope of the mutual information
curve. Thus, it represents the model’s learning capability at each moment.
Figure 1, from [
15
], shows that a mutual information path contains two learning stages. The first
fitting phase only took very little time compared to the whole learning process. The model began
to generalize at the second compression phase. Thus, we only used
I(T;Y)
I(X;T)
to represent the model’s
learning capability in the second phase. A better model is expected to have smaller (negative)
I(T;Y)
I(X;T)
in the second phase; while for the first fitting phase,
I(T
;
Y)
of the model increased (the model needed
to remember the training data for fitting the label). Thus, we only used
I(T
;
Y)
to represent the model’s
capability of fitting the label. Based on our analysis above, we propose our information plane-based
framework in Figure 4.
Entropy 2019,21, 456 11 of 22
I(X; T)
I(T; Y)
O
Capability of
fitting labels
Capability of
generalization
Transition
points
Figure 4.
This figure (from [
15
]) shows the information plane-based framework.
I(T
;
Y)
before the
transition point represents the model’s capability of fitting the label.
I(T;Y)
I(X;T)
after the transition point
represents the model’s capability of generalization.
To further explore how different datasets or network structures influence the mutual information
curves in the information plane, we performed an experiment on two datasets: MNISTand CIFAR-10.
The network structure contained fully-connected neural networks and convolutional neural networks
with various numbers of layers. Notice for subsequent experiments,
X
in
I(X
;
T)
represents
the validation input, since we cared about the validation accuracy rather than training accuracy.
Furthermore, we smoothed the mutual information curves in the information plane for simplicity and
better visualization. Table 4and Figure 5, from [15] show the experiment results.
0246810
I(X; T)
0
1
2
3
I(T; Y)
MNIST
FC-3
FC-6
FC-9
CNN-2
CNN-4
CNN-6
0 2 4 6 8 10 12
I(X; T)
0
0.5
1
1.5
2
2.5
3
I(T; Y)
CIFAR-10
FC-3
FC-6
FC-9
CNN-2
CNN-4
CNN-6
Figure 5.
In these two figures (from [
15
]), FC-
i
is a fully-connected neural network with
i
hidden layers.
CNN-iis a convolutional neural network with iconvolutional layers.
Entropy 2019,21, 456 12 of 22
Table 4.
This table (from [
15
]) lists
I(T
;
Y)
,
I(X
;
T)
, training epochs, and validation accuracies of each
network from Figure 5. The transition points of FC-3, CNN-2, and CNN-4 are just the convergence
points, since they did not show the second phase.
Dataset Model Transition Point Convergence Point
I(T;Y)I(X;T)Epochs Accuracy I(T;Y)I(X;T)Epochs Accuracy
MNIST
FC-3 2.96 7.183 1 0.836 3.259 4.358 51 0.983
FC-6 2.962 7.532 1 0.846 3.249 3.746 56 0.988
FC-9 2.803 7.166 1 0.774 3.214 3.647 54 0.988
CNN-2 2.952 7.898 1 0.75 3.282 3.916 50 0.99
CNN-4 2.286 7.683 1 0.451 3.284 3.621 53 0.994
CNN-6 2.236 6.184 1 0.515 3.275 3.592 54 0.994
CIFAR-10
FC-3 2.671 10.085 65 0.534 2.671 10.085 65 0.534
FC-6 2.604 9.321 20 0.537 2.218 7.197 66 0.575
FC-9 2.55 9.02 21 0.555 2.218 7.197 66 0.56
CNN-2 1.816 8.133 63 0.451 1.816 8.133 63 0.451
CNN-4 2.840 8.761 67 0.705 2.840 8.761 67 0.705
CNN-6 2.301 8.891 5 0.52 2.472 4.862 66 0.781
There are some interesting observations from Figure 5:
Not all networks had exactly the second compression phase. For the MNIST experiment, all
networks had two learning stages. However, for CIFAR-10, the networks with fewer hidden
layers (FC-3, CNN-2, CNN-4) did not show the second phase. The reason is that CIFAR-10 is a
more difficult dataset for the network to classify. The network with fewer hidden layers may not
have enough generalization capability.
Convolutional neural networks had better generalization capability than fully-connected neural
networks by observing
I(T;Y)
I(X;T)
in the second phase. This led to higher validation accuracy.
However,
I(T
;
Y)
of convolutional neural networks was sometimes lower than fully-connected
neural networks by comparing
I(T
;
Y)
at the transition point, which indicates that fully-connected
networks may have a better capability of fitting the labels than convolutional neural networks in
the first phase. The fact that fully-connected neural networks have a large number of weights
may contribute to this.
Not all networks have increasing
I(T
;
Y)
in the second phase. For CIFAR-10,
I(T
;
Y)
of FC-6 and
FC-9 dropped down in the second phase. One possible reason is that FC with more layers may
over-fit the training data.
3.3. Informativeness and Guidance of the Information Plane
There does not exist a network that is optimal for all problems (datasets). Usually, researchers
have to test many neural network structures on a specific dataset. The network is chosen by comparing
the final validation accuracy of each network. This process is time consuming since we have to train
each network until convergence. In this section, we will show that the information plane could ease
this searching process and facilitate a quick model selection of neural networks.
In Section 3.2, we show that
I(T;Y)
I(X;T)
can represent the capability of generalization at each moment
in the second compression phase. Thus, a direct way to select the model is to compare each model’s
I(T;Y)
I(X;T)
at the beginning of the compression phase. Since the first compression phase only takes a little
time, we can select a better model in a short time. Figure 6and Table 5, from [
15
], show the mutual
information paths and validation accuracies of different networks on the CIFAR-10 dataset.
Entropy 2019,21, 456 13 of 22
0 2 4 6 8 10 12
I(X; T)
0
0.5
1
1.5
2
2.5
3
3.5
I(T; Y)
CNN-2
CNN-4
CNN-9
CNN-16
(a)
0 2 4 6 8 10 12
I(X; T)
0
0.5
1
1.5
2
2.5
3
3.5
I(T; Y)
CNN-2
CNN-4
CNN-9
CNN-16
(b)
0 10000 20000 30000 40000 50000
Iterations
0
0.5
1
1.5
2
2.5
3
3.5
Loss
CNN-2
CNN-4
CNN-9
CNN-16
(c)
Figure 6.
In these three figures (from [
15
]): (
a
) Mutual information paths of neural networks on the
training set of CIFAR-10. (
b
) Mutual information paths of neural networks on the validation set.
(c) Training losses of neural networks with training iterations.
Table 5. This table (from [15]) lists the percentages of each network from Table 2.
Network Structure CNN-2 CNN-4 CNN-9 CNN-16
percentage of 600 samples 0.56 0.68 0.87 0.96
final accon validation set 0.45 0.70 0.77 0.89
We can gain some information from Figure 6and Table 5:
I(T;Y)
I(X;T)
can be used to select a better network. We can compare CNN-9 and CNN-16 in Figure 6b.
The slope of the mutual information curve of CNN-16 was smaller (negative) than CNN-9, which
represents a better generalization capability. The final validation accuracy in Table 5is consistent
with our analysis. Thus, for a specific problem, we can visualize each model’s mutual information
curve on the validation data to select a better model quickly.
The information plane is more informative than the loss curve. By comparing Figure 6c with
Figure 6a,b, we can see that the training loss of each model continued to decrease with training
steps. However, the mutual information curves behaved differently. The model with fewer layers
did not clearly show the second compression phase. Thus, the information plane could reveal
more information about the network.
I(X
;
T)
mostly contributed to the accuracy in the second phase. We record the percentage of each
network in Table 5. From Table 5, when the network had fewer layers, the mutual information
path did not clearly show the second phase, and the percentage was low; whereas for the network
that had the compression phase, the percentage was high. One possible reason is that in the
second phase, the model learned to generalize (extract common features from each mini-batch).
The value of percentages indicates that the compression happened even when
I(T
;
Y)
’s remained
the same.
We can view
I(X
;
T)
and
I(T
;
Y)
as:
I(T
;
Y)
determines how much the knowledge
T
has about
the label Y, and I(X;T)determines how easy this knowledge can be learned by the network.
3.4. Evaluating Popular CNNs in the Information Plane
The architecture of neural networks has undergone substantial development. From AlexNet [
1
],
VGG [
7
] to ResNet [
9
] and DenseNet [
30
], researchers have made great efforts in searching for efficient
networks. In this section, we visualize these popular networks in the information plane. Figure 7and
Table 6show the information curve, training epochs, and accuracy of each network structure on the
CIFAR-10 dataset.
Entropy 2019,21, 456 14 of 22
0246810
I(X; T)
0
0.5
1
1.5
2
2.5
3
I(T; Y)
Alexnet
Vgg-16
Resnet-50
Densenet-121
transition point
convergence point
Figure 7.
Mutual information paths of different network architectures on the CIFAR-10 dataset. For each
mutual information path, training time spent on the dotted line was three-times longer than the time
spent on the solid line.
Table 6.
This table records
I(X
;
T)
,
I(T
;
Y)
, training epochs, and model accuracy at the transition point
and convergence point for every neural network.
Transition Point Convergence Point
I(T;Y)I(X;T)Epochs Accuracy I(T;Y)I(X;T)Epochs Accuracy
AlexNet 2.496 9.476 5 0.524 2.339 5.907 62 0.673
VGG-16 2.262 8.862 7 0.451 2.733 4.682 69 0.839
ResNet-50 2.387 9.389 2 0.541 2.989 4.956 55 0.877
DenseNet-121 2.702 8.965 1 0.604 2.96 4.492 51 0.902
Only looking at the solid line in Figure 7, we may infer that when classifying the CIFAR-10
dataset, AlexNet was the worst neural network among all the models. AlexNet has a low capability of
fitting the labels, and after the transition point,
I(T
;
Y)
of the model even dropped down, indicating
losing information on labels. VGG had a stronger capability of fitting labels than ResNet, but the
generalization capability was relatively lower. From the trend of the curves,
I(T
;
Y)
of ResNet may be
larger than VGG in the future. DenseNet was better than all the other models in both two learning
phases. The model accuracies in Table 6are consistent with our analysis.
Here, we emphasize that our prediction may not always be true, since the mutual information
path may have a larger slope change in the future. See ResNet in Figure 7. Thus, there existed a
trade-off between the training time and confidence of our prediction. We can make a more confident
prediction by training the network with a longer time. However, it is still an efficient way for guiding
us on choosing a better network that leads to a high validation accuracy. Table 6also shows that with
more convolutional layers, the model can reach the transition point more quickly. This means that for
many deep networks, we can predict the networks’ quality and choose a better model in a short time.
3.5. Evaluating DNN’s Capability of Recognizing Objects from Different Classes
The previous sections evaluated neural networks on a whole dataset. However, this evaluation
can only test the performance of the network on all the classes. In this section, we will show how to
evaluate DNNs on a single class in the information plane when performing image classification tasks.
Suppose there are
C
classes in the dataset and
Ci
denotes the ith class. To evaluate networks on
the ith class, we labeled other classes as one class. Thus, the dimension of label
Y
changes from
C
to
two. We also made label
Y
balanced when calculating mutual information so that
H(Y)
is equal to
Entropy 2019,21, 456 15 of 22
one. This process is similar to one-vs.-all classifications [
31
]. However, instead of concentrating on the
accuracy, we paid attention to the mutual information. Furthermore, notice that we did not change the
network structure; we only performed pre-processing on the validation data. The training scheme and
other conditions did not change during this process. We selected three classes (airplane, automobile,
bird) on CIFAR-10 to perform the experiment. Figure 8, from [
15
], compares the performance of two
networks on these three classes. Figure 9, from [
15
], shows the mutual information curve on each class
in the information plane.
By comparing the mutual information curves of automobile and airplane, we can find that they
had almost the same value of
I(T
;
Y)
at the transition point. However, the slope of automobile was
smaller than airplane. Thus, the final validation accuracy of automobile may be higher than airplane.
By comparing the mutual information curves of airplane and bird, we can find that they almost had the
same slope of mutual information curves in the second phase. However,
I(T
;
Y)
of airplane was higher
than that of bird. Thus, the final validation accuracy of airplane may be higher than bird. The true
validation accuracies were 0.921 (airplane), 0.961 (automobile), and 0.825 (bird), which are consistent
with our analysis.
airplane automobile bird
AlexNet
0
0.2
0.4
0.6
0.8
1
I(X; T)
I(T; Y)
Accuracy
airplane automobile bird
VGG-16
0
0.2
0.4
0.6
0.8
1
I(X; T)
I(T; Y)
Accuracy
Figure 8.
This figure (from [
15
]) lists
I(X
;
T)
,
I(T
;
Y)
and accuracies for well-trained AlexNet and
VGG-16. The accuracy is defined as the percentage of how many samples are correctly predicted out of
all samples belonging to that class. For better visualization, we divided
I(X
;
T)
by its upper-bound
H(X)so that I(X;T),I(T;Y), and the validation accuracy had the same magnitude.
0 2 4 6 8 10 12
I(X; T)
0
0.2
0.4
0.6
0.8
1
I(T; Y)
airplane
automobile
bird
transition point
convergence point
Figure 9.
This figure (from [
15
]) shows the mutual information curves of each class on the CIFAR-10
dataset based on VGG-16.
Entropy 2019,21, 456 16 of 22
Furthermore, we examined two models on the same classes. From Figure 8, VGG-16 had better
performance than AlexNet on each class by comparing
I(X
;
T)
,
I(T
;
Y)
, and validation accuracies.
Thus, the information plane also provided a way to analyze the performance of the network on each
class in the image classification task.
3.6. Evaluating DNNs for Image Classification with an Unbalanced Data Distribution
In a multi-class classification problem, each class may have different numbers of samples.
Suppose we want our model to have a balanced classification capability: we need to control the
number of training samples for each class. We suggest that the information plane could help in a
quick way.
In CIFAR-10, each class has 5000 training samples. For simplicity and better presentation, in this
experiment, we chose the classes of “automobile” and “bird”. We hoped that the model had balanced
classification capability for these two classes. Let the number of training samples for bird be fixed,
and we performed the experiments with variant numbers of samples for automobile. The results are
shown in Figure 10 and Table 7.
From Figure 10a, when there were only 200 samples of automobile, the
I(X
;
T)
of automobile
even decreased in the second stage. In this case, the network may just over-fit the small sample noise.
From (b) and (c), the slope of automobile in the second phase was higher (negative) than bird, indicating
low generalization capability. While when automobile:bird = 3:5, from Figure 10d, the generalization
capability of automobile was already comparable to bird. The final validation accuracies from Table 7
of these two classes are consistent with the generalization capability in Figure 10. Since the first phase
only took a little time, we could decide how many samples were needed by checking the slope in the
second phase.
We believe this application has huge potential in some serious areas, such as medical science,
especially where people expect computers to automatically classify diseased tissues from samples.
For example, at present, cancer classification has two major problems: the healthy and diseased
samples are very unbalanced; the AI system lacks interpretability. Since IB explains the network from
an informative way and our framework shows we could determine the number of samples for each
class in a short time, our framework could be applied to medical science in the future.
0246810
I(X; T)
0
0.2
0.4
0.6
0.8
1
I(T; Y)
automoible:bird = 200:5000
automobile
bird
(a)
0246810
I(X; T)
0
0.2
0.4
0.6
0.8
1
I(T; Y)
automoible:bird = 500:5000
automobile
bird
(b)
Figure 10. Cont.
Entropy 2019,21, 456 17 of 22
0246810
I(X; T)
0
0.2
0.4
0.6
0.8
1
I(T; Y)
automoible:bird = 1500:5000
automobile
bird
(c)
0246810
I(X; T)
0
0.2
0.4
0.6
0.8
1
I(T; Y)
automoible:bird = 3000:5000
automobile
bird
(d)
Figure 10.
Mutual information paths of automobile and bird with variant numbers of training samples
on the CIFAR-10 dataset. The numbers of automobile and bird are (
a
) 200:5000, (
b
) 500:5000, (
c
)
1500:5000, and (d) 3000:5000.
Table 7. Final validation accuracies for automobile and bird with different training samples.
Number of Samples (automobile:bird) 200:5000 500:5000 1500:5000 3000:5000
automobile 0.55 0.68 0.85 0.91
bird 0.90 0.91 0.90 0.90
3.7. Transfer Learning in the Information Plane
Using the pre-trained model to train data instead of training a neural network from scratch has
been common in recent days. It may greatly reduce the training time and make the model easy to learn.
In this section, we study why the pre-trained model is better in the language of the information plane.
We used the same neural network structure (ResNet-34) as our base model and pre-trained model.
The pre-trained model was trained on the ImageNet dataset. We resized the images of CIFAR-10
to 224
×
224 for fitting the network structure. Figure 11 shows the mutual information paths in the
information plane.
We can see for these two models that they had very similar curves in the “compression” phase
and ended up with the same convergence point. However, in the “fitting” phase, they acted differently.
The pre-trained model had larger
I(X
;
T)
and
I(T
;
Y)
at the initial time, which contributed to the
similarity between CIFAR-10 and the ImageNet dataset. Thus, pre-trained model reached the transition
point much faster than the base model. After the transition point, since they had the same architecture,
the generalization capabilities for these two models were the same.
Therefore, the information plane revealed that pre-trained model improving learning algorithm
happened in the “fitting” phase. and it helped the model fit labels efficiently. However, the pre-trained
model and the base model without pre-training may have equal generalization capability. Interestingly,
our result is consistent with a recent paper [
27
], where Kaiming He et al. found that ImageNet
pre-training is not necessary. ImageNet can help speed up convergence, but does not necessarily
improve accuracy unless the target dataset is too small. Differently, we verified this result by utilizing
the information plane.
Entropy 2019,21, 456 18 of 22
0246810
I(X; T)
0
0.5
1
1.5
2
2.5
3
I(T; Y)
Resnet-34
Pretrained Resnet-34
transition point
convergence point
Figure 11. Mutual information paths in ResNet and pre-trained ResNet.
3.8. SGD versus Adam in the Information Plane
Choosing an appropriate optimization method is essential in the neural network training process.
A straightforward way to optimize the loss of DNN is via stochastic gradient descent (SGD). However,
the loss function of DNN is highly non-convex, which makes SGD easily stuck in the local minima
or saddle points. Thus, several advanced optimization methods were developed in recent years and
have become popular. Momentum SGD [
32
] is a method that helps accelerate SGD in the relevant
direction and dampens oscillations. The work in [
33
] adapted the learning rate and performed smaller
updates for parameters, which could improve the robustness of SGD. The work in [
34
] proposed
Adam, an optimization method that has been generally used in deep learning. Adam also keeps an
exponentially-decaying average of past gradients, similar to momentum. Whereas momentum can be
seen as a ball running down a slope, Adam behaves like a heavy ball with friction, which thus prefers
flat minima in the error surface.
However, recently, there have been some criticisms about Adam [
28
]. It was stated that Adam
may have better performance at the beginning of the training, but may end up with lower accuracy
than SGD. To verify this phenomenon, we visualize SGD and the Adam method in the information
plane in Figure 12.
From Figure 12, we can see that Adam had a better capability of fitting the labels in the first stage.
while in the second stage, SGD behaved better than Adam. Since SGD converged to a point with
lower
I(X
;
T)
, this indicates that SGD had better generalization capability than Adam at the end of
training. Our finding is consistent with [
28
]. The experiments from Sections 3.7 and 3.8 show that the
information plane can be used as a tool to analyze variant deep learning algorithms.
Entropy 2019,21, 456 19 of 22
0246810
I(X; T)
0
0.5
1
1.5
2
2.5
3
I(T; Y)
SGD vs. Adam
SGD
Adam
Figure 12. Mutual information paths of SGD and Adam in the information plane.
4. Conclusions and Future Work
Our work from the paper aimed to bridge the gap between the theoretical understanding of
neural networks and their applications by visualizing mutual information. We summarize our work as
follows:
We investigated the relationship among
I(X
;
T)
,
I(T
;
Y)
, and accuracies. The experiment from
Section 3.1 showed that low
I(X
;
T)
and high
I(T
;
Y)
both contributed to the validation accuracy
when
X
represented the validation input. Based on this exploration, we proposed a framework
by utilizing mutual information to evaluate neural networks. The model with lower
I(X
;
T)
and
higher
I(T
;
Y)
in the second stage of the learning process was more likely to achieve a higher
accuracy. The benefit of mutual information is that it provides two terms:
I(X
;
T)
and
I(T
;
Y)
.
Compared to the “accuracy”, mutual information can better explore the network’s representation
in terms of the capability of fitting the labels and the capability of generalization.
There were some interesting phenomena when we compared fully-connected neural network (FCs)
and convolutional neural network (CNNs) in the information plane. For example, sometimes, the
network may not show the second learning stage in the mutual information paths. This happens
when a dataset is harder to recognize and the network has fewer layers. It cannot be shown by
simply visualizing the loss curve. Thus, we suggest that the information plane is a more powerful
tool than loss for evaluating neural networks.
Since the model with lower
I(X
;
T)
and higher
I(T
;
Y)
had a big probability of achieving a
higher accuracy,
I(T;Y)
I(X;T)
can represent the model’s momentary learning capability in the second
stage. The first learning stage only took a little time compared to the whole learning process.
We can compare each model’s
I(T;Y)
I(X;T)
at the beginning of the second stage and select a better one.
Thus, the information plane could facilitate a quick model selection for a specific problem.
From Sections 3.5 and 3.6, with some preprocessing techniques on the data, the information plane
can also reveal the recognition capability of each class of neural networks. Thus, we can use the
information plane to deal with the classification problem with an unbalanced data distribution.
By comparing the mutual information curves of each class, we can determine how many samples
are needed to train a classifier with balanced recognition capability.
There are many sub-fields in deep learning research. We suggest that the information plane is a
powerful tool for analyzing particular problems in these sub-fields. For example, weight-transfer
learning is often regarded as a technique that improves the model performance; while from our
visualization in Section 3.7, weight-transfer learning had almost equal generalization capability
compared to the original model without pre-training. This observation is consistent with a recent
Entropy 2019,21, 456 20 of 22
study in [
27
]. However, the information plane draws this conclusion in a more informative
way. Furthermore, optimization methods are essential in neural networks’ training process. We
compared SGD and Adam in the information plane to show how each optimization method
behaved during the training in Section 3.8.
There are some future directions based on our current work:
The mutual information
I(X
;
T)
and
I(T
;
Y)
is estimated from the last layer of CNNs. One future
work is to develop techniques to estimate the information of each kernel in hidden layers with
regard to the input and output. This process may help us determine how many hidden layers or
how many kernels are needed for designing the network structure. Furthermore, the visualization
of hidden units may facilitate a better understanding of neural networks.
We used the binning technique to estimate the mutual information. This estimation method
has limitations, since the bin size and the dimension of the layer may affect the value of
mutual information. Some other estimation techniques, like
K
-nearest neighbor [
35
] and the
kernel-density-based estimator [
36
], are also affected by the dimension of the layer. A possible
future work can be done to develop more efficient estimation techniques to stabilize the calculation
of mutual information.
Author Contributions:
H.C. proposed the main idea and wrote the manuscript. D.L. performed the experiments
to validate the result. Y.G. and S.G. reviewed and edited the manuscript. All authors read and approve
the manuscript.
Funding: This work is supported by NSFC (No. 61601288 and No. 61502304).
Conflicts of Interest: The authors declare no conflict of interest.
Appendix A. Experiment Settings
This section introduces the detailed experiment settings in the paper.
(1)
In Sections 3.13.3, the networks include FCs and CNNs. The setting for each network is listed in
Table A1.
Table A1. Network settings for Sections 3.13.3.
Hidden Layers Batch Size Initial lr lr Decay Optimizer Momentum
Linear Network 9 128 0.1 0.1//60 SGD 0.9
FC-3 3 128 0.1 0.1//60 SGD 0.9
FC-6 6 128 0.1 0.1//60 SGD 0.9
FC-9 9 128 0.1 0.1//60 SGD 0.9
CNN-2 2 128 0.1 0.1//60 SGD 0.9
CNN-4 4 128 0.1 0.1//60 SGD 0.9
CNN-6 6 128 0.1 0.1//60 SGD 0.9
CNN-9 9 128 0.1 0.1//60 SGD 0.9
CNN-16 16 128 0.1 0.1//60 SGD 0.9
(2)
In Section 3.4, we visualize popular networks in the information plane. The network settings are
listed in Table A2.
Table A2. Network settings for Section 3.4.
Hidden Layers Batch Size Initial lr lr Decay Optimizer Momentum
AlexNet 4 128 0.1 0.1//60 SGD 0.9
VGG 16 128 0.1 0.1//60 SGD 0.9
ResNet-50 50 128 0.1 0.1//60 SGD 0.9
DenseNet 121 128 0.1 0.1//60 SGD 0.9
Entropy 2019,21, 456 21 of 22
(3)
In Sections 3.5 and 3.6, we use AlexNet and VGG to evaluate the classification capability of
networks for each class in the image classification task. The network settings are listed in
Table A3.
Table A3. Network settings for Sections 3.5 and 3.6.
Hidden Layers Batch Size Initial lr lr Decay Optimizer Momentum
AlexNet 4 128 0.05 0.95 SGD 0.9
VGG 16 128 0.05 0.95 SGD 0.9
(4)
In Section 3.7, we use the information plane to analyze the efficiency of transfer learning.
The network was ResNet-34, and its settings are listed in Table A4.
Table A4. Network settings for Section 3.7.
Hidden Layers Batch Size Initial lr lr Decay Optimizer Momentum
ResNet-34 34 128 0.1 0.1//60 SGD 0.9
(5)
In Section 3.8, we use the information plane to analyze different optimization methods in neural
networks. The network was CNN-6, and its settings are listed in Table A5.
Table A5. Network settings for Section 3.8.
Hidden Layers Batch Size Initial lr lr Decay Optimizer Momentum
CNN-6 6 128 0.1 0.1//60 SGD 0.9
References
1.
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks.
In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–8
December 2012; pp. 1097–1105.
2.
Seide, F.; Li, G.; Yu, D. Conversational speech transcription using context-dependent deep neural networks.
In Proceedings of the Twelfth Annual Conference of the International Speech Communication Association,
Florence, Italy, 27–31 August 2011.
3.
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.;
Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature
2015
,
518, 529.
4. LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015,521, 436–444.
5.
Zhang, J.; Zong, C. Deep Neural Networks in Machine Translation: an Overview. IEEE Intell. Syst.
2015
,30,
16–25.
6.
Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.;
Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. Mastering the game of Go with deep neural
networks and tree search. Nature 2016,529, 484–489.
7.
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv
2014, arXiv:1409.1556.
8.
Schroff, F.; Kalenichenko, D.; Philbin, J. Facenet: A unified embedding for face recognition and clustering.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA,
7–12 June 2015; pp. 815–823.
9.
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778.
10.
Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Object detectors emerge in deep scene cnns. arXiv
2014, arXiv:1412.6856.
Entropy 2019,21, 456 22 of 22
11.
Lu, Y. Unsupervised learning on neural network outputs: With application in zero-shot learning. arXiv
2015
,
arXiv:1506.00990.
12.
Aubry, M.; Russell, B.C. Understanding deep features with computer-generated imagery. In Proceedings of
the IEEE International Conference on Computer Vision, Tampa, FL, USA, 5–8 December 2015; pp. 2875–2883.
13.
Zeiler, M.D.; Fergus, R. Visualizing and understanding convolutional networks. In Proceedings of the
European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 818–833.
14.
Bau, D.; Zhou, B.; Khosla, A.; Oliva, A.; Torralba, A. Network dissection: Quantifying interpretability of deep
visual representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
Honolulu, HI, USA, 21–26 July 2017; pp. 6541–6549.
15.
Cheng, H.; Lian, D.; Gao, S.; Geng, Y. Evaluating Capability of Deep Neural Networks for Image
Classification via Information Plane. In Proceedings of the European Conference on Computer Vision
(ECCV), Munich, Germany, 8–14 September 2018; pp. 168–182.
16. Vasudevan, S. Dynamic learning rate using Mutual Information. arXiv 2018, arXiv:1805.07249.
17.
Battiti, R. Using mutual information for selecting features in supervised neural net learning. IEEE Trans.
Neural Netw. 1994,5, 537–550.
18.
Hjelm, R.D.; Fedorov, A.; Lavoie-Marchildon, S.; Grewal, K.; Trischler, A.; Bengio, Y. Learning deep
representations by mutual information estimation and maximization. arXiv 2018, arXiv:1808.06670.
19.
Tishby, N.; Pereira, F.C.; Bialek, W. The information bottleneck method. arXiv
2000
, arXiv:physics/0004057.
20.
Chechik, G.; Globerson, A.; Tishby, N.; Weiss, Y. Information bottleneck for Gaussian variables. J. Mach.
Learn. Res. 2005,6, 165–188.
21. Strouse, D.; Schwab, D.J. The deterministic information bottleneck. Neural Comput. 2017,29, 1611–1630.
22.
Shamir, O.; Sabato, S.; Tishby, N. Learning and generalization with the information bottleneck.
Theor. Comput. Sci. 2010,411, 2696–2711.
23.
Alemi, A.A.; Fischer, I.; Dillon, J.V.; Murphy, K. Deep variational information bottleneck. arXiv
2016
,
arXiv:1612.00410.
24.
Thann T. Nguyen, J.C. Layer-wise Learning of Stochastic Neural Networks with Information Bottleneck.
arXiv 2018, arXiv:1712.01272.
25.
Achille, A.; Soatto, S. Information dropout: Learning optimal representations through noisy computation.
IEEE Trans. Pattern Anal. Mach. Intell. 2018,40, 2897–2905.
26.
Shwartz-Ziv, R.; Tishby, N. Opening the Black Box of Deep Neural Networks via Information. arXiv
2017
,
arXiv:1703.00810.
27. He, K.; Ross, G.; Dollar, P. Rethinking ImageNet Pre-training. arXiv 2018, arXiv:1811.088883.
28.
Keskar, N.S.; Socher, R. Improving Generalization Performance by Switching from Adam to SGD. arXiv
2017, arXiv:1712.07628.
29.
Raginsky, M.; Rakhlin, A.; Tsao, M.; Wu, Y.; Xu, A. Information-theoretic analysis of stability and bias of
learning algorithms. In Proceedings of the 2016 IEEE Information Theory Workshop (ITW), Cambridge, UK,
11–14 September 2016; pp. 26–30.
30.
Huang, G.; Liu, Z.; Weinberger, K.Q.; van der Maaten, L. Densely connected convolutional networks. arXiv
2016, arXiv:1608.06993.
31. Bishop, C.M. Pattern Recognition and Machine Learning; Springer: Berlin/Heidelberg, Germany, 2006.
32. Qian, N. On the momentum term in gradient descent learning algorithms. Neural Netw. 1999,12, 145–151.
33.
Duchi, J.; Hazan, E.; Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization.
J. Mach. Learn. Res. 2011,12, 2121–2159.
34. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980.
35.
Gao, W.; Oh, S.; Viswanath, P. Demystifying Fixed
k
-Nearest Neighbor Information Estimators. IEEE Trans.
Inf. Theory 2018,64, 5629–5661.
36. Kolchinsky, A.; Tracey, B. Estimating mixture entropy with pairwise distances. Entropy 2017,19, 361.
c
2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
... While they are used to solve a variety of problems, the learning processes of CNNs are still not transparent. In recent years, many studies have been undertaken to explain these models [8,9]. However, the theoretical understanding of CNNs is still insufficient. ...
... In addition to these, because the selection of hyperparameters, such as the learning speed and batch size is also intuitive and has no transparent algorithmic structure, these networks are not reproducible. Thus, efforts [8,12] are being made to make DNNs more understandable and transparent. ...
... For example, early machine learning models based on probabilistic mappings such as decision trees, logistic regression and clustering are convenient in terms of their explanations. Still, DNNs are very hard to understand [15], so recent research has begun to consider the learning behaviour of neural networks to make them more transparent [8,9]. Work towards understanding the learning procedure of CNNs, as described in this study, also contributes to the research in explainable machine learning. ...
Article
Full-text available
Deep learning has proven to be an important element of modern data processing technology, which has found its application in many areas such as multimodal sensor data processing and understanding, data generation and anomaly detection. While the use of deep learning is booming in many real-world tasks, the internal processes of how it draws results is still uncertain. Understanding the data processing pathways within a deep neural network is important for transparency and better resource utilisation. In this paper, a method utilising information theoretic measures is used to reveal the typical learning patterns of convolutional neural networks, which are commonly used for image processing tasks. For this purpose, training samples, true labels and estimated labels are considered to be random variables. The mutual information and conditional entropy between these variables are then studied using information theoretical measures. This paper shows that more convolutional layers in the network improve its learning and unnecessarily higher numbers of convolutional layers do not improve the learning any further. The number of convolutional layers that need to be added to a neural network to gain the desired learning level can be determined with the help of theoretic information quantities including entropy, inequality and mutual information among the inputs to the network. The kernel size of convolutional layers only affects the learning speed of the network. This study also shows that where the dropout layer is applied to has no significant effects on the learning of networks with a lower dropout rate, and it is better placed immediately after the last convolutional layer with higher dropout rates.
... Afterwards, Shwartz-Ziv and Tishby (2019a) and Schiemer and Ye (2019) believe that information compression is caused by densely clustered representations in latent spaces, which is in line with Geiger's claims about geometric clustering (Capoyleas et al., 1991). Moreover, Cheng et al. (2019) illustrate that the compression depended on datasets as well, where complex data is hard to compress by a shallow network. In relation to the second viewpoint, Chelombiev et al. (2018) illustrate that the compression attributed to generalisation does not always take place through plotting the compression score against accuracy, where the higher rates of compression does not show significant correlation with generalisation. ...
... They further suggest this inferior performance can be improved by replacing cross-entropy loss with an information propagation (InfoPro) loss, which can avoid a collapse of task-relevant information (I(Y ; T )) at each layer during training. However, Amjad and Geiger (2020) and Cheng et al. (2019) argue that a high I(T ; Y ) may be not the main reason for successful performance. Therefore, we believe that Wang et al. (2021) should have focused on both terms of the proposed InfoPro loss, where the first term tries to maximally retain information about the input while the second term attempts to maximise task-relevant information. ...
Thesis
This dissertation is on the analysis and applications of a constructive architecture for training Deep Neural Networks, which are usually trained by End-to-End gradient propagation with fixed depths. End-to-End training of Deep Neural Networks has proven to offer impressive performances in a number of applications such as computer vision, machine translation and in playing complex games such as GO. Cascade Learning, the approach of interest here, trains networks in a layer-wise fashion and has been demonstrated to achieve satisfactory performance in large scale tasks such as the popular ImageNet benchmark dataset, at substantially reduced computing and memory requirements. Here we focus on the nature of features extracted from Cascade Learning. By attempting to explain the process of learning using Tishby et al.s’ Information Bottleneck theory, we derive an empirical rule (Information Transition Ratio) to automatically determine a satisfactory depth for Deep Neural Networks. We suggest that Cascade Learning packs information in a hierarchical manner, with coarse features in early layers and more task specific features in later layers. This is verified by considering Transfer Learning whereby features learned from a data-rich source domain assist in learning a data-sparse target domain. Using a wide range of inference problems in medical imaging, human activity recognition and inference from single cell gene expression between mice and humans, we demonstrate that Transfer Learning from a cascade trained model outperforms results noted by previous authors. An exception to this is the single cell gene expression problem where a single hidden layer network happens to be an adequate solution.
... Shwartz-Ziv & Tishby (2017) empirically studied the IB principle applied to neural networks and made several qualitative observations about the training process; especially, they observed a fitting phase and a compression phase. The latter information-theoretic compression is conjectured to be a reason for good generalization performance and has widely been considered in the literature (Abrol & Tanner, 2020;Balda et al., 2018;Chelombiev et al., 2019;Cheng et al., 2019;Darlow & Storkey, 2020;Elad et al., 2019;Fang et al., 2018;Gabrié et al., 2019;Goldfeld et al., 2019;Jónsson et al., 2020;Kirsch et al., 2020;Tang Nguyen & Choi, 2019;Noshad et al., 2019;Schiemer & Ye, 2020;Shwartz-Ziv & Alemi, 2020;Wickstrøm et al., 2019;Yu et al., 2020). The work and conclusions by Shwartz-Ziv & Tishby (2017) received a lot of critique, with the generality of their claims being doubted; especially Saxe et al. (2018) argued that the results by Shwartz-Ziv & Tishby do not generalize to networks using a different activation function. ...
... Noshad et al. (2019) presented a new MI estimator EDGE, based on dependency graphs, and tested it on the specific counter example using RELU activations as suggested by Saxe et al. (2018), and they observed the two phases. Table I in the review by Geiger (2020) provides a nice overview over empirical IB studies and if the compression phase was observed (Darlow & Storkey, 2020;Jónsson et al., 2020;Kirsch et al., 2020;Noshad et al., 2019;Raj et al., 2020;Shwartz-Ziv & Tishby, 2017) or not (Abrol & Tanner, 2020;Balda et al., 2018;Tang Nguyen & Choi, 2019;Shwartz-Ziv & Alemi, 2020;Yu et al., 2020) or the results were mixed (Chelombiev et al., 2019;Cheng et al., 2019;Elad et al., 2019;Fang et al., 2018;Gabrié et al., 2019;Goldfeld et al., 2019;Saxe et al., 2018;Schiemer & Ye, 2020;Wickstrøm et al., 2019). In conclusion, an important part of the controversy surrounding the IB hinges on the estimation of the information-theoretic quantities -this issue has to be solved before researching the information flow. ...
Preprint
Full-text available
The information bottleneck (IB) principle has been suggested as a way to analyze deep neural networks. The learning dynamics are studied by inspecting the mutual information (MI) between the hidden layers and the input and output. Notably, separate fitting and compression phases during training have been reported. This led to some controversy including claims that the observations are not reproducible and strongly dependent on the type of activation function used as well as on the way the MI is estimated. Our study confirms that different ways of binning when computing the MI lead to qualitatively different results, either supporting or refusing IB conjectures. To resolve the controversy, we study the IB principle in settings where MI is non-trivial and can be computed exactly. We monitor the dynamics of quantized neural networks, that is, we discretize the whole deep learning system so that no approximation is required when computing the MI. This allows us to quantify the information flow without measurement errors. In this setting, we observed a fitting phase for all layers and a compression phase for the output layer in all experiments; the compression in the hidden layers was dependent on the type of activation function. Our study shows that the initial IB results were not artifacts of binning when computing the MI. However, the critical claim that the compression phase may not be observed for some networks also holds true.
... where D i, j is the summation of geometric distance of the j-th filter for i-th layer. Then we follow the strategy from [45,46] to calculate the entropy of D i, j . Specifically, to calculate the continuous distribution, the range of values are divided into n bins and we calculated the probability of each bin. ...
Article
Full-text available
In the automatic processing of tea, the category of fresh tea leaves determines the processing parameters for the automatic control systems. The automatic classification of fresh tea leaves is a necessity. Previous works used deep and wide architectures of CNNs to classify the tea leaves, which have a limitation on the model deployment. Despite the filter pruning achieve superior performance about network compression and acceleration, the pruning-based approaches have to trade accuracy for low computational cost and model size. we propose a novel joint-way model optimization strategy, named Filter Pruning and Grafting via Geometric Distance(P &GGD), to solve the above problems. Specifically, we utilize the filter pruning method to compress models and uses the filter grafting method to boost accuracy. To graft pruned models effectively, we develop the grafting criteria based on the geometric distance to measure the information of layers. Extensive experiments validate our approach on the fresh-tea-leaves dataset and two image classification benchmarks. For example, with the fresh-tea-leaves dataset, we achieve 96.296%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\%$$\end{document} top-1 accuracy, with more than 45%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\%$$\end{document} FLOPs-reduction, even outperforms the non-compress MobileNetV2 by 2.395%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\%$$\end{document} accuracy. Our proposed method can provide more accurate, more compact, and faster models. Thus, it can promote the deployment of the model at the edge of the automatic production line.
... With more generality, the IB problem connects to many other timely aspects. These include game theory and Nash equilibrium [47], capital investment [48], distributed learning [37], deep learning [2], [49]- [52] and convolutional neural networks [53], [54]. ...
Preprint
We formulate and analyze the compound information bottleneck programming. In this problem, a Markov chain $ \mathsf{X} \rightarrow \mathsf{Y} \rightarrow \mathsf{Z} $ is assumed with fixed marginal distributions $\mathsf{P}_{\mathsf{X}}$ and $\mathsf{P}_{\mathsf{Y}}$, and the mutual information between $ \mathsf{X} $ and $ \mathsf{Z} $ is sought to be maximized over the choice of conditional probability of $\mathsf{Z}$ given $\mathsf{Y}$ from a given class, under the \textit{worst choice} of the joint probability of the pair $(\mathsf{X},\mathsf{Y})$ from a different class. We consider several classes based on extremes of: mutual information; minimal correlation; total variation; and the relative entropy class. We provide values, bounds, and various characterizations for specific instances of this problem: the binary symmetric case, the scalar Gaussian case, the vector Gaussian case and the symmetric modulo-additive case. Finally, for the general case, we propose a Blahut-Arimoto type of alternating iterations algorithm to find a consistent solution to this problem.
... ), showing that there is no "efficient" solution when the dimension of the problem is large and known approximations do not scale well with dimension and sample size(Gao et al.;.Several recent works have attempted to develop novel and efficient methods for estimating mutual information for DNNs. One line of works uses a generative decoder network (PixelCNN++) to estimate a lower bound on the mutual information (Darlow and Storkey; 2020;2018). ...
Preprint
Full-text available
Although deep neural networks have been immensely successful, there is no comprehensive theoretical understanding of how they work or are structured. As a result, deep networks are often seen as black boxes with unclear interpretations and reliability. Understanding the performance of deep neural networks is one of the greatest scientific challenges. This work aims to apply principles and techniques from information theory to deep learning models to increase our theoretical understanding and design better algorithms. We first describe our information-theoretic approach to deep learning. Then, we propose using the Information Bottleneck (IB) theory to explain deep learning systems. The novel paradigm for analyzing networks sheds light on their layered structure, generalization abilities, and learning dynamics. We later discuss one of the most challenging problems of applying the IB to deep neural networks - estimating mutual information. Recent theoretical developments, such as the neural tangent kernel (NTK) framework, are used to investigate generalization signals. In our study, we obtained tractable computations of many information-theoretic quantities and their bounds for infinite ensembles of infinitely wide neural networks. With these derivations, we can determine how compression, generalization, and sample size pertain to the network and how they are related. At the end, we present the dual Information Bottleneck (dualIB). This new information-theoretic framework resolves some of the IB's shortcomings by merely switching terms in the distortion function. The dualIB can account for known data features and use them to make better predictions over unseen examples. The underlying structure and the optimal representations are uncovered through an analytical framework, and a variational framework using deep neural networks optimizes has been used.
... This is not a reliable estimator of mutual information, as it is likely to be biased by the choice of such auxiliary models. Furthermore, the argument used in Wang et al. [42] that I(T; Y) is related to generalisation is contradicted by previous authors [43,44]. Duan et al. [45] also address layer-wise learning of neural networks based on kernel machines instead of neurons. ...
Article
Full-text available
In solving challenging pattern recognition problems, deep neural networks have shown excellent performance by forming powerful mappings between inputs and targets, learning representations (features) and making subsequent predictions. A recent tool to help understand how representations are formed is based on observing the dynamics of learning on an information plane using mutual information, linking the input to the representation (I(X;T)) and the representation to the target (I(T;Y)). In this paper, we use an information theoretical approach to understand how Cascade Learning (CL), a method to train deep neural networks layer-by-layer, learns representations, as CL has shown comparable results while saving computation and memory costs. We observe that performance is not linked to information–compression, which differs from observation on End-to-End (E2E) learning. Additionally, CL can inherit information about targets, and gradually specialise extracted features layer-by-layer. We evaluate this effect by proposing an information transition ratio, I(T;Y)/I(X;T), and show that it can serve as a useful heuristic in setting the depth of a neural network that achieves satisfactory accuracy of classification.
... Let θ i N k ∈ R di×di+1 be the weight of the i-th layer in GNN N k , where d i is the number of channels in the i-th layer. Following the previous approach introduced in [30][31][32], we calculate the entropy by dividing the range of values in θ i N k into m different bins. Denote the number of values that fall into the i-th bin as n i . ...
Preprint
Full-text available
We introduce a framework for learning from multiple generated graph views, named graph symbiosis learning (GraphSym). In GraphSym, graph neural networks (GNN) developed in multiple generated graph views can adaptively exchange parameters with each other and fuse information stored in linkage structures and node features. Specifically, we propose a novel adaptive exchange method to iteratively substitute redundant channels in the weight matrix of one GNN with informative channels of another GNN in a layer-by-layer manner. GraphSym does not rely on specific methods to generate multiple graph views and GNN architectures. Thus, existing GNNs can be seamlessly integrated into our framework. On 3 semi-supervised node classification datasets, GraphSym outperforms previous single-graph and multiple-graph GNNs without knowledge distillation, and achieves new state-of-the-art results. We also conduct a series of experiments on 15 public benchmarks, 8 popular GNN models, and 3 graph tasks -- node classification, graph classification, and edge prediction -- and show that GraphSym consistently achieves better performance than existing popular GNNs by 1.9\%$\sim$3.9\% on average and their ensembles. Extensive ablation studies and experiments on the few-shot setting also demonstrate the effectiveness of GraphSym.
... Further investigation in this information plane shows that the training process of minibatch SGD often consists of two phases: the initial fitting phase with a high gradient signal-to-noise ratio (SNR) and a subsequent compression phase with a low gradient SNR [21]. Thus, different architecture and training algorithms can be compared through the pattern of information curves [22]- [24]. ...
Article
Deep learning has become the most powerful machine learning tool in the last decade. However, how to efficiently train deep neural networks remains to be thoroughly solved. The widely used minibatch stochastic gradient descent (SGD) still needs to be accelerated. As a promising tool to better understand the learning dynamic of minibatch SGD, the information bottleneck (IB) theory claims that the optimization process consists of an initial fitting phase and the following compression phase. Based on this principle, we further study typicality sampling, an efficient data selection method, and propose a new explanation of how it helps accelerate the training process of the deep networks. We show that the fitting phase depicted in the IB theory will be boosted with a high signal-to-noise ratio of gradient approximation if the typicality sampling is appropriately adopted. Furthermore, this finding also implies that the prior information of the training set is critical to the optimization process, and the better use of the most important data can help the information flow through the bottleneck faster. Both theoretical analysis and experimental results on synthetic and real-world datasets demonstrate our conclusions.
Article
Full-text available
Recent work has shown that convolutional networks can be substantially deeper, more accurate and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper we embrace this observation and introduce the Dense Convolutional Network (DenseNet), where each layer is directly connected to every other layer in a feed-forward fashion. Whereas traditional convolutional networks with L layers have L connections, one between each layer and its subsequent layer (treating the input as layer 0), our network has L(L+1)/2 direct connections. For each layer, the feature maps of all preceding layers are treated as separate inputs whereas its own feature maps are passed on as inputs to all subsequent layers. Our proposed connectivity pattern has several compelling advantages: it alleviates the vanishing gradient problem and strengthens feature propagation; despite the increase in connections, it encourages feature reuse and leads to a substantial reduction of parameters; its models tend to generalize surprisingly well. We evaluate our proposed architecture on five highly competitive object recognition benchmark tasks. The DenseNet obtains significant improvements over the state-of-the-art on all five of them (e.g., yielding 3.74% test error on CIFAR-10, 19.25% on CIFAR-100 and 1.59% on SVHN).
Article
Full-text available
Mixture distributions arise in many parametric and non-parametric settings, for example in Gaussian mixture models and in non-parametric estimation. It is often necessary to compute the entropy of a mixture, but in most cases this quantity has no closed-form expression, making some form of approximation necessary. We propose a family of estimators based on a pairwise distance function between mixture components, and show that this estimator class has many attractive properties. For many distributions of interest, the proposed estimators are efficient to compute, differentiable in the mixture parameters, and become exact when the mixture components are clustered. We prove this family includes lower and upper bounds on the mixture entropy. The Chernoff $\alpha$-divergence gives a lower bound when chosen as the distance function, with the Bhattacharyaa distance providing the tightest lower bound for components that are symmetric and members of a location family. The Kullback-Leibler divergence gives an upper bound when used as the distance function. We provide closed-form expressions of these bounds for mixtures of Gaussians, and discuss their applications to the estimation of mutual information. We then demonstrate that our bounds are significantly tighter than well-known existing bounds using numeric simulations. This pairwise estimator class is very useful in optimization problems involving maximization/minimization of entropy and mutual information, such as MaxEnt and rate distortion problems.
Article
Full-text available
Despite their great success, there is still no com- prehensive theoretical understanding of learning with Deep Neural Networks (DNNs) or their in- ner organization. Previous work [Tishby & Zaslavsky (2015)] proposed to analyze DNNs in the Information Plane; i.e., the plane of the Mutual Information values that each layer preserves on the input and output variables. They suggested that the goal of the network is to optimize the In- formation Bottleneck (IB) tradeoff between com- pression and prediction, successively, for each layer. In this work we follow up on this idea and demonstrate the effectiveness of the Information- Plane visualization of DNNs. We first show that the stochastic gradient descent (SGD) epochs have two distinct phases: fast empirical error minimization followed by slow representation compression, for each layer. We then argue that the DNN layers end up very close to the IB theo- retical bound, and present a new theoretical argu- ment for the computational benefit of the hidden layers.
Article
Estimating mutual information from independent identically distributed samples drawn from an unknown joint density function is a basic statistical problem of broad interest with multitudinous applications. The most popular estimator is the one proposed by Kraskov, Stögbauer, and Grassberger (KSG) in 2004 and is nonparametric and based on the distances of each sample to its k <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">th</sup> nearest neighboring sample, where k is a fixed small integer. Despite of its widespread use (part of scientific software packages), theoretical properties of this estimator have been largely unexplored. In this paper, we demonstrate that the estimator is consistent and also identify an upper bound on the rate of convergence of the ℓ <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sub> error as a function of a number of samples. We argue that the performance benefits of the KSG estimator stems from a curious “correlation boosting” effect and build on this intuition to modify the KSG estimator in novel ways to construct a superior estimator. As a by-product of our investigations, we obtain nearly tight rates of convergence of the ℓ <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sub> error of the well-known fixed k-nearest neighbor estimator of differential entropy by Kozachenko and Leonenko.
Article
Despite superior training outcomes, adaptive optimization methods such as Adam, Adagrad or RMSprop have been found to generalize poorly compared to Stochastic gradient descent (SGD). These methods tend to perform well in the initial portion of training but are outperformed by SGD at later stages of training. We investigate a hybrid strategy that begins training with an adaptive method and switches to SGD when appropriate. Concretely, we propose SWATS, a simple strategy which switches from Adam to SGD when a triggering condition is satisfied. The condition we propose relates to the projection of Adam steps on the gradient subspace. By design, the monitoring process for this condition adds very little overhead and does not increase the number of hyperparameters in the optimizer. We report experiments on several standard benchmarks such as: ResNet, SENet, DenseNet and PyramidNet for the CIFAR-10 and CIFAR-100 data sets, ResNet on the tiny-ImageNet data set and language modeling with recurrent networks on the PTB and WT2 data sets. The results show that our strategy is capable of closing the generalization gap between SGD and Adam on a majority of the tasks.
Technical Report
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry