Available via license: CC BY

Content may be subject to copyright.

entropy

Article

Utilizing Information Bottleneck to Evaluate the

Capability of Deep Neural Networks for

Image Classiﬁcation †

Hao Cheng 1,2,3, Dongze Lian 3, Shenghua Gao 3and Yanlin Geng 4,*

1Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences,

Shanghai 200050, China; chenghao@shanghaitech.edu.cn

2University of Chinese Academy of Sciences, Beijing 100049, China

3School of Information Science and Technology, ShanghaiTech University, Shanghai 201210, China;

liandz@shanghaitech.edu.cn (D.L.); gaoshh@shanghaitech.edu.cn (S.G.)

4State Key Laboratory of ISN, Xidian University, Xi’an 710071, China

*Correspondence: gengyanlin@gmail.com

† This paper is an extended version of our paper published in the 15th European Conference on Computer

Vision (ECCV), Munich, Germany, 8–14 September 2018.

Received: 10 February 2019; Accepted: 28 April 2019; Published: 1 May 2019

Abstract:

Inspired by the pioneering work of the information bottleneck (IB) principle for Deep

Neural Networks’ (DNNs) analysis, we thoroughly study the relationship among the model accuracy,

I(X

;

T)

and

I(T

;

Y)

, where

I(X

;

T)

and

I(T

;

Y)

are the mutual information of DNN’s output

T

with input

X

and label

Y

. Then, we design an information plane-based framework to evaluate

the capability of DNNs (including CNNs) for image classiﬁcation. Instead of each hidden layer’s

output, our framework focuses on the model output

T

. We successfully apply our framework

to many application scenarios arising in deep learning and image classiﬁcation problems, such

as image classiﬁcation with unbalanced data distribution, model selection, and transfer learning.

The experimental results verify the effectiveness of the information plane-based framework: Our

framework may facilitate a quick model selection and determine the number of samples needed

for each class in the unbalanced classiﬁcation problem. Furthermore, the framework explains the

efﬁciency of transfer learning in the deep learning area.

Keywords: information bottleneck; mutual information; neural networks; image classiﬁcation

1. Introduction

1.1. Deep Neural Networks

Deep neural networks (DNNs) are very powerful machine learning models that have

revolutionized many research and application areas [

1

–

6

] in recent years. These include image

recognition [

1

,

4

], speech recognition [

2

], and natural language processing [

5

]. DNN is a type of

representation learning that can automatically generate good representations from raw data for further

processing.

With many

training data and GPU acceleration [

1

], the performance of DNN has greatly

exceeded other traditional learning algorithms in image recognition tasks. Inspired by the work in [

1

],

researchers have developed more efﬁcient network structures: The work in [

7

] decreased the size of

kernel and found better representations of data. The work in [

8

] proposed a facial recognition tool,

named FaceNet, that could be embedded for face recognition and clustering. The work in [

9

] proposed

ResNet by utilizing identity mapping between layers. ResNet could make the network deeper than

1000 layers without losing performance and has been widely used in many applications.

Entropy 2019,21, 456; doi:10.3390/e21050456 www.mdpi.com/journal/entropy

Entropy 2019,21, 456 2 of 22

However, DNN has millions of parameters and numerous hyper-parameters. Before training

a DNN model, one needs to deﬁne all the hyper-parameters heuristically. These include the initial

learning rate, momentum, batch size, activation functions, number of layers, weight decay, etc. There is

no standard guideline on how to choose appropriate hyper-parameters. Furthermore, the same

network structure or hyper-parameter setting may be suitable for one task (dataset), but fail on the

other one. Thus, for DNN, it lacks deep understanding or an explanation of how networks work

or behave.

Recently, there have been some works that tried to understand networks by visualization. The

work in [

10

] proposed a method to compute the image-resolution receptive ﬁeld of neural activations

in a feature map. Accurate calculation of the receptive ﬁeld would help people to understand the

representation of the ﬁlter. The work in [

11

,

12

] understood DNN by estimating the feature distribution

of each class in the feature space of the pre-trained model. The work in [

13

] used the decovnet

operation for visualizing discriminant features from data. The work in [14] introduced a tool, named

network dissection. This tool can be used to quantify the exploitability of hidden representations of

kernels in convolutional networks by evaluating the alignment between individual hidden units and

a set of semantic concepts. However, almost all of these visualization techniques understand DNNs

by visualizing features or activation maps to gain insights into neural networks, which somehow

lacks a theoretical basis. In this work, we are going to use the tool of mutual information, which is

a fundamental measure in information theory, to represent networks in the information plane for

better understanding.

Remark 1. This is an extend version of our conference paper published in ECCV 2018 [15].

1.2. Mutual Information and Information Bottleneck

The reason we used mutual information is that it provides a measure to quantify the common

part between two random variables, especially for the case where their samples are highly non-linear.

Notice that supervised or unsupervised deep learning models have the input data and the output ones;

the relationship or uncertainty between them can be measured by mutual information. For example, the

work in [

16

] introduced a new learning rate schedule based on mutual information. The work in [

17

]

utilized a mutual information method to score all the candidate features in the network and selected

the most informative one as the input for DNN. The work in [

18

] learned a better representation by

maximizing the mutual information between DNN’s input and output.

In supervised deep learning, the label

Y

is also available. In addition to the relationship

between the input

X

and output

T

, the correlation between the output

T

and label

Y

is also

worth studying. We used

I(X

;

T)

and

I(T

;

Y)

to represent the corresponding mutual information.

The study of

I(X

;

T)

and

I(T

;

Y)

inspires the information bottleneck (IB) method. IB was introduced

by

Naftali Tishby et al.

[

19

] and serves as a method for compressing the source data (random

variable) without losing the predictive capability for the target label (random variable). Since then,

the modiﬁcation and generalization of IB has attracted considerable interest. The work in [

20

]

generalized the IB method to the continuous setting, which deals with variables such as Gaussian. The

work in [21] proposed the deterministic IB (DIB), a different formulation that uses entropy instead of

mutual information. It stated that DIB could better distinguish relevant features. The work in [

22

]

deeply explored the relationship between IB and minimal sufﬁcient statistics.

In recent years, IB has been applied to the deep learning research ﬁeld. The work in [

23

] introduced

a variational approximation technique to IB. The new method uses a neural network to parameterize

the IB model, which could lead to an efﬁcient and stable training process. The work in [

24

] introduced

a new IB-based framework, called parametric IB (PIB). The advantage of PIB is that it could jointly

optimize the compressed information and the relevant information of the layers in the neural network

and leads to a better representation ability of DNNs. The work in [

25

] proposed a method called

Entropy 2019,21, 456 3 of 22

information dropout, which generalizes the original dropout with theoretical consideration. The new

dropout method can make it easier for the network to ﬁt the training data.

1.3. Visualizing DNN via Information Bottleneck

Different from the work in [

23

–

25

], Ravid and Naftali [

26

] took IB as a visualization tool to

understand how neural networks work in general. Speciﬁcally, they estimated the mutual information

I(X

;

T)

and

I(T

;

Y)

for

each hidden layer

, where

X

is the input data,

Y

is the label, and

T

is the

hidden layer output, respectively. Figure 1, from [

15

], draws

I(X

;

T)

,

I(T

;

Y)

with training epochs in

the two-dimensional plane (referred to as the information plane). From the ﬁgure, the green points

(transition points) separate mutual information paths into two stages:

•

The ﬁrst phase, which is also called the “ﬁtting phase”, only takes a little time in the whole

training procedure.

I(X

;

T)

and

I(T

;

Y)

both increase in this stage. It can be explained as: the

network needs to remember the information from the data “

I(X

;

T)

” for ﬁtting the label “

I(T

;

Y)

”.

•

The second phase, which is also called the “compression phase”, takes most of the training

time.

I(T

;

Y)

still increases while

I(X

;

T)

decreases. It can be explained as: the value of

I(X

;

T)

,

obtained in the ﬁrst ﬁtting phase, contains too much irrelevant information (for classiﬁcation).

The network needs to drop this irrelevant information for better generalization.

Figure 1.

In this ﬁgure (from [

15

]), the network that performs the experiment is a fully-connected

neural network. The input data

X

are a 12-dimensional binary vector. The label

Y

takes values from

{0, 1}.

Each mutual information path corresponds to a hidden layer. The leftmost path represents the

last hidden layer, and the rightmost path represents the ﬁrst hidden layer. The green points (also called

transition points) separate mutual information paths into two stages. Best viewed in color.

The dynamic change of

I(X

;

T)

and

I(T

;

Y)

in the information plane helps us understand how

neural networks work. However, there are still some unsolved problems: (1) “Accuracy” is an

important indicator for DNNs. To evaluate DNNs via IB, one needs to investigate fully the relationship

among the model accuracy,

I(X

;

T)

, and

I(T

;

Y)

. One beneﬁt of this investigation is that we can use

the information plane to help us evaluate or select a better model that leads to a higher accuracy in

a short time (see Section 3.3). (2) The models used in [

26

] are some simple fully-connected neural

networks (FCs), while convolutional neural networks (CNNs) are generally used in many applications,

like computer vision and speech recognition. Is there any difference between FCs and CNNs in the

information plane? (3) Could IB be used in some other problems arising in deep learning? This paper

aims to address these problems to make IB theory more acceptable and applicable in deep learning

research. The contributions of our work are summarized in two parts as follows:

Entropy 2019,21, 456 4 of 22

(1) Framework:

•

“Accuracy” is a common indicator that reﬂects DNN’s quality. Since we are to use mutual

information to evaluate DNNs, we need to explore the relationship among

I(X

;

T)

,

I(T

;

Y)

,

and accuracy. With extensive experiments, we found that low

I(X

;

T)

and high

I(T

;

Y)

both

contribute to the accuracy. Furthermore, the correlation between

I(X

;

T)

and accuracy grew

stronger as the network got deeper.

•

An information plane-based framework is proposed based on the investigation of the

relationship among

I(X

;

T)

,

I(T

;

Y)

, and accuracy. We used the framework to evaluate

fully-connected neural networks (FCs) and convolutional neural networks (CNNs) in the

information plane and found some interesting phenomena. For example, when a dataset is

harder to recognize, FCs with fewer layers do not exhibit the second phase.

•

Our results suggest that the information plane is more informative than the loss curve; thus,

it may be used for better evaluation of neural networks. For example, two models with

similar loss curves (both decreasing along with the training epochs) behave differently in the

information plane.

(2) Applications:

•

We successfully applied the framework to the image classiﬁcation problem with an

unbalanced data distribution. The framework helped us determine the samples of each class

needed in the image classiﬁcation task.

•

We applied the framework to explain the efﬁciency of transfer learning. We found that

the weight-transfer method helped the model ﬁt the data only in the ﬁrst learning phase.

This result observed by us is consistent with [

27

]. While different, we verify this result from

the information plane.

•

We apply the framework to analyze how different optimization methods affect DNNs.

We compare the optimization methods “SGD” and “Adam” in the information plane.

We found that Adam might have the advantage of ﬁtting labels in the beginning of training,

but generalizes more poorly than SGD in the end. This ﬁnding is also consistent with [28].

Compared to our original conference version [

15

], this paper focuses more on application scenarios

arising in deep learning and image classiﬁcation problems like transfer learning and on the comparison

of different optimization algorithms listed above.

2. Materials and Methods

In this section, we introduce the deﬁnition of mutual information and explain representation

learning in the language of information bottleneck. Then, we show how to estimate mutual information

in DNNs in detail.

2.1. Some Concepts of Mutual Information

The mutual information of two discrete random variables Xand Yis deﬁned as:

I(X;Y) = ∑

x,y

p(x,y)log p(x,y)

p(x)p(y), (1)

where

p(x

,

y)

is the joint probability mass function of

X

and

Y

and

p(x)

and

p(y)

are the marginal

probability mass functions of Xand Y, respectively.

For continuous random variables, the summation is replaced by a deﬁnite double integral:

I(X;Y) = ZyZxp(x,y)log p(x,y)

p(x)p(y)dxdy, (2)

Entropy 2019,21, 456 5 of 22

where

p(x

,

y)

, an abuse of notation, is now the joint probability density function of

X

and

Y

.

Compared with

(2)

,

(1)

is easier and more practical to calculate (integrating probability density

functions is time consuming in networks). Thus, in the experiments, we will ﬁrst on using a

discretization technique to transform continuous values to discrete ones before calculating the mutual

information. The detail can be found in Section 2.3.

For a discrete X, the entropy H(X)can be written in the form of mutual information:

H(X) = I(X;X) = −∑

x

p(x)log p(x). (3)

In general, mutual information quantiﬁes the similarity, or dependence, between two variables.

The following two properties of mutual information are essential for analyzing neural networks.

•Function transformation: For any invertible functions ψand φ,

I(X;Y) = I(ψ(X);φ(Y)). (4)

•Data processing inequality: If X→Y→Zforms a Markov chain, then:

I(X;Y)≥I(X;Z). (5)

2.2. Optimal Representations of Information Bottleneck

In machine learning, representation learning is a set of techniques that allows a system to

discover the representations needed for feature detection or classiﬁcation from raw data automatically.

The neural network is a type of representation learning that can be used to perform feature learning,

since it learns a representation of the input at the hidden layer, which is subsequently used for

classiﬁcation or regression at the output layer. The goal of representation learning is to extract the

efﬁcient representation of raw data

X

. This is often called “compression”. However, the compression

should not lose the predictive capability of the label

Y

. This learning process is identical to the concept

of minimal sufﬁcient statistics. A minimal sufﬁcient statistic

T(X)

is the solution to the following

optimization problem:

T(X) = arg min

S(X):I(S(X);Y)=I(X;Y)

I(S(X);X). (6)

From

(6)

, the minimum

T(X)

is the simplest sufﬁcient statistic. In other words,

T(X)

is the

function of other sufﬁcient statistics. One way of formulating this is through the Markov chain

Y→X→S(X)→T(X)

. Thus, from the concept of minimal sufﬁcient statistics, the goal of DNNs is

to make

I(X

;

S(X))

as small as possible, which means the representation is efﬁcient; while

I(S(X)

;

Y)

should be the same value of

I(X

;

Y)

, which means the information on

Y

is not reduced. Since explicit

formulae for minimal sufﬁcient statistics only exist for very special distributions (exponential families),

the work in [

19

] relaxed this optimization problem by ﬁrst allowing the map to be stochastic deﬁned

as an encoder

P(T|X)

and, then, by allowing the map to capture

I(X

;

T)

as much as possible, not

necessarily all of it. This leads to the IB method [19].

IB is a special case of the rate-distortion theory, which provides a technique for ﬁnding

approximate minimal sufﬁcient statistics, or the optimal trade-off between the compression of

X

and the predictive ability of

Y

. Let

x

be an input,

t

be the output of the model, or the compressed

representation of

x

, and

p(t|x)

be a probabilistic mapping. We can formulate the information bottleneck

as the following optimization problem:

min

p(t|x),Y→X→T{I(X;T)−βI(T;Y)}. (7)

Entropy 2019,21, 456 6 of 22

The positive Lagrange multiplier

β

operates as a trade-off parameter between the compression

I(X

;

T)

and the amount of preserved relevant information

I(T

;

Y)

. The solution to

(7)

can be given by

three IB self-consistent equations:

p(t|x) = p(t)

Z(x;β)exp{−βDKL (p(y|x)||p(y|t))}, (8)

p(t) = ∑

x

p(t|x)p(x), (9)

p(y|t) = ∑

x

p(y|x)p(x|t), (10)

where Z(x;β)is the normalization function.

From

(7)

, a good representation is a trade-off between

I(X

;

T)

and

I(T

;

Y)

. These two terms are

essential for analyzing neural networks. In the next section, we will show how to estimate mutual

information in neural networks approximately.

2.3. Calculating Mutual Information in DNNs

In Section 2.2, we emphasize that

I(X

;

T)

and

I(T

;

Y)

are important for analyzing DNNs (a type of

representation learning model). Next, we will show step by step how to calculate mutual information

I(X;T)and I(T;Y)in neural networks.

First, Figure 2shows a general network structure.

Y→X→T

forms a Markov chain. The number

of neurons in

T

equals the number of classes

C

in the training set. Notice that the outputs of

T

are

the scores of different classes and are unbounded. To better calculate mutual information

I(X

;

T)

and

I(T

;

Y)

, a normalized exponential function is employed to squash a

C

-dimensional real vector

z

of

arbitrary real values to a

C

-dimensional vector

σ(z)

of real values in the range

[

0, 1

]

that add up to one.

The normalization function is written as:

σ(z)j=ezj

∑C

i=1ezifor j=1, . . . , C, (11)

which is of the same form as the softmax function in the neural network. Now, in the layer

T

, the value

of each neuron ranges from 0–1. Then, we bin each neuron’s output into 10 equal intervals between

zero and one and get our ﬁnal model output

T

, which is the discretization process shown in Figure 2.

Y X T

Discretization

Label Input Output

CNN architecture

Figure 2.

This ﬁgure, from [

15

], depicts a general neural network structure. The blue circles represent

the softmax probability. Discretization is performed on the output of this softmax function.

Second, from

(1)

, one needs to calculate the joint distribution of two random variables for

estimating mutual information. Suppose we want to estimate

I(X

;

T)

from a mini-batch (containing

n

Entropy 2019,21, 456 7 of 22

samples) in the image classiﬁcation task; since every image is different from other images, for the ith

input image

xi

, it is common to let

p(xi) = 1

n

. When performing calculation of mutual information at

each epoch, the network’s parameter

θ

is ﬁxed.

ti

is generated by inputting

xi

into the deterministic

network. We can put is as

ti=f(xi

,

θ)

. In this case,

p(ti|xi) =

1. There are multiple data points; thus,

one can take

X

as a random variable (of data), but not for

θ

at this epoch since it is ﬁxed. Thus,

T

is a deterministic function of

X

. Therefore, we have

H(T|X) =

0. At the next epoch, the network’s

parameter θis updated, and we continue this calculation process. Now, due to binning, we have:

I(X;T) = H(T)−H(T|X) = H(T) = ∑

t

p(t)log 1

p(t)=

n

∑

i=1

1

nlog 1

p(ti). (12)

From

(12)

,

I(X

;

T)

is related to the distribution of

T

. It does not mean that calculating

I(X

;

T)

does not need

X

, since

T

is generated by inputting

X

into the network. Suppose in the mini-batch, after

binning, each value

ti

is different from others. Then,

I(X

;

T)

achieves the maximum

log n

. Suppose

some

ti

’s have the same value, then

I(X

;

T)

becomes lower. Furthermore,

(12)

shows that

I(X

;

T)

is not

related to the dimension of images. Thus, for high dimensional images, we can still estimate mutual

information this way (usually, the number of classes is much smaller than the dimension of the image).

However, there are several points worth noticing:

•

We make a hypothesis that in the mini-batch, each image is different from the others. Actually, it

can happen that some images are the same. However, this will not affect the expression

(12)

, since

the same images must produce the same outputs. Only the probability mass function of

T

might

be changed.

•

The mutual information we estimate may be different from the ground truth since we perform

discretization on the layer

T

. Thus, for fair comparison, for different data with the same number

of classes, we need to use the same discretization interval.

•

If the number of neurons in the layer

T

is large, the mutual information

I(X

;

T)

barely changes;

since in this case, the sample space of

T

is huge even if we decrease the number of intervals.

As a result,

X→T

is approximately a bijection function (

p(x|t)

is almost deterministic).

Thus, I(X;T)≈H(X)from (1) and (3).

From the above analysis, the dimension of

T

cannot be huge. Since the number of neurons in the

layer

T

equals the number of classes in the training set, our method is subject to the number of classes.

For cases with a large numbers of classes, we investigated the method of clustering to estimate the

probability mass functions. We provide an API at https://github.com/piratehao/API-of-estimating-

mutual-information-in-networks.

3. Main Results

This section is structured as follows: In Section 3.1, we explore the relationship among the model

accuracy,

I(X

;

T)

, and

I(T

;

Y)

; in Section 3.2, we introduce our information plane-based framework

and utilize this framework to compare the differences between FCs and CNNs in the information

plane; in Section 3.3, we compare the evaluation framework with loss curves and state that our

framework is more informative when evaluating neural networks; in Section 3.4, we evaluate modern

popular CNNs in the information plane; in Section 3.5, we show that the information plane can be

used to evaluate DNNs, not only on a dataset, but also on each single class of the dataset in image

classiﬁcation tasks; in Section 3.6, we use IB to evaluate DNNs for image classiﬁcation problems with

an unbalanced data distribution; in Section 3.7, we analyze the efﬁciency of transfer learning in the

information plane; in Section 3.8, we compare different optimization algorithms of neural networks in

the information plane.

Entropy 2019,21, 456 8 of 22

3.1. Relationship among Model Accuracy, I(X;T), and I(T;Y)

Accuracy is a common “indicator” to reﬂect the model’s quality. Since we are to use mutual

information to evaluate neural networks, it is necessary to investigate the relationship among

I(X

;

T)

,

I(T

;

Y)

, and the accuracy. The work in [

22

] studied the relationship between mutual information and

accuracy and stated that

I(T

;

Y)

explains the training accuracy,

I(X

;

T)

serves as a regularization term

that controls the generalization; while from our experiments, we found that in deep neural networks,

when

I(T

;

Y)

is ﬁxed, there is also a correlation between

I(X

;

T)

and the training accuracy, where

X

represents the training data. Furthermore, the correlation grew stronger as the network got deeper.

To validate the hypothesis experimentally, we need to sample the values of the training accuracy,

I(X

;

T)

, and

I(T

;

Y)

. The method of estimating mutual information has been discussed in Section 2.3.

The network and dataset we chose were VGG-16 and CIFAR-10, respectively. The sampling was

performed after every ﬁxed iteration steps during the training. We ﬁrst plotted

I(X

;

T)

,

I(T

;

Y)

, and

the training accuracy of sampled data in 3D space, which is shown in Figure 3.

10

8

6

I(X;T)

4

2

0

0

0.5

1

1.5

I(T;Y)

2

2.5

3

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

3.5

Accuracy

Figure 3. Visualization of the sampled data in 3D space.

From Figure 3, we can see the data did not fully ﬁll the whole space. They came from a narrow

area out of the whole 3D space. Furthermore, there existed two points that had similar

I(T

;

Y)

, but the

point with smaller

I(X

;

T)

had a higher training accuracy. To verify the correlation between

(X

;

T)

and

the training accuracy quantitatively, we used the method of “checking inversion”.

First, we used

I(X

;

T)i

,

I(T

;

Y)i

, and

Acci

to denote the values of the ith sampling result,

which corresponds to a blue circle in Figure 3. Suppose there exists a sample pair

(i

,

j)

such that

I(T;Y)i=I(T;Y)j

, then we can directly check the relationship between

I(X

;

T)

and the training

accuracy to ﬁnd how they are related. However, each

I(T

;

Y)i

is a real number, and it is difﬁcult to ﬁnd

samples that have the same value of

I(T

;

Y)

. Thus, we verify the hypothesis by checking inversions.

An inversion is deﬁned as a pair of samples

(i

,

j)

that satisﬁes

I(T

;

Y)i<I(T

;

Y)j

and

Acci>Accj

.

A “satisﬁed inversion” is deﬁned as an inversion for which

I(X

;

T)i<I(X

;

T)j

and

Acci>Accj

.

The “percentage”, deﬁned as follows:

percentage =#satisﬁed inversions

#all inversions , (13)

Entropy 2019,21, 456 9 of 22

is a proper indicator to reﬂect the rightness of our hypothesis that low

I(X

;

T)

also contributes to the

training accuracy. Consider two cases:

•

The percentage is near 0.5. This means that almost half of the inversion pairs satisfy

I(X;T)i<I(X;T)j

and

Acci>Accj

, and the other inversion pairs satisfy

I(X

;

T)i<I(X

;

T)j

and

Acci<Accj. Thus, I(X;T)almost has no relation to the training accuracy.

•

The percentage is high. This means that a large number of inversion pairs satisfy

I(X;T)i<I(X;T)jand Acci>Accj. Thus, low I(X;T)also contributes to the training accuracy.

We performed experiments under different training conditions to train the neural networks.

Table 1, from [15], records percentages with different training conditions.

Table 1.

In this table (from [

15

]), we include 600 samples when performing experiments. The percentage

already converged for 600 samples. SGD represents stochastic gradient descent, which means that

we stochastically selected a mini-batch from all training samples each time for training the network.

BGD represents batch gradient descent, which means we selected all the training samples for training

the network. For equal comparison, BGD and SGD used the same training set. CNN-9 is a convolutional

neural network with 9 hidden layers. The linear network is a fully-connected neural network whose

activation function is the identity mapping.

Network Structure Training Method Percentages with 600 Samples

CNN-9 SGD 0.865

BGD 0.821

Linear Network SGD 0.755

BGD 0.594

Table 1shows that in various condition settings, the percentages were all over 0.5, which indicates

that

I(X

;

T)

also contributed to the training accuracy. Furthermore, different training schemes (SGD,

batch gradient descent (BGD)) and network structures (FC, CNN) may lead to different percentages.

We explain these phenomenons as follows:

•

Even though

I(T

;

Y)

represents the correlation between the model output and label,

I(T

;

Y)

is

not a monotonic function of the training accuracy. Suppose there are

C

classes in the image

dataset, and let

Ci

be the ith class. Consider two cases: In the ﬁrst case,

T=σ(Y)

where

σ

is an identity mapping, which indicates that the model output

T

always equals the true class.

For another case,

T=ϕ(Y)

where

ϕ

is a shift mapping, which indicates that if the true class is

Ci

,

the prediction of

T

is

Ci+1

. In both cases, since

σ

and

ϕ

are invertible functions, from

(4)

, we have

I(T

;

Y) = I(σ(Y)

;

Y) = I(ϕ(Y)

;

Y) = H(Y)

. However, in the ﬁrst case, the training accuracy was

one, whereas in the other case it was zero.

•

The activation function of the linear network was identity mapping. Thus, the loss function of the

linear network was a convex function; whereas the loss function of convolutional neural networks

is highly non-convex. By using BGD to optimize a convex function with a proper learning rate,

the training loss with respect to all the training data always decreased (the model was stablest in

this case), which indicated that

T

was gradually closer to

Y

during training. Thus,

I(T

;

Y)

can

fully explain the training accuracy and

I(X

;

T)

may not have contributed to the training accuracy

greatly; however, when the loss function was a non-convex function, or the training scheme was

SGD. The loss with respect to all the training data did not decrease all the time (the network

was sometimes learning in the wrong direction). Thus,

I(T

;

Y)

cannot fully explain the training

accuracy for SGD and convolutional neural networks.

•

The work in [

29

] used

I(X

;

T)

to represent the stability of the learning algorithm. The model with

low

I(X

;

T)

tended to be more stable on the training data. Thus, when

I(T

;

Y)i

and

I(T

;

Y)j

are

equal, the model with low I(X;T)may lead to a high training accuracy.

Entropy 2019,21, 456 10 of 22

Another interesting phenomenon is that we found that there was a correlation between percentage

and the number of layers of convolutional neural networks. Table 2, from [

15

], shows that percentage

rose as we increased the number of layers of convolutional neural networks. This result may relate to

some inherent properties of CNNs, which is worth investigating for future work.

Table 2.

This table (from [

15

]) lists the percentages for convolutional neural networks with variant

numbers of hidden layers. CNN-

i

is a deep convolutional neural network with

i

convolutional layers.

Network Structure CNN-2 CNN-4 CNN-9 CNN-16

percentage with 600 samples 0.56 0.68 0.87 0.96

Compared to the training accuracy, we were more interested in validation accuracy. Thus, we

also validated our hypothesis on the validation data where

X

represents the validation input. Table 3,

from [

15

], shows that low

I(X

;

T)

also contributed to the validation accuracy. This observation will

form the basis of our evaluation framework proposed in the next section.

Table 3. This table (from [15]) lists the percentages with the numbers of samples on the validation set.

The training scheme and the network structure were SGD and VGG-16, respectively.

Number of Samples 100 200 300 400 500 600

percentage 0.905 0.921 0.912 0.924 0.924 0.924

3.2. Evaluating DNNs in the Information Plane

In Section 2.2, we explain that

I(X

;

T)

and

I(T

;

Y)

are essential for analyzing neural networks.

The neural network training process can be regarded as ﬁnding a trade-off between

I(X

;

T)

and

I(T

;

Y)

.

Furthermore, in Section 3.1, the experiments showed that not only high

I(T

;

Y)

, but also low

I(X

;

T)

contributed to the validation accuracy, where

X

represents validation input. Since

I(X

;

T)

and

I(T

;

Y)

are both indicators to reﬂect the model’s accuracy, we used

∆I(T;Y)

∆I(X;T)

to represent the model’s learning

capability in the information plane. Notice that

∆I(T;Y)

∆I(X;T)

is exactly the slope of the mutual information

curve. Thus, it represents the model’s learning capability at each moment.

Figure 1, from [

15

], shows that a mutual information path contains two learning stages. The ﬁrst

ﬁtting phase only took very little time compared to the whole learning process. The model began

to generalize at the second compression phase. Thus, we only used

∆I(T;Y)

∆I(X;T)

to represent the model’s

learning capability in the second phase. A better model is expected to have smaller (negative)

∆I(T;Y)

∆I(X;T)

in the second phase; while for the ﬁrst ﬁtting phase,

I(T

;

Y)

of the model increased (the model needed

to remember the training data for ﬁtting the label). Thus, we only used

I(T

;

Y)

to represent the model’s

capability of ﬁtting the label. Based on our analysis above, we propose our information plane-based

framework in Figure 4.

Entropy 2019,21, 456 11 of 22

I(X; T)

I(T; Y)

O

Capability of

fitting labels

Capability of

generalization

Transition

points

Figure 4.

This ﬁgure (from [

15

]) shows the information plane-based framework.

I(T

;

Y)

before the

transition point represents the model’s capability of ﬁtting the label.

∆I(T;Y)

∆I(X;T)

after the transition point

represents the model’s capability of generalization.

To further explore how different datasets or network structures inﬂuence the mutual information

curves in the information plane, we performed an experiment on two datasets: MNISTand CIFAR-10.

The network structure contained fully-connected neural networks and convolutional neural networks

with various numbers of layers. Notice for subsequent experiments,

X

in

I(X

;

T)

represents

the validation input, since we cared about the validation accuracy rather than training accuracy.

Furthermore, we smoothed the mutual information curves in the information plane for simplicity and

better visualization. Table 4and Figure 5, from [15] show the experiment results.

0246810

I(X; T)

0

1

2

3

I(T; Y)

MNIST

FC-3

FC-6

FC-9

CNN-2

CNN-4

CNN-6

0 2 4 6 8 10 12

I(X; T)

0

0.5

1

1.5

2

2.5

3

I(T; Y)

CIFAR-10

FC-3

FC-6

FC-9

CNN-2

CNN-4

CNN-6

Figure 5.

In these two ﬁgures (from [

15

]), FC-

i

is a fully-connected neural network with

i

hidden layers.

CNN-iis a convolutional neural network with iconvolutional layers.

Entropy 2019,21, 456 12 of 22

Table 4.

This table (from [

15

]) lists

I(T

;

Y)

,

I(X

;

T)

, training epochs, and validation accuracies of each

network from Figure 5. The transition points of FC-3, CNN-2, and CNN-4 are just the convergence

points, since they did not show the second phase.

Dataset Model Transition Point Convergence Point

I(T;Y)I(X;T)Epochs Accuracy I(T;Y)I(X;T)Epochs Accuracy

MNIST

FC-3 2.96 7.183 1 0.836 3.259 4.358 51 0.983

FC-6 2.962 7.532 1 0.846 3.249 3.746 56 0.988

FC-9 2.803 7.166 1 0.774 3.214 3.647 54 0.988

CNN-2 2.952 7.898 1 0.75 3.282 3.916 50 0.99

CNN-4 2.286 7.683 1 0.451 3.284 3.621 53 0.994

CNN-6 2.236 6.184 1 0.515 3.275 3.592 54 0.994

CIFAR-10

FC-3 2.671 10.085 65 0.534 2.671 10.085 65 0.534

FC-6 2.604 9.321 20 0.537 2.218 7.197 66 0.575

FC-9 2.55 9.02 21 0.555 2.218 7.197 66 0.56

CNN-2 1.816 8.133 63 0.451 1.816 8.133 63 0.451

CNN-4 2.840 8.761 67 0.705 2.840 8.761 67 0.705

CNN-6 2.301 8.891 5 0.52 2.472 4.862 66 0.781

There are some interesting observations from Figure 5:

•

Not all networks had exactly the second compression phase. For the MNIST experiment, all

networks had two learning stages. However, for CIFAR-10, the networks with fewer hidden

layers (FC-3, CNN-2, CNN-4) did not show the second phase. The reason is that CIFAR-10 is a

more difﬁcult dataset for the network to classify. The network with fewer hidden layers may not

have enough generalization capability.

•

Convolutional neural networks had better generalization capability than fully-connected neural

networks by observing

∆I(T;Y)

∆I(X;T)

in the second phase. This led to higher validation accuracy.

However,

I(T

;

Y)

of convolutional neural networks was sometimes lower than fully-connected

neural networks by comparing

I(T

;

Y)

at the transition point, which indicates that fully-connected

networks may have a better capability of ﬁtting the labels than convolutional neural networks in

the ﬁrst phase. The fact that fully-connected neural networks have a large number of weights

may contribute to this.

•

Not all networks have increasing

I(T

;

Y)

in the second phase. For CIFAR-10,

I(T

;

Y)

of FC-6 and

FC-9 dropped down in the second phase. One possible reason is that FC with more layers may

over-ﬁt the training data.

3.3. Informativeness and Guidance of the Information Plane

There does not exist a network that is optimal for all problems (datasets). Usually, researchers

have to test many neural network structures on a speciﬁc dataset. The network is chosen by comparing

the ﬁnal validation accuracy of each network. This process is time consuming since we have to train

each network until convergence. In this section, we will show that the information plane could ease

this searching process and facilitate a quick model selection of neural networks.

In Section 3.2, we show that

∆I(T;Y)

∆I(X;T)

can represent the capability of generalization at each moment

in the second compression phase. Thus, a direct way to select the model is to compare each model’s

∆I(T;Y)

∆I(X;T)

at the beginning of the compression phase. Since the ﬁrst compression phase only takes a little

time, we can select a better model in a short time. Figure 6and Table 5, from [

15

], show the mutual

information paths and validation accuracies of different networks on the CIFAR-10 dataset.

Entropy 2019,21, 456 13 of 22

0 2 4 6 8 10 12

I(X; T)

0

0.5

1

1.5

2

2.5

3

3.5

I(T; Y)

CNN-2

CNN-4

CNN-9

CNN-16

(a)

0 2 4 6 8 10 12

I(X; T)

0

0.5

1

1.5

2

2.5

3

3.5

I(T; Y)

CNN-2

CNN-4

CNN-9

CNN-16

(b)

0 10000 20000 30000 40000 50000

Iterations

0

0.5

1

1.5

2

2.5

3

3.5

Loss

CNN-2

CNN-4

CNN-9

CNN-16

(c)

Figure 6.

In these three ﬁgures (from [

15

]): (

a

) Mutual information paths of neural networks on the

training set of CIFAR-10. (

b

) Mutual information paths of neural networks on the validation set.

(c) Training losses of neural networks with training iterations.

Table 5. This table (from [15]) lists the percentages of each network from Table 2.

Network Structure CNN-2 CNN-4 CNN-9 CNN-16

percentage of 600 samples 0.56 0.68 0.87 0.96

ﬁnal accon validation set 0.45 0.70 0.77 0.89

We can gain some information from Figure 6and Table 5:

•∆I(T;Y)

∆I(X;T)

can be used to select a better network. We can compare CNN-9 and CNN-16 in Figure 6b.

The slope of the mutual information curve of CNN-16 was smaller (negative) than CNN-9, which

represents a better generalization capability. The ﬁnal validation accuracy in Table 5is consistent

with our analysis. Thus, for a speciﬁc problem, we can visualize each model’s mutual information

curve on the validation data to select a better model quickly.

•

The information plane is more informative than the loss curve. By comparing Figure 6c with

Figure 6a,b, we can see that the training loss of each model continued to decrease with training

steps. However, the mutual information curves behaved differently. The model with fewer layers

did not clearly show the second compression phase. Thus, the information plane could reveal

more information about the network.

•I(X

;

T)

mostly contributed to the accuracy in the second phase. We record the percentage of each

network in Table 5. From Table 5, when the network had fewer layers, the mutual information

path did not clearly show the second phase, and the percentage was low; whereas for the network

that had the compression phase, the percentage was high. One possible reason is that in the

second phase, the model learned to generalize (extract common features from each mini-batch).

The value of percentages indicates that the compression happened even when

I(T

;

Y)

’s remained

the same.

We can view

I(X

;

T)

and

I(T

;

Y)

as:

I(T

;

Y)

determines how much the knowledge

T

has about

the label Y, and I(X;T)determines how easy this knowledge can be learned by the network.

3.4. Evaluating Popular CNNs in the Information Plane

The architecture of neural networks has undergone substantial development. From AlexNet [

1

],

VGG [

7

] to ResNet [

9

] and DenseNet [

30

], researchers have made great efforts in searching for efﬁcient

networks. In this section, we visualize these popular networks in the information plane. Figure 7and

Table 6show the information curve, training epochs, and accuracy of each network structure on the

CIFAR-10 dataset.

Entropy 2019,21, 456 14 of 22

0246810

I(X; T)

0

0.5

1

1.5

2

2.5

3

I(T; Y)

Alexnet

Vgg-16

Resnet-50

Densenet-121

transition point

convergence point

Figure 7.

Mutual information paths of different network architectures on the CIFAR-10 dataset. For each

mutual information path, training time spent on the dotted line was three-times longer than the time

spent on the solid line.

Table 6.

This table records

I(X

;

T)

,

I(T

;

Y)

, training epochs, and model accuracy at the transition point

and convergence point for every neural network.

Transition Point Convergence Point

I(T;Y)I(X;T)Epochs Accuracy I(T;Y)I(X;T)Epochs Accuracy

AlexNet 2.496 9.476 5 0.524 2.339 5.907 62 0.673

VGG-16 2.262 8.862 7 0.451 2.733 4.682 69 0.839

ResNet-50 2.387 9.389 2 0.541 2.989 4.956 55 0.877

DenseNet-121 2.702 8.965 1 0.604 2.96 4.492 51 0.902

Only looking at the solid line in Figure 7, we may infer that when classifying the CIFAR-10

dataset, AlexNet was the worst neural network among all the models. AlexNet has a low capability of

ﬁtting the labels, and after the transition point,

I(T

;

Y)

of the model even dropped down, indicating

losing information on labels. VGG had a stronger capability of ﬁtting labels than ResNet, but the

generalization capability was relatively lower. From the trend of the curves,

I(T

;

Y)

of ResNet may be

larger than VGG in the future. DenseNet was better than all the other models in both two learning

phases. The model accuracies in Table 6are consistent with our analysis.

Here, we emphasize that our prediction may not always be true, since the mutual information

path may have a larger slope change in the future. See ResNet in Figure 7. Thus, there existed a

trade-off between the training time and conﬁdence of our prediction. We can make a more conﬁdent

prediction by training the network with a longer time. However, it is still an efﬁcient way for guiding

us on choosing a better network that leads to a high validation accuracy. Table 6also shows that with

more convolutional layers, the model can reach the transition point more quickly. This means that for

many deep networks, we can predict the networks’ quality and choose a better model in a short time.

3.5. Evaluating DNN’s Capability of Recognizing Objects from Different Classes

The previous sections evaluated neural networks on a whole dataset. However, this evaluation

can only test the performance of the network on all the classes. In this section, we will show how to

evaluate DNNs on a single class in the information plane when performing image classiﬁcation tasks.

Suppose there are

C

classes in the dataset and

Ci

denotes the ith class. To evaluate networks on

the ith class, we labeled other classes as one class. Thus, the dimension of label

Y

changes from

C

to

two. We also made label

Y

balanced when calculating mutual information so that

H(Y)

is equal to

Entropy 2019,21, 456 15 of 22

one. This process is similar to one-vs.-all classiﬁcations [

31

]. However, instead of concentrating on the

accuracy, we paid attention to the mutual information. Furthermore, notice that we did not change the

network structure; we only performed pre-processing on the validation data. The training scheme and

other conditions did not change during this process. We selected three classes (airplane, automobile,

bird) on CIFAR-10 to perform the experiment. Figure 8, from [

15

], compares the performance of two

networks on these three classes. Figure 9, from [

15

], shows the mutual information curve on each class

in the information plane.

By comparing the mutual information curves of automobile and airplane, we can ﬁnd that they

had almost the same value of

I(T

;

Y)

at the transition point. However, the slope of automobile was

smaller than airplane. Thus, the ﬁnal validation accuracy of automobile may be higher than airplane.

By comparing the mutual information curves of airplane and bird, we can ﬁnd that they almost had the

same slope of mutual information curves in the second phase. However,

I(T

;

Y)

of airplane was higher

than that of bird. Thus, the ﬁnal validation accuracy of airplane may be higher than bird. The true

validation accuracies were 0.921 (airplane), 0.961 (automobile), and 0.825 (bird), which are consistent

with our analysis.

airplane automobile bird

AlexNet

0

0.2

0.4

0.6

0.8

1

I(X; T)

I(T; Y)

Accuracy

airplane automobile bird

VGG-16

0

0.2

0.4

0.6

0.8

1

I(X; T)

I(T; Y)

Accuracy

Figure 8.

This ﬁgure (from [

15

]) lists

I(X

;

T)

,

I(T

;

Y)

and accuracies for well-trained AlexNet and

VGG-16. The accuracy is deﬁned as the percentage of how many samples are correctly predicted out of

all samples belonging to that class. For better visualization, we divided

I(X

;

T)

by its upper-bound

H(X)so that I(X;T),I(T;Y), and the validation accuracy had the same magnitude.

0 2 4 6 8 10 12

I(X; T)

0

0.2

0.4

0.6

0.8

1

I(T; Y)

airplane

automobile

bird

transition point

convergence point

Figure 9.

This ﬁgure (from [

15

]) shows the mutual information curves of each class on the CIFAR-10

dataset based on VGG-16.

Entropy 2019,21, 456 16 of 22

Furthermore, we examined two models on the same classes. From Figure 8, VGG-16 had better

performance than AlexNet on each class by comparing

I(X

;

T)

,

I(T

;

Y)

, and validation accuracies.

Thus, the information plane also provided a way to analyze the performance of the network on each

class in the image classiﬁcation task.

3.6. Evaluating DNNs for Image Classiﬁcation with an Unbalanced Data Distribution

In a multi-class classiﬁcation problem, each class may have different numbers of samples.

Suppose we want our model to have a balanced classiﬁcation capability: we need to control the

number of training samples for each class. We suggest that the information plane could help in a

quick way.

In CIFAR-10, each class has 5000 training samples. For simplicity and better presentation, in this

experiment, we chose the classes of “automobile” and “bird”. We hoped that the model had balanced

classiﬁcation capability for these two classes. Let the number of training samples for bird be ﬁxed,

and we performed the experiments with variant numbers of samples for automobile. The results are

shown in Figure 10 and Table 7.

From Figure 10a, when there were only 200 samples of automobile, the

I(X

;

T)

of automobile

even decreased in the second stage. In this case, the network may just over-ﬁt the small sample noise.

From (b) and (c), the slope of automobile in the second phase was higher (negative) than bird, indicating

low generalization capability. While when automobile:bird = 3:5, from Figure 10d, the generalization

capability of automobile was already comparable to bird. The ﬁnal validation accuracies from Table 7

of these two classes are consistent with the generalization capability in Figure 10. Since the ﬁrst phase

only took a little time, we could decide how many samples were needed by checking the slope in the

second phase.

We believe this application has huge potential in some serious areas, such as medical science,

especially where people expect computers to automatically classify diseased tissues from samples.

For example, at present, cancer classiﬁcation has two major problems: the healthy and diseased

samples are very unbalanced; the AI system lacks interpretability. Since IB explains the network from

an informative way and our framework shows we could determine the number of samples for each

class in a short time, our framework could be applied to medical science in the future.

0246810

I(X; T)

0

0.2

0.4

0.6

0.8

1

I(T; Y)

automoible:bird = 200:5000

automobile

bird

(a)

0246810

I(X; T)

0

0.2

0.4

0.6

0.8

1

I(T; Y)

automoible:bird = 500:5000

automobile

bird

(b)

Figure 10. Cont.

Entropy 2019,21, 456 17 of 22

0246810

I(X; T)

0

0.2

0.4

0.6

0.8

1

I(T; Y)

automoible:bird = 1500:5000

automobile

bird

(c)

0246810

I(X; T)

0

0.2

0.4

0.6

0.8

1

I(T; Y)

automoible:bird = 3000:5000

automobile

bird

(d)

Figure 10.

Mutual information paths of automobile and bird with variant numbers of training samples

on the CIFAR-10 dataset. The numbers of automobile and bird are (

a

) 200:5000, (

b

) 500:5000, (

c

)

1500:5000, and (d) 3000:5000.

Table 7. Final validation accuracies for automobile and bird with different training samples.

Number of Samples (automobile:bird) 200:5000 500:5000 1500:5000 3000:5000

automobile 0.55 0.68 0.85 0.91

bird 0.90 0.91 0.90 0.90

3.7. Transfer Learning in the Information Plane

Using the pre-trained model to train data instead of training a neural network from scratch has

been common in recent days. It may greatly reduce the training time and make the model easy to learn.

In this section, we study why the pre-trained model is better in the language of the information plane.

We used the same neural network structure (ResNet-34) as our base model and pre-trained model.

The pre-trained model was trained on the ImageNet dataset. We resized the images of CIFAR-10

to 224

×

224 for ﬁtting the network structure. Figure 11 shows the mutual information paths in the

information plane.

We can see for these two models that they had very similar curves in the “compression” phase

and ended up with the same convergence point. However, in the “ﬁtting” phase, they acted differently.

The pre-trained model had larger

I(X

;

T)

and

I(T

;

Y)

at the initial time, which contributed to the

similarity between CIFAR-10 and the ImageNet dataset. Thus, pre-trained model reached the transition

point much faster than the base model. After the transition point, since they had the same architecture,

the generalization capabilities for these two models were the same.

Therefore, the information plane revealed that pre-trained model improving learning algorithm

happened in the “ﬁtting” phase. and it helped the model ﬁt labels efﬁciently. However, the pre-trained

model and the base model without pre-training may have equal generalization capability. Interestingly,

our result is consistent with a recent paper [

27

], where Kaiming He et al. found that ImageNet

pre-training is not necessary. ImageNet can help speed up convergence, but does not necessarily

improve accuracy unless the target dataset is too small. Differently, we veriﬁed this result by utilizing

the information plane.

Entropy 2019,21, 456 18 of 22

0246810

I(X; T)

0

0.5

1

1.5

2

2.5

3

I(T; Y)

Resnet-34

Pretrained Resnet-34

transition point

convergence point

Figure 11. Mutual information paths in ResNet and pre-trained ResNet.

3.8. SGD versus Adam in the Information Plane

Choosing an appropriate optimization method is essential in the neural network training process.

A straightforward way to optimize the loss of DNN is via stochastic gradient descent (SGD). However,

the loss function of DNN is highly non-convex, which makes SGD easily stuck in the local minima

or saddle points. Thus, several advanced optimization methods were developed in recent years and

have become popular. Momentum SGD [

32

] is a method that helps accelerate SGD in the relevant

direction and dampens oscillations. The work in [

33

] adapted the learning rate and performed smaller

updates for parameters, which could improve the robustness of SGD. The work in [

34

] proposed

Adam, an optimization method that has been generally used in deep learning. Adam also keeps an

exponentially-decaying average of past gradients, similar to momentum. Whereas momentum can be

seen as a ball running down a slope, Adam behaves like a heavy ball with friction, which thus prefers

ﬂat minima in the error surface.

However, recently, there have been some criticisms about Adam [

28

]. It was stated that Adam

may have better performance at the beginning of the training, but may end up with lower accuracy

than SGD. To verify this phenomenon, we visualize SGD and the Adam method in the information

plane in Figure 12.

From Figure 12, we can see that Adam had a better capability of ﬁtting the labels in the ﬁrst stage.

while in the second stage, SGD behaved better than Adam. Since SGD converged to a point with

lower

I(X

;

T)

, this indicates that SGD had better generalization capability than Adam at the end of

training. Our ﬁnding is consistent with [

28

]. The experiments from Sections 3.7 and 3.8 show that the

information plane can be used as a tool to analyze variant deep learning algorithms.

Entropy 2019,21, 456 19 of 22

0246810

I(X; T)

0

0.5

1

1.5

2

2.5

3

I(T; Y)

SGD vs. Adam

SGD

Adam

Figure 12. Mutual information paths of SGD and Adam in the information plane.

4. Conclusions and Future Work

Our work from the paper aimed to bridge the gap between the theoretical understanding of

neural networks and their applications by visualizing mutual information. We summarize our work as

follows:

•

We investigated the relationship among

I(X

;

T)

,

I(T

;

Y)

, and accuracies. The experiment from

Section 3.1 showed that low

I(X

;

T)

and high

I(T

;

Y)

both contributed to the validation accuracy

when

X

represented the validation input. Based on this exploration, we proposed a framework

by utilizing mutual information to evaluate neural networks. The model with lower

I(X

;

T)

and

higher

I(T

;

Y)

in the second stage of the learning process was more likely to achieve a higher

accuracy. The beneﬁt of mutual information is that it provides two terms:

I(X

;

T)

and

I(T

;

Y)

.

Compared to the “accuracy”, mutual information can better explore the network’s representation

in terms of the capability of ﬁtting the labels and the capability of generalization.

•

There were some interesting phenomena when we compared fully-connected neural network (FCs)

and convolutional neural network (CNNs) in the information plane. For example, sometimes, the

network may not show the second learning stage in the mutual information paths. This happens

when a dataset is harder to recognize and the network has fewer layers. It cannot be shown by

simply visualizing the loss curve. Thus, we suggest that the information plane is a more powerful

tool than loss for evaluating neural networks.

•

Since the model with lower

I(X

;

T)

and higher

I(T

;

Y)

had a big probability of achieving a

higher accuracy,

∆I(T;Y)

∆I(X;T)

can represent the model’s momentary learning capability in the second

stage. The ﬁrst learning stage only took a little time compared to the whole learning process.

We can compare each model’s

∆I(T;Y)

∆I(X;T)

at the beginning of the second stage and select a better one.

Thus, the information plane could facilitate a quick model selection for a speciﬁc problem.

•

From Sections 3.5 and 3.6, with some preprocessing techniques on the data, the information plane

can also reveal the recognition capability of each class of neural networks. Thus, we can use the

information plane to deal with the classiﬁcation problem with an unbalanced data distribution.

By comparing the mutual information curves of each class, we can determine how many samples

are needed to train a classiﬁer with balanced recognition capability.

•

There are many sub-ﬁelds in deep learning research. We suggest that the information plane is a

powerful tool for analyzing particular problems in these sub-ﬁelds. For example, weight-transfer

learning is often regarded as a technique that improves the model performance; while from our

visualization in Section 3.7, weight-transfer learning had almost equal generalization capability

compared to the original model without pre-training. This observation is consistent with a recent

Entropy 2019,21, 456 20 of 22

study in [

27

]. However, the information plane draws this conclusion in a more informative

way. Furthermore, optimization methods are essential in neural networks’ training process. We

compared SGD and Adam in the information plane to show how each optimization method

behaved during the training in Section 3.8.

There are some future directions based on our current work:

•

The mutual information

I(X

;

T)

and

I(T

;

Y)

is estimated from the last layer of CNNs. One future

work is to develop techniques to estimate the information of each kernel in hidden layers with

regard to the input and output. This process may help us determine how many hidden layers or

how many kernels are needed for designing the network structure. Furthermore, the visualization

of hidden units may facilitate a better understanding of neural networks.

•

We used the binning technique to estimate the mutual information. This estimation method

has limitations, since the bin size and the dimension of the layer may affect the value of

mutual information. Some other estimation techniques, like

K

-nearest neighbor [

35

] and the

kernel-density-based estimator [

36

], are also affected by the dimension of the layer. A possible

future work can be done to develop more efﬁcient estimation techniques to stabilize the calculation

of mutual information.

Author Contributions:

H.C. proposed the main idea and wrote the manuscript. D.L. performed the experiments

to validate the result. Y.G. and S.G. reviewed and edited the manuscript. All authors read and approve

the manuscript.

Funding: This work is supported by NSFC (No. 61601288 and No. 61502304).

Conﬂicts of Interest: The authors declare no conﬂict of interest.

Appendix A. Experiment Settings

This section introduces the detailed experiment settings in the paper.

(1)

In Sections 3.1–3.3, the networks include FCs and CNNs. The setting for each network is listed in

Table A1.

Table A1. Network settings for Sections 3.1–3.3.

Hidden Layers Batch Size Initial lr lr Decay Optimizer Momentum

Linear Network 9 128 0.1 0.1//60 SGD 0.9

FC-3 3 128 0.1 0.1//60 SGD 0.9

FC-6 6 128 0.1 0.1//60 SGD 0.9

FC-9 9 128 0.1 0.1//60 SGD 0.9

CNN-2 2 128 0.1 0.1//60 SGD 0.9

CNN-4 4 128 0.1 0.1//60 SGD 0.9

CNN-6 6 128 0.1 0.1//60 SGD 0.9

CNN-9 9 128 0.1 0.1//60 SGD 0.9

CNN-16 16 128 0.1 0.1//60 SGD 0.9

(2)

In Section 3.4, we visualize popular networks in the information plane. The network settings are

listed in Table A2.

Table A2. Network settings for Section 3.4.

Hidden Layers Batch Size Initial lr lr Decay Optimizer Momentum

AlexNet 4 128 0.1 0.1//60 SGD 0.9

VGG 16 128 0.1 0.1//60 SGD 0.9

ResNet-50 50 128 0.1 0.1//60 SGD 0.9

DenseNet 121 128 0.1 0.1//60 SGD 0.9

Entropy 2019,21, 456 21 of 22

(3)

In Sections 3.5 and 3.6, we use AlexNet and VGG to evaluate the classiﬁcation capability of

networks for each class in the image classiﬁcation task. The network settings are listed in

Table A3.

Table A3. Network settings for Sections 3.5 and 3.6.

Hidden Layers Batch Size Initial lr lr Decay Optimizer Momentum

AlexNet 4 128 0.05 0.95 SGD 0.9

VGG 16 128 0.05 0.95 SGD 0.9

(4)

In Section 3.7, we use the information plane to analyze the efﬁciency of transfer learning.

The network was ResNet-34, and its settings are listed in Table A4.

Table A4. Network settings for Section 3.7.

Hidden Layers Batch Size Initial lr lr Decay Optimizer Momentum

ResNet-34 34 128 0.1 0.1//60 SGD 0.9

(5)

In Section 3.8, we use the information plane to analyze different optimization methods in neural

networks. The network was CNN-6, and its settings are listed in Table A5.

Table A5. Network settings for Section 3.8.

Hidden Layers Batch Size Initial lr lr Decay Optimizer Momentum

CNN-6 6 128 0.1 0.1//60 SGD 0.9

References

1.

Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classiﬁcation with deep convolutional neural networks.

In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–8

December 2012; pp. 1097–1105.

2.

Seide, F.; Li, G.; Yu, D. Conversational speech transcription using context-dependent deep neural networks.

In Proceedings of the Twelfth Annual Conference of the International Speech Communication Association,

Florence, Italy, 27–31 August 2011.

3.

Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.;

Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature

2015

,

518, 529.

4. LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015,521, 436–444.

5.

Zhang, J.; Zong, C. Deep Neural Networks in Machine Translation: an Overview. IEEE Intell. Syst.

2015

,30,

16–25.

6.

Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.;

Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. Mastering the game of Go with deep neural

networks and tree search. Nature 2016,529, 484–489.

7.

Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv

2014, arXiv:1409.1556.

8.

Schroff, F.; Kalenichenko, D.; Philbin, J. Facenet: A uniﬁed embedding for face recognition and clustering.

In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA,

7–12 June 2015; pp. 815–823.

9.

He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE

Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778.

10.

Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Object detectors emerge in deep scene cnns. arXiv

2014, arXiv:1412.6856.

Entropy 2019,21, 456 22 of 22

11.

Lu, Y. Unsupervised learning on neural network outputs: With application in zero-shot learning. arXiv

2015

,

arXiv:1506.00990.

12.

Aubry, M.; Russell, B.C. Understanding deep features with computer-generated imagery. In Proceedings of

the IEEE International Conference on Computer Vision, Tampa, FL, USA, 5–8 December 2015; pp. 2875–2883.

13.

Zeiler, M.D.; Fergus, R. Visualizing and understanding convolutional networks. In Proceedings of the

European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 818–833.

14.

Bau, D.; Zhou, B.; Khosla, A.; Oliva, A.; Torralba, A. Network dissection: Quantifying interpretability of deep

visual representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,

Honolulu, HI, USA, 21–26 July 2017; pp. 6541–6549.

15.

Cheng, H.; Lian, D.; Gao, S.; Geng, Y. Evaluating Capability of Deep Neural Networks for Image

Classiﬁcation via Information Plane. In Proceedings of the European Conference on Computer Vision

(ECCV), Munich, Germany, 8–14 September 2018; pp. 168–182.

16. Vasudevan, S. Dynamic learning rate using Mutual Information. arXiv 2018, arXiv:1805.07249.

17.

Battiti, R. Using mutual information for selecting features in supervised neural net learning. IEEE Trans.

Neural Netw. 1994,5, 537–550.

18.

Hjelm, R.D.; Fedorov, A.; Lavoie-Marchildon, S.; Grewal, K.; Trischler, A.; Bengio, Y. Learning deep

representations by mutual information estimation and maximization. arXiv 2018, arXiv:1808.06670.

19.

Tishby, N.; Pereira, F.C.; Bialek, W. The information bottleneck method. arXiv

2000

, arXiv:physics/0004057.

20.

Chechik, G.; Globerson, A.; Tishby, N.; Weiss, Y. Information bottleneck for Gaussian variables. J. Mach.

Learn. Res. 2005,6, 165–188.

21. Strouse, D.; Schwab, D.J. The deterministic information bottleneck. Neural Comput. 2017,29, 1611–1630.

22.

Shamir, O.; Sabato, S.; Tishby, N. Learning and generalization with the information bottleneck.

Theor. Comput. Sci. 2010,411, 2696–2711.

23.

Alemi, A.A.; Fischer, I.; Dillon, J.V.; Murphy, K. Deep variational information bottleneck. arXiv

2016

,

arXiv:1612.00410.

24.

Thann T. Nguyen, J.C. Layer-wise Learning of Stochastic Neural Networks with Information Bottleneck.

arXiv 2018, arXiv:1712.01272.

25.

Achille, A.; Soatto, S. Information dropout: Learning optimal representations through noisy computation.

IEEE Trans. Pattern Anal. Mach. Intell. 2018,40, 2897–2905.

26.

Shwartz-Ziv, R.; Tishby, N. Opening the Black Box of Deep Neural Networks via Information. arXiv

2017

,

arXiv:1703.00810.

27. He, K.; Ross, G.; Dollar, P. Rethinking ImageNet Pre-training. arXiv 2018, arXiv:1811.088883.

28.

Keskar, N.S.; Socher, R. Improving Generalization Performance by Switching from Adam to SGD. arXiv

2017, arXiv:1712.07628.

29.

Raginsky, M.; Rakhlin, A.; Tsao, M.; Wu, Y.; Xu, A. Information-theoretic analysis of stability and bias of

learning algorithms. In Proceedings of the 2016 IEEE Information Theory Workshop (ITW), Cambridge, UK,

11–14 September 2016; pp. 26–30.

30.

Huang, G.; Liu, Z.; Weinberger, K.Q.; van der Maaten, L. Densely connected convolutional networks. arXiv

2016, arXiv:1608.06993.

31. Bishop, C.M. Pattern Recognition and Machine Learning; Springer: Berlin/Heidelberg, Germany, 2006.

32. Qian, N. On the momentum term in gradient descent learning algorithms. Neural Netw. 1999,12, 145–151.

33.

Duchi, J.; Hazan, E.; Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization.

J. Mach. Learn. Res. 2011,12, 2121–2159.

34. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980.

35.

Gao, W.; Oh, S.; Viswanath, P. Demystifying Fixed

k

-Nearest Neighbor Information Estimators. IEEE Trans.

Inf. Theory 2018,64, 5629–5661.

36. Kolchinsky, A.; Tracey, B. Estimating mixture entropy with pairwise distances. Entropy 2017,19, 361.

c

2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access

article distributed under the terms and conditions of the Creative Commons Attribution

(CC BY) license (http://creativecommons.org/licenses/by/4.0/).