Content uploaded by Umang Bhatt

Author content

All content in this area was uploaded by Umang Bhatt on Jan 31, 2019

Content may be subject to copyright.

NIF: A Framework for Quantifying Neural Information Flow in Deep Networks

Brian Davis∗, Umang Bhatt∗, Kartikeya Bhardwaj∗, Radu Marculescu, Jos´

e Moura

Carnegie Mellon University, Pittsburgh, Pennsylvania 15213-3890

{briandavis, umang, kbhardwa, radum, moura}@cmu.edu

Abstract

In this paper, we present a new approach to interpreting deep

learning models. More precisely, by coupling mutual infor-

mation with network science, we explore how information

ﬂows through feed forward networks. We show that efﬁciently

approximating mutual information via the dual representation

of Kullback-Leibler divergence allows us to create an infor-

mation measure that quantiﬁes how much information ﬂows

between any two neurons of a deep learning model. To that

end, we propose NIF, Neural Information Flow, a new metric

for codifying information ﬂow which exposes the internals of

a deep learning model while providing feature attributions.

Introduction

As deep learning gains popularity, there has been an inﬂux

in methods that attempt to explain how deep learning begets

its predictive power. Most approaches to interpreting deep

learning models are model agnostic and make local approxi-

mations in the feature space region around the datapoints to

be explained (Ribeiro, Singh, and Guestrin 2016). However,

such techniques fail to capture global model-speciﬁc behavior

that is crucial to understanding if the function learned by a

deep learning model aligns well with a users’ intention. More-

over, current noisy approximations neglect the topological

structure of model used for prediction (Sundararajan, Taly,

and Yan 2017).

We note that it is easy to forget the network structure

of deep learning models, particularly feed forward models,

which resemble directed acyclic graphs. However, under-

standing the topological structure of different models can not

only help decide the architecture best suited for the task at

hand, but also help expose the internal interactions between

neurons at inference time. While the existing interpretability

techniques (Chen et al

.

2018) shed light on which input fea-

tures are responsible for a given prediction, prior art still fails

to quantify how information ﬂows through a deep network

at the neuron-level. This prevents answering one of the most

fundamental questions in deep learning: How much informa-

tion ﬂows through a deep network model from input features

to each of its intermediate neurons?

∗Equal Contribution

Copyright

c

�

2019, Association for the Advancement of Artiﬁcial

Intelligence (www.aaai.org). All rights reserved.

To address this question, we consider two types of inter-

pretability notions: (i) model interpretability via attribution

to input features, and (ii) network architecture interpretability

with respect to how information ﬂows from neuron to neu-

ron within a given pretrained model. We believe, addressing

notion (ii) above from the fundamental information theory

standpoint will automatically reveal insights about the pre-

cise decision-making process followed by the model (i.e., the

notion (i) above).

To that end, using an information theoretic measure, we

propose to model the ﬂow of information via Neural Infor-

mation Flow (NIF) between neurons in consecutive layers

to expose how simple deep learning models can learn com-

plex functions of their input features. We further analyze

this ﬂow of information between neurons from a network

science (Barab

´

asi and Bonabeau 2003) perspective, where

each neuron in the deep network essentially becomes a node

in the network of information ﬂow. Eventually, the NIF can

help recover an information-theoretic feature attribution, a

rank of feature importance to a given class.

Combining an information measure with the ability to

propagate information through the network can help us vi-

sualize the information ﬂow. Feature attributions not only

expose which features are important to a model (just like

current feature attribution techniques (Ribeiro, Singh, and

Guestrin 2016)), but also which information ﬂow paths in the

network are crucial to a model’s prediction; the latter will

allow us to study how information ﬂow is ampliﬁed or thwart-

ed when we introduce current state of the art deep learning

building blocks: shortcuts, residuals, dropout, etc. To the best

of our knowledge, we are the ﬁrst to propose an information-

and network-theoretic model for explaining how information

ﬂows through a deep learning model while accounting for its

network structure.

Background

Network Science

Network science has gained a lot of interest for many bio-

logical and social science applications. However, to the best

of our knowledge, network concepts have not been used to

understand the inner workings of deep neural networks. To

that end, several ideas from network science can be used for

better understanding deep network architectures.

Betweenness Centrality

Given a network

G={V,E}

,

betweenness centrality

B(v)

of a node

v∈ V

is a measure of

how central a node is in the network. Speciﬁcally, the

B(v)

computes how many shortest paths between different pairs of

nodes in the network pass through node

v

. Mathematically,

betweeness can be computed as:

B(v) =

s�=t�=v

σst(v)

σst

where,

σst

is the number of shortest paths between nodes

s, t ∈ V

, and

σst(v)

are the shortest paths passing through

v

.

Community Structure

Communities in a network refer to

groups of tightly connected nodes. Intuitively, a community

can be deﬁned as a group of nodes if the number of con-

nections between this group is signiﬁcantly more than what

we would expect at random. Mathematically, communities

can be computed by maximizing a modularity function as

follows (Newman 2006):

max

g={g1,g2,...,gk}

1

2m

ij Aij −1

γ·kikj

2mδ(gi, gj)(1)

where,

m

is the number of edges,

ki

is the degree (number

of connections) of node

i

,

Aij

is the weight of the link be-

tween nodes

i

and

j

, and

δ

is Kronecker delta. The idea is to

ﬁnd groups of tightly connected nodes,

g={g1, g2, . . . , gk}

,

which map the nodes

V

to

k

communities. The

kikj/2m

factor represents the number of links one would expect in a

randomly connected network. Finally,

γ

controls the resolu-

tion of communities: lower gamma will detect more number

of smaller communities (i.e., the number of communities

k

depends on the resolution of communities, γ).

Interpretability

Current interpretability techniques fall into two classes. The

ﬁrst class of work are gradient-based methods, which com-

pute the gradient of the output with respect to the input, treat-

ing gradient ﬂow as a saliency map (Sundararajan, Taly, and

Yan 2017). The other type of research leverages perturbation-

based techniques to approximate a complex model using a

locally additive model, thus explaining the difference be-

tween test output-input pair and some reference output-input

pair. Lundberg and Lee proposed SHAP, a class of methods

which randomly draws points from a kernel centered at the

test point and ﬁts a sparse linear model to locally approximate

the decision boundary (Lundberg and Lee 2017). Approxi-

mating Shapley values to quantify the importance of features

of a given input, kernel SHAP can learn a feature attribution.

While gradient-based techniques like (Sundararajan, Taly,

and Yan 2017) consider inﬁnitesimal regions on the decision

surface and take the ﬁrst-order term in the Taylor expansion

as the additive model, perturbation-based additive models

consider the ﬁnite difference between an input vector and a

reference vector.

Information Theory

Mutual information has proven to be a valuable tool for fea-

ture selection at training time leveraging dimensionality re-

duction (Bollacker and Ghosh 1996). More recent work has

represented deep neural networks as Markovian chains to

create an information bottleneck theory for deep learning

(Shwartz-Ziv and Tishby 2017). However, these works do

not tackle the interpretability problem directly.

Other works look to ﬁnd

I(X;Y)

, the mutual information

between a subset of the input vector and the output vector.

In order to explain the conditional distribution of the output

vector given the input vector, Chen et al. develop an efﬁcient

variational approximation to mutual information (Chen et al

.

2018). However, this model fails to recover the per-feature

mutual information, a requisite of our model to explain how

information ﬂows through all possible paths.

Additionally, (Belghazi et al

.

2018) proposes to esti-

mate mutual information via a neural information measure,

IΘ(X,Z)

: this quantity is grounded in the dual representa-

tion of the Kullback-Leibler divergence between the joint

and product of the marginals parameterized by

θ∈Θ

from

astatistics network

TΘ:X × Z → R

: a deep neural net-

work used to estimate the neural information measure from

empirical samples from the joint (

PX Z

) and the product of

the marginal distributions (

PX⊗PZ

). The empirical neural

information measure (

I) is deﬁned as follows:

I(X,Z) = sup

θ∈Θ

EP [Tθ]−log(EP⊗P[eTθ])

We use this approximation to start our detailed understanding

of information ﬂow in deep networks. We control

X

and

Z

to be different quantities of interest within our model,

M

,

namely a speciﬁc input feature, a hidden neuron, etc.

Approach

Our proposed approach, NIF, transforms a traditional deep

learning model into a representation that actually captures

the information-theoretic relationship between nodes learned

by the model (Fig 1).

Figure 1: Traditional Model to NIF Network. Color of nodes

corresponds to communities. Size of nodes corresponds to

betweenness centrality

Our approach extends the work of (Belghazi et al

.

2018)

and decomposes their approximation of

I(X;Z)

to give us

I(Xi;Qk)

, where

Xi

is a dimension of

X

(speciﬁcally the

ith

feature of the input vector) and

Qk

is any quantity of

interest (perhaps the

kth

neuron in a hidden layer or a class

of the output vector).

Assuming that mutual information is composable and en-

tropy is non-decreasing, we can calculate the mutual infor-

mation for any feature

Qk)

by leveraging a tractable approxi-

mation from (Bollacker and Ghosh 1996):

I(Xi;Qk) = I(X;Qk)−β

i−1

j=1

I(Xi;Xj)(2)

where

β

can be used to tune the interactive effect of mutual

information between features. The ﬁrst term is referred to

as the relevance of

X

to

Qk

and the second term is called

redundancy, as it removes interactions between dimensions of

the input. We desire a tractable approximation to Equation

(2)

using the statistics network

1

,

TΘ

, that calculates the mutual

information between two empirical distributions (in this case,

Xto Qk). We can ﬁnd the relevance term via:

I(X,Qk, TΘ) = sup

θ∈ΘEP [Tθ]−log(EP⊗P[eTθ])

Similarly, the redundancy term is as follows:

I(Xi,Xj, TΘ) = sup

θ∈ΘEPij[Tθ]−log(EPi⊗Pj[eTθ])

Combining both relevance and redundancy, we get the fol-

lowing estimate of neural information:

I(Xi,Qk, TΘ) =

I(X,Qk, TΘ)−β

i−1

j=1

I(Xi,Xj, TΘ)

Since we share model parameters between the redundancy

and relevance components, we derive a weaker least upper

bound that allows us to get granular about distributional

interactions. First, let the following hold:

A=EP [Tθ]−log(EP⊗P[eTθ])

B=EPij[Tθ]−log(EPi⊗Pj[eTθ])

To that end, we propose NIF, a new metric for neural infor-

mation ﬂow.

N IF = sup

θ∈ΘA−β

i−1

j=1

B≥

I(Xi,Q, TΘ)(3)

By jointly training

TΘ

, we can approximate the mutual in-

formation between a feature and a quantity of interest; for

concreteness, let’s assume the quantity of interest is the ﬁrst

hidden neuron of a hidden layer. Solving Equation

(3)

for

all possible

Xi

will place a weight on every edge between a

feature and a quantity of interest.2

Results

In order to test the ﬁdelity of NIF, we run a few experiments

that not only validate our proposed metric, but also lead

to novel interpretations of deep learning models. We run

all experiments on UCI datasets, namely Iris and Banknote

authentication, both of which provide us with a small enough

feature space for us to interpret and visualize (Dheeru and

Karra Taniskidou 2017).

1

For a thorough derivation of the statistics network as a valid

measure of mutual information, see (Belghazi et al. 2018).

2Note we can scale Xiand Qkto be any two model internals.

One Layer Perceptron

We start by visualizing NIF for a

one layer perceptron trained on the Iris dataset with ReLU

activations and optimized via ADAM. Note that we make a

feature independence assumption for the Iris dataset, since

its low number of samples hinders NIF convergence.

(a) (b)

Figure 2: One layer perceptron for the Iris dataset with ReLU

activation and trained with ADAM (a) NIF network model

(b) Activation distribution of the original model

In Figure 2, we show the NIF network created using Equa-

tion

(3)

and a distribution of activations, as a sanity check.

Particularly, in Figure 2(a), we normalize the information

ﬂow per layer to ease visualization of the edges. The thick-

ness of an edge denotes how much information is ﬂowing

between any two nodes: the thicker the connection, the more

information travelling form one node to the next. The size of

the node denotes its centrality: the bigger the node, the more

central it is for information to freely propagate through the

network. The color of the node denotes which community the

node is a member of: using a standard resolution of

γ= 1

,

we use Equation

(1)

to ﬁnd three distinct communities in the

network. Upon ﬁrst glance, it is clear that of the ﬁve hidden

neurons in the one hidden layer, only three are central to the

model’s ﬁnal prediction. This results makes intuitive sense as

ReLU activation at those nodes is zero (see Figure 2(b)): thus,

we can reason that ReLU effectively stiﬂes information from

ﬂowing through the network. Moreover, Figure 2(b) conﬁrms

that the distribution of activations at nodes three and ﬁve are

zero and, therefore, have no connections in the NIF model.

(a) (b)

Figure 3: One layer perceptron for the Banknote dataset with

ReLU activation and trained with ADAM (a) NIF network

(b) Activation distribution of the original model

We perform similar analysis for the Banknote dataset and

report results in Figure 3. We see a strong information propa-

gation from feature one to hidden layer node ﬁve, so much so

that both nodes belong to their own community. Leveraging

the activation distribution in Figure 3(b) conﬁrms the equal

importance of all central nodes to the model’s prediction.

Two Layered Network

To show the initial ability of NIF

to generalize larger networks, we train a two layered network

with ReLU activation on the Banknote dataset. Shown in

Figure 4, we ﬁnd that two nodes per layer are zero which

means there are information pathways that are inherently

stiﬂed due to use of ReLU activation.

Figure 4: NIF network for a two layer MLP for the Banknote

dataset with ReLU activation and trained with ADAM

Accuracy Recovery

It is worthwhile to note that all of the

models described above received upwards of 96% accuracy

on a held-out test set. The NIF model shown in Figure 2

and in Figure 4 found that ReLU can zero out activations

at certain neurons in the hidden layer of a network while

still passing enough information through the rest of the neu-

rons to maintain predictive accuracy. We ran another set of

experiments wherein we zero out the weights and biases of

the original model for zero activation neurons in the original

model. We ﬁnd that using NIF to identify useless weights

(and then acting upon that learning to zero out the correspond-

ing elements in the weight matrix). To our surprise, we did

not

ﬁnd a drop in accuracy when we zeroed out the weights

in the matrix. This will have massive implications as we scale

to larger networks.

Feature Attribution

NIF naturally recovers a feature at-

tribution which we calculate in the following manner. We

ﬁnd all the possible paths between a feature of interest

xi

and any of the outputs

y1, . . . , yc

. To ﬁnd out the value of a

path, we take the product of all NIF calculations along the

path. We then sum over all of the possible values to ﬁnd

Ai,j

,

our desired feature attribution for feature

xi

and class

yj

.

Mathematically, the element

Ai,j

of our attribution matrix

A ∈ Rn×c

(where,

n

is the number of features and

c

is the

number of classes) can be given as:

Aij =

p∈P

l∈L

N IF p(l)

where,

P

is the set of all directed paths from input

xi

to class

yj

in the neural information ﬂow network, and

L

is the set of

links on each path p∈P.

We compare NIF to current feature attribution techniques

SHAP (Lundberg and Lee 2017) and Integrated Gradients

(Sundararajan, Taly, and Yan 2017) in Table 1. Using the two

sample Kolmogorov-Smirnov test for goodness of ﬁt between

two empirical distributions (in this case, the raw mutual in-

formation attribution between the input and output classes

and the attribution in question), we ﬁnd that NIF surpasses

current benchmarks, which means NIF is likely drawn from

the same distribution as the raw mutual information between

the input and output classes. This leads us to believe that

information theoretic feature attribution is viable.

ATTR IB UT IO N K-S STATI ST IC P -VALU E

NIF 1.0 0.011

SHAP 0.75 0.107

INT EG RATE D GRADIENTS 0.25 0.996

Table 1: Feature attribution comparison

Conclusion and Future Work

We have proposed NIF, Neural Information Flow, a new met-

ric for measuring information ﬂow through deep learning

models. Merging a dual representation of Kullback-Leilber

divergence and classical feature selection literature, we ﬁnd

that NIF not only provides insight into which information

pathways are crucial within a network but also allows us to

leverage fewer parameters at inference time, since we can

remove parameters deemed useless by the NIF without loss

of accuracy. Finally, we have shown how NIF recovers an

information theoretic feature attribution that aligns with ex-

isting benchmarks. In our future work, we plan to apply NIF

to larger architectures.

References

Barab

´

asi, A.-L., and Bonabeau, E. 2003. Scale-free networks.

Scientiﬁc American 288(60-69).

Belghazi, M. I.; Baratin, A.; Rajeshwar, S.; Ozair, S.; Bengio, Y.;

Courville, A.; and Hjelm, D. 2018. Mutual information neural

estimation. In Proc. ICML, volume 80, 531–540. PMLR.

Bollacker, K. D., and Ghosh, J. 1996. Linear feature extractors

based on mutual information. In Proc. ICPR, 720–724 vol.2.

Chen, J.; Song, L.; Wainwright, M. J.; and Jordan, M. I. 2018.

Learning to explain: An information-theoretic perspective on model

interpretation. ICML 2018.

Dheeru, D., and Karra Taniskidou, E. 2017. UCI machine learning

repository.

Lundberg, S. M., and Lee, S.-I. 2017. A uniﬁed approach to

interpreting model predictions. In Advances in Neural Information

Processing Systems 30. 4765–4774.

Newman, M. E. 2006. Modularity and community structure

in networks. Proceedings of the national academy of sciences

103(23):8577–8582.

Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2016. Why should I trust

you?: Explaining the predictions of any classiﬁer. In Proc. KDD,

1135–1144.

Shwartz-Ziv, R., and Tishby, N. 2017. Opening the black box of

deep neural networks via information. CoRR.

Sundararajan, M.; Taly, A.; and Yan, Q. 2017. Axiomatic attribution

for deep networks. In Proc. ICML, volume 70, 3319–3328. PMLR.