ArticlePDF Available

Abstract and Figures

Deep neural networks excel at finding hierarchical representations that solve complex tasks over large datasets. How can we humans understand these learned representations? In this work, we present network dissection, an analytic framework to systematically identify the semantics of individual hidden units within image classification and image generation networks. First, we analyze a convolutional neural network (CNN) trained on scene classification and discover units that match a diverse set of object concepts. We find evidence that the network has learned many object classes that play crucial roles in classifying scene classes. Second, we use a similar analytic method to analyze a generative adversarial network (GAN) model trained to generate scenes. By analyzing changes made when small sets of units are activated or deactivated, we find that objects can be added and removed from the output scenes while adapting to the context. Finally, we apply our analytic framework to understanding adversarial attacks and to semantic image editing.
Content may be subject to copyright.
Understanding the Role of Individual Units
in a Deep Neural Network
David Baua,1, Jun-Yan Zhua,b, Hendrik Strobeltc, Agata Lapedrizad,e, Bolei Zhouf, and Antonio Torralbaa
This preprint was compiled on September 15, 2020. The peer-reviewed version can be found at
Deep neural networks excel at finding hierarchical representations that solve complex tasks over large data sets. How can we humans
understand these learned representations? In this work, we present network dissection, an analytic framework to systematically identify the
semantics of individual hidden units within image classification and image generation networks. First, we analyze a convolutional neural
network (CNN) trained on scene classification and discover units that match a diverse set of object concepts. We find evidence that the
network has learned many object classes that play crucial roles in classifying scene classes. Second, we use a similar analytic method to
analyze a generative adversarial network (GAN) model trained to generate scenes. By analyzing changes made when small sets of units are
activated or deactivated, we find that objects can be added and removed from the output scenes while adapting to the context. Finally, we
apply our analytic framework to understanding adversarial attacks and to semantic image editing.
Machine Learning |Deep Networks |Computer Vision
Can the individual hidden units of a deep network teach us how
the network solves a complex task? Intriguingly, within state-of-the-
art deep networks, it has been observed that many single units
match human-interpretable concepts that were not explicitly taught
to the network: units have been found to detect objects, parts,
textures, tense, gender, context, and sentiment (
). Finding
such meaningful abstractions is one of the main goals of deep
learning (8), but the emergence and role of such concept-specific
units is not well-understood. Thus we ask: how can we quantify
the emergence of concept units across the layers of a network?
What types of concepts are matched; and what function do they
serve? When a network contains a unit that activates on trees, we
wish to understand if it is a spurious correlation, or if the unit has
a causal role that reveals how the network models its higher-level
notions about trees.
To investigate these questions, we introduce network dissec-
tion (
), our method for systematically mapping the semantic
concepts found within a deep convolutional neural network. The
basic unit of computation within such a network is a learned con-
volutional filter; this architecture is the state-of-the-art for solving
a wide variety of discriminative and generative tasks in computer
vision (
). Network dissection identifies, visualizes, and quan-
tifies the role of individual units in a network by comparing the
activity of each unit to a range of human-interpretable pattern-
matching tasks such as the detection of object classes.
Previous approaches for understanding a deep network include
the use of salience maps (
): those methods ask where a
network looks when it makes a decision. The goal of our current
inquiry is different: we ask what a network is looking for, and
why. Another approach is to create simplified surrogate models to
mimic and summarize a complex network’s behavior (
); and
another technique is to train explanation networks that generate
Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139;
Adobe Research, Adobe Inc., San Jose, CA 95110;
Massachusetts Institute of Technology–International Business Machines (IBM) Watson Artificial Intelligence Laborator y, Cambridge, MA 02142;
Media Lab, Massachusetts
Institute of Technology, Cambridge, MA 02139;
Estudis d’Informàtica, Multimèdia i Telecomunicació, Universitat Oberta de Catalunya, 08018 Barcelona, Spain;
Department of
Information Engineering, The Chinese University of Hong Kong, Shatin, Hong Kong SAR, China
1Corresponding author e-mail:
human-readable explanations of a network (
). In contrast to
those methods, network dissection aims to directly interpret the
internal computation of the network itself, rather than training an
auxiliary model.
We dissect the units of networks trained on two different types of
tasks: image classification and image generation. In both settings,
we find that a trained network contains units that correspond to
high-level visual concepts that were not explicitly labeled in the
training data. For example, when trained to classify or generate
natural scene images, both types of networks learn individual units
that match the visual concept of a ‘tree’ even though we have never
taught the network the tree concept during training.
Focusing our analysis on the units of a network allows us to
test the causal structure of network behavior by activating and
deactivating the units during processing. In a classifier, we use
these interventions to ask whether the classification performance
of a specific class can be explained by a small number of units that
identify visual concepts in the scene class. For example, we ask
how the ability of the network to classify an image as a ski resort
is affected when removing a few units that detect snow, mountains,
trees, and houses. Within a scene generation network, we ask how
the rendering of objects in a scene is affected by object-specific
units. How does the removal of tree units affect the appearance of
trees and other objects in the output image?
Finally, we demonstrate the usefulness of our approach with
two applications. We show how adversarial attacks on a classifier
can be understood as attacks on the important units for a class.
And we apply unit intervention on a generator to enable a human
user to modify semantic concepts such as trees and doors in an
image by directly manipulating units.
arXiv:2009.05041v2 [cs.CV] 12 Sep 2020
unit 150 “airplane” (object) unit 141 “fur” (material)unit 208 “person top” (part)
swimming pool
51 objects
22 parts
12 materials
8 colors
fc6, fc7, fc8
scene prediction
input image
fully connected layers ßmatched vi sual concepts for all uni ts à
(a) VGG-16 architecture (e) dissection of each convolutional layer
(d) conv5_3 summary
(f) out-of-domain object detection test: conv5_3 unit 150 activation on airpl anes
“conference room”
(b) unit 10 activation
(from layer co nv5_3)
view one unit
non-airplane airplane images
unit 150 peak activation per image
(c) single units tested on scenes
Fig. 1.
The emergence of single-unit object detectors within a VGG-16 scene classifier. (a) VGG-16 consists of 13 convolutional layers,
, followed by
three fully connected layers,
. (b) The activation of a single filter on an input image can be visualized as the region where the filter activates beyond its top 1% quantile
level. (c) Single units are scored by matching high-activating regions against a set of human-interpretable visual concepts; each unit is labeled with its best-matching concept
and visualized with maximally-activating images. (d) Concepts that match units in the final convolutional layer are summarized, showing a broad diversity of detectors for
objects, object parts, materials, and colors. Many concepts are associated with multiple units. (e) Comparing all the layers of the network reveals that most object detectors
emerge at the last convolutional layers. (f) Although the training set contains no object labels, unit 150 emerges as an ‘airplane’ object detector that activates much more
strongly on airplane objects than non-airplane objects, as tested against a dataset of labeled object images not previously seen by the network. The jitter plot shows peak
activations for the unit on randomly sampled 1,000 airplane and 1,000 non-airplane Imagenet images, and the curves show the kernel density estimates of these activations.
Emergence of Object Detectors in a Scene Classifier.
We first
identify individual units that emerge as object detectors when
training a network on a scene classification task. The network
we analyze is a VGG-16 CNN (
) trained to classify images into
365 scene categories using the places365 data set (
). We
analyze all units within the 13 convolutional layers of the network
(Figure 1a). Please refer to materials and methods for further
details on networks and datasets.
Each unit
computes an activation function
x, p
)that out-
puts a signal at every image position
given a test image
. Filters
with low-resolution outputs are visualized and analyzed at high-
resolution positions
using bilinear upsampling. Denote by
top 1% quantile level for
, that is, writing
]to indicate the
probability that an event is true when sampled over all positions
and images, we define the threshold
x, p
01. In visualizations we highlight the activation region
x, p
> tu}
above the threshold. As seen in Figure 1b,
this region can correspond to semantics such as the heads of all
the people in the image. To identify filters that match semantic
concepts, we measure the agreement between each filter and a
visual concept
using a computer vision segmentation model (
: (
x, p
→ {
that is trained to predict the presence of the
visual concept
within image
at position
. We quantify the
agreement between concept
and unit
using the intersection
over union (IoU) ratio:
IoUu,c =Px,p[sc(x, p)(au(x, p)> tu)]
Px,p[sc(x, p)(au(x, p)> tu)] [1]
This IoU ratio is computed on the set of held-out validation set
images. Within this validation set, each unit is scored against
1,825 segmented concepts
, including object classes, parts of
objects, materials, and colors. Then each unit is labeled with
the highest-scoring matching concept. Figure 1c shows several
labeled concept detector units along with the five images with the
highest unit activations.
When examining all 512 units in the last convolutional layer, we
find many detected object classes and relatively fewer detected
object parts and materials:
units match 51 object classes,
22 parts, 12 materials, and 8 colors. Several visual concepts
such as ‘airplane’ and ‘head‘ are matched by more than one unit.
Figure 1d lists every segmented concept matching units in layer
excluding any units with IoU ratio
4%, showing the
frequency of units matching each concept. Across different lay-
ers, the last convolutional layer has the largest number of object
classes detected by units, while the number of object parts peaks
two layers earlier, at
, which has units matching 28 ob-
ject classes, 25 parts, 9 materials, and 8 colors (Figure 1e). A
complete visualization of all the units of
is provided in SI,
as well as more detailed comparisons between layers of VGG-16,
comparisons to layers of AlexNet (
) and ResNet (
), and an
Bau et al. |September 15, 2020 |2–8
unit 261 “snow” (acc lost: train 8.9% val 10.9%)`
unit 149 “mountain top” (acc lost: train 1.2% val 3.5%)
unit 242 “house” (acc lost: train 1.5% val 2.5%)
unit 105 “tree top” (part, acc lost: train 0.8% val 2.0%)
(c) “Ski resort” accuracy when removing sets of units of different sizes
(d) Removing 20 most-and 492 least-important units for all scene classes
(a) Units of conv5_3 causing most accuracy loss on
the single class “ski resort” when removed individually
(b) Validation accuracy when units removed as a set
Balanced single-class ‘ski resort’ accuracy All-class accuracy
Unchanged vgg-16: 81.4% 53.3%
4 most important units removed: 64.0% 53.2%
20 most important units removed: 53.5% 52.6%
492 least important units removed: 77.7% 2.1%
Chance level 50.0% 0.27%
(e) Interpretability of units that are important to more and fewer classes
bars show standard error
over n units
Fig. 2.
A few units play important roles in classi-
fication performance. (a) The four
cause the most damage to balanced classification
accuracy for “ski resort” when each unit is individu-
ally removed from the network; dissection reveals
that these most-important units detect visual con-
cepts that are salient to ski resorts. (b) When the
most-important units to the class are removed all
together, balanced single-class accuracy drops to
near chance levels. When the 492 least-important
units in
are removed all together (leaving
only the 20 most-important units) accuracy remains
high. (c) The effect on "ski resort" prediction accu-
racy when removing sets of units of successively
larger sizes. These units are sorted in ascend-
ing and descending order of individual unit’s im-
pact on accuracy. (d) Repeating the experiment
for each of 365 scene classes. Each point plots
single-class classification accuracy in one of three
settings: the original network; the network after
removing the 20 units most-important to the class;
and with all
units removed except the 20
most-important ones. On the
axis, classes are or-
dered alphabetically. (e) The relationship between
unit importance and interpretability. Units that are
among the top-4 important units for more classes
are also closer matches for semantic concepts with
as measured by IoUu,c.
analysis of the texture versus shape sensitivity of units using a
stylization method based on (34).
Interestingly, object detectors emerge despite the absence of
object labels in the training task. For example, the aviation-related
scene classes in the training set are ‘airfield’, ‘airport terminal’,
‘hangar’, ‘landing deck’, and ‘runway.’ Scenes in these classes
do not always contain airplanes, and there is no explicit ‘airplane’
object label in the training set. Yet unit 150 emerges as a detector
that locates airplanes, scoring
= 9
0% agreement with our
reference airplane segmentations in scene images. The accuracy
of the unit as an airplane classifier can be further verified on Ima-
genet (
), a dataset that contains 1,000 object classes; its images
and classes are disjoint from the Places365 training set. Imagenet
contains two airplane class labels: ‘airliner’ and ‘warplane’, and
a simple threshold on unit 150 (peak activation
4) achieves
85.6% balanced classification accuracy on the task of distinguish-
ing these airplane classes from the other object classes. Figure 1f
shows the distribution of activations of this unit on a sample of
airplane and non-airplane Imagenet images.
Role of Units in a Scene Classifier.
How does the network use
the above object detector units? Studies of network compression
have shown that many units can be eliminated from a network while
recovering overall classification accuracy by retraining (
One way to estimate the importance of an individual unit is to
examine the impact of the removal of the unit on mean network
accuracy (38,39).
To obtain a more fine-grained understanding of the causal role
of each unit within a network, we measure the impact of removing
each unit on the network’s ability of classifying each individual
scene class. Units are removed by forcing the specified unit to
output zero and leaving the rest of the network intact. No retraining
is done. Single-class accuracy is tested on the balanced two-way
classification problem of discriminating the specified class from all
the other classes.
The relationships between objects and scenes learned by the
network can be revealed by identifying the most important units
for each class. For example, the four most important
units for the class ‘ski resort’ are shown in Figure 2a: these units
damage ski resort accuracy most when removed. The units detect
snow, mountains, houses, and trees, all of which seem salient to
ski resort scenes.
To test whether the ability of the network to classify ski resorts
can be attributed to just the most important units, we remove se-
lected sets of units. Figure 2b shows that removing just these 4
(out of 512) units reduces the network’s accuracy at discriminating
‘ski resort’ scenes from 81.4% to 64.0%, and removing the 20
most important units in
reduces class accuracy further to
53.5%, near chance levels (chance is 50%), even though classifi-
cation accuracy over all scene classes is hardly affected (changing
from 53.3% to 52.6%, where chance is 0.27%). In contrast, re-
moving the 492 least-important units, leaving only the 20 most
important units in
, has only a small impact on accuracy
for the specific class, reducing ski resort accuracy by only 3.7%,
to 77.7%. Of course, removing so many units damages the ability
of the network to classify other scene classes: removing the 492
least-important units reduces all-class accuracy to 2.1% (chance
is 0.27%).
The effect of removing varying numbers of most-important and
least-important units upon ‘ski resort’ accuracy is shown in Fig-
ure 2c. To avoid overfitting to the evaluation data, we rank the
importance of units according to their individual impact on single-
class ski resort accuracy on the training set, and the plotted impact
of removing sets of units is evaluated on the held-out validation set.
The network can be seen to derive most of its performance for ski
resort classification from just the most important units. Single-class
accuracy can even be improved by removing the least important
units; this effect is further explored in SI.
This internal organization, in which the network relies on a small
number of important units for most of its accuracy with respect
to a single output class, is seen across all classes. Figure 2d
repeats the same experiment for each of the 365 scene classes.
Removing the 20 most important
units for each class
reduces single-class accuracy to 53.0% on average, near chance
Bau et al. |September 15, 2020 |3–8
exhaust hood
kitchen island
work surface
work surface-b
kitchen island-r
kitchen island-l
work surface-t
kitchen island-t
kitchen island-b
work surface-r
work surface-l
14 objects
33 parts
1 material
6 colors
view one unit
output image
random vector
(a) Progressive GAN architecture (c) dissection of each convolutional layer (f) unit 314 activation for images with and without large windows
(d) layer5 detail
(e) single units
(dissected a s “lamp”)
(b) unit 381 activation
unit 314 “window bottom” (part)
unit 498 “oven” (object) unit 244 “chair top” (part) unit 58 “red” (color)
(g) Window images in which
unit 314 does not activate
Fig. 3.
The emergence of object- and part-specific units within a Progressive GAN generator‘(
). (a) The analyzed Progressive GAN consists of 15 convolutional layers that
transform a random input vector into a synthesized image of a kitchen. (b) A single filter is visualized as the region of the output image where the filter activates beyond its top
1% quantile level; note that the filters are all precursors to the output. (c) Dissecting all the layers of the network shows a peak in object-specific units at
of the network.
(d) A detailed examination of
shows more part-specific units than objects, and many visual concepts corresponding to multiple units. (e) Units do not correspond to
exact pixel patterns: a wide range of visual appearances for ovens and chairs are generated when an oven or chair part unit are activated. (f) When a unit specific to window
parts is tested as a classifier, on average the unit activates more strongly on generated images that contain large windows than images that do not. The jitter plot shows the
peak activation of unit 314 on 800 generated images that have windows larger than 5% of the image area as estimated by a segmentation algorithm, and 800 generated images
that do not. (g) Some counterexamples: images for which unit 314 does not activate but where windows are synthesized nevertheless.
levels. In contrast, removing the 492 least important units only
reduces single-class accuracy by an average of 3.6%, just a slight
reduction. We conclude that the emergent object detection done
by units of
is not spurious: each unit is impor tant to a
specific set of classes, and the object detectors can be interpreted
as decomposing the network’s classification of individual scene
classes into simpler sub-problems.
Why do some units match interpretable concepts so well while
other units do not? The data in Figure 2e show that the most
interpretable units are those that are important to many different
output classes. Units that are important to only one class (or
none) are less interpretable, measured by
. We further find that
important units are predominantly positively correlated with their
associated classes, and different combinations of units provide
support for each class. Measurements of unit-class correlations
and examples of overlapping combinations of important units are
detailed in SI.
Does the emergence of interpretable units such as airplane,
snow, and tree detectors depend on having training set labels that
divide the visual world into hundreds of scene classes? Perhaps
the taxonomy of scenes encodes distinctions that are necessary
to learn about objects. Or is it possible for a network to infer such
concepts from the visual data itself? To investigate this question,
we next conduct a similar set of experiments on networks trained
to solve unsupervised tasks.
Emergence of Object Detectors in a GAN. A generative adver-
sarial network (GAN) learns to synthesize random realistic images
that mimic the distribution of real images in a training set (
Architecturally, a trained GAN generator is the reverse of a classi-
fier, producing a realistic image from a random input latent vector.
Unlike classification, it is an unsupervised setting: no human an-
notations are provided to a GAN, so the network must learn the
structure of the images by itself.
Remarkably, GANs have been observed to learn global seman-
tics of an image: for example, interpolating between latent vectors
can smoothly transform the layout of a room (
) or change the
texture of an object (
). We wish to understand whether the
GAN also learns hierarchical structure, for example, if it learns to
decompose the generation of a scene into meaningful parts.
We test a Progressive GAN architecture (
) trained to imi-
tate LSUN kitchen images (
). This network architecture con-
sists of 15 convolutional layers, as shown in Figure 3a. Given
a 512-dimensional vector sampled from a multivariate Gaussian
distribution, the network produces a 256
256 realistic image after
processing the data through the 15 layers. As with a classifier
network, each unit is visualized by showing the regions where the
filter activates above its top 1% quantile level, as shown in Fig-
ure 3b. Importantly, causality in a generator flows in the opposite
direction as a classifier: when unit 381 activates on lamp shades
in an image, it is not detecting objects in the image, because the
Bau et al. |September 15, 2020 |4–8
unchanged 2 units 4 units 8 units 20 units
Number of units removed (units ranked by IoU match with tree segmentations)
(a) Causal effect on generation of trees when “tree” units removed
units removed
(c) Effect of activating “door” units depends on location
Area shows 99% confidence intervals
over n=10000 images
(d) Effect of activating “door” units depends on object context
Bars show standard deviation
over n=10000 images
Original object at modified location
1 2
3 4
3 4
Fig. 4.
The causal effect of altering units
within a GAN generator. (a) When succes-
sively larger sets of units are removed from
a GAN trained to generate outdoor church
scenes, the tree area of the generated im-
ages is reduced. Removing 20 tree units
removes more than half the generated tree
pixels from the output. (b) Qualitative results:
removing tree units affects trees while leaving
other objects intact. Building parts that were
previously occluded by trees are rendered as
if revealing the objects that were behind the
trees. (c) Doors can be added to buildings by
activating 20 door units. The location, shape,
size, and style of the rendered door depends
on the location of the activated units. The
same activation levels produce different doors,
or no door at all (case 4) depending on loca-
tions. (d) Similar context dependence can be
seen quantitatively: doors can be added in
reasonable locations such as at the location
of a window, but not in abnormal locations
such as on a tree or in the sky.
filter activation occurs before the image is generated. Instead, the
unit is part of the computation that ultimately renders the objects.
To identify the location of units in the network that are associ-
ated with object classes, we apply network dissection to the units
of every layer of the network. In this experiment, the reference seg-
mentation models and thresholds used are the same as those used
to analyze the VGG-16 classifier. However, instead of analyzing
agreement with objects that appear in the input data, we analyze
agreement with segmented objects found in the generated output
images. As shown in Figure 3c, the largest number of emergent
concept units do not appear at the edge of the network as we saw
in the classifier, but in the middle: the
has units that match
the largest number of distinct object and part classes.
Figure 3d shows each object, part, material, and color that
matches a unit in
IoU >
4%. This layer contains 19
object-specific units, 41 units that match object parts, one material,
and six color units. As seen in the classification network, visual
concepts such as ‘oven’ and ‘chair’ match many units. Different
from the classifier, more object parts are matched than whole
In Figure 3d, individual units show a wide range of visual di-
versity: the units do not appear to rigidly match a specific pixel
pattern, but rather different appearances for a particular class, for
example, various styles of ovens, or different colors and shapes of
kitchen stools.
In Figure 3f, we apply the window-specific unit 314 as an image
classifier. We find a strong gap between the activation of the unit
when a large window is generated and when no large window
is generated. Furthermore, a simple threshold (peak activation
03) can achieve a 78.2% accuracy in predicting whether the
generated image will have a large window or not. Nevertheless,
the distribution density curve reveals that images that contain large
windows can be often generated without activating unit 314. Two
such samples are shown in Figure 3g. These examples suggest
that other units could potentially synthesize windows.
Role of Units in a GAN.
The correlations between units and
generated object classes are suggestive, but they do not prove
that the units that correlate with an object class actually cause the
generator to render instances of the object class. To understand
the causal role of a unit in a GAN generator, we test the output of
the generator when sets of units are directly removed or activated.
We first remove successively larger sets of tree units from a
Progressive GAN (
) trained on LSUN church scenes (
). We
rank units in
according to
to identify the most tree-
specific units. When successively larger sets of these tree units are
removed from the network, the GAN generates images with fewer
and smaller trees (Figure 4a). Removing the 20 most tree-specific
units reduces the number of tree pixels in the generated output by
53.3%, as measured over 10,000 randomly generated images.
When tree-specific units are removed, the generated images
continue to look similarly realistic. Although fewer and smaller trees
are generated, other objects such as buildings are unchanged.
Remarkably, parts of buildings that were occluded by trees are
hallucinated, as if removing the trees reveals the walls and win-
dows behind them (Figure 4b). The generator appears to have
computed more details than are necessary to render the final out-
put; the details of a building that are hidden behind a tree can
only be revealed by suppressing the generation of the tree. The
appearance of such hidden details strongly suggests that the GAN
is learning a structured statistical model of the scene that extends
beyond a flat summarization of visible pixel patterns.
Units can also be forced on to insert new objects into a gen-
erated scene. We use
to find the 20 most door-specific
units identified in
of the same outdoor church GAN. At
tested locations, the activations for this set of 20 units are all forced
to their high
value. Figure 4c shows the effect of applying this
procedure to activate 20 door units at two different locations in
two generated images. Although the same intervention is applied
to all four cases, the doors obtained in each situation is different:
In cases 1-3, the newly synthesized door has a size, style, and
location that is appropriate to the scene context. In case 4, where
door units are activated on a tree, no new door is added to the
Figure 4d quantifies the context-sensitivity of activating door
units in different locations. In 10,000 randomly generated images,
the same 20-door-unit activation is tested at every featuremap loca-
tion, and the number of newly synthesized door pixels is evaluated
using a segmentation algorithm. Doors can be easily added in
some locations, such as in buildings and especially on top of an
Bau et al. |September 15, 2020 |5–8
‘ski resort ’
+ =
unit 261 snow
Δpeak -53.2
unit 242 house
Δpeak -19.8
unit 149 mountain
Δpeak -21.0
unit 105 tree
Δpeak -28.1
(b) Activation cha nge in 4 units m ost importan t to ski resort and be droom.
green = +10 activation red = -10 activation
(a) Adversarial attack c hanges an im age imperce ptibly to fool the class ifier
(c) Mean p eak unit activation change wh en attacked, for units in conv5 _3
unit 317 bed
Δpeak +23.8
unit 157 head
Δpeak +43.5
unit 290 bed
Δpeak +24.5
unit 75 sofa
Δpeak -1.0
Fig. 5.
Application: visualizing an adversarial attack. (a) the test image is correctly labeled as a ski resort, but when an adversarial perturbation is added, the visually
indistinguishable result is classified as a bedroom. (b) Visualization of the attack on the four most important units to the ski resort class and the four units most important to the
bedroom class. Areas of maximum increase and decrease are shown;
indicates the change in the peak activation level for the unit. (c) Over 1000 images attacked to
misclassify images to various incorrect target classes, the units that are changed most are those that dissection has identified as most important to the source and target
classes. Mean absolute value change in peak unit activation is graphed, with 99% confidence intervals shown.
(a) (b)
Fig. 6.
Application: painting by manipulating GAN neurons. (a) An interactive interface
allows a user to choose several high-level semantic visual concepts and paint them
on to an image. Each concept corresponds to 20 units in the GAN. (b) After the user
adds a dome in the specified location, the result is a modified image in which a dome
has been added in place of the original steeple. Once the user’s high-level intent has
been expressed by changing 20 dome units, the generator automatically handles the
pixel-level details of how to fit together objects to keep the output scene realistic.
existing window, but it is nearly impossible to add a door into trees
or in the sky.
By learning to solve the unsupervised image generation prob-
lem, a GAN has learned units for emergent objects such as doors
and trees. It has also learned a computational structure over those
units that prevents it from rendering nonsensical output such as a
door in the sky, or a door in a tree.
We now turn to two applications enabled by our understanding
of the role of units: understanding attacks on a classifier, and
interactively editing a photo by activating units of a GAN.
Analyzing Adversarial Attack of a Classifier.
The sensitivity
of image classifiers to adversarial attacks is an active research
area (
). To visualize and understand how an attack works,
we can examine the effects on important object detector units. In
Figure 5a, a correctly classified ‘ski resort’ image is attacked to
the target ‘bedroom’ by the C & W optimization method (
The adversarial algorithm computes a small perturbation which,
when added to the original, results in a misclassified image that is
visually indistinguishable from the original image. To understand
how the attack works, we examine the four most important units
to the ski resort class and the four most important units to the
bedroom class. Figure 5b visualizes changes in the activations
for these units between the original image and the adversarial
image. This reveals that the attack has fooled the network by
reducing detection of snow, mountain, house, and tree objects,
and by increasing activations of detectors for beds, person heads,
and sofas in locations where those objects do not actually exist in
the image. Figure 5c shows that, across many images and classes,
the units that are most changed by an attack are the few units that
are important to a class.
Semantic Paint using a GAN.
Understanding the roles of units
within a network allows us to create a human interface for con-
trolling the network via direct manipulation of its units. We apply
this method to a GAN to create an interactive painting application.
Instead of painting with a palette of colors, the application allows
painting with a palette of high-level object concepts. Each concept
is associated with 20 units that maximize
for the concept
Figure 6a shows our interactive interface. When a user adds brush
strokes with a concept, the units for the concept are activated (if
the user is drawing) or zeroed (if the user is erasing). Figure 6b
shows typical results after the user adds an object to the image.
The GAN deals with the pixel-level details of how to add objects
while keeping the scene reasonable and realistic. Multiple changes
in a scene can be composed for creative effects; movies of image
editing demos are included in SI; online demos are also available
at the website
Simple measures of performance, such as classification accuracy,
do not reveal how a network solves its task: good performance can
be achieved by networks that have differing sensitivities to shapes,
textures, or perturbations (34,48).
To develop an improved understanding of how a network works,
we have presented a way to analyze the roles of individual network
units. In a classifier, the units reveal how the network decomposes
the recognition of specific scene classes into particular visual
concepts that are important to each scene class. And within a
Bau et al. |September 15, 2020 |6–8
generator, the behavior of the units reveals contextual relationships
that the model enforces between classes of objects in a scene.
Network dissection relies on the emergence of disentangled,
human-interpretable units during training. We have seen that
many such interpretable units appear in state-of-the-art models,
both supervised and unsupervised. How to train better disentan-
gled models is an open problem that is the subject of ongoing
efforts (4952).
We conclude that a systematic analysis of individual units can
yield insights about the black box internals of deep networks. By
observing and manipulating units of a deep network, it is possible
to understand the structure of the knowledge that the network has
learned, and to build systems that help humans interact with these
powerful models.
Materials and Methods
Data sets.
Places365 (
) consists of 1.80 million photographic
images, each labeled with one of 365 scene classes. The dataset
also includes 36,500 labeled validation images (100 per class) that
are not used for training. Imagenet (
) consists of 1.28 million
photographic images, each focused on a single main object and
labeled with one of 1,000 object classes. LSUN is a dataset with
a large number of 256
256 images in a few classes (
kitchens consists of 2.21 million indoor kitchen photographs, and
LSUN outdoor churches consists of 1.26 million photographs of
church building exteriors. Recognizable people in dataset images
have been anonymized by pixelating faces in visualizations.
Tested Networks.
We analyze the VGG-16 classifier (
) trained
by the Places365 authors (
) to classify Places365 images. The
network achieves classification accuracy of 53.3% on the held-out
validation set (chance is 0.27%). The 13 convolutional layers of
VGG-16 are divided into 5 groups. The layers in the first group
contain 32 units that process image data at the full 224
224 reso-
lution; at each successive group the feature depth is doubled and
the feature maps are pooled to halve the resolution, so that at the
final stage that includes
the layers contain
512 units at 14
14 resolution. The GAN models we analyze are
trained by the Progressive GAN authors (
). The models are con-
figured to generate 256
256 output images using 15 convolutional
layers divided into 8 groups, starting with 512 units in each layer at
4 resolution, and doubling resolution at each successive group,
so that
has 8
8 resolution and 512 units and
16 resolution and 512 units. Unit depth is halved in each group
, so that the 14th layer has 32 units and 256
resolution. The 15th layer (not pictured in Figure 3a) produces a
3-channel RGB image.
Reference Segmentation.
To locate human-interpretable visual
concepts within large-scale data sets of images, we use the Unified
Perceptual Parsing image segmentation network (
) trained on
the ADE20K scene dataset (
) and an assignment of rgb values
to color names (
). The segmentation algorithm achieves mean
IoU of 23.4% on objects, 28.8% on parts, and 54.2% on materials.
To further identify units that specialize in object parts, we expand
each object class into four additional object part classes which
denote the top, bottom, left, or right half of the bounding box of a
connected component. Our reference segmentation algorithm can
detect 335 object classes, 1452 object parts, 25 materials, and 11
Data Availability.
The code, trained model weights and data sets
needed to reproduce the results in this paper are public and avail-
able for download at GitHub at
and at the project website
We are indebted to Aditya Khosla, Aude Oliva, William Peebles,
Jonas Wulff, Joshua B. Tenenbaum, and William T. Freeman for
their advice and collaboration. And we are grateful for the support
of the MIT-IBM Watson AI Lab, the DARPA XAI program FA8750-
18-C0004, NSF 1524817 on Advancing Visual Recognition with
Feature Visualizations, NSF BIGDATA 1447476, Grant RTI2018-
095232-B-C22 from the Spanish Ministry of Science, Innovation
and Universities to AL, Early Career Scheme (ECS) of Hong Kong
(No.24206219) to BZ, and a hardware donation from NVIDIA.
1Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A (2015) Object detectors emerge in deep
scene cnns in ICLR.
2Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks in ECCV.
3Mahendran A, Vedaldi A (2015) Understanding deep image representations by inverting them
in CVPR.
4Olah C, et al. (2018) The building blocks of interpretability. Distill 3(3):e10.
5Bau A, et al. (2018) Identifying and controlling important neurons in neural machine translation
in NeurIPS.
6Karpathy A, Johnson J, Fei-Fei L (2016) Visualizing and understanding recurrent networks in
7Radford A, Jozefowicz R, Sutskever I (2017) Learning to generate reviews and discovering
sentiment. arXiv preprint arXiv:1704.01444.
8Bengio Y, Courville A, Vincent P (2013) Representation lear ning: A review and new perspec-
tives. IEEE transactions on pattern analysis and machine intelligence 35(8):1798–1828.
9Zhou B, Bau D, Oliva A, Torralba A (2018) Interpreting deep visual representations via network
dissection. PAMI.
10 Bau D, et al. (2019) Gan dissection: Visualizing and understanding generative adversarial
networks in ICLR.
11 LeCun Y, Bengio Y, , et al. (1995) Convolutional networks for images, speech, and time series.
The handbook of brain theory and neural networks 3361(10):1995.
12 Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional
neural networks in NeurIPS. pp. 1097–1105.
13 Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image
recognition in ICLR.
14 Goodfellow I, et al. (2014) Generative adversarial nets in NeurIPS.
15 Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: A neural image caption gener-
ator in CVPR. pp. 3156–3164.
16 He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition in CVPR.
17 Isola P, Zhu JY, Zhou T, Efros AA (2017) Image-to-image translation with conditional adversar-
ial networks in CVPR.
18 Zhu JY, Park T, Isola P, Efros AA (2017) Unpaired image-to-image translation using cycle-
consistent adversarial networks in ICCV.
19 Karras T, Aila T, Laine S, Lehtinen J (2018) Progressive growing of gans for improved quality,
stability, and variation in ICLR.
20 Bach S, et al. (2015) On pixel-wise explanations fornon-linear classifier decisions by layer-wise
relevance propagation. PloS one 10(7).
21 Zhou T, Krahenbuhl P, Aubry M, Huang Q, Efros AA (2016) Learning dense correspondence
via 3d-guided cycle consistency in CVPR.
22 Fong RC, Vedaldi A (2017) Interpretable explanations of black boxes by meaningful perturba-
tion in ICCV. pp. 3429–3437.
23 Lundberg SM, Lee SI (2017) A unified approach to interpreting model predictions in NeurIPS.
24 Selvaraju RR, et al. (2017) Grad-cam: Visual explanations from deep networks via gradient-
based localization. in ICCV.
25 Sundararajan M, Taly A, Yan Q (2017) Axiomatic attribution for deep networks in ICML.
26 Smilkov D, Thorat N, Kim B, Viégas F, Wattenberg M (2017) Smoothgrad: removing noise by
adding noise. arXiv preprint arXiv:1706.03825.
27 Petsiuk V, Das A, Saenko K (2018) Rise: Randomized input sampling for explanation of black-
box models in BMVC.
28 Ribeiro MT, Singh S, Guestrin C (2016) Why should i trust you?: Explaining the predictions of
any classifier in SIGKDD. (ACM), pp. 1135–1144.
29 Kim B, Gilmer J, Viegas F, Erlingsson U, Wattenberg M (2017) Tcav: Relative concept impor-
tance testing with linear concept activation vectors. arXiv preprint arXiv:1711.11279.
30 Koul A, Fern A, Greydanus S (2019) Learning finite state representations of recurrent policy
networks in ICLR.
31 Hendricks LA, et al. (2016) Generating visual explanations in ECCV. (Springer), pp. 3–19.
32 Zhou B, Lapedriza A, Xiao J, Torralba A, Oliva A (2014) Learning deep features for scene
recognition using places database in NeurIPS.
33 Xiao T,Liu Y, Zhou B, Jiang Y, Sun J (2018) Unified perceptual parsing for scene understanding
in ECCV.
34 Geirhos R, et al. (2018) Imagenet-trained cnns are biased towards texture; increasing shape
bias improves accuracy and robustness. arXiv preprint arXiv:1811.12231.
35 Deng J, et al. (2009) Imagenet: A large-scale hierarchical image database in CVPR.
Bau et al. |September 15, 2020 |7–8
36 Wen W, Wu C, Wang Y, Chen Y, Li H (2016) Learning structured sparsity in deep neural
networks in NeurIPS. pp. 2074–2082.
37 Li H, Kadav A, Durdanovic I, Samet H, Graf HP (2017) Pruning filters for efficient convnets in
38 Morcos AS, Barrett DG, Rabinowitz NC, Botvinick M (2018) On the importance of single direc-
tions for generalization. arXiv preprint arXiv:1803.06959.
39 Zhou B, Sun Y, Bau D, Torralba A (2018) Revisiting the importance of individual units in cnns
via ablation. arXiv preprint arXiv:1806.02891.
40 Radford A, Metz L, Chintala S (2016) Unsupervised representation learning with deep convo-
lutional generative adversarial networks in ICLR.
41 Zhu JY, Krähenbühl P, Shechtman E, Efros AA (2016) Generative visual manipulation on the
natural image manifold in ECCV.
42 Yu F, et al. (2015) Lsun: Construction of a large-scale image dataset using deep learning with
humans in the loop. arXiv preprint arXiv:1506.03365.
43 Szegedy C, et al. (2014) Intriguing properties of neural networks in ICLR.
44 Goodfellow IJ, Shlens J, Szegedy C (2015) Explaining and harnessing adversarial examples
in ICLR.
45 Carlini N, Wagner D (2017) Towards evaluating the robustness of neural networks in 2017
IEEE Symposium on Security and Privacy (SP). (IEEE), pp. 39–57.
46 Madry A, Makelov A, Schmidt L, Tsipras D, Vladu A (2018) Towards deep learning models
resistant to adversarial attacks in ICLR.
47 Rauber J, Brendel W,Bethge M (2017) Foolbox: Apython toolbox to benchmark the robustness
of machine learning models. arXiv prepr int arXiv:1707.04131.
48 Ilyas A, et al. (2019) Adversarial examples are not bugs, they are features. arXiv preprint
49 Chen X, et al. (2016) Infogan: Interpretable representation learning by information maximizing
generative adversarial nets in NeurIPS. pp. 2172–2180.
50 Higgins I, et al. (2017) beta-vae: Learning basic visual concepts with a constrained variational
framework. ICLR 2(5):6.
51 Zhang Q, Nian Wu Y, Zhu SC (2018) Interpretable convolutional neural networks in CVPR.
52 Achille A, Soatto S (2018) Emergence of invariance and disentanglement in deep representa-
tions. JMLR 19(1):1947–1980.
53 Zhou B, et al. (2017) Scene parsing through ade20k dataset in CVPR.
54 Van De Weijer J, Schmid C, Verbeek J, Larlus D (2009) Learning color names for real-world
applications. IEEE Transactions on Image Processing 18(7):1512–1523.
Bau et al. |September 15, 2020 |8–8
... From these approaches, we can at most obtain a global understanding of all possible features the model can use, but how these features interact with each other given some specific data sample and how the model infers a decision remain hidden. Other branches of global XAI propose methods, for example, to test a model's sensitivity to a priori known, expected or pre-categorized stimuli [28][29][30][31] . These approaches require labelled data, thus limiting, and standing in contrast to, the exploratory potential of local XAI. ...
... In this process, relevance is propagated backwards through the network, starting from the output until the input layer (Fig. 2a), which also provides relevance values for each intermediate element of the model (for example, channel of an intermediate layer). As the literature suggests that latent structures of neural networks are encoding abstract human-understandable concepts with distinct semantics, especially in higher layers 23,31,[37][38][39][40] , the channel-wise relevance values can be interpreted as scores quantifying the importance of the corresponding concepts in the inference process. ...
Full-text available
The field of explainable artificial intelligence (XAI) aims to bring transparency to today’s powerful but opaque deep learning models. While local XAI methods explain individual predictions in the form of attribution maps, thereby identifying ‘where’ important features occur (but not providing information about ‘what’ they represent), global explanation techniques visualize what concepts a model has generally learned to encode. Both types of method thus provide only partial insights and leave the burden of interpreting the model’s reasoning to the user. Here we introduce the Concept Relevance Propagation (CRP) approach, which combines the local and global perspectives and thus allows answering both the ‘where’ and ‘what’ questions for individual predictions. We demonstrate the capability of our method in various settings, showcasing that CRP leads to more human interpretable explanations and provides deep insights into the model’s representation and reasoning through concept atlases, concept-composition analyses, and quantitative investigations of concept subspaces and their role in fine-grained decision-making.
... Investigating these functions requires experimentation with trained models, using tools that incorporate varying degrees of human input. Hand-tooled approaches that rely on close manual inspection [Zeiler and Fergus, 2014, Zhou et al., 2014, Mahendran and Vedaldi, 2015, Olah et al., 2017, 2020, Elhage et al., 2021 or search for predefined phenomena [Wang et al., 2022 are increasingly complemented by more automatic approaches that enable larger-scale analysis [Bau et al., 2020, Mu and Andreas, 2020, Hernandez et al., 2022, Oikarinen and Weng, 2023, Conmy et al., 2023. Recent open-ended Figure 1: The FIND benchmark. ...
... For example, if the entity is plants and botany, f e (garden) and f e (tree) should return high values, while f e (car) should return a low value. This task is relevant to lines of work that automatically summarize maximally activating exemplar sets to characterize neuron function [Bau et al., 2020, Mu and Andreas, 2020, Hernandez et al., 2022, Bills et al., 2023, Singh et al., 2023, generalized to a setting where the interpreter must also produce the data. ...
Labeling neural network submodules with human-legible descriptions is useful for many downstream tasks: such descriptions can surface failures, guide interventions, and perhaps even explain important model behaviors. To date, most mechanistic descriptions of trained networks have involved small models, narrowly delimited phenomena, and large amounts of human labor. Labeling all human-interpretable sub-computations in models of increasing size and complexity will almost certainly require tools that can generate and validate descriptions automatically. Recently, techniques that use learned models in-the-loop for labeling have begun to gain traction, but methods for evaluating their efficacy are limited and ad-hoc. How should we validate and compare open-ended labeling tools? This paper introduces FIND (Function INterpretation and Description), a benchmark suite for evaluating the building blocks of automated interpretability methods. FIND contains functions that resemble components of trained neural networks, and accompanying descriptions of the kind we seek to generate. The functions are procedurally constructed across textual and numeric domains, and involve a range of real-world complexities, including noise, composition, approximation, and bias. We evaluate new and existing methods that use language models (LMs) to produce code-based and language descriptions of function behavior. We find that an off-the-shelf LM augmented with only black-box access to functions can sometimes infer their structure, acting as a scientist by forming hypotheses, proposing experiments, and updating descriptions in light of new data. However, LM-based descriptions tend to capture global function behavior and miss local corruptions. These results show that FIND will be useful for characterizing the performance of more sophisticated interpretability methods before they are applied to real-world models.
... Driven by their superior performance and the availability of open-source libraries, different forms of deep neural networks (DNNs) and specifically deep convolutional neural networks (DCNNs) became the standard solution to problems related to automatic image analysis. DCNNs are used for supervised and unsupervised machine learning and provide good performance in tasks related to analysis of image data [11][12][13][14][15][16][17][18]. However, the similarity between image variables must not be confused with correlation between two image variables. ...
Full-text available
Correlation is one of the most common statistical tools and one of the most basic methods for discovery from data. Over the past century, correlation has led to an extremely high number of discoveries in virtually all fields of science and is currently one of the basic building blocks that enable scientific research. In most cases, correlation is detected between numerical variables using tools such as Pearson or Spearman correlation. This paper describes a method that can identify rank correlation between two sequences of images. That is, instead of identifying a rank correlation between two numerical variables, the method identifies rank correlations between two variables such that each variable is an image. Experimental results show that the method is able to detect rank correlations between sequences of synthetic images, as well as sequences of natural images. The method can provide useful information for profiling visual data, and example discoveries that can be made with the method are discussed. Given the prevalence of correlation in data-driven discovery, combined with the increasing size and availability of databases of visual data, identifying rank correlations in image data can provide useful insights and new knowledge from the data. The method is nonparametric and can be applied to various types of image data.
... The main divergence of our method is in the Inspiration phase where we introduce a human observer element that generates a list of exploratory hypotheses. A similar human-introduced influence is evident, albeit to a lesser extent, in the work of Bau and colleagues [3]. In their analytic method, termed network dissection, Bau et al. (2020) identify object-detector units in a CNN, tuned to concepts that were not part of the set of labels the model was trained on. ...
Full-text available
We present a structured approach to combine explainability of artificial intelligence (AI) with the scientific method leading to scientific discovery. We demonstrate the utility of this approach in a proof-of-concept study where we extract previously unknown biomarkers from a convolutional neural network (CNN) model that is trained to classify patient sex in retinal fundus images. We note that this is a trait that is not currently recognized by diagnosticians in retinal images, yet, one that can be classified successfully by CNNs. Our methodology augments existing explainability tools in AI and results in novel retinal features that are teachable to clinicians. Specifically, our methodology consists of four phases: In Phase 1, CNN development, we train a Visual Geometry Group (VGG) model to recognize patient sex in retinal images. In Phase 2, Inspiration, we review visualizations that we obtained from post-hoc interpretability tools to make observations, and liberally articulate exploratory hypotheses regarding the decision process of the CNN model. Here, we listed 14 testable hypotheses based on specific retinal parameters related to possible differences in vasculature and the optic disc. In Phase 3, Exploration, we test all exploratory hypotheses generated at the previous phase on an independent dataset. Out of the 14 exploratory hypotheses, nine revealed significant differences. In Phase 4, Verification, we re-tested the nine flagged hypotheses from Phase 3 on a new independent dataset. Five of these hypotheses were verified, revealing (i) significantly greater length, (ii) more nodes, and (iii) more branches of retinal vasculature, (iv) greater retinal area covered by the vessels in the superior temporal quadrant, and (v) darker peri-papillary region in male eyes. Finally, we trained a group of ophthalmologists (N = 26) to recognize the newly discovered retinal features for sex-classification. While their pre-training performance was not different from chance-level or the performance of a non-expert group (N = 31), post-training, their performance increased significantly (p < .001, d = 2.63). These findings showcase the potential of our methodology in leveraging AI applications for retinal biomarker discovery, with the added utility of empowering medical practitioners with new diagnostic capabilities to enhance their clinical toolkit.
... Large language models are capable of unprecedented feats, yet largely remain inscrutable black boxes. Yet evidence has accumulated that models act as feature extractors: identifying increasingly complex properties of the input and representing these in the internal activations (Geva et al., 2021;Bau et al., 2020;Gurnee et al., 2023;Belinkov, 2022a;Burns et al., 2022;Goh et al., 2021;Elhage et al., 2022a). A key first step for interpreting them is understanding how these features are represented. ...
Full-text available
How do sequence models represent their decision-making process? Prior work suggests that Othello-playing neural network learned nonlinear models of the board state (Li et al., 2023). In this work, we provide evidence of a closely related linear representation of the board. In particular, we show that probing for ``my colour'' vs. ``opponent's colour'' may be a simple yet powerful way to interpret the model's internal state. This precise understanding of the internal representations allows us to control the model's behaviour with simple vector arithmetic. Linear representations enable significant interpretability progress, which we demonstrate with further exploration of how the world model is computed.
... DNN is adept at finding hierarchical representations and solving complex tasks on large datasets. Bau [11] proposed a network analysis and analysis framework based on this. A ...
... Recently, some studies have attempted to demonstrate the ability to identify the object represented by each neuron (Bau et al., 2017, Bau et al., 2020. While the mechanisms proposed may be effective at the engineering and application levels, it is debatable whether they are useful. ...
The past decade has witnessed the rise of connectionism, which has led to intensive attempts in the architectural research community to form a novel digital design paradigm created using the probabilistic computational mechanisms of deep neural networks. This thesis project aims at investigating the conceptual shift of architectural rationality in the emerging modes of connectionist design, and tries to relate this shift to the history of architectural theory. It argues that the rise of connectionism is a fundamental alternative to the ‘symbolic’ view of computation, as well as an alternative to modern design thinking. More generally, connectionism is the alternative scientific method to traditional ‘rationality’, where design is addressed through ‘rule-based-problem-solving’ that usually employs formal systems. The first half of the thesis contextualises the rise of connectionist design and discusses the development of the architectural relevance of the probability theory that connectionism relies on. The second half of the thesis is an extended case study on the selected connectionist design projects. It tries to generalise the projects and accordingly develop a critical discussion on type, style and collective design thinking. Inspired by the operational nature of autoencoders, the discussion defines the connectionist concept of ‘latent space’ as a quintessential data-compression technology, of which types and styles become the ideal residents. The differentiable nature of latent space may allow designers to consciously intervene in the construction of design schemata in an encoded, vectorial field of artefact creation.
Objective: This investigation delved into the inner workings of previously published Deep Neural Networks (DNNs) designed to detect changes in diastolic function related to age using echocardiographic parameters. The primary goal was to decipher the predictive mechanism and determine whether biological age and gender played concealed roles within the DNN model. Methods: We conducted a secondary analysis of a previously published DNN model that was trained using data from 1009 patients (average age: 62±17 years, 57% females) to forecast risk phenogroups based on nine echocardiographic parameters. This model was assessed on both an internal cohort (n=243, mean age = 62 ± 17 years, ~57% females) and an external validation cohort (n=5596, mean age = 76 ± 5 years, ~57% females). To forecast biological age and gender, we developed linear regression and classification models employing hidden layer activations from the DNN. Model performance was assessed using Pearson correlation for regression, accuracy, and area under the curve (AUC) metrics for classification. Results: Upon scrutinizing the hidden layer activations, we observed that the model accurately captured biological age in both younger and older populations, particularly in low-risk phenogroups, with robust correlations for the entire population (0.94, p<0.001), males (0.90, p<0.001), and females (0.94, p<0.001). In high-risk phenogroups, the correlations were lower, standing at 0.31 (p=0.274) for the entire population, 0.76 (p=0.003) for males, and 0.11 (p=0.723) for females. Predicting gender as an underlying factor resulted in an accuracy rate of 58.02% and 52.27%, accompanied by an AUC of 0.65 for both validation cohorts. Conclusion: This study underscores that the latent space within DNNs maintains a link with age in relation to diastolic functional parameters, offering a solution that is independent of age for predicting diastolic dysfunction. The dissection of the network can further enhance our comprehension of the information learned by DNNs, thereby providing novel pathophysiological insights for medical professionals.
Full-text available
We propose a technique for producing ‘visual explanations’ for decisions from a large class of Convolutional Neural Network (CNN)-based models, making them more transparent and explainable. Our approach—Gradient-weighted Class Activation Mapping (Grad-CAM), uses the gradients of any target concept (say ‘dog’ in a classification network or a sequence of words in captioning network) flowing into the final convolutional layer to produce a coarse localization map highlighting the important regions in the image for predicting the concept. Unlike previous approaches, Grad-CAM is applicable to a wide variety of CNN model-families: (1) CNNs with fully-connected layers (e.g.VGG), (2) CNNs used for structured outputs (e.g.captioning), (3) CNNs used in tasks with multi-modal inputs (e.g.visual question answering) or reinforcement learning, all without architectural changes or re-training. We combine Grad-CAM with existing fine-grained visualizations to create a high-resolution class-discriminative visualization, Guided Grad-CAM, and apply it to image classification, image captioning, and visual question answering (VQA) models, including ResNet-based architectures. In the context of image classification models, our visualizations (a) lend insights into failure modes of these models (showing that seemingly unreasonable predictions have reasonable explanations), (b) outperform previous methods on the ILSVRC-15 weakly-supervised localization task, (c) are robust to adversarial perturbations, (d) are more faithful to the underlying model, and (e) help achieve model generalization by identifying dataset bias. For image captioning and VQA, our visualizations show that even non-attention based models learn to localize discriminative regions of input image. We devise a way to identify important neurons through Grad-CAM and combine it with neuron names (Bau et al. in Computer vision and pattern recognition, 2017) to provide textual explanations for model decisions. Finally, we design and conduct human studies to measure if Grad-CAM explanations help users establish appropriate trust in predictions from deep networks and show that Grad-CAM helps untrained users successfully discern a ‘stronger’ deep network from a ‘weaker’ one even when both make identical predictions. Our code is available at, along with a demo on CloudCV (Agrawal et al., in: Mobile cloud visual media computing, pp 265–290. Springer, 2015) ( and a video at
All Rights Reserved. Generative Adversarial Networks (GANs) have recently achieved impressive results for many real-world applications, and many GAN variants have emerged with improvements in sample quality and training stability. However, they have not been well visualized or understood. How does a GAN represent our visual world internally? What causes the artifacts in GAN results? How do architectural choices affect GAN learning? Answering such questions could enable us to develop new insights and better models. In this work, we present an analytic framework to visualize and understand GANs at the unit-, object-, and scene-level. We first identify a group of interpretable units that are closely related to object concepts using a segmentation-based network dissection method. Then, we quantify the causal effect of interpretable units by measuring the ability of interventions to control objects in the output. We examine the contextual relationship between these units and their surroundings by inserting the discovered object concepts into new images. We show several practical applications enabled by our framework, from comparing internal representations across different layers, models, and datasets, to improving GANs by locating and removing artifact-causing units, to interactively manipulating objects in a scene. We provide open source interpretation tools to help researchers and practitioners better understand their GAN models.
Using established principles from Statistics and Information Theory, we show that invariance to nuisance factors in a deep neural network is equivalent to information minimality of the learned representation, and that stacking layers and injecting noise during training naturally bias the network towards learning invariant representations. We then decompose the cross-entropy loss used during training and highlight the presence of an inherent overfitting term. We propose regularizing the loss by bounding such a term in two equivalent ways: One with a Kullbach-Leibler term, which relates to a PAC-Bayes perspective; the other using the information in the weights as a measure of complexity of a learned model, yielding a novel Information Bottleneck for the weights. Finally, we show that invariance and independence of the components of the representation learned by the network are bounded above and below by the information in the weights, and therefore are implicitly optimized during training. The theory enables us to quantify and predict sharp phase transitions between underfitting and overfitting of random labels when using our regularized loss, which we verify in experiments, and sheds light on the relation between the geometry of the loss function, invariance properties of the learned representation, and generalization error.
Despite their ability to memorize large datasets, deep neural networks often achieve good generalization performance. However, the differences between the learned solutions of networks which generalize and those which do not remain unclear. Additionally, the tuning properties of single directions (defined as the activation of a single unit or some linear combination of units in response to some input) have been highlighted, but their importance has not been evaluated. Here, we connect these lines of inquiry to demonstrate that a network's reliance on single directions is a good predictor of its generalization performance, across networks trained on datasets with different fractions of corrupted labels, across ensembles of networks trained on datasets with unmodified labels, across different hyperparameters, and over the course of training. While dropout only regularizes this quantity up to a point, batch normalization implicitly discourages single direction reliance, in part by decreasing the class selectivity of individual units. Finally, we find that class selectivity is a poor predictor of task importance, suggesting not only that networks which generalize well minimize their dependence on individual units by reducing their selectivity, but also that individually selective units may not be necessary for strong network performance.