ArticlePDF Available

Abstract and Figures

Deep neural networks excel at finding hierarchical representations that solve complex tasks over large datasets. How can we humans understand these learned representations? In this work, we present network dissection, an analytic framework to systematically identify the semantics of individual hidden units within image classification and image generation networks. First, we analyze a convolutional neural network (CNN) trained on scene classification and discover units that match a diverse set of object concepts. We find evidence that the network has learned many object classes that play crucial roles in classifying scene classes. Second, we use a similar analytic method to analyze a generative adversarial network (GAN) model trained to generate scenes. By analyzing changes made when small sets of units are activated or deactivated, we find that objects can be added and removed from the output scenes while adapting to the context. Finally, we apply our analytic framework to understanding adversarial attacks and to semantic image editing.
Content may be subject to copyright.
Understanding the Role of Individual Units
in a Deep Neural Network
David Baua,1, Jun-Yan Zhua,b, Hendrik Strobeltc, Agata Lapedrizad,e, Bolei Zhouf, and Antonio Torralbaa
This preprint was compiled on September 15, 2020. The peer-reviewed version can be found at
Deep neural networks excel at finding hierarchical representations that solve complex tasks over large data sets. How can we humans
understand these learned representations? In this work, we present network dissection, an analytic framework to systematically identify the
semantics of individual hidden units within image classification and image generation networks. First, we analyze a convolutional neural
network (CNN) trained on scene classification and discover units that match a diverse set of object concepts. We find evidence that the
network has learned many object classes that play crucial roles in classifying scene classes. Second, we use a similar analytic method to
analyze a generative adversarial network (GAN) model trained to generate scenes. By analyzing changes made when small sets of units are
activated or deactivated, we find that objects can be added and removed from the output scenes while adapting to the context. Finally, we
apply our analytic framework to understanding adversarial attacks and to semantic image editing.
Machine Learning |Deep Networks |Computer Vision
Can the individual hidden units of a deep network teach us how
the network solves a complex task? Intriguingly, within state-of-the-
art deep networks, it has been observed that many single units
match human-interpretable concepts that were not explicitly taught
to the network: units have been found to detect objects, parts,
textures, tense, gender, context, and sentiment (
). Finding
such meaningful abstractions is one of the main goals of deep
learning (8), but the emergence and role of such concept-specific
units is not well-understood. Thus we ask: how can we quantify
the emergence of concept units across the layers of a network?
What types of concepts are matched; and what function do they
serve? When a network contains a unit that activates on trees, we
wish to understand if it is a spurious correlation, or if the unit has
a causal role that reveals how the network models its higher-level
notions about trees.
To investigate these questions, we introduce network dissec-
tion (
), our method for systematically mapping the semantic
concepts found within a deep convolutional neural network. The
basic unit of computation within such a network is a learned con-
volutional filter; this architecture is the state-of-the-art for solving
a wide variety of discriminative and generative tasks in computer
vision (
). Network dissection identifies, visualizes, and quan-
tifies the role of individual units in a network by comparing the
activity of each unit to a range of human-interpretable pattern-
matching tasks such as the detection of object classes.
Previous approaches for understanding a deep network include
the use of salience maps (
): those methods ask where a
network looks when it makes a decision. The goal of our current
inquiry is different: we ask what a network is looking for, and
why. Another approach is to create simplified surrogate models to
mimic and summarize a complex network’s behavior (
); and
another technique is to train explanation networks that generate
Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139;
Adobe Research, Adobe Inc., San Jose, CA 95110;
Massachusetts Institute of Technology–International Business Machines (IBM) Watson Artificial Intelligence Laborator y, Cambridge, MA 02142;
Media Lab, Massachusetts
Institute of Technology, Cambridge, MA 02139;
Estudis d’Informàtica, Multimèdia i Telecomunicació, Universitat Oberta de Catalunya, 08018 Barcelona, Spain;
Department of
Information Engineering, The Chinese University of Hong Kong, Shatin, Hong Kong SAR, China
1Corresponding author e-mail:
human-readable explanations of a network (
). In contrast to
those methods, network dissection aims to directly interpret the
internal computation of the network itself, rather than training an
auxiliary model.
We dissect the units of networks trained on two different types of
tasks: image classification and image generation. In both settings,
we find that a trained network contains units that correspond to
high-level visual concepts that were not explicitly labeled in the
training data. For example, when trained to classify or generate
natural scene images, both types of networks learn individual units
that match the visual concept of a ‘tree’ even though we have never
taught the network the tree concept during training.
Focusing our analysis on the units of a network allows us to
test the causal structure of network behavior by activating and
deactivating the units during processing. In a classifier, we use
these interventions to ask whether the classification performance
of a specific class can be explained by a small number of units that
identify visual concepts in the scene class. For example, we ask
how the ability of the network to classify an image as a ski resort
is affected when removing a few units that detect snow, mountains,
trees, and houses. Within a scene generation network, we ask how
the rendering of objects in a scene is affected by object-specific
units. How does the removal of tree units affect the appearance of
trees and other objects in the output image?
Finally, we demonstrate the usefulness of our approach with
two applications. We show how adversarial attacks on a classifier
can be understood as attacks on the important units for a class.
And we apply unit intervention on a generator to enable a human
user to modify semantic concepts such as trees and doors in an
image by directly manipulating units.
arXiv:2009.05041v2 [cs.CV] 12 Sep 2020
unit 150 “airplane” (object) unit 141 “fur” (material)unit 208 “person top” (part)
swimming pool
51 objects
22 parts
12 materials
8 colors
fc6, fc7, fc8
scene prediction
input image
fully connected layers ßmatched vi sual concepts for all uni ts à
(a) VGG-16 architecture (e) dissection of each convolutional layer
(d) conv5_3 summary
(f) out-of-domain object detection test: conv5_3 unit 150 activation on airpl anes
“conference room”
(b) unit 10 activation
(from layer co nv5_3)
view one unit
non-airplane airplane images
unit 150 peak activation per image
(c) single units tested on scenes
Fig. 1.
The emergence of single-unit object detectors within a VGG-16 scene classifier. (a) VGG-16 consists of 13 convolutional layers,
, followed by
three fully connected layers,
. (b) The activation of a single filter on an input image can be visualized as the region where the filter activates beyond its top 1% quantile
level. (c) Single units are scored by matching high-activating regions against a set of human-interpretable visual concepts; each unit is labeled with its best-matching concept
and visualized with maximally-activating images. (d) Concepts that match units in the final convolutional layer are summarized, showing a broad diversity of detectors for
objects, object parts, materials, and colors. Many concepts are associated with multiple units. (e) Comparing all the layers of the network reveals that most object detectors
emerge at the last convolutional layers. (f) Although the training set contains no object labels, unit 150 emerges as an ‘airplane’ object detector that activates much more
strongly on airplane objects than non-airplane objects, as tested against a dataset of labeled object images not previously seen by the network. The jitter plot shows peak
activations for the unit on randomly sampled 1,000 airplane and 1,000 non-airplane Imagenet images, and the curves show the kernel density estimates of these activations.
Emergence of Object Detectors in a Scene Classifier.
We first
identify individual units that emerge as object detectors when
training a network on a scene classification task. The network
we analyze is a VGG-16 CNN (
) trained to classify images into
365 scene categories using the places365 data set (
). We
analyze all units within the 13 convolutional layers of the network
(Figure 1a). Please refer to materials and methods for further
details on networks and datasets.
Each unit
computes an activation function
x, p
)that out-
puts a signal at every image position
given a test image
. Filters
with low-resolution outputs are visualized and analyzed at high-
resolution positions
using bilinear upsampling. Denote by
top 1% quantile level for
, that is, writing
]to indicate the
probability that an event is true when sampled over all positions
and images, we define the threshold
x, p
01. In visualizations we highlight the activation region
x, p
> tu}
above the threshold. As seen in Figure 1b,
this region can correspond to semantics such as the heads of all
the people in the image. To identify filters that match semantic
concepts, we measure the agreement between each filter and a
visual concept
using a computer vision segmentation model (
: (
x, p
→ {
that is trained to predict the presence of the
visual concept
within image
at position
. We quantify the
agreement between concept
and unit
using the intersection
over union (IoU) ratio:
IoUu,c =Px,p[sc(x, p)(au(x, p)> tu)]
Px,p[sc(x, p)(au(x, p)> tu)] [1]
This IoU ratio is computed on the set of held-out validation set
images. Within this validation set, each unit is scored against
1,825 segmented concepts
, including object classes, parts of
objects, materials, and colors. Then each unit is labeled with
the highest-scoring matching concept. Figure 1c shows several
labeled concept detector units along with the five images with the
highest unit activations.
When examining all 512 units in the last convolutional layer, we
find many detected object classes and relatively fewer detected
object parts and materials:
units match 51 object classes,
22 parts, 12 materials, and 8 colors. Several visual concepts
such as ‘airplane’ and ‘head‘ are matched by more than one unit.
Figure 1d lists every segmented concept matching units in layer
excluding any units with IoU ratio
4%, showing the
frequency of units matching each concept. Across different lay-
ers, the last convolutional layer has the largest number of object
classes detected by units, while the number of object parts peaks
two layers earlier, at
, which has units matching 28 ob-
ject classes, 25 parts, 9 materials, and 8 colors (Figure 1e). A
complete visualization of all the units of
is provided in SI,
as well as more detailed comparisons between layers of VGG-16,
comparisons to layers of AlexNet (
) and ResNet (
), and an
Bau et al. |September 15, 2020 |2–8
unit 261 “snow” (acc lost: train 8.9% val 10.9%)`
unit 149 “mountain top” (acc lost: train 1.2% val 3.5%)
unit 242 “house” (acc lost: train 1.5% val 2.5%)
unit 105 “tree top” (part, acc lost: train 0.8% val 2.0%)
(c) “Ski resort” accuracy when removing sets of units of different sizes
(d) Removing 20 most-and 492 least-important units for all scene classes
(a) Units of conv5_3 causing most accuracy loss on
the single class “ski resort” when removed individually
(b) Validation accuracy when units removed as a set
Balanced single-class ‘ski resort’ accuracy All-class accuracy
Unchanged vgg-16: 81.4% 53.3%
4 most important units removed: 64.0% 53.2%
20 most important units removed: 53.5% 52.6%
492 least important units removed: 77.7% 2.1%
Chance level 50.0% 0.27%
(e) Interpretability of units that are important to more and fewer classes
bars show standard error
over n units
Fig. 2.
A few units play important roles in classi-
fication performance. (a) The four
cause the most damage to balanced classification
accuracy for “ski resort” when each unit is individu-
ally removed from the network; dissection reveals
that these most-important units detect visual con-
cepts that are salient to ski resorts. (b) When the
most-important units to the class are removed all
together, balanced single-class accuracy drops to
near chance levels. When the 492 least-important
units in
are removed all together (leaving
only the 20 most-important units) accuracy remains
high. (c) The effect on "ski resort" prediction accu-
racy when removing sets of units of successively
larger sizes. These units are sorted in ascend-
ing and descending order of individual unit’s im-
pact on accuracy. (d) Repeating the experiment
for each of 365 scene classes. Each point plots
single-class classification accuracy in one of three
settings: the original network; the network after
removing the 20 units most-important to the class;
and with all
units removed except the 20
most-important ones. On the
axis, classes are or-
dered alphabetically. (e) The relationship between
unit importance and interpretability. Units that are
among the top-4 important units for more classes
are also closer matches for semantic concepts with
as measured by IoUu,c.
analysis of the texture versus shape sensitivity of units using a
stylization method based on (34).
Interestingly, object detectors emerge despite the absence of
object labels in the training task. For example, the aviation-related
scene classes in the training set are ‘airfield’, ‘airport terminal’,
‘hangar’, ‘landing deck’, and ‘runway.’ Scenes in these classes
do not always contain airplanes, and there is no explicit ‘airplane’
object label in the training set. Yet unit 150 emerges as a detector
that locates airplanes, scoring
= 9
0% agreement with our
reference airplane segmentations in scene images. The accuracy
of the unit as an airplane classifier can be further verified on Ima-
genet (
), a dataset that contains 1,000 object classes; its images
and classes are disjoint from the Places365 training set. Imagenet
contains two airplane class labels: ‘airliner’ and ‘warplane’, and
a simple threshold on unit 150 (peak activation
4) achieves
85.6% balanced classification accuracy on the task of distinguish-
ing these airplane classes from the other object classes. Figure 1f
shows the distribution of activations of this unit on a sample of
airplane and non-airplane Imagenet images.
Role of Units in a Scene Classifier.
How does the network use
the above object detector units? Studies of network compression
have shown that many units can be eliminated from a network while
recovering overall classification accuracy by retraining (
One way to estimate the importance of an individual unit is to
examine the impact of the removal of the unit on mean network
accuracy (38,39).
To obtain a more fine-grained understanding of the causal role
of each unit within a network, we measure the impact of removing
each unit on the network’s ability of classifying each individual
scene class. Units are removed by forcing the specified unit to
output zero and leaving the rest of the network intact. No retraining
is done. Single-class accuracy is tested on the balanced two-way
classification problem of discriminating the specified class from all
the other classes.
The relationships between objects and scenes learned by the
network can be revealed by identifying the most important units
for each class. For example, the four most important
units for the class ‘ski resort’ are shown in Figure 2a: these units
damage ski resort accuracy most when removed. The units detect
snow, mountains, houses, and trees, all of which seem salient to
ski resort scenes.
To test whether the ability of the network to classify ski resorts
can be attributed to just the most important units, we remove se-
lected sets of units. Figure 2b shows that removing just these 4
(out of 512) units reduces the network’s accuracy at discriminating
‘ski resort’ scenes from 81.4% to 64.0%, and removing the 20
most important units in
reduces class accuracy further to
53.5%, near chance levels (chance is 50%), even though classifi-
cation accuracy over all scene classes is hardly affected (changing
from 53.3% to 52.6%, where chance is 0.27%). In contrast, re-
moving the 492 least-important units, leaving only the 20 most
important units in
, has only a small impact on accuracy
for the specific class, reducing ski resort accuracy by only 3.7%,
to 77.7%. Of course, removing so many units damages the ability
of the network to classify other scene classes: removing the 492
least-important units reduces all-class accuracy to 2.1% (chance
is 0.27%).
The effect of removing varying numbers of most-important and
least-important units upon ‘ski resort’ accuracy is shown in Fig-
ure 2c. To avoid overfitting to the evaluation data, we rank the
importance of units according to their individual impact on single-
class ski resort accuracy on the training set, and the plotted impact
of removing sets of units is evaluated on the held-out validation set.
The network can be seen to derive most of its performance for ski
resort classification from just the most important units. Single-class
accuracy can even be improved by removing the least important
units; this effect is further explored in SI.
This internal organization, in which the network relies on a small
number of important units for most of its accuracy with respect
to a single output class, is seen across all classes. Figure 2d
repeats the same experiment for each of the 365 scene classes.
Removing the 20 most important
units for each class
reduces single-class accuracy to 53.0% on average, near chance
Bau et al. |September 15, 2020 |3–8
exhaust hood
kitchen island
work surface
work surface-b
kitchen island-r
kitchen island-l
work surface-t
kitchen island-t
kitchen island-b
work surface-r
work surface-l
14 objects
33 parts
1 material
6 colors
view one unit
output image
random vector
(a) Progressive GAN architecture (c) dissection of each convolutional layer (f) unit 314 activation for images with and without large windows
(d) layer5 detail
(e) single units
(dissected a s “lamp”)
(b) unit 381 activation
unit 314 “window bottom” (part)
unit 498 “oven” (object) unit 244 “chair top” (part) unit 58 “red” (color)
(g) Window images in which
unit 314 does not activate
Fig. 3.
The emergence of object- and part-specific units within a Progressive GAN generator‘(
). (a) The analyzed Progressive GAN consists of 15 convolutional layers that
transform a random input vector into a synthesized image of a kitchen. (b) A single filter is visualized as the region of the output image where the filter activates beyond its top
1% quantile level; note that the filters are all precursors to the output. (c) Dissecting all the layers of the network shows a peak in object-specific units at
of the network.
(d) A detailed examination of
shows more part-specific units than objects, and many visual concepts corresponding to multiple units. (e) Units do not correspond to
exact pixel patterns: a wide range of visual appearances for ovens and chairs are generated when an oven or chair part unit are activated. (f) When a unit specific to window
parts is tested as a classifier, on average the unit activates more strongly on generated images that contain large windows than images that do not. The jitter plot shows the
peak activation of unit 314 on 800 generated images that have windows larger than 5% of the image area as estimated by a segmentation algorithm, and 800 generated images
that do not. (g) Some counterexamples: images for which unit 314 does not activate but where windows are synthesized nevertheless.
levels. In contrast, removing the 492 least important units only
reduces single-class accuracy by an average of 3.6%, just a slight
reduction. We conclude that the emergent object detection done
by units of
is not spurious: each unit is impor tant to a
specific set of classes, and the object detectors can be interpreted
as decomposing the network’s classification of individual scene
classes into simpler sub-problems.
Why do some units match interpretable concepts so well while
other units do not? The data in Figure 2e show that the most
interpretable units are those that are important to many different
output classes. Units that are important to only one class (or
none) are less interpretable, measured by
. We further find that
important units are predominantly positively correlated with their
associated classes, and different combinations of units provide
support for each class. Measurements of unit-class correlations
and examples of overlapping combinations of important units are
detailed in SI.
Does the emergence of interpretable units such as airplane,
snow, and tree detectors depend on having training set labels that
divide the visual world into hundreds of scene classes? Perhaps
the taxonomy of scenes encodes distinctions that are necessary
to learn about objects. Or is it possible for a network to infer such
concepts from the visual data itself? To investigate this question,
we next conduct a similar set of experiments on networks trained
to solve unsupervised tasks.
Emergence of Object Detectors in a GAN. A generative adver-
sarial network (GAN) learns to synthesize random realistic images
that mimic the distribution of real images in a training set (
Architecturally, a trained GAN generator is the reverse of a classi-
fier, producing a realistic image from a random input latent vector.
Unlike classification, it is an unsupervised setting: no human an-
notations are provided to a GAN, so the network must learn the
structure of the images by itself.
Remarkably, GANs have been observed to learn global seman-
tics of an image: for example, interpolating between latent vectors
can smoothly transform the layout of a room (
) or change the
texture of an object (
). We wish to understand whether the
GAN also learns hierarchical structure, for example, if it learns to
decompose the generation of a scene into meaningful parts.
We test a Progressive GAN architecture (
) trained to imi-
tate LSUN kitchen images (
). This network architecture con-
sists of 15 convolutional layers, as shown in Figure 3a. Given
a 512-dimensional vector sampled from a multivariate Gaussian
distribution, the network produces a 256
256 realistic image after
processing the data through the 15 layers. As with a classifier
network, each unit is visualized by showing the regions where the
filter activates above its top 1% quantile level, as shown in Fig-
ure 3b. Importantly, causality in a generator flows in the opposite
direction as a classifier: when unit 381 activates on lamp shades
in an image, it is not detecting objects in the image, because the
Bau et al. |September 15, 2020 |4–8
unchanged 2 units 4 units 8 units 20 units
Number of units removed (units ranked by IoU match with tree segmentations)
(a) Causal effect on generation of trees when “tree” units removed
units removed
(c) Effect of activating “door” units depends on location
Area shows 99% confidence intervals
over n=10000 images
(d) Effect of activating “door” units depends on object context
Bars show standard deviation
over n=10000 images
Original object at modified location
1 2
3 4
3 4
Fig. 4.
The causal effect of altering units
within a GAN generator. (a) When succes-
sively larger sets of units are removed from
a GAN trained to generate outdoor church
scenes, the tree area of the generated im-
ages is reduced. Removing 20 tree units
removes more than half the generated tree
pixels from the output. (b) Qualitative results:
removing tree units affects trees while leaving
other objects intact. Building parts that were
previously occluded by trees are rendered as
if revealing the objects that were behind the
trees. (c) Doors can be added to buildings by
activating 20 door units. The location, shape,
size, and style of the rendered door depends
on the location of the activated units. The
same activation levels produce different doors,
or no door at all (case 4) depending on loca-
tions. (d) Similar context dependence can be
seen quantitatively: doors can be added in
reasonable locations such as at the location
of a window, but not in abnormal locations
such as on a tree or in the sky.
filter activation occurs before the image is generated. Instead, the
unit is part of the computation that ultimately renders the objects.
To identify the location of units in the network that are associ-
ated with object classes, we apply network dissection to the units
of every layer of the network. In this experiment, the reference seg-
mentation models and thresholds used are the same as those used
to analyze the VGG-16 classifier. However, instead of analyzing
agreement with objects that appear in the input data, we analyze
agreement with segmented objects found in the generated output
images. As shown in Figure 3c, the largest number of emergent
concept units do not appear at the edge of the network as we saw
in the classifier, but in the middle: the
has units that match
the largest number of distinct object and part classes.
Figure 3d shows each object, part, material, and color that
matches a unit in
IoU >
4%. This layer contains 19
object-specific units, 41 units that match object parts, one material,
and six color units. As seen in the classification network, visual
concepts such as ‘oven’ and ‘chair’ match many units. Different
from the classifier, more object parts are matched than whole
In Figure 3d, individual units show a wide range of visual di-
versity: the units do not appear to rigidly match a specific pixel
pattern, but rather different appearances for a particular class, for
example, various styles of ovens, or different colors and shapes of
kitchen stools.
In Figure 3f, we apply the window-specific unit 314 as an image
classifier. We find a strong gap between the activation of the unit
when a large window is generated and when no large window
is generated. Furthermore, a simple threshold (peak activation
03) can achieve a 78.2% accuracy in predicting whether the
generated image will have a large window or not. Nevertheless,
the distribution density curve reveals that images that contain large
windows can be often generated without activating unit 314. Two
such samples are shown in Figure 3g. These examples suggest
that other units could potentially synthesize windows.
Role of Units in a GAN.
The correlations between units and
generated object classes are suggestive, but they do not prove
that the units that correlate with an object class actually cause the
generator to render instances of the object class. To understand
the causal role of a unit in a GAN generator, we test the output of
the generator when sets of units are directly removed or activated.
We first remove successively larger sets of tree units from a
Progressive GAN (
) trained on LSUN church scenes (
). We
rank units in
according to
to identify the most tree-
specific units. When successively larger sets of these tree units are
removed from the network, the GAN generates images with fewer
and smaller trees (Figure 4a). Removing the 20 most tree-specific
units reduces the number of tree pixels in the generated output by
53.3%, as measured over 10,000 randomly generated images.
When tree-specific units are removed, the generated images
continue to look similarly realistic. Although fewer and smaller trees
are generated, other objects such as buildings are unchanged.
Remarkably, parts of buildings that were occluded by trees are
hallucinated, as if removing the trees reveals the walls and win-
dows behind them (Figure 4b). The generator appears to have
computed more details than are necessary to render the final out-
put; the details of a building that are hidden behind a tree can
only be revealed by suppressing the generation of the tree. The
appearance of such hidden details strongly suggests that the GAN
is learning a structured statistical model of the scene that extends
beyond a flat summarization of visible pixel patterns.
Units can also be forced on to insert new objects into a gen-
erated scene. We use
to find the 20 most door-specific
units identified in
of the same outdoor church GAN. At
tested locations, the activations for this set of 20 units are all forced
to their high
value. Figure 4c shows the effect of applying this
procedure to activate 20 door units at two different locations in
two generated images. Although the same intervention is applied
to all four cases, the doors obtained in each situation is different:
In cases 1-3, the newly synthesized door has a size, style, and
location that is appropriate to the scene context. In case 4, where
door units are activated on a tree, no new door is added to the
Figure 4d quantifies the context-sensitivity of activating door
units in different locations. In 10,000 randomly generated images,
the same 20-door-unit activation is tested at every featuremap loca-
tion, and the number of newly synthesized door pixels is evaluated
using a segmentation algorithm. Doors can be easily added in
some locations, such as in buildings and especially on top of an
Bau et al. |September 15, 2020 |5–8
‘ski resort ’
+ =
unit 261 snow
Δpeak -53.2
unit 242 house
Δpeak -19.8
unit 149 mountain
Δpeak -21.0
unit 105 tree
Δpeak -28.1
(b) Activation cha nge in 4 units m ost importan t to ski resort and be droom.
green = +10 activation red = -10 activation
(a) Adversarial attack c hanges an im age imperce ptibly to fool the class ifier
(c) Mean p eak unit activation change wh en attacked, for units in conv5 _3
unit 317 bed
Δpeak +23.8
unit 157 head
Δpeak +43.5
unit 290 bed
Δpeak +24.5
unit 75 sofa
Δpeak -1.0
Fig. 5.
Application: visualizing an adversarial attack. (a) the test image is correctly labeled as a ski resort, but when an adversarial perturbation is added, the visually
indistinguishable result is classified as a bedroom. (b) Visualization of the attack on the four most important units to the ski resort class and the four units most important to the
bedroom class. Areas of maximum increase and decrease are shown;
indicates the change in the peak activation level for the unit. (c) Over 1000 images attacked to
misclassify images to various incorrect target classes, the units that are changed most are those that dissection has identified as most important to the source and target
classes. Mean absolute value change in peak unit activation is graphed, with 99% confidence intervals shown.
(a) (b)
Fig. 6.
Application: painting by manipulating GAN neurons. (a) An interactive interface
allows a user to choose several high-level semantic visual concepts and paint them
on to an image. Each concept corresponds to 20 units in the GAN. (b) After the user
adds a dome in the specified location, the result is a modified image in which a dome
has been added in place of the original steeple. Once the user’s high-level intent has
been expressed by changing 20 dome units, the generator automatically handles the
pixel-level details of how to fit together objects to keep the output scene realistic.
existing window, but it is nearly impossible to add a door into trees
or in the sky.
By learning to solve the unsupervised image generation prob-
lem, a GAN has learned units for emergent objects such as doors
and trees. It has also learned a computational structure over those
units that prevents it from rendering nonsensical output such as a
door in the sky, or a door in a tree.
We now turn to two applications enabled by our understanding
of the role of units: understanding attacks on a classifier, and
interactively editing a photo by activating units of a GAN.
Analyzing Adversarial Attack of a Classifier.
The sensitivity
of image classifiers to adversarial attacks is an active research
area (
). To visualize and understand how an attack works,
we can examine the effects on important object detector units. In
Figure 5a, a correctly classified ‘ski resort’ image is attacked to
the target ‘bedroom’ by the C & W optimization method (
The adversarial algorithm computes a small perturbation which,
when added to the original, results in a misclassified image that is
visually indistinguishable from the original image. To understand
how the attack works, we examine the four most important units
to the ski resort class and the four most important units to the
bedroom class. Figure 5b visualizes changes in the activations
for these units between the original image and the adversarial
image. This reveals that the attack has fooled the network by
reducing detection of snow, mountain, house, and tree objects,
and by increasing activations of detectors for beds, person heads,
and sofas in locations where those objects do not actually exist in
the image. Figure 5c shows that, across many images and classes,
the units that are most changed by an attack are the few units that
are important to a class.
Semantic Paint using a GAN.
Understanding the roles of units
within a network allows us to create a human interface for con-
trolling the network via direct manipulation of its units. We apply
this method to a GAN to create an interactive painting application.
Instead of painting with a palette of colors, the application allows
painting with a palette of high-level object concepts. Each concept
is associated with 20 units that maximize
for the concept
Figure 6a shows our interactive interface. When a user adds brush
strokes with a concept, the units for the concept are activated (if
the user is drawing) or zeroed (if the user is erasing). Figure 6b
shows typical results after the user adds an object to the image.
The GAN deals with the pixel-level details of how to add objects
while keeping the scene reasonable and realistic. Multiple changes
in a scene can be composed for creative effects; movies of image
editing demos are included in SI; online demos are also available
at the website
Simple measures of performance, such as classification accuracy,
do not reveal how a network solves its task: good performance can
be achieved by networks that have differing sensitivities to shapes,
textures, or perturbations (34,48).
To develop an improved understanding of how a network works,
we have presented a way to analyze the roles of individual network
units. In a classifier, the units reveal how the network decomposes
the recognition of specific scene classes into particular visual
concepts that are important to each scene class. And within a
Bau et al. |September 15, 2020 |6–8
generator, the behavior of the units reveals contextual relationships
that the model enforces between classes of objects in a scene.
Network dissection relies on the emergence of disentangled,
human-interpretable units during training. We have seen that
many such interpretable units appear in state-of-the-art models,
both supervised and unsupervised. How to train better disentan-
gled models is an open problem that is the subject of ongoing
efforts (4952).
We conclude that a systematic analysis of individual units can
yield insights about the black box internals of deep networks. By
observing and manipulating units of a deep network, it is possible
to understand the structure of the knowledge that the network has
learned, and to build systems that help humans interact with these
powerful models.
Materials and Methods
Data sets.
Places365 (
) consists of 1.80 million photographic
images, each labeled with one of 365 scene classes. The dataset
also includes 36,500 labeled validation images (100 per class) that
are not used for training. Imagenet (
) consists of 1.28 million
photographic images, each focused on a single main object and
labeled with one of 1,000 object classes. LSUN is a dataset with
a large number of 256
256 images in a few classes (
kitchens consists of 2.21 million indoor kitchen photographs, and
LSUN outdoor churches consists of 1.26 million photographs of
church building exteriors. Recognizable people in dataset images
have been anonymized by pixelating faces in visualizations.
Tested Networks.
We analyze the VGG-16 classifier (
) trained
by the Places365 authors (
) to classify Places365 images. The
network achieves classification accuracy of 53.3% on the held-out
validation set (chance is 0.27%). The 13 convolutional layers of
VGG-16 are divided into 5 groups. The layers in the first group
contain 32 units that process image data at the full 224
224 reso-
lution; at each successive group the feature depth is doubled and
the feature maps are pooled to halve the resolution, so that at the
final stage that includes
the layers contain
512 units at 14
14 resolution. The GAN models we analyze are
trained by the Progressive GAN authors (
). The models are con-
figured to generate 256
256 output images using 15 convolutional
layers divided into 8 groups, starting with 512 units in each layer at
4 resolution, and doubling resolution at each successive group,
so that
has 8
8 resolution and 512 units and
16 resolution and 512 units. Unit depth is halved in each group
, so that the 14th layer has 32 units and 256
resolution. The 15th layer (not pictured in Figure 3a) produces a
3-channel RGB image.
Reference Segmentation.
To locate human-interpretable visual
concepts within large-scale data sets of images, we use the Unified
Perceptual Parsing image segmentation network (
) trained on
the ADE20K scene dataset (
) and an assignment of rgb values
to color names (
). The segmentation algorithm achieves mean
IoU of 23.4% on objects, 28.8% on parts, and 54.2% on materials.
To further identify units that specialize in object parts, we expand
each object class into four additional object part classes which
denote the top, bottom, left, or right half of the bounding box of a
connected component. Our reference segmentation algorithm can
detect 335 object classes, 1452 object parts, 25 materials, and 11
Data Availability.
The code, trained model weights and data sets
needed to reproduce the results in this paper are public and avail-
able for download at GitHub at
and at the project website
We are indebted to Aditya Khosla, Aude Oliva, William Peebles,
Jonas Wulff, Joshua B. Tenenbaum, and William T. Freeman for
their advice and collaboration. And we are grateful for the support
of the MIT-IBM Watson AI Lab, the DARPA XAI program FA8750-
18-C0004, NSF 1524817 on Advancing Visual Recognition with
Feature Visualizations, NSF BIGDATA 1447476, Grant RTI2018-
095232-B-C22 from the Spanish Ministry of Science, Innovation
and Universities to AL, Early Career Scheme (ECS) of Hong Kong
(No.24206219) to BZ, and a hardware donation from NVIDIA.
1Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A (2015) Object detectors emerge in deep
scene cnns in ICLR.
2Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks in ECCV.
3Mahendran A, Vedaldi A (2015) Understanding deep image representations by inverting them
in CVPR.
4Olah C, et al. (2018) The building blocks of interpretability. Distill 3(3):e10.
5Bau A, et al. (2018) Identifying and controlling important neurons in neural machine translation
in NeurIPS.
6Karpathy A, Johnson J, Fei-Fei L (2016) Visualizing and understanding recurrent networks in
7Radford A, Jozefowicz R, Sutskever I (2017) Learning to generate reviews and discovering
sentiment. arXiv preprint arXiv:1704.01444.
8Bengio Y, Courville A, Vincent P (2013) Representation lear ning: A review and new perspec-
tives. IEEE transactions on pattern analysis and machine intelligence 35(8):1798–1828.
9Zhou B, Bau D, Oliva A, Torralba A (2018) Interpreting deep visual representations via network
dissection. PAMI.
10 Bau D, et al. (2019) Gan dissection: Visualizing and understanding generative adversarial
networks in ICLR.
11 LeCun Y, Bengio Y, , et al. (1995) Convolutional networks for images, speech, and time series.
The handbook of brain theory and neural networks 3361(10):1995.
12 Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional
neural networks in NeurIPS. pp. 1097–1105.
13 Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image
recognition in ICLR.
14 Goodfellow I, et al. (2014) Generative adversarial nets in NeurIPS.
15 Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: A neural image caption gener-
ator in CVPR. pp. 3156–3164.
16 He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition in CVPR.
17 Isola P, Zhu JY, Zhou T, Efros AA (2017) Image-to-image translation with conditional adversar-
ial networks in CVPR.
18 Zhu JY, Park T, Isola P, Efros AA (2017) Unpaired image-to-image translation using cycle-
consistent adversarial networks in ICCV.
19 Karras T, Aila T, Laine S, Lehtinen J (2018) Progressive growing of gans for improved quality,
stability, and variation in ICLR.
20 Bach S, et al. (2015) On pixel-wise explanations fornon-linear classifier decisions by layer-wise
relevance propagation. PloS one 10(7).
21 Zhou T, Krahenbuhl P, Aubry M, Huang Q, Efros AA (2016) Learning dense correspondence
via 3d-guided cycle consistency in CVPR.
22 Fong RC, Vedaldi A (2017) Interpretable explanations of black boxes by meaningful perturba-
tion in ICCV. pp. 3429–3437.
23 Lundberg SM, Lee SI (2017) A unified approach to interpreting model predictions in NeurIPS.
24 Selvaraju RR, et al. (2017) Grad-cam: Visual explanations from deep networks via gradient-
based localization. in ICCV.
25 Sundararajan M, Taly A, Yan Q (2017) Axiomatic attribution for deep networks in ICML.
26 Smilkov D, Thorat N, Kim B, Viégas F, Wattenberg M (2017) Smoothgrad: removing noise by
adding noise. arXiv preprint arXiv:1706.03825.
27 Petsiuk V, Das A, Saenko K (2018) Rise: Randomized input sampling for explanation of black-
box models in BMVC.
28 Ribeiro MT, Singh S, Guestrin C (2016) Why should i trust you?: Explaining the predictions of
any classifier in SIGKDD. (ACM), pp. 1135–1144.
29 Kim B, Gilmer J, Viegas F, Erlingsson U, Wattenberg M (2017) Tcav: Relative concept impor-
tance testing with linear concept activation vectors. arXiv preprint arXiv:1711.11279.
30 Koul A, Fern A, Greydanus S (2019) Learning finite state representations of recurrent policy
networks in ICLR.
31 Hendricks LA, et al. (2016) Generating visual explanations in ECCV. (Springer), pp. 3–19.
32 Zhou B, Lapedriza A, Xiao J, Torralba A, Oliva A (2014) Learning deep features for scene
recognition using places database in NeurIPS.
33 Xiao T,Liu Y, Zhou B, Jiang Y, Sun J (2018) Unified perceptual parsing for scene understanding
in ECCV.
34 Geirhos R, et al. (2018) Imagenet-trained cnns are biased towards texture; increasing shape
bias improves accuracy and robustness. arXiv preprint arXiv:1811.12231.
35 Deng J, et al. (2009) Imagenet: A large-scale hierarchical image database in CVPR.
Bau et al. |September 15, 2020 |7–8
36 Wen W, Wu C, Wang Y, Chen Y, Li H (2016) Learning structured sparsity in deep neural
networks in NeurIPS. pp. 2074–2082.
37 Li H, Kadav A, Durdanovic I, Samet H, Graf HP (2017) Pruning filters for efficient convnets in
38 Morcos AS, Barrett DG, Rabinowitz NC, Botvinick M (2018) On the importance of single direc-
tions for generalization. arXiv preprint arXiv:1803.06959.
39 Zhou B, Sun Y, Bau D, Torralba A (2018) Revisiting the importance of individual units in cnns
via ablation. arXiv preprint arXiv:1806.02891.
40 Radford A, Metz L, Chintala S (2016) Unsupervised representation learning with deep convo-
lutional generative adversarial networks in ICLR.
41 Zhu JY, Krähenbühl P, Shechtman E, Efros AA (2016) Generative visual manipulation on the
natural image manifold in ECCV.
42 Yu F, et al. (2015) Lsun: Construction of a large-scale image dataset using deep learning with
humans in the loop. arXiv preprint arXiv:1506.03365.
43 Szegedy C, et al. (2014) Intriguing properties of neural networks in ICLR.
44 Goodfellow IJ, Shlens J, Szegedy C (2015) Explaining and harnessing adversarial examples
in ICLR.
45 Carlini N, Wagner D (2017) Towards evaluating the robustness of neural networks in 2017
IEEE Symposium on Security and Privacy (SP). (IEEE), pp. 39–57.
46 Madry A, Makelov A, Schmidt L, Tsipras D, Vladu A (2018) Towards deep learning models
resistant to adversarial attacks in ICLR.
47 Rauber J, Brendel W,Bethge M (2017) Foolbox: Apython toolbox to benchmark the robustness
of machine learning models. arXiv prepr int arXiv:1707.04131.
48 Ilyas A, et al. (2019) Adversarial examples are not bugs, they are features. arXiv preprint
49 Chen X, et al. (2016) Infogan: Interpretable representation learning by information maximizing
generative adversarial nets in NeurIPS. pp. 2172–2180.
50 Higgins I, et al. (2017) beta-vae: Learning basic visual concepts with a constrained variational
framework. ICLR 2(5):6.
51 Zhang Q, Nian Wu Y, Zhu SC (2018) Interpretable convolutional neural networks in CVPR.
52 Achille A, Soatto S (2018) Emergence of invariance and disentanglement in deep representa-
tions. JMLR 19(1):1947–1980.
53 Zhou B, et al. (2017) Scene parsing through ade20k dataset in CVPR.
54 Van De Weijer J, Schmid C, Verbeek J, Larlus D (2009) Learning color names for real-world
applications. IEEE Transactions on Image Processing 18(7):1512–1523.
Bau et al. |September 15, 2020 |8–8
... Here, we assume that f provides a good estimation of P (z|X; θ); Tishby et al. [54] provides empirical evidence that a discriminative model may first learn how to extract proper attributes to model images X conditioned on c, then learn to discriminate their categories based on the attribute distribution. Some dimensions of z (usually 5% ∼ 10% of the total dimensions) capture concrete semantic attributes of visual concepts when the activation score f z (X) is relatively high [4]. Combining this concept-conditional measurement with attribute modeling, we rewrite the category prediction considering the attribute as a latent variable and marginalize the observable joint distribution (X, c) over z: (1) where Z is the space of all attributes. ...
... where is the error rate threshold, and φ the parameter of f 's suffix in the same discriminative model for visual categorization (e.g., the fully-connected layer). We calculate P (c|z 1 , · · · , z K ; φ) by removing the neurons' effects corresponding to z K+1 , · · · , z d [4]. Instead of maintaining all error rate thresholds, we leverage the accuracy gain between every two iterations to search for the minimum K. ...
... 2.1 that RoA has the potential to serve as a prior for Bayesian generalization. Further, we visualize the most representative attributes of each concept by upsampling the activated feature vector to the size of the original image [4]; the attributes are located around the peaks. Most attributes with high RoA are explainable, such as the shape attribute shared by blue cylinder and green cylinder, shape and color captured by two distinct attributes in banana and watermelon, and foreground object (plane, car) and background (road, field) attributes in airport and car on the road. ...
Full-text available
We consider concept generalization at a large scale in the diverse and natural visual spectrum. Established computational modes (i.e., rule-based or similarity-based) are primarily studied isolated and focus on confined and abstract problem spaces. In this work, we study these two modes when the problem space scales up, and the $complexity$ of concepts becomes diverse. Specifically, at the $representational \ level$, we seek to answer how the complexity varies when a visual concept is mapped to the representation space. Prior psychology literature has shown that two types of complexities (i.e., subjective complexity and visual complexity) (Griffiths and Tenenbaum, 2003) build an inverted-U relation (Donderi, 2006; Sun and Firestone, 2021). Leveraging Representativeness of Attribute (RoA), we computationally confirm the following observation: Models use attributes with high RoA to describe visual concepts, and the description length falls in an inverted-U relation with the increment in visual complexity. At the $computational \ level$, we aim to answer how the complexity of representation affects the shift between the rule- and similarity-based generalization. We hypothesize that category-conditioned visual modeling estimates the co-occurrence frequency between visual and categorical attributes, thus potentially serving as the prior for the natural visual world. Experimental results show that representations with relatively high subjective complexity outperform those with relatively low subjective complexity in the rule-based generalization, while the trend is the opposite in the similarity-based generalization.
... In order to investigate this question we employed two different approaches. In one we examined the network by dissecting it (Bau et al., 2020). In the other, we combine the prediction scores of the network with our computational factors to examine how much of human complexity ratings can be explained. ...
... This extracted semantics is what we hypothesise as being part of the reason for a specific region being labelled as ''complex'' or ''simple'' by a human, and which is naturally unextractable by low-level computer vision techniques. Research exploring the internal representations of neural networks (Bau et al., 2020;Zhou et al., 2014) have shown that it has learned to individually extract semantically relevant features from the scenes it is shown: from picking out wings and beaks to identify a bird, to detecting chairs and tables to identify a living room. Notably, the network learns that it is the combinations of these elements (i.e, the association between scene elements) that identifies the bird, or the living room. ...
... Hence, we trained and evaluated ComplexityNet, and combined its ''semantic features'' with our global image features, and found this significantly boosts the level of variance in human complexity ratings we can explain. Dissecting the network (Bau et al., 2020) revealed both low-level (checkerboards, lined surfaces) and semantic features (sky, architectural details) extractors arise in the neurons of such a model. ...
Full-text available
Humans can effortlessly assess the complexity of the visual stimuli they encounter. However, our understanding of how we do this, and the relevant factors that result in our perception of scene complexity remain unclear; especially for the natural scenes in which we are constantly immersed. We introduce several new datasets to further understanding of human perception of scene complexity. Our first dataset (VISC-C) contains 800 scenes and 800 corresponding two-dimensional complexity annotations gathered from human observers, allowing exploration for how complexity perception varies across a scene. Our second dataset, (VISC-CI) consists of inverted scenes (reflection on the horizontal axis) with corresponding complexity maps, collected from human observers. Inverting images in this fashion is associated with destruction of semantic scene characteristics when viewed by humans, and hence allows analysis of the impact of semantics on perceptual complexity. We analysed perceptual complexity from both a single-score and a two-dimensional perspective, by evaluating a set of calculable and observable perceptual features based upon grounded psychological research (clutter, symmetry, entropy and openness). We considered these factors’ relationship to complexity via hierarchical regressions analyses, tested the efficacy of various neural models against our datasets, and validated our perceptual features against a large and varied complexity dataset consisting of nearly 5000 images. Our results indicate that both global image properties and semantic features are important for complexity perception. We further verified this by combining identified perceptual features with the output of a neural network predictor capable of extracting semantics, and found that we could increase the amount of explained human variance in complexity beyond that of low-level measures alone. Finally, we dissect our best performing prediction network, determining that artificial neurons learn to extract both global image properties and semantic details from scenes for complexity prediction. Based on our experimental results, we propose the “dual information” framework of complexity perception, hypothesising that humans rely on both low-level image features and high-level semantic content to evaluate the complexity of images.
... However, in this paper, we find that conventional feature alignment strategies of contrastive learning would instead hinder the model generalization. To illustrate this point, we employ a neuron activation view inspired by the recent progress in neuron interpretability [2,3,25,41,42,45,75]. As shown in Figure 1, since all feature elements stem from the neuron output, we treat feature embeddings as neuron activation states. ...
... Given semantic meanings of neurons, however, we find that current alignment strategies perform via indiscriminately aligning all neuron activations, i.e., |z − z + | 2 . This instead ignores rich relations among neurons: many of them often play the same roles in network propagations (i.e., identifying the same visual concepts (See Figure 2)) though they emerge differently [3]. From this standpoint, we thus argue that such direct alignment methods -overly stressing the same route in propagationscould constrain the diversity of neuron activations. ...
... Neuron Interpretability. In recent years, there has been much research revealing that a single or combination of neurons detects meaningful information in network propagations [2,3,22,25,41,45]. Earlier works such as [72] and [75] seek to elucidate neuron behaviors by visualizing image patches that maximize neuron activations. ...
Full-text available
Learning invariant representations via contrastive learning has seen state-of-the-art performance in domain generalization (DG). Despite such success, in this paper, we find that its core learning strategy -- feature alignment -- could heavily hinder the model generalization. Inspired by the recent progress in neuron interpretability, we characterize this problem from a neuron activation view. Specifically, by treating feature elements as neuron activation states, we show that conventional alignment methods tend to deteriorate the diversity of learned invariant features, as they indiscriminately minimize all neuron activation differences. This instead ignores rich relations among neurons -- many of them often identify the same visual concepts though they emerge differently. With this finding, we present a simple yet effective approach, \textit{Concept Contrast} (CoCo), which relaxes element-wise feature alignments by contrasting high-level concepts encoded in neurons. This approach is highly flexible and can be integrated into any contrastive method in DG. Through extensive experiments, we further demonstrate that our CoCo promotes the diversity of feature representations, and consistently improves model generalization capability over the DomainBed benchmark.
... Instead of discovering attributes in the latent space, one can also define concepts by characterizing individual neurons' functionality. To this end, Bau et al. proposed a neuron interpretation method that utilizes an image dataset with rich segment-level labels [10,11]. In the proposed work, we also adopt a similar neuron-level semantic modeling approach. ...
... In the proposed work, we also adopt a similar neuron-level semantic modeling approach. However, instead of relying on the human annotation of objects in the image as in [10,12], we leverage the recently CLIP multimodal embeddings [13,14]. Discovering semantically aligned neurons from a StyleGAN will naturally enable us to produce an arbitrarily large "pseudo"-labeled dataset. ...
... One of the key ingredients that facilitates the proposed approach is the expressiveness of generative models (e.g., Style-GAN) and the semantic control they exhibit [10,16,17,18]. However, existing unsupervised methods for discovering task-relevant attributes from generative models often produce a large number of unlabeled suggestions, which require human supervision to be leveraged for any downstream task. ...
Full-text available
We present a fully automated framework for building object detectors on satellite imagery without requiring any human annotation or intervention. We achieve this by leveraging the combined power of modern generative models (e.g., StyleGAN) and recent advances in multi-modal learning (e.g., CLIP). While deep generative models effectively encode the key semantics pertinent to a data distribution, this information is not immediately accessible for downstream tasks, such as object detection. In this work, we exploit CLIP's ability to associate image features with text descriptions to identify neurons in the generator network, which are subsequently used to build detectors on-the-fly.
... Debugging and Explainability: Most of the existing works on explainability of deep networks focus on inspecting the decisions for a single image [2,8,9,15,33,37,40,41,53,56,62,63,65,69]. These include saliency maps [48,51,52,54], activation maps [4,24,25,47,70], removing image patches [26], interpreting local decision boundaries [45], finding influential [28] or counterfactual inputs [17,32,36]. However, examining several single-image explanations for debugging can be time-consuming. ...
Full-text available
Deep neural networks can be unreliable in the real world when the training set does not adequately cover all the settings where they are deployed. Focusing on image classification, we consider the setting where we have an error distribution $\mathcal{E}$ representing a deployment scenario where the model fails. We have access to a small set of samples $\mathcal{E}_{sample}$ from $\mathcal{E}$ and it can be expensive to obtain additional samples. In the traditional model development framework, mitigating failures of the model in $\mathcal{E}$ can be challenging and is often done in an ad hoc manner. In this paper, we propose a general methodology for model debugging that can systemically improve model performance on $\mathcal{E}$ while maintaining its performance on the original test set. Our key assumption is that we have access to a large pool of weakly (noisily) labeled data $\mathcal{F}$. However, naively adding $\mathcal{F}$ to the training would hurt model performance due to the large extent of label noise. Our Data-Centric Debugging (DCD) framework carefully creates a debug-train set by selecting images from $\mathcal{F}$ that are perceptually similar to the images in $\mathcal{E}_{sample}$. To do this, we use the $\ell_2$ distance in the feature space (penultimate layer activations) of various models including ResNet, Robust ResNet and DINO where we observe DINO ViTs are significantly better at discovering similar images compared to Resnets. Compared to LPIPS, we find that our method reduces compute and storage requirements by 99.58\%. Compared to the baselines that maintain model performance on the test set, we achieve significantly (+9.45\%) improved results on the debug-heldout sets.
... Network Dissection Network dissection by Bau et al. [3] proposes to identify concepts associated with individual neurons in CNNs. Specifically, the emergence of object detectors in scene classifiers is examined. ...
Full-text available
The present paper reviews and discusses work from computer science that proposes to identify concepts in internal representations (hidden layers) of DNNs. It is examined, first, how existing methods actually identify concepts that are supposedly represented in DNNs. Second, it is discussed how conceptual spaces -- sets of concepts in internal representations -- are shaped by a tradeoff between predictive accuracy and compression. These issues are critically examined by drawing on philosophy. While there is evidence that DNNs able to represent non-trivial inferential relations between concepts, our ability to identify concepts is severely limited.
Many studies have shown that generative adversarial networks (GANs) can discover semantics at various levels of abstraction, yet GANs do not provide an intuitive way to show how they understand and control semantics. In order to identify interpretable directions in GAN’s latent space, both supervised and unsupervised approaches have been proposed. But the supervised methods can only find the directions consistent with the supervised conditions. However, many current unsupervised methods are hampered by varying degrees of semantic property disentanglement. This paper proposes an unsupervised method with a layer-wise design. The model embeds subspace in each generator layer to capture the disentangled interpretable semantics in GAN. And the research also introduces a latent mapping network to map the inputs to an intermediate latent space with rich disentangled semantics. Additionally, the paper applies an Orthogonal Jacobian regularization to the model to impose constraints on the overall input, further enhancing disentanglement. Experiments demonstrate the method’s applicability in the human face, anime face, and scene datasets and its efficacy in finding interpretable directions. Compared with existing unsupervised methods in both qualitative and quantitative aspects, this study proposed method achieves excellent improvement in the disentanglement effect.
The human brain achieves visual object recognition through multiple stages of linear and nonlinear transformations operating at a millisecond scale. To predict and explain these rapid transformations, computational neuroscientists employ machine learning modeling techniques. However, state-of-the-art models require massive amounts of data to properly train, and to the present day there is a lack of vast brain datasets which extensively sample the temporal dynamics of visual object recognition. Here we collected a large and rich dataset of high temporal resolution EEG responses to images of objects on a natural background. This dataset includes 10 participants, each with 82,160 trials spanning 16,740 image conditions. Through computational modeling we established the quality of this dataset in five ways. First, we trained linearizing encoding models that successfully synthesized the EEG responses to arbitrary images. Second, we correctly identified the recorded EEG data image conditions in a zero-shot fashion, using EEG synthesized responses to hundreds of thousands of candidate image conditions. Third, we show that both the high number of conditions as well as the trial repetitions of the EEG dataset contribute to the trained models’ prediction accuracy. Fourth, we built encoding models whose predictions well generalize to novel participants. Fifth, we demonstrate full end-to-end training of randomly initialized DNNs that output EEG responses for arbitrary input images. We release this dataset as a tool to foster research in visual neuroscience and computer vision.
Deep Metric Learning (DML) serves to learn an embedding function to project semantically similar data into nearby embedding space and plays a vital role in many applications, such as image retrieval and face recognition. However, the performance of DML methods often highly depends on sampling methods to choose effective data from the embedding space in the training. In practice, the embeddings in the embedding space are obtained by some deep models, where the embedding space is often with barren area due to the absence of training points, resulting in so called “missing embedding” issue. This issue may impair the sample quality, which leads to degenerated DML performance. In this work, we investigate how to alleviate the “missing embedding” issue to improve the sampling quality and achieve effective DML. To this end, we propose a Densely-Anchored Sampling (DAS) scheme that considers the embedding with corresponding data point as “anchor” and exploits the anchor’s nearby embedding space to densely produce embeddings without data points. Specifically, we propose to exploit the embedding space around single anchor with Discriminative Feature Scaling (DFS) and multiple anchors with Memorized Transformation Shifting (MTS). In this way, by combing the embeddings with and without data points, we are able to provide more embeddings to facilitate the sampling process thus boosting the performance of DML. Our method is effortlessly integrated into existing DML frameworks and improves them without bells and whistles. Extensive experiments on three benchmark datasets demonstrate the superiority of our method.KeywordsDeep metric learningMissing embeddingEmbedding space exploitationDensely-Anchored Sampling
We consider the targeted image editing problem, namely blending a region in a source image with a driveg that specifiesthe desired change. Differently from prior works, we solve this problem by learning a conditional probability distribution of the edits, end-to-end in code space. Training such a model requires addressing the lack of example edits for training. To this end, we propose a self-supervised approach that simulates edits by augmenting off-the-shelf images in a target domain. The benefits are remarkable: implemented as a state-of-the-art auto-regressive transformer, our approach is simple, sidesteps difficulties with previous methods based on GAN-like priors, obtains significantly better edits, and is efficient. Furthermore, we show that different blending effects can be learned by an intuitive control of the augmentation process, with no other changes required to the model architecture. We demonstrate the superiority of this approach across several datasets in extensive quantitative and qualitative experiments, including human studies, significantly outperforming prior work.
Full-text available
We propose a technique for producing ‘visual explanations’ for decisions from a large class of Convolutional Neural Network (CNN)-based models, making them more transparent and explainable. Our approach—Gradient-weighted Class Activation Mapping (Grad-CAM), uses the gradients of any target concept (say ‘dog’ in a classification network or a sequence of words in captioning network) flowing into the final convolutional layer to produce a coarse localization map highlighting the important regions in the image for predicting the concept. Unlike previous approaches, Grad-CAM is applicable to a wide variety of CNN model-families: (1) CNNs with fully-connected layers (e.g.VGG), (2) CNNs used for structured outputs (e.g.captioning), (3) CNNs used in tasks with multi-modal inputs (e.g.visual question answering) or reinforcement learning, all without architectural changes or re-training. We combine Grad-CAM with existing fine-grained visualizations to create a high-resolution class-discriminative visualization, Guided Grad-CAM, and apply it to image classification, image captioning, and visual question answering (VQA) models, including ResNet-based architectures. In the context of image classification models, our visualizations (a) lend insights into failure modes of these models (showing that seemingly unreasonable predictions have reasonable explanations), (b) outperform previous methods on the ILSVRC-15 weakly-supervised localization task, (c) are robust to adversarial perturbations, (d) are more faithful to the underlying model, and (e) help achieve model generalization by identifying dataset bias. For image captioning and VQA, our visualizations show that even non-attention based models learn to localize discriminative regions of input image. We devise a way to identify important neurons through Grad-CAM and combine it with neuron names (Bau et al. in Computer vision and pattern recognition, 2017) to provide textual explanations for model decisions. Finally, we design and conduct human studies to measure if Grad-CAM explanations help users establish appropriate trust in predictions from deep networks and show that Grad-CAM helps untrained users successfully discern a ‘stronger’ deep network from a ‘weaker’ one even when both make identical predictions. Our code is available at, along with a demo on CloudCV (Agrawal et al., in: Mobile cloud visual media computing, pp 265–290. Springer, 2015) ( and a video at
Full-text available
The success of recent deep convolutional neural networks (CNNs) depends on learning hidden representations that can summarize the important factors of variation behind the data. However, CNNs often criticized as being black boxes that lack interpretability, since they have millions of unexplained model parameters. In this work, we describe Network Dissection, a method that interprets networks by providing labels for the units of their deep visual representations. The proposed method quantifies the interpretability of CNN representations by evaluating the alignment between individual hidden units and a set of visual semantic concepts. By identifying the best alignments, units are given human interpretable labels across a range of objects, parts, scenes, textures, materials, and colors. The method reveals that deep representations are more transparent and interpretable than expected: we find that representations are significantly more interpretable than they would be under a random equivalently powerful basis. We apply the method to interpret and compare the latent representations of various network architectures trained to solve different supervised and self-supervised training tasks. We then examine factors affecting the network interpretability such as the number of the training iterations, regularizations, different initializations, and the network depth and width. Finally we show that the interpreted units can be used to provide explicit explanations of a prediction given by a CNN for an image. Our results highlight that interpretability is an important property of deep neural networks that provides new insights into their hierarchical structure.
All Rights Reserved. Generative Adversarial Networks (GANs) have recently achieved impressive results for many real-world applications, and many GAN variants have emerged with improvements in sample quality and training stability. However, they have not been well visualized or understood. How does a GAN represent our visual world internally? What causes the artifacts in GAN results? How do architectural choices affect GAN learning? Answering such questions could enable us to develop new insights and better models. In this work, we present an analytic framework to visualize and understand GANs at the unit-, object-, and scene-level. We first identify a group of interpretable units that are closely related to object concepts using a segmentation-based network dissection method. Then, we quantify the causal effect of interpretable units by measuring the ability of interventions to control objects in the output. We examine the contextual relationship between these units and their surroundings by inserting the discovered object concepts into new images. We show several practical applications enabled by our framework, from comparing internal representations across different layers, models, and datasets, to improving GANs by locating and removing artifact-causing units, to interactively manipulating objects in a scene. We provide open source interpretation tools to help researchers and practitioners better understand their GAN models.
Using established principles from Statistics and Information Theory, we show that invariance to nuisance factors in a deep neural network is equivalent to information minimality of the learned representation, and that stacking layers and injecting noise during training naturally bias the network towards learning invariant representations. We then decompose the cross-entropy loss used during training and highlight the presence of an inherent overfitting term. We propose regularizing the loss by bounding such a term in two equivalent ways: One with a Kullbach-Leibler term, which relates to a PAC-Bayes perspective; the other using the information in the weights as a measure of complexity of a learned model, yielding a novel Information Bottleneck for the weights. Finally, we show that invariance and independence of the components of the representation learned by the network are bounded above and below by the information in the weights, and therefore are implicitly optimized during training. The theory enables us to quantify and predict sharp phase transitions between underfitting and overfitting of random labels when using our regularized loss, which we verify in experiments, and sheds light on the relation between the geometry of the loss function, invariance properties of the learned representation, and generalization error.