PreprintPDF Available

Generating Interpretable Class Model Visualizations for CNNs with Varying Dilation Factors

Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Using datasets highly optimized (MNIST and Fashion-MNIST) for image classification, we explored the role of the input layer's dilation factor within a convolutional neural network (CNN) as it pertains to classification accuracy. We trained CNNs using a range of dilation factors; then numerically generated class model visualizations, input images for each trained network that would maximize the likelihood of a given class being predicted. Our contribution is twofold. We found that for the given datasets, a dilation factor greater than 1 (i.e. a sparse kernel) may detract from classification accuracy while generating more humanly-interpretable class model visualizations. Additionally, our experiments suggest that class model visualization may be starkly different from a typical representation (e.g. sample mean) of the class they maximize prediction for
Content may be subject to copyright.
Generating Interpretable Class Model Visualizations
for CNNs with Varying Dilation Factors
Shengyi Huang
David Grethlein
November 20, 2019
Copyright c
2019 for this paper by its authors. Use permitted under Creative Com-
mons License Attribution 4.0 International (CC BY 4.0).
Using datasets highly optimized (MNIST and Fashion-MNIST) for
image classification, we explored the role of the input layer’s dilation
factor within a convolutional neural network (CNN) as it pertains to
classification accuracy. We trained CNNs using a range of dilation
factors; then numerically generated class model visualizations, input
images for each trained network that would maximize the likelihood
of a given class being predicted. Our contribution is two-fold. We
found that for the given datasets, a dilation factor greater than 1 (i.e.
a sparse kernel) may detract from classification accuracy while gen-
erating more humanly-interpretable class model visualizations. Addi-
tionally, our experiments suggest that class models visualization may
be starkly different from a typical representation (e.g. sample mean)
of the class they maximize prediction for 1.
Keywords — CNN , Dilation Factor , Classification , Semantic Seg-
mentation , Parameter Tuning , Class Model Visualizations , MNIST
, Fashion-MNIST
1 Introduction
Convolutional neural networks (CNNs) have overwhelmingly become the pre-
ferred tool for researchers studying image classification tasks, in some cases
vastly out-performing other traditional machine learning techniques or fully
connected multi-layer perceptrons (MLPs). Network learning often requires
very large labelled datasets to achieve high performance, and the recent in-
crease in availability of distributed computing resources have allowed re-
searchers to reduce their network training overhead times and explore a wider
variety of architectures. CNNs are more than capable of reliably predicting
classes given enough training time and samples; however, what exactly is
learned as the anticipated representation of each class has largely been a
Simonyan et al. presented Class model visualizations (CMVs), which are nu-
merically generated images that have been optimized to maximize the likeli-
hood of a desired class prediction [1]. The utility for using CMVs is to help
1Our project is hosted at
human researchers intuit what a trained model understands to be the rele-
vant input features for confidently classifying a sample as a particular class.
CNNs were developed to replicate the combined functions of the human eye
and brain to process visual input and identify scene objects. Therefore, we
hypothesized that the learned representations that maximize each class’s like-
lihood of prediction would bear resemblance to their corresponding aggregate
representations (i.e. sample means) within the training set.
Figure 1: First samples (left) and image averages (right) for each class value
for both MNIST and Fashion-MNIST datasets.
For all experiments reported herewithin, we trained CNNs on datasets
(MNIST and Fashion-MNIST) that have been highly optimized for classifi-
cation, where the class averages bear strong visual resemblance to individual
samples [2, 3]. Most noticeably the sample means appear to preserve the
overall shapes of each class’s activated pixels, but with blurred edges that
obfuscate the unique characteristics exhibited by individual samples (loops
versus sharp angles, skewing, brand logos, patterns, localized textures).
It is worth noting that both datasets have practically uniform class distri-
butions and are comprised entirely of gray-scale images; so our findings may
not be expandable to multi-channel image sets, or those with a skewed class-
2 Related Work
There has been extensive research into fine-tuning network parameters and
constructing CNN architectures to adapt to a given image dataset[4][5][6].
There has been substantially less investigation into the degree of influence
a kernel’s dilation factor, the gap size between input features in a neuron’s
perceptive field, in a CNN’s input layer has on either classification accuracy
or CMVs generation.
Figure 2: 3x3 Kernels with varying dilation factors superimposed over their
receptive fields to illustrate the relative sparseness of computed features.
A dilation factor greater than 1 is used to amplify a neuron’s receptive
field without increasing kernel size, or incurring additional computational
cost. Dilated kernels aren’t specific to image processing (have also been used
for processing and segmenting audio)[7], and they intuitively allow for the de-
tection of sparser features within an image by computing neuron activations
from non-contiguous data-points (pixels). Recent studies have also suggested
that using a dilation factor greater than 1 may produce more accurate re-
sults in the semantic segmentation tasks, identifying a specific objects within
a regions of an input image [8].
3 Contribution
Our experimentation can be broken into two steps: first we trained CNNs
with varying dilation factors, then we used the trained models to produce
Input image (shape=(N, 1, 1, 28))
Conv2D(channelin=1, channelout=32, kernel=3x3, dilation=d)
Conv2D(channelin=32, channelout=32, kernel=5x5)
Conv2D(channelin=32, channelout=64, kernel=5x5)
Figure 3: The neural network architecutre used to do classification, where N
is the batch number and dis the dilation factor.
3.1 Training the Neural Networks
We constructed the neural networks according to the architecture defined
in Figure 3. A combination of 2-D convolutional filters, ReLU activation
functions, and dropout layers were used in calculating the logits for each
classes and a LogSoftmax layer was used in the end to extract the classi-
fication probability for each class. Then, our neural networks selected the
class with the highest classification probability as prediction, and used the
cross entropy loss function and Adam optimizer with learning rate of 0.001
to back-propagate. Based on this architecture, we trained a neural network
with varying dilation factors d= 1,2, ..., 7 for 5 epochs.
3.2 CMVs Generation
CMVs visualize the most representative image Ifor each class jsuch that
the probability of Ibeing classified as jis maximized. More formally, let Sj
be the class score/probability function of the neural network predicting the
image as class j. For each class j, a CMV is the L2-regularised image Isuch
where λis the regularisation parameter. Implementation-wise, we started by
creating an image Ifilled with zeroes, passing it through our neural networks,
using Adam optimization to find the gradient
and then applied stochastic gradient assent to update the image I[9]. In
our experiments, we generated the CMVs using the the trained network with
dilation factor d= 1,2, ..., 7 for analysis.
4 Results and Analysis
Empirically, we found that with higher dilation factors, classification accu-
racy generally decreases while the CMVs generated are usually more inter-
pretable with the MNIST dataset.
4.1 Training Accuracy of Classifiers
The classification accuracy for MNIST and Fashion-MNIST datasets are pre-
sented in Table 1 for each trained CNN. Our experiments show that classi-
fication accuracy generally decreases with higher dilation factors for both
datasets. Notably, the classification accuracy for the MNIST dataset is sig-
nificantly higher than that of the Fashion-MNIST dataset. This is likely due
to the simplicity of features in the MNIST dataset (i.e. written digits) are
more readily detectable than the features in the Fashion-MNIST dataset (i.e.
articles of clothing).
Dilation Factor MNIST Fashion-MNIST
1 98.2% 82.4%
2 97.9% 82.5%
3 97.7% 74.0%
4 96.9% 81.6%
5 94.9% 74.3%
6 90.7% 76.1%
7 96.1% 74.3%
Table 1: Classification accuracies for CNNs trained with varying dilation
factors for both our experimental datasets.
4.2 More Interpretable CMVs
Some examples of CMVs generated with varying dilation factors for the
MNIST dataset are shown in Figure(s) 4, 5, 6, where the heading c=j, 0.99
above each image Idisplayed is interpreted as the probability of image Ibe-
ing classified as jis 99%. The comprehensive list of generated CMVs could
be found in the Appendix. We unfortunately could not give much insight into
the CMVs generated for the Fashion-MNIST dataset because they aren’t as
clearly interpretable. We suspect that the lower classification accuracy for
the Fashion-MNIST dataset causes this ambiguity.
Through observation, we assert that the CMVs generated in Figure 4 don’t
exactly match the aggregated class representation as shown in Figure 1, dis-
proving our hypothesis. However, there are interesting shapes and edges that
bear resemblance to the both the samples themselves as well as aggregate
class representations (sample means).
Upon further anlysis, we found that CMVs generated by CNNs with dilation
factor greater than 1 produce more interpretable images, up to a point. It
is apparent that most digits in the CMVs with d= 3 are significantly more
recognizable than those in CMVs with d= 1. For example, the 0 of CMV
with d= 3 has a distinct whole in the middle and a ring around that whole,
which is very indicative of the digit 0. Also, the CMV generated for class
7 with d= 3 almost produces a triangle with the bottom left side being
somewhat faint, which suggests it is the digit 7.
Figure 4: CMVs for the MNIST dataset with dilation d= 1.
Figure 5: CMVs for the MNIST dataset with dilation d= 3
Comparatively, the digits in the CMVs with d= 1 are more random,
blended, and harder to interpret. That being said, higher dilation only works
to a certain point. CMVs with d= 5 seems to experience gradient explosion,
causing most of the pixels to be yellow.
Figure 6: CMVs for the MNIST dataset with dilation d= 5
We hypothesize that the reason why CMV with d= 3 is recognizable is
because kernels using higher dilation factor in the input layer allow for a CNN
to pick up more spatial features, characterizing the overall shape of important
features within the input images. This is evidenced by the sharper edges
and distinguishable strokes comprising digits in each class representation.
These findings support those demonstrated by Simonyan et al in semantic
segmentation tasks[1].
5 Conclusions and Future Work
In conclusion, leveraging higher dilation factors in CNN input layers during
training resulted in a gradual decrease in classification accuracy while pro-
ducing more clearly interpretable CMVs with sharper edges distinct features
of the class they represent. That being said, it is not clear how this technique
will scale up to more complex and larger images. Specifically, all the images
in the MNIST dataset used in our paper are purely 2-dimensional and gener-
ally positioned in the central position while other image datasets such as the
ImageNet contains images of dogs that are 3-dimensional and viewed from
different angles and posed in different positions [10]. Our findings suggest
an increase in complexity of the image classes may introduce more noise to
the underlying CMVs. Further experimentation is needed to determine if a
network trained with a dilation factor greater than 1 will generally produce
more interpretable CMVs.
For future work, we would like to our current methodology to more com-
plicated datasets such as ImageNet. In addition, we are also interested in
studying the effect of other parameters to CMVs such as the kernel size
and padding. All images used in our experiments were fairly small, there-
fore reproducing these results using larger images would bolster our claims
6 References
[1] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep in-
side convolutional networks: Visualising image classification models and
saliency maps. arXiv preprint arXiv:1312.6034, 2013.
[2] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel
image dataset for benchmarking machine learning algorithms. arXiv
preprint arXiv:1708.07747, 2017.
[3] Yann LeCun. The mnist database of handwritten digits. http://yann.
lecun. com/exdb/mnist/, 1999.
[4] Filip Radenovi´c, Giorgos Tolias, and Ondˇrej Chum. Cnn image retrieval
learns from bow: Unsupervised fine-tuning with hard examples. In Eu-
ropean conference on computer vision, pages 3–20. Springer, 2016.
[5] Nima Tajbakhsh, Jae Y Shin, Suryakanth R Gurudu, R Todd Hurst,
Christopher B Kendall, Michael B Gotway, and Jianming Liang. Con-
volutional neural networks for medical image analysis: Full training or
fine tuning? IEEE transactions on medical imaging, 35(5):1299–1312,
[6] Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan
Carlsson. Cnn features off-the-shelf: an astounding baseline for recog-
nition. In Proceedings of the IEEE conference on computer vision and
pattern recognition workshops, pages 806–813, 2014.
[7] Xiaohu Zhang, Yuexian Zou, and Wei Shi. Dilated convolution neu-
ral network with leakyrelu for environmental sound classification. In
2017 22nd International Conference on Digital Signal Processing (DSP),
pages 1–5. IEEE, 2017.
[8] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by di-
lated convolutions. arXiv preprint arXiv:1511.07122, 2015.
[9] Diederik Kingma and Jimmy Ba. Adam: a method for stochastic opti-
mization (2014). arXiv preprint arXiv:1412.6980, 15, 2015.
[10] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet:
A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
7 Appendix
The comprehensive list of generated CMVs for both MNIST and Fashion-
MNIST datasets are presented below.
Figure 7: CMVs for the MNIST dataset with dilation d= 1
Figure 8: CMVs for the MNIST dataset with dilation d= 2
Figure 9: CMVs for the MNIST dataset with dilation d= 3
Figure 10: CMVs for the MNIST dataset with dilation d= 4
Figure 11: CMVs for the MNIST dataset with dilation d= 5
Figure 12: CMVs for the MNIST dataset with dilation d= 6
Figure 13: CMVs for the MNIST dataset with dilation d= 7
Figure 14: CMVs for the Fashion-MNIST dataset with dilation d= 1
Figure 15: CMVs for the Fashion-MNIST dataset with dilation d= 2
Figure 16: CMVs for the Fashion-MNIST dataset with dilation d= 3
Figure 17: CMVs for the Fashion-MNIST dataset with dilation d= 4
Figure 18: CMVs for the Fashion-MNIST dataset with dilation d= 5
Figure 19: CMVs for the Fashion-MNIST dataset with dilation d= 6
Figure 20: CMVs for the Fashion-MNIST dataset with dilation d= 7
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
We present Fashion-MNIST, a new dataset comprising of 28x28 grayscale images of 70,000 fashion products from 10 categories, with 7,000 images per category. The training set has 60,000 images and the test set has 10,000 images. Fashion-MNIST is intended to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms, as it shares the same image size, data format and the structure of training and testing splits. The dataset is freely available at
Conference Paper
Full-text available
Convolutional Neural Networks (CNNs) achieve state-of-the-art performance in many computer vision tasks. However, this achievement is preceded by extreme manual annotation in order to perform either training from scratch or fine-tuning for the target task. In this work, we propose to fine-tune CNN for image retrieval from a large collection of unordered images in a fully automated manner. We employ state-of-the-art retrieval and Structure-from-Motion (SfM) methods to obtain 3D models, which are used to guide the selection of the training data for CNN fine-tuning. We show that both hard positive and hard negative examples enhance the final performance in particular object retrieval with compact codes.
Conference Paper
Full-text available
State-of-the-art models for semantic segmentation are based on adaptations of convolutional networks that had originally been designed for image classification. However, dense prediction problems such as semantic segmentation are structurally different from image classification. In this work, we develop a new convolutional network module that is specifically designed for dense prediction. The presented module uses dilated convolutions to systematically aggregate multi-scale contextual information without losing resolution. The architecture is based on the fact that dilated convolutions support exponential expansion of the receptive field without loss of resolution or coverage. We show that the presented context module increases the accuracy of state-of-the-art semantic segmentation systems. In addition, we examine the adaptation of image classification networks to dense prediction and show that simplifying the adapted network can increase accuracy.
Full-text available
Training a deep convolutional neural network (CNN) from scratch is difficult because it requires a large amount of labeled training data and a great deal of expertise to ensure proper convergence. A promising alternative is to fine-tune a CNN that has been pre-trained using, for instance, a large set of labeled natural images. However, the substantial differences between natural and medical images may advise against such knowledge transfer. In this paper, we seek to answer the following central question in the context of medical image analysis: Can the use of pre-trained deep CNNs with sufficient fine-tuning eliminate the need for training a deep CNN from scratch? To address this question, we considered four distinct medical imaging applications in three specialties (radiology, cardiology, and gastroenterology) involving classification, detection, and segmentation from three different imaging modalities, and investigated how the performance of deep CNNs trained from scratch compared with the pre-trained CNNs fine-tuned in a layer-wise manner. Our experiments consistently demonstrated that 1) the use of a pre-trained CNN with adequate fine-tuning outperformed or, in the worst case, performed as well as a CNN trained from scratch; 2) fine-tuned CNNs were more robust to the size of training sets than CNNs trained from scratch; 3) neither shallow tuning nor deep tuning was the optimal choice for a particular application; and 4) our layer-wise fine-tuning scheme could offer a practical way to reach the best performance for the application at hand based on the amount of available data.
Conference Paper
Full-text available
Recent results indicate that the generic descriptors extracted from the convolutional neural networks are very powerful. This paper adds to the mounting evidence that this is indeed the case. We report on a series of experiments conducted for different recognition tasks using the publicly available code and model of the OverFeat network which was trained to perform object classification on ILSVRC13. We use features extracted from the OverFeat network as a generic image representation to tackle the diverse range of recognition tasks of object image classification, scene recognition, fine grained recognition, attribute detection and image retrieval applied to a diverse set of datasets. We selected these tasks and datasets as they gradually move further away from the original task and data the OverFeat network was trained to solve. Remarkably we report better or competitive results compared to the state-of-the-art in all the tasks on various datasets. The results are achieved using a linear SVM classifier applied to a feature representation of size 4096 extracted from a layer in the net. The results strongly suggest that features obtained from deep learning with convolutional nets should be the primary candidate in most visual classification tasks.
Conference Paper
Full-text available
The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called ldquoImageNetrdquo, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.
We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions. The method is straightforward to implement and is based an adaptive estimates of lower-order moments of the gradients. The method is computationally efficient, has little memory requirements and is well suited for problems that are large in terms of data and/or parameters. The method is also ap- propriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The method exhibits invariance to diagonal rescaling of the gradients by adapting to the geometry of the objective function. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. We demonstrate that Adam works well in practice when experimentally compared to other stochastic optimization methods.
This paper addresses the visualisation of image classification models, learnt using deep Convolutional Networks (ConvNets). We consider two visualisation techniques, based on computing the gradient of the class score with respect to the input image. The first one generates an image, which maximises the class score [Erhan et al., 2009], thus visualising the notion of the class, captured by a ConvNet. The second technique computes a class saliency map, specific to a given image and class. We show that such maps can be employed for weakly supervised object segmentation using classification ConvNets. Finally, we establish the connection between the gradient-based ConvNet visualisation methods and deconvolutional networks [Zeiler et al., 2013].