Preview content only
Content available from International Journal of Computer Vision
This content is subject to copyright. Terms and conditions apply.
International Journal of Computer Vision (2021) 129:736–760
Compositional Convolutional Neural Networks: A Robust and
Interpretable Model for Object Recognition Under Occlusion
Adam Kortylewski1·Qing Liu1·Angtian Wang1·Yihong Sun1·Alan Yuille1
Received: 20 January 2020 / Accepted: 4 November 2020 / Published online: 24 November 2020
© Springer Science+Business Media, LLC, part of Springer Nature 2020
Computer vision systems in real-world applications need to be robust to partial occlusion while also being explainable. In this
work, we show that black-box deep convolutional neural networks (DCNNs) have only limited robustness to partial occlu-
sion. We overcome these limitations by unifying DCNNs with part-based models into Compositional Convolutional Neural
Networks (CompositionalNets)—an interpretable deep architecture with innate robustness to partial occlusion. Speciﬁcally,
we propose to replace the fully connected classiﬁcation head of DCNNs with a differentiable compositional model that
can be trained end-to-end. The structure of the compositional model enables CompositionalNets to decompose images into
objects and context, as well as to further decompose object representations in terms of individual parts and the objects’ pose.
The generative nature of our compositional model enables it to localize occluders and to recognize objects based on their
non-occluded parts. We conduct extensive experiments in terms of image classiﬁcation and object detection on images of arti-
ﬁcially occluded objects from the PASCAL3D+ and ImageNet dataset, and real images of partially occluded vehicles from the
MS-COCO dataset. Our experiments show that CompositionalNets made from several popular DCNN backbones (VGG-16,
ResNet50, ResNext) improve by a large margin over their non-compositional counterparts at classifying and detecting partially
occluded objects. Furthermore, they can localize occluders accurately despite being trained with class-level supervision only.
Finally, we demonstrate that CompositionalNets provide human interpretable predictions as their individual components can
be understood as detecting parts and estimating an objects’ viewpoint.
Keywords Compositional models ·Robustness to partial occlusion ·Image classiﬁcation ·Object detection ·Out-of-
Advances in the architecture design of deep convolu-
tional neural networks (DCNNs) (Krizhevsky et al. 2012;
Communicated by Mei Chen.
1Johns Hopkins University, Baltimore, MD, USA
Simonyan and Zisserman 2014;Heetal.2016) increased
the performance of computer vision systems at object recog-
nition enormously. This led to the deployment of computer
vision models in safety-critical real-world applications, such
as self-driving cars and security systems. In these application
areas, we expect models to reliably generalize to previously
unseen visual stimuli. However, in practice we observe that
deep models do not generalize as well as humans in scenarios
that are different from what has been observed during train-
ing, e.g., unseen partial occlusion, rare object pose, changes
in the environment, etc.. This lack of generalization may
lead to fatal consequences in real-world applications, e.g.
when driver-assistant systems fail to detect partially occluded
pedestrians (Economist 2017).
In particular, a key problem for computer vision systems is
how to deal with partial occlusion. In natural environments,
objects are often surrounded and partially occluded by each
other. The large variability of occluders in terms of their