ArticlePublisher preview available

Compositional Convolutional Neural Networks: A Robust and Interpretable Model for Object Recognition Under Occlusion

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Computer vision systems in real-world applications need to be robust to partial occlusion while also being explainable. In this work, we show that black-box deep convolutional neural networks (DCNNs) have only limited robustness to partial occlusion. We overcome these limitations by unifying DCNNs with part-based models into Compositional Convolutional Neural Networks (CompositionalNets)—an interpretable deep architecture with innate robustness to partial occlusion. Specifically, we propose to replace the fully connected classification head of DCNNs with a differentiable compositional model that can be trained end-to-end. The structure of the compositional model enables CompositionalNets to decompose images into objects and context, as well as to further decompose object representations in terms of individual parts and the objects’ pose. The generative nature of our compositional model enables it to localize occluders and to recognize objects based on their non-occluded parts. We conduct extensive experiments in terms of image classification and object detection on images of artificially occluded objects from the PASCAL3D+ and ImageNet dataset, and real images of partially occluded vehicles from the MS-COCO dataset. Our experiments show that CompositionalNets made from several popular DCNN backbones (VGG-16, ResNet50, ResNext) improve by a large margin over their non-compositional counterparts at classifying and detecting partially occluded objects. Furthermore, they can localize occluders accurately despite being trained with class-level supervision only. Finally, we demonstrate that CompositionalNets provide human interpretable predictions as their individual components can be understood as detecting parts and estimating an objects’ viewpoint.
International Journal of Computer Vision (2021) 129:736–760
https://doi.org/10.1007/s11263-020-01401-3
Compositional Convolutional Neural Networks: A Robust and
Interpretable Model for Object Recognition Under Occlusion
Adam Kortylewski1·Qing Liu1·Angtian Wang1·Yihong Sun1·Alan Yuille1
Received: 20 January 2020 / Accepted: 4 November 2020 / Published online: 24 November 2020
© Springer Science+Business Media, LLC, part of Springer Nature 2020
Abstract
Computer vision systems in real-world applications need to be robust to partial occlusion while also being explainable. In this
work, we show that black-box deep convolutional neural networks (DCNNs) have only limited robustness to partial occlu-
sion. We overcome these limitations by unifying DCNNs with part-based models into Compositional Convolutional Neural
Networks (CompositionalNets)—an interpretable deep architecture with innate robustness to partial occlusion. Specifically,
we propose to replace the fully connected classification head of DCNNs with a differentiable compositional model that
can be trained end-to-end. The structure of the compositional model enables CompositionalNets to decompose images into
objects and context, as well as to further decompose object representations in terms of individual parts and the objects’ pose.
The generative nature of our compositional model enables it to localize occluders and to recognize objects based on their
non-occluded parts. We conduct extensive experiments in terms of image classification and object detection on images of arti-
ficially occluded objects from the PASCAL3D+ and ImageNet dataset, and real images of partially occluded vehicles from the
MS-COCO dataset. Our experiments show that CompositionalNets made from several popular DCNN backbones (VGG-16,
ResNet50, ResNext) improve by a large margin over their non-compositional counterparts at classifying and detecting partially
occluded objects. Furthermore, they can localize occluders accurately despite being trained with class-level supervision only.
Finally, we demonstrate that CompositionalNets provide human interpretable predictions as their individual components can
be understood as detecting parts and estimating an objects’ viewpoint.
Keywords Compositional models ·Robustness to partial occlusion ·Image classification ·Object detection ·Out-of-
distribution generalization
1 Introduction
Advances in the architecture design of deep convolu-
tional neural networks (DCNNs) (Krizhevsky et al. 2012;
Communicated by Mei Chen.
BAdam Kortylewski
akortyl1@jhu.edu
Qing Liu
qingliu@jhu.edu
Angtian Wang
angtianwang@jhu.edu
Yihong Sun
ysun86@jhu.edu
Alan Yuille
ayuille1@jhu.edu
1Johns Hopkins University, Baltimore, MD, USA
Simonyan and Zisserman 2014;Heetal.2016) increased
the performance of computer vision systems at object recog-
nition enormously. This led to the deployment of computer
vision models in safety-critical real-world applications, such
as self-driving cars and security systems. In these application
areas, we expect models to reliably generalize to previously
unseen visual stimuli. However, in practice we observe that
deep models do not generalize as well as humans in scenarios
that are different from what has been observed during train-
ing, e.g., unseen partial occlusion, rare object pose, changes
in the environment, etc.. This lack of generalization may
lead to fatal consequences in real-world applications, e.g.
when driver-assistant systems fail to detect partially occluded
pedestrians (Economist 2017).
In particular, a key problem for computer vision systems is
how to deal with partial occlusion. In natural environments,
objects are often surrounded and partially occluded by each
other. The large variability of occluders in terms of their
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
... Figure 1 illustrates the occlusion problem for the panoptic segmentation task, where instances suffering from occlusion are wrongly predicted. To address this challenge, some studies have explored model architectures specially designed for handling occlusion in object detection [33] and instance segmentation [34]- [37]. For example, [33] introduces a Compositional Convolutional Neural Network (CompositionalNets), which enhances the robustness of deep convolutional neural networks for partial occlusion through integrating a differentiable compositional model. ...
... To address this challenge, some studies have explored model architectures specially designed for handling occlusion in object detection [33] and instance segmentation [34]- [37]. For example, [33] introduces a Compositional Convolutional Neural Network (CompositionalNets), which enhances the robustness of deep convolutional neural networks for partial occlusion through integrating a differentiable compositional model. [34] generalizes CompositionalNets to COCO-Occ database annotations and method code will be made available on acceptance of the paper. ...
Preprint
To help address the occlusion problem in panoptic segmentation and image understanding, this paper proposes a new large-scale dataset, COCO-Occ, which is derived from the COCO dataset by manually labelling the COCO images into three perceived occlusion levels. Using COCO-Occ, we systematically assess and quantify the impact of occlusion on panoptic segmentation on samples having different levels of occlusion. Comparative experiments with SOTA panoptic models demonstrate that the presence of occlusion significantly affects performance with higher occlusion levels resulting in notably poorer performance. Additionally, we propose a straightforward yet effective method as an initial attempt to leverage the occlusion annotation using contrastive learning to render a model that learns a more robust representation capturing different severities of occlusion. Experimental results demonstrate that the proposed approach boosts the performance of the baseline model and achieves SOTA performance on the proposed COCO-Occ dataset.
... As a basic computer vision task, human pose estimation has been applied to many crucial high-semantic vision tasks, such as action recognition [1], human body generation [2,3], person re-identification [4], pedestrian tracking [5], human-computer interaction [6], object detection [7], etc. Many existing researchers who propose advanced models of human posture mainly study highresolution input images of existing datasets, that is, the input image size is 256×256 on the MPII dataset and 256×192 or 384×288 on the MSCOCO dataset. ...
Preprint
Full-text available
Human pose estimation is a basic task in the field of computer vision, so improving recognition accuracy is of great significance for advanced computer vision tasks. At present, there are many methods to solve the human pose estimation task of high-resolution images, and have achieved excellent results. However, current high-resolution human pose estimation methods still face two problems in solving low-resolution human pose estimation tasks. (1) Low-resolution images usually lose a lot of key information, such as texture information, joint point positions becoming blurred, etc., which will cause the recognition accuracy to decrease. (2) Existing super-resolution images generally have a fixed receptive field, and cannot extract features and fuse multi-scale information well for images with different magnifications. To fill the problem of super-resolution images, we proposed a method suitable for low-resolution images. Super-Resolution Human Pose Estimation Network (SRHPENet) for human pose estimation tasks. The network is composed of two parts. First, we designed SReNe super-resolution network. The network is composed of the MSE module we designed, which can effectively extract multi-scale information content and alleviate the problem of fixed receptive fields. Secondly, we input the high-resolution images obtained by the super-resolution network into the human pose estimation network. Additionally, during the training phase, we introduce real high-resolution images into the human pose estimation network in order to improve the accuracy of human pose estimation. Finally, through joint training of the two parts, we obtain low-resolution performance close to the performance of state-of the-art with high-resolution images as input.
Article
Enhancing the robustness of vision algorithms in real-world scenarios is challenging. One reason is that existing robustness benchmarks are limited, as they either rely on synthetic data or ignore the effects of individual nuisance factors. We introduce OOD-CV-v2, a benchmark dataset that includes out-of-distribution examples of 10 object categories in terms of pose, shape, texture, context and the weather conditions, and enables benchmarking of models for image classification, object detection, and 3D pose estimation. In addition to this novel dataset, we contribute extensive experiments using popular baseline methods, which reveal that: 1) Some nuisance factors have a much stronger negative effect on the performance compared to others, also depending on the vision task. 2) Current approaches to enhance robustness have only marginal effects, and can even reduce robustness. 3) We do not observe significant differences between convolutional and transformer architectures. We believe our dataset provides a rich test bed to study robustness and will help push forward research in this area. Our dataset is publically available online, https://genintel.mpi-inf.mpg.de/ood-cv-v2.html .
Article
Full-text available
This is an opinion paper about the strengths and weaknesses of Deep Nets for vision. They are at the heart of the enormous recent progress in artificial intelligence and are of growing importance in cognitive science and neuroscience. They have had many successes but also have several limitations and there is limited understanding of their inner workings. At present Deep Nets perform very well on specific visual tasks with benchmark datasets but they are much less general purpose, flexible, and adaptive than the human visual system. We argue that Deep Nets in their current form are unlikely to be able to overcome the fundamental problem of computer vision, namely how to deal with the combinatorial explosion, caused by the enormous complexity of natural images, and obtain the rich understanding of visual scenes that the human visual achieves. We argue that this combinatorial explosion takes us into a regime where “big data is not enough” and where we need to rethink our methods for benchmarking performance and evaluating vision algorithms. We stress that, as vision algorithms are increasingly used in real world applications, that performance evaluation is not merely an academic exercise but has important consequences in the real world. It is impractical to review the entire Deep Net literature so we restrict ourselves to a limited range of topics and references which are intended as entry points into the literature. The views expressed in this paper are our own and do not necessarily represent those of anybody else in the computer vision community.
Article
Parsing human into semantic parts is crucial to human-centric analysis. In this paper, we propose a human parsing pipeline that uses pose cues, e.g., estimates of human joint locations, to provide pose-guided segment proposals for semantic parts. These segment proposals are ranked using standard appearance cues, deep-learned semantic feature, and a novel pose feature called pose-context. Then these proposals are selected and assembled using an And-Or graph to output a parse of the person. The And-Or graph is able to deal with large human appearance variability due to pose, choice of clothing, etc. We evaluate our approach on the popular Penn-Fudan pedestrian parsing dataset, showing that it significantly outperforms the state of the art, and perform diagnostics to demonstrate the effectiveness of different stages of our pipeline.
Chapter
Despite deep convolutional neural networks’ great success in object classification, recent work has shown that they suffer from a severe generalization performance drop under occlusion conditions that do not appear in the training data. Due to the large variability of occluders in terms of shape and appearance, training data can hardly cover all possible occlusion conditions. However, in practice we expect models to reliably generalize to various novel occlusion conditions, rather than being limited to the training conditions. In this work, we integrate inductive priors including prototypes, partial matching and top-down modulation into deep neural networks to realize robust object classification under novel occlusion conditions, with limited occlusion in training data. We first introduce prototype learning as its regularization encourages compact data clusters for better generalization ability. Then, a visibility map at the intermediate layer based on feature dictionary and activation scale is estimated for partial matching, whose prior sifts irrelevant information out when comparing features with prototypes. Further, inspired by the important role of feedback connection in neuroscience for object recognition under occlusion, a structural prior, i.e. top-down modulation, is introduced into convolution layers, purposefully reducing the contamination by occlusion during feature extraction. Experiment results on partially occluded MNIST, vehicles from the PASCAL3D+ dataset, and vehicles from the cropped COCO dataset demonstrate the improvement under both simulated and real-world novel occlusion conditions, as well as under the transfer of datasets.