ArticlePublisher preview available

Compositional Convolutional Neural Networks: A Robust and Interpretable Model for Object Recognition Under Occlusion

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Computer vision systems in real-world applications need to be robust to partial occlusion while also being explainable. In this work, we show that black-box deep convolutional neural networks (DCNNs) have only limited robustness to partial occlusion. We overcome these limitations by unifying DCNNs with part-based models into Compositional Convolutional Neural Networks (CompositionalNets)—an interpretable deep architecture with innate robustness to partial occlusion. Specifically, we propose to replace the fully connected classification head of DCNNs with a differentiable compositional model that can be trained end-to-end. The structure of the compositional model enables CompositionalNets to decompose images into objects and context, as well as to further decompose object representations in terms of individual parts and the objects’ pose. The generative nature of our compositional model enables it to localize occluders and to recognize objects based on their non-occluded parts. We conduct extensive experiments in terms of image classification and object detection on images of artificially occluded objects from the PASCAL3D+ and ImageNet dataset, and real images of partially occluded vehicles from the MS-COCO dataset. Our experiments show that CompositionalNets made from several popular DCNN backbones (VGG-16, ResNet50, ResNext) improve by a large margin over their non-compositional counterparts at classifying and detecting partially occluded objects. Furthermore, they can localize occluders accurately despite being trained with class-level supervision only. Finally, we demonstrate that CompositionalNets provide human interpretable predictions as their individual components can be understood as detecting parts and estimating an objects’ viewpoint.
International Journal of Computer Vision (2021) 129:736–760
https://doi.org/10.1007/s11263-020-01401-3
Compositional Convolutional Neural Networks: A Robust and
Interpretable Model for Object Recognition Under Occlusion
Adam Kortylewski1·Qing Liu1·Angtian Wang1·Yihong Sun1·Alan Yuille1
Received: 20 January 2020 / Accepted: 4 November 2020 / Published online: 24 November 2020
© Springer Science+Business Media, LLC, part of Springer Nature 2020
Abstract
Computer vision systems in real-world applications need to be robust to partial occlusion while also being explainable. In this
work, we show that black-box deep convolutional neural networks (DCNNs) have only limited robustness to partial occlu-
sion. We overcome these limitations by unifying DCNNs with part-based models into Compositional Convolutional Neural
Networks (CompositionalNets)—an interpretable deep architecture with innate robustness to partial occlusion. Specifically,
we propose to replace the fully connected classification head of DCNNs with a differentiable compositional model that
can be trained end-to-end. The structure of the compositional model enables CompositionalNets to decompose images into
objects and context, as well as to further decompose object representations in terms of individual parts and the objects’ pose.
The generative nature of our compositional model enables it to localize occluders and to recognize objects based on their
non-occluded parts. We conduct extensive experiments in terms of image classification and object detection on images of arti-
ficially occluded objects from the PASCAL3D+ and ImageNet dataset, and real images of partially occluded vehicles from the
MS-COCO dataset. Our experiments show that CompositionalNets made from several popular DCNN backbones (VGG-16,
ResNet50, ResNext) improve by a large margin over their non-compositional counterparts at classifying and detecting partially
occluded objects. Furthermore, they can localize occluders accurately despite being trained with class-level supervision only.
Finally, we demonstrate that CompositionalNets provide human interpretable predictions as their individual components can
be understood as detecting parts and estimating an objects’ viewpoint.
Keywords Compositional models ·Robustness to partial occlusion ·Image classification ·Object detection ·Out-of-
distribution generalization
1 Introduction
Advances in the architecture design of deep convolu-
tional neural networks (DCNNs) (Krizhevsky et al. 2012;
Communicated by Mei Chen.
BAdam Kortylewski
akortyl1@jhu.edu
Qing Liu
qingliu@jhu.edu
Angtian Wang
angtianwang@jhu.edu
Yihong Sun
ysun86@jhu.edu
Alan Yuille
ayuille1@jhu.edu
1Johns Hopkins University, Baltimore, MD, USA
Simonyan and Zisserman 2014;Heetal.2016) increased
the performance of computer vision systems at object recog-
nition enormously. This led to the deployment of computer
vision models in safety-critical real-world applications, such
as self-driving cars and security systems. In these application
areas, we expect models to reliably generalize to previously
unseen visual stimuli. However, in practice we observe that
deep models do not generalize as well as humans in scenarios
that are different from what has been observed during train-
ing, e.g., unseen partial occlusion, rare object pose, changes
in the environment, etc.. This lack of generalization may
lead to fatal consequences in real-world applications, e.g.
when driver-assistant systems fail to detect partially occluded
pedestrians (Economist 2017).
In particular, a key problem for computer vision systems is
how to deal with partial occlusion. In natural environments,
objects are often surrounded and partially occluded by each
other. The large variability of occluders in terms of their
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
... Recognition of objects and faces under occlusion has advanced rapidly in recent years, thus yielding novel ideas and approaches for recognizing occluded commodities [12][13][14], which are broadly classified into two classes, one of which is adding weight to the unoccluded part and another is restoring the occluded part. Kortylewski et al. [15] proposed a compositional convolutional neural network (CNN) model to recognize products based on unoccluded parts. Wang et al. [16] proposed an object shape feature extraction approach called slope difference distribution (SDD), which extracts features of shape as a sparse representation and utilizes the detected SDD features of all shape models and the minimum distance between SDD features for object recognition. ...
Article
Full-text available
Aiming at the recognition of intelligent retail dynamic visual container goods, two problems that lead to low recognition accuracy must be addressed; one is the lack of goods features caused by the occlusion of the hand, and the other is the high similarity of goods. Therefore, this study proposes an approach for occluding goods recognition based on a generative adversarial network combined with prior inference to address the two abovementioned problems. With DarkNet53 as the backbone network, semantic segmentation is used to locate the occluded part in the feature extraction network, and simultaneously, the YOLOX decoupling head is used to obtain the detection frame. Subsequently, a generative adversarial network under prior inference is used to restore and expand the features of the occluded parts, and a multi-scale spatial attention and effective channel attention weighted attention mechanism module is proposed to select fine-grained features of goods. Finally, a metric learning method based on von Mises–Fisher distribution is proposed to increase the class spacing of features to achieve the effect of feature distinction, whilst the distinguished features are utilized to recognize goods at a fine-grained level. The experimental data used in this study were all obtained from the self-made smart retail container dataset, which contains a total of 12 types of goods used for recognition and includes four couples of similar goods. Experimental results reveal that the peak signal-to-noise ratio and structural similarity under improved prior inference are 0.7743 and 0.0183 higher than those of the other models, respectively. Compared with other optimal models, mAP improves the recognition accuracy by 1.2% and the recognition accuracy by 2.82%. This study solves two problems: one is the occlusion caused by hands, and the other is the high similarity of goods, thus meeting the requirements of commodity recognition accuracy in the field of intelligent retail and exhibiting good application prospects.
... Complex scene understanding has become a research focus in the image understanding domain [91,92,93,94,95,96,97,98,99,100]. For example, Ke et al. [101] propose Bilayer Convolutional Network (BCNet) to decouple overlapping objects into occluder and occludee layers. ...
Preprint
Video object segmentation (VOS) aims at segmenting a particular object throughout the entire video clip sequence. The state-of-the-art VOS methods have achieved excellent performance (e.g., 90+% J&F) on existing datasets. However, since the target objects in these existing datasets are usually relatively salient, dominant, and isolated, VOS under complex scenes has rarely been studied. To revisit VOS and make it more applicable in the real world, we collect a new VOS dataset called coMplex video Object SEgmentation (MOSE) to study the tracking and segmenting objects in complex environments. MOSE contains 2,149 video clips and 5,200 objects from 36 categories, with 431,725 high-quality object segmentation masks. The most notable feature of MOSE dataset is complex scenes with crowded and occluded objects. The target objects in the videos are commonly occluded by others and disappear in some frames. To analyze the proposed MOSE dataset, we benchmark 18 existing VOS methods under 4 different settings on the proposed MOSE dataset and conduct comprehensive comparisons. The experiments show that current VOS algorithms cannot well perceive objects in complex scenes. For example, under the semi-supervised VOS setting, the highest J&F by existing state-of-the-art VOS methods is only 59.4% on MOSE, much lower than their ~90% J&F performance on DAVIS. The results reveal that although excellent performance has been achieved on existing benchmarks, there are unresolved challenges under complex scenes and more efforts are desired to explore these challenges in the future. The proposed MOSE dataset has been released at https://henghuiding.github.io/MOSE.
... Despite their incontestable success in a number of visual tasks, deep models are not fully trusted for real-world applications because of their sensitivity to input changes. This is an active area of research and proposed solutions are both at a data (DeVries & Taylor, 2017;Yun et al., 2019) as well as algorithm (Globerson & Roweis, 2006;Kortylewski et al., 2020;Zhu et al., 2019;Xu et al., 2020) level. A widely adopted method for measuring occlusion robustness is through the accuracy obtained after superimposing a rectangular patch on an image (Chun et al., 2020;Yun et al., 2019;Fawzi & Frossard, 2016;Yun et al., 2019;Zhong et al., 2020;Kokhlikyan et al., 2020). ...
Preprint
Full-text available
Over the past years, the crucial role of data has largely been shadowed by the field's focus on architectures and training procedures. We often cause changes to the data without being aware of their wider implications. In this paper we show that distorting images without accounting for the artefacts introduced leads to biased results when establishing occlusion robustness. To ensure models behave as expected in real-world scenarios, we need to rule out the impact added artefacts have on evaluation. We propose a new approach, iOcclusion, as a fairer alternative for applications where the possible occluders are unknown.
Article
Neural network-based solutions have revolutionized the field of computer vision by achieving outstanding performance in a number of applications. Yet, while these deep learning models outclass previous methods, they still have significant shortcomings relating to generalization and robustness to input disturbances, such as occlusion. Most existing methods that tackle this latter problem use passive neural network architectures that are unable to act on and, thus, influence the observed scene. In this paper, we argue that an active observer agent may be able to achieve superior performance by changing the parameters of the scene, thus, avoiding occlusion by moving to a different position in the scene. To demonstrate this, a reinforcement learning environment is introduced that implements OpenAI Gym’s interface, and allows the creation of synthetic scenes with realistic occlusion. The environment is implemented using differentiable rendering, allowing us to perform direct gradient-based optimization of the camera position. Moreover, two additional methods are also presented, one utilizing self-supervised learning to predict occlusion segments, and optimal camera positions, while the other learns to avoid occlusion using Reinforcement Learning. We present comparative experiments of the proposed methods to demonstrate their efficiency. It was shown, via Bayesian t-tests, that the neural network-based methods credibly outperformed the gradient-based avoidance strategy by avoiding occlusion with an average of 5.0 fewer steps in multi-object scenes.
Chapter
Computer vision (CV) is a branch of artificial intelligence that educates and assists computers in recognizing and comprehending the content of digital images. It is primarily concerned with replicating attributes of a human vision system and empowering computer systems to process and categorize artifacts in digital images similar to humans. CV can be applied in various domains, including robotics, autonomous vehicles, remote sensing, medical diagnosis, pattern recognition, etc. Extracting image features has become a key element in CV applications. For this purpose, we are using shape feature detectors and descriptors. Motivated by the need to understand shape feature detector fundamentals and applications in CV, the present work aims to explore various feature extraction techniques and shape detection approaches required for image retrieval. In addition, real-time applications of shape feature extraction and object recognition techniques are also discussed with examples.
Chapter
Enhancing the robustness of vision algorithms in real-world scenarios is challenging. One reason is that existing robustness benchmarks are limited, as they either rely on synthetic data or ignore the effects of individual nuisance factors. We introduce OOD-CV , a benchmark dataset that includes out-of-distribution examples of 10 object categories in terms of pose, shape, texture, context and the weather conditions, and enables benchmarking models for image classification, object detection, and 3D pose estimation. In addition to this novel dataset, we contribute extensive experiments using popular baseline methods, which reveal that: 1) Some nuisance factors have a much stronger negative effect on the performance compared to others, also depending on the vision task. 2) Current approaches to enhance robustness have only marginal effects, and can even reduce robustness. 3) We do not observe significant differences between convolutional and transformer architectures. We believe our dataset provides a rich testbed to study robustness and will help push forward research in this area.
Chapter
We consider the problem of category-level 6D pose estimation from a single RGB image. Our approach represents an object category as a cuboid mesh and learns a generative model of the neural feature activations at each mesh vertex to perform pose estimation through differentiable rendering. A common problem of rendering-based approaches is that they rely on bounding box proposals, which do not convey information about the 3D rotation of the object and are not reliable when objects are partially occluded. Instead, we introduce a coarse-to-fine optimization strategy that utilizes the rendering process to estimate a sparse set of 6D object proposals, which are subsequently refined with gradient-based optimization. The key to enabling the convergence of our approach is a neural feature representation that is trained to be scale- and rotation-invariant using contrastive learning. Our experiments demonstrate an enhanced category-level 6D pose estimation performance compared to prior work, particularly under strong partial occlusion. KeywordsCategory-level 6D pose estimationRender-and-compare
Article
Full-text available
This is an opinion paper about the strengths and weaknesses of Deep Nets for vision. They are at the heart of the enormous recent progress in artificial intelligence and are of growing importance in cognitive science and neuroscience. They have had many successes but also have several limitations and there is limited understanding of their inner workings. At present Deep Nets perform very well on specific visual tasks with benchmark datasets but they are much less general purpose, flexible, and adaptive than the human visual system. We argue that Deep Nets in their current form are unlikely to be able to overcome the fundamental problem of computer vision, namely how to deal with the combinatorial explosion, caused by the enormous complexity of natural images, and obtain the rich understanding of visual scenes that the human visual achieves. We argue that this combinatorial explosion takes us into a regime where “big data is not enough” and where we need to rethink our methods for benchmarking performance and evaluating vision algorithms. We stress that, as vision algorithms are increasingly used in real world applications, that performance evaluation is not merely an academic exercise but has important consequences in the real world. It is impractical to review the entire Deep Net literature so we restrict ourselves to a limited range of topics and references which are intended as entry points into the literature. The views expressed in this paper are our own and do not necessarily represent those of anybody else in the computer vision community.
Article
Parsing human into semantic parts is crucial to human-centric analysis. In this paper, we propose a human parsing pipeline that uses pose cues, e.g., estimates of human joint locations, to provide pose-guided segment proposals for semantic parts. These segment proposals are ranked using standard appearance cues, deep-learned semantic feature, and a novel pose feature called pose-context. Then these proposals are selected and assembled using an And-Or graph to output a parse of the person. The And-Or graph is able to deal with large human appearance variability due to pose, choice of clothing, etc. We evaluate our approach on the popular Penn-Fudan pedestrian parsing dataset, showing that it significantly outperforms the state of the art, and perform diagnostics to demonstrate the effectiveness of different stages of our pipeline.
Chapter
Despite deep convolutional neural networks’ great success in object classification, recent work has shown that they suffer from a severe generalization performance drop under occlusion conditions that do not appear in the training data. Due to the large variability of occluders in terms of shape and appearance, training data can hardly cover all possible occlusion conditions. However, in practice we expect models to reliably generalize to various novel occlusion conditions, rather than being limited to the training conditions. In this work, we integrate inductive priors including prototypes, partial matching and top-down modulation into deep neural networks to realize robust object classification under novel occlusion conditions, with limited occlusion in training data. We first introduce prototype learning as its regularization encourages compact data clusters for better generalization ability. Then, a visibility map at the intermediate layer based on feature dictionary and activation scale is estimated for partial matching, whose prior sifts irrelevant information out when comparing features with prototypes. Further, inspired by the important role of feedback connection in neuroscience for object recognition under occlusion, a structural prior, i.e. top-down modulation, is introduced into convolution layers, purposefully reducing the contamination by occlusion during feature extraction. Experiment results on partially occluded MNIST, vehicles from the PASCAL3D+ dataset, and vehicles from the cropped COCO dataset demonstrate the improvement under both simulated and real-world novel occlusion conditions, as well as under the transfer of datasets.