Preprint

Weakly-Supervised Amodal Instance Segmentation with Compositional Priors

Authors:
Preprints and early-stage research may not have been peer reviewed yet.
To read the file of this research, you can request a copy directly from the authors.

Abstract

Amodal segmentation in biological vision refers to the perception of the entire object when only a fraction is visible. This ability of seeing through occluders and reasoning about occlusion is innate to biological vision but not adequately modeled in current machine vision approaches. A key challenge is that ground-truth supervisions of amodal object segmentation are inherently difficult to obtain. In this paper, we present a neural network architecture that is capable of amodal perception, when weakly supervised with standard (inmodal) bounding box annotations. Our model extends compositional convolutional neural networks (CompositionalNets), which have been shown to be robust to partial occlusion by explicitly representing objects as composition of parts. In particular, we extend CompositionalNets by: 1) Expanding the innate part-voting mechanism in the CompositionalNets to perform instance segmentation; 2) and by exploiting the internal representations of CompositionalNets to enable amodal completion for both bounding box and segmentation mask. Our extensive experiments show that our proposed model can segment amodal masks robustly, with much improved mask prediction qualities compared to state-of-the-art amodal segmentation approaches.

No file available

Request Full-text Paper PDF

To read the file of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
This paper presents a weakly supervised instance segmentation method that consumes training data with tight bounding box annotations. The major difficulty lies in the uncertain figure-ground separation within each bounding box since there is no supervisory signal about it. We address the difficulty by formulating the problem as a multiple instance learning (MIL) task, and generate positive and negative bags based on the sweeping lines of each bounding box. The proposed deep model integrates MIL into a fully supervised instance segmentation network, and can be derived by the objective consisting of two terms, i.e., the unary term and the pairwise term. The former estimates the foreground and background areas of each bounding box while the latter maintains the unity of the estimated object masks. The experimental results show that our method performs favorably against existing weakly supervised methods and even surpasses some fully supervised methods for instance segmentation on the PASCAL VOC dataset. The code is available at https://github.com/chengchunhsu/WSIS_BBTP.
Article
Full-text available
Amodal completion is the representation of those parts of the perceived object that we get no sensory stimulation from. In the case of vision, it is the representation of occluded parts of objects we see: When we see a cat behind a picket fence, our perceptual system represents those parts of the cat that are occluded by the picket fence. The aim of this piece is to argue that amodal completion plays a constitutive role in our everyday perception and trace the theoretical consequences of this claim.
Conference Paper
Full-text available
Pedestrian detection in crowded scenes is a challenging problem since the pedestrians often gather together and occlude each other. In this paper, we propose a new occlusion-aware R-CNN (OR-CNN) to improve the detection accuracy in the crowd. Specifically, we design a new aggregation loss to enforce proposals to be close and locate compactly to the corresponding ground truth objects. Meanwhile, we use a new part occlusion-aware region of interest (PORoI) pooling unit to replace the RoI pooling layer in order to integrate the prior structure information of human body with visibility prediction into the network to handle occlusion. Our detector is trained in an end-to-end fashion, which achieves state-of-the-art results on three pedestrian detection datasets, i.e., CityPersons, ETH, and INRIA, and performs on-pair with the state-of-the-arts on Caltech.
Article
Full-text available
Visual area V4 is a midtier cortical area in the ventral visual pathway. It is crucial for visual object recognition and has been a focus of many studies on visual attention. However, there is no unifying view of V4's role in visual processing. Neither is there an understanding of how its role in feature processing interfaces with its role in visual attention. This review captures our current knowledge of V4, largely derived from electrophysiological and imaging studies in the macaque monkey. Based on recent discovery of functionally specific domains in V4, we propose that the unifying function of V4 circuitry is to enable selective extraction of specific functional domain-based networks, whether it be by bottom-up specification of object features or by top-down attentionally driven selection.
Chapter
Despite deep convolutional neural networks’ great success in object classification, recent work has shown that they suffer from a severe generalization performance drop under occlusion conditions that do not appear in the training data. Due to the large variability of occluders in terms of shape and appearance, training data can hardly cover all possible occlusion conditions. However, in practice we expect models to reliably generalize to various novel occlusion conditions, rather than being limited to the training conditions. In this work, we integrate inductive priors including prototypes, partial matching and top-down modulation into deep neural networks to realize robust object classification under novel occlusion conditions, with limited occlusion in training data. We first introduce prototype learning as its regularization encourages compact data clusters for better generalization ability. Then, a visibility map at the intermediate layer based on feature dictionary and activation scale is estimated for partial matching, whose prior sifts irrelevant information out when comparing features with prototypes. Further, inspired by the important role of feedback connection in neuroscience for object recognition under occlusion, a structural prior, i.e. top-down modulation, is introduced into convolution layers, purposefully reducing the contamination by occlusion during feature extraction. Experiment results on partially occluded MNIST, vehicles from the PASCAL3D+ dataset, and vehicles from the cropped COCO dataset demonstrate the improvement under both simulated and real-world novel occlusion conditions, as well as under the transfer of datasets.
Article
Real-world value often depends on subtle, continuously variable visual cues specific to particular object categories, like the tailoring of a suit, the condition of an automobile, or the construction of a house. Here, we used microelectrode recording in behaving monkeys to test two possible mechanisms for category-specific value-cue processing: (1) previous findings suggest that prefrontal cortex (PFC) identifies object categories, and based on category identity, PFC could use top-down attentional modulation to enhance visual processing of category-specific value cues, providing signals to PFC for calculating value, and (2) a faster mechanism would be first-pass visual processing of category-specific value cues, immediately providing the necessary visual information to PFC. This, however, would require learned mechanisms for processing the appropriate cues in a given object category. To test these hypotheses, we trained monkeys to discriminate value in four letter-like stimulus categories. Each category had a different, continuously variable shape cue that signified value (liquid reward amount) as well as other cues that were irrelevant. Monkeys chose between stimuli of different reward values. Consistent with the first-pass hypothesis, we found early signals for category-specific value cues in area TE (the final stage in monkey ventral visual pathway) beginning 81 ms after stimulus onset-essentially at the start of TE responses. Task-related activity emerged in lateral PFC approximately 40 ms later and consisted mainly of category-invariant value tuning. Our results show that, for familiar, behaviorally relevant object categories, high-level ventral pathway cortex can implement rapid, first-pass processing of category-specific value cues.
Conference Paper
We consider the problem of amodal instance segmentation, the objective of which is to predict the region encompassing both visible and occluded parts of each object. Thus far, the lack of publicly available amodal segmentation annotations has stymied the development of amodal segmentation methods. In this paper, we sidestep this issue by relying solely on standard modal instance segmentation annotations to train our model. The result is a new method for amodal instance segmentation, which represents the first such method to the best of our knowledge. We demonstrate the proposed method’s effectiveness both qualitatively and quantitatively.
Conference Paper
The k-means method is a widely used clustering technique that seeks to minimize the average squared distance between points in the same cluster. Although it offers no accuracy guarantees, its simplicity and speed are very appealing in practice. By augmenting k-means with a very simple, randomized seeding technique, we obtain an algorithm that is Θ(logk)-competitive with the optimal clustering. Preliminary experiments show that our augmentation improves both the speed and the accuracy of k-means, often quite dramatically.
Article
Sparse coding has long been recognized as a primary goal of image transformation in the visual system. Sparse coding in early visual cortex is achieved by abstracting local oriented spatial frequencies and by excitatory/inhibitory surround modulation. Object responses are thought to be sparse at subsequent processing stages, but neural mechanisms for higher-level sparsification are not known. Here, convergent results from macaque area V4 neural recording and simulated V4 populations trained on natural object contours suggest that sparse coding is achieved in midlevel visual cortex by emphasizing representation of acute convex and concave curvature. We studied 165 V4 neurons with a random, adaptive stimulus strategy to minimize bias and explore an unlimited range of contour shapes. V4 responses were strongly weighted toward contours containing acute convex or concave curvature. In contrast, the tuning distribution in nonsparse simulated V4 populations was strongly weighted toward low curvature. But as sparseness constraints increased, the simulated tuning distribution shifted progressively toward more acute convex and concave curvature, matching the neural recording results. These findings indicate a sparse object coding scheme in midlevel visual cortex based on uncommon but diagnostic regions of acute contour curvature.
  • Terrance Devries
  • Graham W Taylor
Terrance DeVries and Graham W. Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017.
Adam: A method for stochastic optimization
  • P Diederik
  • Jimmy Kingma
  • Ba
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Adam Kortylewski
  • Qing Liu
  • Angtian Wang
  • Yihong Sun
  • Alan Yuille
Adam Kortylewski, Qing Liu, Angtian Wang, Yihong Sun, and Alan Yuille. Compositional convolutional neural networks: A robust and interpretable model for object recognition under occlusion. arXiv preprint arXiv:2006.15538, 2020c.
Robustness of object recognition under extreme occlusion in humans and computational models
  • Hongru Zhu
  • Peng Tang
  • Jeongho Park
  • Soojin Park
  • Alan Yuille
Hongru Zhu, Peng Tang, Jeongho Park, Soojin Park, and Alan Yuille. Robustness of object recognition under extreme occlusion in humans and computational models. CogSci Conference, 2019.