Conference Paper

Amodal Segmentation through Out-of-Task and Out-of-Distribution Generalization with a Bayesian Model

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Related works on amodal segmentation often adopt a fully-supervised approach, with training supervisions coming from human annotations [6,29] or synthetic occlusions [15,17,24]. Recent work [32] introduces a Bayesian approach that is trained on non-occluded objects only and does not require any amodal supervision. Moreover, our model takes a 3D-aware approach for amodal segmentation such that our probabilistic model is built on top of deformable object meshes. ...
... Amodal segmentation. We compare our model with Bayesian-Amodal [32], which extends deep neural networks with a Bayesian generative model of neural features. We use their official implementations to train on all categories in PASCAL3D+ dataset and evaluate on Occluded PASCAL3D+ dataset. ...
... Amodal segmentation predicts the region of both the visible and occluded parts of an object. Following previous works [29,32], we evaluate the average IoU between the predicted segmentation masks and the groundtruth segmentation masks. ...
Preprint
Human vision demonstrates higher robustness than current AI algorithms under out-of-distribution scenarios. It has been conjectured such robustness benefits from performing analysis-by-synthesis. Our paper formulates triple vision tasks in a consistent manner using approximate analysis-by-synthesis by render-and-compare algorithms on neural features. In this work, we introduce Neural Textured Deformable Meshes, which involve the object model with deformable geometry that allows optimization on both camera parameters and object geometries. The deformable mesh is parameterized as a neural field, and covered by whole-surface neural texture maps, which are trained to have spatial discriminability. During inference, we extract the feature map of the test image and subsequently optimize the 3D pose and shape parameters of our model using differentiable rendering to best reconstruct the target feature map. We show that our analysis-by-synthesis is much more robust than conventional neural networks when evaluated on real-world images and even in challenging out-of-distribution scenarios, such as occlusion and domain shift. Our algorithms are competitive with standard algorithms when tested on conventional performance measures.
... Finally, the pixels belonging to objects, background, or other objects in the image are labeled separately by generating target segmentation masks. Besides the work in [17], there have proposed a series work for Amodal instance segmentation [18][19][20][21] Based on the above, amodal instance segmentation is well-suited for C. elegans segmentation due to its ability to manage occlusions and overlapping structures. C. ...
... L=L coarse +λ dec L dec +λ rmask L rmask +λ cons L cons (21) The loss for RoI extraction and coarse mask prediction is represented by L coarse , while dec L represents the decomposition loss for overlapping and non-overlapping region segmentation. L rmask is the segmentation loss for refined masks, and L cons supervises the semantic consistency between the overall instance and sub-regions. ...
Preprint
Caenorhabditis elegans (C. elegans) is an excellent model organism because of its short lifespan and high degree of homology with human genes, and it has been widely used in a variety of human health and disease models. However, the segmentation of C. elegans remains challenging due to the following reasons: 1) the activity trajectory of C. elegans is uncontrollable, and multiple nematodes often overlap, resulting in blurred boundaries of C. elegans. This makes it impossible to clearly study the life trajectory of a certain nematode; and 2) in the microscope images of overlapping C. elegans, the translucent tissues at the edges obscure each other, leading to inaccurate boundary segmentation. To solve these problems, a Bilayer Segmentation-Recombination Network (BR-Net) for the segmentation of C. elegans instances is proposed. The network consists of three parts: A Coarse Mask Segmentation Module (CMSM), a Bilayer Segmentation Module (BSM), and a Semantic Consistency Recombination Module (SCRM). The CMSM is used to extract the coarse mask, and we introduce a Unified Attention Module (UAM) in CMSM to make CMSM better aware of nematode instances. The Bilayer Segmentation Module (BSM) segments the aggregated C. elegans into overlapping and non-overlapping regions. This is followed by integration by the SCRM, where semantic consistency regularization is introduced to segment nematode instances more accurately. Finally, the effectiveness of the method is verified on the C. elegans dataset. The experimental results show that BR-Net exhibits good competitiveness and outperforms other recently proposed instance segmentation methods in processing C. elegans occlusion images.
... Amodal segmentation has been widely studied since then. Researchers explored fully supervised approaches [27,55,39,45,49,29,11], weakly supervised approaches [51,35,41,32], and diverse applications including self-driving [39,4], image augmentation [13,29,38] and robotic gripping systems [43,44,18]. ...
... In the early stages of research, besides direct approach [27,55,39], many fully supervised methods have been proposed with various concepts involved, such as depth relationship [53], region correlation [10,20], and shape priors [49,29,5,11]. Immediately afterward, a series of weakly supervised methods [51,35,23,24,41,32] begin to appear with the supervision of simpler annotations, such as bounding boxes and categories. Exploiting the capabilities of amodal segmentation, researchers are concerned with its diverse applications. ...
Preprint
Aiming to predict the complete shapes of partially occluded objects, amodal segmentation is an important step towards visual intelligence. With crucial significance, practical prior knowledge derives from sufficient training, while limited amodal annotations pose challenges to achieve better performance. To tackle this problem, utilizing the mighty priors accumulated in the foundation model, we propose the first SAM-based amodal segmentation approach, PLUG. Methodologically, a novel framework with hierarchical focus is presented to better adapt the task characteristics and unleash the potential capabilities of SAM. In the region level, due to the association and division in visible and occluded areas, inmodal and amodal regions are assigned as the focuses of distinct branches to avoid mutual disturbance. In the point level, we introduce the concept of uncertainty to explicitly assist the model in identifying and focusing on ambiguous points. Guided by the uncertainty map, a computation-economic point loss is applied to improve the accuracy of predicted boundaries. Experiments are conducted on several prominent datasets, and the results show that our proposed method outperforms existing methods with large margins. Even with fewer total parameters, our method still exhibits remarkable advantages.
... In a related vein, Sun et al. [50] contributed to the research field of amodal segmentation under partial occlusion by inferring amodal segmentation into CompNet. Leveraging a Bayesian generative model with neural network features, they replaced the fully connected classifier in the CNN. ...
... The network comparison indicates that YOLONAS-Cutout performs exceptionally well in detecting objects in less challenging scenarios. Furthermore, our evaluation of CompNet involved adapting its original object segmentation [50] evaluation script into a detection network, emphasizing its versatility. These comprehensive results offer valuable insights into the relative strengths of generative and deep learning approaches tailored for SVS applications. ...
Article
Full-text available
Smart video surveillance systems (SVSs) have garnered significant attention for their autonomous monitoring capabilities, encompassing automated detection, tracking, analysis, and decision making within complex environments, with minimal human intervention. In this context, object detection is a fundamental task in SVS. However, many current approaches often overlook occlusion by nearby objects, posing challenges to real-world SVS applications. To address this crucial issue, this paper presents a comprehensive comparative analysis of occlusion-handling techniques tailored for object detection. The review outlines the pretext tasks common to both domains and explores various architectural solutions to combat occlusion. Unlike prior studies that primarily focus on a single dataset, our analysis spans multiple benchmark datasets, providing a thorough assessment of various object detection methods. By extending the evaluation to datasets beyond the KITTI benchmark, this study offers a more holistic understanding of each approach’s strengths and limitations. Additionally, we delve into persistent challenges in existing occlusion-handling approaches and emphasize the need for innovative strategies and future research directions to drive substantial progress in this field.
... Several methods [11,25,27,29,34,35,40] are developed by extending the methods designed for visible segmentation or object detection [5, 16-18, 20, 59], while the relative occlusion order information is leveraged by many methods [38,58,61,63]. Since shape prior knowledge is an essential approach for inferring the shape of the occluded region, several methods [7,10,12,26,28,31,47,52] are designed to learn and utilize it. For instance, VRSP [52] designs a codebook to store and retrieve shape priors for amodal masks. ...
Preprint
Full-text available
Amodal segmentation aims to infer the complete shape of occluded objects, even when the occluded region's appearance is unavailable. However, current amodal segmentation methods lack the capability to interact with users through text input and struggle to understand or reason about implicit and complex purposes. While methods like LISA integrate multi-modal large language models (LLMs) with segmentation for reasoning tasks, they are limited to predicting only visible object regions and face challenges in handling complex occlusion scenarios. To address these limitations, we propose a novel task named amodal reasoning segmentation, aiming to predict the complete amodal shape of occluded objects while providing answers with elaborations based on user text input. We develop a generalizable dataset generation pipeline and introduce a new dataset focusing on daily life scenarios, encompassing diverse real-world occlusions. Furthermore, we present AURA (Amodal Understanding and Reasoning Assistant), a novel model with advanced global and spatial-level designs specifically tailored to handle complex occlusions. Extensive experiments validate AURA's effectiveness on the proposed dataset. The code, model, and dataset will be publicly released.
... Their model directly encapsulates this notion, learning shape priors and using them to drive amodal segmentation. The idea of a prior was also explored by [14], who used Bayesian models, training them on modal data and considering amodal segmentation as an out-of-distribution generalization. [15] modeled amodal instance segmentation as two separate tasks: the occluded object segmentation, and the occluding object segmentation, training a single model, BCNet, to exploit relationships between objects. ...
Article
Full-text available
Automatic dimensioning of mealworms based on computer vision is challenging due to occlusions. Amodal instance segmentation (AIS) could be a viable solution, but the acquisition of annotated training data is difficult and time-consuming. This work proposes a new method to prepare data for training AIS models that reduces the human annotation effort significantly. Instead of acquiring the occluded images directly, only images of fully visible larvae are acquired and processed, allowing obtaining their contours via automatic segmentation. Next, synthetic images with occlusions are generated from the database of automatically extracted instances. The generation procedure uses simple computer graphics tools and is computationally inexpensive, yet yields images that allow training off-the-shelf AIS models. Since those models need to be tested on real data, which requires manual annotation, a data acquisition method that significantly simplifies the test set annotation process is demonstrated. Results are reported in terms of the amodal segmentation quality as well as the accuracy of larvae dimensioning, measured using the histogram intersection metric. The best-performing model achieves a mean average precision of 0.41 and a histogram intersection of 0.77, confirming the effectiveness of the proposed method of data acquisition and generation. The method is not specific to mealworm detection and could be applied to other similar problems where object occlusions pose a challenge.
... First methods investigate predicting an amodal mask given the visible mask and the input image [27,55]. This has been extended to instance segmentation methods predicting the amodal instead of the visible mask directly from the input image [16,22,30,34,36,40,44]. Amodal semantic segmentation methods apply grouping and multi-task training to look behind occlusions [4,6,35]. ...
Preprint
In this work, we study amodal video instance segmentation for automated driving. Previous works perform amodal video instance segmentation relying on methods trained on entirely labeled video data with techniques borrowed from standard video instance segmentation. Such amodally labeled video data is difficult and expensive to obtain and the resulting methods suffer from a trade-off between instance segmentation and tracking performance. To largely solve this issue, we propose to study the application of foundation models for this task. More precisely, we exploit the extensive knowledge of the Segment Anything Model (SAM), while fine-tuning it to the amodal instance segmentation task. Given an initial video instance segmentation, we sample points from the visible masks to prompt our amodal SAM. We use a point memory to store those points. If a previously observed instance is not predicted in a following frame, we retrieve its most recent points from the point memory and use a point tracking method to follow those points to the current frame, together with the corresponding last amodal instance mask. This way, while basing our method on an amodal instance segmentation, we nevertheless obtain video-level amodal instance segmentation results. Our resulting S-AModal method achieves state-of-the-art results in amodal video instance segmentation while resolving the need for amodal video-based labels. Code for S-AModal is available at https://github.com/ifnspaml/S-AModal.
... Hyperspherical representation. Hyperspherical representation under vMF distribution has been extensively used in various machine learning applications, such as supervised classification [79,80,81], face verification [82,83], generative modeling [84], segmentation [85,86], and clustering [87]. In addition, some researchers have utilized the vMF distribution for anomaly detection by employing generative models and using it as the prior for zero-shot learning [88] and document analysis [89]. ...
Preprint
Full-text available
The ability to detect out-of-distribution (OOD) inputs is critical to guarantee the reliability of classification models deployed in an open environment. A fundamental challenge in OOD detection is that a discriminative classifier is typically trained to estimate the posterior probability p(y|z) for class y given an input z, but lacks the explicit likelihood estimation of p(z) ideally needed for OOD detection. While numerous OOD scoring functions have been proposed for classification models, these estimate scores are often heuristic-driven and cannot be rigorously interpreted as likelihood. To bridge the gap, we propose Intrinsic Likelihood (INK), which offers rigorous likelihood interpretation to modern discriminative-based classifiers. Specifically, our proposed INK score operates on the constrained latent embeddings of a discriminative classifier, which are modeled as a mixture of hyperspherical embeddings with constant norm. We draw a novel connection between the hyperspherical distribution and the intrinsic likelihood, which can be effectively optimized in modern neural networks. Extensive experiments on the OpenOOD benchmark empirically demonstrate that INK establishes a new state-of-the-art in a variety of OOD detection setups, including both far-OOD and near-OOD. Code is available at https://github.com/deeplearning-wisc/ink.
... Recently, a SAM-based approach [18] achieved state-of-the-art performance, well exploiting the mighty feature extraction capability provided by the large-scale foundation model [12]. Noticing the labor-intensive and error-prone challenges in the annotation of amodal masks, researchers also propose many weakly supervised approaches for amodal segmentation using only box-level supervision or self supervision [34,19,13,14,26,17]. Based on the advances of amodal segmentation algorithms, researchers develop various applications. ...
Preprint
Full-text available
Segmentation of surgical instruments is crucial for enhancing surgeon performance and ensuring patient safety. Conventional techniques such as binary, semantic, and instance segmentation share a common drawback: they do not accommodate the parts of instruments obscured by tissues or other instruments. Precisely predicting the full extent of these occluded instruments can significantly improve laparoscopic surgeries by providing critical guidance during operations and assisting in the analysis of potential surgical errors, as well as serving educational purposes. In this paper, we introduce Amodal Segmentation to the realm of surgical instruments in the medical field. This technique identifies both the visible and occluded parts of an object. To achieve this, we introduce a new Amoal Instruments Segmentation (AIS) dataset, which was developed by reannotating each instrument with its complete mask, utilizing the 2017 MICCAI EndoVis Robotic Instrument Segmentation Challenge dataset. Additionally, we evaluate several leading amodal segmentation methods to establish a benchmark for this new dataset.
... Early works [Kar et al. 2015;Li and Malik 2016] propose to predict the amodal bounding box and pixel-wise masks that encompasses the entire extent of an object, respectively. Other works [Sun et al. 2022;Wang et al. 2020;Yuan et al. 2021] integrate compositional modelsfor amodal segmentation. Another line of works proposes semantic-aware distance maps [Zhang et al. 2019], amodal semantic segmentation maps [Breitenstein and Fingscheidt 2022;Mohan and Valada 2022a,b], and amodal scene layouts [Liu et al. 2022;Mani et al. 2020;] for amodal prediction. ...
Preprint
Full-text available
Deoccluding the hidden portions of objects in a scene is a formidable task, particularly when addressing real-world scenes. In this paper, we present a new self-supervised PArallel visible-to-COmplete diffusion framework, named PACO, a foundation model for object-level scene deocclusion. Leveraging the rich prior of pre-trained models, we first design the parallel variational autoencoder, which produces a full-view feature map that simultaneously encodes multiple complete objects, and the visible-to-complete latent generator, which learns to implicitly predict the full-view feature map from partial-view feature map and text prompts extracted from the incomplete objects in the input image. To train PACO, we create a large-scale dataset with 500k samples to enable self-supervised learning, avoiding tedious annotations of the amodal masks and occluded regions. At inference, we devise a layer-wise deocclusion strategy to improve efficiency while maintaining the deocclusion quality. Extensive experiments on COCOA and various real-world scenes demonstrate the superior capability of PACO for scene deocclusion, surpassing the state of the arts by a large margin. Our method can also be extended to cross-domain scenes and novel categories that are not covered by the training set. Further, we demonstrate the deocclusion applicability of PACO in single-view 3D scene reconstruction and object recomposition.
... Amodal segmentation [37] is a challenging task that involves predicting both the visible and occluded parts of objects. Numerous approaches [37,7,32,35,29,14,33,28] have sought to explore the occluded parts of objects using additional information or by learning the objects' shape priors, often achieving excellent performance. For example, Savos [34] employs optical flow, and A3D [17] learns 3D shape priors to enhance performance. ...
Preprint
Amodal object completion is a complex task that involves predicting the invisible parts of an object based on visible segments and background information. Learning shape priors is crucial for effective amodal completion, but traditional methods often rely on two-stage processes or additional information, leading to inefficiencies and potential error accumulation. To address these shortcomings, we introduce a novel framework named the Hyper-Transformer Amodal Network (H-TAN). This framework utilizes a hyper transformer equipped with a dynamic convolution head to directly learn shape priors and accurately predict amodal masks. Specifically, H-TAN uses a dual-branch structure to extract multi-scale features from both images and masks. The multi-scale features from the image branch guide the hyper transformer in learning shape priors and in generating the weights for dynamic convolution tailored to each instance. The dynamic convolution head then uses the features from the mask branch to predict precise amodal masks. We extensively evaluate our model on three benchmark datasets: KINS, COCOA-cls, and D2SA, where H-TAN demonstrated superior performance compared to existing methods. Additional experiments validate the effectiveness and stability of the novel hyper transformer in our framework.
... Sun et al. used in [12] CompNet [13] a Bayesian generative model with neural network features to replace the fully-connected classifier in a CNN. This model applies the probability distribution to describe the image's features, such as the object classes and amodal segmentation to accurately classify images of partly occluded objects. ...
... For example, a model trained only on unoccluded objects can classify objects when they are occluded (out-of-distribution) (Zhu, Tang, Park, Park, & Yuille, 2019). Similarly, a model trained to classify objects robustly to occlusion can, as a side effect, estimate the amodal boundary of an object (out-of-task transfer) (Sun, Kortylewski, & Yuille, 2020). Such representations appear to be closely linked to the human capacity to learn from a few examples (sometimes zero-shot or one-shot learning) and to generalize learning to novel situations (Bapst et al., 2019). ...
Article
Advances in artificial intelligence have raised a basic question about human intelligence: Is human reasoning best emulated by applying task‐specific knowledge acquired from a wealth of prior experience, or is it based on the domain‐general manipulation and comparison of mental representations? We address this question for the case of visual analogical reasoning. Using realistic images of familiar three‐dimensional objects (cars and their parts), we systematically manipulated viewpoints, part relations, and entity properties in visual analogy problems. We compared human performance to that of two recent deep learning models (Siamese Network and Relation Network) that were directly trained to solve these problems and to apply their task‐specific knowledge to analogical reasoning. We also developed a new model using part‐based comparison (PCM) by applying a domain‐general mapping procedure to learned representations of cars and their component parts. Across four‐term analogies (Experiment 1) and open‐ended analogies (Experiment 2), the domain‐general PCM model, but not the task‐specific deep learning models, generated performance similar in key aspects to that of human reasoners. These findings provide evidence that human‐like analogical reasoning is unlikely to be achieved by applying deep learning with big data to a specific type of analogy problem. Rather, humans do (and machines might) achieve analogical reasoning by learning representations that encode structural information useful for multiple tasks, coupled with efficient computation of relational similarity.
... VRSP [33] for the first time explicitly designs the shape prior module to refine the amodal mask. There are also some approaches [39,9,38,32,36,28,14,34,27,18] focus on modeling shape priors with shape statistics, making it challenging to extend their models to open-world applications where object category distributions are long-tail and hard to pre-define. SaVos [35] leverages spatiotemporal consistency and dense object motion to alleviate this problem. ...
Preprint
Amodal object segmentation is a challenging task that involves segmenting both visible and occluded parts of an object. In this paper, we propose a novel approach, called Coarse-to-Fine Segmentation (C2F-Seg), that addresses this problem by progressively modeling the amodal segmentation. C2F-Seg initially reduces the learning space from the pixel-level image space to the vector-quantized latent space. This enables us to better handle long-range dependencies and learn a coarse-grained amodal segment from visual features and visible segments. However, this latent space lacks detailed information about the object, which makes it difficult to provide a precise segmentation directly. To address this issue, we propose a convolution refine module to inject fine-grained information and provide a more precise amodal object segmentation based on visual features and coarse-predicted segmentation. To help the studies of amodal object segmentation, we create a synthetic amodal dataset, named as MOViD-Amodal (MOViD-A), which can be used for both image and video amodal object segmentation. We extensively evaluate our model on two benchmark datasets: KINS and COCO-A. Our empirical results demonstrate the superiority of C2F-Seg. Moreover, we exhibit the potential of our approach for video amodal object segmentation tasks on FISHBOWL and our proposed MOViD-A. Project page at: http://jianxgao.github.io/C2F-Seg.
... The proposed model shows promise in accurately detecting masks for shapes and preserving their identities even when they are barely visible, but, faces challenges in distinguishing heavily overlapping shapes and requires improvements in handling complete occlusion and capturing more complex features in embeddings. Y Sun et al presented a Bayesian approach [30] for amodal instance segmentation promising a more data-efficient and robust computer vision model that addresses challenges that come with distributions that are out-of-task or out-ofdistribution. They claim to outperform existing weakly and completely supervised techniques for scenarios with high occlusion levels. ...
Article
Full-text available
Amodal segmentation is a critical task in the field of computer vision as it involves accurately estimating object boundaries that extend beyond occlusion. This paper introduces a network named after the Amodal Segmentation Head, ASH-Net, a novel architecture specifically designed for amodal segmentation. ASH-Net is comprised of a ResNet-50 backbone, a Feature Pyramid Network middle layer, and an Amodal Segmentation Head. The evaluation encompasses three diverse datasets, namely COCOA-cls, KINS, and D2SA, providing a comprehensive analysis of ASH-Net’s capabilities. The results obtained demonstrate the superiority of ASH-Net in accurately estimating object boundaries beyond occlusion across multiple datasets. Specifically, ASH-Net achieves an Average Precision of 62.15% on the COCOA-cls dataset, 72.58% on the KINS dataset, and an impressive 91.4% on the D2SA dataset. Through extensive evaluation using average precision and average recall metrics, ASH-Net exhibits exceptional performance compared to state-of-the-art models. These findings highlight the remarkable performance of ASH-Net in overcoming occlusion challenges and accurately delineating object boundaries. This research identifies optimal training parameters, such as coefficient dimensions and aspect ratios, that significantly enhance segmentation performance while maintaining computational efficiency. The proposed ASH-Net architecture and its performance pave the way for improved object recognition, enhanced scene understanding, and the development of practical applications in various domains.
... Bayesian deep learning [47] has recently gained increasing attention in the machine learning community as a means for uncertainty quantification (e.g., [21,48]) and model selection (e.g., [3,18]), compromising, among others, advancements on prior specification (e.g., [10,[33][34][35]) and efficient approximate inference schemes (e.g., [6,24,29]). Even though some of these advancements have recently found application in computer vision (e.g., [40,44,46]), they have not found adoption for decision-making in DNNs. ...
Preprint
Dynamic neural networks are a recent technique that promises a remedy for the increasing size of modern deep learning models by dynamically adapting their computational cost to the difficulty of the input samples. In this way, the model can adjust to a limited computational budget. However, the poor quality of uncertainty estimates in deep learning models makes it difficult to distinguish between hard and easy samples. To address this challenge, we present a computationally efficient approach for post-hoc uncertainty quantification in dynamic neural networks. We show that adequately quantifying and accounting for both aleatoric and epistemic uncertainty through a probabilistic treatment of the last layers improves the predictive performance and aids decision-making when determining the computational budget. In the experiments, we show improvements on CIFAR-100 and ImageNet in terms of accuracy, capturing uncertainty, and calibration error.
... Unfortunately, SAIL-VOS has frequent camera view switches, not the ideal testbed to apply video tracking or motion. Several efforts are made towards amodal segmentation on these datasets [46,24,7,45,41,44,33,17,43,30,20]. Generally speaking, most of the methods are on image level and they model type priors with shape statistics, as such it is challenging to extend their models to open-world applications where object category distributions are long-tail. ...
Preprint
Amodal perception requires inferring the full shape of an object that is partially occluded. This task is particularly challenging on two levels: (1) it requires more information than what is contained in the instant retina or imaging sensor, (2) it is difficult to obtain enough well-annotated amodal labels for supervision. To this end, this paper develops a new framework of Self-supervised amodal Video object segmentation (SaVos). Our method efficiently leverages the visual information of video temporal sequences to infer the amodal mask of objects. The key intuition is that the occluded part of an object can be explained away if that part is visible in other frames, possibly deformed as long as the deformation can be reasonably learned. Accordingly, we derive a novel self-supervised learning paradigm that efficiently utilizes the visible object parts as the supervision to guide the training on videos. In addition to learning type prior to complete masks for known types, SaVos also learns the spatiotemporal prior, which is also useful for the amodal task and could generalize to unseen types. The proposed framework achieves the state-of-the-art performance on the synthetic amodal segmentation benchmark FISHBOWL and the real world benchmark KINS-Video-Car. Further, it lends itself well to being transferred to novel distributions using test-time adaptation, outperforming existing models even after the transfer to a new distribution.
... In the inference phase, Mask R-CNN trained on the generated amodal ground truth is expected to yield the correct AIS. Another approach in WAIS uses object bounding boxes as inputs [35,36], where shape priors and probabilistic models are applied to classify visible, invisible, and background parts within the bounding box. Our ASIFormer belongs to the first group of fully supervised learning, in which visible, amodal, and occluder masks are represented by queries in a unified manner. ...
Preprint
Full-text available
Amodal Instance Segmentation (AIS) aims to segment the region of both visible and possible occluded parts of an object instance. While Mask R-CNN-based AIS approaches have shown promising results, they are unable to model high-level features coherence due to the limited receptive field. The most recent transformer-based models show impressive performance on vision tasks, even better than Convolution Neural Networks (CNN). In this work, we present AISFormer, an AIS framework, with a Transformer-based mask head. AISFormer explicitly models the complex coherence between occluder, visible, amodal, and invisible masks within an object's regions of interest by treating them as learnable queries. Specifically, AISFormer contains four modules: (i) feature encoding: extract ROI and learn both short-range and long-range visual features. (ii) mask transformer decoding: generate the occluder, visible, and amodal mask query embeddings by a transformer decoder (iii) invisible mask embedding: model the coherence between the amodal and visible masks, and (iv) mask predicting: estimate output masks including occluder, visible, amodal and invisible. We conduct extensive experiments and ablation studies on three challenging benchmarks i.e. KINS, D2SA, and COCOA-cls to evaluate the effectiveness of AISFormer. The code is available at: https://github.com/UARK-AICV/AISFormer
... The corrected segmentations are then passed back into the network to self-correct and improve the prediction results. Also inspired by CompositionalNet, a Bayesian generative model of the neural network features is used to replace the fully-connected classifier in the CNN to infer the amodal segmentation in [69]. The Bayesian model then uses the probability distribution to explain the features of the image, including the object classes and amodal segmentation. ...
Preprint
Existing computer vision systems can compete with humans in understanding the visible parts of objects, but still fall far short of humans when it comes to depicting the invisible parts of partially occluded objects. Image amodal completion aims to equip computers with human-like amodal completion functions to understand an intact object despite it being partially occluded. The main purpose of this survey is to provide an intuitive understanding of the research hotspots, key technologies and future trends in the field of image amodal completion. Firstly, we present a comprehensive review of the latest literature in this emerging field, exploring three key tasks in image amodal completion, including amodal shape completion, amodal appearance completion, and order perception. Then we examine popular datasets related to image amodal completion along with their common data collection methods and evaluation metrics. Finally, we discuss real-world applications and future research directions for image amodal completion, facilitating the reader's understanding of the challenges of existing technologies and upcoming research trends.
Article
This study explores cost-effective, real-time strategies for bin picking in industrial quality control. An anomaly detection solution was developed for a screw production plant, utilizing machine vision and AI to identify overlapping screws as anomalies. Two improvements are proposed to a basic solution initially relying on a laser profiler for depth images. The first improvement applies a Convolutional Neural Network (CNN) to the laser profiler's output, and the second replaces the laser profiler with a camera that captures color images, applying a CNN to its output. The first improvement was tested with real laser profiler data using YOLOv8 and Mask R-CNN segmentation models. After achieving comparable results on the real dataset, the second improvement was tested on multiple synthetic datasets, simulating different scenarios, including setups with mixed screws. Results demonstrated that model performance on color images, represented in the RGB color space (red, green, and blue), was comparable to depth images, validating color cameras as an appropriate alternative. Since color cameras are cheaper and capture images faster, they are well-suited for high-speed quality control systems, offering significant cost and performance advantages. Code is available at: https://github.com/enmarchi/overlapping_screws_geneneration_code .
Article
Sow crushing significantly increases the mortality rate of preweaning piglets, leading to substantial economic losses for the pig industry. However, automated detection of piglet-crushing events has been rarely reported in the literature. In this study, we proposed a three-stage computer vision-based detection method for piglet-crushing events. In the first stage, an anchor-based object detection and segmentation framework is adopted to segment piglet instances, classify sow postures, and detect sow's keypoints. In the second stage, by leveraging the sow mask and the two keypoints, a sow back-related angle is determined to judge if the sow back is visible or invisible. If the sow back is invisible, the fatal zone is identified by the self-adaptive fatal zone localization method (SFZLM). In the third stage, a weighted mask-based IoU is proposed to track the piglets within the fatal area during the previous frames and infer whether invisible piglets are being crushed under the sow. If the number of piglets entering the fatal zone exceeds the number of visibly crushed piglets, it is inferred that some non-visible piglets are also being crushed. Out of 238 video clips, 166 are used for training the model, while 72 short clips and four long videos are utilized for validating the proposed method. The classified results demonstrate a recall of 0.919, a precision of 0.895, and an F1 score of 0.906. The piglet-crushing events are detected accurately in four long videos with a TIoU of 0.968, a recall of crushing events of 100%, and a recall of crushed piglets of 90%. The favorable performance of detecting piglet crushing events suggests that the proposed method is feasible in automatically detecting and analysing the crushing events caused by lateral lying sows.
Preprint
Full-text available
Panoramic images can broaden the Field of View (FoV), occlusion-aware prediction can deepen the understanding of the scene, and domain adaptation can transfer across viewing domains. In this work, we introduce a novel task, Occlusion-Aware Seamless Segmentation (OASS), which simultaneously tackles all these three challenges. For benchmarking OASS, we establish a new human-annotated dataset for Blending Panoramic Amodal Seamless Segmentation, i.e., BlendPASS. Besides, we propose the first solution UnmaskFormer, aiming at unmasking the narrow FoV, occlusions, and domain gaps all at once. Specifically, UnmaskFormer includes the crucial designs of Unmasking Attention (UA) and Amodal-oriented Mix (AoMix). Our method achieves state-of-the-art performance on the BlendPASS dataset, reaching a remarkable mAPQ of 26.58% and mIoU of 43.66%. On public panoramic semantic segmentation datasets, i.e., SynPASS and DensePASS, our method outperforms previous methods and obtains 45.34% and 48.08% in mIoU, respectively. The fresh BlendPASS dataset and our source code will be made publicly available at https://github.com/yihong-97/OASS.
Article
Amodal scene analysis entails interpreting the occlusion relationship among scene elements and inferring the possible shapes of the invisible parts. Existing methods typically frame this task as an extended instance segmentation or a pair-wise object de-occlusion problem. In this work, we propose a new framework, which comprises a Holistic Occlusion Relation Inference (HORI) module followed by an instance-level Generative Mask Completion (GMC) module. Unlike previous approaches, which rely on mask completion results for occlusion reasoning, our HORI module directly predicts an occlusion relation matrix in a single pass. This approach is much more efficient than the pair-wise de-occlusion process and it naturally handles mutual occlusion, a common but often neglected situation. Moreover, we formulate the mask completion task as a generative process and use a diffusion-based GMC module for instance-level mask completion. This improves mask completion quality and provides multiple plausible solutions. We further introduce a large-scale amodal segmentation dataset with high-quality human annotations, including mutual occlusions. Experiments on our dataset and two public benchmarks demonstrate the advantages of our method. code public available at https://github.com/zbwxp/Amodal-AAAI.
Article
Precise segmentation is an important computer vision task for the automated animal monitoring, with potential application in the quantitative analysis of piglet crushing. However, automated instance segmentation of piglets under crushing events has rarely been reported in computer vision community. In this study, a two-stage amodal instance segmentation method is proposed for piglets under crushing events. In the first stage, a Mask R-CNN framework is used for the initial instance segmentation of piglets. Upon the segmentation, the masks are used to obtain the minimum bounding boxes which are then extended by using the proposed self-adaptive bounding box extension approach. The extended bounding boxes overlapping with sow will be reflected on the feature maps F output by ResNeXt-101 otherwise are not further processed. The sub-feature maps cropped by the extended bounding box are encoded by using von Mises-Fisher (vMF) Distributions. The vMF cluster centers {µk} resembles feature activation patterns that frequently occur in the training dataset. The encoded features are then input into the Bayesian lattice classifier (BLC) that is trained by using foreground prior and visibility prior. The lattice states (foreground or not and visible or not) estimated by the BLC are reflected on original RGB images. Then the reflected masks are refined by using image morphology processing to finalize the amodal instance segmentation of piglets under crushing events. Extracted from 104 video clips of Danish Genetics® piglets and sows, the training dataset consisted of 836 images where piglets were not occluded and the test dataset consisted of 412 images where crushing events occurred. The proposed method achieved segmentation results with a IoUs of 0.930, 0.911, 0.888, 0.853, and 0.834, respectively over five occlusion levels and a IoU of 0.893, a mIoU of 0.883, and a wIoU of 0.902. The method proposed in this study outperformed other state-of-the-art methods by a large margin. The favorable segmentation performance manifested the superiorities of our amodal instance segmentation method and tapped the potential of automated and quantitative analysis of piglet crushing events.
Chapter
Amodal perception is the ability to hallucinate full shapes of (partially) occluded objects. While natural to humans, learning-based perception methods often only focus on the visible parts of scenes. This constraint is critical for safe automated driving since detection capabilities of perception methods are limited when faced with (partial) occlusions. Moreover, corner cases can emerge from occlusions while the perception method is oblivious. In this work, we investigate the possibilities of joint prediction of amodal and visible semantic segmentation masks. More precisely, we investigate whether both perception tasks benefit from a joint training approach. We report our findings on both the Cityscapes and the Amodal Cityscapes dataset. The proposed joint training outperforms the separately trained networks in terms of mean intersection over union in amodal areas of the masks by 6.84% absolute, while even slightly improving the visible segmentation performance.
Article
Full-text available
Computer vision systems in real-world applications need to be robust to partial occlusion while also being explainable. In this work, we show that black-box deep convolutional neural networks (DCNNs) have only limited robustness to partial occlusion. We overcome these limitations by unifying DCNNs with part-based models into Compositional Convolutional Neural Networks (CompositionalNets)—an interpretable deep architecture with innate robustness to partial occlusion. Specifically, we propose to replace the fully connected classification head of DCNNs with a differentiable compositional model that can be trained end-to-end. The structure of the compositional model enables CompositionalNets to decompose images into objects and context, as well as to further decompose object representations in terms of individual parts and the objects’ pose. The generative nature of our compositional model enables it to localize occluders and to recognize objects based on their non-occluded parts. We conduct extensive experiments in terms of image classification and object detection on images of artificially occluded objects from the PASCAL3D+ and ImageNet dataset, and real images of partially occluded vehicles from the MS-COCO dataset. Our experiments show that CompositionalNets made from several popular DCNN backbones (VGG-16, ResNet50, ResNext) improve by a large margin over their non-compositional counterparts at classifying and detecting partially occluded objects. Furthermore, they can localize occluders accurately despite being trained with class-level supervision only. Finally, we demonstrate that CompositionalNets provide human interpretable predictions as their individual components can be understood as detecting parts and estimating an objects’ viewpoint.
Conference Paper
Full-text available
This paper presents a weakly supervised instance segmentation method that consumes training data with tight bounding box annotations. The major difficulty lies in the uncertain figure-ground separation within each bounding box since there is no supervisory signal about it. We address the difficulty by formulating the problem as a multiple instance learning (MIL) task, and generate positive and negative bags based on the sweeping lines of each bounding box. The proposed deep model integrates MIL into a fully supervised instance segmentation network, and can be derived by the objective consisting of two terms, i.e., the unary term and the pairwise term. The former estimates the foreground and background areas of each bounding box while the latter maintains the unity of the estimated object masks. The experimental results show that our method performs favorably against existing weakly supervised methods and even surpasses some fully supervised methods for instance segmentation on the PASCAL VOC dataset. The code is available at https://github.com/chengchunhsu/WSIS_BBTP.
Article
Full-text available
Amodal completion is the representation of those parts of the perceived object that we get no sensory stimulation from. In the case of vision, it is the representation of occluded parts of objects we see: When we see a cat behind a picket fence, our perceptual system represents those parts of the cat that are occluded by the picket fence. The aim of this piece is to argue that amodal completion plays a constitutive role in our everyday perception and trace the theoretical consequences of this claim.
Article
Full-text available
Faces in natural images are often occluded by a variety of objects. We propose a fully automated, probabilistic and occlusion-aware 3D morphable face model adaptation framework following an analysis-by-synthesis setup. The key idea is to segment the image into regions explained by separate models. Our framework includes a 3D morphable face model, a prototype-based beard model and a simple model for occlusions and background regions. The segmentation and all the model parameters have to be inferred from the single target image. Face model adaptation and segmentation are solved jointly using an expectation–maximization-like procedure. During the E-step, we update the segmentation and in the M-step the face model parameters are updated. For face model adaptation we apply a stochastic sampling strategy based on the Metropolis–Hastings algorithm. For segmentation, we apply loopy belief propagation for inference in a Markov random field. Illumination estimation is critical for occlusion handling. Our combined segmentation and model adaptation needs a proper initialization of the illumination parameters. We propose a RANSAC-based robust illumination estimation technique. By applying this method to a large face image database we obtain a first empirical distribution of real-world illumination conditions. The obtained empirical distribution is made publicly available and can be used as prior in probabilistic frameworks, for regularization or to synthesize data for deep learning methods.
Conference Paper
Full-text available
The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called ldquoImageNetrdquo, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.
Article
Full-text available
We present a new family of subgradient methods that dynamically incorporate knowledge of the geometry of the data observed in earlier iterations to perfo rm more informative gradient-based learning. The adaptation, in essence, allows us to find needl es in haystacks in the form of very predictive yet rarely observed features. Our paradigm stems from recent advances in online learning which employ proximal functions to control the gradient steps of the algorithm. We describe and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies the task of setting a learning rate and results in regret guar antees that are provably as good as the best proximal function that can be chosen in hindsight. We corroborate our theoretical results with experiments on a text classification task, showing substant ial improvements for classification with sparse datasets.
Article
Full-text available
Several large scale data mining applications, such as text categorization and gene expression analysis, involve high-dimensional data that is also inherently directional in nature. Often such data is normalized so that it lies on the surface of a unit hypersphere. Popular models such as (mixtures of) multi-variate Gaussians are inadequate for characterizing such data. This paper proposes a generative mixture-model approach to clustering directional data based on the von Mises-Fisher (vMF) distribution, which arises naturally for data distributed on the unit hypersphere. In particular, we derive and analyze two variants of the Expectation Maximization (EM) framework for estimating the mean and concentration parameters of this mixture. Numerical estimation of the concentration parameters is non-trivial in high dimensions since it involves functional inversion of ratios of Bessel functions. We also formulate two clustering algorithms corresponding to the variants of EM that we derive. Our approach provides a theoretical basis for the use of cosine similarity that has been widely employed by the information retrieval community, and obtains the spherical kmeans algorithm (kmeans with cosine similarity) as a special case of both variants. Empirical results on clustering of high-dimensional text and gene-expression data based on a mixture of vMF distributions show that the ability to estimate the concentration parameter for each vMF component, which is not present in existing approaches, yields superior results, especially for difficult clustering tasks in high-dimensional spaces.
Article
Full-text available
We consider data that are images containing views of multiple objects. Our task is to learn about each of the objects present in the images. This task can be approached as a factorial learning problem, where each image must be explained by instantiating a model for each of the objects present with the correct instantiation parameters. A major problem with learning a factorial model is that as the number of objects increases, there is a combinatorial explosion of the number of configurations that need to be considered. We develop a method to extract object models sequentially from the data by making use of a robust statistical method, thus avoiding the combinatorial explosion, and present results showing successful extraction of objects from real images.
Article
Almost all existing amodal segmentation methods make the inferences of occluded regions by using features corresponding to the whole image. This is against the human's amodal perception, where human uses the visible part and the shape prior knowledge of the target to infer the occluded region. To mimic the behavior of human and solve the ambiguity in the learning, we propose a framework, it firstly estimates a coarse visible mask and a coarse amodal mask. Then based on the coarse prediction, our model infers the amodal mask by concentrating on the visible region and utilizing the shape prior in the memory. In this way, features corresponding to background and occlusion can be suppressed for amodal mask estimation. Consequently, the amodal mask would not be affected by what the occlusion is given the same visible regions. The leverage of shape prior makes the amodal mask estimation more robust and reasonable. Our proposed model is evaluated on three datasets. Experiments show that our proposed model outperforms existing state-of-the-art methods. The visualization of shape prior indicates that the category-specific feature in the codebook has certain interpretability. The code is available at https://github.com/YutingXiao/Amodal-Segmentation-Based-on-Visible-Region-Segmentation-and-Shape-Prior.
Chapter
Recent approaches for weakly supervised instance segmentations depend on two components: (i) a pseudo label generation model which provides instances that are consistent with a given annotation; and (ii) an instance segmentation model, which is trained in a supervised manner using the pseudo labels as ground-truth. Unlike previous approaches, we explicitly model the uncertainty in the pseudo label generation process using a conditional distribution. The samples drawn from our conditional distribution provide accurate pseudo labels due to the use of semantic class aware unary terms, boundary aware pairwise smoothness terms, and annotation aware higher order terms. Furthermore, we represent the instance segmentation model as an annotation agnostic prediction distribution. In contrast to previous methods, our representation allows us to define a joint probabilistic learning objective that minimizes the dissimilarity between the two distributions. Our approach achieves state of the art results on the PASCAL VOC 2012 data set, outperforming the best baseline by 4.2% mAP0.5r4.2\%\ \text {mAP}^r_{0.5} and 4.8% mAP0.75r4.8\%\ \text {mAP}^r_{0.75}.
Chapter
Scene understanding tasks such as the prediction of object pose, shape, appearance and illumination are hampered by the occlusions often found in images. We propose a vision-as-inverse-graphics approach to handle these occlusions by making use of a graphics renderer in combination with a robust generative model (GM). Since searching over scene factors to obtain the best match for an image is very inefficient, we make use of a recognition model (RM) trained on synthetic data to initialize the search. This paper addresses two issues: (i) We study how the inferences are affected by the degree of occlusion of the foreground object, and show that a robust GM which includes an outlier model to account for occlusions works significantly better than a non-robust model. (ii) We characterize the performance of the RM and the gains that can be made by refining the search using the GM, using a new dataset that includes background clutter and occlusions. We find that pose and shape are predicted very well by the RM, but appearance and especially illumination less so. However, accuracy on these latter two factors can be clearly improved with the generative model.
Article
Weakly supervised instance segmentation with image-level labels, instead of expensive pixel-level masks, remains unexplored. In this paper, we tackle this challenging problem by exploiting class peak responses to enable a classification network for instance mask extraction. With image labels supervision only, CNN classifiers in a fully convolutional manner can produce class response maps, which specify classification confidence at each image location. We observed that local maximums, i.e., peaks, in a class response map typically correspond to strong visual cues residing inside each instance. Motivated by this, we first design a process to stimulate peaks to emerge from a class response map. The emerged peaks are then back-propagated and effectively mapped to highly informative regions of each object instance, such as instance boundaries. We refer to the above maps generated from class peak responses as Peak Response Maps (PRMs). PRMs provide a fine-detailed instance-level representation, which allows instance masks to be extracted even with some off-the-shelf methods. To the best of our knowledge, we for the first time report results for the challenging image-level supervised instance segmentation task. Extensive experiments show that our method also boosts weakly supervised pointwise localization as well as semantic segmentation performance, and reports state-of-the-art results on popular benchmarks, including PASCAL VOC 2012 and MS COCO.
Conference Paper
We consider the problem of amodal instance segmentation, the objective of which is to predict the region encompassing both visible and occluded parts of each object. Thus far, the lack of publicly available amodal segmentation annotations has stymied the development of amodal segmentation methods. In this paper, we sidestep this issue by relying solely on standard modal instance segmentation annotations to train our model. The result is a new method for amodal instance segmentation, which represents the first such method to the best of our knowledge. We demonstrate the proposed method’s effectiveness both qualitatively and quantitatively.
Conference Paper
3D object detection and pose estimation methods have become popular in recent years since they can handle ambiguities in 2D images and also provide a richer description for objects compared to 2D object detectors. However, most of the datasets for 3D recognition are limited to a small amount of images per category or are captured in controlled environments. In this paper, we contribute PASCAL3D+ dataset, which is a novel and challenging dataset for 3D object detection and pose estimation. PASCAL3D+ augments 12 rigid categories of the PASCAL VOC 2012 [4] with 3D annotations. Furthermore, more images are added for each category from ImageNet [3]. PASCAL3D+ images exhibit much more variability compared to the existing 3D datasets, and on average there are more than 3,000 object instances per category. We believe this dataset will provide a rich testbed to study 3D detection and pose estimation and will help to significantly push forward research in this area. We provide the results of variations of DPM [6] on our new dataset for object detection and viewpoint estimation in different scenarios, which can be used as baselines for the community. Our benchmark is available online at http://cvgl.stanford.edu/projects/pascal3d.
Article
Compositional models provide an elegant formalism for representing the visual appearance of highly variable objects. While such models are appealing from a theoretical point of view, it has been difficult to demonstrate that they lead to per-formance advantages on challenging datasets. Here we develop a grammar model for person detection and show that it outperforms previous high-performance sys-tems on the PASCAL benchmark. Our model represents people using a hierar-chy of deformable parts, variable structure and an explicit model of occlusion for partially visible objects. To train the model, we introduce a new discriminative framework for learning structured prediction models from weakly-labeled data.
Conference Paper
We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in understanding an object's precise 2D location. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old along with per-instance segmentation masks. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.
Article
We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in understanding an object's precise 2D location. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old along with per-instance segmentation masks. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.
Article
In this paper we present a computationally efficient framework for part-based modeling and recognition of objects. Our work is motivated by the pictorial structure models introduced by Fischler and Elschlager. The basic idea is to represent an object by a collection of parts arranged in a deformable configuration. The appearance of each part is modeled separately, and the deformable configuration is represented by spring-like connections between pairs of parts. These models allow for qualitative descriptions of visual appearance, and are suitable for generic recognition problems. We address the problem of using pictorial structure models to find instances of an object in an image as well as the problem of learning an object model from training examples, presenting efficient algorithms in both cases. We demonstrate the techniques by learning models that represent faces and human bodies and using the resulting models to locate the corresponding objects in novel images.
Article
S ummary A broadly applicable algorithm for computing maximum likelihood estimates from incomplete data is presented at various levels of generality. Theory showing the monotone behaviour of the likelihood and convergence of the algorithm is derived. Many examples are sketched, including missing value situations, applications to grouped, censored or truncated data, finite mixture models, variance component estimation, hyperparameter estimation, iteratively reweighted least squares and factor analysis.
Conference Paper
We present a new family of subgradient methods that dynamically incorporate knowledge of the geometry of the data observed in earlier iterations to perform more informative gradient-based learning. Metaphorically, the adaptation allows us to find needles in haystacks in the form of very predictive but rarely seen features. Our paradigm stems from recent advances in stochastic optimization and online learning which employ proximal functions to control the gradient steps of the algorithm. We describe and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal function that can be chosen in hindsight. We give several efficient algorithms for empirical risk minimization problems with common and important regularization functions and domain constraints. We experimentally study our theoretical analysis and show that adaptive subgradient methods outperform state-of-the-art, yet non-adaptive, subgradient algorithms.
Conference Paper
We describe a general method for building cascade classifiers from part-based deformable models such as pictorial structures. We focus primarily on the case of star-structured models and show how a simple algorithm based on partial hypothesis pruning can speed up object detection by more than one order of magnitude without sacrificing detection accuracy. In our algorithm, partial hypotheses are pruned with a sequence of thresholds. In analogy to probably approximately correct (PAC) learning, we introduce the notion of probably approximately admissible (PAA) thresholds. Such thresholds provide theoretical guarantees on the performance of the cascade method and can be computed from a small sample of positive examples. Finally, we outline a cascade detection algorithm for a general class of models defined by a grammar formalism. This class includes not only tree-structured pictorial structures but also richer models that can represent each part recursively as a mixture of other parts.
Conference Paper
The k-means method is a widely used clustering technique that seeks to minimize the average squared distance between points in the same cluster. Although it offers no accuracy guarantees, its simplicity and speed are very appealing in practice. By augmenting k-means with a very simple, randomized seeding technique, we obtain an algorithm that is Θ(logk)-competitive with the optimal clustering. Preliminary experiments show that our augmentation improves both the speed and the accuracy of k-means, often quite dramatically.
Where are the masks: Instance segmentation with image-level supervision
  • H Issam
  • David Laradji
  • Mark Vazquez
  • Schmidt
Nemo: Neural mesh models of contrastive features for robust 3d pose estimation
  • Angtian Wang
  • Adam Kortylewski
  • Alan Yuille
Variational amodal object completion
  • Huan Ling
  • David Acuna
  • Karsten Kreis
  • Wook Seung
  • Sanja Kim
  • Fidler
Robustness of object recognition under extreme occlusion in humans and computational models
  • Hongru Zhu
  • Peng Tang
  • Jeongho Park
  • Soojin Park
  • Alan Yuille
Nemo: Neural mesh models of contrastive features for robust 3d pose estimation
  • wang
Robustness of object recognition under extreme occlusion in humans and computational models
  • zhu
Variational amodal object completion
  • ling
Where are the masks: Instance segmentation with image-level supervision
  • laradji
Self-supervised scene deocclusion
  • zhan