Article
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

While supervised object detection and segmentation methods achieve impressive accuracy, they generalize poorly to images whose appearance significantly differs from the data they have been trained on. To address this when annotating data is prohibitively expensive, we introduce a self-supervised detection and segmentation approach that can work with single images captured by a potentially moving camera. At the heart of our approach lies the observation that object segmentation and background reconstruction are linked tasks, and that, for structured scenes, background regions can be re-synthesized from their surroundings, whereas regions depicting the moving object cannot. We encode this intuition into a self-supervised loss function that we exploit to train a proposal-based segmentation network. To account for the discrete nature of the proposals, we developed a Monte Carlo-based training strategy that allows the algorithm to explore the large space of object proposals. We apply our method to human detection and segmentation in images that visually depart from those of standard benchmarks and outperform existing self-supervised methods.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Hence, our method relates to self-supervised approaches. We show that the proposed multi-view training increases single-image accuracy performance at inference time, which allows us to outperform state-of-the-art single-view [22,53,9,31,20] and multi-view [37] approaches. Our code is publicly available at https:// github.com/isinsukatircioglu/mvc. ...
... Built upon [43], [9] trains an ensemble of networks, which comes at the cost of requiring significant amounts of additional data. In [20], an inpainting network is trained to identify the regions that are harder to reconstruct from the surrounding image patches and encodes and decodes the content of this region to learn the scene decomposition. [53] employs a similar inpainting network but on flow fields obtained by [44] and aims to generate the mask of a moving object in the region where the inpainting network yields poor reconstruction. ...
... Let us consider the network F of [20], which we use as the backbone of our approach. It takes an image I ∈ R W ×H×3 as input and resynthesizes it. ...
Article
Full-text available
Katircioglu I, Rhodin H, Spörri J, Salzmann M, Fua P. Human Detection and Segmentation via Multi-view Consensus. 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, QC, Canada, 2021, pp. 2855-2864. https://dx.doi.org/10.1109/ICCV48922.2021.00285
... It is also not necessary to employ models to generate entire scenes [15,20], which can be challenging to train. Our working principle exploits observations also made by [17][18][19]. They point out that the correct mask maximizes the inpainting error both for the background and the foreground. ...
... This property holds for objects in the foreground, as they occlude all other objects in the scene. This basic idea has already been exploited in prior work with relative success [13][14][15][16][17][18][19]. Nonetheless, here we introduce a novel formulation based on movability that yields a significant performance boost across several datasets for salient object detection. ...
Preprint
Full-text available
We introduce MOVE, a novel method to segment objects without any form of supervision. MOVE exploits the fact that foreground objects can be shifted locally relative to their initial position and result in realistic (undistorted) new images. This property allows us to train a segmentation model on a dataset of images without annotation and to achieve state of the art (SotA) performance on several evaluation datasets for unsupervised salient object detection and segmentation. In unsupervised single object discovery, MOVE gives an average CorLoc improvement of 7.2% over the SotA, and in unsupervised class-agnostic object detection it gives a relative AP improvement of 53% on average. Our approach is built on top of self-supervised features (e.g. from DINO or MAE), an inpainting network (based on the Masked AutoEncoder) and adversarial training.
... Our method is related to the recent teacher-student formulation of pseudo labeling [47,62,52,59,25,64,28,63,57,43,22], which works under the assumption that the generated high-quality pseudo label of the teacher can be used to supervise the student network when having the same input as the teacher but only different data augmentations. Although this simple general framework has been widely used in image classification [47,62], object detection [59,28,64,51], and semantic segmentation [25], it only can work with high-quality pseudo labels. ...
Preprint
Most self-supervised 6D object pose estimation methods can only work with additional depth information or rely on the accurate annotation of 2D segmentation masks, limiting their application range. In this paper, we propose a 6D object pose estimation method that can be trained with pure RGB images without any auxiliary information. We first obtain a rough pose initialization from networks trained on synthetic images rendered from the target's 3D mesh. Then, we introduce a refinement strategy leveraging the geometry constraint in synthetic-to-real image pairs from multiple different views. We formulate this geometry constraint as pixel-level flow consistency between the training images with dynamically generated pseudo labels. We evaluate our method on three challenging datasets and demonstrate that it outperforms state-of-the-art self-supervised methods significantly, with neither 2D annotations nor additional depth images.
... On the other hand, instance segmentation attempts to go one step further and tries to distinguish between different occurrences of the same class (see Figure 2). This paper focuses on semantic segmentation (SS), which has gained much interest in recent years with important applications in different areas such as medical imaging [3], autonomous driving [4], aerial scene analysis [5] or metallographic images [6], among others [7,8]. ...
Preprint
Semantic segmentation is one of the most challenging tasks in computer vision. However, in many applications, a frequent obstacle is the lack of labeled images, due to the high cost of pixel-level labeling. In this scenario, it makes sense to approach the problem from a semi-supervised point of view, where both labeled and unlabeled images are exploited. In recent years this line of research has gained much interest and many approaches have been published in this direction. Therefore, the main objective of this study is to provide an overview of the current state of the art in semi-supervised semantic segmentation, offering an updated taxonomy of all existing methods to date. This is complemented by an experimentation with a variety of models representing all the categories of the taxonomy on the most widely used becnhmark datasets in the literature, and a final discussion on the results obtained, the challenges and the most promising lines of future research.
... It involves filling in missing parts or removing undesired objects from an image [21] [22]. Specifically, image inpainting can remove selected objects in an image, such as text, dates, or other selected targets, and then recover the removed area using the information of the surrounding area [20] [23]. Most image inpainting algorithms are based on the following techniques: copy-paste texture synthesis, coherence between neighboring pixels, and geometric partial differential equation (PDE). ...
Article
Full-text available
Small target detection is a crucial and challenging task in infrared search and track system. Background estimation-based methods is an effective and important approach for infrared small target detection. Affected by the target pixels, existing background estimation methods may reconstruct an inaccurate background. Based on image inpainting technique, we propose a novel two-stage approach to predict more accurate backgrounds. At the first stage, the inner and outer window-based image inpainting (IOWII) is used to obtain a rough background estimation. Then a mask of candidate target region is automatically obtained by calculating and evaluating the difference between raw image and rough background. In the second stage, the final accurate background is predicted by mask-based image inpainting (MII). It recovers the removed candidate target area using the information of surrounding background pixels, avoiding target pixels to participate in the calculation of background reconstruction. Finally, the target saliency map is obtained by subtracting the final estimated background from the original image, and a simple adaptive threshold is used to segment the target. Experimental results on real infrared images and sequences demonstrate that the proposed method outperforms other state-of-the-arts. It is simple and effective, with strong robustness and good real-time performance.
... Bielski et al. [2] propose to reposition the generated foreground to disentangle it from the background. Katircioglu et al, [24,25] detect that region as foreground that cannot be inpainted from the surrounding, a strategy previously used on optical flow [63], which however requires a similar background in all training examples or an optical flow estimator. Chen et al. [4] achieved unsupervised foreground segmentation by resampling the foreground appearance to disentangle foreground and background. ...
Preprint
Full-text available
Segmenting an image into its parts is a frequent preprocess for high-level vision tasks such as image editing. However, annotating masks for supervised training is expensive. Weakly-supervised and unsupervised methods exist, but they depend on the comparison of pairs of images, such as from multi-views, frames of videos, and image transformations of single images, which limits their applicability. To address this, we propose a GAN-based approach that generates images conditioned on latent masks, thereby alleviating full or weak annotations required in previous approaches. We show that such mask-conditioned image generation can be learned faithfully when conditioning the masks in a hierarchical manner on latent keypoints that define the position of parts explicitly. Without requiring supervision of masks or points, this strategy increases robustness to viewpoint and object positions changes. It also lets us generate image-mask pairs for training a segmentation network, which outperforms the state-of-the-art unsupervised segmentation methods on established benchmarks.
Article
Background and objective: Accurate segmentation of electron microscopy (EM) volumes of the brain is essential to characterize neuronal structures at a cell or organelle level. While supervised deep learning methods have led to major breakthroughs in that direction during the past years, they usually require large amounts of annotated data to be trained, and perform poorly on other data acquired under similar experimental and imaging conditions. This is a problem known as domain adaptation, since models that learned from a sample distribution (or source domain) struggle to maintain their performance on samples extracted from a different distribution or target domain. In this work, we address the complex case of deep learning based domain adaptation for mitochondria segmentation across EM datasets from different tissues and species. Methods: We present three unsupervised domain adaptation strategies to improve mitochondria segmentation in the target domain based on (1) state-of-the-art style transfer between images of both domains; (2) self-supervised learning to pre-train a model using unlabeled source and target images, and then fine-tune it only with the source labels; and (3) multi-task neural network architectures trained end-to-end with both labeled and unlabeled images. Additionally, to ensure good generalization in our models, we propose a new training stopping criterion based on morphological priors obtained exclusively in the source domain. The code and its documentation are publicly available at https://github.com/danifranco/EM_domain_adaptation. Results: We carried out all possible cross-dataset experiments using three publicly available EM datasets. We evaluated our proposed strategies and those of others based on the mitochondria semantic labels predicted on the target datasets. Conclusions: The methods introduced here outperform the baseline methods and compare favorably to the state of the art. In the absence of validation labels, monitoring our proposed morphology-based metric is an intuitive and effective way to stop the training process and select in average optimal models.
Chapter
Full-text available
We developed a real-time, high-quality semi-supervised video object segmentation algorithm. Its accuracy is on par with the most accurate, time-consuming online-learning model, while its speed is similar to the fastest template-matching method with sub-optimal accuracy. The core component of the model is a novel global context module that effectively summarizes and propagates information through the entire video. Compared to previous approaches that only use one frame or a few frames to guide the segmentation of the current frame, the global context module uses all past frames. Unlike the previous state-of-the-art space-time memory network that caches a memory at each spatio-temporal position, the global context module uses a fixed-size feature representation. Therefore, it uses constant memory regardless of the video length and costs substantially less memory and computation. With the novel module, our model achieves top performance on standard benchmarks at a real-time speed.
Conference Paper
Full-text available
We introduce a novel network, called CO-attention Siamese Network (COSNet), to address the unsupervised video object segmentation task from a holistic view. We emphasize the importance of inherent correlation among video frames and incorporate a global co-attention mechanism to improve further the state-of-the-art deep learning based solutions that primarily focus on learning discriminative foreground representations over appearance and motion in short-term temporal segments. The co-attention layers in our network provide efficient and competent stages for capturing global correlations and scene context by jointly computing and appending co-attention responses into a joint feature space. We train COSNet with pairs of video frames, which naturally augments training data and allows increased learning capacity. During the segmentation stage, the co-attention model encodes useful information by processing multiple reference frames together, which is lever-aged to infer the frequently reappearing and salient foreground objects better. We propose a unified and end-to-end trainable framework where different co-attention variants can be derived for mining the rich context within videos. Our extensive experiments over three large benchmarks manifest that COSNet outperforms the current alternatives by a large margin.
Article
Full-text available
Unsupervised learning represents one of the most interesting challenges in computer vision today. The task has an immense practical value with many applications in artificial intelligence and emerging technologies, as large quantities of unlabeled images and videos can be collected at low cost. In this paper, we address the unsupervised learning problem in the context of segmenting the main foreground objects in single images. We propose an unsupervised learning system, which has two pathways, the teacher and the student, respectively. The system is designed to learn over several generations of teachers and students. At every generation the teacher performs unsupervised object discovery in videos or collections of images and an automatic selection module picks up good frame segmentations and passes them to the student pathway for training. At every generation multiple students are trained, with different deep network architectures to ensure a better diversity. The students at one iteration help in training a better selection module, forming together a more powerful teacher pathway at the next iteration. In experiments, we show that the improvement in the selection power, the training of multiple students and the increase in unlabeled data significantly improve segmentation accuracy from one generation to the next. Our method achieves top results on three current datasets for object discovery in video, unsupervised image segmentation and saliency detection. At test time, the proposed system is fast, being one to two orders of magnitude faster than published unsupervised methods. We also test the strength of our unsupervised features within a well known transfer learning setup and achieve competitive performance, proving that our unsupervised approach can be reliably used in a variety of computer vision tasks.
Conference Paper
Full-text available
This paper conducts a systematic study on the role of visual attention in the Unsupervised Video Object Segmen-tation (UVOS) task. By elaborately annotating three popular video segmentation datasets (DAVIS 16 , Youtube-Objects and SegTrack V 2) with dynamic eye-tracking data in the UVOS setting, for the first time, we quantitatively verified the high consistency of visual attention behavior among human observers, and found strong correlation between human attention and explicit primary object judgements during dynamic, task-driven viewing. Such novel observations provide an in-depth insight into the underlying rationale behind UVOS. Inspired by these findings, we de-couple UVOS into two sub-tasks: UVOS-driven Dynamic Visual Attention Prediction (DVAP) in spatiotemporal domain , and Attention-Guided Object Segmentation (AGOS) in spatial domain. Our UVOS solution enjoys three major merits: 1) modular training without using expensive video segmentation annotations, instead, using more affordable dynamic fixation data to train the initial video attention module and using existing fixation-segmentation paired static/image data to train the subsequent segmentation module ; 2) comprehensive foreground understanding through multi-source learning; and 3) additional interpretability from the biologically-inspired and assessable attention. Experiments on popular benchmarks show that, even without using expensive video object mask annotations, our model achieves compelling performance in comparison with state-of-the-arts.
Chapter
Full-text available
This paper proposes a fast video salient object detection model, based on a novel recurrent network architecture, named Pyramid Dilated Bidirectional ConvLSTM (PDB-ConvLSTM). A Pyramid Dilated Convolution (PDC) module is first designed for simultaneously extracting spatial features at multiple scales. These spatial features are then concatenated and fed into an extended Deeper Bidirectional ConvLSTM (DB-ConvLSTM) to learn spatiotemporal information. Forward and backward ConvLSTM units are placed in two layers and connected in a cascaded way, encouraging information flow between the bi-directional streams and leading to deeper feature extraction. We further augment DB-ConvLSTM with a PDC-like structure, by adopting several dilated DB-ConvLSTMs to extract multi-scale spatiotemporal information. Extensive experimental results show that our method outperforms previous video saliency models in a large margin, with a real-time speed of 20 fps on a single GPU. With unsupervised video object segmentation as an example application, the proposed model (with a CRF-based post-process) achieves state-of-the-art results on two popular benchmarks, well demonstrating its superior performance and high applicability.
Article
Full-text available
Rhodin H, Spörri J, Katircioglu I, Constantin V, Meyer F, Müller E, Salzmann M, Fua P. Learning Monocular 3D Human Pose Estimation from Multi-view Images. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 8437-8446. https://dx.doi.org/10.1109/CVPR.2018.00880
Article
Full-text available
Reusable model design becomes desirable with the rapid expansion of computer vision and machine learning applications. In this paper, we focus on the reusability of pre-trained deep convolutional models. Specifically, different from treating pre-trained models as feature extractors, we reveal more treasures beneath convolutional layers, i.e., the convolutional activations could act as a detector for the common object in the image co-localization problem. We propose a simple yet effective method, termed Deep Descriptor Transforming (DDT), for evaluating the correlations of descriptors and then obtaining the category-consistent regions, which can accurately locate the common object in a set of unlabeled images, i.e., unsupervised object discovery. Empirical studies validate the effectiveness of the proposed DDT method. On benchmark image co-localization datasets, DDT consistently outperforms existing state-of-the-art methods by a large margin. Moreover, DDT also demonstrates good generalization ability for unseen categories and robustness for dealing with noisy data. Beyond those, DDT can be also employed for harvesting web images into valid external data sources for improving performance of both image recognition and object detection.
Article
Full-text available
Categorical variables are a natural choice for representing discrete structure in the world. However, stochastic neural networks rarely use categorical latent variables due to the inability to backpropagate through samples. In this work, we present an efficient gradient estimator that replaces the non-differentiable sample from a categorical distribution with a differentiable sample from a novel Gumbel-Softmax distribution. This distribution has the essential property that it can be smoothly annealed into a categorical distribution. We show that our Gumbel-Softmax estimator outperforms state-of-the-art gradient estimators on structured output prediction and unsupervised generative modeling tasks with categorical latent variables, and enables large speedups on semi-supervised classification.
Chapter
How to make a segmentation model efficiently adapt to a specific video as well as online target appearance variations is a fundamental issue in the field of video object segmentation. In this work, a graph memory network is developed to address the novel idea of “learning to update the segmentation model”. Specifically, we exploit an episodic memory network, organized as a fully connected graph, to store frames as nodes and capture cross-frame correlations by edges. Further, learnable controllers are embedded to ease memory reading and writing, as well as maintain a fixed memory scale. The structured, external memory design enables our model to comprehensively mine and quickly store new knowledge, even with limited visual information, and the differentiable memory controllers slowly learn an abstract method for storing useful representations in the memory and how to later use these representations for prediction, via gradient descent. In addition, the proposed graph memory network yields a neat yet principled framework, which can generalize well to both one-shot and zero-shot video object segmentation tasks. Extensive experiments on four challenging benchmark datasets verify that our graph memory network is able to facilitate the adaptation of the segmentation network for case-by-case video object segmentation.
Chapter
In this paper, we introduce a novel network, called discriminative feature network (DFNet), to address the unsupervised video object segmentation task. To capture the inherent correlation among video frames, we learn discriminative features (D-features) from the input images that reveal feature distribution from a global perspective. The D-features are then used to establish correspondence with all features of test image under conditional random field (CRF) formulation, which is leveraged to enforce consistency between pixels. The experiments verify that DFNet outperforms state-of-the-art methods by a large margin with a mean IoU score of 83.4% and ranks first on the DAVIS-2016 leaderboard while using much fewer parameters and achieving much more efficient performance in the inference phase. We further evaluate DFNet on the FBMS dataset and the video saliency dataset ViSal, reaching a new state-of-the-art. To further demonstrate the generalizability of our framework, DFNet is also applied to the image object co-segmentation task. We perform experiments on a challenging dataset PASCAL-VOC and observe the superiority of DFNet. The thorough experiments verify that DFNet is able to capture and mine the underlying relations of images and discover the common foreground objects.
Chapter
Semi-supervised video object segmentation (VOS) is a task that involves predicting a target object in a video when the ground truth segmentation mask of the target object is given in the first frame. Recently, space-time memory networks (STM) have received significant attention as a promising solution for semi-supervised VOS. However, an important point is overlooked when applying STM to VOS. The solution (STM) is non-local, but the problem (VOS) is predominantly local. To solve the mismatch between STM and VOS, we propose a kernelized memory network (KMN). Before being trained on real videos, our KMN is pre-trained on static images, as in previous works. Unlike in previous works, we use the Hide-and-Seek strategy in pre-training to obtain the best possible results in handling occlusions and segment boundary extraction. The proposed KMN surpasses the state-of-the-art on standard benchmarks by a significant margin (+5% on DAVIS 2017 test-dev set). In addition, the runtime of KMN is 0.12 s per frame on the DAVIS 2016 validation set, and the KMN rarely requires extra computation, when compared with STM.
Chapter
We present a method for simultaneously learning, in an unsupervised manner, (i) a conditional image generator, (ii) foreground extraction and segmentation, (iii) clustering into a two-level class hierarchy, and (iv) object removal and background completion, all done without any use of annotation. The method combines a Generative Adversarial Network and a Variational Auto-Encoder, with multiple encoders, generators and discriminators, and benefits from solving all tasks at once. The input to the training scheme is a varied collection of unlabeled images from the same domain, as well as a set of background images without a foreground object. In addition, the image generator can mix the background from one image, with a foreground that is conditioned either on that of a second image or on the index of a desired cluster. The method obtains state of the art results in comparison to the literature methods, when compared to the current state of the art in each of the tasks.
Chapter
Object tracking is a well-studied problem in computer vision while identifying salient spots of objects in a video is a less explored direction in the literature. Video eye gaze estimation methods aim to tackle a related task but salient spots in those methods are not bounded by objects and tend to produce very scattered, unstable predictions due to the noisy ground truth data. We reformulate the problem of detecting and tracking of salient object spots as a new task called object hotspot tracking. In this paper, we propose to tackle this task jointly with unsupervised video object segmentation, in real-time, with a unified framework to exploit the synergy between the two. Specifically, we propose a Weighted Correlation Siamese Network (WCS-Net) which employs a Weighted Correlation Block (WCB) for encoding the pixel-wise correspondence between a template frame and the search frame. In addition, WCB takes the initial mask/hotspot as guidance to enhance the influence of salient regions for robust tracking. Our system can operate online during inference and jointly produce the object mask and hotspot track-lets at 33 FPS. Experimental results validate the effectiveness of our network design, and show the benefits of jointly solving the hotspot tracking and object segmentation problems. In particular, our method performs favorably against state-of-the-art video eye gaze models in object hotspot tracking, and outperforms existing methods on three benchmark datasets for unsupervised video object segmentation.
Chapter
We propose a unified referring video object segmentation network (URVOS). URVOS takes a video and a referring expression as inputs, and estimates the object masks referred by the given language expression in the whole video frames. Our algorithm addresses the challenging problem by performing language-based object segmentation and mask propagation jointly using a single deep neural network with a proper combination of two attention models. In addition, we construct the first large-scale referring video object segmentation dataset called Refer-Youtube-VOS. We evaluate our model on two benchmark datasets including ours and demonstrate the effectiveness of the proposed approach. The dataset is released at https://github.com/skynbe/Refer-Youtube-VOS.
Chapter
This paper investigates the principles of embedding learning to tackle the challenging semi-supervised video object segmentation. Different from previous practices that only explore the embedding learning using pixels from foreground object (s), we consider background should be equally treated and thus propose Collaborative video object segmentation by Foreground-Background Integration (CFBI) approach. Our CFBI implicitly imposes the feature embedding from the target foreground object and its corresponding background to be contrastive, promoting the segmentation results accordingly. With the feature embedding from both foreground and background, our CFBI performs the matching process between the reference and the predicted sequence from both pixel and instance levels, making the CFBI be robust to various object scales. We conduct extensive experiments on three popular benchmarks, i.e., DAVIS 2016, DAVIS 2017, and YouTube-VOS. Our CFBI achieves the performance (\(\mathcal {J}\)&\(\mathcal {F}\)) of 89.4%, 81.9%, and 81.4%, respectively, outperforming all the other state-of-the-art methods. Code: https://github.com/z-x-yang/CFBI.
Chapter
Video object segmentation (VOS) is a highly challenging problem, since the target object is only defined by a first-frame reference mask during inference. The problem of how to capture and utilize this limited information to accurately segment the target remains a fundamental research question. We address this by introducing an end-to-end trainable VOS architecture that integrates a differentiable few-shot learner. Our learner is designed to predict a powerful parametric model of the target by minimizing a segmentation error in the first frame. We further go beyond the standard few-shot learning paradigm by learning what our target model should learn in order to maximize segmentation accuracy. We perform extensive experiments on standard benchmarks. Our approach sets a new state-of-the-art on the large-scale YouTube-VOS 2018 dataset by achieving an overall score of 81.5, corresponding to a 2.6% relative improvement over the previous best result. The code and models are available at https://github.com/visionml/pytracking.
Book
This book addresses one of the most important unsolved problems in artificial intelligence: the task of learning, in an unsupervised manner, from massive quantities of spatiotemporal visual data that are available at low cost. The book covers important scientific discoveries and findings, with a focus on the latest advances in the field. Presenting a coherent structure, the book logically connects novel mathematical formulations and efficient computational solutions for a range of unsupervised learning tasks, including visual feature matching, learning and classification, object discovery, and semantic segmentation in video. The final part of the book proposes a general strategy for visual learning over several generations of student-teacher neural networks, along with a unique view on the future of unsupervised learning in real-world contexts. Offering a fresh approach to this difficult problem, several efficient, state-of-the-art unsupervised learning algorithms are reviewed in detail, complete with an analysis of their performance on various tasks, datasets, and experimental setups. By highlighting the interconnections between these methods, many seemingly diverse problems are elegantly brought together in a unified way. Serving as an invaluable guide to the computational tools and algorithms required to tackle the exciting challenges in the field, this book is a must-read for graduate students seeking a greater understanding of unsupervised learning, as well as researchers in computer vision, machine learning, robotics, and related disciplines. Dr. Marius Leordeanu is an Associate Professor (Senior Lecturer) at the Computer Science & Engineering Department, Polytechnic University of Bucharest and a Senior Researcher at the Institute of Mathematics of the Romanian Academy (IMAR), Bucharest, Romania. In 2014, he was awarded the Grigore Moisil Prize, the most prestigious award in mathematics bestowed by the Romanian Academy, for his work on unsupervised learning.
Chapter
Unsupervised learningUnsupervised learning represents one of the most interesting challenges in computer vision today, as the material presented so far in the book has so often shown. The problem is essential for visual learning, coming in many different forms and tasks. Consequently, it has an immense practical value with many applications in artificial intelligence and emerging technologies, as large quantities of unlabeled images and videos can be collected at low cost. In this chapter, we address the unsupervised learningUnsupervised learning problem in the context of segmenting the main foreground objects in single images. We propose an unsupervised learning systemTeacher-student unsupervised learning system, which has two pathways, the teacher and the student, respectively. The system is designed to learn over several generations of teachers and students. At every generation, the teacher does unsupervised object discoveryUnsupervised object discovery in videos or collections of images and an automatic selection module picks up good frame segmentations and passes them to the student pathway for training. At every generation, multiple students are trained, with different deep network architectures to ensure a better diversity. The students at one iteration help in training a better selection module, forming together a more powerful teacher pathway at the next iteration. The approach presented in this chapter gathers together many ideas studied throughout the book. On one hand, the pathway used as teacher at the first iteration is an unsupervised video object discoverer that immediately relates to the VideoPCAVideoPCA method introduced in Chap. 5 and the space-time clustering GO-VOS method introduced in Chap. 6. Then, the idea of using agreements between multiple student nets at the current generation as teacher at the next generation relates directly to the principles for unsupervised learningPrinciples of unsupervised learning proposed in Chap. 10.1007/978-3-030-42128-1_1. Such agreements constitute reliable HPP signal (Chap. 5), which becomes a robust teacher for training the next-generation students. Their agreements are also used to train in an unsupervised fashion a deep network that selects only high-quality HPP object masks as supervisory signal. The principles for unsupervised learning are exploited fully in the teacher-studentTeacher-student unsupervised learning system system presented here. The material in this chapter makes some original contributions, discussed next. They lay the foundation of our general concept of a universal unsupervised learning machineUniversal unsupervised learning machine, the Visual Story NetworkVisual Story Network (VSN), which we introduce in the final chapter of the book. In the extensive experimental validation of our proposed system, we show that the improvement in the selection power, the training of multiple students, and the increase in unlabeled data significantly improve segmentation accuracy from one generation to the next. Our method achieves top results on three challenging datasets for object discovery in video, unsupervised image segmentationUnsupervised image segmentation, and saliency detectionSaliency detection. At test time, our system is fast, being one to two orders of magnitude faster than published unsupervised methods. We also test the strength of our unsupervised features within a well-known transfer learningTransfer learning setup and achieve competitive performance, proving that our unsupervised approach can be reliably used in a variety of computer vision tasks.
Conference Paper
We propose an adversarial contextual model for detecting moving objects in images. A deep neural network is trained to predict the optical flow in a region using information from everywhere else but that region (context), while another network attempts to make such context as uninformative as possible. The result is a model where hypotheses naturally compete with no need for explicit regularization or hyper-parameter tuning. Although our method requires no supervision whatsoever, it outperforms several methods that are pre-trained on large annotated datasets. Our model can be thought of as a generalization of classical variational generative region-based segmentation, but in a way that avoids explicit regularization or solution of partial differential equations at run-time.
Article
There are many reasons to expect an ability to reason in terms of objects to be a crucial skill for any generally intelligent agent. Indeed, recent machine learning literature is replete with examples of the benefits of object-like representations: generalization, transfer to new tasks, and interpretability, among others. However, in order to reason in terms of objects, agents need a way of discovering and detecting objects in the visual world - a task which we call unsupervised object detection. This task has received significantly less attention in the literature than its supervised counterpart, especially in the case of large images containing many objects. In the current work, we develop a neural network architecture that effectively addresses this large-image, many-object setting. In particular, we combine ideas from Attend, Infer, Repeat (AIR), which performs unsupervised object detection but does not scale well, with recent developments in supervised object detection. We replace AIR’s core recurrent network with a convolutional (and thus spatially invariant) network, and make use of an object-specification scheme that describes the location of objects with respect to local grid cells rather than the image as a whole. Through a series of experiments, we demonstrate a number of features of our architecture: that, unlike AIR, it is able to discover and detect objects in large, many-object scenes; that it has a significant ability to generalize to images that are larger and contain more objects than images encountered during training; and that it is able to discover and detect objects with enough accuracy to facilitate non-trivial downstream processing.
Chapter
In this work, we study the unsupervised video object segmentation problem where moving objects are segmented without prior knowledge of these objects. First, we propose a motion-based bilateral network to estimate the background based on the motion pattern of non-object regions. The bilateral network reduces false positive regions by accurately identifying background objects. Then, we integrate the background estimate from the bilateral network with instance embeddings into a graph, which allows multiple frame reasoning with graph edges linking pixels from different frames. We classify graph nodes by defining and minimizing a cost function, and segment the video frames based on the node labels. The proposed method outperforms previous state-of-the-art unsupervised video object segmentation methods against the DAVIS 2016 and the FBMS-59 datasets.
Chapter
Unsupervised video segmentation plays an important role in a wide variety of applications from object identification to compression. However, to date, fast motion, motion blur and occlusions pose significant challenges. To address these challenges for unsupervised video segmentation, we develop a novel saliency estimation technique as well as a novel neighborhood graph, based on optical flow and edge cues. Our approach leads to significantly better initial foreground-background estimates and their robust as well as accurate diffusion across time. We evaluate our proposed algorithm on the challenging DAVIS, SegTrack v2 and FBMS-59 datasets. Despite the usage of only a standard edge detector trained on 200 images, our method achieves state-of-the-art results outperforming deep learning based methods in the unsupervised setting. We even demonstrate competitive results comparable to deep learning based methods in the semi-supervised setting on the DAVIS dataset.
Chapter
Modern 3D human pose estimation techniques rely on deep networks, which require large amounts of training data. While weakly-supervised methods require less supervision, by utilizing 2D poses or multi-view imagery without annotations, they still need a sufficiently large set of samples with 3D annotations for learning to succeed. In this paper, we propose to overcome this problem by learning a geometry-aware body representation from multi-view images without annotations. To this end, we use an encoder-decoder that predicts an image from one viewpoint given an image from another viewpoint. Because this representation encodes 3D geometry, using it in a semi-supervised setting makes it easier to learn a mapping from it to 3D human pose. As evidenced by our experiments, our approach significantly outperforms fully-supervised methods given the same amount of labeled data, and improves over other semi-supervised methods while using as little as 1% of the labeled data.
Conference Paper
Categorical variables are a natural choice for representing discrete structure in the world. However, stochastic neural networks rarely use categorical latent variables due to the inability to backpropagate through samples. In this work, we present an efficient gradient estimator that replaces the non-differentiable sample from a categorical distribution with a differentiable sample from a novel Gumbel-Softmax distribution. This distribution has the essential property that it can be smoothly annealed into a categorical distribution. We show that our Gumbel-Softmax estimator outperforms state-of-the-art gradient estimators on structured output prediction and unsupervised generative modeling tasks with categorical latent variables, and enables large speedups on semi-supervised classification.
Book
Simulation and the Monte Carlo Method, Third Edition reflects the latest developments in the field and presents a fully updated and comprehensive account of the state-of-the-art theory, methods and applications that have emerged in Monte Carlo simulation since the publication of the classic First Edition over more than a quarter of a century ago. While maintaining its accessible and intuitive approach, this revised edition features a wealth of up-to-date information that facilitates a deeper understanding of problem solving across a wide array of subject areas, such as engineering, statistics, computer science, mathematics, and the physical and life sciences. The book begins with a modernized introduction that addresses the basic concepts of probability, Markov processes, and convex optimization. Subsequent chapters discuss the dramatic changes that have occurred in the field of the Monte Carlo method, with coverage of many modern topics including: Markov Chain Monte Carlo, variance reduction techniques such as importance (re-)sampling, and the transform likelihood ratio method, the score function method for sensitivity analysis, the stochastic approximation method and the stochastic counter-part method for Monte Carlo optimization, the cross-entropy method for rare events estimation and combinatorial optimization, and application of Monte Carlo techniques for counting problems. An extensive range of exercises is provided at the end of each chapter, as well as a generous sampling of applied examples. The Third Edition features a new chapter on the highly versatile splitting method, with applications to rare-event estimation, counting, sampling, and optimization. A second new chapter introduces the stochastic enumeration method, which is a new fast sequential Monte Carlo method for tree search. In addition, the Third Edition features new material on: Random number generation, including multiple-recursive generators and the Mersenne Twister. Simulation of Gaussian processes, Brownian motion, and diffusion processes. Multilevel Monte Carlo method. New enhancements of the cross-entropy (CE) method, including the "improved" CE method, which uses sampling from the zero-variance distribution to find the optimal importance sampling parameters. Over 100 algorithms in modern pseudo code with flow control. Over 25 new exercises. Simulation and the Monte Carlo Method, Third Edition is an excellent text for upper-undergraduate and beginning graduate courses in stochastic simulation and Monte Carlo techniques. The book also serves as a valuable reference for professionals who would like to achieve a more formal understanding of the Monte Carlo method.
Conference Paper
There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net .