Dynamic Sampling Network for Semantic Segmentation

To read the full-text of this research, you can request a copy directly from the authors.


Sampling is a basic operation of modern convolutional neural networks (CNN) since down-sampling operators are employed to enlarge the receptive field while up-sampling operators are adopted to increase resolution. Most existing deep segmentation networks employ regular grid sampling operators, which can be suboptimal for semantic segmentation task due to large shape and scale variance. To address this problem, this paper proposes a Context Guided Dynamic Sampling (CGDS) module to obtain an effective representation with rich shape and scale information by adaptively sampling useful segmentation information in spatial space. Moreover, we utilize the multi-scale contextual representations to guide the sampling process. Therefore, our CGDS can adaptively capture shape and scale information according to not only the input feature map but also the multi-scale semantic context. CGDS provides a plug-and-play module which can be easily incorporated in deep segmentation networks. We incorporate our proposed CGDS module into Dynamic Sampling Network (DSNet) and perform extensive experiments on segmentation datasets. Experimental results show that our CGDS significantly improves semantic segmentation performance and achieves state-of-the-art performance on PASCAL VOC 2012 and ADE20K datasets. Our model achieves 85.2% mIOU on PASCAL VOC 2012 test set without MS COCO dataset pre-trained and 46.4% on ADE20K validation set. The codes will become publicly available after publication.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... DeepLabv2 [23], DeepLabv3 [3], and DeepLabv3+ [18] apply several parallel atrous convolutions with different rates (called Atrous Spatial Pyramid Pooling, or ASPP) to capture rich contextual information. DSNet [24] proposes a Context-Guided Dynamic Sampling (CGDS) module that adaptively samples spatially useful segmentation information in spatial by obtaining an efficient representation of rich shape and scale information. APCNet [25] proposes the Adaptive Context Module (ACM), which uses the GLA to compute the context vector for each local location to aggregate contextual information. ...
... We set the atrous rate r differently in SPM1 and SPM2. In SPM1, the atrous rate r is set to [4,8,12] to capture smaller objects in the image; in SPM2, the atrous rate r is set to [12,24,36] to capture larger objects in the image. ...
Full-text available
Low-level features contain spatial detail information, and high-level features contain rich semantic information. Semantic segmentation research focuses on fully acquiring and effectively fusing spatial detail with semantic information. This paper proposes a multiscale feature-enhanced adaptive fusion network named MFEAFN to improve semantic segmentation performance. First, we designed a Double Spatial Pyramid Module named DSPM to extract more high-level semantic information. Second, we designed a Focusing Selective Fusion Module named FSFM to fuse different scales and levels of feature maps. Specifically, the feature maps are enhanced to adaptively fuse these features by generating attention weights through a spatial attention mechanism and a two-dimensional discrete cosine transform, respectively. To validate the effectiveness of FSFM, we designed different fusion modules for comparison and ablation experiments. MFEAFN achieved 82.64% and 78.46% mIoU on the PASCAL VOC2012 and Cityscapes datasets. In addition, our method has better segmentation results than state-of-the-art methods.
... Within the past decades, increasing attention to hierarchically organizing images has been drawn from the communities of computer vision and multimedia, by concerning the principle of perceptual systems. For example, an image can be spatially segmented into a set of object instances or super-pixels [14,26,35,53,52,8,13], which serve as primitives for further processing. Different from the spatial whole-part perspective, this paper concentrates on another organization manner from a scale-space/information elicit- ing perspective. ...
The importance of hierarchical image organization has been witnessed by a wide spectrum of applications in computer vision and graphics. Different from image segmentation with the spatial whole-part consideration, this work designs a modern framework for disassembling an image into a family of derived signals from a scale-space perspective. Specifically, we first offer a formal definition of image disassembly. Then, by concerning desired properties, such as peeling hierarchy and structure preservation, we convert the original complex problem into a series of two-component separation sub-problems, significantly reducing the complexity. The proposed framework is flexible to both supervised and unsupervised settings. A compact recurrent network, namely hierarchical image peeling net, is customized to efficiently and effectively fulfill the task, which is about 3.5Mb in size, and can handle 1080p images in more than 60 fps per recurrence on a GTX 2080Ti GPU, making it attractive for practical use. Both theoretical findings and experimental results are provided to demonstrate the efficacy of the proposed framework, reveal its superiority over other state-of-the-art alternatives, and show its potential to various applicable scenarios. Our code is available at \url{}.
Full-text available
The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide detailed a analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the five years of the challenge, and propose future directions and improvements.
Conference Paper
Full-text available
In this paper we study the role of context in existing state-of-the-art detection and segmentation approaches. Towards this goal, we label every pixel of PASCAL VOC 2010 de-tection challenge with a semantic category. We believe this data will provide plenty of challenges to the community, as it contains 520 additional classes for semantic segmenta-tion and object detection. Our analysis shows that near-est neighbor based approaches perform poorly on semantic segmentation of contextual classes, showing the variability of PASCAL imagery. Furthermore, improvements of exist-ing contextual models for detection is rather modest. In order to push forward the performance in this difficult sce-nario, we propose a novel deformable part-based model, which exploits both local context around each candidate de-tection as well as global context at the level of the scene. We show that this contextual reasoning significantly helps in detecting objects at all scales.
Conference Paper
Full-text available
Most state-of-the-art techniques for multi-class image segmentation and labeling use conditional random fields defined over pixels or image regions. While region-level models often feature dense pairwise connectivity, pixel-level models are considerably larger and have only permitted sparse graph structures. In this paper, we consider fully connected CRF models defined on the complete set of pixels in an image. The resulting graphs have billions of edges, making traditional inference algorithms impractical. Our main contribution is a highly efficient approximate inference algorithm for fully connected CRF models in which the pairwise edge potentials are defined by a linear combination of Gaussian kernels. Our experiments demonstrate that dense connectivity at the pixel level substantially improves segmentation and labeling accuracy.
Full-text available
The Pascal Visual Object Classes (VOC) challenge is a benchmark in visual object category recognition and detection, providing the vision and machine learning communities with a standard dataset of images and annotation, and standard evaluation procedures. Organised annually from 2005 to present, the challenge and its associated dataset has become accepted as the benchmark for object detection. This paper describes the dataset and evaluation procedure. We review the state-of-the-art in evaluated methods for both classification and detection, analyse whether the methods are statistically different, what they are learning from the images (e.g. the object or its context), and what the methods find easy or confuse. The paper concludes with lessons learnt in the three year history of the challenge, and proposes directions for future improvement and extension.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
In this work we address the task of semantic image segmentation with Deep Learning and make three main contributions that are experimentally shown to have substantial practical merit. First, we highlight convolution with upsampled filters, or 'atrous convolution', as a powerful tool in dense prediction tasks. Atrous convolution allows us to explicitly control the resolution at which feature responses are computed within Deep Convolutional Neural Networks. It also allows us to effectively enlarge the field of view of filters to incorporate larger context without increasing the number of parameters or the amount of computation. Second, we propose atrous spatial pyramid pooling (ASPP) to robustly segment objects at multiple scales. ASPP probes an incoming convolutional feature layer with filters at multiple sampling rates and effective fields-of-views, thus capturing objects as well as image context at multiple scales. Third, we improve the localization of object boundaries by combining methods from DCNNs and probabilistic graphical models. The commonly deployed combination of max-pooling and downsampling in DCNNs achieves invariance but has a toll on localization accuracy. We overcome this by combining the responses at the final DCNN layer with a fully connected Conditional Random Field (CRF), which is shown both qualitatively and quantitatively to improve localization performance. Our proposed "DeepLab" system sets the new state-of-art at the PASCAL VOC-2012 semantic image segmentation task, reaching 79.7% mIOU in the test set, and advances the results on three other datasets: PASCAL-Context, PASCAL-Person-Part, and Cityscapes. All of our code is made publicly available online.
Convolutional Neural Networks define an exceptionally powerful class of models, but are still limited by the lack of ability to be spatially invariant to the input data in a computationally and parameter efficient manner. In this work we introduce a new learnable module, the Spatial Transformer, which explicitly allows the spatial manipulation of data within the network. This differentiable module can be inserted into existing convolutional architectures, giving neural networks the ability to actively spatially transform feature maps, conditional on the feature map itself, without any extra training supervision or modification to the optimisation process. We show that the use of spatial transformers results in models which learn invariance to translation, scale, rotation and more generic warping, resulting in state-of-the-art performance on several benchmarks, and for a number of classes of transformations.
Conference Paper
We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in understanding an object's precise 2D location. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old along with per-instance segmentation masks. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.
The ability of learning networks to generalize can be greatly enhanced by providing constraints from the task domain. This paper demonstrates how such constraints can be integrated into a backpropagation network through the architecture of the network. This approach has been successfully applied to the recognition of handwritten zip code digits provided by the U.S. Postal Service. A single network learns the entire recognition operation, going from the normalized image of the character to the final classification.
Large-scale machine learning with stochastic gradient descent
  • L Bottou
Bottou, L. 2010. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT'2010. Springer. 177-186.
Rethinking atrous convolution for semantic image segmentation
  • L.-C Chen
  • G Papandreou
  • F Schroff
  • H Adam
Chen, L.-C.; Papandreou, G.; Schroff, F.; and Adam, H. 2017. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587.
Encoder-decoder with atrous separable convolution for semantic image segmentation
  • L.-C Chen
  • Y Zhu
  • G Papandreou
  • F Schroff
  • H Adam
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; and Adam, H. 2018b. Encoder-decoder with atrous separable convolution for semantic image segmentation. arXiv preprint arXiv:1802.02611.
Dynamic filter networks
  • X Jia
  • B De Brabandere
  • T Tuytelaars
  • L V Gool
  • D D Lee
  • M Sugiyama
  • U V Luxburg
  • I Guyon
  • R Garnett
Jia, X.; De Brabandere, B.; Tuytelaars, T.; and Gool, L. V. 2016. Dynamic filter networks. In Lee, D. D.; Sugiyama, M.; Luxburg, U. V.; Guyon, I.; and Garnett, R., eds., Advances in Neural Information Processing Systems 29. Curran Associates, Inc. 667-675.
Automatic differentiation in pytorch
  • A Paszke
  • S Gross
  • S Chintala
  • G Chanan
  • E Yang
  • Z Devito
  • Z Lin
  • A Desmaison
  • L Antiga
  • A Lerer
Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; and Lerer, A. 2017. Automatic differentiation in pytorch.
  • E Shelhamer
  • D Wang
  • T Darrell
Shelhamer, E.; Wang, D.; and Darrell, T. 2019. Blurring the line between structure and learning to optimize and adapt receptive fields. arXiv preprint arXiv:1904.11487.
Inception-v4, inception-resnet and the impact of residual connections on learning
  • C Szegedy
  • S Ioffe
  • V Vanhoucke
  • A A Alemi
Szegedy, C.; Ioffe, S.; Vanhoucke, V.; and Alemi, A. A. 2017. Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-First AAAI Conference on Artificial Intelligence.