Article
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Camouflaged objects attempt to conceal their texture into the background and discriminating them from the background is hard even for human beings. The main objective of this paper is to explore the camouflaged object segmentation problem, namely, segmenting the camouflaged object(s) for a given image. This problem has not been well studied in spite of a wide range of potential applications including the preservation of wild animals and the discovery of new species, surveillance systems, search-and-rescue missions in the event of natural disasters such as earthquakes, floods or hurricanes. This paper addresses a new challenging problem of camouflaged object segmentation. To address this problem, we provide a new image dataset of camouflaged objects for benchmarking purposes. In addition, we propose a general end-to-end network, called the Anabranch Network, that leverages both classification and segmentation tasks. Different from existing networks for segmentation, our proposed network possesses the second branch for classification to predict the probability of containing camouflaged object(s) in an image, which is then fused into the main branch for segmentation to boost up the segmentation accuracy. Extensive experiments conducted on the newly built dataset demonstrate the effectiveness of our network using various fully convolutional networks.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... However, due to the local receptive fields of CNN, it tends to activate local semantic elements of object or even some non-object regions, obtaining missing and redundant parts in COD results. Recently, much effort has been made to solve this problem for COD by proposing auxiliary tasks to augment the representations, (i.e., features for classification [13], identification [8], or edge detection [35]. However, none of those work have paid attention to fundamentally solving the inherent defects of CNN's local representation. ...
... Camouflaged Object Detection (COD) Recently, the deep CNN networks with large capacity have been widely used to recognize the complex camouflaged objects. Le et al. [13] propose an end-to-end network called the ANet, adding a classification stream to a salient object detection model to boost the segmentation accuracy. Yan et al. [33] find that horizontally flipped images could provide vital cues, and provide a two-stream MirrorNet with the original and flipped images as inputs. ...
... Datasets. We evaluate our method on four benchmark datasets: CHAMELEON [23], CAMO [13], COD10K [8] and NC4K [18]. CHAMELEON [23] has 76 images collected from the Internet via the Google search engine using Table 1. ...
Preprint
Camouflaged objects are seamlessly blended in with their surroundings, which brings a challenging detection task in computer vision. Optimizing a convolutional neural network (CNN) for camouflaged object detection (COD) tends to activate local discriminative regions while ignoring complete object extent, causing the partial activation issue which inevitably leads to missing or redundant regions of objects. In this paper, we argue that partial activation is caused by the intrinsic characteristics of CNN, where the convolution operations produce local receptive fields and experience difficulty to capture long-range feature dependency among image regions. In order to obtain feature maps that could activate full object extent, keeping the segmental results from being overwhelmed by noisy features, a novel framework termed Cross-Model Detail Querying network (DQnet) is proposed. It reasons the relations between long-range-aware representations and multi-scale local details to make the enhanced representation fully highlight the object regions and eliminate noise on non-object regions. Specifically, a vanilla ViT pretrained with self-supervised learning (SSL) is employed to model long-range dependencies among image regions. A ResNet is employed to enable learning fine-grained spatial local details in multiple scales. Then, to effectively retrieve object-related details, a Relation-Based Querying (RBQ) module is proposed to explore window-based interactions between the global representations and the multi-scale local details. Extensive experiments are conducted on the widely used COD datasets and show that our DQnet outperforms the current state-of-the-arts.
... To validate the effectiveness of our CamoFormer, we conduct extensive experiments on three popular COD benchmarks (NC4K [36], COD10K [11], and CAMO [27]). On all these benchmarks, CamoFormer achieves new stateof-the-art (SOTA) records compared to the recent methods. ...
... Unlike previous works [11,22,41] that mainly use the addition operation or the concatenation operation to fuse the features from different feature levels, we first compute the element-wise product between them and then use the summation operation. We empirically found that such a simple modification brings about 0.2%+ relative improvement in terms of S-measure and weighted F-measure averagely on NC4K [36], COD10K-test [10], and CAMO-test [27]. ...
... Datasets. We evaluate our methods on three popular COD benchmarks, including CAMO [27], COD10K [10], and NC4k [36]. CAMO comprises 2,500 images, half of which contain camouflaged objects and half do not. ...
Preprint
Full-text available
How to identify and segment camouflaged objects from the background is challenging. Inspired by the multi-head self-attention in Transformers, we present a simple masked separable attention (MSA) for camouflaged object detection. We first separate the multi-head self-attention into three parts, which are responsible for distinguishing the camouflaged objects from the background using different mask strategies. Furthermore, we propose to capture high-resolution semantic representations progressively based on a simple top-down decoder with the proposed MSA to attain precise segmentation results. These structures plus a backbone encoder form a new model, dubbed CamoFormer. Extensive experiments show that CamoFormer surpasses all existing state-of-the-art methods on three widely-used camouflaged object detection benchmarks. There are on average around 5% relative improvements over previous methods in terms of S-measure and weighted F-measure.
... In the early years, several traditional COD methods [9], [10] have been proposed to segment camouflaged objects by using manually designed features. Recently, due to the development of deep learning-based representation methods, many deep learning-based COD methods have been proposed to obtain state-of-the-art performance [11]- [15]. For example, ANet [11] utilizes a classification network to determine whether the image contains camouflaged objects or not, and then uses a fully convolutional network for COD. ...
... Recently, due to the development of deep learning-based representation methods, many deep learning-based COD methods have been proposed to obtain state-of-the-art performance [11]- [15]. For example, ANet [11] utilizes a classification network to determine whether the image contains camouflaged objects or not, and then uses a fully convolutional network for COD. SINet [12] is proposed to utilize a search module to coarsely select the candidate regions of camouflaged objects and then proposes an identification module to precisely detect camouflaged objects. ...
... • CHAMELEON [12] is collected via the Google search engine with the keyword "camouflage animals", containing 76 camouflaged images, which are all used for testing. • CAMO [11] has 1, 250 images with 8 categories, of which 1, 000 images are for training and the remaining 250 ones are for testing. • COD10K [12] is currently the largest camouflaged object dataset with highquality pixel-level annotations. ...
Preprint
Full-text available
Camouflaged object detection (COD) aims to detect/segment camouflaged objects embedded in the environment, which has attracted increasing attention over the past decades. Although several COD methods have been developed, they still suffer from unsatisfactory performance due to the intrinsic similarities between the foreground objects and background surroundings. In this paper, we propose a novel Feature Aggregation and Propagation Network (FAP-Net) for camouflaged object detection. Specifically, we propose a Boundary Guidance Module (BGM) to explicitly model the boundary characteristic, which can provide boundary-enhanced features to boost the COD performance. To capture the scale variations of the camouflaged objects, we propose a Multi-scale Feature Aggregation Module (MFAM) to characterize the multi-scale information from each layer and obtain the aggregated feature representations. Furthermore, we propose a Cross-level Fusion and Propagation Module (CFPM). In the CFPM, the feature fusion part can effectively integrate the features from adjacent layers to exploit the cross-level correlations, and the feature propagation part can transmit valuable context information from the encoder to the decoder network via a gate unit. Finally, we formulate a unified and end-to-end trainable framework where cross-level features can be effectively fused and propagated for capturing rich context information. Extensive experiments on three benchmark camouflaged datasets demonstrate that our FAP-Net outperforms other state-of-the-art COD models. Moreover, our model can be extended to the polyp segmentation task, and the comparison results further validate the effectiveness of the proposed model in segmenting polyps. The source code and results will be released at https://github.com/taozh2017/FAPNet.
... We tested our method on three popular datasets: CHAMELEON [32], CAMO [21], and COD10K [10]. These datasets contain various challenging scenarios, and the proposed model achieves state-of-the-art performance on all three datasets. ...
... • The proposed network achieves state-of-the-art performance on the CHAMELEON [32], CAMO [21], and COD10K [10] datasets. Additionally, we demonstrate the effectiveness of the proposed method through various ablation studies. ...
... We perform experiments on three popular COD benchmarks to validate the effectiveness of the proposed method: CHAMELEON [32], CAMO [21], and COD10K [10]. CHAMELEON [32] is a small dataset containing only 76 images, which are collected from the Internet. ...
Preprint
The camouflaged object detection (COD) task aims to find and segment objects that have a color or texture that is very similar to that of the background. Despite the difficulties of the task, COD is attracting attention in medical, lifesaving, and anti-military fields. To overcome the difficulties of COD, we propose a novel global-local aggregation architecture with a deformable point sampling method. Further, we propose a global-local aggregation transformer that integrates an object's global information, background, and boundary local information, which is important in COD tasks. The proposed transformer obtains global information from feature channels and effectively extracts important local information from the subdivided patch using the deformable point sampling method. Accordingly, the model effectively integrates global and local information for camouflaged objects and also shows that important boundary information in COD can be efficiently utilized. Our method is evaluated on three popular datasets and achieves state-of-the-art performance. We prove the effectiveness of the proposed method through comparative experiments.
... It thus has wide applications in medical treatment (e.g., polyp segmentation [3], COVID-19 infection segmentation [4]), agriculture (e.g., pest identification [5]), and art (e.g., camouflage painting [6]). Unlike salient object detection for identifying objects with high contrast compared with foreground and background, COD is challenging since it has to distinguish intrinsic similarities between foreground objects and background surroundings [7]. Although several CNN-based deep learning approaches were proposed to overcome this challenge, they still have inherent limitations in representing and learning explicit global contexts for precise and robust detection and segmentation [8][9][10][11][12][13][14][15][16]. ...
... Fig. 1 shows the results of COD using the proposed TCU-Net, which can conduct accurate camouflaged object segmentations, almost identical to the ground truths. We compare the proposed TCU-Net with previous studies by analyzing various metrics based on four public datasets such as CAMO [7], CHAMELEON [24], COD10K [8], and NC4K [14]. Comprehensive evaluation proves the robustness and novelty of TCU-Net. ...
... Four public datasets were used for comparative evaluation of COD, such as CAMO [7], CHAMELEON [24], COD10K [8], and NC4K [14]. We trained the proposed TCU-Net with the public datasets that contain 3,040 images from COD10K and 1,000 images from CAMO. ...
Article
Full-text available
Camouflaged object detection (COD) seeks to find concealed objects hidden in natural surroundings. COD is challenging since it has to distinguish intrinsic similarities between foreground objects and background surroundings, unlike salient object detection. Convolutional neural network (CNN)-based approaches are proposed to overcome this challenge. However, they have inherent limitations in modeling and extracting global contexts. Although Transformer-based approaches are proposed to tackle this problem, which can maintain the semantic features of input images, they have limitations in learning localized spatial features in the limited receptive field. Therefore, one of the main challenges is to conduct accurate and robust COD while maintaining global contexts without sacrificing low-level contexts. This study proposes a novel concealed object detection and segmentation method using Transformer and CNN-based advanced U-Net (TCU-Net). TCU-Net can extract globalized semantic features using the Swin Transformer-based encoder and localized spatial features using the attentive inception decoder. In particular, multi-dilated residual (MDR) blocks connecting the encoder and decoder generate refined multi-level features to improve discriminability. Finally, the attentive inception decoder generates the final camouflaged object mask by maintaining the localized spatial information. Instead of simple up-sampling of the feature map, the attentive inception decoder conducts cascaded deconvolution through inception and attention modules. A weighted hybrid loss function is used for optimizing the model, consisting of the binary cross entropy (BCE) and intersection over union (IoU) losses. We comprehensively compared the proposed TCU-Net with previous studies by analyzing different metrics based on four public datasets, such as CAMO, CHAMELEON, COD10K, and NC4K. An ablation study was also conducted to evaluate network architectures and loss functions to verify advantages of the proposed approach. Experimental analysis on public datasets proves that the proposed TCU-Net outperforms previous approaches.
... This domain has a very limited collection of datasets; due to its challenging nature, the images are difficult to find and prepare, i.e., image collection and GT annotations, respectively. The CAMO dataset [10] consists of 1250 images of ecological camouflage, and the Kvasir-SEG [11] has images of medical camouflage, consisting of 1000 images. While the COD10K dataset has the natural camouflage, consisting of 10,000 images. ...
... respectively. The CAMO dataset [10] consists of 1250 images of ecological camouflage, and the Kvasir-SEG [11] has images of medical camouflage, consisting of 1000 images. While the COD10K dataset has the natural camouflage, consisting of 10,000 images. ...
... The evaluation time on this framework is 83.3 ms/sample. The CAMO dataset [10] consists of generic camouflage objects that are dependent upon the specific situation or condition, as shown in Figure 6a. COD10K [4] contains 10,000 images, as shown in Figure 6b, in which 5066 are defined as camouflaged, 3000 background, and 1934 as non-camouflaged. ...
Article
Full-text available
Camouflage objects hide information physically based on the feature matching of the texture or boundary line within the background. Texture matching and similarities between the camouflage objects and surrounding maps make differentiation difficult with generic and salient objects, thus making camouflage object detection (COD) more challenging. The existing techniques perform well. However, the challenging nature of camouflage objects demands more accuracy in detection and segmentation. To overcome this challenge, an optimized modular framework for COD tasks, named Optimize Global Refinement (OGR), is presented. This framework comprises a parallelism approach in feature extraction for the enhancement of learned parameters and globally refined feature maps for the abstraction of all intuitive feature sets at each extraction block’s outcome. Additionally, an optimized local best feature node-based rule is proposed to reduce the complexity of the proposed model. In light of the baseline experiments, OGR was applied and evaluated on a benchmark. The publicly available datasets were outperformed by achieving state-of-the-art structural similarity of 94%, 93%, and 96% for the Kvasir-SEG, COD10K, and Camouflaged Object (CAMO) datasets, respectively. The OGR is generalized and can be integrated into real-time applications for future development.
... Four datasets were used to evaluate the proposed model: CHAMELEMON [79], CAMO [80], COD10K [2], NC4K [36]. These four datasets differ significantly. ...
... The smallest dataset is the CHAMELEON dataset, which does not provide a training set, and contains only 76 images for testing. According to [2], [11], [33], we used the training images from the CAMO and COD10K datasets as the training set (4,040 images) and the test images from the CHAMELEMON [79], CAMO [80], COD10K [2], and NC4K [36] as the test set. ...
... In order to verify the effectiveness of the parameter selection in each module in the experiments, we used Res2Net-50 as the backbone and selected two COD datasets (CAMO [80], COD10K [2]) and two SOD datasets (ECSSD [70], PASCAL-S [69]) for the validation experiments. ...
Preprint
Binary segmentation is used to distinguish objects of interest from background, and is an active area of convolutional encoder-decoder network research. The current decoders are designed for specific objects based on the common backbones as the encoders, but cannot deal with complex backgrounds. Inspired by the way human eyes detect objects of interest, a new unified dual-branch decoder paradigm named the difference-aware decoder is proposed in this paper to explore the difference between the foreground and the background and separate the objects of interest in optical images. The difference-aware decoder imitates the human eye in three stages using the multi-level features output by the encoder. In Stage A, the first branch decoder of the difference-aware decoder is used to obtain a guide map. The highest-level features are enhanced with a novel field expansion module and a dual residual attention module, and are combined with the lowest-level features to obtain the guide map. In Stage B, the other branch decoder adopts a middle feature fusion module to make trade-offs between textural details and semantic information and generate background-aware features. In Stage C, the proposed difference-aware extractor, consisting of a difference guidance model and a difference enhancement module, fuses the guide map from Stage A and the background-aware features from Stage B, to enlarge the differences between the foreground and the background and output a final detection result. The results demonstrate that the difference-aware decoder can achieve a higher accuracy than the other state-of-the-art binary segmentation methods for these tasks.
... Recently, many researches put emphasis on learning from a fixed single view with either auxiliary tasks [18,32,34,58,67,15], uncertainty discovery [20,26], or vision transformers [56,38] and their proposed methods achieved significant progress. Nevertheless, due to visual insignificance of camouflaged objects and contextual insufficiency from single-view input, they are still striving to precisely recognize camouflaged objects and their performance needs to be improved. ...
... In recent years, some researches applied multi-task learning to detect the camouflaged objects. Le et al. [18] introduced the binary ...
... Datasets. We use four COD datasets, CAMO [18], CHAMELEON [42], COD10K [6] and NC4K [32]. Evaluation Metrics. ...
Preprint
Full-text available
Recent research about camouflaged object detection (COD) aims to segment highly concealed objects hidden in complex surroundings. The tiny, fuzzy camouflaged objects result in visually indistinguishable properties. However, current single-view COD detectors are sensitive to background distractors. Therefore, blurred boundaries and variable shapes of the camouflaged objects are challenging to be fully captured with a single-view detector. To overcome these obstacles, we propose a behavior-inspired framework, called Multi-view Feature Fusion Network (MFFN), which mimics the human behaviors of finding indistinct objects in images, i.e., observing from multiple angles, distances, perspectives. Specifically, the key idea behind it is to generate multiple ways of observation (multi-view) by data augmentation and apply them as inputs. MFFN captures critical boundary and semantic information by comparing and fusing extracted multi-view features. In addition, our MFFN exploits the dependence and interaction between views and channels. Specifically, our methods leverage the complementary information between different views through a two-stage attention module called Co-attention of Multi-view (CAMV). And we design a local-overall module called Channel Fusion Unit (CFU) to explore the channel-wise contextual clues of diverse feature maps in an iterative manner. The experiment results show that our method performs favorably against existing state-of-the-art methods via training with the same data. The code will be available at https://github.com/dwardzheng/MFFN_COD.
... The number of images in the CHAMELEON dataset is small, with only 76 published images collected from the internet [43]. The CAMO dataset contains 1250 images in eight categories [44]. In 2020, Fan et al. proposed the COD10K universal camouflaged object dataset, which has 78 subclasses of 10K images, and this dataset is very precise and challenging [39]. ...
... We evaluate the CAMO [44] and COD10K [39] datasets with relatively large data volumes. CAMO includes 1250 images, and COD10K includes 5066 camouflage images. ...
Article
Full-text available
In recent years, protecting important objects by simulating animal camouflage has been widely employed in many fields. Therefore, camouflaged object detection (COD) technology has emerged. COD is more difficult to achieve than traditional object detection techniques due to the high degree of fusion of objects camouflaged with the background. In this paper, we strive to more accurately and efficiently identify camouflaged objects. Inspired by the use of magnifiers to search for hidden objects in pictures, we propose a COD network that simulates the observation effect of a magnifier called the MAGnifier Network (MAGNet). Specifically, our MAGNet contains two parallel modules: the ergodic magnification module (EMM) and the attention focus module (AFM). The EMM is designed to mimic the process of a magnifier enlarging an image, and AFM is used to simulate the observation process in which human attention is highly focused on a particular region. The two sets of output camouflaged object maps were merged to simulate the observation of an object by a magnifier. In addition, a weighted key point area perception loss function, which is more applicable to COD, was designed based on two modules to give greater attention to the camouflaged object. Extensive experiments demonstrate that compared with 19 cutting-edge detection models, MAGNet can achieve the best comprehensive effect on eight evaluation metrics in the public COD dataset. Additionally, compared to other COD methods, MAGNet has lower computational complexity and faster segmentation. We also validated the model’s generalization ability on a military camouflaged object dataset constructed in-house. Finally, we experimentally explored some extended applications of COD.
... The COD10K [36] and CAMO [37] benchmarks are the two largest camouflage datasets, covering artificial camouflage, animal camouflage and insect camouflage. After discarding the duplicate insect images, we assembled a total of 1900 ecological images and ground truth pairs from the COD10K and CAMO for this study, which included 10 orders of typical camouflaged insects, such as Coleoptera, Hemiptera, Odonata, Neuroptera, Hymenoptera, Diptera, Lepidoptera, Phasmatodea, Mantodea, and Orthoptera ( Figure 2). ...
Article
Full-text available
Accurately segmenting an insect from its original ecological image is the core technology restricting the accuracy and efficiency of automatic recognition. However, the performance of existing segmentation methods is unsatisfactory in insect images shot in wild backgrounds on account of challenges: various sizes, similar colors or textures to the surroundings, transparent body parts and vague outlines. These challenges of image segmentation are accentuated when dealing with camouflaged insects. Here, we developed an insect image segmentation method based on deep learning termed the progressive refinement network (PRNet), especially for camouflaged insects. Unlike existing insect segmentation methods, PRNet captures the possible scale and location of insects by extracting the contextual information of the image, and fuses comprehensive features to suppress distractors, thereby clearly segmenting insect outlines. Experimental results based on 1900 camouflaged insect images demonstrated that PRNet could effectively segment the camouflaged insects and achieved superior detection performance, with a mean absolute error of 3.2%, pixel-matching degree of 89.7%, structural similarity of 83.6%, and precision and recall error of 72%, which achieved improvements of 8.1%, 25.9%, 19.5%, and 35.8%, respectively, when compared to the recent salient object detection methods. As a foundational technology for insect detection, PRNet provides new opportunities for understanding insect camouflage, and also has the potential to lead to a step progress in the accuracy of the intelligent identification of general insects, and even being an ultimate insect detector.
... Kajiura et al. [31] improve the detection accuracy by exploring the uncertainties of pseudo-edge and pseudo-map labels. Zhuge et al. [32] propose a cube-alike architecture for COD, which accompanies attention fu-sion and X-shaped connection to integrate multiple-layer features sufficiently. 2) Two-stage strategy: Search and identification strategy [1] is an early practice to model the COD task. ...
Article
Full-text available
This paper introduces deep gradient network (DGNet), a novel deep framework that exploits object gradient supervision for camouflaged object detection (COD). It decouples the task into two connected branches, i.e., a context and a texture encoder. The essential connection is the gradient-induced transition, representing a soft grouping between context and texture features. Benefiting from the simple but efficient framework, DGNet outperforms existing state-of-the-art COD models by a large margin. Notably, our efficient version, DGNet-S, runs in real-time (80 fps) and achieves comparable results to the cutting-edge model JCSOD-CVPR21 with only 6.82% parameters. The application results also show that the proposed DGNet performs well in the polyp segmentation, defect detection, and transparent object segmentation tasks. The code will be made available at https://github.com/GewelsJI/DGNet .
... Thus, we extended SPNet to the RGB-D COD task. We conducted this experiment on three public benchmark datasets for camouflaged object detection: (i) CHAMELEON [100], consisting of 76 camouflaged images, (ii) CAMO [104], with 1250 images (1000 for training, 250 for testing) in 8 categories, and (iii) COD10K [100], with 5066 camouflaged images (3040 for training, 2026 for testing) in 5 super-classes and 69 sub-classes. Following the same setting in Ref. [105], we divided the training and testing sets and then trained our model on the training set. ...
Article
Full-text available
Salient object detection (SOD) in RGB and depth images has attracted increasing research interest. Existing RGB-D SOD models usually adopt fusion strategies to learn a shared representation from RGB and depth modalities, while few methods explicitly consider how to preserve modality-specific characteristics. In this study, we propose a novel framework, the specificity-preserving network (SPNet), which improves SOD performance by exploring both the shared information and modality-specific properties. Specifically, we use two modality-specific networks and a shared learning network to generate individual and shared saliency prediction maps. To effectively fuse cross-modal features in the shared learning network, we propose a cross-enhanced integration module (CIM) and propagate the fused feature to the next layer to integrate cross-level information. Moreover, to capture rich complementary multi-modal information to boost SOD performance, we use a multi-modal feature aggregation (MFA) module to integrate the modality-specific features from each individual decoder into the shared decoder. By using skip connections between encoder and decoder layers, hierarchical features can be fully combined. Extensive experiments demonstrate that our SPNet outperforms cutting-edge approaches on six popular RGB-D SOD and three camouflaged object detection benchmarks. The project is publicly available at https://github.com/taozh2017/SPNet .
... With the development of reconnaissance technology, the research and development level of intelligent instruments is increasing. Intelligent instruments have provided significant improvements in the precision and accuracy of recognition, and higher performance is now required for the camouflage technology of military targets [5]. Camouflage technology includes protection camouflage, deformation camouflage, digital camouflage and so on. ...
... Dataset Preparation: To better illustrate the generalizability of our approach, we evaluate the effectiveness of our approach on both COD and SOD benchmarks. We choose four widely used COD datasets, i.e., CAMO [32], CHAMELEON [57], COD10K [8], and NC4K [43], as well as four RGB-D SOD datasets, i.e., NLPR [48], NJU2K [25], STERE [46], and SIP [9]. For the unimodal COD dataset, we compare with both RGB COD models and RGB-D SOD models retrained on the COD datasets with the same source-free depth D s f . ...
Preprint
Full-text available
Depth cues are known to be useful for visual perception. However, direct measurement of depth is often impracticable. Fortunately, though, modern learning-based methods offer promising depth maps by inference in the wild. In this work, we adapt such depth inference models for object segmentation using the objects' ``pop-out'' prior in 3D. The ``pop-out'' is a simple composition prior that assumes objects reside on the background surface. Such compositional prior allows us to reason about objects in the 3D space. More specifically, we adapt the inferred depth maps such that objects can be localized using only 3D information. Such separation, however, requires knowledge about contact surface which we learn using the weak supervision of the segmentation mask. Our intermediate representation of contact surface, and thereby reasoning about objects purely in 3D, allows us to better transfer the depth knowledge into semantics. The proposed adaptation method uses only the depth model without needing the source data used for training, making the learning process efficient and practical. Our experiments on eight datasets of two challenging tasks, namely camouflaged object detection and salient object detection, consistently demonstrate the benefit of our method in terms of both performance and generalizability.
... Camouflage breaking aims to discover the objects that are hiding in the scene, where the visual appearance barely provide informative cues [3,24,25]. As such, the objects will only be apparent when they start to move, thus is closely related to motion segmentation. ...
Conference Paper
Biological visual systems are exceptionally good at perceiving objects that undergo changes in appearance, pose, and position. In this paper, we aim to train a computational model with similar functionality to segment the moving objects in videos. We target the challenging cases when objects are "invisible" in the RGB video sequence-for example, breaking camouflage, where visual appearance from a static scene can barely provide informative cues, or locating the objects as a whole even under partial occlusion. To this end, we make the following contributions: (i) In order to train a motion seg-mentation model, we propose a scalable pipeline for generating synthetic training data, significantly reducing the requirements for labour-intensive annotations; (ii) We introduce a dual-head architecture (hybrid of ConvNets and Transformer) that takes a sequence of optical flows as input, and learns to segment the moving objects even when they are partially occluded or stop moving at certain points in videos; (iii) We conduct thorough ablation studies to analyse the critical components in data simulation, and validate the necessity of Transformer layers for aggregating temporal information and for developing object permanence. When evaluating on the MoCA camouflage dataset, the model trained only on synthetic data demonstrates state-of-the-art segmen-tation performance, even outperforming strong supervised approaches. In addition, we also evaluate on the popular benchmarks DAVIS2016 and SegTrackv2, and show competitive performance despite only processing optical flow. The project webpage is at: www.robots.ox.ac.uk/~vgg/research/simo/
... Further method combines RGBD and SOBS [26] detecting camouflage object. With the existing branch for segmentation, anabranch network [27] provides classification that predicts the probability of having camouflaged target in an image, which is combined with the segmentation branch to increase accuracy. Accuracy is less on some datasets. ...
Article
Detecting camouflage moving object from the video sequence is the big challenge in computer vision. To detect moving object from dynamic background is also very difficult as the background is also detected as moving object. Mask RCNN is a deep neural network which solves the problem of separation of instances of same object in machine learning or computer vision. Thus, it separates different objects in video. It is the extension of faster RCNN in which an extra branch is added to create an object mask simultaneously along with bounding box and classifier. After giving input, Mask RCNN gives the rectangle around the object, class to which object belong and object mask. This article introduces Mask RCNN algorithm along with some modifications for target detection from dynamic background and also for camouflage handling. After target object detection, contrast limited adaptive histogram equalization is applied. Morphological operations are used to improve results. For both challenges quantitative and qualitative measures were obtained and compared with the existing algorithms. Our method efficiently detects the moving object from input sequence and gives best results in both situations.
... We evaluate the proposed architecture on four COD datasets: CAMO-Test (Le et al. 2019), CPD1K-Test (Zheng et al. 2019), CHAMELEON, COD10K-Test (Fan et al. 2020a) and five SOD datasets: ECSSD (Yan et al. 2013), PASCAL-S (Li et al. 2014), DUT-OMRON (Li et al. 2014), DUTS, HKU-IS-Test (Li and Yu 2015). Six metrics are adopted to evaluate the performance of TINet and other models. ...
Article
Camouflaged objects, similar to the background, show indefinable boundaries and deceptive textures, which increases the difficulty of detection task and makes the model rely on features with more information. Herein, we design a texture label to facilitate our network for accurate camouflaged object segmentation. Motivated by the complementary relationship between texture labels and camouflaged object labels, we propose an interactive guidance framework named TINet, which focuses on finding the indefinable boundary and the texture difference by progressive interactive guidance. It maximizes the guidance effect of refined multi-level texture cues on segmentation. Specifically, texture perception decoder (TPD) makes a comprehensive analysis of texture information in multiple scales. Feature interaction guidance decoder (FGD) interactively refines multi-level features of camouflaged object detection and texture detection level by level. Holistic perception decoder (HPD) enhances FGD results by multi-level holistic perception. In addition, we propose a boundary weight map to help the loss function pay more attention to the object boundary. Sufficient experiments conducted on COD and SOD datasets demonstrate that the proposed method performs favorably against 23 state-of-the-art methods.
Preprint
In many binary segmentation tasks, most CNNs-based methods use a U-shape encoder-decoder network as their basic structure. They ignore two key problems when the encoder exchanges information with the decoder: one is the lack of interference control mechanism between them, the other is without considering the disparity of the contributions from different encoder levels. In this work, we propose a simple yet general gated network (GateNet) to tackle them all at once. With the help of multi-level gate units, the valuable context information from the encoder can be selectively transmitted to the decoder. In addition, we design a gated dual branch structure to build the cooperation among the features of different levels and improve the discrimination ability of the network. Furthermore, we introduce a ``Fold'' operation to improve the atrous convolution and form a novel folded atrous convolution, which can be flexibly embedded in ASPP or DenseASPP to accurately localize foreground objects of various scales. GateNet can be easily generalized to many binary segmentation tasks, including general and specific object segmentation and multi-modal segmentation. Without bells and whistles, our network consistently performs favorably against the state-of-the-art methods under 10 metrics on 33 datasets of 10 binary segmentation tasks.
Chapter
The camouflaged object detection (COD) task is challenging because of the high similarity between target and background. Most of the existing COD methods are based on the transfer learning of salient object detection (SOD) network, which is not efficient for the COD task. It is also difficult to accurately capture the edge information of the object after the coarse-grained localization of the camouflaged object. In this paper, we propose a novel network: Attention Guided Fusion Network (AGFNet), for the task of COD. We use low-level and high-level features to extract edge and semantic information. To solve the problem of discriminating and localizing the camouflaged object, we adopt a dual-attention module, which can selectively determine the more discriminate information of the camouflaged object. In addition, our method applies a module to fuse edge and semantic information for refinement to generate sharp edges. The experiments demonstrate the effectiveness and superiority of the proposed network over state-of-the-art methods.
Chapter
Photobombing occurs very often in photography. This causes inconvenience to the main person(s) in the photos. Therefore, there is a legitimate need to remove the photobombing from taken images to produce a pleasing image. In this paper, the aim is to conduct a benchmark on this aforementioned problem. To this end, we first collect a dataset of images with undesired and distracting elements which requires the removal of photobombing. Then, we annotate the photobombed regions which should be removed. Next, different image inpainting methods are leveraged to remove the photobombed regions and reconstruct the image. We further invited professional photoshoppers to remove the unwanted regions. These photoshopped images are considered as the groundtruth. In our benchmark, several performance metrics are leveraged to compare the results of different methods with the groundtruth. The experiments provide insightful results which demonstrate the effectiveness of inpainting methods in this particular problem.KeywordsPhotobombing removalImage inpaintingBenchmarkPerformance metrics
Chapter
Full-text available
Infrared (IR) spectroscopic imaging offers label-free visualization of sample heterogeneity via spatially localized chemical information. This spatial-spectral data set is amenable to computational algorithms that highlight functional properties of the sample. Although Fourier transform IR (FT-IR) imaging provides reliable analytical information over a wide spectral profile, long data acquisition times are a major challenge impeding broad adoptability. Discrete frequency (DF) IR imaging is considerably faster, first by reducing the total number of spectral frequencies acquired to only those necessary for the task, and second by using substantially higher optical power via IR lasers. Further acceleration of imaging is hindered by high laser noise and usually relies on time-consuming averaging of ensemble measurements to achieve useful signal-to-noise ratio (SNR). Here, we develop a novel convolutional neural network (CNN) architecture capable of denoising discrete frequency infrared (DFIR) images in real-time, removing the need for excessive co-averaging, thereby reducing the total data acquisition time accordingly. Our architecture is based on dilated residual block network (DRB-Net), which outperforms state-of-the-art CNN models for image denoising task. To validate the robustness of DRB-Net, we demonstrate its efficacy on various unseen samples including SU-8 targets, polymers, cells, and prostate tissues. Our findings demonstrate that DRB-Net recovers high-quality data from noisy input without supervision and with minimal computation time.KeywordsInfrared imagingImage enhancementComputer visionDenoising
Article
Camouflaged object detection (COD) aims to detect/segment camouflaged objects embedded in the environment, which has attracted increasing attention over the past decades. Although several COD methods have been developed, they still suffer from unsatisfactory performance due to the intrinsic similarities between the foreground objects and background surroundings. In this paper, we propose a novel Feature Aggregation and Propagation Network (FAP-Net) for camouflaged object detection. Specifically, we propose a Boundary Guidance Module (BGM) to explicitly model the boundary characteristic, which can provide boundary-enhanced features to boost the COD performance. To capture the scale variations of the camouflaged objects, we propose a Multi-scale Feature Aggregation Module (MFAM) to characterize the multi-scale information from each layer and obtain the aggregated feature representations. Furthermore, we propose a Cross-level Fusion and Propagation Module (CFPM). In the CFPM, the feature fusion part can effectively integrate the features from adjacent layers to exploit the cross-level correlations, and the feature propagation part can transmit valuable context information from the encoder to the decoder network via a gate unit. Finally, we formulate a unified and end-to-end trainable framework where cross-level features can be effectively fused and propagated for capturing rich context information. Extensive experiments on three benchmark camouflaged datasets demonstrate that our FAP-Net outperforms other state-of-the-art COD models. Moreover, our model can be extended to the polyp segmentation task, and the comparison results further validate the effectiveness of the proposed model in segmenting polyps. The source code and results will be released at https://github.com/taozh2017/FAPNet .
Chapter
Although CNN-based camouflaged object detection(COD) has made great progress in recent years, their prediction maps usually contain incomplete detail information due to the similarity between the camouflaged object and the background. To alleviate this, a CNN-based framework named SACF-Net is designed for COD via cross-level fusion that facilitates the detection of camouflaged object detail. On the one hand, the low-level features contain abundant edge detail information to distinguish the camouflaged object from the background. On the other hand, the Polarized Self-Attention(PSA) mechanism is introduced to refine high-level features that contain extensive semantic information to enhance inner details and performance. Finally, cross-level complementarity fusion is performed progressively to generate prediction maps in a top-down manner. Extensive experiments on four COD datasets exhibit that the proposed method is better than the state-of-the-art methods.KeywordsCamouflaged object detectionSelf-attentionCross-Level fusion
Chapter
We present OSFormer, the first one-stage transformer framework for camouflaged instance segmentation (CIS). OSFormer is based on two key designs. First, we design a location-sensing transformer (LST) to obtain the location label and instance-aware parameters by introducing the location-guided queries and the blend-convolution feed-forward network. Second, we develop a coarse-to-fine fusion (CFF) to merge diverse context information from the LST encoder and CNN backbone. Coupling these two components enables OSFormer to efficiently blend local features and long-range context dependencies for predicting camouflaged instances. Compared with two-stage frameworks, our OSFormer reaches 41% AP and achieves good convergence efficiency without requiring enormous training data, i.e., only 3,040 samples under 60 epochs. Code link: https://github.com/PJLallen/OSFormer.KeywordsCamouflageInstance segmentationTransformer
Chapter
We present a systematic study on a new task called dichotomous image segmentation (DIS), which aims to segment highly accurate objects from natural images. To this end, we collected the first large-scale DIS dataset, called DIS5K, which contains 5,470 high-resolution (e.g., 2K, 4K or larger) images covering camouflaged, salient, or meticulous objects in various backgrounds. DIS is annotated with extremely fine-grained labels. Besides, we introduce a simple intermediate supervision baseline (IS-Net) using both feature-level and mask-level guidance for DIS model training. IS-Net outperforms various cutting-edge baselines on the proposed DIS5K, making it a general self-learned supervision network that can facilitate future research in DIS. Further, we design a new metric called human correction efforts (HCE) which approximates the number of mouse clicking operations required to correct the false positives and false negatives. HCE is utilized to measure the gap between models and real-world applications and thus can complement existing metrics. Finally, we conduct the largest-scale benchmark, evaluating 16 representative segmentation models, providing a more insightful discussion regarding object complexities, and showing several potential applications (e.g., background removal, art design, 3D reconstruction). Hoping these efforts can open up promising directions for both academic and industries. Project page: https://xuebinqin.github.io/dis/index.html.
Chapter
What constitutes an object? This has been a long-standing question in computer vision. Towards this goal, numerous learning-free and learning-based approaches have been developed to score objectness. However, they generally do not scale well across new domains and novel objects. In this paper, we advocate that existing methods lack a top-down supervision signal governed by human-understandable semantics. For the first time in literature, we demonstrate that Multi-modal Vision Transformers (MViT) trained with aligned image-text pairs can effectively bridge this gap. Our extensive experiments across various domains and novel objects show the state-of-the-art performance of MViTs to localize generic objects in images. Based on the observation that existing MViTs do not include multi-scale feature processing and usually require longer training schedules, we develop an efficient MViT architecture using multi-scale deformable attention and late vision-language fusion. We show the significance of MViT proposals in a diverse range of applications including open-world object detection, salient and camouflage object detection, supervised and self-supervised detection tasks. Further, MViTs can adaptively generate proposals given a specific language query and thus offer enhanced interactability. Code: https://git.io/J1HPY.
Article
Salient Object Detection (SOD) has witnessed remarkable improvement during the past decade. However, RGB-based SOD methods may fail for real-world applications in some extreme environments like low-light conditions and cluttered backgrounds. Thermal (T) images can capture the heat radiation from the surface of the objects and overcome such extreme situations. Therefore, some researchers introduce the T modality to the SOD task. Existing RGB-T SOD methods fail to explicitly explore multi-scale complementary saliency cues from dual modalities and lack the full explorations of individual RGB and T modalities. To deal with such problems, we propose the Three-stream Interaction Decoder Network (TIDNet) for the RGB-T SOD task. Specifically, the feature maps from the encoder branches are fed to the three-stream interaction decoder for in-depth saliency exploration, catching the single modality and multi-modality saliency cues. For single modality decoder streams, Contextual-enhanced Channel Reduction units (CCR) firstly reduce the channel dimension of feature maps from RGB and T modalities, reducing the computational burden and discriminatively enriching the multi-scale information. For the multi-modality decoder stream, Multi-scale Cross Modality Fusion (MCMF) unit is proposed to explore the complementary multi-scale information from RGB and T modalities. Then Internal and Multiple Decoder Interaction (IMDI) units further dig the specified and complementary saliency cues from the three-stream decoder. Three-stream deep supervision has been deployed on each feature level to facilitate the training strategy. Comprehensive experiments show our method outperforms fifteen state-of-the-art methods in terms of seven metrics. The codes and models are available at https://github.com/huofushuo/TIDNet.
Article
Camouflaged objects share very similar colors but have different semantics with the surroundings. Cognitive scientists observe that both the global contour (i.e., boundary) and the local pattern (i.e., texture) of camouflaged objects are key cues to help humans find them successfully. Inspired by the cognitive scientist's observation, we propose a novel boundary-and-texture enhancement network (FindNet) for camouflaged object detection (COD) from single images. Different from most of existing COD methods, FindNet embeds both the boundary-and-texture information into the camouflaged object features. The boundary enhancement (BE) module is leveraged to focus on the global contour of the camouflaged object, and the texture enhancement (TE) module is utilized to focus on the local pattern. The enhanced features from BE and TE, which complement each other, are combined to obtain the final prediction. FindNet performs competently on various conditions of COD, including slightly clear boundaries but very similar textures, fuzzy boundaries but slightly differentiated textures, and simultaneous fuzzy boundaries and textures. Experimental results exhibit clear improvements of FindNet over fifteen state-of-the-art methods on four benchmark datasets, in terms of detection accuracy and boundary clearness. The code will be publicly released.
Article
Most existing methods mainly input images into a CNN backbone to obtain image features. However, compared with convolutional features, the recently emerging transformer features can more accurately express the meaningful features of images. In this paper, we use a transformer backbone to capture multiple feature layers of an image, and design an Object Localization and Edge Refinement (OLER) Network for saliency detection. Our network is divided into two stages, the first stage for object positioning and the second stage for refining their boundaries. In the first stage, we directly apply multiple feature layers to identify salient regions, where we design an Information Multiple Selection (IMS) module to capture saliency cues for each feature layer. The IMS module contains multiple pathways, each of which is a judgment of the location of saliency information. After the input feature layer is processed by the IMS module, its potential salient object information is mined. The second stage consists of two modules, namely the edge generation module and the edge refinement module. The edge generation module takes the original image and saliency map as inputs, and then outputs two edge maps focusing on different edge ranges. To make the object edges sharp, the original image, initial saliency map and two edge maps are fed into the edge refinement module, and the final saliency map is output. Our network as a whole is relatively simple and easy to build without involving complex components. Experimental results on five public datasets demonstrate that our method has tremendous advantages in terms of not only significantly improving detection accuracy, but also achieving better detection efficiency. The code will be available at https://github.com/CKYiu/OLER.
Article
In this paper, we introduce a practical system for interactive video object mask annotation, which can support multiple back-end methods. To demonstrate the generalization of our system, we introduce a novel approach for video object annotation. Our proposed system takes scribbles at a chosen key-frame from the end-users via a user-friendly interface and produces masks of corresponding objects at the key-frame via the Control-Point-based Scribbles-to-Mask (CPSM) module. The object masks at the key-frame are then propagated to other frames and refined through the Multi-Referenced Guided Segmentation (MRGS) module. Last but not least, the user can correct wrong segmentation at some frames, and the corrected mask is continuously propagated to other frames in the video via the MRGS to produce the object masks at all video frames.
Article
In this paper, we investigate the interesting yet challenging problem of camouflaged instance segmentation. To this end, we first annotate the available CAMO dataset at the instance level. We also embed the data augmentation in order to increase the number of training samples. Then, we train different state-of-the-art instance segmentation on the CAMO-instance data. Last but not least, we develop an interactive user interface which demonstrates the performance of different state-of-the-art instance segmentation methods on the task of camouflaged instance segmentation. The users are able to compare the results of different methods on the given input images. Our work is expected to push the envelope of the camouflage analysis problem.
Article
Full-text available
Visual saliency analysis detects salient regions/objects that attract human attention in natural scenes. It has attracted intensive research in different fields such as computer vision, computer graphics, and multimedia. While many such computational models exist, the focused study of what and how applications can be beneficial is still lacking. In this article, our ultimate goal is thus to provide a comprehensive review of the applications using saliency cues, the so-called attentive systems. We would like to provide a broad vision about saliency applications and what visual saliency can do. We categorize the vast amount of applications into different areas such as computer vision, computer graphics, and multimedia. Intensively covering 200+ publications we survey (1) key application trends, (2) the role of visual saliency, and (3) the usability of saliency into different tasks.
Article
Full-text available
This paper presents a method for detecting salient objects in videos where temporal information in addition to spatial information is fully taken into account. Following recent reports on the advantage of deep features over conventional hand-crafted features, we propose the SpatioTemporal Deep (STD) feature that utilizes local and global contexts over frames. We also propose the SpatioTemporal Conditional Random Field (STCRF) to compute saliency from STD features. STCRF is our extension of CRF toward the temporal domain and formulates the relationship between neighboring regions both in a frame and over frames. STCRF leads to temporally consistent saliency maps over frames, contributing to the accurate detection of the boundaries of salient objects and the reduction of noise in detection. Our proposed method first segments an input video into multiple scales and then computes a saliency map at each scale level using STD features with STCRF. The final saliency map is computed by fusing saliency maps at different scale levels. Our intensive experiments using publicly available benchmark datasets confirm that the proposed method significantly outperforms state-of-the-art methods. We also applied our saliency computation to the video object segmentation task, showing that our method outperforms existing video object segmentation methods.
Conference Paper
Full-text available
Image saliency detection has recently witnessed rapid progress due to deep convolutional neural networks. However, none of the existing methods is able to identify object instances in the detected salient regions. In this paper, we present a salient instance segmentation method that produces a saliency mask with distinct object instance labels for an input image. Our method consists of three steps, estimating saliency map, detecting salient object contours and identifying salient object instances. For the first two steps, we propose a multiscale saliency refinement network, which generates high-quality salient region masks and salient object contours. Once integrated with multiscale combinatorial grouping and a MAP-based subset optimization framework, our method can generate very promising salient object instance segmentation results. To promote further research and evaluation of salient instance segmentation, we also construct a new database of 1000 images and their pixelwise salient instance annotations. Experimental results demonstrate that our proposed method is capable of achieving state-of-the-art performance on all public benchmarks for salient region detection as well as on our new dataset for salient instance segmentation.
Conference Paper
Full-text available
Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident. This paper aims to discover the principles to design effective ConvNet architectures for action recognition in videos and learn these models given limited training samples. Our first contribution is temporal segment network (TSN), a novel framework for video-based action recognition. which is based on the idea of long-range temporal structure modeling. It combines a sparse temporal sampling strategy and video-level supervision to enable efficient and effective learning using the whole action video. The other contribution is our study on a series of good practices in learning ConvNets on video data with the help of temporal segment network. Our approach obtains the state-the-of-art performance on the datasets of HMDB51 ( $ 69.4\% $) and UCF101 ($ 94.2\% $). We also visualize the learned ConvNet models, which qualitatively demonstrates the effectiveness of temporal segment network and the proposed good practices.
Conference Paper
Full-text available
Salient object detection has recently witnessed substantial progress due to powerful features extracted using deep convolutional neural networks (CNNs). However, existing CNN-based methods operate at the patch level instead of the pixel level. Resulting saliency maps are typically blurry, especially near the boundary of salient objects. Furthermore, image patches are treated as independent samples even when they are overlapping, giving rise to significant redundancy in computation and storage. In this CVPR 2016 paper, we propose an end-to-end deep contrast network to overcome the aforementioned limitations. Our deep network consists of two complementary components, a pixel-level fully convolutional stream and a segment-wise spatial pooling stream. The first stream directly produces a saliency map with pixel-level accuracy from an input image. The second stream extracts segment-wise features very efficiently, and better models saliency discontinuities along object boundaries. Finally, a fully connected CRF model can be optionally incorporated to improve spatial coherence and contour localization in the fused result from these two streams. Experimental results demonstrate that our deep model significantly improves the state of the art.
Article
Full-text available
The analysis and evaluation of camouflage performance is an important procedure in digital camouflage pattern design, as it helps to improve the design quality of camouflage patterns. In this paper, we propose a novel framework that uses the nonlinear fusion of multiple image features to quantitatively evaluate the degree to which the target and surrounding background differ with respect to background-related and internal features. In our framework, background-related features are first formulated as a measure of conspicuousness, which is calculated and quantized by the saliency detection method, whereas internal features refer to the interior saliency of camouflage textures, such as lines and other regular patterns. These two features are fused to evaluate the camouflage effect. A subjective evaluation is carried out as the baseline of our evaluation model. Experimental results show that our camouflage evaluation framework accords with the human visual perception mechanism, and is an effective method for evaluating camouflage pattern design.
Article
Full-text available
In order to realize the detection of the mobile object with camouflage color, a scheme based on optical flow model was put forward. Firstly, optical flow model was used to model the motion pattern of the object and the background. Secondly, the magnitude and the location of the optical flow were used to cluster the motion pattern, and the object detection result was obtained. At last, the location and scale of the object were used as the state variables, Kalman filter was used to improve the performance of the detection, and the final detection result was obtained. Experimental results show the algorithm can solve the mobile object detection satisfactorily.
Article
Full-text available
Camouflage is an attempt to conceal the texture of a foreground object into the background image frame texture. Camouflage detection method or Decamouflaging method is basically used to detect foreground object hidden in the background image. In this research paper authors presented survey of camouflage detection methods for different applications and areas.
Conference Paper
Full-text available
Visual saliency is a fundamental problem in both cogni-tive and computational sciences, including computer vision. In this paper, we discover that a high-quality visual saliency model can be trained with multiscale features extracted using a popular deep learning framework, convolutional neural networks (CNNs), which have had many successes in visual recognition tasks. For learning such saliency models, we introduce a neural network architecture, which has fully connected layers on top of CNNs responsible for feature extraction at three different scales. We then propose a refinement method to enhance the spatial coherence of our saliency results. Finally, aggre-gating multiple saliency maps computed for different levels of image segmentation can further boost the performance, yielding saliency maps better than those generated from a single segmentation. To promote further research and evaluation of visual saliency models, we also construct a new large database of 4447 challenging images and their pix-elwise saliency annotations. Experimental results demonstrate that our proposed method is capable of achieving state-of-the-art performance on all public benchmarks, improving the F-Measure by 5.0% and 13.2% respectively on the MSRA-B dataset and our new dataset (HKU-IS), and lowering the mean absolute error by 5.7% and 35.1% respectively on these two datasets.
Conference Paper
Full-text available
This paper deals with foreground object segmentation in the context of moving camera sequences. The method that we propose com-putes a foreground object segmentation in a MAP-MRF framework between foreground and background classes. We use region-based models to model the foreground object and the background region that surrounds the object. Moreover, the global background of the sequence is also included in the classification process by using pixel-wise color GMM. We compute the foreground segregation for each one of the frames by using a Bayesian classification and a graph-cut regularization between the classes, where the prior probability maps for both, foreground and background, are included in the for-mulation, thus using the cumulative knowledge of the object from the segmentation obtained in the previous frames. The results pre-sented in the paper show how the false positive and false negative detections are reduced, meanwhile the robustness of the system is improved thanks to the use of the prior probability maps in the clas-sification process.
Article
Full-text available
The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide detailed a analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the five years of the challenge, and propose future directions and improvements.
Conference Paper
Full-text available
Most existing bottom-up methods measure the foreground saliency of a pixel or region based on its contrast within a local context or the entire image, whereas a few methods focus on segmenting out background regions and thereby salient objects. Instead of considering the contrast between the salient objects and their surrounding regions, we consider both foreground and background cues in a different way. We rank the similarity of the image elements (pixels or regions) with foreground cues or background cues via graph-based manifold ranking. The saliency of the image elements is defined based on their relevances to the given seeds or queries. We represent the image as a close-loop graph with super pixels as nodes. These nodes are ranked based on the similarity to background and foreground queries, based on affinity matrices. Saliency detection is carried out in a two-stage scheme to extract background regions and foreground salient objects efficiently. Experimental results on two large benchmark databases demonstrate the proposed method performs well when against the state-of-the-art methods in terms of accuracy and speed. We also create a more difficult benchmark database containing 5,172 images to test the proposed saliency model and make this database publicly available with this paper for further studies in the saliency field.
Article
Full-text available
In this paper, we present a novel foreground object detection scheme that integrates the top-down information based on the expectation maximization (EM) framework. In this generalized EM framework, the top-down information is incorporated in an object model. Based on the object model and the state of each target, a foreground model is constructed. This foreground model can augment the foreground detection for the camouflage problem. Thus, an object's state-specific Markov random field (MRF) model is constructed for detection based on the foreground model and the background model. This MRF model depends on the latent variables that describe each object's state. The maximization of the MRF model is the M-step in the EM framework. Besides fusing spatial information, this MRF model can also adjust the contribution of the top-down information for detection. To obtain detection result using this MRF model, sampling importance resampling is used to sample the latent variable and the EM framework refines the detection iteratively. Besides the proposed generalized EM framework, our method does not need any prior information of the moving object, because we use the detection result of moving object to incorporate the domain knowledge of the object shapes into the construction of top-down information. Moreover, in our method, a kernel density estimation (KDE)-Gaussian mixture model (GMM) hybrid model is proposed to construct the probability density function of background and moving object model. For the background model, it has some advantages over GMM- and KDE-based methods. Experimental results demonstrate the capability of our method, particularly in handling the camouflage problem.
Conference Paper
Full-text available
Reliable estimation of visual saliency allows appropriate processing of images without prior knowledge of their contents, and thus remains an important step in many computer vision tasks including image segmentation, object recognition, and adaptive compression. We propose a regional contrast based saliency extraction algorithm, which simultaneously evaluates global contrast differences and spatial coherence. The proposed algorithm is simple, efficient, and yields full resolution saliency maps. Our algorithm consistently outperformed existing saliency detection methods, yielding higher precision and better recall rates, when evaluated using one of the largest publicly available data sets. We also demonstrate how the extracted saliency map can be used to create high quality segmentation masks for subsequent image processing.
Article
Full-text available
Image retrieval is an active research area in image processing, pattern recognition, and computer vision. For the purpose of effectively retrieving more similar images from the digital image databases, this paper uses the local HSV color and Gray level co-occurrence matrix (GLCM) texture features. The image is divided into sub blocks of equal size. Then the color and texture features of each sub-block are computed. Color of each sub-block is extracted by quantifying the HSV color space into non-equal intervals and the color feature is represented by cumulative color histogram. Texture of each sub-block is obtained by using gray level co-occurrence matrix. An integrated matching scheme based on Most Similar Highest Priority (MSHP) principle is used to compare the query and target image. The adjacency matrix of a bipartite graph is formed using the sub-blocks of query and target image. This matrix is used for matching the images. Euclidean distance measure is used in retrieving the similar images. As the experimental results indicated, the proposed technique indeed outperforms other retrieval schemes interms of average precision.
Conference Paper
Full-text available
Texture segmentation is a difficult problem, as is apparent from camouflage pictures. A textured region can contain texture elements of various sizes, each of which can itself be textured. We approach this problem using a bottom-up aggregation framework that combines structural characteristics of texture elements with filter responses. Our process adaptively identifies the shape of texture elements and characterize them by their size, aspect ratio, orientation, brightness, etc., and then uses various statistics of these properties to distinguish between different textures. At the same time our process uses the statistics of filter responses to characterize textures. In our process the shape measures and the filter responses crosstalk extensively. In addition, a top-down cleaning process is applied to avoid mixing the statistics of neighboring segments. We tested our algorithm on real images and demonstrate that it can accurately segment regions that contain challenging textures.
Article
Salient object detection aims to detect the main objects in the given image. In this paper, we proposed an approach that integrates semantic priors into the salient object detection process. The method first obtains an explicit saliency map that is refined by the explicit semantic priors learned from data. Then an implicit saliency map is constructed using a trained model that maps the implicit semantic priors embedded into superpixel features with the saliency values. Next, the fusion saliency map is computed by adaptively fusing both the explicit and implicit semantic maps. The final saliency map is eventually computed via the post-processing refinement step. Experimental results have demonstrated the effectiveness of the proposed method, particularly, it achieves competitive performance with the state-of-the-art baselines on three challenging datasets, namely, ECSSD, HKUIS, and iCoSeg.
Conference Paper
Focusing on only semantic instances that only salient in a scene gains more benefits for robot navigation and self-driving cars than looking at all objects in the whole scene. This paper pushes the envelope on salient regions in a video to decompose them into semantically meaningful components, namely, semantic salient instances. We provide the baseline for the new task of video semantic salient instance segmentation (VSSIS), that is, Semantic Instance - Salient Object (SISO) framework. The SISO framework is simple yet efficient, leveraging advantages of two different segmentation tasks, i.e. semantic instance segmentation and salient object segmentation to eventually fuse them for the final result. In SISO, we introduce a sequential fusion by looking at overlapping pixels between semantic instances and salient regions to have non-overlapping instances one by one. We also introduce a recurrent instance propagation to refine the shapes and semantic meanings of instances, and an identity tracking to maintain both the identity and the semantic meaning of instances over the entire video. Experimental results demonstrated the effectiveness of our SISO baseline, which can handle occlusions in videos. In addition, to tackle the task of VSSIS, we augment the DAVIS-2017 benchmark dataset by assigning semantic ground-truth for salient instance labels, obtaining SEmantic Salient Instance Video (SESIV) dataset. Our SESIV dataset consists of 84 high-quality video sequences with pixel-wisely per-frame ground-truth labels.
Article
Object detection, including objectness detection (OD), salient object detection (SOD), and category-specific object detection (COD), is one of the most fundamental yet challenging problems in the computer vision community. Over the last several decades, great efforts have been made by researchers to tackle this problem, due to its broad range of applications for other computer vision tasks such as activity or event recognition, content-based image retrieval and scene understanding, etc. While numerous methods have been presented in recent years, a comprehensive review for the proposed high-quality object detection techniques, especially for those based on advanced deep-learning techniques, is still lacking. To this end, this article delves into the recent progress in this research field, including 1) definitions, motivations, and tasks of each subdirection; 2) modern techniques and essential research trends; 3) benchmark data sets and evaluation metrics; and 4) comparisons and analysis of the experimental results. More importantly, we will reveal the underlying relationship among OD, SOD, and COD and discuss in detail some open questions as well as point out several unsolved challenges and promising future works.
Article
Given a set of images that contain objects from a common category, object co-segmentation aims at automatically discovering and segmenting such common objects from each image. During the past few years, object cosegmentation has received great attention in the computer vision community. However, the existing approaches are usually designed with misleading assumptions, unscalable priors, or subjective computational models, which do not have sufficient robustness for dealing with complex and unconstrained real-world image contents. This paper proposes a novel two-stage co-segmentation framework, mainly for addressing the robustness issue. In the proposed framework, we first introduce the concept of union background and use it to improve the robustness for suppressing the image backgrounds contained by the given image groups. Then, we also weaken the requirement for the strong prior knowledge by using the background prior instead. This can improve the robustness when scaling up for the unconstrained image contents. Based on the weak background prior, we propose a novel MR-SGS model, i.e., manifold ranking with self-learnt graph structure, which can infer suitable graph structures in a data-driven manner rather than building the fixed graph structure relying on the subjective design. Such capacity is critical for further improving the robustness in inferring the foreground/background probability of each image pixel. Comprehensive experiments and comparisons with other state-ofthe-art approaches can demonstrate the effectiveness of the proposed work.
Technical Report
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Article
Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixels-to-pixels, exceed the state-of-the-art in semantic segmentation. Our key insight is to build "fully convolutional" networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet, the VGG net, and GoogLeNet) into fully convolutional networks and transfer their learned representations by fine-tuning to the segmentation task. We then define a novel architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional network achieves state-of-the-art segmentation of PASCAL VOC (20% relative improvement to 62.2% mean IU on 2012), NYUDv2, and SIFT Flow, while inference takes one third of a second for a typical image.
Conference Paper
Salient object detection has increasingly become a popular topic in cognitive and computational sciences, including computer vision and artificial intelligence research. In this paper, we propose integrating semantic priors into the salient object detection process. Our algorithm consists of three basic steps. Firstly, the explicit saliency map is obtained based on the semantic segmentation refined by the explicit saliency priors learned from the data. Next, the implicit saliency map is computed based on a trained model which maps the implicit saliency priors embedded into regional features with the saliency values. Finally, the explicit semantic map and the implicit map are adaptively fused to form a pixel-accurate saliency map which uniformly covers the objects of interest. We further evaluate the proposed framework on two challenging datasets, namely, ECSSD and HKUIS. The extensive experimental results demonstrate that our method outperforms other state-of-the-art methods.
Conference Paper
This paper presents a novel end-to-end 3D fully convolutional network for salient object detection in videos. The proposed network uses 3D filters in the spatiotemporal domain to directly learn both spatial and temporal information to have 3D deep features, and transfers the 3D deep features to pixel-level saliency prediction, outputting saliency voxels. In our network, we combine the refinement at each layer and deep supervision to efficiently and accurately detect salient object boundaries. The refinement module recurrently enhances to learn contextual information into the feature map. Applying deeply-supervised learning to hidden layers, on the other hand, improves details of the intermediate saliency voxel, and thus the saliency voxel is refined progressively to become finer and finer. Intensive experiments using publicly available benchmark datasets confirm that our network outperforms state-of-the-art methods. The proposed saliency model also effectively works for video object segmentation.
Article
Two-stream Convolutional Networks (ConvNets) have shown strong performance for human action recognition in videos. Recently, Residual Networks (ResNets) have arisen as a new technique to train extremely deep architectures. In this paper, we introduce spatiotemporal ResNets as a combination of these two approaches. Our novel architecture generalizes ResNets for the spatiotemporal domain by introducing residual connections in two ways. First, we inject residual connections between the appearance and motion pathways of a two-stream architecture to allow spatiotemporal interaction between the two streams. Second, we transform pretrained image ConvNets into spatiotemporal networks by equipping these with learnable convolutional filters that are initialized as temporal residual connections and operate on adjacent feature maps in time. This approach slowly increases the spatiotemporal receptive field as the depth of the model increases and naturally integrates image ConvNet design principles. The whole model is trained end-to-end to allow hierarchical learning of complex spatiotemporal features. We evaluate our novel spatiotemporal ResNet using two widely used action recognition benchmarks where it exceeds the previous state-of-the-art.
Conference Paper
The human ability to detect and segment moving objects works in the presence of multiple objects, complex background geometry, motion of the observer, and even camouflage. In addition to all of this, the ability to detect motion is nearly instantaneous. While there has been much recent progress in motion segmentation, it still appears we are far from human capabilities. In this work, we derive from first principles a likelihood function for assessing the probability of an optical flow vector given the 2D motion direction of an object. This likelihood uses a novel combination of the angle and magnitude of the optical flow to maximize the information about how objects are moving differently. Using this new likelihood and several innovations in initialization, we develop a motion segmentation algorithm that beats current state-of-the-art methods by a large margin. We compare to five state-of-the-art methods on two established benchmarks, and a third new data set of camouflaged animals, which we introduce to push motion segmentation to the next level.
Conference Paper
Convolutional neural networks (CNNs) have been widely used in computer vision community, significantly improving the state-of-the-art. In most of the available CNNs, the softmax loss function is used as the supervision signal to train the deep model. In order to enhance the discriminative power of the deeply learned features, this paper proposes a new supervision signal, called center loss, for face recognition task. Specifically, the center loss simultaneously learns a center for deep features of each class and penalizes the distances between the deep features and their corresponding class centers. More importantly, we prove that the proposed center loss function is trainable and easy to optimize in the CNNs. With the joint supervision of softmax loss and center loss, we can train a robust CNNs to obtain the deep features with the two key learning objectives, inter-class dispension and intra-class compactness as much as possible, which are very essential to face recognition. It is encouraging to see that our CNNs (with such joint supervision) achieve the state-of-the-art accuracy on several important face recognition benchmarks, Labeled Faces in the Wild (LFW), YouTube Faces (YTF), and MegaFace Challenge. Especially, our new approach achieves the best results on MegaFace (the largest public domain face benchmark) under the protocol of small training set (contains under 500000 images and under 20000 persons), significantly improving the previous results and setting new state-of-the-art for both face recognition and face verification tasks.
Article
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet and Fast R-CNN have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolutional features. For the very deep VGG-16 model, our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.
Article
To effectively detect the camouflaged target in the complex background, the target detection method based on 3D convexity is proposed in this article. This method uses a new operator which can fully utilize the representative image gray level represented by the convexity structure of the target, set up appropriate threshold to eliminate the influence of the background noise by the median filtering through the gray face of the target image, and realize the effective detection and identification of the convexity target. The experiment result shows that this method could successively detect the camouflaged targets in the complex background, better than classic edge detection method. As a new camouflaged target detection and evaluation technology, this method can provide necessary factors for the design and implementation of the camouflage technology, and promote the development of the camouflage technology.
Conference Paper
We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in understanding an object's precise 2D location. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old along with per-instance segmentation masks. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.
We investigate architectures of discriminatively trained deep Convolutional Networks (ConvNets) for action recognition in video. The challenge is to capture the complementary information on appearance from still frames and motion between frames. We also aim to incorporate into the network design aspects of the best performing hand-crafted features. Our contribution is three-fold. First, we propose a two-stream ConvNet architecture which incorporates spatial and temporal networks. Second, we demonstrate that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data. Finally, we show that multi-task learning, applied to two different action classification datasets, can be used to increase the amount of training data and improve the performance on both. Our architecture is trained and evaluated on the standard video actions benchmarks of UCF-101 and HMDB-51, where it matches the state of the art. It also exceeds by a large margin previous attempts to use deep nets for video classification.
Conference Paper
It is known that purely low-level saliency cues such as frequency does not lead to a good salient object detection result, requiring high-level knowledge to be adopted for successful discovery of task-independent salient objects. In this paper, we propose an efficient way to combine such high-level saliency priors and low-level appearance models. We obtain the high-level saliency prior with the objectness algorithm to find potential object candidates without the need of category information, and then enforce the consistency among the salient regions using a Gaussian MRF with the weights scaled by diverse density that emphasizes the influence of potential foreground pixels. Our model obtains saliency maps that assign high scores for the whole salient object, and achieves state-of-the-art performance on benchmark datasets covering various foreground statistics.
Article
We describe a new learning procedure, back-propagation, for networks of neurone-like units. The procedure repeatedly adjusts the weights of the connections in the network so as to minimize a measure of the difference between the actual output vector of the net and the desired output vector. As a result of the weight adjustments, internal 'hidden' units which are not part of the input or output come to represent important features of the task domain, and the regularities in the task are captured by the interactions of these units. The ability to create useful new features distinguishes back-propagation from earlier, simpler methods such as the perceptron-convergence procedure.
Conference Paper
Traditional evaluation method of camouflage texture effect is subjective evaluation. It's very tedious and inconvenient to direct the texture designing. In this paper, a systemic and rational method for direction and evaluation of camouflage texture designing is proposed. A new camouflage texture evaluation method based on WSSIM (Weight structural similarity) is given to access the effects of camouflage texture at first. Then nature image features between the camouflage texture and the background image are calculated to help direct the designing camouflage texture. Primary experimental results show that the proposed method is helpful for evaluation and design of the camouflage texture.
We propose a novel approach to learn and recognize natural scene categories. Unlike previous work [9,17], it does not require experts to annotate the training set. We represent the image of a scene by a collection of local regions, denoted as codewords obtained by unsupervised learning. Each region is represented as part of a "theme". In previous work, such themes were learnt from hand-annotations of experts, while our method learns the theme distributions as well as the codewords distribution over the themes without supervision. We report satisfactory categorization performances on a large set of 13 categories of complex scenes.
Article
Psychophysical and physiological evidence indicates that the visual system of primates and humans has evolved a specialized processing focus moving across the visual scene. This study addresses the question of how simple networks of neuron-like elements can account for a variety of phenomena associated with this shift of selective visual attention. Specifically, we propose the following: (1) A number of elementary features, such as color, orientation, direction of movement, disparity etc. are represented in parallel in different topographical maps, called the early representation. (2) There exists a selective mapping from the early topographic representation into a more central non-topographic representation, such that at any instant the central representation contains the properties of only a single location in the visual scene, the selected location. We suggest that this mapping is the principal expression of early selective visual attention. One function of selective attention is to fuse information from different maps into one coherent whole. (3) Certain selection rules determine which locations will be mapped into the central representation. The major rule, using the conspicuity of locations in the early representation, is implemented using a so-called Winner-Take-All network. Inhibiting the selected location in this network causes an automatic shift towards the next most conspicious location. Additional rules are proximity and similarity preferences. We discuss how these rules can be implemented in neuron-like networks and suggest a possible role for the extensive back-projection from the visual cortex to the LGN.
Conference Paper
Camouflaging is the process of disguising an object to blend it with its surrounding. A camouflaged object cannot be seen by the human vision system. De-camouflaging is the identification and recognition of the object which is camouflaged. For the de-camouflaging of the image, texture analysis is carried out in different part of the image. Based on the results of the analysis, camouflaged object is detected. This paper mainly focuses on application of de-camouflaging in defense area, which is needed for our country to save soldiers. In defense area the soldiers wear dresses based on the war field. So they cannot be easily predicted by the enemies. In this paper we have proposed a system to identify the camouflaged object and to extract that from the background efficiently. Normally, de-camouflaging is done in unsupervised way, which means we do not have any knowledge about either the normal background or the camouflaged object. The camouflaged objects are surface objects which are induced with hiding characteristics by merging their visuals with the nature of background. The nature of the background is the background color, texture, shape, motion etc. We use texture analysis method to isolate the camouflaged object.
Conference Paper
An object recognition system has been developed that uses a new class of local image features. The features are invariant to image scaling, translation, and rotation, and partially invariant to illumination changes and affine or 3D projection. These features share similar properties with neurons in inferior temporal cortex that are used for object recognition in primate vision. Features are efficiently detected through a staged filtering approach that identifies stable points in scale space. Image keys are created that allow for local geometric deformations by representing blurred image gradients in multiple orientation planes and at multiple scales. The keys are used as input to a nearest neighbor indexing method that identifies candidate object matches. Final verification of each match is achieved by finding a low residual least squares solution for the unknown model parameters. Experimental results show that robust object recognition can be achieved in cluttered partially occluded images with a computation time of under 2 seconds