Conference Paper

Fully convolutional networks for semantic segmentation

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Compared with manually designed methods, deep learning-based segmentation methods perform considerably better in many applications [4]- [6]. The fully convolutional network (FCN) [7], which consists of convolutional and pooling layers arranged alternately, is the first end-to-end semantic segmentation neural network. To alleviate the loss of local details caused by pooling operations in FCNs, U-Net [8] was proposed to retain local details of the original input size of the image, by connecting low-level and high-level features through skip connections operations. ...
... In this section, to prove the effectiveness of the modules in the popular deep networks (e.g., CNN, EfficientNet-V2, and CSwin Transformer) for citrus tree canopy segmentation, classic CNN models (including FCN [7], U-Net [8], FPN [46], PSPNet [34], DeepLab-V3 [9], BiseNet-V2 [47], DANet [11], and HRCNet [35]) and Transformer models (including SwinT [19], CSwinT [20], Beit [18], and ResT [17]) were selected to evaluate the performance of citrus tree canopy segmentation. Among them, DANet, PSPNet, and DeepLab-V3 all use ResNet-50 [41] as the backbone. ...
Article
Full-text available
Existing convolutional neural network (CNN)-based methods usually tend to ignore the contextual information for citrus tree canopy segmentation. Although popular Transformer models are helpful in extracting global semantic information, they ignore the edge details between citrus tree canopies and the background. To address these issues, we propose a parallel fusion neural network considering both local and global semantic information for citrus tree canopy segmentation from 3D data, which are derived by unmanned aerial vehicle (UAV) mapping. In the feature extraction stage, a parallel architecture, concatenated by EfficientNet-V2 and CSwin Transformer, is used to extract local and global information of citrus trees. In the feature fusion stage, we design a coordinate attention-based fusion module to retain the contextual information and local edge details of citrus tree canopies. Additionally, to exaggerate the exclusivity between tree canopies and complex backgrounds, 3D data incorporating RGB imagery and canopy height model derived by UAV photogrammetry are generated for citrus tree canopy segmentation. Experimental results indicate that the proposed method performs considerably better than methods based only on CNN or Transformer models, and is superior to state-of-the-art methods (e.g., the highest mIoU score of 93.46%).
... With the emergence of large-scale datasets and computing resources, Convolutional Neural Networks (CNN) [22] have become the mainstream in visual recognition. Researchers like Long et al. [23] introduced a semantic segmentation method based on Fully Convolutional Networks (FCN), which addressed image segmentation at the semantic level and classified images at the pixel level. Ronneberger et al. [24] proposed a U-Net network with an encoder-decoder structure capable of fusing low-resolution and high-resolution features, thus improving the accuracy can result in increased computational burdens when processing large-sized images and may not adequately capture pixel-level detail information, affecting the accuracy and the recognition capability of small structures during segmentation. ...
... where C in and C out are the number of input and output channels, respectively. For the Fully Connected Layer, also known as the linear layer, the calculation of FLOPs can be determined by the input feature dimension (D in ) and the output feature dimension (D out ). FLOPs are calculated as follows: Since there are no parameters to be learned for the pooling layer, the calculation formula for FLOPs is usually simpler, for the pooling layer with a feature map size of H × W, the number of channels is C, and the stride length is s, the calculation formula for FLOPs is as follows: maximum pooling is like Equation (22), and average pooling is like Equation (23). ...
Article
Full-text available
Floods represent a significant natural hazard with the potential to inflict substantial damage on human society. The swift and precise delineation of flood extents is of paramount importance for effectively supporting flood response and disaster relief efforts. In comparison to optical sensors, Synthetic Aperture Radar (SAR) sensor data acquisition exhibits superior capabilities, finding extensive application in flood detection research. Nonetheless, current methodologies exhibit limited accuracy in flood boundary detection, leading to elevated instances of both false positives and false negatives, particularly in the detection of smaller-scale features. In this study, we proposed an advanced flood detection method called FWSARNet, which leveraged a deformable convolutional visual model with Sentinel-1 SAR images as its primary data source. This model centered around deformable convolutions as its fundamental operation and took inspiration from the structural merits of the Vision Transformer. Through the introduction of a modest number of supplementary parameters, it significantly extended the effective receptive field, enabling the comprehensive capture of intricate local details and spatial fluctuations within flood boundaries. Moreover, our model employed a multi-level feature map fusion strategy that amalgamated feature information from diverse hierarchical levels. This enhancement substantially augmented the model’s capability to encompass various scales and boost its discriminative power. To validate the effectiveness of the proposed model, experiments were conducted using the ETCI2021 dataset. The results demonstrated that the Intersection over Union (IoU) and mean Intersection over Union (mIoU) metrics for flood detection achieved impressive values of 80.10% and 88.47%, respectively. These results surpassed the performance of state-of-the-art (SOTA) models. Notably, in comparison to the best results documented on the official ETCI2021 dataset competition website, our proposed model in this paper exhibited a remarkable 3.29% improvement in flood prediction IoU. The experimental outcomes underscore the capability of the FWSARNet method outlined in this paper for flood detection using Synthetic Aperture Radar (SAR) data. This method notably enhances the accuracy of flood detection, providing essential technical and data support for real-world flood monitoring, prevention, and response efforts.
... The feature map of the convolutional layer of the region detector in the fast R-CNN serves to generate candidate regions for the RPNs. Building upon feature mapping, several backward convolutional layers are added to create a regional recommendation network, representing a fully convolutional network (FCN) (Long et al., 2015). ...
... There are two parallel loss functions, softmax, and smoothL1, that classify and regress each region of interest (RoI) respectively. In this way, the model can get a real category and more precise coordinates, length, and width of each RoI (Long et al., 2015). ...
Article
Full-text available
Recently, Deep learning algorithms are becoming increasingly instrumental in autonomous driving by identifying and acknowledging road entities to ensure secure navigation and decision-making. Autonomous car datasets play a vital role in developing and evaluating perception systems. Nevertheless, the majority of current datasets are acquired using Light Detection and Ranging (LiDAR) and camera sensors. Utilizing deep neural networks yields remarkable outcomes in object recognition, especially when applied to analyze data from cameras and LiDAR sensors which perform poorly under adverse weather conditions such as rain, fog, and snow due to the sensor wavelengths. This paper aims to evaluate the ability to use RADAR dataset for detecting objects in adverse weather conditions, when LiDAR and Cameras may fail to be effective. This paper presents two experiments for object detection using Faster-RCNN architecture with Resnet-50 backbone and COCO evaluation metrics. Experiment 1 is object detection over only one class, while Experiment 2 is object detection over eight classes. The results show that as expected the average precision (AP) of detecting one class is (47.2) which is better than the results from detecting eight classes (27.4). Comparing my results from experiment 1 to the literature results which achieved an overall AP (45.77), my result was slightly better in accuracy than the literature mainly due to hyper-parameters optimization. The outcomes of object detection and recognition based on RADAR indicate the potential effectiveness of RADAR data in automotive applications particularly in adverse weather conditions, where vision and LiDAR may encounter limitations.
... Considering the weed-crop segmentation task, current deep learning-based semantic segmentation models for weed-crop detection [19][20][21][22][23][24] face challenges in achieving the desired results. This is due to the reason that the majority of current deep learningbased weed-crop segmentation models rely on Fully Convolutional Neural Networks (FCNs) [25] and U-Net [26], which inherit the following limitations: (1) These models suffer from the gradient vanishing problem, especially in the cases when the number of layers increase. In this situation, these networks lose spatial information which is crucial for accurate segmentation. ...
... Considering the aforementioned challenges in the weed-crop segmentation task, a framework is proposed that addresses the limitations of the FCN- [25] and U-Net [26]-based networks. Generally, the proposed framework consists of two main parts: (1) Encoder and (2) Decoder. ...
Article
Full-text available
The rapid expansion of the world’s population has resulted in an increased demand for agricultural products which necessitates the need to improve crop yields. To enhance crop yields, it is imperative to control weeds. Traditionally, weed control predominantly relied on the use of herbicides; however, the indiscriminate application of herbicides presents potential hazards to both crop health and productivity. Fortunately, the advent of cutting-edge technologies such as unmanned vehicle technology (UAVs) and computer vision has provided automated and efficient solutions for weed control. These approaches leverage drone images to detect and identify weeds with a certain level of accuracy. Nevertheless, the identification of weeds in drone images poses significant challenges attributed to factors like occlusion, variations in color and texture, and disparities in scale. The utilization of traditional image processing techniques and deep learning approaches, which are commonly employed in existing methods, presents difficulties in extracting features and addressing scale variations. In order to address these challenges, an innovative deep learning framework is introduced which is designed to classify every pixel in a drone image into categories such as weed, crop, and others. In general, our proposed network adopts an encoder–decoder structure. The encoder component of the network effectively combines the Dense-inception network with the Atrous spatial pyramid pooling module, enabling the extraction of multi-scale features and capturing local and global contextual information seamlessly. The decoder component of the network incorporates deconvolution layers and attention units, namely, channel and spatial attention units (CnSAUs), which contribute to the restoration of spatial information and enhance the precise localization of weeds and crops in the images. The performance of the proposed framework is assessed using a publicly available benchmark dataset known for its complexity. The effectiveness of the proposed framework is demonstrated via comprehensive experiments, showcasing its superiority by achieving a 0.81 mean Intersection over Union (mIoU) on the challenging dataset.
... The decoder then receives the encoded feature map and determines the size of the expected map. The decoding channel is carried out by a deconvolution technique described by Long et al. [32] and Noh et al. [15]. Jump connections are introduced by Ronneberger et al. [44] to link encoded features with their corresponding decoded features, which can enrich the segmentation output. ...
... In this part, we will describe the overall design of our network. We chose the encoder-decoder architecture because it is straightforward, user-friendly, and generally follows earlier work [15,32,44,45]. We will first outline the general structure of our network before describing how the IERM and MSCM modules affect the effectiveness of semantic segmentation. ...
Article
Full-text available
Multi-scale representation provides an effective answer to the scale variation of objects and entities in semantic segmentation. The ability to capture long-range pixel dependency facilitates semantic segmentation. In addition, semantic segmentation necessitates the effective use of pixel-to-pixel similarity in the channel direction to enhance pixel areas. By reviewing the characteristics of earlier successful segmentation models, we discover a number of crucial elements that enhance segmentation model performance, including a robust encoder structure, multi-scale interactions, attention mechanisms, and a robust decoder structure. The attention mechanism of the asymmetric non-local neural network (ANNet) is merged with multi-scale pyramidal modules to accelerate model segmentation while maintaining high accuracy. However, ANNet does not account for the similarity between pixels in the feature map channel direction, making the segmentation accuracy unsatisfactory. As a result, we propose EMSNet, a straightforward convolutional network architecture for semantic segmentation that consists of Integration of enhanced regional module (IERM) and Multi-scale convolution module (MSCM). The IERM module generates weights using four or five-stage feature maps, then fuses the input features with the weights and uses more computation. The similarity of the channel direction feature graphs is also calculated using ANNet’s auxiliary loss function. The MSCM module can more accurately describe the interactions between various channels, capture the interdependencies between feature pixels, and capture the multi-scale context. Experiments prove that we perform well in tests using the benchmark dataset. On Cityscapes test data, we get 82.2% segmentation accuracy. The mIoU in the ADE20k and Pascal VOC datasets are, respectively, 45.58% and 85.46%.
... The primary objective is to detect cracks as early as possible, ensuring timely maintenance, prevention of potential catastrophic failures, and extension of the lifespan of structures. With the rapid development of deep learning [1], [2], [3], [4] and the emergence of segmentation models such as SegNet [5], UNet [6], FCNs [7], and the DeepLab series [8], [9], general semantic segmentation tasks and crack segmentation tasks, in particular, have achieved significant improvements [10], [11], [12]. However, crack segmentation still poses numerous challenges that need addressing [13], [14], [15]. ...
... To demonstrate the effectiveness and superiority of the proposed method, we adopted eight existing and popular methods to compare to the proposed model, including Unet [6], TransUNet [27], FCNs [7], SegNet [5], and DeepCrack [28]. Table II shows results on the CrackTree260 dataset. ...
... Chen et al., [2]: "Semantic image segmentation with deep convolutional nets and fully connected crfs," arXiv preprint arXiv:1412.7062, 2014 7. Long, Shelhamer, Darrell [5]: "Fully convolutional networks for semantic segmentation," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431-3440. ...
... The advent of deep learning introduced convolutional neural networks (CNNs) [4] for segmentation tasks. Early works like Fully Convolutional Networks (FCN) [5] demonstrated effective dense predictions but lacked global context information. Encoder-decoder architectures like U-Net [8] and SegNet [6] addressed this by combining an encoder for feature extraction and a decoder for precise pixel-level predictions. ...
Presentation
Full-text available
Lung cancer represents a significant health threat, demanding accurate diagnostic tools for effective intervention. Precise lung region segmentation stands as a fundamental step in tumor localization, providing crucial insights for comprehensive lung image analysis. This research introduces an innovative approach for lung image segmentation, leveraging the ResNet34 as the foundational encoder architecture coupled with a custom-designed decoder. The devised model exhibits proficiency in extracting intricate high-level information and expanding feature maps to recover nuanced segmentation details. Notably, the incorporation of dilated convolutions within the decoder block enhances the receptive field, refining the segmentation accuracy. Experimental evaluation show-cases promising outcomes, yielding an Intersection over Union (IoU) score of 60% and a Dice score of 76%. These results underscore the model's potential while signaling avenues for further enhancement and exploration.
... In the past decades, semantic segmentation have achieved remarkable progress thanks to deep neural networks. One well-known method for semantic segmentation using Convolutional Neural Networks (CNNs) is Fully Convolutional Networks (FCN) [15], which replaces the output fully connected layer in the classification framework with convolutional layers, allowing the network to make dense pixel-wise predictions for the image. From then on, a series of segmentation methods based on CNNs were proposed, accompanied by a variety of techniques like atrous spatial pyramid pooling [16], Pyramid Pooling Modules [17]. ...
Article
Full-text available
Few-shot Semantic Segmentation (FSS) endeavors to segment novel categories in a query image by referring to a support set comprising only a few annotated examples. Presently, many existing FSS methodologies primarily embrace the prototype learning paradigm and concentrate on optimizing the matching mechanism. However, these approaches tend to overlook the discrimination between the features of foreground background. Consequently, the segmentation results are often imprecise when it comes to capturing intricate structures, such as boundaries and small objects. In this study, we introduce the Discriminative Foreground-and-Background feature learning Network (DFBNet) to enhance the distinguishability of bilateral features. DFBNet comprises three major modules: a multi-level self-matching module (MSM), a feature separation module (FSM), and a semantic alignment module (SAM). The MSM generates prior masks separately for the foreground and background, employing a self-matching strategy across different feature levels. These prior masks are subsequently used as scaling factors within the FSM, where the features of the query’s foreground and background are independently scaled up and then concatenated along the channel dimension. Furthermore, we incorporate a two-layer Transformer encoder-based semantic alignment module (SAM) in DFBNet to refine the features, thereby creating a greater distinction between the foreground and background features. The performance of DFBNet is evaluated on the PASCAL-\(5^i\) and COCO-\(20^i\) benchmarks, demonstrating its superiority over existing solutions and establishing new state-of-the-art results in the field of few-shot semantic segmentation. The codes will be released if this paper is accepted.
... Building upon previous research [11][12][13], we define the task of semantic part detection as the classification of individual pixels within the image to determine whether each pixel belongs to a particular part. To achieve this goal, it is essential to establish a suitable representation for each pixel and a classifier for each part. ...
Article
Full-text available
Semantic part detection within an object is of importance in the field of computer vision. This study proposes a novel approach to semantic part detection that starts by employing a convolutional neural network to concatenate a selection of feature maps from the network into a long vector for pixel representation. Using this dedicated pixel representation, we implement a range of techniques, such as Poisson disk sampling for pixel sampling and Poisson matting for pixel label correction. These techniques efficiently facilitate the training of a practical pixel classifier for part detection. Our experimental exploration investigated various factors that affect the model’s performance, including training data labeling (with or without the aid of Poisson matting), hypercolumn representation dimensionality, neural network architecture, post-processing techniques, and pixel classifier selection. In addition, we conducted a comparative analysis of our approach with established object detection methods.
... In order to more comprehensively evaluate the performance of the ASFF-SENet model, FCN-4s [32], PspNet, Deeplabv3+, BC-DUnet [33], and HC-Unet++ [34] were used to conduct comparative experiments. The experimental results are shown in Table 6. ...
Article
Full-text available
Wrinkles, crucial for age estimation and skin quality assessment, present challenges due to their uneven distribution, varying scale, and sensitivity to factors like lighting. To overcome these challenges, this study presents facial wrinkle detection with multiscale spatial feature fusion based on image enhancement and an adaptively spatial feature fusion squeeze-and-excitation Unet network (ASFF-SEUnet) model. Firstly, in order to improve wrinkle features and address the issue of uneven illumination in wrinkle images, an innovative image enhancement algorithm named Coiflet wavelet transform Donoho threshold and improved Retinex (CT-DIR) is proposed. Secondly, the ASFF-SEUnet model is designed to enhance the accuracy of full-face wrinkle detection across all age groups under the influence of lighting factors. It replaces the encoder part of the Unet network with EfficientNet, enabling the simultaneous adjustment of depth, width, and resolution for improved wrinkle feature extraction. The squeeze-and-excitation (SE) attention mechanism is introduced to grasp the correlation and importance among features, thereby enhancing the extraction of local wrinkle details. Finally, the adaptively spatial feature fusion (ASFF) module is incorporated to adaptively fuse multiscale features, capturing facial wrinkle information comprehensively. Experimentally, the method excels in detecting facial wrinkles amid complex backgrounds, robustly supporting facial skin quality diagnosis and age assessment.
... Long et al. [30] proposed FCN, which extends the new idea of deep learning in the field of image segmentation and realizes end-to-end pixel-level semantic segmentation. Noh et al. [31] proposed Deconvnet, which adopts a symmetrical encoder-decoder structure to optimize the FCN. ...
Article
Full-text available
Change detection in high resolution (HR) remote sensing images faces more challenges than in low resolution images because of the variations of land features, which prompts this research on faster and more accurate change detection methods. We propose a pixel-level semantic change detection method to solve the fine-grained semantic change detection for HR remote sensing image pairs, which takes one lightweight semantic segmentation network (LightNet), using the parameter-sharing SiameseNet, as the architecture to carry out pixel-level semantic segmentations for the dual-temporal image pairs and achieve pixel-level change detection based directly on semantic comparison. LightNet consists of four long–short branches, each including lightweight dilated residual blocks and an information enhancement module. The feature information is transmitted, fused, and enhanced among the four branches, where two large-scale feature maps are fused and then enhanced via the channel information enhancement module. The two small-scale feature maps are fused and then enhanced via a spatial information enhancement module, and the four upsampling feature maps are finally concatenated to form the input of the Softmax. We used high resolution remote sensing images of Lake Erhai in Yunnan Province in China, collected by GF-2, to make one dataset with a fine-grained semantic label and a dual-temporal image-pair label to train our model, and the experiments demonstrate the superiority of our method and the accuracy of LightNet; the pixel-level semantic change detection methods are up to 89% and 86%, respectively.
... This method verifies the reliability of the model and can provide an objective evaluation of the landslide detection effect. [38,39]. Due to its robustness, it makes this network widely used in remote sensing image segmentation as well. ...
Article
Full-text available
Accurate and efficient landslide identification is an important basis for landslide disaster prevention and control. Due to the diversity of landslide features, vegetation occlusion, and the complexity of the surrounding surface environment in remote sensing images, deep learning models (such as U-Net) for landslide detection based only on optical remote sensing images will lead to false and missed detection. The detection accuracy is not high, and it is difficult to satisfy the demand. SAR has penetrability, and SAR images are highly sensitive to changes in surface morphology and structure. In this study, a multi-input channel U-Net landslide detection method fusing SAR, optical, and topographic multi-source remote sensing data is proposed. Firstly, a multi-input channel U-Net model fusing SAR multi-source remote sensing data is constructed, then an attention mechanism is introduced into the multi-input channel U-Net to adjust the spatial weights of the feature maps of the multi-source data to emphasize the landslide-related features, and finally, the proposed model is applied to the experimental scene for validation. The experimental results demonstrate that the proposed model combined with SAR multi-source remote sensing data improves the perception ability of landslide features, focuses on learning landslide-related features, improves the accuracy of landslide detection, and reduces the rate of false detections and missed detections. Compared with the traditional U-Net landslide detection method based on SAR multi-source remote sensing data and the traditional U-Net method that disregards SAR multi-source remote sensing data, the proposed method has the best quantitative evaluation indicators. Among them, the proposed method obtained the highest F1 value (66.18%), indicating that fused SAR remote sensing data can provide rich and complementary landslide feature information, simultaneously setting up a multi-channel U-Net model to input multi-source remote sensing data can effectively process landslide feature information. The proposed method can provide theoretical and technical support for landslide disaster prevention and control.
... In recent years, deep learning methods especially the convolutional neural network (CNN) represented by fully convolutional networks (FCN) (Long et al., 2015) have become the mainstream approach for building extraction from remote sensing images benefiting from their flexibility and adaptability (Ji et al., 2018). Since the pioneer FCN structure, the encoderdecoder structure for segmentation such as UNet (Ronneberger et al., 2015) and SegNet (Badrinarayanan et al., 2017) aiming at addressing the coarse-resolution segmentation of FCN-based networks was also introduced and improved for building extraction (Shi et al., 2022;Qiu et al., 2023;Deng et al., 2023). ...
Article
Full-text available
The density of buildings is an important index to reflect the productivity and prosperity of an economic entity. Automatically monitoring the change and development of buildings through satellite can not only benefit the assessment of the status of urban development but also contribute to suburban construction planning. Apparently, more accurate building extraction performance can be guaranteed with higher-resolution remote sensing images. However, the desired high-resolution images are not always available limited by the remote sensing imaging technology and the expensive cost of updating the sensors and equipment. Therefore, the super-resolution technology, which aims at restoring the high-resolution images from the given low-resolution images, is a promising solution to resolve the dilemma. Therefore, in this paper, we investigate the potential application of super-resolution technology for cross-domain building extraction. The experiment results demonstrate that super-resolution can indeed improve building extraction accuracy.
... Second, the quality and accuracy of data are greatly improved by using aerial surveys and remote sensing technology. Third, deep learning has achieved state-of-the-art performance in a number of vision tasks (Dosovitskiy et al., 2020;Girshick, 2015;Long et al., 2015;Redmon et al., 2016;Ren et al., 2015). However, extracting components from building façades with UAV data and deep learning methods is barely studied in current research (Liu et al., 2020). ...
Article
Full-text available
Rapid access to the basic structure of village buildings is conducive to the investigation of the load-bearing bodies of village houses and provides data support for disaster assessment and post-disaster rescue and reconstruction. The development of computer vision technology provides new ideas and tools for identifying and extracting basic structures of housing buildings. Considering that the original Mask R-CNN ignores the spatial association and relationship of door and window elements, an advanced deep learning model based on Mask R-CNN network is proposed in this paper to detect and segment the door and window structure from the façade images. The improved network architectures integrate the attention mechanism with the original network, containing an improved Coordinate Attention(CA) module and a relationship module-based head network. The experimental results show that the Average Precision(AP) value of the backbone combined with the improved CA module is increased by 0.7% and 0.7% on regression and segmentation tasks respectively, compared with the original Mask R-CNN network. In the head network based on the relationship module, the calculation strategy of the relational module proposed in this paper increases the AP values of detection and segmentation from 76.7% and 77.7% to 80.6% and 80.0%, respectively.
... Notably, Long and his team introduced the concept of Fully Convolutional Networks (FCN), capable of pixel-wise classification through forward propagation. This technology transforms image input into image output, enabling end-to-end segmentation [19]. FCN-based semantic segmentation has attracted substantial research efforts, with novel techniques emerging to facilitate hierarchical feature learning, classification optimization, and the creation of dense predictions for entire images. ...
Article
Full-text available
The field of medical image segmentation, particularly in the context of brain tumor delineation, plays an instrumental role in aiding healthcare professionals with diagnosis and accurate lesion quantification. Recently, Convolutional Neural Networks (CNNs) have demonstrated substantial efficacy in a range of computer vision tasks. However, a notable limitation of CNNs lies in their inadequate capability to encapsulate global and distal semantic information effectively. In contrast, the advent of Transformers, which has established their prowess in natural language processing and computer vision, offers a promising alternative. This is primarily attributed to their self-attention mechanisms that facilitate comprehensive modeling of global information. This research delineates an innovative methodology to augment brain tumor segmentation by synergizing UNET architecture with Transformer technology (denoted as UT), and integrating advanced feature enhancement (FE) techniques, specifically Modified Histogram Equalization (MHE), Contrast Limited Adaptive Histogram Equalization (CLAHE), and Modified Bi-histogram Equalization Based on Optimization (MBOBHE). This integration fosters the development of highly efficient image segmentation algorithms, namely FE1-UT, FE2-UT, and FE3-UT. The methodology is predicated on three pivotal components. Initially, the study underscores the criticality of feature enhancement in the image preprocessing phase. Herein, techniques such as MHE, CLAHE, and MBOBHE are employed to substantially ameliorate the visibility of salient details within the medical images. Subsequently, the UT model is meticulously engineered to refine segmentation outcomes through a customized configuration within the UNET framework. The integration of Transformers within this model is instrumental in imparting contextual comprehension and capturing long-range data dependencies, culminating in more precise and context-sensitive segmentation. Empirical evaluation of the model on two extensively acknowledged public datasets yielded accuracy rates exceeding 99%.
... The intuition behind feature integration is to take advantage of global semantic information captured by fully connected layers and instancelevel information preserved by convolutional layers [54]. Ref. [55] merged the features from intermediate and high-level convolutional activations in their convolutional network to exploit low-level details and high-level semantics for image segmentation. Similarly, for localization and segmentation, Ref. [56] concatenated the feature maps of convolutional layers at a pixel as a vector to form a descriptor. ...
Article
Full-text available
Fine-grained classifiers collect information about inter-class variations to best use the underlying minute and subtle differences. The task is challenging due to the minor differences between the colors, viewpoints, and structure in the same class entities. The classification becomes difficult and challenging due to the similarities between the differences in viewpoint with other classes and its own. This work investigates the performance of landmark traditional CNN classifiers, presenting top-notch results on large-scale classification datasets and comparing them against state-of-the-art fine-grained classifiers. This paper poses three specific questions. (i) Do the traditional CNN classifiers achieve comparable results to fine-grained classifiers? (ii) Do traditional CNN classifiers require any specific information to improve fine-grained ones? (iii) Do current traditional state-of-the-art CNN classifiers improve the fine-grained classification while utilized as a backbone? Therefore, we train the general CNN classifiers throughout this work without introducing any aspect specific to fine-grained datasets. We show an extensive evaluation on six datasets to determine whether the fine-grained classifier can elevate the baseline in their experiments. We provide ablation studies regarding efficiency, number of parameters, flops, and performance.
... In image segmentation settings, most models require the original training images to be annotated with segmentation labels, that is, all pixels are labelled with a ground-truth class. There are several existing approaches to image segmentation, such as Fully Convolutional Networks (Long et al., 2015), U-Net (Ronneberger et al., 2015), and Pyramid Networks (Lin et al., 2017). Existing works have applied these or similar approaches to LCC (Kuo et al., 2018;Rakhlin et al., 2018;Seferbekov et al., 2018;Tong et al., 2020;Wang et al., 2020;Karra et al., 2021); we refer readers to for a more in-depth review of existing work. ...
Article
Full-text available
Land cover classification (LCC) and natural disaster response (NDR) are important issues in climate change mitigation and adaptation. Existing approaches that use machine learning with Earth observation (EO) imaging data for LCC and NDR often rely on fully annotated and segmented datasets. Creating these datasets requires a large amount of effort, and a lack of suitable datasets has become an obstacle in scaling the use of machine learning for EO. In this study, we extend our prior work on Scene-to-Patch models: an alternative machine learning approach for EO that utilizes Multiple Instance Learning (MIL). As our approach only requires high-level scene labels, it enables much faster development of new datasets while still providing segmentation through patch-level predictions, ultimately increasing the accessibility of using machine learning for EO. We propose new multi-resolution MIL architectures that outperform single-resolution MIL models and non-MIL baselines on the DeepGlobe LCC and FloodNet NDR datasets. In addition, we conduct a thorough analysis of model performance and interpretability.
... Sa et al. [14] employed the Faster R-CNN (Faster Regional Convolutional Neural Network) algorithm [15] to detect multi-colored (green, red, yellow) pepper fruits. Chen et al. [16] used fully convolutional networks [17] to count fruits in apple and orange orchards. Bargoti and Underwood [18] utilized Faster R-CNN and transfer learning to estimate yield in apple, mango, and almond orchards. ...
Article
Full-text available
The apple is a delicious fruit with high nutritional value that is widely grown around the world. Apples are traditionally picked by hand, which is very inefficient. The development of advanced fruit-picking robots has great potential to replace manual labor. A major prerequisite for a robot to successfully pick fruits the accurate identification and positioning of the target fruit. The active laser vision systems based on structured algorithms can achieve higher recognition rates by quickly capturing the three-dimensional information of objects. This study proposes to combine the laser active vision system with the YOLOv5 neural network model to recognize and locate apples on trees. The method obtained accurate two-dimensional pixel coordinates, which, when combined with the active laser vision system, can be converted into three-dimensional world coordinates for apple recognition and positioning. On this basis, we built a picking robot platform equipped with this visual recognition system, and carried out a robot picking experiment. The experimental findings showcase the efficacy of the neural network recognition algorithm proposed in this study, which achieves a precision rate of 94%, an average precision mAP% of 92.86%, and a spatial localization accuracy of approximately 4 mm for the visual system. The implementation of this control method in simulated harvesting operations shows the promise of more precise and successful fruit positioning. In summary, the integration of the YOLOv5 neural network model with an active laser vision system presents a novel and effective approach for the accurate identification and positioning of apples. The achieved precision and spatial accuracy indicate the potential for enhanced fruit-harvesting operations, marking a significant step towards the automation of fruit-picking processes.
... These methods fail to model the global context and cannot be trained end to end. Inspired by FCNs [18], CNN-based methods [19][20][21][22][23][24][25] formulate the salient object detection as a pixel-level prediction task and generate the saliency map in an end-to-end manner. Recently, Transformer has also been applied to SOD [26][27][28][29][30]. ...
Article
Full-text available
Salient Object Detection (SOD) aims at identifying the most visually distinctive objects in a scene. However, learning a mapping directly from a raw image to its corresponding saliency map is still challenging. First, the binary annotations of SOD impede the model from learning the mapping smoothly. Second, the annotator's preference introduces noisy labeling in the SOD datasets. Motivated by these, we propose a novel learning framework which consists of the Self-Improvement Training (SIT) strategy and the Augmentation-based Consistent Learning (ACL) scheme. SIT aims at reducing the learning difficulty, which provides smooth labels and improves the SOD model in a momentum-updating manner. Meanwhile, ACL focuses on improving the robustness of models by regularizing the consistency between raw images and their corresponding augmented images. Extensive experiments on five challenging benchmark datasets demonstrate that the proposed framework can play a plug-and-play role in various existing state-of-the-art SOD methods and improve their performances on multiple benchmarks without any architecture modification.
... DL methods employ neural networks to automatically extract feature from images with minimal manual intervention. DL have always been widely applied in the segmentation of medical images for many years [15]. For example, Lei designs SGU-Net, an ultralight shape-guided network for the segmentation of abdominal medical images [16]. ...
Article
Full-text available
Aortic segmentation from computed tomography (CT) is crucial for facilitating aortic intervention, as it enables clinicians to visualize aortic anatomy for diagnosis and measurement. However, aortic segmentation faces the challenge of variable geometry in space, as the geometric diversity of different diseases and the geometric transformations that occur between raw and measured images. Existing constraint-based methods can potentially solve the challenge, but they are hindered by two key issues: inaccurate definition of properties and inappropriate topology of transformation in space. In this paper, we propose a deformable constraint transport network (DCTN). The DCTN adaptively extracts aortic features to define intra-image constrained properties and guides topological implementation in space to constrain inter-image geometric transformation between raw and curved planar reformation (CPR) images. The DCTN contains a deformable attention extractor, a geometry-aware decoder and an optimal transport guider. The extractor generates variable patches that preserve semantic integrity and long-range dependency in long-sequence images. The decoder enhances the perception of geometric texture and semantic features, particularly for low-intensity aortic coarctation and false lumen, which removes background interference. The guider explores the geometric discrepancies between raw and CPR images, constructs probability distributions of discrepancies, and matches them with inter-image transformation to guide geometric topology in space. Experimental studies on 267 aortic subjects and four public datasets show the superiority of our DCTN over 23 methods. The results demonstrate DCTN’s advantages in aortic segmentation for different types of aortic disease, for different aortic segments, and in the measurement of clinical indexes.
... Therefore among the DL models nowadays, encoder-decoder-based pixel-level crack detection models (i.e., FCN [33], U-net [34]) are becoming more popular for improving the detection accuracy as these models can extract the geometrical shape of the cracks along with localizing them. Li et al. proposed a novel encoder-decoder-based model called FCN for detecting cracks where the VGG19 model was used as the downsampler of the proposed FCN. ...
Preprint
Full-text available
Crack inspection in railway sleepers is crucial for ensuring rail safety and avoiding deadly accidents. Traditional methods for detecting cracks on railway sleepers are very time-consuming and lack efficiency. Therefore nowadays, researchers are paying attention to the vision-based algorithm, especially Deep Learning algorithms. In this work, we adopted the U-net for the first time for detecting cracks on a railway sleeper and proposed a modified U-net architecture named Dense U-net for segmenting the cracks. In the Dense U-net structure, we established several short connections between the encoder and decoder blocks, which enabled the architecture to obtain better pixel information flow. Thus, the model extracted the necessary information in more detail to predict the cracks. We collected images from railway sleepers, processed them in a dataset, and finally trained the model with the images. The model achieved an overall F1-score, precision, Recall, and IoU of 86.5% 88.53%, 84.63%, and 76.31% respectively. We compared our suggested model with the original U-net, and the results demonstrate that our model outperformed the U-net in both quantitative and qualitative results. Moreover, we considered the necessity of crack severity analysis and measured a few parameters of the cracks (e.g., length, maximum width, area, density). The engineers must know the severity of the cracks to have an idea about the most severe locations and take the necessary steps to repair the badly affected sleepers.
... Since the development of Deep Learning to semantic segmentation [23], which classifies regions of an entire image, it has been applied in various fields of manufacturing and resource management. The accuracy of semantic segmentation has been evaluated using the benchmark datasets, and there are also datasets for cityscapes and roadscapes, in recent years, the most accurate models have been announced every year for these datasets. ...
Preprint
Full-text available
River bed materials serve multiple environmental functions as a habitat for aquatic invertebrates and fishes. At the same time, the particle size of the bed material reflects the tractive force of the flow regime in a flood and provides useful information for flood control. The traditional river bed particle size surveys, such as sieving, require time and labor to investigate river bed materials. The authors proposed a method to classify aerial images taken by unmanned aerial vehicle (UAV) using convolutional neural networks (CNN), our previous study showed that terrestrial riverbed material could be classified with high accuracy. In this study, we attempted to classify riverbed materials distributed in shallow waters where the bottom can be seen from UAVs. After training the CNN to classify the images with the same grain size as being in the same class even if the surface flow types taken overlapping the riverbed material were different, the total accuracy reached 90.3%. Moreover, the proposed method was applied to the wide-ranging area to determine the distribution of the particle size. In parallel, the microtopography was surveyed using Lidar-UAV, and the relationship between the microtopography and particle size distribution was discussed. In the steep section, coarse particles were distributed and formed a rapid. Fine particles were deposited on the upstream side of those rapids, where the slope had become gentler due to the damming. There was good agreement between the microtopographical trends and the grain size distribution.
... Faster R-CNN eliminates the need for an external region-proposal algorithm by introducing what is called a Region Proposal Network (RPN). The RPN can be described as a fully convolutional network that takes an image input and outputs a correspondingly sized segmented output-map containing features in the image [24]. This feature map, and its corresponding RoIs are subject to a set of k = 9 rectangular anchor-boxes representing 3 scales and 3 aspect ratios. ...
Thesis
Full-text available
This thesis investigates how a tailored CNN can aid autonomous surface vehicles (ASVs) in detecting and classifying maritime traffic for collision avoidance. Several state of the art CNN models are presented and trained on data sets with relevance to the above-mentioned objective. Data collected from different sources are used for training these CNN models in pursuit to obtain a good performing detector. The main data sets are large, general purpose image sets of ships and boats. A smaller image set is also developed in this thesis. This custom data set is constructed from images taken along a predefined path at sea from a video camera. This includes images along docks and of ships in transit at sea. This data set is then split into training and testing images which are in close relation to each other. Through experiments, variations of the general purpose data sets are used to train both a 5 layer deep and a 16 layer deep CNN model to detect ships in an image
... Medical Image Segmentation: Most medical segmentation models are neural networks, primarily convolutional neural networks (CNN). These neural networks include fully convolutional networks (FCN) (6) and autoencoders that are based on the U-Net (7). Modified architectures of these models have been used for the semantic segmentation of both medical and non-medical images. ...
Preprint
Full-text available
p>Brain tumors are one of the deadliest forms of cancer with a mortality rate of over 80%. A quick and accurate diagnosis is crucial to increase the chance of survival. However, in medical analysis, the manual annotation and segmentation of a brain tumor can be a complicated task. Multiple MRI modalities are typically analyzed as they provide unique information regarding the tumor regions. Although these MRI modalities are helpful for segmenting gliomas, they tend to increase overfitting and computation. This paper proposes a region of interest detection algorithm that is implemented during data preprocessing to locate salient features and remove extraneous MRI data. This decreases the input size, allowing for more aggressive data augmentations and deeper neural networks. Following the preprocessing of the MRI modalities, a fully convolutional autoencoder with soft attention segments the different brain MRIs. When these deep learning algorithms are implemented in practice, analysts and physicians cannot differentiate between accurate and inaccurate predictions. Subsequently, test time augmentations and an energy-based model were used for voxel-based uncertainty predictions. Experimentation was conducted on the BraTS benchmarks and achieved state-of-the-art segmentation performance. Additionally, qualitative results were used to assess the segmentation models and uncertainty predictions. The code for this work is made available online at: https://github.com/WeToTheMoon/BrainTumorSegmentation.</p
... Medical Image Segmentation: Most medical segmentation models are neural networks, primarily convolutional neural networks (CNN). These neural networks include fully convolutional networks (FCN) (6) and autoencoders that are based on the U-Net (7). Modified architectures of these models have been used for the semantic segmentation of both medical and non-medical images. ...
Preprint
Full-text available
p>Brain tumors are one of the deadliest forms of cancer with a mortality rate of over 80%. A quick and accurate diagnosis is crucial to increase the chance of survival. However, in medical analysis, the manual annotation and segmentation of a brain tumor can be a complicated task. Multiple MRI modalities are typically analyzed as they provide unique information regarding the tumor regions. Although these MRI modalities are helpful for segmenting gliomas, they tend to increase overfitting and computation. This paper proposes a region of interest detection algorithm that is implemented during data preprocessing to locate salient features and remove extraneous MRI data. This decreases the input size, allowing for more aggressive data augmentations and deeper neural networks. Following the preprocessing of the MRI modalities, a fully convolutional autoencoder with soft attention segments the different brain MRIs. When these deep learning algorithms are implemented in practice, analysts and physicians cannot differentiate between accurate and inaccurate predictions. Subsequently, test time augmentations and an energy-based model were used for voxel-based uncertainty predictions. Experimentation was conducted on the BraTS benchmarks and achieved state-of-the-art segmentation performance. Additionally, qualitative results were used to assess the segmentation models and uncertainty predictions. The code for this work is made available online at: https://github.com/WeToTheMoon/BrainTumorSegmentation.</p
Article
Full-text available
Deep learning-based methods for building extraction from remote sensing images have been widely applied in fields such as land management and urban planning. However, extracting buildings from remote sensing images commonly faces challenges due to specific shooting angles. First, there exists a foreground–background imbalance issue, and the model excessively learns features unrelated to buildings, resulting in performance degradation and propagative interference. Second, buildings have complex boundary information, while conventional network architectures fail to capture fine boundaries. In this paper, we designed a multi-task U-shaped network (BFL-Net) to solve these problems. This network enhances the expression of the foreground and boundary features in the prediction results through foreground learning and boundary refinement, respectively. Specifically, the Foreground Mining Module (FMM) utilizes the relationship between buildings and multi-scale scene spaces to explicitly model, extract, and learn foreground features, which can enhance foreground and related contextual features. The Dense Dilated Convolutional Residual Block (DDCResBlock) and the Dual Gate Boundary Refinement Module (DGBRM) individually process the diverted regular stream and boundary stream. The former can effectively expand the receptive field, and the latter utilizes spatial and channel gates to activate boundary features in low-level feature maps, helping the network refine boundaries. The predictions of the network for the building, foreground, and boundary are respectively supervised by ground truth. The experimental results on the WHU Building Aerial Imagery and Massachusetts Buildings Datasets show that the IoU scores of BFL-Net are 91.37% and 74.50%, respectively, surpassing state-of-the-art models.
Article
Full-text available
Reading out neuronal activity from three-dimensional (3D) functional imaging requires segmenting and tracking individual neurons. This is challenging in behaving animals if the brain moves and deforms. The traditional approach is to train a convolutional neural network with ground-truth (GT) annotations of images representing different brain postures. For 3D images, this is very labor intensive. We introduce ‘targeted augmentation’, a method to automatically synthesize artificial annotations from a few manual annotations. Our method (‘Targettrack’) learns the internal deformations of the brain to synthesize annotations for new postures by deforming GT annotations. This reduces the need for manual annotation and proofreading. A graphical user interface allows the application of the method end-to-end. We demonstrate Targettrack on recordings where neurons are labeled as key points or 3D volumes. Analyzing freely moving animals exposed to odor pulses, we uncover rich patterns in interneuron dynamics, including switching neuronal entrainment on and off.
Chapter
In this chapter, we will elaborate on the task of video object tracking (VOT), which aims at producing tight bounding boxes around one or multiple target objects in the video.
Article
Full-text available
Background and Motivation: Coronary artery disease (CAD) has the highest mortality rate; therefore, its diagnosis is vital. Intravascular ultrasound (IVUS) is a high-resolution imaging solution that can image coronary arteries, but the diagnosis software via wall segmentation and quantification has been evolving. In this study, a deep learning (DL) paradigm was explored along with its bias. Methods: Using a PRISMA model, 145 best UNet-based and non-UNet-based methods for wall segmentation were selected and analyzed for their characteristics and scientific and clinical validation. This study computed the coronary wall thickness by estimating the inner and outer borders of the coronary artery IVUS cross-sectional scans. Further, the review explored the bias in the DL system for the first time when it comes to wall segmentation in IVUS scans. Three bias methods, namely (i) ranking, (ii) radial, and (iii) regional area, were applied and compared using a Venn diagram. Finally, the study presented explainable AI (XAI) paradigms in the DL framework. Findings and Conclusions: UNet provides a powerful paradigm for the segmentation of coronary walls in IVUS scans due to its ability to extract automated features at different scales in encoders, reconstruct the segmented image using decoders, and embed the variants in skip connections. Most of the research was hampered by a lack of motivation for XAI and pruned AI (PAI) models. None of the UNet models met the criteria for bias-free design. For clinical assessment and settings, it is necessary to move from a paper-to-practice approach.
Article
Full-text available
Plastic greenhouses (PGs) play a vital role in modern agricultural development by providing a controlled environment for the cultivation of food crops. Their widespread adoption has the potential to revolutionize agriculture and impact the local environment. Accurate mapping and estimation of PG coverage are critical for strategic planning in agriculture. However, the challenge lies in the extraction of small and densely distributed PGs; this is often compounded by issues like irrelevant and redundant features and spectral confusion in high-resolution remote-sensing imagery, such as Gaofen-2 data. This paper proposes an innovative approach that combines the power of a full convolutional network (FC-DenseNet103) with an image enhancement index. The image enhancement index effectively accentuates the boundary features of PGs in Gaofen-2 satellite images, enhancing the unique spectral characteristics of PGs. FC-DenseNet103, known for its robust feature propagation and extensive feature reuse, complements this by addressing challenges related to feature fusion and misclassification at the boundaries of PGs and adjacent features. The results demonstrate the effectiveness of this approach. By incorporating the image enhancement index into the DenseNet103 model, the proposed method successfully eliminates issues related to the fusion and misclassification of PG boundaries and adjacent features. The proposed method, known as DenseNet103 (Index), excels in extracting the integrity of PGs, especially in cases involving small and densely packed plastic sheds. Moreover, it holds the potential for large-scale digital mapping of PG coverage. In conclusion, the proposed method providing a practical and versatile tool for a wide range of applications related to the monitoring and evaluation of PGs, which can help to improve the precision of agricultural management and quantitative environmental assessment.
Article
Full-text available
This report presents the results of the 2006 PASCAL Visual Object Classes Challenge (VOC2006). Details of the challenge, data, and evaluation are presented. Participants in the challenge submitted descriptions of their methods, and these have been included verbatim. This document should be considered preliminary, and subject to change.
Article
Full-text available
Recognition algorithms based on convolutional networks (CNNs) typically use the output of the last layer as feature representation. However, the information in this layer may be too coarse to allow precise localization. On the contrary, earlier layers may be precise in localization but will not capture semantics. To get the best of both worlds, we define the hypercolumn at a pixel as the vector of activations of all CNN units above that pixel. Using hypercolumns as pixel descriptors, we show results on three fine-grained localization tasks: simultaneous detection and segmentation[21], where we improve state-of-the-art from 49.7[21] mean AP^r to 59.0, keypoint localization, where we get a 3.3 point boost over[19] and part labeling, where we show a 6.6 point gain over a strong baseline.
Article
Full-text available
We propose a deep convolutional neural network architecture codenamed "Inception", which was responsible for setting the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC 2014). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. This was achieved by a carefully crafted design that allows for increasing the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC 2014 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.
Article
Full-text available
Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models. The framework is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures. Caffe fits industry and internet-scale media needs by CUDA GPU computation, processing over 40 million images a day on a single K40 or Titan GPU (approx 2 ms per image). By separating model representation from actual implementation, Caffe allows experimentation and seamless switching among platforms for ease of development and deployment from prototyping machines to cloud environments. Caffe is maintained and developed by the Berkeley Vision and Learning Center (BVLC) with the help of an active community of contributors on GitHub. It powers ongoing research projects, large-scale industrial applications, and startup prototypes in vision, speech, and multimedia.
Conference Paper
Full-text available
In this paper we study the problem of object detection for RGB-D images using semantically rich image and depth features. We propose a new geocentric embedding for depth images that encodes height above ground and angle with gravity for each pixel in addition to the horizontal disparity. We demonstrate that this geocentric embedding works better than using raw depth images for learning feature representations with convolutional neural networks. Our final object detection system achieves an average precision of 37.3%, which is a 56% relative improvement over existing methods. We then focus on the task of instance segmentation where we label pixels belonging to object instances found by our detector. For this task, we propose a decision forest approach that classifies pixels in the detection window as foreground or background using a family of unary and binary tests that query shape and geocentric pose features. Finally, we use the output from our object detectors in an existing superpixel classification framework for semantic scene segmentation and achieve a 24% relative improvement over current state-of-the-art for the object categories that we study. We believe advances such as those represented in this paper will facilitate the use of perception in fields like robotics.
Article
Full-text available
This paper proposes a new hybrid architecture that consists of a deep Convolutional Network and a Markov Random Field. We show how this architecture is successfully applied to the challenging problem of articulated human pose estimation in monocular images. The architecture can exploit structural domain constraints such as geometric relationships between body joint locations. We show that joint training of these two model paradigms improves performance and allows us to significantly outperform existing state-of-the-art techniques.
Article
Full-text available
Latest results indicate that features learned via convolutional neural networks outperform previous descriptors on classification tasks by a large margin. It has been shown that these networks still work well when they are applied to datasets or recognition tasks different from those they were trained on. However, descriptors like SIFT are not only used in recognition but also for many correspondence problems that rely on descriptor matching. In this paper we compare features from various layers of convolutional neural nets to standard SIFT descriptors. We consider a network that was trained on ImageNet and another one that was trained without supervision. Surprisingly, convolutional neural networks clearly outperform SIFT on descriptor matching.
Article
Full-text available
We present an integrated framework for using Convolutional Networks for classification, localization and detection. We show how a multiscale and sliding window approach can be efficiently implemented within a ConvNet. We also introduce a novel deep learning approach to localization by learning to predict object boundaries. Bounding boxes are then accumulated rather than suppressed in order to increase detection confidence. We show that different tasks can be learnt simultaneously using a single shared network. This integrated framework is the winner of the localization task of the ImageNet Large Scale Visual Recognition Challenge 2013 (ILSVRC2013), and produced near state of the art results for the detection and classifications tasks. Finally, we release a feature extractor from our best model called OverFeat.
Article
Full-text available
Scene labeling consists of labeling each pixel in an image with the category of the object it belongs to. We propose a method that uses a multiscale convolutional network trained from raw pixels to extract dense feature vectors that encode regions of multiple sizes centered on each pixel. The method alleviates the need for engineered features, and produces a powerful representation that captures texture, shape, and contextual information. We report results using multiple postprocessing methods to produce the final labeling. Among those, we propose a technique to automatically retrieve, from a pool of segmentation components, an optimal set of components that best explain the scene; these components are arbitrary, for example, they can be taken from a segmentation tree or from any family of oversegmentations. The system yields record accuracies on the SIFT Flow dataset (33 classes) and the Barcelona dataset (170 classes) and near-record accuracy on Stanford background dataset (eight classes), while being an order of magnitude faster than competing approaches, producing a $(320\times 240)$ image labeling in less than a second, including feature extraction.
Conference Paper
Full-text available
This paper describes the use of a convolutional neural network to perform address block location on machine-printed mail pieces. Locating the address block is a dicult object recognition problem because there is often a large amount of extraneous printing on a mail piece and because address blocks vary dramatically in size and shape. We used a convolutional locator network with four outputs, each trained to nd a dierent corner of the address block. A simple set of rules was used to generate ABL candidates from the network output. The system performs very well: when allowed ve guesses, the network will tightly bound the address delivery information in 98.2% of the cases.
Conference Paper
Full-text available
This paper presents a simple and eective nonparametric approach to the problem of image parsing, or labeling image regions (in our case, superpixels produced by bottom-up segmentation) with their categories. This approach requires no training, and it can easily scale to datasets with tens of thousands of images and hundreds of labels. It works by scene-level matching with global image descriptors, followed by superpixel-level matching with local features and ecient Markov ran- dom eld (MRF) optimization for incorporating neighborhood context. Our MRF setup can also compute a simultaneous labeling of image re- gions into semantic classes (e.g., tree, building, car) and geometric classes (sky, vertical, ground). Our system outperforms the state-of-the-art non- parametric method based on SIFT Flow on a dataset of 2,688 images and 33 labels. In addition, we report per-pixel rates on a larger dataset of 15,150 images and 170 labels. To our knowledge, this is the rst complete evaluation of image parsing on a dataset of this size, and it establishes a new benchmark for the problem.
Article
Full-text available
We describe a trainable system for analyzing videos of developing C. elegans embryos. The system automatically detects, segments, and locates cells and nuclei in microscopic images. The system was designed as the central component of a fully automated phenotyping system. The system contains three modules 1) a convolutional network trained to classify each pixel into five categories: cell wall, cytoplasm, nucleus membrane, nucleus, outside medium; 2) an energy-based model, which cleans up the output of the convolutional network by learning local consistency constraints that must be satisfied by label images; 3) a set of elastic models of the embryo at various stages of development that are matched to the label images.
Article
Full-text available
We present a feed-forward network architecture for recognizing an unconstrained handwritten multi-digit string. This is an extension of previous work on recognizing isolated digits. In this architecture a single digit recognizer is replicated over the input. The output layer of the network is coupled to a Viterbi alignment module that chooses the best interpretation of the input. Training errors are propagated through the Viterbi module. The novelty in this procedure is that segmentation is done on the feature maps developed in the Space Displacement Neural Network (SDNN) rather than the input (pixel) space. 1 Introduction In previous work (Le Cun et al., 1990) we have demonstrated a feed-forward backpropagation network that recognizes isolated handwritten digits at state-of-the-art performance levels. The natural extension of this work is towards recognition of unconstrained strings of handwritten digits. The most straightforward solution is to divide the process into two: segmentati...
Conference Paper
Semantic part localization can facilitate fine-grained categorization by explicitly isolating subtle appearance differences associated with specific object parts. Methods for pose-normalized representations have been proposed, but generally presume bounding box annotations at test time due to the difficulty of object detection. We propose a model for fine-grained categorization that overcomes these limitations by leveraging deep convolutional features computed on bottom-up region proposals. Our method learns whole-object and part detectors, enforces learned geometric constraints between them, and predicts a fine-grained category from a pose-normalized representation. Experiments on the Caltech-UCSD bird dataset confirm that our method outperforms state-of-the-art fine-grained categorization methods in an end-to-end evaluation without requiring a bounding box at test time.
Conference Paper
This paper proposes a new hybrid architecture that consists of a deep Convolutional Network and a Markov Random Field. We show how this architecture is successfully applied to the challenging problem of articulated human pose estimation in monocular images. The architecture can exploit structural domain constraints such as geometric relationships between body joint locations. We show that joint training of these two model paradigms improves performance and allows us to significantly outperform existing state-of-the-art techniques.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Book
Mallat's book is the undisputed reference in this field - it is the only one that covers the essential material in such breadth and depth. - Laurent Demanet, Stanford University The new edition of this classic book gives all the major concepts, techniques and applications of sparse representation, reflecting the key role the subject plays in today's signal processing. The book clearly presents the standard representations with Fourier, wavelet and time-frequency transforms, and the construction of orthogonal bases with fast algorithms. The central concept of sparsity is explained and applied to signal compression, noise reduction, and inverse problems, while coverage is given to sparse representations in redundant dictionaries, super-resolution and compressive sensing applications. Features: * Balances presentation of the mathematics with applications to signal processing * Algorithms and numerical examples are implemented in WaveLab, a MATLAB toolbox * Companion website for instructors and selected solutions and code available for students New in this edition * Sparse signal representations in dictionaries * Compressive sensing, super-resolution and source separation * Geometric image processing with curvelets and bandlets * Wavelets for computer graphics with lifting on surfaces * Time-frequency audio processing and denoising * Image compression with JPEG-2000 * New and updated exercises A Wavelet Tour of Signal Processing: The Sparse Way, third edition, is an invaluable resource for researchers and R&D engineers wishing to apply the theory in fields such as image processing, video processing and compression, bio-sensing, medical imaging, machine vision and communications engineering. Stephane Mallat is Professor in Applied Mathematics at École Polytechnique, Paris, France. From 1986 to 1996 he was a Professor at the Courant Institute of Mathematical Sciences at New York University, and between 2001 and 2007, he co-founded and became CEO of an image processing semiconductor company. Includes all the latest developments since the book was published in 1999, including its application to JPEG 2000 and MPEG-4 Algorithms and numerical examples are implemented in Wavelab, a MATLAB toolbox Balances presentation of the mathematics with applications to signal processing.
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif-ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make train-ing faster, we used non-saturating neurons and a very efficient GPU implemen-tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.
Convolutional neural nets (convnets) trained from massive labeled datasets have substantially improved the state-of-the-art in image classification and object detection. However, visual understanding requires establishing correspondence on a finer level than object category. Given their large pooling regions and training from whole-image labels, it is not clear that convnets derive their success from an accurate correspondence model which could be used for precise localization. In this paper, we study the effectiveness of convnet activation features for tasks requiring correspondence. We present evidence that convnet features localize at a much finer scale than their receptive field sizes, that they can be used to perform intraclass alignment as well as conventional hand-engineered features, and that they outperform conventional features in keypoint prediction on objects from PASCAL VOC 2011.
Article
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.
Conference Paper
We propose a new architecture for difficult image processing operations, such as natural edge detection or thin object segmentation. The architecture is based on a simple combination of convolutional neural networks with the nearest neighbor search. We focus our attention on the situations when the desired image transformation is too hard for a neural network to learn explicitly. We show that in such situations, the use of the nearest neighbor search on top of the network output allows to improve the results considerably and to account for the underfitting effect during the neural network training. The approach is validated on three challenging benchmarks, where the performance of the proposed architecture matches or exceeds the state-of-the-art.
Article
Existing deep convolutional neural networks (CNNs) require a fixed-size (e.g. 224×224) input image. This requirement is “artificial” and may hurt the recognition accuracy for the images or sub-images of an arbitrary size/scale. In this work, we equip the networks with a more principled pooling strategy, “spatial pyramid pooling”, to eliminate the above requirement. The new network structure, called SPP-net, can generate a fixed-length representation regardless of image size/scale. By removing the fixed-size limitation, we can improve all CNN-based image classification methods in general. Our SPP-net achieves state-of-the-art accuracy on the datasets of ImageNet 2012, Pascal VOC 2007, and Caltech101. The power of SPP-net is more significant in object detection. Using SPP-net, we compute the feature maps from the entire image only once, and then pool features in arbitrary regions (sub-images) to generate fixed-length representations for training the detectors. This method avoids repeatedly computing the convolutional features. In processing test images, our method computes convolutional features 30-170× faster than the recent leading method R-CNN (and 24-64× faster overall), while achieving better or comparable accuracy on Pascal VOC 2007.
Article
Predicting depth is an essential component in understanding the 3D geometry of a scene. While for stereo images local correspondence suffices for estimation, finding depth relations from a single image is less straightforward, requiring integration of both global and local information from various cues. Moreover, the task is inherently ambiguous, with a large source of uncertainty coming from the overall scale. In this paper, we present a new method that addresses this task by employing two deep network stacks: one that makes a coarse global prediction based on the entire image, and another that refines this prediction locally. We also apply a scale-invariant error to help measure depth relations rather than scale. By leveraging the raw datasets as large sources of training data, our method achieves state-of-the-art results on both NYU Depth and KITTI, and matches detailed depth boundaries without the need for superpixelation.
Conference Paper
We present an approach to interpret the major surfaces, objects, and support relations of an indoor scene from an RGBD image. Most existing work ignores physical interactions or is applied only to tidy rooms and hallways. Our goal is to parse typical, often messy, indoor scenes into floor, walls, supporting surfaces, and object regions, and to recover support relationships. One of our main interests is to better understand how 3D cues can best inform a structured 3D interpretation. We also contribute a novel integer programming formulation to infer physical support relations. We offer a new dataset of 1449 RGBD images, capturing 464 diverse indoor scenes, with detailed annotations. Our experiments demonstrate our ability to infer support relations in complex scenes and verify that our 3D scene cues and inferred support lead to better object segmentation.
Conference Paper
Photographs taken through a window are often compromised by dirt or rain present on the window surface. Common cases of this include pictures taken from inside a vehicle, or outdoor security cameras mounted inside a protective enclosure. At capture time, defocus can be used to remove the artifacts, but this relies on achieving a shallow depth-of-field and placement of the camera close to the window. Instead, we present a post-capture image processing solution that can remove localized rain and dirt artifacts from a single image. We collect a dataset of clean/corrupted image pairs which are then used to train a specialized form of convolutional neural network. This learns how to map corrupted image patches to clean ones, implicitly capturing the characteristic appearance of dirt and water droplets in natural images. Our models demonstrate effective removal of dirt and rain in outdoor test conditions.
This paper presents a system for image parsing, or labeling each pixel in an image with its semantic category, aimed at achieving broad coverage across hundreds of object categories, many of them sparsely sampled. The system combines region-level features with per-exemplar sliding window detectors. Per-exemplar detectors are better suited for our parsing task than traditional bounding box detectors: they perform well on classes with little training data and high intra-class variation, and they allow object masks to be transferred into the test image for pixel-level segmentation. The proposed system achieves state-of-the-art accuracy on three challenging datasets, the largest of which contains 45,676 images and 232 labels.
We address the problems of contour detection, bottom-up grouping and semantic segmentation using RGB-D data. We focus on the challenging setting of cluttered indoor scenes, and evaluate our approach on the recently introduced NYU-Depth V2 (NYUD2) dataset [27]. We propose algorithms for object boundary detection and hierarchical segmentation that generalize the gPb-ucm approach of [2] by making effective use of depth information. We show that our system can label each contour with its type (depth, normal or albedo). We also propose a generic method for long-range amodal completion of surfaces and show its effectiveness in grouping. We then turn to the problem of semantic segmentation and propose a simple approach that classifies super pixels into the 40 dominant object categories in NYUD2. We use both generic and class-specific features to encode the appearance and geometry of objects. We also show how our approach can be used for scene classification, and how this contextual information in turn improves object recognition. In all of these tasks, we report significant improvements over the state-of-the-art.
Article
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.
Article
We evaluate whether features extracted from the activation of a deep convolutional network trained in a fully supervised fashion on a large, fixed set of object recognition tasks can be re-purposed to novel generic tasks. Our generic tasks may differ significantly from the originally trained tasks and there may be insufficient labeled or unlabeled data to conventionally train or adapt a deep architecture to the new tasks. We investigate and visualize the semantic clustering of deep convolutional features with respect to a variety of such tasks, including scene recognition, domain adaptation, and fine-grained recognition challenges. We compare the efficacy of relying on various network levels to define a fixed feature, and report novel results that significantly outperform the state-of-the-art on several important vision challenges. We are releasing DeCAF, an open-source implementation of these deep convolutional activation features, along with all associated network parameters to enable vision researchers to be able to conduct experimentation with deep representations across a range of visual concept learning paradigms.
Article
Scene parsing is a technique that consist on giving a label to all pixels in an image according to the class they belong to. To ensure a good visual coherence and a high class accuracy, it is essential for a scene parser to capture image long range dependencies. In a feed-forward architecture, this can be simply achieved by considering a sufficiently large input context patch, around each pixel to be labeled. We propose an approach consisting of a recurrent convolutional neural network which allows us to consider a large input context, while limiting the capacity of the model. Contrary to most standard approaches, our method does not rely on any segmentation methods, nor any task-specific features. The system is trained in an end-to-end manner over raw pixels, and models complex spatial dependencies with low inference cost. As the context size increases with the built-in recurrence, the system identifies and corrects its own errors. Our approach yields state-of-the-art performance on both the Stanford Background Dataset and the SIFT Flow Dataset, while remaining very fast at test time.
Chapter
The convergence of back-propagation learning is analyzed so as to explain common phenomenon observedb y practitioners. Many undesirable behaviors of backprop can be avoided with tricks that are rarely exposedin serious technical publications. This paper gives some of those tricks, ando.ers explanations of why they work. Many authors have suggested that second-order optimization methods are advantageous for neural net training. It is shown that most “classical” second-order methods are impractical for large neural networks. A few methods are proposed that do not have these limitations.
Chapter
Mallat's book is the undisputed reference in this field - it is the only one that covers the essential material in such breadth and depth. - Laurent Demanet, Stanford University The new edition of this classic book gives all the major concepts, techniques and applications of sparse representation, reflecting the key role the subject plays in today's signal processing. The book clearly presents the standard representations with Fourier, wavelet and time-frequency transforms, and the construction of orthogonal bases with fast algorithms. The central concept of sparsity is explained and applied to signal compression, noise reduction, and inverse problems, while coverage is given to sparse representations in redundant dictionaries, super-resolution and compressive sensing applications. Features: * Balances presentation of the mathematics with applications to signal processing * Algorithms and numerical examples are implemented in WaveLab, a MATLAB toolbox * Companion website for instructors and selected solutions and code available for students New in this edition * Sparse signal representations in dictionaries * Compressive sensing, super-resolution and source separation * Geometric image processing with curvelets and bandlets * Wavelets for computer graphics with lifting on surfaces * Time-frequency audio processing and denoising * Image compression with JPEG-2000 * New and updated exercises A Wavelet Tour of Signal Processing: The Sparse Way, third edition, is an invaluable resource for researchers and R&D engineers wishing to apply the theory in fields such as image processing, video processing and compression, bio-sensing, medical imaging, machine vision and communications engineering. Stephane Mallat is Professor in Applied Mathematics at École Polytechnique, Paris, France. From 1986 to 1996 he was a Professor at the Courant Institute of Mathematical Sciences at New York University, and between 2001 and 2007, he co-founded and became CEO of an image processing semiconductor company. Includes all the latest developments since the book was published in 1999, including its application to JPEG 2000 and MPEG-4 Algorithms and numerical examples are implemented in Wavelab, a MATLAB toolbox Balances presentation of the mathematics with applications to signal processing.
Conference Paper
We study the challenging problem of localizing and classifying category-specific object contours in real world images. For this purpose, we present a simple yet effective method for combining generic object detectors with bottom-up contours to identify object contours. We also provide a principled way of combining information from different part detectors and across categories. In order to study the problem and evaluate quantitatively our approach, we present a dataset of semantic exterior boundaries on more than 20, 000 object instances belonging to 20 categories, using the images from the VOC2011 PASCAL challenge [7].
Article
The ability of learning networks to generalize can be greatly enhanced by providing constraints from the task domain. This paper demonstrates how such constraints can be integrated into a backpropagation network through the architecture of the network. This approach has been successfully applied to the recognition of handwritten zip code digits provided by the U.S. Postal Service. A single network learns the entire recognition operation, going from the normalized image of the character to the final classification.
Article
While image alignment has been studied in different areas of computer vision for decades, aligning images depicting different scenes remains a challenging problem. Analogous to optical flow, where an image is aligned to its temporally adjacent frame, we propose SIFT flow, a method to align an image to its nearest neighbors in a large image corpus containing a variety of scenes. The SIFT flow algorithm consists of matching densely sampled, pixelwise SIFT features between two images while preserving spatial discontinuities. The SIFT features allow robust matching across different scene/object appearances, whereas the discontinuity-preserving spatial model allows matching of objects located at different parts of the scene. Experiments show that the proposed approach robustly aligns complex scene pairs containing significant spatial differences. Based on SIFT flow, we propose an alignment-based large database framework for image analysis and synthesis, where image information is transferred from the nearest neighbors to a query image according to the dense scene correspondence. This framework is demonstrated through concrete applications such as motion field prediction from a single image, motion synthesis via object transfer, satellite image registration, and face recognition.
Article
It is shown that a convolution with certain reasonable receptive field (RF) profiles yields the exact partial derivatives of the retinal illuminance blurred to a specified degree. Arbitrary concatenations of such RF profiles yield again similar ones of higher order and for a greater degree of blurring. By replacing the illuminance with its third order jet extension we obtain position dependent geometries. It is shown how such a representation can function as the substrate for “point processors” computing geometrical features such as edge curvature. We obtain a clear dichotomy between local and multilocal visual routines. The terms of the truncated Taylor series representing the jets are partial derivatives whose corresponding RF profiles closely mimic the well known units in the primary visual cortex. Hence this description provides a novel means to understand and classify these units. Taking the receptive field outputs as the basic input data one may devise visual routines that compute geometric features on the basis of standard differential geometry exploiting the equivalence with the local jets (partial derivatives with respect to the space coordinates).
Simultaneous detection and segmentation
  • B Hariharan
  • P Arbeláez
  • R Girshick
  • J Malik
Restoring an image taken through a window covered with dirt or rain. In Computer Vision (ICCV), 201 IEEE International Conference
  • D Eigen
  • D Krishnan
  • R Fergus