Article

Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Existing deep convolutional neural networks (CNNs) require a fixed-size (e.g. 224×224) input image. This requirement is “artificial” and may hurt the recognition accuracy for the images or sub-images of an arbitrary size/scale. In this work, we equip the networks with a more principled pooling strategy, “spatial pyramid pooling”, to eliminate the above requirement. The new network structure, called SPP-net, can generate a fixed-length representation regardless of image size/scale. By removing the fixed-size limitation, we can improve all CNN-based image classification methods in general. Our SPP-net achieves state-of-the-art accuracy on the datasets of ImageNet 2012, Pascal VOC 2007, and Caltech101. The power of SPP-net is more significant in object detection. Using SPP-net, we compute the feature maps from the entire image only once, and then pool features in arbitrary regions (sub-images) to generate fixed-length representations for training the detectors. This method avoids repeatedly computing the convolutional features. In processing test images, our method computes convolutional features 30-170× faster than the recent leading method R-CNN (and 24-64× faster overall), while achieving better or comparable accuracy on Pascal VOC 2007.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... To improve the ability of CNNs to aggregate context information, UNet [2], UNet++ [3], VNet [4], and UNet3+ [5] use a symmetric encodingdecoding structure to compensate for the loss of information caused by downsampling and utilize the skip connection method to transmit feature information between different layers, aggregating multi-scale information and improving the model's ability to learn contextual information. SPPNet [6] and PSPNet [7] construct a pooled pyramid to capture the contextual information at different levels, which can simultaneously process both large-and small-scale objects and structures, thus enhancing their ability to learn context. Capturing different levels of contextual information by applying atrous convolution and pooling operations in parallel at different scales leads to enhanced comprehension of complex scenes while maintaining resolution and effective learning of a wider range of contextual information [8,9,10,11,12]. ...
... The results indicate that both randomized multi-scale inputs and horizontal flipping can further improve the performance of the segmentation model, thereby avoiding overfitting. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 A c c e p t e d M a n u s c r i p t Impact of atrous convolution rates on model's performance. We applied four atrous convolution rate configurations in the S-ASPP: (3,6,9), (4,8,12), (6,12,18), and (8,16,24). As Table 6 shows, (6,12,18) performed best. ...
... The results indicate that both randomized multi-scale inputs and horizontal flipping can further improve the performance of the segmentation model, thereby avoiding overfitting. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 A c c e p t e d M a n u s c r i p t Impact of atrous convolution rates on model's performance. We applied four atrous convolution rate configurations in the S-ASPP: (3,6,9), (4,8,12), (6,12,18), and (8,16,24). As Table 6 shows, (6,12,18) performed best. ...
Article
Semantic segmentation is a critical task in computer vision. Constructing complex semantic segmentation models with high accuracy, low spatial occupancy, and low computational complexity remains a challenge. To address this, this paper proposes a semantic segmentation network based on a hybrid architecture of convolutional neural network and Transformer, named Shuffle Window Transformer DeeplabV3+. The network introduces a new module, called the Shuffle Window Transformer. When the window size is fixed, by integrating window attention and shuffle window attention mechanisms, cross-window global context modeling with linear computational complexity is achieved. Additionally, we enhance the atrous spatial pyramid pooling by incorporating strip pooling to construct a strip atrous spatial pyramid pooling, effectively extracting both regular and irregular multi-scale features. Simultaneously, the network adopts adaptive spatial feature fusion in the shallow layers. Dynamic adjustment of multi-scale feature weights improves the backbone network's ability to capture shallow discriminative features. Experimental results demonstrate that on three public datasets (PASCAL VOC 2012, Cityscapes, and CamVid), Shuffle Window Transformer DeeplabV3+ exhibits outstanding segmentation performance under conditions of lower parameter count and computational cost, validating the model's capability to achieve efficient processing while maintaining high accuracy.
... The inclusion of residual connections preserves and propagates features across layers, enabling more robust feature extraction at scale. This approach is particularly effective for identifying complex structures across various Labeled Data YOLO-DCAP SPP [12] CBAM [13] ViT [14] Transformer [15] YOLO [11] Gravity Wave Bore Ocean Eddy Fig. 1: Localization comparison across baseline models and proposed YOLO-DCAP on GW, Bore, and OE datasets. In the 'Labeled Data', pink regions represent objects of interest (GW, Bore, or OE), while blue regions indicate non-objects of interest such as city lights and clouds (labeled as 'noise' in GW and Bore data, and as 'non-eddy' in OE data). ...
... The Attention-aided Spatial Pooling (AaSP) module enhances the Spatial Pyramid Pooling (SPP) framework [12] by introducing architectural modifications and incorporating an attention mechanism inspired by Squeeze-and-Excitation (SE) [40]. This enables the network to focus on the most informative features and adapt to varying input characteristics. ...
... We evaluated our YOLO-DCAP model, which integrates MDRC and the attention-aided AaSP module, against YOLOv5 and state-of-the-art methods, including CBAM [13], Transformer [15], SPP [12], and ViT [14], as shown in Table I. The comparison spans three datasets, GW, Bore, and OE using precision, recall, mean average precision (mAP50 and mAP50-95), and intersection over union (IoU) metrics. ...
Preprint
Full-text available
Object localization in satellite imagery is particularly challenging due to the high variability of objects, low spatial resolution, and interference from noise and dominant features such as clouds and city lights. In this research, we focus on three satellite datasets: upper atmospheric Gravity Waves (GW), mesospheric Bores (Bore), and Ocean Eddies (OE), each presenting its own unique challenges. These challenges include the variability in the scale and appearance of the main object patterns, where the size, shape, and feature extent of objects of interest can differ significantly. To address these challenges, we introduce YOLO-DCAP, a novel enhanced version of YOLOv5 designed to improve object localization in these complex scenarios. YOLO-DCAP incorporates a Multi-scale Dilated Residual Convolution (MDRC) block to capture multi-scale features at scale with varying dilation rates, and an Attention-aided Spatial Pooling (AaSP) module to focus on the global relevant spatial regions, enhancing feature selection. These structural improvements help to better localize objects in satellite imagery. Experimental results demonstrate that YOLO-DCAP significantly outperforms both the YOLO base model and state-of-the-art approaches, achieving an average improvement of 20.95% in mAP50 and 32.23% in IoU over the base model, and 7.35% and 9.84% respectively over state-of-the-art alternatives, consistently across all three satellite datasets. These consistent gains across all three satellite datasets highlight the robustness and generalizability of the proposed approach. Our code is open sourced at https://github.com/AI-4-atmosphere-remote-sensing/satellite-object-localization.
... CNN backbone extracts features from the 2000 region proposal and then is fine-tuned to get the final prediction. Fig. 2 The difference between image classification (a) and object detection (b) SPP-NET [27] generates the region proposal module in a separate branch and passes the entire image to the CNN to extract features from the image only once. SPP layer on top of the last convolutional layer maps the region proposal to the feature map and obtains features of fixed length for the FC layers. ...
... YOLO v4 chooses CSPDarknet53 as a backbone rather than CSPDarknet50 as it is more suitable for detection. On Top of the backbone, they use SPP-NET [27] as an additional module to increase respective field and extract significant features and PANet [47] is used as a path aggregation neck to collect feature maps from different stages rather than FPN [44] and use YOLO v3 [33] as the Head of the YOLO v4. YOLO v4 uses a Genetic Algorithm search for optimal hyperparameters during the training phase and introduces a new data augmentation method, e.g., Mosaic and Self-Adversarial training. ...
Article
Full-text available
In recent years, there has been impressive development in human detection. The main challenge in pedestrian detection is the training data. To assess detectors in crowd scenarios more effectively, a novel dataset in this study called the HEP dataset (Hybrid Egyptian Pedestrian dataset) is introduced. The HEP dataset is extensive, has comprehensive annotations, and is highly diverse. The dataset images are collected by two different means. Most of the images are collected from different mobile cameras for people crossing the street in high crowded streets in Egypt, and the rest of the images are collected from the web. That is why the dataset is called hybrid. The collected dataset is more suitable for pedestrian detection as the whole images focus on pedestrian scenarios for people outdoors crossing the street. This outperforms the previous benchmarks such as CrowdHuman and WiderPerson which collect data from the web and surveillance cameras with lots of images for indoor people. GS-YOLO also is proposed to address the real-time performance and the occlusion in the crowd scenes issues. GS-YOLO is a novel pedestrian detection model that utilizes efficient Ghost and depth separable convolution modules. GS-YOLO replaces all the convolution layers in the backbone and the head of the original YOLOv8 with Ghost and depth separable modules, respectively. A deformable to-features module is proposed to enrich features for the different feature pyramid networks. GS-YOLO is trained and tested over the collected dataset and other benchmarks like CrowdHuman and WiderPerson datasets. GS-YOLO achieves competitive results over the state-of-the-art models such as YOLOv5 and YOLOv8. GS-YOLO achieves 92.8% mAp on the HEP dataset, while YOLOv5 achieves 90.3% mAp and YOLOv8 achieves 91.1% mAp.
... This structure solves the problem of fixed-size input of R-CNN neural network and avoids the problem of redundant calculation of feature maps. At the same time, the new network is robust enough to deform target object recognition [25,26]. However, SPPNet continued using CNN for feature extraction and then training SVM for classification, so SPPNet did not solve the problems of large storage space and cumbersome training process. ...
... The SPPF module (Maxpool2d in the module is maximum pooling, and Concat is concatenation) refers to the idea of SPPNet [25]. It can realize feature fusion of different levels through multiple branches of feature vectors from the previous layer. ...
Article
Full-text available
Remote sensing images contain complex scene information and have multi-scale characteristics after being imaged by remote sensing equipment because the target instances in natural scenes are different sizes. In addition, small target instances are more difficult to identify in complex backgrounds, resulting in serious scale imbalance problems, which pose a serious challenge to identifying and positioning target objects. To address this problem, this article proposes a detection method for multi-scale remote sensing image targets by combining the frequency attention mechanism. The model first introduces an improved frequency channel attention mechanism to design a feature extraction module to improve the extraction of key features by the neural network; second, considering that the complete intersection over union method does not comprehensively consider the aspect ratio of the bounding box, which will cause the loss of small-scale target feature information, the efficient intersection over union method is used to improve it; then, because of the high missed detection rate of the non-maximum suppression (NMS) method, Soft-EIoU-NMS is used to replace the original NMS method. The experiment first conducted ablation experiments on the LEVIR dataset, where the target scale changes little and the number of ground object categories is small, and the DIOR dataset, where the target scale changes greatly and the number of ground object categories is large. The mAP@0.5 on the LEVIR dataset reached 0.935; the mAP@0.5 on the DIOR dataset reached 0.882. Then, the model proposed in this article was compared with the mainstream target detection methods. The experimental results verified the effectiveness of the model in this article. Finally, the model was applied to the disaster remote sensing image scene for detection, again demonstrating the model’s good detection performance. Therefore, the experimental results show that the method proposed in this article can effectively alleviate the scale imbalance problem and achieve a good target detection effect.
... AHP/ANP are AI-based commercial tools to resolve SCA's inventory-based problems. Formulation of Operational and production data in automated digitalization improves the overall performance of SC activities (Dubey et al., 2019;Simonyan & Zisserman, 2014;He et al., 2015b;Ioffe & Szegedy, 2015;Howard et al., 2017;Jia et al., 2014;Tieleman & Hinton, 2012). The multi-criterion decision-making problem (MCDM) explores various learning capabilities of neural networks. ...
... Zhang et al. (2010) integrated IoT and blockchain to create a supply chain provenance system to forecast demand for food. He et al. (2015b) proposed a multimodal classification to estimate the maturity of food through feature concatenation of hyperspectral images. This approach reduces food wastage along with food production costs. ...
Article
Full-text available
The Supply chain business model is being reshaped and transformed by artificial intelligence (AI) using contemporary, environmentally friendly AI-based techniques. Customers frequently receive the incorrect items in parcel boxes when they order online. Supply Chain (SC) companies uphold a simple return policy for clients, which incurs additional expenses for a specific order, to preserve their reputation. A certain amount of time is spent on the return procedure, which can result in losses for significant returns. By incorporating deep learning algorithms into the SC packaging process, the suggested method attempts to address this issue. Before the things are delivered, the darknet architecture in this document detects incorrect product delivery. Semantic properties are extracted from the video footage using the following framework. The suggested method is made with CSPDarknet53 to enhance learning capacity. The spatial pyramid pooling candidate window preserves 6×6 features as a connection parameter while providing special information processing to the various CNN layers. The DPM model is used for the encoding process. In the suggested information construction, a selective, optimistic search pat
... For visual representation, to address the limitations of flattening image features into a single vector, which can discard some spatial structures and semantic information [24,25,26], H 3 DP employs multi-scale visual representation, where different scales capture features at varying granularity levels, ranging from global context to fine visual details. ...
... An effective visual encoder should capture various granularity features of the visual scenarios and guide the policy to predict the action distribution. However, existing methods typically extract features at a single spatial scale or compress them into a fixed-resolution representation, limiting the expressiveness of learned features [24,25,26]. To address this problem, we hierarchically partition the feature map into multiple scales, enabling the capture of both coarse structural information and detailed local cues. ...
Preprint
Visuomotor policy learning has witnessed substantial progress in robotic manipulation, with recent approaches predominantly relying on generative models to model the action distribution. However, these methods often overlook the critical coupling between visual perception and action prediction. In this work, we introduce \textbf{Triply-Hierarchical Diffusion Policy}~(\textbf{H^{\mathbf{3}}DP}), a novel visuomotor learning framework that explicitly incorporates hierarchical structures to strengthen the integration between visual features and action generation. H3^{3}DP contains 3\mathbf{3} levels of hierarchy: (1) depth-aware input layering that organizes RGB-D observations based on depth information; (2) multi-scale visual representations that encode semantic features at varying levels of granularity; and (3) a hierarchically conditioned diffusion process that aligns the generation of coarse-to-fine actions with corresponding visual features. Extensive experiments demonstrate that H3^{3}DP yields a +27.5%\mathbf{+27.5\%} average relative improvement over baselines across 44\mathbf{44} simulation tasks and achieves superior performance in 4\mathbf{4} challenging bimanual real-world manipulation tasks. Project Page: https://lyy-iiis.github.io/h3dp/.
... Spatial pyramid pooling (SPP) [55] was proposed to address this problem. The c idea of the SPP network is to introduce pyramid pooling layers in CNNs that m different sizes of feature maps to a fixed-size feature vector through pooling operatio Skip Connections carry significant relevance in semantic segmentation. ...
... Spatial pyramid pooling (SPP) [55] was proposed to address this problem. The core idea of the SPP network is to introduce pyramid pooling layers in CNNs that map different sizes of feature maps to a fixed-size feature vector through pooling operations, thereby avoiding the need for upsampling operations and greatly reducing the computational cost and memory overhead without sacrificing accuracy. ...
Article
Full-text available
Semantic segmentation is a critical task in computer vision that aims to assign each pixel in an image a corresponding label on the basis of its semantic content. This task is commonly referred to as dense labeling because it requires pixel-level classification of the image. The research area of semantic segmentation is vast and has achieved critical advances in recent years. Deep learning architectures in particular have shown remarkable performance in generating high-level, hierarchical, and semantic features from images. Among these architectures, convolutional neural networks have been widely used to address semantic segmentation problems. This work aims to review and analyze recent technological developments in image semantic segmentation. It provides an overview of traditional and deep-learning-based approaches and analyzes their structural characteristics, strengths, and limitations. Specifically, it focuses on technical developments in deep-learning-based 2D semantic segmentation methods proposed over the past decade and discusses current challenges in semantic segmentation. The future development direction of semantic segmentation and the potential research areas that need further exploration are also examined.
... To overcome these limitations, we propose the MDSC module (Fig. 5). Our design incorporates three 3×3 convolutional kernels with progressively increasing dilation rates (1,3,5), systematically expanding the receptive field [19]. As demonstrated in Fig. 6 ...
Article
Computer vision-based traffic object detection plays a critical role in road traffic safety. Under hazy weather conditions, images captured by road monitoring systems exhibit three main challenges: significant scale variations, abundant background noise, and diverse perspectives. These factors lead to insufficient detection accuracy and limited real-time performance in object detection algorithms. We propose AMC-YOLO an improved YOLOv11-based traffic detection algorithm to address these challenges. In this work, we replace the C3k block's bottleneck module with our novel attention-gate convolution (AGConv), which improves contextual information capture, enhances feature extraction, and reduces computational redundancy. Additionally, we introduce the multi-dilation sharing convolution (MDSC) module to prevent feature information loss during pooling operations, enhancing the model's sensitivity to multi-scale features. We design a lightweight and efficient cross-channel feature fusion module (CCFM) for the path aggregation neck to adaptively adjust feature weights and optimize the model's overall performance. Experimental results demonstrate that AMC-YOLO achieves a 1.1% improvement in mAP@0.5 and a 2.7% increase in mAP@0.5:0.95 compared to YOLOv11n. On graphics processing unit (GPC) hardware, it achieves real-time performance at 376 (FPS) with only 2.6 million parameters, ensuring high-precision traffic detection while meeting deployment requirements on resource-constrained devices.
... SPP (Spatial Pyramid Pooling), proposed by He et al. [25], is a pooling structure used for image processing and computer vision tasks. This structure can perform standard pooling on images of different sizes and ultimately combine them into feature vectors of the same size as the input to the fully connected layer. ...
... Automated tree enumeration can serve as a key input for urban development projects and environmental conservation efforts. 6. ...
Article
Accurate tree enumeration is essential for ecological monitoring, urban planning, and forest management. Traditionally, this task has been carried out manually, which is not only time-consuming but also prone to human error. This project aims to simplify and automate the process of tree counting using image analytics. By leveraging a deep learning-based object detection model—YOLOv8—we trained the system to detect and count trees from aerial and landscape images. The approach involves curating a custom dataset, annotating it using tools like Roboflow, and training the model on Google Colab. A simple web interface was developed using Flask to allow users to upload an image and receive real-time results showing the number and type of trees detected. The model performed well on various test images, showing a high detection accuracy. This system not only reduces manual effort but also provides a scalable and efficient solution for large-scale environmental data collection Keywords: Tree enumeration, Image analytics, YOLOv8, YOLOv9, YOLOv10, Deep learning, Object detection, Streamlit, Python, OpenCV, TensorFlow, Environmental monitoring, Forest management.
... The goal is to develop a reliable and efficient solution that regulates traffic patterns using advanced deep learning structures like LSTMs and CNNs. The effectiveness of this approach is evaluated through quantitative tests and comparisons with state-of-the-art formulations [17]. ...
Article
Full-text available
Utilizing the cloud is not growing as fast as it might because of security and privacy issues. False positives are still a problem with network intrusion detection systems (NIDS), even with their widespread usage. Moreover, the intrusion detection problem has not been treated as a time series problem, necessitating time series modelling, in many research. In this paper, we use time series data to suggest a unique method for early cloud computing intrusion detection. Our strategy uses a forecasting model built using the Multivariate Neural Model of the prophet and an Improved Zebra Optimization Algorithm (IZOA) to gauge its effectiveness. The problem of making false linkages between time series anomalies and assaults is particularly addressed by this method. Our findings show a notable decrease in the quantity of forecasters used within our forecast model—from 70 to 10—while demonstrating an improvement in performance metrics like median absolute percentage error (MDAPE), dynamic temporal warping (DTW), mean absolute percentage error (MAPE), mean square error (MSE), root mean square error (RMSE), and mean absolute error (MAE). In addition, our method has shown reductions in cross-validation, forecasting, as well as training durations of around 97%, 15%, and 85%, respectively.
... The Spatial Pyramid Pooling-Fast Cross Stage Partial Connections (SPPFCSPC) block [15] is a feature extraction component commonly utilized in deep learning models, particularly for object detection tasks. It combines Spatial Pyramid Pooling (SPP) [45], which captures multi-scale spatial features by applying pooling operations at various scales, with Cross Stage Partial Connections (CSPC) [46], which improves gradient flow and reduces computational redundancy by splitting feature maps into transformed and shortcut paths. This integration enables the module to extract both local and global contextual information effectively while maintaining lightweight computational requirements. ...
Preprint
Rapid advances in deep learning for computer vision have driven the adoption of RGB camera-based adaptive traffic light systems to improve traffic safety and pedestrian comfort. However, these systems often overlook the needs of people with mobility restrictions. Moreover, the use of RGB cameras presents significant challenges, including limited detection performance under adverse weather or low-visibility conditions, as well as heightened privacy concerns. To address these issues, we propose a fully automated, thermal detector-based traffic light system that dynamically adjusts signal durations for individuals with walking impairments or mobility burden and triggers the auditory signal for visually impaired individuals, thereby advancing towards barrier-free intersection for all users. To this end, we build the thermal dataset for people with mobility restrictions (TD4PWMR), designed to capture diverse pedestrian scenarios, particularly focusing on individuals with mobility aids or mobility burden under varying environmental conditions, such as different lighting, weather, and crowded urban settings. While thermal imaging offers advantages in terms of privacy and robustness to adverse conditions, it also introduces inherent hurdles for object detection due to its lack of color and fine texture details and generally lower resolution of thermal images. To overcome these limitations, we develop YOLO-Thermal, a novel variant of the YOLO architecture that integrates advanced feature extraction and attention mechanisms for enhanced detection accuracy and robustness in thermal imaging. Experiments demonstrate that the proposed thermal detector outperforms existing detectors, while the proposed traffic light system effectively enhances barrier-free intersection. The source codes and dataset are available at https://github.com/leon2014dresden/YOLO-THERMAL.
... GELAN boosts extraction and computational efficiency. It integrates advanced layer aggregation techniques and combines Spatial Pyramid Pooling (SPP) (He et al., 2015) with channel dimension adjustment through convolutional layers, making it ideal for processing complex data patterns. ...
Article
Full-text available
Computer vision plays a vital role in automating environmental analysis by enabling real-time object detection and classification in diverse conditions. Litter pollution poses significant health and environmental risks due to inefficient disposal, and manual oversight is labor intensive. Effective litter detection is crucial for large-scale environmental monitoring. However, existing models face challenges such as the complexity of detecting shadowy objects in varying lighting conditions (e.g., during rain or under sun rays), difficulty in recognizing small objects, low accuracy, and poor real-time performance. Existing two-stage detectors, such as Faster R-CNN, also struggle with these issues. This paper introduces an automated deep learning-based image processing approach for accurate litter detection across different locations, using an enhanced version of YOLOv9s called LD-YOLOv9s. Key improvements in this novel approach include replacing convolutional layers with DynConvLayer in the backbone, integrating an SDConv-ADown module to substitute down-sampling layers in the neck, and using MPD-IoU instead of CIoU. These modifications reduce the chances of overlooking small objects, such as caps or lids, which had the least class meaning in the dataset, achieving a mAP of 78.3% with an inference time of 6.7ms. A significant contribution of this work is the LD-2024 dataset, curated from indoor and outdoor environments with manually annotated images. Performance comparisons were made with several YOLO versions (YOLOv3, YOLOv5, YOLOv6, YOLOv7, YOLOv8, YOLOv9 variants) and traditional object detectors (Faster R-CNN, Center Net, Retina Net, Cascade R-CNN). Ablation studies validate the effectiveness of LD-YOLOv9s, which outperforms conventional methods, achieving a 6.3% improvement in mean average precision (mAP) over YOLOv9s on the LD-2024 dataset.
... Inspired by spatial pyramid matching (SPM) [84], He et al. proposed SPP-net [85] to boost R-CNN's speed and learn more distinct features. Unlike the traditional method of clipping proposed regions and passing them separately into a CNN model, SPP-net uses a deep convolutional network to generate the feature map from the entire image. ...
Article
Full-text available
Underwater waste detection is a critical challenge for preserving aquatic ecosystems, particularly due to inherent underwater distortions such as light refraction, occlusion, and scattering. In this study, we present a novel deep learning framework for real-time underwater waste detection by evaluating state-of-the-art object detection algorithms on a manually annotated custom dataset comprising 10,000 curated augmented images across various water bodies to simulate real-world turbidity, illumination, and occlusion, consisting of three classes: Underwater Trash, Rover, and Biological Life. Our approach incorporates robust feature extraction and specialized data augmentation techniques that effectively mitigate the adverse effects of underwater distortions. We investigate multiple architectures, including YOLOv8n, YOLOv7, YOLOv6s, YOLOv5s, Faster R-CNN, and Mask R-CNN, analyzing key parameters such as mean average precision (mAP), inference speed, and computational efficiency. Our results demonstrate that YOLO-based models achieve faster & accurate performance, with YOLOv8n and YOLOv7 reaching a detection accuracy of 96%, compared to 81% and 83% for Faster R-CNN and Mask R-CNN, respectively. Computationally, YOLOv8n achieves 76 ms per frame, while YOLOv5s runs at 14 ms per frame, confirming real-time viability for AUV deployment. The proposed approach offers significant advantages over existing methods by enabling rapid, accurate detection with low computational overhead, thereby paving the way for integration with autonomous underwater vehicles (AUVs) for environmental monitoring and waste management.
... The main difference between them is that the twostage model first generates candidate bounding boxes that may contain defects and then classifies the cropped subregions for defect identification, whereas the one-stage model directly predicts the location, size, class, and confidence score of defects based on features extracted by the CNN. The representative two-stage detection algorithms include R-CNN [16], SPP-net [17], and Fast R-CNN [18]. He et al. [19] subsequently proposed Mask R-CNN in 2017, which employs a Residual Network (ResNet) as the backbone and utilizes a Feature Pyramid Network (FPN) to extract multiscale feature information, integrating a fully connected segmentation subnet. ...
Article
Full-text available
To address the issues of omission, and low algorithmic accuracy in detecting surface defects of hot-rolled strip steel in complex backgrounds, this paper proposes the PMSE-YOLO surface defect detection algorithm. First, to enhance the model’s ability to extract multi-scale features, the CSP_PMS structure is designed to optimize the C2f structure in the Backbone, improving feature extraction for multi-scale targets. Second, to preserve the semantic information of small targets in the Neck, the EfficientRepBiPAN structure is adopted, utilizing cross-layer connections to enhance multi-scale feature representation and achieve small target feature fusion. Finally, the Wise-ShapeIoU loss function is incorporated to enhance the model’s detection performance. Experimental validation on the NEU-DET dataset demonstrates that PMSE-YOLO reduces parameter by 17% and computational cost by 18.5%, while improving mAP@0.5 by 3.1 to 82.2% compared to baseline network. PMSE-YOLO balances lightweight design and real-time performance while enhancing surface defect detection accuracy for hot-rolled strips, facilitating deployment on edge devices. Furthermore, experimental results on the GC10-DET dataset confirm the superior generalization capability of the proposed model.
... To improve the ability of road sensing network to feature extraction, CSPDarknet [30] is selected as the backbone network. The backbone network consists of five layers, the last three layers output the feature map via the spatial pyramid pooling (SPP) module [31] then send it to AUDNet for multiscale information enhancement. ...
Article
Full-text available
Vision-based environmental perception has demonstrated significant promise for autonomous driving applications. However, the traditional unidirectional feature flow in many perception networks often leads to inadequate information propagation, which hinders the system’s ability to comprehensively perceive complex driving environments. Issues such as similar objects, illumination variations, and scale differences aggravate this limitation, introducing noise and reducing the reliability of the perception system. To address these challenges, we propose a novel Attention-Aware Upsampling-Downsampling Network (AUDNet). AUDNet utilizes a bidirectional feature fusion structure, incorporating a multi-scale attention upsampling module (MAU) to enhance the fine details in high-level features by guiding the selection of feature information. Additionally, the multi-scale attention downsampling module (MAD) is designed to reinforce the semantic understanding of low-level features by emphasizing relevant spatial dfigureetails. Extensive experiments on a large-scale, real-world driving dataset demonstrate the superior performance of AUDNet, particularly in multi-task environment perception in complex and dynamic driving scenarios.
... They are also computationally more expensive, which limits their practicality in scenarios that require immediate feedback during procedures. Although advancements like spatial pyramid pooling have improved their speed, they still lag behind single-stage models in real-time performance [14]. Leveraging DL detectors for polyps detection may aid endoscopists in reducing the missed polyp rate. ...
Preprint
Full-text available
Colorectal cancer is one of the deadliest cancers today, but it can be prevented through early detection of malignant polyps in the colon, primarily via colonoscopies. While this method has saved many lives, human error remains a significant challenge, as missing a polyp could have fatal consequences for the patient. Deep learning (DL) polyp detectors offer a promising solution. However, existing DL polyp detectors often mistake white light reflections from the endoscope for polyps, which can lead to false positives.To address this challenge, in this paper, we propose a novel data augmentation approach that artificially adds more white light reflections to create harder training scenarios. Specifically, we first generate a bank of artificial lights using the training dataset. Then we find the regions of the training images that we should not add these artificial lights on. Finally, we propose a sliding window method to add the artificial light to the areas that fit of the training images, resulting in augmented images. By providing the model with more opportunities to make mistakes, we hypothesize that it will also have more chances to learn from those mistakes, ultimately improving its performance in polyp detection. Experimental results demonstrate the effectiveness of our new data augmentation method.
... In addition to OBB, the model incorporates advanced attention mechanisms to further enhance detection performance. A notable improvement is the integration of a Transformer Module into the backbone of MHA-YOLOv8, strategically positioned before the Spatial Pyramid Pooling with Faster (SPPF) module [6]. The Transformer Encoder within this module leverages Multi-Head Attention (MHA) to refine feature extraction by capturing global semantic relationships across the entire image. ...
... The formulation is as follows: where F is the input feature map; AvgPool(F), MaxPool(F) are the global average pooling and max pooling outputs; MLP is a two-layer fully connected network with a bottleneck structure using a reduction ratio r; σ denotes the sigmoid activation function; M c is the channel attention weight vector. The spatial attention mechanism takes the feature map F output from the channel attention module as its input [28]. First, it applies global max pooling and average pooling along the channel dimension to generate two feature maps of size H × W × 1. ...
Article
Full-text available
In response to the decreased accuracy in person detection caused by densely populated areas and mutual occlusions in public spaces, a human head-detection approach is employed to assist in detecting individuals. To address key issues in dense scenes—such as poor feature extraction, rough label assignment, and inefficient pooling—we improved the YOLOv7 network in three aspects: adding attention mechanisms, enhancing the receptive field, and applying multi-scale feature fusion. First, a large amount of surveillance video data from crowded public spaces was collected to compile a head-detection dataset. Then, based on YOLOv7, the network was optimized as follows: (1) a CBAM attention module was added to the neck section; (2) a Gaussian receptive field-based label-assignment strategy was implemented at the junction between the original feature-fusion module and the detection head; (3) the SPPFCSPC module was used to replace the multi-space pyramid pooling. By seamlessly uniting CBAM, RFLAGauss, and SPPFCSPC, we establish a novel collaborative optimization framework. Finally, experimental comparisons revealed that the improved model’s accuracy increased from 92.4% to 94.4%; recall improved from 90.5% to 93.9%; and inference speed increased from 87.2 frames per second to 94.2 frames per second. Compared with single-stage object-detection models such as YOLOv7 and YOLOv8, the model demonstrated superior accuracy and inference speed. Its inference speed also significantly outperforms that of Faster R-CNN, Mask R-CNN, DINOv2, and RT-DETRv2, markedly enhancing both small-object (head) detection performance and efficiency.
... These models can be categorized based on several factors, including one-stage and two-stage models in terms of the detection process steps required, anchor-based and anchor-free models depending on the presence of predefined anchors, and CNN-based and Transformer-based models according to the backbone blocks employed. Two-stage models, for instance, identify targets by initially generating numerous regions that could potentially contain targets, followed by refining subsequent regressions based on these regions, leading to longer processing times (Cai & Vasconcelos, 2019;He et al., 2015He et al., , 2017Liu et al., 2018). Anchor-based models entail the predefined specification of various anchor box sizes, with the model subsequently predicting the detection box by determining the offset relative to the anchor box. ...
Article
Full-text available
Recent advancements in automated tunnel defect detection have utilized high-resolution cameras and mobile laser scanners. However, the inability of cameras to accurately capture 3D spatial coordinates complicates tasks such as 3D visualization, while the relatively low resolution of laser scanners makes it difficult to detect small defects such as microcracks. In this paper, a comprehensive inspection method is proposed to address these limitations by integrating multi-defect detection, 3D coordinate acquisition, and visualization. The inspection process involves the capture of both image data and point cloud data of tunnel linings using the newly developed inspection cart (MTI-300). The proposed fusion approach combines image and point cloud data, leveraging the enhanced YOLOv8-seg instance segmentation model for defect identification. The scale-invariant feature transform (SIFT) algorithm is used to match local defect regions in the image data with the corresponding point cloud data, enabling the extraction of 3D coordinates and the integration of defect pixels with the point cloud information. Subsequently, a lightweight 3D reconstruction model is developed to visualize the entire tunnel and its defects using the fused data. The performance of the proposed method is validated and substantiated through a field experiment on Metro Line 8 in Qingdao, China.
... This model is conceived by Yann LeCun et al. in the early 1990s, stands as a seminal convolutional neural network (CNN) architecture [35]. This model, primarily engineered for handwritten digit recognition tasks, attained notable success in discerning digits from the MNIST dataset. ...
Article
Full-text available
In urban road autonomous driving, accurately identifying various objects on the road is crucial for ensuring safety and reliability. The Faster R-CNN model, known for its real-time object detection capabilities, is a key component in urban road traffic monitoring systems. It uses a two-stage approach involving a region proposal network (RPN) for generating object bounding boxes and a convolutional neural network (CNN) for refining these proposals to classify and localize objects accurately. Enhanced by improvements like refined anchor box prediction, feature pyramid networks, and advanced backbone architecture, Faster R-CNN offers better accuracy and efficiency, although it can be computationally intensive and slower for real-time applications. Urban road traffic monitoring faces challenges such as high computational demands, the need for real-time processing, and complex urban environments. Traditional models like Faster R-CNN, while accurate, can be too slow for real-time applications. The DETR (DEtection TRansformer) model, using a transformer-based architecture, provides robust feature extraction but may struggle with speed. YOLO models, on the other hand, are fast but may sacrifice some accuracy for speed. To address these issues, the proposed DETR-YOLO v8n Fusion model combines the strengths of DETR's robust feature extraction and YOLO v8n's real-time detection capabilities. The fusion leverages attention mechanisms to improve accuracy without compromising speed, providing a balanced solution for urban traffic monitoring. The study highlights improvements in YOLO v8, including refined anchor box prediction, feature pyramid networks for multi-scale detection, and an advanced backbone architecture. These enhancements enable the model to identify objects with greater accuracy, especially in complex urban traffic scenarios. The model introduced in this paper, YOLO v8n, is faster and more accurate in detecting and monitoring vehicles on urban roads, with an accuracy of 0.860 mAP@50, a precision of 85.0%, and a recall of 85.8%. This research contributes to the field of intelligent transportation by showcasing advancements in object detection for autonomous vehicles and setting new benchmarks in urban road traffic monitoring systems. The integration of YOLO v8 and DETR continues to refine object detection mechanisms, enhancing the safety and efficiency of intelligent vehicles navigating complex urban environments. www.ejaset.com EJASET 2025 | Volume 3 | Number 3 102
... The use of this cross-overlapping pooling structure facilitates smoother transitions during the pooling process, thereby mitigating the abrupt changes caused by excessive pooling and enabling the extraction of more local information [38]. The designed structure of the spatial feature pyramid module is illustrated in Figure 3. ...
Article
Full-text available
Object detection in low-light environments is often hampered by unfavorable factors such as low brightness, low contrast, and noise, which lead to issues like missed detections and false positives. To address these challenges, this paper proposes a low-light object detection algorithm named Dark-YOLO, which dynamically extracts features. First, an adaptive image enhancement module is introduced to restore image information and enrich feature details. Second, the spatial feature pyramid module is improved by incorporating cross-overlapping average pooling and max pooling to extract salient features while retaining global and local information. Then, a dynamic feature extraction module is designed, which combines partial convolution with a parameter-free attention mechanism, allowing the model to flexibly capture critical and effective information from the image. Finally, a dimension reciprocal attention module is introduced to ensure the model can comprehensively consider various features within the image. Experimental results show that the proposed model achieves an mAP@50 of 71.3% and an mAP@50-95 of 44.2% on the real-world low-light dataset ExDark, demonstrating that Dark-YOLO effectively detects objects under low-light conditions. Furthermore, facial recognition in dark environments is a particularly challenging task. Dark-YOLO demonstrates outstanding performance on the DarkFace dataset, achieving an mAP@50 of 49.1% and an mAP@50-95 of 21.9%, further validating its effectiveness for face detection under complex low-light conditions.
... This results in very slow detection speeds for the R-CNN algorithm, making it unable to satisfy industrial demands. In order to reduce the computational redundancy brought by lots of overlapping region proposals, He et al. proposed Spatial Pyramid Pooling Network (SPP-Net) 25 . SPP-Net directly feeds the entire image into a convolutional neural network (CNN), then extracts image features using Spatial Pyramid Pooling (SPP), and finally utilizes a fully connected neural network to produce the final output. ...
Article
Full-text available
Defect detection is vital for product quality in industrial production, yet current surface defect detection technologies struggle with diverse defect types and complex backgrounds. The challenge intensifies with multi-scale small targets, leading to significantly reduced detection performance. Therefore, this paper proposes the EPSC-YOLO algorithm to improve the efficiency and accuracy of defect detection. The algorithm first introduces multi-scale attention modules and uses two newly designed pyramid convolutions in the backbone network to better identify multi-scale defects; Secondly, Soft-NMS is introduced to replace traditional NMS, which can reduce information loss and improve multi-target detection accuracy by smoothing and suppressing the scores of overlapping boxes. In addition, a new convolutional attention module, CISBA, is designed to enhance the detection capability of small targets in complex backgrounds. In the end, we validate the effectiveness of EPSC-YOLO on NEU-DET and GC10-DET datasets. The experimental results show that, compared to YOLOv9c, mAP50valmAP^{val}_{50} increases by 2% and 2.4%, and mAP50:95valmAP^{val}_{50:95} increases by 5.1% and 2.4%, respectively. Meanwhile, EPSC-YOLO demonstrates superior accuracy and significant advantages in real-time detection of surface defects on products compared to algorithms such as YOLOv10 and MSFT-YOLO.
Article
Synthetic aperture radar (SAR) is characterized by its all-weather monitoring capabilities and high-resolution imaging. It plays a crucial role in operations such as marine salvage and strategic deployments. However, existing vessel detection technologies face challenges such as occlusion and deformation of targets in multi-scale target detection and significant interference noise in complex scenarios like coastal areas and ports. To address these issues, this paper proposes an algorithm based on YOLOv8 for detecting ship targets in complex backgrounds using SAR images, named DFENet (Denoising and Feature Enhancement Network). First, we design a background suppression and target enhancement module (BSTEM), which aims to suppress noise interference in complex backgrounds. Second, we further propose a feature enhancement attention module (FEAM) to enhance the network’s ability to extract edge and contour features, as well as to improve its dynamic awareness of critical areas. Experiments conducted on public datasets demonstrate the effectiveness and superiority of DFENet. In particular, compared with the benchmark network, the detection accuracy of mAP75 on the SSDD and HRSID is improved by 2.3% and 2.9%, respectively. In summary, DFENet demonstrates excellent performance in scenarios with significant background interference or high demands for positioning accuracy, indicating strong potential for various applications.
Article
Power equipment anomaly detection is essential for ensuring the stable operation of power systems. Existing models have high false and missed detection rates in complex weather and multi-scale equipment scenarios. This paper proposes a YOLO-SRSA-based anomaly detection algorithm. For data enhancement, geometric and color transformations and rain-fog simulations are applied to preprocess the dataset, improving the model’s robustness in outdoor complex weather. In the network structure improvements, first, the ACmix module is introduced to reconstruct the SPPCSPC network, effectively suppressing background noise and irrelevant feature interference to enhance feature extraction capability; second, the BiFormer module is integrated into the efficient aggregation network to strengthen focus on critical features and improve the flexible recognition of multi-scale feature images; finally, the original loss function is replaced with the MPDIoU function, optimizing detection accuracy through a comprehensive bounding box evaluation strategy. The experimental results show significant improvements over the baseline model: mAP@0.5 increases from 89.2% to 93.5%, precision rises from 95.9% to 97.1%, and recall improves from 95% to 97%. Additionally, the enhanced model demonstrates superior anti-interference performance under complex weather conditions compared to other models.
Article
Full-text available
Background Idiopathic macular hole is an ophthalmic disease that seriously affects vision, and its early diagnosis and treatment have important clinical significance to reduce the occurrence of blindness. At present, OCT is the gold standard for diagnosing this disease, but its application is limited due to the need for professional ophthalmologist to diagnose it. The introduction of artificial intelligence will break this situation and make its diagnosis efficient, and how to build an effective predictive model is the key to the problem, and more clinical trials are still needed to verify it. Objective This study aims to evaluate the role of deep learning systems in Idiopathic Macular Hole diagnosis, grading, and prediction. Methods A single-center, retrospective study used binocular OCT images from IMH patients at the First Affiliated Hospital of Nanchang University (November 2019 - January 2023). A deep learning algorithm, including traditional omics, Resnet101, and fusion models incorporating multi-feature fusion and transfer learning, was developed. Model performance was evaluated using accuracy and AUC. Logistic regression analyzed clinical factors, and a nomogram predicted surgical risk. Analysis was conducted with SPSS 22.0 and R 3.6.3. P < 0.05 was statistically significant. Results Among 229 OCT images, the traditional omics, Resnet101, and fusion models achieved accuracies of 93%, 94%, and 95%, respectively, in the training set. In the test set, the fusion model and Resnet101 correctly identified 39 images, while the traditional omics model identified 35. The nomogram had a C-index of 0.996, with macular hole diameter most strongly associated with surgical risk. Conclusion The deep learning system with transfer learning and multi-feature fusion effectively diagnoses and grades IMH from OCT images.
Preprint
Full-text available
The breakthrough of artificial intelligence(AI) technology provides strong support for education. This paper adopts the Citespace visualization method to quickly identify the key research about the application of artificial intelligence in the field of engineering education, and quickly identify the most active research frontiers and development trends, etc. The analysis is conducted from six dimensions: the number of publication, cooperating country, research institution, co-citation, keyword and emergent word. The hotspots focus on how to use AI and machine learning technologies, especially by building intelligent education models, to revolutionize engineering education models and improve education quality. At the forefront of the latest research, artificial neural networks, optimization, performance, impact, deep neural networks, management, data science has become a hot topic. It implies that future research will pay more attention to the development of deep integration, interdisciplinary integration, data-driven, personalization and intelligence. These research results can provide global knowledge map and literature basis for the in-depth development and wide application of AI in the field of engineering education.
Conference Paper
Earth remote sensing data can be applied to detect and assess the condition of infrastructure objects on vast territories. One such object is electric pylons, which ensure the sustainability of the energy supply in rural and urban areas. In some remote regions, power lines can be damaged by natural hazards such as earthquakes, strong winds, or floods. Currently, the main limitation in developing highly effective algorithms for electric utilities assessment is associated with data availability and diverse environmental conditions. Therefore, in this study, we aim to explore solutions for new study territories with various backgrounds and forms of electric pylons. We examined several detection algorithms from the YOLO-family. The study includes experiments with datasets for Chinese territories and additionally collected data for regions in Russia. We managed to improve the initial score for polygon detection, achieving an mAP of 79.8%. The obtained results demonstrate high potential for power lines assessment and damage detection through satellite data and deep learning algorithms.
Preprint
Full-text available
Vehicle detection is crucial for intelligent decision support in transportation systems. However, real-time detection of vehicles is challenging due to geometric variations of vehicles and complex environmental factors such as light conditions and weather. To address these issues, the paper introduces the You-Only-Look-Once with Deformable Convolution and Cross-channel Coordinate Attention (YOLO-DC) framework that improves the performance and reliability of vehicle detection. First, YOLO-DC incorporates Cross-channel Coordinate Attention, which combines channel attention and coordinate attention, to more accurately cover target sampling positions and enhance feature extractions from vehicles of various shapes. Second, to better handle vehicles of different sizes, we employ Multi-scale Grouped Convolution to enable multi-scale awareness and streamline parameter sharing. Additionally, we incorporate channel prior convolutional attention so that the model can concentrate on areas of vehicles that are critical for detection. We also optimize feature fusion by leveraging a highly efficient fusion of C2f(CSP Bottleneck with 2 Convolutions) and FasterNet to reduce the model size. Experimental results demonstrate that YOLO-DC performs better than state-of-the-art YOLOv8n method in detecting small, medium, and large-sized vehicles, and in detecting vehicles in adverse weather conditions. In addition to its superior performance, YOLO-DC also features fast detection speed, making it appropriate for real-time detection on devices with limited computational power.
Article
Accurate printed circuit board (PCB) defect detection is crucial for ensuring manufacturing efficiency and minimizing failure rates in electronic devices. This study addresses the limitations of traditional bounding box-based methods by employing instance segmentation using YOLOv7 and YOLOv8. The proposed models leverage pixel-level feature extraction to precisely localize and classify PCB defects, achieving higher accuracy than existing techniques. Trained and evaluated on a test dataset of 69 images, our approach demonstrates superior performance in precision, recall, and mean Average Precision (mAP). Experimental results show that YOLOv7seg outperforms YOLOv8seg across multiple metrics for both bounding box and instance segmentation. Specifically, YOLOv7seg achieves a precision of 0.863 (bounding box) and 0.44 (mask), with recall of 0.884 and 0.411, and mAP@0.5 of 0.897 (bounding box) and 0.318 (mask), whereas YOLOv8seg attains a precision of 0.722 (bounding box) and 0.305 (mask), with recall of 0.582 and 0.29, and mAP@0.5 of 0.627 (bounding box) and 0.207 (mask). While YOLOv8seg demonstrates faster inference times (29.4 ms per image on GPU versus 42.5 ms for YOLOv7seg), the latter consistently delivers higher segmentation accuracy. These findings highlight the potential of YOLOv7-based instance segmentation to enhance defect detection in industrial PCB inspection, offering a balance between precision and real-time feasibility.
Thesis
Full-text available
The visual assessment of microscopic samples by pathologists constitutes an essential component of cancer diagnostics. Traditional pathology workflows were based on the visual assessment of samples under the microscope. The development of designated slide scanners has facilitated the digitization of microscopy samples which not only allowed for digital archiving and remote expert consultancy but also facilitated the use of machine learning-based image analysis algorithms for computer-aided diagnosis. Meanwhile, a wide range of computer-aided systems has been developed in the field of histopathology, often matching the performance of trained pathologists. Previous work has shown that machine learning-based image analysis algorithms, and especially convolutional neural networks, can be very susceptible to changes in the visual appearance of images. In pathology, these domain shifts can be caused when applying trained algorithms to different morphologies, or samples prepared at a different pathology lab. The preparation of histologic samples follows routine stages, including tissue fixation, dehydration, paraffin embedding, and microtome sectioning. Subsequently, a sample is usually stained with a specific dye and digitized with a designated scanning system. The visual manifestation of these sample preparation steps can be very unique for the respective pathology lab. This thesis investigates the impact of different domain shifts on the performance of convolutional neural networks in histopathology. For these experiments, three routine tasks in cancer diagnostics were considered: cross-scanner mitotic figure detection, cross-domain tumor segmentation, and pan-tumor T-lymphocyte detection on immunohistochemistry samples. For the task of cross-scanner mitotic figure detection, domain adversarial training was employed. Evaluations of the learned embeddings demonstrated the successful extraction of scanner-agnostic features. For the task of cross-domain tumor segmentation, representation learning and, in particular, selfsupervised learning was explored as a pre-training strategy to align feature embeddings across domains and thereby enhance the domain agnosticity for the downstream task. The results provide insights into the applicability of self-supervised learning in the context of histopathology. To date, this technique has mostly been employed in the field of natural images. In a project addressing the detection of tumor-infiltrating lymphocytes in immunohistochemistry samples, fine-tuning was leveraged to bridge the domain gap between different tumor indications. Initial experiments exhibited degraded performance on out-of-distribution samples. By exploiting fine-tuning on a limited number of target domain samples, this degradation was effectively mitigated. The experiments allowed for recommendations on the development of robust algorithms for the detection of lymphocytes across different tumor morphologies. In the course of the thesis, several cross-domain datasets were curated, focusing on different sources of domain shift. This includes a fully annotated dataset of 350 whole slide images covering seven canine cutaneous tumor subtypes, which constitutes one of the most comprehensive open histopathology segmentation datasets to date. A high annotation quality of each published dataset was ensured through extensive multi-rater experiments on selected subsets of the data. By making these datasets publicly available, future work on the cross-domain generalization for histopathology was facilitated.
Article
Synthetic aperture radar (SAR) offers robust Earth observation capabilities under diverse lighting and weather conditions, making SAR-based aircraft detection crucial for various applications. However, this task presents significant challenges, including extracting discrete scattering features, mitigating interference from complex backgrounds, and handling potential label noise. To tackle these issues, we propose the scattering feature extraction and fusion network (SFEF-Net). Firstly, we proposed an innovative sparse convolution operator and applied it to feature extraction. Compared to traditional convolution, sparse convolution offers more flexible sampling positions and a larger receptive field without increasing the number of parameters, which enables SFEF-Net to better extract discrete features. Secondly, we developed the global information fusion and distribution module (GIFD) to fuse feature maps of different levels and scales. GIFD possesses the capability for global modeling, enabling the comprehensive fusion of multi-scale features and the utilization of contextual information. Additionally, we introduced a noise-robust loss to mitigate the adverse effects of label noise by reducing the weight of outliers. To assess the performance of our proposed method, we carried out comprehensive experiments utilizing the SAR-AIRcraft1.0 dataset. The experimental results demonstrate the outstanding performance of SFEF-Net.
Article
Full-text available
Deep Neural Networks (DNNs) have recently shown outstanding performance on image classification tasks [14]. In this paper we go one step further and address the problem of object detection using DNNs, that is not only classifying but also precisely localizing objects of various classes. We present a simple and yet powerful formulation of object detection as a regression problem to object bounding box masks. We define a multi-scale inference procedure which is able to produce high-resolution object detections at a low cost by a few network applications. State-of-the-art performance of the approach is shown on Pascal VOC.
Article
Full-text available
Convolutional neural networks (CNN) have recently shown outstanding image classification performance in the large- scale visual recognition challenge (ILSVRC2012). The suc- cess of CNNs is attributed to their ability to learn rich mid- level image representations as opposed to hand-designed low-level features used in other image classification meth- ods. Learning CNNs, however, amounts to estimating mil- lions of parameters and requires a very large number of annotated image samples. This property currently prevents application of CNNs to problems with limited training data. In this work we show how image representations learned with CNNs on large-scale annotated datasets can be effi- ciently transferred to other visual recognition tasks with limited amount of training data. We design a method to reuse layers trained on the ImageNet dataset to compute mid-level image representation for images in the PASCAL VOC dataset. We show that despite differences in image statistics and tasks in the two datasets, the transferred rep- resentation leads to significantly improved results for object and action classification, outperforming the current state of the art on Pascal VOC 2007 and 2012 datasets. We also show promising results for object and action localization.
Article
Full-text available
We propose a deep convolutional neural network architecture codenamed "Inception", which was responsible for setting the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC 2014). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. This was achieved by a carefully crafted design that allows for increasing the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC 2014 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.
Article
Full-text available
The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide detailed a analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the five years of the challenge, and propose future directions and improvements.
Article
Full-text available
Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models. The framework is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures. Caffe fits industry and internet-scale media needs by CUDA GPU computation, processing over 40 million images a day on a single K40 or Titan GPU (approx 2 ms per image). By separating model representation from actual implementation, Caffe allows experimentation and seamless switching among platforms for ease of development and deployment from prototyping machines to cloud environments. Caffe is maintained and developed by the Berkeley Vision and Learning Center (BVLC) with the help of an active community of contributors on GitHub. It powers ongoing research projects, large-scale industrial applications, and startup prototypes in vision, speech, and multimedia.
Article
Full-text available
This paper addresses the challenge of establishing a bridge between deep convolutional neural networks and conventional object detection frameworks for accurate and efficient generic object detection. We introduce Dense Neural Patterns, short for DNPs, which are dense local features derived from discriminatively trained deep convolutional neural networks. DNPs can be easily plugged into conventional detection frameworks in the same way as other dense local features(like HOG or LBP). The effectiveness of the proposed approach is demonstrated with the Regionlets object detection framework. It achieved 46.1% mean average precision on the PASCAL VOC 2007 dataset, and 44.1% on the PASCAL VOC 2010 dataset, which dramatically improves the original Regionlets approach without DNPs.
Conference Paper
Full-text available
Recent results indicate that the generic descriptors extracted from the convolutional neural networks are very powerful. This paper adds to the mounting evidence that this is indeed the case. We report on a series of experiments conducted for different recognition tasks using the publicly available code and model of the OverFeat network which was trained to perform object classification on ILSVRC13. We use features extracted from the OverFeat network as a generic image representation to tackle the diverse range of recognition tasks of object image classification, scene recognition, fine grained recognition, attribute detection and image retrieval applied to a diverse set of datasets. We selected these tasks and datasets as they gradually move further away from the original task and data the OverFeat network was trained to solve. Remarkably we report better or competitive results compared to the state-of-the-art in all the tasks on various datasets. The results are achieved using a linear SVM classifier applied to a feature representation of size 4096 extracted from a layer in the net. The results strongly suggest that features obtained from deep learning with convolutional nets should be the primary candidate in most visual classification tasks.
Article
Full-text available
We present an integrated framework for using Convolutional Networks for classification, localization and detection. We show how a multiscale and sliding window approach can be efficiently implemented within a ConvNet. We also introduce a novel deep learning approach to localization by learning to predict object boundaries. Bounding boxes are then accumulated rather than suppressed in order to increase detection confidence. We show that different tasks can be learnt simultaneously using a single shared network. This integrated framework is the winner of the localization task of the ImageNet Large Scale Visual Recognition Challenge 2013 (ILSVRC2013), and produced near state of the art results for the detection and classifications tasks. Finally, we release a feature extractor from our best model called OverFeat.
Article
Full-text available
We investigate multiple techniques to improve upon the current state of the art deep convolutional neural network based image classification pipeline. The techiques include adding more image transformations to training data, adding more transformations to generate additional predictions at test time and using complementary models applied to higher resolution images. This paper summarizes our entry in the Imagenet Large Scale Visual Recognition Challenge 2013. Our system achieved a top 5 classification error rate of 13.55% using no external data which is over a 20% relative improvement on the previous year's winner.
Article
Full-text available
Deep convolutional neural networks have recently achieved state-of-the-art performance on a number of image recognition benchmarks, including the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC-2012). The winning model on the localization sub-task was a network that predicts a single bounding box and a confidence score for each object category in the image. Such a model captures the whole-image context around the objects but cannot handle multiple instances of the same object in the image without naively replicating the number of outputs for each instance. In this work, we propose a saliency-inspired neural network model for detection, which predicts a set of class-agnostic bounding boxes along with a single score for each box, corresponding to its likelihood of containing any object of interest. The model naturally handles a variable number of instances for each class and allows for cross-class generalization at the highest levels of the network. We are able to obtain competitive recognition performance on VOC2007 and ILSVRC2012, while using only the top few predicted locations in each image and a small number of neural network evaluations.
Article
Full-text available
This report presents the results of the 2006 PASCAL Visual Object Classes Challenge (VOC2006). Details of the challenge, data, and evalu-ation are presented. Participants in the challenge submitted descriptions of their methods, and these have been included verbatim. This document should be considered preliminary, and subject to change.
Conference Paper
Full-text available
The traditional SPM approach based on bag-of-features (BoF) requires nonlinear classifiers to achieve good image classification performance. This paper presents a simple but effective coding scheme called Locality-constrained Linear Coding (LLC) in place of the VQ coding in traditional SPM. LLC utilizes the locality constraints to project each descriptor into its local-coordinate system, and the projected coordinates are integrated by max pooling to generate the final representation. With linear classifier, the proposed approach performs remarkably better than the traditional nonlinear SPM, achieving state-of-the-art performance on several benchmarks. Compared with the sparse coding strategy [22], the objective function used by LLC has an analytical solution. In addition, the paper proposes a fast approximated LLC method by first performing a K-nearest-neighbor search and then solving a constrained least square fitting problem, bearing computational complexity of O(M + K2). Hence even with very large codebooks, our system can still process multiple frames per second. This efficiency significantly adds to the practical values of LLC for real applications.
Conference Paper
Full-text available
Recently SVMs using spatial pyramid matching (SPM) kernel have been highly successful in image classification. Despite its popularity, these nonlinear SVMs have a complexity O(n2 ~ n3) in training and O(n) in testing, where n is the training size, implying that it is nontrivial to scaleup the algorithms to handle more than thousands of training images. In this paper we develop an extension of the SPM method, by generalizing vector quantization to sparse coding followed by multi-scale spatial max pooling, and propose a linear SPM kernel based on SIFT sparse codes. This new approach remarkably reduces the complexity of SVMs to O(n) in training and a constant in testing. In a number of image categorization experiments, we find that, in terms of classification accuracy, the suggested linear SPM based on sparse coding of SIFT descriptors always significantly outperforms the linear SPM kernel on histograms, and is even better than the nonlinear SPM kernels, leading to state-of-the-art performance on several benchmarks by using a single type of descriptors.
Conference Paper
Full-text available
The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called ldquoImageNetrdquo, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.
Conference Paper
Full-text available
The Fisher kernel (FK) is a generic framework which com- bines the benefits of generative and discriminative approaches. In the context of image classification the FK was shown to extend the popular bag-of-visual-words (BOV) by going beyond count statistics. However, in practice, this enriched representation has not yet shown its superiority over the BOV. In the first part we show that with several well-motivated modifications over the original framework we can boost the accuracy of the FK. On PASCAL VOC 2007 we increase the Average Precision (AP) from 47.9% to 58.3%. Similarly, we demonstrate state-of-the-art accuracy on CalTech 256. A major advantage is that these results are obtained us- ing only SIFT descriptors and costless linear classifiers. Equipped with this representation, we can now explore image classification on a larger scale. In the second part, as an application, we compare two abundant re- sources of labeled images to learn classifiers: ImageNet and Flickr groups. In an evaluation involving hundreds of thousands of training images we show that classifiers learned on Flickr groups perform surprisingly well (although they were not intended for this purpose) and that they can complement classifiers learned on more carefully annotated datasets.
Conference Paper
Full-text available
For object recognition, the current state-of-the-art is based on exhaustive search. However, to enable the use of more expensive features and classifiers and thereby progress beyond the state-of-the-art, a selective search strategy is needed. Therefore, we adapt segmentation as a selective search by reconsidering segmentation: We propose to generate many approximate locations over few and precise object delineations because (1) an object whose location is never generated can not be recognised and (2) appearance and immediate nearby context are most effective for object recognition. Our method is class-independent and is shown to cover 96.7% of all objects in the Pascal VOC 2007 test set using only 1,536 locations per image. Our selective search enables the use of the more expensive bag-of-words method which we use to substantially improve the state-of-the-art by up to 8.5% for 8 out of 20 classes on the Pascal VOC 2010 detection challenge.
Article
Full-text available
The Pascal Visual Object Classes (VOC) challenge is a benchmark in visual object category recognition and detection, providing the vision and machine learning communities with a standard dataset of images and annotation, and standard evaluation procedures. Organised annually from 2005 to present, the challenge and its associated dataset has become accepted as the benchmark for object detection. This paper describes the dataset and evaluation procedure. We review the state-of-the-art in evaluated methods for both classification and detection, analyse whether the methods are statistically different, what they are learning from the images (e.g. the object or its context), and what the methods find easy or confuse. The paper concludes with lessons learnt in the three year history of the challenge, and proposes directions for future improvement and extension.
Article
Full-text available
The ability of learning networks to generalize can be greatly enhanced by providing constraints from the task domain. This paper demonstrates how such constraints can be integrated into a backpropagation network through the architecture of the network. This approach has been successfully applied to the recognition of handwritten zip code digits provided by the U.S. Postal Service. A single network learns the entire recognition operation, going from the normalized image of the character to the final classification.
Article
Full-text available
This paper addresses the problem of large-scale image search. Three constraints have to be taken into account: search accuracy, efficiency, and memory usage. We first present and evaluate different ways of aggregating local image descriptors into a vector and show that the Fisher kernel achieves better performance than the reference bag-of-visual words approach for any given vector dimension. We then jointly optimize dimensionality reduction and indexing in order to obtain a precise vector comparison as well as a compact representation. The evaluation shows that the image representation can be reduced to a few dozen bytes while preserving high accuracy. Searching a 100 million image dataset takes about 250 ms on one processor core.
Conference Paper
We study the question of feature sets for robust visual object recognition, adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of Histograms of Oriented Gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.
Conference Paper
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.
Conference Paper
We evaluate whether features extracted from the activation of a deep convolutional network trained in a fully supervised fashion on a large, fixed set of object recognition tasks can be re-purposed to novel generic tasks. Our generic tasks may differ significantly from the originally trained tasks and there may be insufficient labeled or unlabeled data to conventionally train or adapt a deep architecture to the new tasks. We investigate and visualize the semantic clustering of deep convolutional features with respect to a variety of such tasks, including scene recognition, domain adaptation, and fine-grained recognition challenges. We compare the efficacy of relying on various network levels to define a fixed feature, and report novel results that significantly outperform the state-of-the-art on several important vision challenges. We are releasing DeCAF, an open-source implementation of these deep convolutional activation features, along with all associated network parameters to enable vision researchers to be able to conduct experimentation with deep representations across a range of visual concept learning paradigms.
Technical Report
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.
Conference Paper
The use of object proposals is an effective recent approach for increasing the computational efficiency of object detection. We propose a novel method for generating object bounding box proposals using edges. Edges provide a sparse yet informative representation of an image. Our main observation is that the number of contours that are wholly contained in a bounding box is indicative of the likelihood of the box containing an object. We propose a simple box objectness score that measures the number of edges that exist in the box minus those that are members of contours that overlap the box's boundary. Using efficient data structures, millions of candidate boxes can be evaluated in a fraction of a second, returning a ranked set of a few thousand top-scoring proposals. Using standard metrics, we show results that are significantly more accurate than the current state-of-the-art while being faster to compute. In particular, given just 1000 proposals we achieve over 96% object recall at overlap threshold of 0.5 and over 75% recall at the more challenging overlap of 0.7. Our approach runs in 0.25 seconds and we additionally demonstrate a near real-time variant with only minor loss in accuracy.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Article
Training a generic objectness measure to produce a small set of candidate object windows, has been shown to speed up the classical sliding window object detection paradigm. We observe that generic objects with well-defined closed boundary can be discriminated by looking at the norm of gradients, with a suitable resizing of their corresponding image windows in to a small fixed size. Based on this observation and computational reasons, we propose to resize the window to 8 × 8 and use the norm of the gradients as a simple 64D feature to describe it, for explicitly training a generic objectness measure. We further show how the binarized version of this feature, namely binarized normed gradients (BING), can be used for efficient objectness estimation, which requires only a few atomic operations (e.g. ADD, BITWISE SHIFT, etc.). Experiments on the challenging PASCAL VOC 2007 dataset show that our method efficiently (300fps on a single laptop CPU) generates a small set of category-independent, high quality object windows, yielding 96.2% object detection rate (DR) with 1, 000 proposals. Increasing the numbers of proposals and color spaces for computing BING features, our performance can be further improved to 99.5% DR.
Article
LIBSVM is a library for Support Vector Machines (SVMs). We have been actively developing this package since the year 2000. The goal is to help users to easily apply SVM to their applications. LIBSVM has gained wide popularity in machine learning and many other areas. In this article, we present all implementation details of LIBSVM. Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
Conference Paper
Generic object detection is confronted by dealing with different degrees of variations in distinct object classes with tractable computations, which demands for descriptive and flexible object representations that are also efficient to evaluate for many locations. In view of this, we propose to model an object class by a cascaded boosting classifier which integrates various types of features from competing local regions, named as region lets. A region let is a base feature extraction region defined proportionally to a detection window at an arbitrary resolution (i.e. size and aspect ratio). These region lets are organized in small groups with stable relative positions to delineate fine grained spatial layouts inside objects. Their features are aggregated to a one-dimensional feature within one group so as to tolerate deformations. Then we evaluate the object bounding box proposal in selective search from segmentation cues, limiting the evaluation locations to thousands. Our approach significantly outperforms the state-of-the-art on popular multi-class detection benchmark datasets with a single method, without any contexts. It achieves the detection mean average precision of 41.7% on the PASCAL VOC 2007 dataset and 39.7% on the VOC 2010 for 20 object categories. It achieves 14.7% mean average precision on the Image Net dataset for 200 object categories, outperforming the latest deformable part-based model (DPM) by 4.7%.
Article
The Pascal Visual Object Classes (VOC) challenge consists of two components: (i) a publicly available dataset of images together with ground truth annotation and standardised evaluation software; and (ii) an annual competition and workshop. There are five challenges: classification, detection, segmentation, action classification, and person layout. In this paper we provide a review of the challenge from 2008–2012. The paper is intended for two audiences: algorithm designers, researchers who want to see what the state of the art is, as measured by performance on the VOC datasets, along with the limitations and weak points of the current generation of algorithms; and, challenge designers, who want to see what we as organisers have learnt from the process and our recommendations for the organisation of future challenges. To analyse the performance of submitted algorithms on the VOC datasets we introduce a number of novel evaluation methods: a bootstrapping method for determining whether differences in the performance of two algorithms are significant or not; a normalised average precision so that performance can be compared across classes with different proportions of positive instances; a clustering method for visualising the performance across multiple algorithms so that the hard and easy images can be identified; and the use of a joint classifier over the submitted algorithms in order to measure their complementarity and combined performance. We also analyse the community’s progress through time using the methods of Hoiem et al. (Proceedings of European Conference on Computer Vision, 2012) to identify the types of occurring errors. We conclude the paper with an appraisal of the aspects of the challenge that worked well, and those that could be improved in future challenges.
Article
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif-ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make train-ing faster, we used non-saturating neurons and a very efficient GPU implemen-tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.
Article
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.
Conference Paper
In modern face recognition, the conventional pipeline consists of four stages: detect => align => represent => classify. We revisit both the alignment step and the representation step by employing explicit 3D face modeling in order to apply a piecewise affine transformation, and derive a face representation from a nine-layer deep neural network. This deep network involves more than 120 million parameters using several locally connected layers without weight sharing, rather than the standard convolutional layers. Thus we trained it on the largest facial dataset to-date, an identity labeled dataset of four million facial images belonging to more than 4,000 identities, where each identity has an average of over a thousand samples. The learned representations coupling the accurate model-based alignment with the large facial database generalize remarkably well to faces in unconstrained environments, even with a simple classifier. Our method reaches an accuracy of 97.25% on the Labeled Faces in the Wild (LFW) dataset, reducing the error of the current state of the art by more than 25%, closely approaching human-level performance.
Article
The latest generation of Convolutional Neural Networks (CNN) have achieved impressive results in challenging benchmarks on image recognition and object detection, significantly raising the interest of the community in these methods. Nevertheless, it is still unclear how different CNN methods compare with each other and with previous state-of-the-art shallow representations such as the Bag-of-Visual-Words and the Improved Fisher Vector. This paper conducts a rigorous evaluation of these new techniques, exploring different deep architectures and comparing them on a common ground, identifying and disclosing important implementation details. We identify several useful properties of CNN-based representations, including the fact that the dimensionality of the CNN output layer can be reduced significantly without having an adverse effect on performance. We also identify aspects of deep and shallow methods that can be successfully shared. A particularly significant one is data augmentation, which achieves a boost in performance in shallow methods analogous to that observed with CNN-based methods. Finally, we are planning to provide the configurations and code that achieve the state-of-the-art performance on the PASCAL VOC Classification challenge, along with alternative configurations trading-off performance, computation speed and compactness.
Conference Paper
Deep convolutional neural networks (CNN) have shown their promise as a universal representation for recognition. However, global CNN activations at present lack geometric invariance, which limits their robustness for tasks such as classification and matching of highly variable scenes. To improve the invariance of CNN activations without degrading their discriminative power, this paper presents a simple but effective scheme called multi-scale orderless pooling (or MOP-CNN for short). This approach works by extracting CNN activations for local patches at multiple scales, followed by orderless VLAD pooling of these activations at each scale level and concatenating the result. This feature representation decisively outperforms global CNN activations and achieves state-of-the-art performance for scene classification on such challenging benchmarks as SUN397, MIT Indoor Scenes, and ILSVRC2012, as well as for instance-level retrieval on the Holidays dataset.
Article
We propose a method for inferring human attributes (such as gender, hair style, clothes style, expression, action) from images of people under large variation of viewpoint, pose, appearance, articulation and occlusion. Convolutional Neural Nets (CNN) have been shown to perform very well on large scale object recognition problems. In the context of attribute classification, however, the signal is often subtle and it may cover only a small part of the image, while the image is dominated by the effects of pose and viewpoint. Discounting for pose variation would require training on very large labeled datasets which are not presently available. Part-based models, such as poselets and DPM have been shown to perform well for this problem but they are limited by flat low-level features. We propose a new method which combines part-based models and deep learning by training pose-normalized CNNs. We show substantial improvement vs. state-of-the-art methods on challenging attribute classification tasks in unconstrained settings. Experiments confirm that our method outperforms both the best part-based methods on this problem and conventional CNNs trained on the full bounding box of the person.
Conference Paper
Large Convolutional Neural Network models have recently demonstrated impressive classification performance on the ImageNet benchmark \cite{Kriz12}. However there is no clear understanding of why they perform so well, or how they might be improved. In this paper we address both issues. We introduce a novel visualization technique that gives insight into the function of intermediate feature layers and the operation of the classifier. We also perform an ablation study to discover the performance contribution from different model layers. This enables us to find model architectures that outperform Krizhevsky \etal on the ImageNet classification benchmark. We show our ImageNet model generalizes well to other datasets: when the softmax classifier is retrained, it convincingly beats the current state-of-the-art results on Caltech-101 and Caltech-256 datasets.
Article
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.
Article
We evaluate whether features extracted from the activation of a deep convolutional network trained in a fully supervised fashion on a large, fixed set of object recognition tasks can be re-purposed to novel generic tasks. Our generic tasks may differ significantly from the originally trained tasks and there may be insufficient labeled or unlabeled data to conventionally train or adapt a deep architecture to the new tasks. We investigate and visualize the semantic clustering of deep convolutional features with respect to a variety of such tasks, including scene recognition, domain adaptation, and fine-grained recognition challenges. We compare the efficacy of relying on various network levels to define a fixed feature, and report novel results that significantly outperform the state-of-the-art on several important vision challenges. We are releasing DeCAF, an open-source implementation of these deep convolutional activation features, along with all associated network parameters to enable vision researchers to be able to conduct experimentation with deep representations across a range of visual concept learning paradigms.
Article
LIBSVM is a library for support vector machines (SVM). Its goal is to help users to easily use SVM as a tool. In this document, we present all its imple-mentation details. For the use of LIBSVM, the README file included in the package and the LIBSVM FAQ provide the information.
Conference Paper
We describe an approach to object and scene retrieval which searches for and localizes all the occurrences of a user outlined object in a video. The object is represented by a set of viewpoint invariant region descriptors so that recognition can proceed successfully despite changes in viewpoint, illumination and partial occlusion. The temporal continuity of the video within a shot is used to track the regions in order to reject unstable regions and reduce the effects of noise in the descriptors. The analogy with text retrieval is in the implementation where matches on descriptors are pre-computed (using vector quantization), and inverted file systems and document rankings are used. The result is that retrieved is immediate, returning a ranked list of key frames/shots in the manner of Google. The method is illustrated for matching in two full length feature films.
Article
Current computational approaches to learning visual object categories require thousands of training images, are slow, cannot learn in an incremental manner and cannot incorporate prior information into the learning process. In addition, no algorithm presented in the literature has been tested on more than a handful of object categories. We present an method for learning object categories from just a few training images. It is quick and it uses prior information in a principled way. We test it on a dataset composed of images of objects belonging to 101 widely varied categories. Our proposed method is based on making use of prior information, assembled from (unrelated) object categories which were previously learnt. A generative probabilistic model is used, which represents the shape and appearance of a constellation of features belonging to the object. The parameters of the model are learnt incrementally in a Bayesian manner. Our incremental algorithm is compared experimentally to an earlier batch Bayesian algorithm, as well as to one based on maximum likelihood. The incremental and batch versions have comparable classification performance on small training sets, but incremental learning is significantly faster, making real-time learning feasible. Both Bayesian methods outperform maximum likelihood on small training sets.
Conference Paper
While vector quantization (VQ) has been applied widely to generate features for visual recognition problems, much recent work has focused on more powerful methods. In particular, sparse coding has emerged as a strong alternative to traditional VQ approaches and has been shown to achieve consistently higher performance on benchmark datasets. Both approaches can be split into a training phase, where the system learns a dictionary of basis functions, and an encoding phase, where the dictionary is used to extract features from new inputs. In this work, we investigate the reasons for the success of sparse coding over VQ by decoupling these phases, allowing us to separate out the contributions of training and encoding in a controlled way. Through extensive experiments on CIFAR, NORB and Caltech 101 datasets, we compare several training and encoding schemes, including sparse coding and a form of VQ with a soft threshold activation function. Our results show not only that we can use fast VQ algorithms for training, but that we can just as well use randomly chosen exemplars from the training set. Rather than spend resources on training, we find it is more important to choose a good encoder - which can often be a simple feed forward non-linearity. Our results include state-of-the-art performance on both CIFAR and NORB.
Conference Paper
This paper introduces a method for scene categorization by modeling ambiguity in the popular codebook approach. The codebook approach describes an image as a bag of discrete visual codewords, where the frequency distributions of these words are used for image categorization. There are two drawbacks to the traditional codebook model: codeword uncertainty and codeword plausibility. Both of these drawbacks stem from the hard assignment of visual features to a single codeword. We show that allowing a degree of ambiguity in assigning codewords improves categorization performance for three state-of-the-art datasets.
Article
This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Article
We describe an object detection system based on mixtures of multiscale deformable part models. Our system is able to represent highly variable object classes and achieves state-of-the-art results in the PASCAL object detection challenges. While deformable part models have become quite popular, their value had not been demonstrated on difficult benchmarks such as the PASCAL data sets. Our system relies on new methods for discriminative training with partially labeled data. We combine a margin-sensitive approach for data-mining hard negative examples with a formalism we call latent SVM. A latent SVM is a reformulation of MI--SVM in terms of latent variables. A latent SVM is semiconvex, and the training problem becomes convex once latent information is specified for the positive examples. This leads to an iterative training algorithm that alternates between fixing latent values for positive examples and optimizing the latent SVM objective function.