Conference Paper

Rich feature hierarchies for accurate object detection and semantic segmentation

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... These models are particularly popular in industrial applications. The primary target detection networks used in defect detection include YOLO [9][10][11][12], R-CNN (Region-based Convolutional Neural Networks) [13] , Mask R-CNN [14] , Retina Net [15] , and SSD (Single Shot MultiBox Detector) [16] .In automatic defect detection, these networks robustly identify and localize various defect types in images. These techniques autonomously learn and recognize defect features from extensive image data, thereby reducing dependence on manually designed features and enhancing accuracy and stability in identifying defects across various types and sizes.The use of target detection networks can significantly enhance the accuracy and efficiency of defect detection on bearing ring end faces. ...
... Common methods include super-resolution techniques [23] , multi-scale learning, adjusting anchor frames, modifying training strategies, and changing the loss function. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 A c c e p t e d M a n u s c r i p t 9 The study enhances YOLOv5s' performance for bearing ring end face defects by adding a small object detection head, combining multi-scale learning with anchor computation, and using an attention mechanism to refine the receptive field. ...
... To assess the performance of the enhanced YOLOv5s network, this study employs evaluation metrics In turn, the accuracy and average accuracy can be further derived from the above formulae, and the equations for calculating the average accuracy is shown in (9): = ∑ =1 (9) Here, N represents the number of categories within the dataset. ...
Article
Bearings, extensively utilized in the industrial sector, play a pivotal role in the defect detection of industrial components. This paper proposes a defect detection algorithm based on an enhanced YOLOv5s to improve the accuracy and speed of bearing ring end face defect detection. The algorithm boosts minimal target detection by incorporating a small object detection head and combining multi-scale representation learning with anchor box calculations. It introduces a formula to calculate the receptive field sizes of convolutional layers. By calculating the receptive field sizes corresponding to the feature map pixels and integrating this data with anchor analysis, three anchors optimal for the small object detection head are determined. Additionally, an attention mechanism refines the receptive fields of the neural network's output feature maps, enhancing the model's performance. Extensive experiments on a dataset of bearing ring end face defects from industrial sites reveal that the improved YOLOv5s algorithm achieves a detection accuracy (mAP) of 96.14%, a detection speed of up to 44 FPS, and a model size of 20.08M. Compared to other mainstream detection models, this algorithm not only meets but exceeds the real-time detection requirements of industrial production in terms of accuracy and model complexity. With its high precision, compact model size, and rapid detection speed, this algorithm provides a robust foundation for quality control in bearing production.
... General object detection is a fundamental and widely studied task in computer vision [90,43,42,17,137,47,48,100], aiming to identify and localize objects of interest within natural images. Specifically, the task requires detecting various categories of objects and providing their corresponding bounding boxes. ...
... Examples of sketch-to-image results generated by GPT-4o.48 ...
Preprint
Full-text available
Recently, the visual generation ability by GPT-4o(mni) has been unlocked by OpenAI. It demonstrates a very remarkable generation capability with excellent multimodal condition understanding and varied task instructions. In this paper, we aim to explore the capabilities of GPT-4o across various tasks. Inspired by previous study, we constructed a task taxonomy along with a carefully curated set of test samples to conduct a comprehensive qualitative test. Benefiting from GPT-4o's powerful multimodal comprehension, its image-generation process demonstrates abilities surpassing those of traditional image-generation tasks. Thus, regarding the dimensions of model capabilities, we evaluate its performance across six task categories: traditional image generation tasks, discriminative tasks, knowledge-based generation, commonsense-based generation, spatially-aware image generation, and temporally-aware image generation. These tasks not only assess the quality and conditional alignment of the model's outputs but also probe deeper into GPT-4o's understanding of real-world concepts. Our results reveal that GPT-4o performs impressively well in general-purpose synthesis tasks, showing strong capabilities in text-to-image generation, visual stylization, and low-level image processing. However, significant limitations remain in its ability to perform precise spatial reasoning, instruction-grounded generation, and consistent temporal prediction. Furthermore, when faced with knowledge-intensive or domain-specific scenarios, such as scientific illustrations or mathematical plots, the model often exhibits hallucinations, factual errors, or structural inconsistencies. These findings suggest that while GPT-4o marks a substantial advancement in unified multimodal generation, there is still a long way to go before it can be reliably applied to professional or safety-critical domains.
... Early deep learning models in the field of object detection are exemplified by the regionbased convolutional neural network (R-CNN) series. The R-CNN model, proposed by Ross Girshick et al. in 2014, marked the first integration of convolutional neural networks (CNNs) with region proposal algorithms (e.g., selective search). This approach enabled object localization and classification by generating candidate regions and extracting CNN features, significantly enhancing the detection accuracy of object detection models and establishing the foundation for the two-stage detection framework [3]. ...
... The R-CNN model, proposed by Ross Girshick et al. in 2014, marked the first integration of convolutional neural networks (CNNs) with region proposal algorithms (e.g., selective search). This approach enabled object localization and classification by generating candidate regions and extracting CNN features, significantly enhancing the detection accuracy of object detection models and establishing the foundation for the two-stage detection framework [3]. Subsequent improvements, such as Faster R-CNN, further advanced the detection accuracy [4,5]. ...
Article
Full-text available
To address the issue of the low precision in detecting defects in aluminum alloy weld seam digital radiography (DR) images using the current target detection algorithms, a modified algorithm named YOLOv8-ELA based on YOLOv8 is proposed. The model integrates a novel HS-FPN feature fusion module, which optimizes the parameter efficiency and enhances the detection performance. For better identification of small defect features, the CA attention mechanism within HS-FPN is substituted with the ELA attention mechanism. Additionally, the first output layer is enhanced with a SimAM attention mechanism to improve the small target recognition. The experimental findings indicate that, at a 0.5 threshold, the YOLOv8-ELA model achieves mean average precision (mAP@0.5) values of 93.3%, 96.4%, and 96.5% for detecting pores, inclusions, and incomplete welds, respectively. These values surpass those of the original YOLOv8 model by 1.4, 2.3, and 0.1 percentage points. Overall, the model attains an average mAP of 95.4%, marking a 1.3% improvement over its predecessor, confirming its superior defect detection capabilities.
... Among the many algorithms developed, the region-based convolutional neural networks (R-CNN) family has been particularly influential in pushing the boundaries of object detection. This family consists of R-CNN [16], Fast R-CNN [17], Faster R-CNN [7], Mask R-CNN [18], Cascade R-CNN [19] and so on. The R-CNN [16] algorithm introduced the idea of combining region proposals with CNNs to achieve accurate object detection. ...
... This family consists of R-CNN [16], Fast R-CNN [17], Faster R-CNN [7], Mask R-CNN [18], Cascade R-CNN [19] and so on. The R-CNN [16] algorithm introduced the idea of combining region proposals with CNNs to achieve accurate object detection. In this approach, selective search and region proposals are passed through a CNN to extract features and a classifier is then trained on these features to detect objects. ...
Article
Full-text available
In recent years, automatic check-out (ACO) gains increasing interest and has been widely used in daily life. However, current works mainly rely on both counter and product prototype images in the training phase, and it is hard to maintain the performance in an incremental setting. To deal with this problem, in this paper, we propose a robustprototype-free retrievalmethod (ROPREM) for ACO, which is a cascaded framework composed of a product detector module and a product retrieval module. Specifically, we use the product detector module to locate the products and then deliver the results to the subsequent product retrieval module for counting. The product detector module is trained without product class information, which can avoid the model overfitting for the known classes and improve the performance for the incremental setting. Additionally, we treat the check-out process as a retrieval process rather than a classification process. The retrieval result is considered as the product class by calculating the feature similarity between the query and gallery templates. The quantities of each category are treated as the final counts. To our best knowledge, this is the first attempt to treat ACO as a retrieval task. Extensive experiments are conducted to validate the proposed ROPREM, and the results show that ROPREM achieves the best performance in comparison with several state-of-the-art methods on the public retail product checkout (RPC) dataset.
... The data were gathered from real-world automotive scenarios. Two object detection models were tested: Faster R-CNN [12], [13] and RetinaNet [14]. When pretrained with the COCO dataset [15], these models showed significant degradation among the three environmental conditions, where sunny was the best and heavy rain was the worst. ...
... Faster R-CNN is a common and robust two-stage detector. This algorithm is an advancement from the previous twostage detectors Fast R-CNN and R-CNN [12], [13]. It uses a region proposal network (RPN) to reduce the computational complexity of determining regional proposals (the first of the two stages in a two stage architecture). ...
Article
Full-text available
Visible spectrum cameras have emerged as a key technology in Advanced Driving Assistance Systems (ADAS) and automated vehicles. An important question to be answered is how these sensors perform in challenging adverse weather conditions, such as rain. Although progress has been made in determining the impact of rain on computer vision performance, previous studies have generally focused on end-to-end object detection system performance and have not addressed the specific impact of rain in detail. Moreover, the lack of image datasets with detailed labeling acquired under rain conditions means that the impact of rain remains a relatively under-researched question. The purpose of this study is to examine the impact of rain in the propagation path on perception tasks, where other factors affecting performance are removed or controlled as far as possible. This study presents the results of controlled experimental testing designed to measure the impact of rain on automated vehicle perception performance. Object detection is performed on the captured data to determine the impact of rain on performance. Four object detection algorithms, a segmentation algorithm, and an optical character recognition algorithm are used as representative examples of typical algorithms used in ADAS. It is shown that the impact of rain varies between models, and at larger distances, rain has a greater impact. In the case of the OCR algorithm, rain is shown to have a larger impact at certain distances. The findings of this study are useful for ADAS design, as they provide more detailed insight into the impact of rain on ADAS and provide guidance on potential breaking points for algorithms typically used in this type of system.
... In recent years, the advancement in deep learning(DL) has become more popular in big data analysis that led to remarkable achievement in various computer vision tasks such as object detection and semantic segmentation [15], image classification [16], and natural language processing [17]. Deep learning techniques uses Convolutional Neural Networks (CNN) that rise tremendously in HSIs classification. ...
... After extracting the features, the attention mechanism along with the 3D-2D network are combined and flattened for spatial and spectral information. The following expression derives the concatenated features of the mechanism, Concatenated f eatures = F latten(Self − Att(2D f eatures)) ⊕ F latten(3D f eatures) (15) The concatenation operation is denoted by ⊕. ...
Preprint
Full-text available
Hyperspectral image classification presents significant challenges due to the high dimensionality of the data and the intricate spatial-spectral relationships inherent within hyperspectral imagery. This proposed work builds on certain well-established techniques, its novelty lies in the integration and adaptation of these components into a unified framework designed to address specific challenges in hyperspectral image classification. Unlike traditional PCA applied globally, this research work performs PCA within graph-based segmented regions. This localized approach preserves the spatial coherence of hyperspectral data while reducing dimensionality efficiently. Next to this step, a novel self-attention mechanism within a hybrid 3D-2D CNN architecture is introduced that allows the model to dynamically prioritize critical spectral bands while extracting comprehensive spatial-spectral features. The combination of graph-based segmentation, localized PCA, and attention-guided CNNs creates a robust and cohesive framework that enhances feature extraction and classification accuracy. The proposed framework is evaluated on four publicly available hyperspectral datasets Indian Pines, Kennedy Space Center, Salinas, and Pavia University and compared against seven state-of-the-art models, including SVM, 2D-CNN, 3D-CNN, AHAN, AF2GNN, DSN, and TDS-BiGRU. The experimental results demonstrate the superiority of the proposed approach, achieving an overall accuracy of 99.28% on Indian Pines, 99.99% on Salinas, 99.97% on Pavia University, and 99.34% on KSC dataset, consistently outperforming the existing methods. This framework effectively leverages segmentation, localized dimensionality reduction, and attention mechanisms, offering a robust and efficient solution for hyperspectral image classification. These results confirm the model's capability to address the complexities of hyperspectral data, providing a significant advancement in the field.
... However, there are a lot of significant flaws in these traditional algorithms and other traditional object detection models like Viola-Jones [6,7], such as low-processing speed and accuracy, generalization issues and high operating cost, which deep convolutional neural networks (CNNs) have steadily replaced. The feat that CNNs perform with the advent of AlexNet [8] in 2012, R-CNN (Region-Based CNN) [9] in 2014 and R-CNN series has led to a paradigm shift in deep learning-based methods for object detection and CV. The dominance of deep learning has been spread to various areas of object detection, leading to tremendous growth of various methods developed for object detection, motivated by high-processing speed and accuracy, capability for generalization, large-scale datasets and progress made in other CV tasks. ...
... However, R-CNN lacks efficiency and thus affects the training and inference phases. Moreover, the approach of dividing the detection process into four stages negatively affected both the processing speed of the model training and the processing Overview of R-CNN object detection model [9]. The model (1) accepts input image, (2) resources. ...
Chapter
Full-text available
Object detection is a major branch and fundamental task in computer vision, aiming to localize, identify and classify even the smallest objects of interest in images. Features can be extracted efficiently by deep convolutional neural networks (CNNs) as the backbone for real-time or near real-time object detection performance than the hand-crafted-based traditional methods. In the past few years, the advent of transformer-based models with robust self-attention mechanisms has not only raised object detection performance to a higher level but has also enabled it to produce excellent results. Many object detection tasks in the real world require that 3D information about the object be obtained, thus strengthening active research in 3D object detection. However, the algorithms for detecting 3D objects are not easy to propagate in real-world applications due to many factors, making reconstruction of 2D object detection algorithms to 3D object detection algorithms the suitable alternative. Therefore, we review the evolution of 2D object detection algorithms for digital imaging applications, focusing on their developments, models, applications, datasets, evaluation metrics, strengths and weaknesses, for better understanding of their landmarks and contributions to the advancement of the field.
... Visual object detection refers to the method of using image data obtained from a single camera or multiple cameras for object recognition and localization [4]. Currently, mainstream visual object detection can be divided into two categories: methods based on region, such as R-CNN [5], Fast R-CNN [6], Faster R-CNN [7], etc., which generate candidate regions for object detection, and methods based on single-stage detectors, such as the YOLO series [8], which directly detect the entire image. Compared to other methods, visual object detection has the advantages of low cost and rich color, texture, text, and shape information. ...
... Precision heading−weight (Recall)dRecall (5) where Precision ij represents the precision of class i in the j-th image, and Recall ij denotes the recall of class i in the j-th image. TP (true positive) refers to correctly predicted targets, while FP (false positive) indicates targets that were not detected. ...
Article
Full-text available
Multi-sensor fusion object detection is an advanced method that improves object recognition and tracking accuracy by integrating data from different types of sensors. As it can overcome the limitations of a single sensor in complex environments, the method has been widely applied in fields such as autonomous driving, intelligent monitoring, robot navigation, drone flight and so on. In the field of autonomous driving, multi-sensor fusion object detection has become a hot research topic. To further explore the future development trends of multi-sensor fusion object detection, we introduce the mainstream framework Transformer model of the multi-sensor fusion object detection algorithm, and we also provide a comprehensive summary of the feature fusion algorithms used in multi-sensor fusion object detection, specifically focusing on the fusion of camera and LiDAR data. This article provides an overview of feature fusion’s development into feature-level fusion and proposal-level fusion, and it specifically reviews multiple related algorithms. We discuss the application of current multi-sensor object detection algorithms. In the future, with the continuous advancement of sensor technology and the development of artificial intelligence algorithms, multi-sensor fusion object detection will show great potential in more fields.
... Deep neural networks have significantly advanced the field of computer vision, demonstrating exceptional capabilities across a range of tasks, including image classification [1][2][3], object detection [4][5][6], semantic segmentation [7][8][9], and image captioning [10], among others. This success, primarily driven by supervised learning techniques, relies heavily on the availability of extensive labeled datasets. ...
Article
Full-text available
Self-supervised learning has emerged as a powerful paradigm for leveraging unlabeled data to learn rich feature representations. However, the efficacy of self-supervised models is often limited by the degree and complexity of the augmentations used during training. In this work, we propose a novel framework that enhances self-supervised learning by incorporating a generative network designed to produce adversarial examples that challenge the learning process. By integrating adversarially generated data, our method extends three well-known self-supervised architectures---SimCLR, BYOL, and SimSiam---and improves their generalization and robustness. We evaluate our approach on CIFAR-10, CIFAR-100, and Tiny ImageNet datasets, demonstrating consistent improvements in classification accuracy over baseline models. Notably, our proposed method outperforms standard self-supervised learning techniques, achieving significant gains in top-1 accuracy across all datasets and training epochs. This substantiates our hypothesis that adversarial examples can significantly contribute to the feature learning capabilities of self-supervised models. Furthermore, our findings suggest that the integration of generative networks can serve as a catalyst for the development of more advanced self-supervised learning algorithms. This study lays the groundwork for future research exploring the potential of adversarial training in self-supervised learning and its applications across diverse domains.
... Finally, the feature maps are sent to support vector machine (SVM) for classification (SVM classifiers are trained according to the target categories on the input image, and each class corresponds to an SVM classifier). At the same time, the loss value of the bounding box is used to train a regressor with a correction factor to adjust the position of the bounding box [24]. In response to the problems of repeated calculation of candidate boxes and fixed-scale input of R-CNN, He et al. proposed SPPNet, which designed a spatial pyramid structure. ...
Article
Full-text available
Remote sensing images contain complex scene information and have multi-scale characteristics after being imaged by remote sensing equipment because the target instances in natural scenes are different sizes. In addition, small target instances are more difficult to identify in complex backgrounds, resulting in serious scale imbalance problems, which pose a serious challenge to identifying and positioning target objects. To address this problem, this article proposes a detection method for multi-scale remote sensing image targets by combining the frequency attention mechanism. The model first introduces an improved frequency channel attention mechanism to design a feature extraction module to improve the extraction of key features by the neural network; second, considering that the complete intersection over union method does not comprehensively consider the aspect ratio of the bounding box, which will cause the loss of small-scale target feature information, the efficient intersection over union method is used to improve it; then, because of the high missed detection rate of the non-maximum suppression (NMS) method, Soft-EIoU-NMS is used to replace the original NMS method. The experiment first conducted ablation experiments on the LEVIR dataset, where the target scale changes little and the number of ground object categories is small, and the DIOR dataset, where the target scale changes greatly and the number of ground object categories is large. The mAP@0.5 on the LEVIR dataset reached 0.935; the mAP@0.5 on the DIOR dataset reached 0.882. Then, the model proposed in this article was compared with the mainstream target detection methods. The experimental results verified the effectiveness of the model in this article. Finally, the model was applied to the disaster remote sensing image scene for detection, again demonstrating the model’s good detection performance. Therefore, the experimental results show that the method proposed in this article can effectively alleviate the scale imbalance problem and achieve a good target detection effect.
... The summary of research on leukocyte detection and classification [25]- [33] is shown in Table 1. [34], [35] was one of the first successful object detectors. It uses a selective search to generate probable bounding boxes in an image, which are then processed by a CNN to extract features and classified with a support vector machine (SVM). ...
Article
Full-text available
Accurate and timely white blood cell (WBC) analysis is crucial for diagnosing hematological disorders, often requiring microscopic examination of peripheral blood smears (PBS). While manual counting by trained specialists is considered the gold standard, it is time-consuming and impractical in resource-limited settings. Automated cell counters can misclassify similitude or immature cells, hindering accurate diagnosis. To address these limitations, we propose an AI-powered web application that utilizes YOLOv11 with enhanced small object detection capabilities, enabled by integrating our C3k2-Conv blocks, an architecture inspired by C3k2. Our model, trained on eleven WBC classes and nucleated red blood cells (NRBCs), achieves an impressive mean average precision (mAP@0.5) of approximately 0.9000 on validation and unseen test sets, demonstrating a performance comparable to human specialists in identifying and quantifying WBCs. Furthermore, our research demonstrates that providing general practitioners and medical students with PBS images annotated by our AI model significantly improves their counting accuracy and reduces the time spent on manual counting. Our web application, Myelosoft, allows clinicians to upload smartphone-captured PBS images for rapid and automated analysis. The system provides comprehensive differential counts for 11 WBC classes, including atypical lymphocytes, band neutrophils, basophils, blasts, eosinophils, lymphocytes, metamyelocytes, monocytes, myelocytes, promyelocytes, and segmented neutrophils, as well as NRBCs. This real-time analysis facilitates timely diagnosis and treatment, potentially reducing risks associated with delayed interventions. Our approach offers a robust and accessible solution for improving hematologic treatment, especially in resource-constrained environments.
... With the advancement of CNNs in object detection and their inherent advantages in image processing, researchers have proposed numerous object detection algorithms. CNNbased detection methods are broadly categorized into twostage detectors and single-stage detectors.Girshick et al. [19] pioneered the first end-to-end two-stage framework, Faster R-CNN, by introducing a Region Proposal Network (RPN) to replace traditional selective search. The RPN processes images of varying sizes to generate region proposals, significantly improving both detection accuracy and speed. ...
Article
Full-text available
To address the challenges of low detection accuracy in optical remote sensing images (RSIs) caused by densely distributed targets, extreme scale variations, and insufficient feature representation of small objects, this paper proposes LQ-MixerNeT, a novel CNN-Transformer hybrid framework with deep fusion capabilities. The core innovation lies in the DMIR-Fusion feature integrator, which integrates two key components: the DMI-DWConv Module and the ReLU Linear Attention (RLA) mechanism . This feature integrator dynamically coordinates local high-frequency feature from CNNs and global low-frequency feature from Transformers, effectively overcoming the inherent limitations of unimodal architectures.Furthermore, a frequency-ramping structure is introduced to dynamically regulate the high-frequency/low-frequency information allocation ratio in the DMIR-Fusion integrator across different feature extraction stages through four-channel scaling ratios (1/2, 1/4, 1/8, 1/16). The framework also incorporates an enhanced Asymptotic Feature Pyramid Network (AFPN) and Coordinate Attention (CA) mechanisms, synergistically optimizing spatial-semantic alignment and multi-scale feature representation, thereby significantly improving feature extraction performance.Extensive experiments on three benchmark RSIs datasets (RSOD, NWPU VHR-10 and DIOR) validate the superiority of LQ-MixerNeT. Results demonstrate that our method achieves mAP@0.5 scores of 73.65%, 85.29%, and 83.25%, respectively. Ablation studies reveal that the DMIRFusion integrator contributes a 2.02% accuracy improvement, while the enhanced AFPN boosts performance by 0.54%. These findings highlight the model’s robustness in handling complex RSIs scenarios and establish a new paradigm for multimodal fusion in remote sensing object detection.
... Deep learning-based surface defect detection methods are classified into two-stage and single-stage approaches. Two-stage detection methods, including region-based convolutional neural network (R-CNN), 3 Fast-RCNN, 4 and Faster-RCNN, 5 first generate region proposals before classifying them to obtain predictions. While these methods achieve high recognition accuracy, they suffer from high resource consumption, slow detection speeds, and large model sizes. ...
Article
Full-text available
In view of the problems in industrial steel plate surface defect detection, such as high model complexity, insufficient recognition of small targets, and inefficient hardware deployment, this study proposes the StarNet‐GSConv‐RetC3 detection transformer (SSR‐DETR) lightweight framework. The framework comprises a StarNet backbone network and an innovative star operation optimization structure to reduce computational complexity while enhancing feature representation capabilities. In the feature fusion stage, the RetBlock CSP bottleneck with 3 convolutions (RetC3) module with an explicit attenuation mechanism is designed to enhance the extraction of geometric features of microscopic defects by combining two‐dimensional spatial priors, and grouped spatial convolution (GSConv) is used to optimize the aggregation of multiscale features. Experiments show that the algorithm achieves a mean average precision (mAP) of 88.2% and a classification accuracy of 92.0% on the Northeastern University steel surface defect (NEU‐DET) dataset, which is 4.8% and 3.7% higher than the baseline model, respectively. Meanwhile, the model's computational load and size are reduced by 59.5% and 47.8%, respectively. Actual deployment tests show that this algorithm operates at 98.1 frames per second (FPS) on personal computer platforms and at 40.3 FPS, with a single‐frame processing time of 24.8 ms, on the RK3568 embedded system, fully meeting the comprehensive requirements of industrial scenarios.
... In practical deep learning applications, it is common to leverage pre-trained models and fine-tune them on the target task's dataset [6]. This paradigm has also been gaining popularity in CL approaches. ...
Preprint
Continual learning (CL) aims to train models that can learn a sequence of tasks without forgetting previously acquired knowledge. A core challenge in CL is balancing stability -- preserving performance on old tasks -- and plasticity -- adapting to new ones. Recently, large pre-trained models have been widely adopted in CL for their ability to support both, offering strong generalization for new tasks and resilience against forgetting. However, their high computational cost at inference time limits their practicality in real-world applications, especially those requiring low latency or energy efficiency. To address this issue, we explore model compression techniques, including pruning and knowledge distillation (KD), and propose two efficient frameworks tailored for class-incremental learning (CIL), a challenging CL setting where task identities are unavailable during inference. The pruning-based framework includes pre- and post-pruning strategies that apply compression at different training stages. The KD-based framework adopts a teacher-student architecture, where a large pre-trained teacher transfers downstream-relevant knowledge to a compact student. Extensive experiments on multiple CIL benchmarks demonstrate that the proposed frameworks achieve a better trade-off between accuracy and inference complexity, consistently outperforming strong baselines. We further analyze the trade-offs between the two frameworks in terms of accuracy and efficiency, offering insights into their use across different scenarios.
... These algorithms are typically classified into two categories: two-stage and single-stage target detection algorithms. Two-stage algorithms (e.g., R-CNN (Region-based Convolutional Neural Networks) [7], Fast R-CNN [8], Faster R-CNN [9]) excel in detection accuracy but suffer from slower inference times due to the need for candidate region generation followed by classification. In contrast, single-stage algorithms (e.g., YOLO [10], SSD (Single Shot MultiBox Detector) [11]) streamline the detection process, improving speed while still meeting real-time detection requirements, albeit with slightly reduced accuracy in more complex scenarios. ...
Article
Full-text available
As the global population of visually impaired individuals continues to grow, traditional assistive tools struggle to meet the demands of safe navigation. This paper presents the design of an intelligent guide glasses system based on the deep learning model YOLOv5, integrating technologies such as ultrasonic ranging, computer vision, and global positioning to enable obstacle detection, environmental sensing, and navigation. By constructing a specialized dataset, Blind Vision-YOLO, and training the model, the system demonstrates impressive real-time performance and high precision in target detection. Experimental results reveal that the system achieves real-time target detection at 28.01 FPS with a mean average precision (mAP) of 74.1%, accurately identifying potential obstacles for visually impaired individuals during daily travel and providing timely voice feedback. The smart guide glasses designed in this study offer excellent portability and practicality, providing a safer and more convenient travel experience for the visually impaired.
... The main difference between them is that the twostage model first generates candidate bounding boxes that may contain defects and then classifies the cropped subregions for defect identification, whereas the one-stage model directly predicts the location, size, class, and confidence score of defects based on features extracted by the CNN. The representative two-stage detection algorithms include R-CNN [16], SPP-net [17], and Fast R-CNN [18]. He et al. [19] subsequently proposed Mask R-CNN in 2017, which employs a Residual Network (ResNet) as the backbone and utilizes a Feature Pyramid Network (FPN) to extract multiscale feature information, integrating a fully connected segmentation subnet. ...
Article
Full-text available
To address the issues of omission, and low algorithmic accuracy in detecting surface defects of hot-rolled strip steel in complex backgrounds, this paper proposes the PMSE-YOLO surface defect detection algorithm. First, to enhance the model’s ability to extract multi-scale features, the CSP_PMS structure is designed to optimize the C2f structure in the Backbone, improving feature extraction for multi-scale targets. Second, to preserve the semantic information of small targets in the Neck, the EfficientRepBiPAN structure is adopted, utilizing cross-layer connections to enhance multi-scale feature representation and achieve small target feature fusion. Finally, the Wise-ShapeIoU loss function is incorporated to enhance the model’s detection performance. Experimental validation on the NEU-DET dataset demonstrates that PMSE-YOLO reduces parameter by 17% and computational cost by 18.5%, while improving mAP@0.5 by 3.1 to 82.2% compared to baseline network. PMSE-YOLO balances lightweight design and real-time performance while enhancing surface defect detection accuracy for hot-rolled strips, facilitating deployment on edge devices. Furthermore, experimental results on the GC10-DET dataset confirm the superior generalization capability of the proposed model.
... One-stage detectors regard object detection as a regression or classification problem and use a unified framework to obtain the final categories and locations directly 23 , such as RetinaNet 24 , Single Shot Detector (SDD) 25 , AttentionNet4 26 or You Only Look Once (YOLO). On the contrary, Two-stage detectors generate regions and classify each area to get different object categories, such as Regions with CNN features (R-CNN) 27 , Faster Region-based Convolutional Neural Network (Faster R-CNN) 28 or Region-based Fully Convolutional Network (R-FCN) 29 . One-stage detectors are typically faster and are commonly used for real-time applications. ...
Article
Full-text available
Wildlife biologists increasingly use camera traps for monitoring animal populations. However, manually sifting through the collected images is expensive and time-consuming. Current deep learning studies for camera trap images do not adequately tackle real-world challenges such as imbalances between animal and empty images, distinguishing similar species, and the impact of backgrounds on species identification, limiting the models’ applicability in new locations. Here, we present a novel two-stage deep learning framework. First, we train a global deep-learning model using all animal species in the dataset. Then, an agglomerative clustering algorithm groups animals based on their appearance. Subsequently, we train a specialized deep-learning expert model for each animal group to detect similar features. This approach leverages Transfer Learning from the MegaDetectorV5 (YOLOv5 version) model, already pre-trained on various animal species and ecosystems. Our two-stage deep learning pipeline uses the global model to redirect images to the appropriate expert models for final classification. We validated this strategy using 1.3 million images from 91 camera traps encompassing 24 mammal species and used 120,000 images for testing, achieving an F1-Score of 96.2% using expert models for final classification. This method surpasses existing deep learning models, demonstrating improved precision and effectiveness in automated wildlife detection.
... In the field of deep learning for target detection, detection networks can be categorized into two types based on the steps they take to generate detection targets, i.e., two-stage methods and one-stage methods. Two-stage methods, such as the R-CNN [10] series, first extract feature layers from the backbone network and generate candidate region proposals and then remove and aggregate overlapping region proposals through Non-Maximum Suppression (NMS) to generate regions of interest (RoIs). They then finally output target bounding boxes, target types, and confidence scores through certain postprocessing methods. ...
Article
Full-text available
The accurate detection of small ships based on images or vision is critical for many scenarios, like maritime surveillance, port security, and navigation safety. However, achieving accurate detection for small ships is a challenge for cost-efficiency models; while the models could meet this requirement, they have unacceptable computation costs for real-time surveillance. We propose YOLO-LPSS, a novel model designed to significantly improve small ship detection accuracy with low computation cost. The characteristics of YOLO-LPSS are as follows: (1) Strengthening the backbone's ability to extract and emphasize features relevant to small ship objects, particularly in semantic-rich layers. (2) A sophisticated, learnable method for up-sampling processes is employed, taking into account both deep image information and semantic information. (3) Introducing a post-processing mechanism in the final output of the resampling process to restore the missing local region features in the high-resolution feature map and capture the global-dependence features. The experimental results show that YOLO-LPSS outperforms the known YOLOv8 nano baseline and other works, and the number of parameters increases by only 0.33 M compared to the original YOLOv8n while achieving 0.796 and 0.831 AP 50:95 in classes consisting mainly of small ship targets (the bounding box of the target area is less than 5% of the image resolution), which is 3-5% higher than the vanilla model and recent SOTA models.
... Object detection algorithms based on deep learning are usually divided into two-stage and single-stage detection algorithms. Two-stage detection algorithms, such as R-CNN [5], Fast R-CNN [6], and Fast R-CNN [7], are renowned for their high accuracy, but they are computationally expensive and slow, which limits their applicability to real-time applications. Therefore, singlestage detection algorithms such as the YOLO series [8][9][10], SSD [11], and RetinaNet [12] have received significant attention due to their efficiency and speed. ...
Preprint
Full-text available
To address the issues of high environmental complexity, insufficient model performance, and low computational efficiency in underwater image analysis, this paper proposes a lightweight underwater object detection algorithm named SGL-YOLO, which is based on an enhanced YOLOv8 framework. In the backbone network, we introduce a newly designed SCA-C2f module that replaces the original C2f module, thereby improving feature discrimination and representation capabilities. To solve the problem of insufficient feature fusion in the neck network, a global-local cross-layer aligned bidirectional feature pyramid network (GLCA-BiFPN) is constructed, which improves the detection performance of small objects while reducing the model parameters and computational cost. A lightweight shared convolutional detection head (LSCD) is introduced in the head to optimize the computational efficiency. Finally, the WIoU is adopted to replace the CIoU loss function, thereby improving the accuracy of bounding box regression. SGL-YOLO achieves outstanding results on the DUO dataset. Compared with YOLOv8, the model parameters are reduced by 45.2%, the model computational complexity is decreased by 27.2%, and the average detection accuracy is increased by 1.1%, achieving a good balance between detection accuracy and detection speed.
... With the continuous development and maturation of deep learning technologies, deep learning-based object detection has gradually replaced traditional methods. Current mainstream deep learning object-detection algorithms are mainly divided into three categories: two-stage algorithms (such as RCNN [8], Faster-RCNN [9], Mask-RCNN [10]), single-stage algorithms (such as the SSD series [11] and YOLO series [12]), and end-to-end detection approaches [13]. While two-stage algorithms generally achieve higher detection accuracy, they are hampered by lower prediction efficiency and slower processing speeds. ...
Article
Full-text available
In response to the decreased accuracy in person detection caused by densely populated areas and mutual occlusions in public spaces, a human head-detection approach is employed to assist in detecting individuals. To address key issues in dense scenes—such as poor feature extraction, rough label assignment, and inefficient pooling—we improved the YOLOv7 network in three aspects: adding attention mechanisms, enhancing the receptive field, and applying multi-scale feature fusion. First, a large amount of surveillance video data from crowded public spaces was collected to compile a head-detection dataset. Then, based on YOLOv7, the network was optimized as follows: (1) a CBAM attention module was added to the neck section; (2) a Gaussian receptive field-based label-assignment strategy was implemented at the junction between the original feature-fusion module and the detection head; (3) the SPPFCSPC module was used to replace the multi-space pyramid pooling. By seamlessly uniting CBAM, RFLAGauss, and SPPFCSPC, we establish a novel collaborative optimization framework. Finally, experimental comparisons revealed that the improved model’s accuracy increased from 92.4% to 94.4%; recall improved from 90.5% to 93.9%; and inference speed increased from 87.2 frames per second to 94.2 frames per second. Compared with single-stage object-detection models such as YOLOv7 and YOLOv8, the model demonstrated superior accuracy and inference speed. Its inference speed also significantly outperforms that of Faster R-CNN, Mask R-CNN, DINOv2, and RT-DETRv2, markedly enhancing both small-object (head) detection performance and efficiency.
... The necessity for models to generalize and transfer effectively to new clinical environments underscored the importance of addressing these challenges [28,29]. To adapt models to new unseen medical environments, fine-tuning can be employed, which involves extending the training process with relatively small number of new samples [30]. To date, there have been very few studies on ML-based phase detection using multi-centric surgical video datasets [3,31]. ...
... At present, object detection algorithms can be divided into two categories: two-stage detection algorithms and one-stage detection algorithms. Two stage detection algorithms, including convolutional neural networks [2,3], CNN [4], and R-CNN [5], generate candidate boxes containing potential targets, and then use region classifiers to predict them. Single stage detection algorithms, such as SSD [6] and YOLO [7][8][9][10][11][12][13] series, directly classify and predict targets at each position on the feature map, thereby improving detection speed and practicality. ...
Article
Full-text available
At present, UAV aerial photography has a good application prospect in agricultural production and disaster response. The application of drones can greatly improve work efficiency and decision-making accuracy. However, due to the inherent characteristics of drone aerial images, such as high image density, small target size, complex background, etc. In order to solve these problems, this paper proposes a small target detection algorithm for UAV aerial photography based on the improved YOLOv11n. Firstly, the FADC module was introduced into the backbone network to optimize the feature extraction process. Then, a small target detection layer was introduced into the algorithm to improve the detection performance of small targets in aerial images. Secondly, the scale sequence feature fusion network ASF-YOLO was used to replace the PANet network to improve the speed and accuracy of target detection. Then, Wise IoU is used to replace CIoU to speed up the network convergence speed and improve the regression accuracy. The algorithm was evaluated on the VisDrone-2019 dataset. Compared with YOLOV11n, the algorithm is improved by 5.7% and 4.3% in mAP@50 and mAP@0.5:0.95, respectively. Experiments show that compared with YOLOV11n, the performance of the algorithm on small targets is greatly improved.
... • Backbone: A convolutional neural network (CNN) that 14 extracts features from the input image. YOLOv11 em- 15 ploys an optimized version of the CSPDarknet backbone, incorporating cross-stage partial connections to 16 enhance feature extraction while reducing computational 17 overhead. This backbone uses advanced techniques like spatial attention modules to focus on relevant regions, 18 ...
Preprint
Full-text available
The pharmaceutical industry is tasked with ensuring the production and distribution of medications that meet stringent quality and safety standards, yet it grapples with significant challenges in automating critical processes such as quality control, pill sorting, and inventory management. These challenges arise from the inherent complexity of identifying medical pills, which vary widely in shape, size, color, and imprints, often requiring meticulous human intervention that is both time-consuming and error-prone. This study introduces a proof-of-concept (POC) dataset and a cutting-edge YOLOv11-based computer vision model tailored for medical-pills detection, with the overarching goal of advancing automation within pharmaceutical workflows. The dataset comprises 115 meticulously labeled images, split into 92 training and 23 validation samples, and serves as the foundation for training our YOLOv11 model, which achieves an exceptional mean Average Precision (mAP@ 0.5) of 0.995. We rigorously evaluate the model’s performance using a suite of analytical tools, including precision-recall curves, F1-confidence curves, confusion matrices, and bounding box visualizations, providing a comprehensive assessment of its capabilities. The results underscore the transformative potential of AI-driven solutions in pharmaceutical applications, such as automated sorting, defect detection, counterfeit identification, and real-time inventory tracking. However, we also acknowledge limitations, such as the dataset’s modest size and the controlled conditions of our experiments, which temper the generalizability of our findings. This work establishes a foundational resource for researchers and industry practitioners, offering both a dataset and a high-performing model to catalyze the development of scalable, efficient systems for healthcare automation, while transparently outlining areas for future improvement.
... Compared to one-stage detectors, two-stage detectors, while slower, achieve higher detection accuracy. For example, the R-CNN series [32], [33], [34] generates a set of candidate regions from the image using a Region Proposal Network (RPN), then classifies and regresses these regions to determine the object's category and precise location. Building upon these methods, researchers have proposed various improved algorithms. ...
Article
Full-text available
Accurate detection of road markings is critically important in fields such as autonomous driving technology, high-precision mapping, and intelligent transportation systems. Unlike traditional object detection tasks, road marking detection faces many challenges, including significant direction changes, complex backgrounds, and diverse types of road markings. To address these issues, this paper proposes a deep learning approach based on mobile laser scanning (MLS) point cloud intensity images to accurately detect road markings with arbitrary orientations. Firstly, we employ an oriented bounding box (OBB) to precisely represent the direction and location of road markings, thereby improving detection accuracy. Additionally, the channel semantic enhanced module (CSEM) is designed to enhance the feature representation capacity of the backbone network, effectively reducing the impact of complex backgrounds on foreground targets. Finally, the spatial semantic enhanced module (SSEM) is introduced to enhance the feature pyramid network (FPN)'s ability to capture contextual information. The module consists of the large selective kernel network (LSKNet) and the semantic enhanced module (SEM). LSKNet extracts contextual information around the target by adaptively selecting receptive fields of various scales, while SEM, a lightweight semantic segmentation branch, further strengthens the feature representation ability of this information by incorporating a semantic supervision mechanism. The proposed method achieves an mAP of 84.7% on Test Set I, representing a 3.2% improvement over the baseline model, while maintaining high efficiency, offering a robust and reliable solution for real-time road marking detection tasks.
... The paradigm of transfer learning, particularly leveraging models pre-trained on large-scale datasets like ImageNet [1] or COCO [2], has become foundational to modern object detection [3]. Pre-training captures robust hierarchical features that often generalize well, significantly reducing the need for vast amounts of labeled data and computational resources for downstream tasks [4,5]. Common adaptation strategies involve either using the pre-trained model as a fixed feature extractor and training only a new task-specific head, or fine-tuning some or all layers of the pre-trained network on the target dataset [6]. ...
Preprint
Full-text available
The success of large pre-trained object detectors hinges on their adaptability to diverse downstream tasks. While fine-tuning is the standard adaptation method, specializing these models for challenging fine-grained domains necessitates careful consideration of feature granularity. The critical question remains: how deeply should the pre-trained backbone be fine-tuned to optimize for the specialized task without incurring catastrophic forgetting of the original general capabilities? Addressing this, we present a systematic empirical study evaluating the impact of fine-tuning depth. We adapt a standard YOLOv8n model to a custom, fine-grained fruit detection dataset by progressively unfreezing backbone layers (freeze points at layers 22, 15, and 10) and training. Performance was rigorously evaluated on both the target fruit dataset and, using a dual-head evaluation architecture, on the original COCO validation set. Our results demonstrate unequivocally that deeper fine-tuning (unfreezing down to layer 10) yields substantial performance gains (e.g., +10\% absolute mAP50) on the fine-grained fruit task compared to only training the head. Strikingly, this significant adaptation and specialization resulted in negligible performance degradation (<0.1\% absolute mAP difference) on the COCO benchmark across all tested freeze levels. We conclude that adapting mid-to-late backbone features is highly effective for fine-grained specialization. Critically, our results demonstrate this adaptation can be achieved without the commonly expected penalty of catastrophic forgetting, presenting a compelling case for exploring deeper fine-tuning strategies, particularly when targeting complex domains or when maximizing specialized performance is paramount.
... Object detection frameworks have evolved significantly in agricultural applications. Girshick et al. [27] initially demonstrated promising results with R-CNN variants, but found them to be computationally intensive. Redmon et al. [28] revolutionized real-time object detection with the one-stage YOLO architecture, finding it to be highly suitable for agricultural applications. ...
Article
Full-text available
Accurate tomato maturity detection represents a critical challenge in precision agriculture. A YOLOv11-based algorithm named YOLO-PGC is proposed in this study for tomato maturity detection. Its three innovative components are denoted by “PGC”, respectively representing the Polarization State Space Strategy with Dynamic Weight Allocation, the Global Horizontal–Vertical Context Module, and the Convolutional–Inductive Feature Fusion Module. The Polarization Strategy enhances robustness against occlusion through adaptive feature importance modulation, he Global Context Module integrates cross-dimensional attention mechanisms with hierarchical feature extraction, and the Convolutional–Inductive Feature Fusion Module employs multimodal integration for improved object discrimination in complex scenes. Experimental results demonstrate that YOLO-PGC achieves superior precision and mean average precision compared to state-of-the-art methods. Validation on the COCO benchmark confirms the framework’s generalization capabilities, maintaining computational efficiency for real-time deployment. YOLO-PGC establishes new performance standards for agricultural object detection with potential applications in similar computer vision challenges. Overall, these components and strategies are integrated into YOLO-PGC to achieve robust object detection in complex scenarios.
... Target detection algorithms based on deep learning are divided into one-stage algorithms and two-stage algorithms. Typical two-stage algorithms include R-CNN (Region-based Convolutional Neural Network) [3] and Fast R-CNN (Fast Region-based Convolutional Neural Network) [4]. These algorithms involve generating candidate frames that are subsequently input into the network model for classification and regression. ...
Article
Full-text available
In UAV aerial image target detection, the presence of small-scale objects, complex backgrounds, and weak illumination leads to difficulties in feature extraction and low detection accuracy. To address these issues, this paper proposes an aerial image target detection algorithm named Dual-YOLO. First, a parallel dual-path backbone network is designed to achieve complementary feature extraction, thereby enhancing the feature extraction capability for targets. Second, a bidirectional feature pyramid network (BiFPN) structure is implemented in the neck to optimize multi-scale feature fusion, which enhances feature representation capabilities through its bidirectional cross-scale connectivity. Finally, a dynamic head framework is employed to unify the object detection head and the attention mechanism, thereby enhancing overall detection performance. Experimental results show that the Dual-YOLO algorithm achieves mean Average Precision at 50% IoU (mAP50) scores of 43.1% and 76.3% on the VisDrone2019 and HazyDet datasets, respectively, outperforming the baseline model by 9.3% and 6.4% and significantly enhancing detection accuracy for aerial targets.
... In order to predict the tracking status through using the object detection model [14]. After this stage, we apply the centroid tracking algorithm in terms to track each car. ...
Article
Full-text available
With the increasing volume of cars in traffic and the global traffic increasing exponentially, it has become critical to manage traffic as a challenge in the most developed countries. To address this issue, the intelligent traffic control system will use automatic vehicle counting as one of its core tasks to facilitate access, particularly in parking lots. The primary benefit of automatic vehicle counting is that it allows for managing and evaluating traffic conditions in the urban transportation system. The new era of technologies such as the Internet of Things and computer vision has transformed traditional systems into new smart city networks. Because of the proliferation of computer vision, traffic counting from low-cost control cameras may emerge as an appealing candidate for traffic flow control automation. This paper proposed a low-cost embedded car counter system using a Jetson nano card based on computer vision and IoT technologies to implement the offered system. In the proposed system, we apply a combination of background subtraction and counters, trackable objects, centroid tracking, and direction counting. Moreover, we implement the MoG foreground-background subtractor method. The proposed system is connected to the Internet using Telegram API to send notifications to smartphone hourly to analyze traffic congestion. In addition, we compared the performance of Jetson nano with the Raspberry Pi4 platform.
... Nowadays, the sophisticated deep-network-based object detection approaches can be divided into proposal-based [8][9][10][11][12] and proposal-free frameworks [13,14,2,15]. Proposal-based methods are composed of two stages: proposal generation and classification. ...
Preprint
Full-text available
Detecting small objects remains a significant challenge in single-shot object detectors due to the inherent trade-off between spatial resolution and semantic richness in convolutional feature maps. To address this issue, we propose a novel framework that enables small object representations to "borrow" discriminative features from larger, semantically richer instances within the same class. Our architecture introduces three key components: the Feature Matching Block (FMB) to identify semantically similar descriptors across layers, the Feature Representing Block (FRB) to generate enhanced shallow features through weighted aggregation, and the Feature Fusion Block (FFB) to refine feature maps by integrating original, borrowed, and context information. Built upon the SSD framework, our method improves the descriptive capacity of shallow layers while maintaining real-time detection performance. Experimental results demonstrate that our approach significantly boosts small object detection accuracy over baseline methods, offering a promising direction for robust object detection in complex visual environments.
... However, current semantic segmentation-based detection methods are unable to accurately segment and detect small targets with clear contours. The second type is detection-based methods, such as R-CNN [44], Fast R-CNN [45], RetinaNet [31], AMFLW-YOLO [46], and DET-YOLO [47], which enhance the probability of detecting small targets by precisely locating objects within detection boxes. To ensure a fair comparison, each model was retrained using the three public datasets for 300 epochs, with all other parameters maintained at their default values. ...
Preprint
Infrared small target detection(IRSTD) is widely recognized as a challenging task due to the inherent limitations of infrared imaging, including low signal-to-noise ratios, lack of texture details, and complex background interference. While most existing methods model IRSTD as a semantic segmentation task, but they suffer from two critical drawbacks: (1)blurred target boundaries caused by long-distance imaging dispersion; and (2) excessive computational overhead due to indiscriminate feature stackin. To address these issues, we propose the Lightweight Efficiency Infrared Small Target Detection (LE-IRSTD), a lightweight and efficient framework based on YOLOv8n, with following key innovations. Firstly, we identify that the multiple bottleneck structures within the C2f component of the YOLOv8-n backbone contribute to an increased computational burden. Therefore, we implement the Mobile Inverted Bottleneck Convolution block (MBConvblock) and Bottleneck Structure block (BSblock) in the backbone, effectively balancing the trade-off between computational efficiency and the extraction of deep semantic information. Secondly, we introduce the Attention-based Variable Convolution Stem (AVCStem) structure, substituting the final convolution with Variable Kernel Convolution (VKConv), which allows for adaptive convolutional kernels that can transform into various shapes, facilitating the receptive field for the extraction of targets. Finally, we employ Global Shuffle Convolution (GSConv) to shuffle the channel dimension features obtained from different convolutional approaches, thereby enhancing the robustness and generalization capabilities of our method. Experimental results demonstrate that our LE-IRSTD method achieves compelling results in both accuracy and lightweight performance, outperforming several state-of-the-art deep learning methods.
... Techniques like Deformable Parts Models (DPM) used a sliding window approach to run classifiers at evenly spaced locations over the entire image [24]. More recent approaches, such as R-CNN (Regions with CNN features), first generate potential bounding boxes in an image and then classify these proposed regions [10]. This method laid the groundwork for many modern object detectors but was computationally expensive and slow. ...
Preprint
Full-text available
Traditional 3D modeling requires technical expertise, specialized software, and time-intensive processes, making it inaccessible for many users. Our research aims to lower these barriers by combining generative AI and augmented reality (AR) into a cohesive system that allows users to easily generate, manipulate, and interact with 3D models in real time, directly within AR environments. Utilizing cutting-edge AI models like Shap-E, we address the complex challenges of transforming 2D images into 3D representations in AR environments. Key challenges such as object isolation, handling intricate backgrounds, and achieving seamless user interaction are tackled through advanced object detection methods, such as Mask R-CNN. Evaluation results from 35 participants reveal an overall System Usability Scale (SUS) score of 69.64, with participants who engaged with AR/VR technologies more frequently rating the system significantly higher, at 80.71. This research is particularly relevant for applications in gaming, education, and AR-based e-commerce, offering intuitive, model creation for users without specialized skills.
... If the condition is met, the feature point is considered dynamic. If not, it is considered static (Girshick et al., 2014;Borrego et al., 2018;Ren et al, 2015). ...
Article
Conventional simultaneous localization and mapping (SLAM) systems for agricultural robots rely heavily on static rigidity assumptions, which makes it susceptible to the influence of dynamic target feature points in the environment thus leading to poor localization accuracy and robustness of the system. To address the above issues, this paper proposes a method that utilizes a target detection algorithm to identify and eliminate dynamic target feature points in a farm depot. The method initially employs the YOLOv5 target detection algorithm to recognize dynamic targets in the captured warehouse environment images. The detected targets are then integrated into the feature extraction process at the front end of the visual SLAM. Next, dynamic feature points belonging to the dynamic target part are eliminated from the extracted image feature points using the LK optical flow method. Finally, the remaining feature points are used for location matching, map construction and localization. The final test on the TUM dataset shows that the enhanced vision SLAM system improves the localization accuracy by 91.47% compared to ORB-SLAM2 in highly dynamic scenes. This improvement increases the accuracy and robustness of the system and outperforms some of the best SLAM algorithms while maintaining high real-time performance. These features make it more valuable for mobile devices.
... The fused features are then processed through CSP blocks for final refinement before being delivered to the detection head. The detection head retains YOLOv5's anchor-based[31] design, performing classification and bounding box regression using predefined anchor boxes, with Non-Maximum Suppression[32] (NMS) filtering redundant detections. ...
Preprint
Table structure recognition is a key task in document analysis. However, the geometric deformation in deformed tables causes a weak correlation between content information and structure, resulting in downstream tasks not being able to obtain accurate content information. To obtain fine-grained spatial coordinates of cells, we propose the OG-HFYOLO model, which enhances the edge response by Gradient Orientation-aware Extractor, combines a Heterogeneous Kernel Cross Fusion module and a scale-aware loss function to adapt to multi-scale objective features, and introduces mask-driven non-maximal suppression in the post-processing, which replaces the traditional bounding box suppression mechanism. Furthermore, we also propose a data generator, filling the gap in the dataset for fine-grained deformation table cell spatial coordinate localization, and derive a large-scale dataset named Deformation Wired Table (DWTAL). Experiments show that our proposed model demonstrates excellent segmentation accuracy on all mainstream instance segmentation models. The dataset and the source code are open source: https://github.com/justliulong/OGHFYOLO.
... Object detection methods built on deep learning are divided into two basic types: single-and two-stage. Two-stage object detection methods comprise R-CNN [7], Faster R-CNN [8], and Fast R-CNN [9]. For instance, Hu et al. implemented the GARPN method to enhance the accuracy of anchor predictions and effectively integrated multi-level feature information into the Faster R-CNN framework [10]. ...
Article
Full-text available
The manufacturing quality of printed circuit boards (PCBs) significantly influences the functionality and life expectancy of electronic devices. This paper introduces a YOLO-WWBi based on improved YOLO11 framework method for the detection of surface defects. First, an improved weighted and re-parameterized ghost multi-scale feature aggregation module (WRGMSFA) is designed. This module focuses more on defect information channels, enhancing multi-scale feature extraction capabilities while suppressing redundant information. Then, BiFPN is integrated into the neck to enhance the quality of fused features and deepen the interaction of feature information. Finally, the WIoU loss function was employed to optimize the localization performance of defect positions, thereby enhancing robustness in highly similar PCB background interference. The experimental results indicate that YOLO-WWBi has an mAP of 96.6%, surpassing YOLO11 by 5.4 points. Its performance metrics indicate that the requirements for the high-precision, real-time detection of PCB defects are satisfactorily met.
... Composite Feature In deep neural network, the spatial resolution of input features becomes smaller after being down-sampled by multiple layers of the network [28], while the high-level semantic information and contextual information become richer and richer. In terms of implementation, the final output of V1, V2, and V3 structures will be slightly different. ...
Article
Full-text available
Fabric defect detection is a major quality control process in the textile industry. However, compared with common detection tasks, printed fabric defect detection has several difficulties as follows: printed fabric pattern texture is complex; defects are of numerous types and vary greatly in size and shape; most of the defects are extremely small.Aiming at these problems and difficulties in real scenes, this paper firstly adopts the idea of template matching in change detection to eliminate the background pattern texture features of target images by using template image features.Aiming at the problem of large scale difference of defects, this paper proposed a better modeling ability of Ratio DCN on the basis of deformable convolutional network for defects with different scales, especially the abnormal Ratio of long and short edges. To solve the problem of a large number of small target defects, this paper uses the template image features as an auxiliary structure, and fuses the high-level and low-level features of the detected and template images to improve the detection rate of small target defects.In addition, the Diff Attention structure was also proposed in this paper.Based on the differential features of the target image and template image, the model's Attention was more focused on the region where the defect was located, which could strengthen the feature extraction ability of the model for the defect.DPDAN's good generalization is verified on both its own data set and public data set.
... This method has achieved high detection accuracy, but slow detection speed. The R-CNN (Regions with Convolutional Neural Networks Features) [13][14][15] stage algorithm. This algorithm could be described as follows: firstly, the candidate frames could be extracted. ...
Article
Full-text available
Unmanned aerial vehicle (UAV) plays an important role in various areas. To improve small object detection accuracy in UAV images, a small object detection model based on YOLOv7 has been proposed. Firstly, a multi-scale feature fusion structure for extracting small object features was configured. Secondly, the dynamic head with scale attention was introduced to focus on shallow feature map residing small objects. Finally, the auxiliary anchor label allocation strategy was proposed to provide positive labels for small objects via expanding the Ground Truth (GT) and prior anchor areas. The experiments were carried out on the VisDrone2019 dataset and UAVDT dataset to verify that the proposed algorithm can achieve higher detection accuracy in UAV image detection. The experimental results show that the proposed algorithm has achieved that mAP0.5 upgraded by 6%, mAP0.5:0.95 upgraded by 6.5% on VisDrone2019 and mAP0.5 upgraded by 1.2%, mAP0.5:0.95 upgraded by 4.5% on UAVDT dataset.
Preprint
Full-text available
Vehicle detection is crucial for intelligent decision support in transportation systems. However, real-time detection of vehicles is challenging due to geometric variations of vehicles and complex environmental factors such as light conditions and weather. To address these issues, the paper introduces the You-Only-Look-Once with Deformable Convolution and Cross-channel Coordinate Attention (YOLO-DC) framework that improves the performance and reliability of vehicle detection. First, YOLO-DC incorporates Cross-channel Coordinate Attention, which combines channel attention and coordinate attention, to more accurately cover target sampling positions and enhance feature extractions from vehicles of various shapes. Second, to better handle vehicles of different sizes, we employ Multi-scale Grouped Convolution to enable multi-scale awareness and streamline parameter sharing. Additionally, we incorporate channel prior convolutional attention so that the model can concentrate on areas of vehicles that are critical for detection. We also optimize feature fusion by leveraging a highly efficient fusion of C2f(CSP Bottleneck with 2 Convolutions) and FasterNet to reduce the model size. Experimental results demonstrate that YOLO-DC performs better than state-of-the-art YOLOv8n method in detecting small, medium, and large-sized vehicles, and in detecting vehicles in adverse weather conditions. In addition to its superior performance, YOLO-DC also features fast detection speed, making it appropriate for real-time detection on devices with limited computational power.
Thesis
Full-text available
The visual assessment of microscopic samples by pathologists constitutes an essential component of cancer diagnostics. Traditional pathology workflows were based on the visual assessment of samples under the microscope. The development of designated slide scanners has facilitated the digitization of microscopy samples which not only allowed for digital archiving and remote expert consultancy but also facilitated the use of machine learning-based image analysis algorithms for computer-aided diagnosis. Meanwhile, a wide range of computer-aided systems has been developed in the field of histopathology, often matching the performance of trained pathologists. Previous work has shown that machine learning-based image analysis algorithms, and especially convolutional neural networks, can be very susceptible to changes in the visual appearance of images. In pathology, these domain shifts can be caused when applying trained algorithms to different morphologies, or samples prepared at a different pathology lab. The preparation of histologic samples follows routine stages, including tissue fixation, dehydration, paraffin embedding, and microtome sectioning. Subsequently, a sample is usually stained with a specific dye and digitized with a designated scanning system. The visual manifestation of these sample preparation steps can be very unique for the respective pathology lab. This thesis investigates the impact of different domain shifts on the performance of convolutional neural networks in histopathology. For these experiments, three routine tasks in cancer diagnostics were considered: cross-scanner mitotic figure detection, cross-domain tumor segmentation, and pan-tumor T-lymphocyte detection on immunohistochemistry samples. For the task of cross-scanner mitotic figure detection, domain adversarial training was employed. Evaluations of the learned embeddings demonstrated the successful extraction of scanner-agnostic features. For the task of cross-domain tumor segmentation, representation learning and, in particular, selfsupervised learning was explored as a pre-training strategy to align feature embeddings across domains and thereby enhance the domain agnosticity for the downstream task. The results provide insights into the applicability of self-supervised learning in the context of histopathology. To date, this technique has mostly been employed in the field of natural images. In a project addressing the detection of tumor-infiltrating lymphocytes in immunohistochemistry samples, fine-tuning was leveraged to bridge the domain gap between different tumor indications. Initial experiments exhibited degraded performance on out-of-distribution samples. By exploiting fine-tuning on a limited number of target domain samples, this degradation was effectively mitigated. The experiments allowed for recommendations on the development of robust algorithms for the detection of lymphocytes across different tumor morphologies. In the course of the thesis, several cross-domain datasets were curated, focusing on different sources of domain shift. This includes a fully annotated dataset of 350 whole slide images covering seven canine cutaneous tumor subtypes, which constitutes one of the most comprehensive open histopathology segmentation datasets to date. A high annotation quality of each published dataset was ensured through extensive multi-rater experiments on selected subsets of the data. By making these datasets publicly available, future work on the cross-domain generalization for histopathology was facilitated.
Preprint
Full-text available
The task of weakly supervised temporal sentence grounding (WSTSG) aims to detect temporal intervals corresponding to a language description from untrimmed videos with only video-level video-language correspondence. For an anchor sample, most existing approaches generate negative samples either from other videos or within the same video for contrastive learning. However, some training samples are highly similar to the anchor sample, directly regarding them as negative samples leads to difficulties for optimization and ignores the correlations between these similar samples and the anchor sample. To address this, we propose Positive Sample Mining (PSM), a novel framework that mines positive samples from the training set to provide more discriminative supervision. Specifically, for a given anchor sample, we partition the remaining training set into semantically similar and dissimilar subsets based on the similarity of their text queries. To effectively leverage these correlations, we introduce a PSM-guided contrastive loss to ensure that the anchor proposal is closer to similar samples and further from dissimilar ones. Additionally, we design a PSM-guided rank loss to ensure that similar samples are closer to the anchor proposal than to the negative intra-video proposal, aiming to distinguish the anchor proposal and the negative intra-video proposal. Experiments on the WSTSG and grounded VideoQA tasks demonstrate the effectiveness and superiority of our method.
Article
Full-text available
Data augmentation (DA) tailored to instances is vital for instance segmentation to improve model robustness and accuracy without high manual annotation costs. Existing erasing methods risk losing information about instances, whereas instance-level methods necessitate additional overhead, such as an object bank and context calculations to decide locations to attach objects to an image. Thus, we propose the instance-centric erasing-based DA method, FlickBI, which enhances focus on the target object and diversity by randomly eliminating confusing information. FlickBI consists of two separate methods: FlickBack and FlickIns. FlickBack removes the unrelated background information based on the given annotations. FlickIns stochastically deletes instances, assuming that instances within one image are independent of each other. The experiments reveal that the proposed simple yet effective method consistently enhances performance across detectors and backbones on three benchmark datasets with just a few lines of code, even if the diversity of transformed images and the number of used instances are lower. On the COCO dataset, FlickBI achieves mask mAP improvements ranging from 1.1 to 4.9. Moreover, on the Cityscapes and LVIS datasets, there is an average improvement in mask AP of +3.1 and +4.8, respectively. Furthermore, FlickBI demonstrates synergistic improvements with other instance-level DA methods.
Article
Full-text available
Multi-object tracking (MOT) is a critical task involving detecting and continuously tracking multiple objects within a video sequence. It is widely used in various fields, such as autonomous driving and intelligent security. In recent years, deep learning architectures have effectively promoted the development of MOT. However, this task poses significant challenges regarding accuracy due to occlusion/truncation, light variation, camera movement. Researchers have proposed many methods to address these issues to reduce trajectory fragmentation, identity switches, and missing targets. To better understand these advancements, it is essential to categorize the approaches based on their methodologies. This article reviewed the recent development of MOT, divided into Tracking by Detection (TBD) and End-to-End (E2E). By introducing and comparing the two types of tracking algorithms, readers can quickly understand the current development status of MOT. Meanwhile, this review summarizes the links to open-source code of excellent algorithms and common benchmark datasets in the appendix. And provide a unified MOT toolkit that includes evaluation and visualization at https://github.com/guanzhiyu817/MOT-tools. In addition, this review discusses the future directions of MOT, specifically cross-modal reasoning.
Article
Full-text available
Coffee cultivation is vital to the global economy, but faces significant challenges with diseases such as rust, miner, phoma, and cercospora, which impact production and sustainable crop management. In this scenario, deep learning techniques have shown promise for the early identification of these diseases, enabling more efficient monitoring. This paper proposes an approach for detecting diseases and pests on coffee leaves using an efficient single-shot object-detection algorithm. The experiments were conducted using the YOLOv8, YOLOv9, YOLOv10 and YOLOv11 versions, including their variations. The BRACOL dataset, annotated by an expert, was used in the experiments to guarantee the quality of the annotations and the reliability of the trained models. The evaluation of the models included quantitative and qualitative analyses, considering the mAP, F1-Score, and recall metrics. In the analyses, YOLOv8s stands out as the most effective, with a mAP of 54.5%, an inference time of 11.4 ms and the best qualitative predictions, making it ideal for real-time applications.
Article
Full-text available
Model lightweighting and efficiency are essential in UAV target recognition. Given the limited computational resources of UAVs and the system’s high stability demands, existing complex models often do not meet practical application requirements. To tackle these challenges, this paper proposes LW-YOLOv8, a lightweight object detection algorithm based on the YOLOv8s model for UAV deployment. First, Cross Stage Partial Convolutional Neural Network (CNN) Transformer Fusion Net (CSP-CTFN) is proposed. It integrates convolutional neural networks and a multi-head self-attention (MHSA) mechanism, and achieves comprehensive global feature extraction through an expanded receptive field. Second, Parameter Shared Convolution Head (PSC-Head) is designed to enhance detection efficiency and further minimize model size. Furthermore, the original loss function is replaced with SIoU to enhance detection accuracy. Extensive experiments on the VisDrone2019 dataset show that the proposed model reduces parameters by 37.9%\%, computational cost by 22.8%\%, and model size by 36.9%\%, while improving AP, AP50, and AP75 by 0.2%\%, 0.2%\%, and 0.4%\%, respectively. The results indicate that the proposed model performs effectively in UAV recognition applications.
Article
In the age of digitalization, the proliferation of counterfeit logos has become a significant concern for businesses and consumers alike. Counterfeit logos not only deceive consumers but also tarnish the brand's reputation and result in substantial economic losses. This study addresses the challenge of detecting fake logos using advanced machine learning techniques.We propose a robust framework for fake logo detection that leverages convolutional neural networks (CNNs) to differentiate between authentic and counterfeit logos. The framework consists of a preprocessing step where logos are normalized and augmented to enhance the model's generalization capabilities. The CNN architecture is designed to capture intricate features of logos through multiple layers of convolution, pooling, and fully connected networks.The model is trained and validated on a comprehensive dataset that includes a diverse range of authentic and fake logos across various industries. To ensure the robustness of our approach, we employ data augmentation techniques such as rotation, scaling, and color variations, thereby simulating real-world scenarios where logos might appear in different orientations and lighting conditions.Our experimental results demonstrate that the proposed CNN-based model achieves high accuracy and precision in detecting fake logos, outperforming traditional image processing and machine learning methods. We also conduct a comparative analysis with existing state-of-the-art techniques, highlighting the strengths and limitations of our approach.The findings of this study have significant implications for brand protection and intellectual property rights enforcement. By deploying the proposed fake logo detection system, businesses can safeguard their brand integrity and consumers can be protected from counterfeit products. Future work will focus on enhancing the model's scalability and real-time detection capabilities, as well as expanding the dataset to include more diverse logo designs and counterfeit techniques.In conclusion, this study presents a novel and effective solution for fake logo detection using deep learning, contributing to the broader effort of combating counterfeiting in the digital era.
Article
Full-text available
Target detection helps to identify, locate, and monitor key components and potential issues in power sensing networks. The fusion of infrared and visible light images can effectively integrate the target the indication characteristics of infrared images and the rich scene detail information of visible light images, thereby enhancing the ability for target detection in power equipment in complex environments. In order to improve the registration accuracy and feature extraction stability of traditional registration algorithms for infrared and visible light images, an image registration method based on an improved SIFT algorithm is proposed. The image is preprocessed to a certain extent, using edge detection algorithms and corner detection algorithms to extract relatively stable feature points, and the feature vectors with excessive gradient values in the normalized visible light image are truncated and normalized again to eliminate the influence of nonlinear lighting. To address the issue of insufficient deep information extraction during image fusion using a single deep learning network, a dual ResNet network is designed to extract deep level feature information from infrared and visible light images, improving the similarity of the fused images. The image fusion technology based on the dual ResNet network was applied to the target detection of sensing insulators in the power sensing network, improving the average accuracy of target detection. The experimental results show that the improved registration algorithm has increased the registration accuracy of each group of images by more than 1%, the structural similarity of image fusion in the dual ResNet network has been improved by about 0.2% compared to in the single ResNet network, and the mean Average Precision (mAP) of the fusion image obtained via the dual ResNet network has been improved by 3% and 6% compared to the infrared and visible light images, respectively.
Article
n this paper, an improved vehicle detection method based on YOLOv11 and an attention mechanism is proposed. By optimizing the network structure and integrating the attention mechanism, the accuracy and efficiency of vehicle detection are significantly enhanced. The experimental results show that the proposed method outperforms the traditional YOLOv11 in various scenarios, providing a more reliable solution for intelligent transportation systems and related fields.
Technical Report
DeepQuery VisionXTrans is an advanced computer vision model leveraging Vision Transformers (ViTs) to detect deviations from expected crowd behavior. It combines spatial and temporal analysis for real-time anomaly detection and behavior understanding. Optimized for high accuracy and efficiency, the model is ideal for surveillance and safety-critical applications. Its robust performance in varying conditions makes it suitable for dynamic, large-scale environments.
Article
Full-text available
Defect detection in industrial computed tomography (CT) images remains challenging due to small defect sizes, low contrast, and noise interference. To address these issues, we propose Defect R-CNN, a novel detection framework designed to capture the structural characteristics of defects in CT images. The model incorporates an edge-prior convolutional block (EPCB) that guides to focus on extracting edge information, particularly along defect boundaries, improving both localization and classification. Additionally, we introduce a custom backbone, edge-prior net (EP-Net), to capture features across multiple spatial scales, enhancing the recognition of subtle and complex defect patterns. During inference, the multi-branch structure is consolidated into a single-branch equivalent to accelerate detection without compromising accuracy. Experiments conducted on a CT dataset of nuclear graphite components from a high-temperature gas-cooled reactor (HTGR) demonstrate that Defect R-CNN achieves average precision (AP) exceeding 0.9 for all defect types. Moreover, the model attains mean average precision (mAP) scores of 0.983 for bounding boxes (mAP-bbox) and 0.956 for segmentation masks (mAP-segm), surpassing established methods such as Faster R-CNN, Mask R-CNN, Efficient Net, RT-DETR, and YOLOv11. The inference speed reaches 76.2 frames per second (FPS), representing an optimal balance between accuracy and efficiency. This study demonstrates that Defect R-CNN offers a robust and reliable approach for industrial scenarios that require high-precision and real-time defect detection.
Article
Full-text available
Fine-grained ship detection tasks require models to accurately classify fine-grained categories and precisely localize them within complex backgrounds, relying on detailed features. The challenges of this task mainly lie in bird’s-eye viewpoints, scale variations, rotational changes, and environmental factors, which lead to minor inter-class differences and significant intra-class variations. This paper presents a novel model, called Wavelet Transform-based Dual-Stream Backbone Network (WTDBNet), which effectively integrates three key strengths: the ability of the Transformer to model long-range dependencies for global context, the capability of convolutional neural networks to extract detailed local features, and the efficiency of wavelet transform in frequency-domain decomposition for enhancing edges and texture details. These components are fused via channel and spatial attention mechanisms, thereby improving the model’s ability to extract discriminative features. The effectiveness of WTDBNet is validated on two widely used benchmarks for fine-grained oriented ship detection, as well as on a self-constructed dataset designed to represent complex scenarios. Experimental results demonstrate the superior performance of the proposed method.
ResearchGate has not been able to resolve any references for this publication.