Conference Paper

You Only Look Once: Unified, Real-Time Object Detection

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Historically, object detection methods such as Background Subtraction [7], Haar Cascades [8], Histogram of Oriented Gradients (HOG) [9], and Template Matching [10] laid foundational principles. Transitioning to deep learning (DL) and machine learning (ML), significant advances were made through methods such as SSD [11], YOLO (You Only Look Once) [12], Faster R-CNN [13], Mask R-CNN [14], RetinaNet [15], and EfficientDet [16], which revolutionized speed and accuracy in detection tasks. More recently, LVLMs such as ContextDET [17], VOLTRON [18], DVDet [19], DOD Framework [20], Synthetic negative generation [21], and DetGPT [22] have integrated complex language understanding capabilities, enabling dynamic and contextually aware object detection across diverse and challenging environments. ...
... For instance, Single Shot MultiBox Detector (SSD) [11] efficiently processes images in one shot to detect objects, delivering both their locations and class predictions. Likewise, YOLO streamlines detection by dividing images into grids, each predicting bounding boxes and probabilities, enabling rapid real-time detection [12,36]. Additionally, Fast R-CNN [13] and Faster R-CNN [13] enhance detection by using region proposal networks and shared convolutional features, respectively, to quickly and accurately predict object locations and classes [37]. ...
... • Architectural Innovations in LVLMs: Traditional deep learning models such as YOLO, SSD, and Faster R-CNN are fundamentally built on convolutional neural networks, each optimized for specific aspects of object detection [105]. YOLO, known for its single-stage detection mechanism, divides the image into grids, predicting bounding boxes and class probabilities directly from these grid cells using anchor boxes [12,106]. SSD extends this by utilizing multiple feature maps to detect objects across various scales in a single forward pass, optimizing for speed [11]. ...
Preprint
Full-text available
The integration of language and vision in large vision-language models (LVLMs) has revolutionized deep learning-based object detection by enhancing adaptability, contextual reasoning, and generalization beyond traditional architectures. This in-depth review presents a structured exploration of the state-of-the-art in LVLMs, systematically organized through a three-step research review process. First, we discuss the functioning of vision language models for object detection, describing how these models harness natural language processing (NLP) and computer vision (CV) techniques to revolutionize object detection and localization. We then explain the architectural innovations, training paradigms, and output flexibility of recent vision language models for object detection, highlighting how they achieve advanced contextual understanding for object detection. The review thoroughly examines the integration of visual and textual information, demonstrating the progress made in object detection using vision language models that facilitatemore sophisticated object detection and localization strategies. Additionally, the review includes visualizations depicting LVLMs' effectiveness across diverse scenarios, extending beyond conventional object detection to include localization and segmentation tasks in images. Moreover, in this review, we present the comparative performance of LVLMs against traditional deep learning systems for object detection in terms of their real-time capabilities, adaptability, and system complexities. Concluding with a discussion on the future roadmap, this review outlines the transformative impact of LVLMs in object detection, providing a fundamental understanding and critical insights into their operational efficacy and scientific advancements. Based on the findings of this review, we anticipate that LVLMs will soon surpass traditional deep learning methods in object detection. This progress paves the way for hybrid models that integrate the precision of conventional architectures with the contextual and semantic strengths of LVLMs, thereby maximizing performance across diverse and complex detection tasks. Index Terms: Object detection with large language models, Vision-language model (VLM) integration, Multimodal object detection, Cross-modal understanding in AI, Object segmentation with VLMs, Image segmentation using vision-language models, Vision perception evaluation metrics, Deep learning for object recognition, Generative AI for visual tasks, Semantic segmentation with LLMs, Zero-shot object detection, Few-shot learning in vision-language tasks, CLIP model for object detection, DETR architectures for vision-language tasks, Automated image annotation with AI Visual question answering (VQA), Scene understanding with LLMs ,Image-text alignment in VLMs , Panoptic segmentation with LLMs , Transformer-based vision models, Self-supervised learning for VLMs, BERT for visual grounding ,GPT-4 vision capabilities ,SAM (Segment Anything Model) applications, Open-vocabulary detection systems , Instance segmentation with VLMs, Vision-language pretraining techniques , Multimodal deep learning frameworks, NLP in computer vision tasks, Image captioning with object detection, Visual grounding in language models, Context-aware object recognition, Real-time detection using VLMs, Transfer learning for vision-language models
... However, due to its limited performance [8], computer vision did not receive wider attention until the year 2012 [9], when the world witnessed the rebirth of convolutional neural network (CNN) [10]. Following this, R. Girshick et al. proposed the prestigious Regions with CNN features (R-CNN) [11] in 2014, by exploiting CNN in object detection, while both "You Only Look Once" (YOLO) [12] and Single Shot MultiBox (SSD) [13] were proposed in 2016. Since then, VA has evolved rapidly and has been harnessed in various scenarios. ...
... In other words, several steps can be integrated into a single step. For example, an object can be recognized directly by CNN-based algorithm [12], without separating image segmentation and object detection. Furthermore, given that VA tasks vary over diverse scenarios, one or multiple steps can be skipped. ...
Article
Video analytics (VA), capable of autonomously understanding events in video content, has demonstrated significant potential across various applications, from surveillance to self-driving cars and industrial automation. However, traditional VA, relying on either end-device or cloud-based solutions, faces limitations such as restricted on-device computing power and network congestion at cloud centers. Edge computing offers a promising solution, enabling low-latency, high-accuracy, and bandwidth-efficient performance, thus supporting the rapid growth of VA deployment. This article provides a comprehensive review of VA at the edge, examining aspects of model training, deployment, end-edge-cloud orchestration, and VA platforms. Specifically, we explore model training approaches conducted in the cloud, at the edge, and in hybrid cloud-edge configurations. We also discuss various model deployment techniques, including quantization and network pruning. Furthermore, the article surveys end-edge-cloud orchestration strategies, categorized into VA query offload-ing and query scheduling. We evaluate practical deployments and review the literature on VA platforms. Finally, we outline several promising future research directions for advancing this field.
... The six approaches depicted in Figure 1 are Convolutional Neural Networks (CNNs) [9] , Transformer-based models [10,11], Vision Language Model-based approaches [12,13], Hybrid models such as RetinaMask and EfficientDet [14,15], Sparse Coding and Dictionary Learning models, and Traditional Feature-based approaches. Among them, CNNs, including the YOLO (You Only Look Once) series [16,17] and R-CNN (Region-based CNN) family such as Mask R-CNN [18], have become staples in practical deployments due to their proficient handling of spatial hierarchies. Transformer-based models like DETR (Detection Transformer) such as dynamic DETR [19] and deformable DETR [20] utilize self-attention mechanisms to treat images as sequences of patches, which helps in integrating a global context and eliminates the need for Non-Maximum Suppression (NMS) [21], streamlining postprocessing [22]. ...
... • YOLO Series: Starting with YOLOv1 [16], which reframed object detection as a single regression problem from image pixels to bounding box coordinates and class probabilities, through to YOLOv12 [32], which introduced improvements like anchor-free detection and dynamic label assignment for enhanced accuracy and efficiency [33,34,35,31]. ...
Preprint
Full-text available
This study conducts a detailed comparison of RF-DETR object detection base model and YOLOv12 object detection model configurations for detecting greenfruits in a complex orchard environment marked by label ambiguity, occlusions, and background blending. A custom dataset was developed featuring both single-class (greenfruit) and multi-class (occluded and non-occluded greenfruits) annotations to assess model performance under dynamic real-world conditions. RF-DETR object detection model, utilizing a DINOv2 backbone and deformable attention, excelled in global context modeling, effectively identifying partially occluded or ambiguous greenfruits. In contrast, YOLOv12 leveraged CNN-based attention for enhanced local feature extraction, optimizing it for computational efficiency and edge deployment. RF-DETR achieved the highest mean Average Precision (mAP50) of 0.9464 in single-class detection, proving its superior ability to localize greenfruits in cluttered scenes. Although YOLOv12N recorded the highest mAP@50:95 of 0.7620, RF-DETR consistently outperformed in complex spatial scenarios. For multi-class detection, RF-DETR led with an mAP@50 of 0.8298, showing its capability to differentiate between occluded and non-occluded fruits, while YOLOv12L scored highest in mAP@50:95 with 0.6622, indicating better classification in detailed occlusion contexts. Training dynamics analysis highlighted RF-DETR's swift convergence, particularly in single-class settings where it plateaued within 10 epochs, demonstrating the efficiency of transformer-based architectures in adapting to dynamic visual data. These findings validate RF-DETR's effectiveness for precision agricultural applications, with YOLOv12 suited for fast-response scenarios. >Index Terms: RF-DETR object detection, YOLOv12, YOLOv13, YOLOv14, YOLOv15, YOLOE, YOLO World, YOLO, You Only Look Once, Roboflow, Detection Transformers, CNNs
... Zhang et al. proposed a costsensitive residual convolutional neural network that effectively balances the different misclassification costs of sample imbalance and false defects in PCB detection [16]. Another category is represented by first-stage algorithms such as YOLO (You Only Look Once) and SSD (Single Shot MultiBox Detector) [17,18]. These algorithms directly utilize convolutional neural networks to extract target features and perform classification and regression on the targets. ...
Article
Full-text available
Aiming at the problems of low accuracy and large computation in the task of PCB defect detection. This paper proposes a lightweight PCB defect detection algorithm based on YOLO. To address the problem of large numbers of parameters and calculations, GhostNet are used in Backbone to keep the model lightweight. Second, the ordinary convolution of the neck network is improved by depthwise separable convolution, resulting in a reduction of redundant parameters within the neck network. Afterwards, the Swin-Transformer is integrated with the C3 module in the Neck to build the C3STR module, which aims to address the issue of cluttered background in defective images and the confusion caused by simple defect types. Finally, the PANet network structure is replaced with the bidirectional feature pyramid network (BIFPN) structure to enhance the fusion of multi-scale features in the network. The results indicated that when comparing our model with the original model, there was a 47.2% reduction in the model’s parameter count, a 48.5% reduction in GFLOPs, a 42.4% reduction in Weight, a 2.0% reduction in FPS, and a 2.4% rise in mAP. The model is better suited for use on low-arithmetic platforms as a result.
... The primary contributions of this study are as follows: [17] at Stanford University in 2016. Owing to its superior speed and accuracy compared to other models, YOLO has gained widespread adoption in the field of object detection. ...
Article
Full-text available
In rice cultivation, manually identifying rice blast disease is time-consuming and labor-intensive. Computer vision technology enables real-time and accurate disease detection in complex rice field environments. However, conventional object detection models face challenges in direct deployment on embedded edge devices due to their high computational demands. To address this issue, this paper proposes a lightweight rice blast detection model based on YOLOv8n. First, a Star Operation is introduced to optimize the C2f module in YOLOv8, enhancing the model’s ability to extract multi-scale features of rice blast disease. Second, a lightweight High-level Screening Feature Pyramid Network (HSFPN) is adopted to improve feature fusion efficiency, and a detection head incorporating shared parameters and Detail Enhancement Convolution (DEConv) is designed to reduce computational costs while improving detection accuracy. Finally, the LAMP pruning algorithm is applied to remove redundant parameters, further lightweighting the model. Experimental results demonstrate that, compared to YOLOv8n (3.01M parameters, 8.1 GFLOPs), the proposed model reduces parameters by 88% (from 3.01M to 0.36M) and decreases computational cost by 59.2% (from 8.1 GFLOPs to 3.3 GFLOPs), while maintaining an mAP@50 of 78.6% (an improvement of 0.2%). Furthermore, the model achieves a real-time performance of 22.5 FPS on the Jetson Nano, providing practical value for rice blast disease detection and rice crop protection.
... To explore how current object detection models meet these demands, we begin by examining known architectures that have demonstrated strong performance in real-time tasks. Among the most widely adopted object detection models is You Only Look Once (YOLO), a family of architectures designed for high-speed, single-shot detection [28]. The original YOLO model introduced a grid-based detection framework, dividing an image into an S × S grid and assigning bounding boxes based on object centers. ...
Article
Full-text available
Event-based vision revolutionizes traditional image sensing by capturing asynchronous intensity variations rather than static frames, enabling ultrafast temporal resolution, sparse data encoding, and enhanced motion perception. While this paradigm offers significant advantages, conventional event-based datasets impose a fixed thresholding constraint to determine pixel activations, severely limiting adaptability to real-world environmental fluctuations. Lower thresholds retain finer details but introduce pervasive noise, whereas higher thresholds suppress extraneous activations at the expense of crucial object information. To mitigate these constraints, we introduce the Event-Based Crossing Dataset (EBCD), a comprehensive dataset tailored for pedestrian and vehicle detection in dynamic outdoor environments, incorporating a multi-thresholding framework to refine event representations. By capturing event-based images at ten distinct threshold levels (4, 8, 12, 16, 20, 30, 40, 50, 60, and 75), this dataset facilitates an extensive assessment of object detection performance under varying conditions of sparsity and noise suppression. We benchmark state-of-the-art detection architectures—including YOLOv4, YOLOv7, YOLOv10, EfficientDet-b0, MobileNet-v1, and Histogram of Oriented Gradients (HOG)—to experiment upon the nuanced impact of threshold selection on detection performance. By offering a systematic approach to threshold variation, we foresee that EBCD fosters a more adaptive evaluation of event-based object detection, aligning diverse neuromorphic vision with real-world scene dynamics. We present the dataset as publicly available to propel further advancements in low-latency, high-fidelity neuromorphic imaging: https://ieee-dataport.org/documents/event-based-crossing-dataset-ebcd</uri
... Single-stage detection methods, exemplified by the YOLO algorithm [22], [23], [24], [25], prioritize computational efficiency, making them well-suited for realtime applications. YOLO has evolved from YOLOv2 [23] to YOLOv5 [19], continually optimizing the balance between detection speed and accuracy. ...
Article
Full-text available
Road damage detection is crucial for ensuring road safety and maintaining infrastructure durability, especially as increasing traffic and aging road networks create growing challenges for transportation systems worldwide. This process typically involves detecting irregularities on the road surface, such as cracks and potholes, from images. Current methods, integrating Single Shot Detector with MobileNet and Faster R-CNN, have successfully detected large potholes and long, deep cracks. However, detecting smaller, shallower cracks and minor potholes remains a challenge. Additionally, the appearance of road damage can vary under different weather conditions, making consistent detection difficult. To address these challenges, three modules are proposed. The Weather Trim Augment module leverages stable diffusion to generate weather-specific road damage datasets by adjusting prompt words and parameters, ensuring crack location and type remain unchanged, thus improving detection under various weather conditions. The Flexi Corner Block module enhances road damage detection by combining deformable convolutions with lightweight MLP and Learnable Local Attention, improving feature learning in corner areas. Moreover, the HXIOU loss function uses weighted calculations to effectively mine difficult samples, enhancing detection capabilities for challenging damages such as blurry potholes and fine cracks. Experimental results on the RDD2020 and CNRDD datasets demonstrate improved performance. On one hand, the proposed method achieves 64.9 on Test1 and a 40.6 F1-Score, demonstrating strong detection capabilities. On the other hand, the model exhibits robust generalization under adverse weather conditions, such as rain and snow, further validating its effectiveness in diverse environments.
... Advanced algorithms used in this field make it possible to identify and analyze these elements by utilizing machine learning and deep learning techniques (Smith & Brown, 2022;Johnson & Davis, 2021;Lee & Kim, 2020;Szegedy et al., 2014;He, Zhang, Ren, & Sun, 2015;Redmon, Divvala, Girshick, & Farhadi, 2015;Bishop, 2016;Tan, Yüksel, Aydemir, & Ersoy, 2021). ...
Conference Paper
Full-text available
Image recognition is a technology that detects and interprets objects, faces, places, or other distinct elements in digital photographs or videos. Image processing methods are commonly employed in this field, and in recent years, artificial intelligence and deep learning approaches have significantly enhanced this process. Today, image recognition technology is not limited to technology companies. However, it is actively used in various sectors, including health services, agriculture, forestry, mapping, and even more technical fields, such as maritime and coastal engineering. When it works in conjunction with remote sensing systems, it yields highly efficient results, particularly in the analysis of large geographical areas. In this study, a classification model was developed using forest and lake-themed satellite images from the Satellite Image Classification dataset. MATLAB software and the Deep Network Designer tool included in it were used to create the model. Thanks to the deep learning-based artificial neural network model prepared, the images were successfully identified and classified. As a result of the tests, it was seen that the model provided approximately 95% accuracy. Such studies not only contribute to image processing and recognition but also provide an inspiring basis for remote sensing, environmental analysis, and artificial intelligence-based research. They are highly valuable in preparing the ground for future developments in this field.
... Due to its capacity to automatically identify relevant characteristics from the input data, CNNs have demonstrated significant success in this field. There are many CNN-based architectures specifically designed for object detection such as You Only Look Once (YOLO) 16 , Single Shot Multibox Detector (SSD) 17 , Faster-RCNN 18 , Fast-RCNN 19 , Region-based CNNs(R-CNN) 19 and the more popular convolutional neural network models 20,21 . Convolutional neural networks have excelled in a variety of computer vision tasks including semantic segmentation, object identification and instance segmentation demonstrating their high effectiveness in this area. ...
Article
Full-text available
In computer vision tasks, object detection is the most significant challenge. Numerous studies on Convolutional neural network based techniques have been extensively utilized in computer vision for the detection of objects. Scale variation, illumination variation or occlusion problems are the most popular challenges in crowd counting. To address this, MaskFormer EfficientNetB7 Instance Segmentation architecture has been proposed that utilized the EfficientNetB7 as the backbone for feature extraction to create efficient and accurate counting of people in challenging scenarios. A basic mask classification model called MaskFormer has been used to predict a series of binary masks, each of which is connected to a single global class for label prediction and EfficientNetB7 has used a compound scaling algorithm that equally scaled each dimension using a predetermined set of scaling coefficients. Experimental findings on the UCF-QNRF, ShanghalTech (Part A and Part B) and Mall datasets have demonstrated that the suggested strategy has provided remarkable outcomes in contrast to existing crowd counting approaches in terms of Mean Absolute Error and Root Mean Squared Error. Therefore, the proposed crowd-counting model has been proven to be more adaptable in different environments or scenarios along with strong generalizability on unseen data.
... Single-stage detection algorithm achieves high detection speed by directly predicting and recognizing targets, but its detection accuracy is relatively low. This type of algorithm is represented by YOLOv1 proposed by Redmon et al. [20] in 2015. Although its detection speed is fast, there are obvious shortcomings in positioning accuracy. ...
Article
Full-text available
Recently, the application of target detection in urban transportation has become increasingly widespread. However, in foggy environments, due to low visibility, targets, such as pedestrians and vehicles, are not obvious, which easily lead to low detection accuracy and poor robustness. We investigate the detection methods in foggy environments, using the DehazeFormer and YOLOv8 as benchmark algorithms. To address the issue of ambiguous target features in foggy weather, we introduce an innovative spatial pyramid pooling structure, SimSPPFCSPC, to accelerate network convergence and boost the accuracy and efficiency of the target detection. To address the issue of detail loss and contextual information loss in foggy targets, we suggest substituting the backbone feature extraction network with EfficientViT network, implementing a lightweight multi-scale linear attention mechanism to augment the model’s resilience and enhance its detection capability. To address the issue of target size variation in foggy weather, we propose an IoU-based dynamic adjustment gradient distribution strategy to refine the loss function, bolstering the model’s generalization capability. Experimental results indicate the proposed method not only surpasses the baseline in performance, but also occupies an advantageous position among similar target detection methods, demonstrating excellent detection performance. Specifically, our method achieves Precision, Recall, mAP of 62.7%, 37.0% and 42.1% on Foggy Cityscapes, respectively. In addition, it attains Precision, Recall, mAP of 81.8%, 74.8% and 82.2% on PASCAL VOC, respectively. The related code of our method is available at https://github.com/chuanchuan0423/YOLO-SEW.
... Conversely, deep learning techniques are capable of extracting specific features of targets in complex images, and their application in agriculture has yielded favorable results [4]. In the context of object detection, the prevailing detection models can be classified into the following categories: The R-CNN [5,6] series, the SSD [7], the YOLO [8][9][10] series, and so forth. The specific optimization focus of each model varies. ...
Article
Full-text available
The current level of mechanization in the eggplant planting industry is low, with most planting processes still carried out manually. In the context of using agricultural machinery for automated operations, such as automatic pesticide spraying, the real-time detection of eggplant seeding centers is a crucial step. An EggYOLOPlant real-time detection model is designed to address the challenges of detecting the seedling centers by the diverse appearance features of the eggplant in complex planting environments, where there are different lighting conditions on sunny days, cloudy days, and evenings, as well as the presence of many weeds. We constructed a small dataset using real-world images of eggplant seedlings captured in their natural environment, which was used to train and evaluate EggYOLOPlant as well as other object detection models. The EggYOLOPlant model is based on an enhanced YOLOv8 architectural framework. The original backbone was replaced with an Mblock-based architecture, which originates from MobileNetV3’s feature extraction module. Additionally, we replaced the C2f model in the Neck layer with the C2f-faster model. The Grad-CAM visualization of the backbone layer outputs revealed that EggYOLOPlant’s feature activations were more focused on target regions than YOLOv8, demonstrating superior background suppression. The experimental results demonstrate that EggYOLOPlant achieves 94.7% precision (P) and 95.9% mAP50 on the test set, representing a + 4.0% (P) and + 3.8% (mAP50) improvement compared to the baseline YOLOv8 model (90.7% P, 92.1% mAP50). Additionally, the number of parameters has been reduced from 2.7 to 1.9 M (a 30% decrease), while the FPS has increased from 267 to 294, achieving a 10.1% improvement in speed. In comparison to Faster-RCNN, YOLOv5s, RT-DETR, and YOLO11s, the mAP50 of this model demonstrates an improvement of 3.2%, 1.8%, 3.5% and 1.5%, respectively. Moreover, the detection speed exhibits a notable enhancement, reaching approximately 24 times, 1.59 times, 1.95 times, and 1.61 times that of the aforementioned models, respectively. Furthermore, we deployed EggYOLOPlant on the Honor X10 smartphone using the NCNN framework, achieving an average runtime speed of 25.2 FPS, thereby preliminarily validating the model’s real-time performance on mobile devices.
... To achieve an automatic and accurate evaluation of the root canal treatment, we proposed the use of the singlestage target detection model, YOLO [17]. YOLOv5 is a single-stage target detection model that enables the simultaneous localization and classification of target regions. ...
Article
Full-text available
Background Deep-learning networks are promising techniques in dentistry. This study developed and validated a deep-learning network, You Only Look Once (YOLO) v5, for the automatic evaluation of root-canal filling quality on periapical radiographs. Methods YOLOv5 was developed using 1,008 periapical radiographs (training set: 806, validation set: 101, testing set: 101) from one center and validated on an external data set of 500 periapical radiographs from another center. We compared the network’s performance with that of inexperienced endodontist in terms of recall, precision, F1 scores, and Kappa values, using the results from specialists as the gold standard. We also compared the evaluation durations between the manual method and the network. Results On the external test data set, the YOLOv5 network performed better than inexperienced endodontist in terms of overall comprehensive performance. The F1 index values of the network for correct and incorrect filling were 92.05% and 82.93%, respectively. The network outperformed the inexperienced endodontist in all tooth regions, especially in the more difficult-to-assess upper molar regions. Notably, the YOLOv5 network evaluated images 150–220 times faster than manual evaluation. Conclusions The YOLOv5 deep learning network provided clinicians with a new, relatively accurate and efficient auxiliary tool for assessing the radiological quality of root canal fillings, enhancing work efficiency with large sample sizes. However, its use should be complemented by clinical expertise for accurate evaluations.
... Deep learning image object detection methods rely solely on spatial image information to extract features and detect regions of objects in the image. You-only-lookonce (Redmon et al., 2016) (YOLO) is a one-stage object detector and one of the fastest object detectors, which is important for processing millions of images or deploying on edge computers. In our work, YOLOv5 (Glenn Jocher, 2020) with CSPDarknet53 as the backbone was evaluated. ...
Article
Full-text available
Insects represent nearly half of all known multicellular species, but knowledge about them lags behind for most vertebrate species. In part for this reason, they are often neglected in biodiversity conservation policies and practice. Computer vision tools, such as insect camera traps, for automated monitoring have the potential to revolutionize insect study and conservation. To further advance insect camera trapping and the analysis of their image data, effective image processing pipelines are needed. In this paper, we present a flexible and fast processing pipeline designed to analyse these recordings by detecting, tracking and classifying nocturnal insects in a broad taxonomy of 15 insect classes and resolution of individual moth species. A classifier with anomaly detection is proposed to filter dark, blurred or partially visible insects that will be uncertain to classify correctly. A simple track‐by‐detection algorithm is proposed to track classified insects by incorporating feature embeddings, distance and area cost. We evaluated the computational speed and power performance of different edge computing devices (Raspberry Pi's and NVIDIA Jetson Nano) and compared various time‐lapse (TL) strategies with tracking. The minimum difference of detections was found for 2‐min TL intervals compared to tracking with 0.5 frames per second; however, for insects with fewer than one detection per night, the Pearson correlation decreases. Shifting from tracking to TL monitoring would reduce the number of recorded images and would allow for edge processing of images in real‐time on a camera trap with Raspberry Pi. The Jetson Nano is the most energy‐efficient solution, capable of real‐time tracking at nearly 0.5 fps. Our processing pipeline was applied to more than 5.7 million images recorded at 0.5 frames per second from 12 light camera traps during two full seasons located in diverse habitats, including bogs, heaths and forests. Our results thus show the scalability of insect camera traps.
... To achieve this, computer vision techniques were employed to effectively analyze visual data and extract meaningful information [31]. The You Only Look Once (YOLO) model, first introduced in 2016, revolutionized object detection by enabling real-time processing with a single network pass, eliminating the need for sliding window or two-stage detection approaches [32]. YOLOv8, released by Ultralytics in 2023, was employed in this study for instance segmentation due to its superior speed and accuracy [33]. ...
Article
Full-text available
Segregation in self-consolidating concrete (SCC) can significantly impact the quality and structural integrity of concrete applications. Traditional methods for assessing segregation, such as the visual stability index and column segregation tests, often involve manual intervention, introducing subjectivity and delaying the assessment process. This study proposes a novel image-based approach using deep learning, specifically the YOLOv8 segmentation model, to quantify and assess segregation in fresh SCC mixes. Utilizing high-resolution images from slump flow tests, the model identifies critical indicators of segregation, including the mortar halo and aggregate pile. These features are evaluated with two newly introduced quantitative metrics: the mortar halo index (I mh) and the aggregate pile index (I ap). Experimental validation demonstrates high model precision (96.4%) and recall (85.6%), establishing it as a robust tool for on-site quality control. Furthermore, the study examines the relationship between segregation levels and compressive strength, revealing a strong correlation between increased segregation and reduced strength. The proposed feedback-based optimization strategy for mix proportions enables real-time adjustments to mitigate segregation risks. This approach enhances the objectivity and efficiency of segregation assessments, facilitating improved mix design and overall concrete performance on construction sites.
... YOLO (You Only Look Once) is an image-segmentation and object-detection model that uses deep learning and computer vision and has been widely used due to its great performance in terms of speed and accuracy. It uses a simple neural network to find objects and each predicted object is enclosed in a bounding box called a region of interest or ROI [18,19]. Since YOLO11n is a pre-trained version, it has the ability to detect different types of objects. ...
Article
Full-text available
Recently, video surveillance systems have evolved from expensive, human-operated monitoring systems that were only useful after the crime was committed to systems that monitor 24/7, in real time, and with less and less human involvement. This is partly due to the use of smart cameras, the improvement of the Internet, and AI-based algorithms that allow the classifying and tracking of objects in images and in some cases identifying them as threats. Threats are often associated with abnormal or unexpected situations such as the presence of unauthorized persons in a given place or time, the manifestation of a different behavior by one or more persons compared to the behavior of the majority, or simply an unexpected number of people in the place, which depends largely on the available information of their context, i.e., place, date, and time of capture. In this work, we propose a model to automatically contextualize video capture scenarios, generating data such as location, date, time, and flow of people in the scene. A strategy to measure the accuracy of the data generated for such contextualization is also proposed. The pre-trained YOLO11n algorithm and the Bot-SORT algorithm gave the best results in person detection and tracking, respectively.
Article
In the actual industrial production environment, the surface defects of products are subtle, and the number of different types of defect data samples is also quite small. Most deep learning models rely on a large number of training samples and parameters to achieve high-precision defect detection. At the same time, the edge computing layer in the actual industrial environment may also encounter transmission delays and insufficient resources. Training a proper model for a specific type of surface defect while simultaneously satisfying the real-time accuracy of defect detection is still a challenging task. To effectively deal with the above challenges, we propose an edge-cloud computing defect detection model based on the intrinsic mean feature detector in the Lie Group space. The modules in the model adopt a symmetrical structure, which can extract related features more effectively. Different from existing models, this model utilizes the Lie Group space intrinsic mean feature as a metric to characterize the essential attributes of different types of surface defects. In addition, we propose an intrinsic mean attention mechanism in the Lie Group manifold space that is easy to implement at the edge service layer without increasing the number of model parameters, thereby enhancing the detection performance of tiny surface defects. Extensive experiments on three publicly available and challenging datasets reveal the superiority of our model in terms of detection accuracy, real-time detection, number of parameters, and computational performance. In addition, our proposed model also shows competitiveness and advantages compared with state-of-the-art models.
Article
Deep-sea biological detection is a pivotal technology for the exploration and conservation of marine resources. Nonetheless, the inherent complexities of the deep-sea environment, the scarcity of available deep-sea organism samples, and the significant refraction and scattering effects of underwater light collectively impose formidable challenges on the current detection algorithms. To address these issues, we propose an advanced deep-sea biometric identification framework based on an enhanced YOLOv8n architecture, termed PSVG-YOLOv8n. Specifically, our model integrates a highly efficient Partial Spatial Attention module immediately preceding the SPPF layer in the backbone, thereby facilitating the refined, localized feature extraction of deep-sea organisms. In the neck network, a Slim-Neck module (GSconv + VoVGSCSP) is incorporated to reduce the parameter count and model size while simultaneously augmenting the detection performance. Moreover, the introduction of a squeeze–excitation residual module (C2f_SENetV2), which leverages a multi-branch fully connected layer, further bolsters the network’s global representational capacity. Finally, an improved detection head synergistically fuses all the modules, yielding substantial enhancements in the overall accuracy. Experiments conducted on a dataset of deep-sea images acquired by the Jiaolong manned submersible indicate that the proposed PSVG-YOLOv8n model achieved a precision of 79.9%, an mAP50 of 67.2%, and an mAP50-95 of 50.9%. These performance metrics represent improvements of 1.2%, 2.3%, and 1.1%, respectively, over the baseline YOLOv8n model. The observed enhancements underscore the effectiveness of the proposed modifications in addressing the challenges associated with deep-sea organism detection, thereby providing a robust framework for accurate deep-sea biological identification.
Article
Electric power operation violation recognition (EPOVR) is essential for personnel safety, achieved by detecting key objects in electric power operation scenarios. Recent methods usually use the YOLOv8 model to achieve EPOVR; however, the YOLOv8 model still has four problems that need to be addressed. Firstly, the capability for feature representation of irregularly shaped objects is not strong enough. Secondly, the capability for feature representation is not strong enough to precisely detect multi-scale objects. Thirdly, the localization accuracy is not ideal. Fourthly, many violation categories in electric power operation cannot be covered by the existing datasets. To address the first problem, a deformable C2f (DC2f) module is proposed, which contains deformable convolutions and depthwise separable convolutions. For the second problem, an adaptive multi-scale feature enhancement (AMFE) module is proposed, which integrates multi-scale depthwise separable convolutions, adaptive convolutions, and a channel attention mechanism to optimize multi-scale feature representation while minimizing the number of parameters. For the third problem, an optimized complete intersection over union (OCIoU) loss is proposed for bounding box localization. Finally, a novel dataset named EPOVR-v1.0 is proposed to evaluate the performance of the object detection model applied in EPOVR. Ablation studies validate the effectiveness of the DC2f module, AMFE module, OCIoU loss, and their combinations. Compared with the baseline YOLOv8 model, the mAP@0.5 and mAP@0.5–0.95 are improved by 3.2% and 4.4%, while SDAP@0.5 and SDAP@0.5–0.95 are reduced by 0.34 and 0.019, respectively. Furthermore, the number of parameters and GFLOPS are shown to have slightly decreased. Comparison with seven YOLO models shows that our DAO-YOLO model achieves the highest detection accuracy while achieving real-time object detection for EPOVR.
Article
The detection of cannabis and cannabis‐related products is a critical task for forensic laboratories and law enforcement agencies, given their harmful effects. Forensic laboratories analyze large quantities of plant material annually to identify genuine cannabis and its illicit substitutes. Ensuring accurate identification is essential for supporting judicial proceedings and combating drug‐related crimes. The naked eye alone cannot distinguish between genuine cannabis and non‐cannabis plant material that has been sprayed with synthetic cannabinoids, especially after distribution into the market. Reliable forensic identification typically requires two colorimetric tests (Duquenois‐Levine and Fast Blue BB), as well as a drug laboratory expert test for affirmation or negation of cannabis hair (non‐glandular trichomes), making the process time‐consuming and resource‐intensive. Here, we propose a novel deep learning‐based computer vision method for identifying non‐glandular trichome hairs in cannabis. A dataset of several thousand annotated microscope images was collected, including genuine cannabis and non‐cannabis plant material apparently sprayed with synthetic cannabinoids. Ground‐truth labels were established using three forensic tests, two chemical assays, and expert microscopic analysis, ensuring reliable classification. The proposed method demonstrated an accuracy exceeding 97% in distinguishing cannabis from non‐cannabis plant material. These results suggest that deep learning can reliably identify non‐glandular trichome hairs in cannabis based on microscopic trichome features, potentially reducing reliance on costly and time‐consuming expert microscopic analysis. This framework provides forensic departments and law enforcement agencies with an efficient and accurate tool for identifying non‐glandular trichome hairs in cannabis, supporting efforts to combat illicit drug trafficking.
Article
Full-text available
In institutions such as universities, corporate offices, and restricted-access areas, enforcing ID card compliance is critical for ensuring security, tracking attendance, and maintaining discipline. Manual enforcement is often inefficient and prone to oversight. To address this, we propose an automated ID Card Detection and Penalty Mechanism that leverages deep learning models for object detection and facial recognition. The system utilizes YOLOv5 for real-time identification of ID cards worn by individuals in front of a camera. If the system fails to detect an ID card, it automatically initiates a secondary process that uses facial recognition to identify the person, predicts their roll number, and triggers an alert mechanism. This includes sending an automated email notification to a predefined recipient, reporting the incident along with the identified individual's details.The system is trained specifically on a dataset comprising known faces and ID card positions to ensure high accuracy in controlled environments. It includes a user-friendly interface where users can start the camera, initiate detection, and send email notifications directly through the GUI. The model is effective in both detecting the presence of ID cards and in handling non-compliance scenarios by linking the individual's identity to the infraction. Experimental evaluations show that the system performs reliably across Automated ID Card Detection and Penalty System Using YOLOv5 and Face Recognition 64 various lighting conditions and backgrounds, with minimal false detections. The proposed solution offers a scalable and efficient method to automate ID enforcement, enhance security monitoring, and reduce dependency on manual supervision. INTRODUCTION In today's technologically advanced world, automated surveillance and identity verification systems have become increasingly important across various sectors, including educational institutions, corporate offices, research labs, and secure government facilities. One fundamental component of such security frameworks is the enforcement of visible identification cards (ID cards) worn by employees, students, or visitors. ID cards serve not only as authentication tools but also as key enablers for access control, attendance monitoring, and accountability. However, ensuring consistent compliance with ID-wearing policies remains a challenge when done manually. Relying on security personnel or administrative staff to monitor ID card usage is time-consuming, resource-intensive, and susceptible to human error.To address this issue, there is a growing need for automated systems that can detect whether individuals are wearing their ID cards and take corrective actions if non-compliance is observed. In this context, computer vision and deep learning techniques offer powerful tools for real-time monitoring and decision-making. Object detection models such as YOLO (You Only Look Once), combined with face recognition and identity prediction algorithms, enable systems to detect ID cards, recognize faces, and link individuals to a known database. These technologies allow institutions to build intelligent surveillance systems that can proactively enforce policies without requiring continuous human intervention. This paper presents an integrated ID Card Detection and Penalty Mechanism system that automates the process of identifying individuals who are not wearing their ID cards and subsequently triggering a disciplinary or notification process. The system uses the YOLOv5 object detection model to identify the presence or absence of an ID card in live camera feeds. If no card is detected, the system uses facial recognition to predict the identity or roll number of the person. Once the individual is identified, the system allows an administrator or supervisor to send a warning message to a designated email address directly from the application interface.The proposed system is particularly useful in educational campuses where students are required to wear ID cards as part of institutional discipline. In such environments, the model can be trained on a dataset containing students' facial images and sample ID card images. The system interface includes real-time camera access, detection initiation, identity display, and email alert generation, making it a complete solution for daily compliance enforcement.Additionally, the model is designed to be lightweight, fast, and easy to deploy on any machine with a webcam. It achieves high detection accuracy under various lighting conditions and works effectively in real-time, thus meeting the practical requirements of a surveillance-grade system. In summary, this research contributes an end-to-end automated framework that enforces ID-wearing compliance using deep learning. By eliminating manual checking and incorporating intelligent alert mechanisms, the system significantly enhances institutional security, operational efficiency, and rule enforcement.
Conference Paper
Full-text available
Security and surveillance are a major concern in today's world, and thus the need for the creation of intelligent and autonomous monitoring systems. This paper discusses a new security and surveillance system that utilizes real-time face tracking and deep learning to improve monitoring efficiency and reliability. The system consists of a high-resolution webcam on a robot arm driven by MG995 servo motors and an Arduino microcontroller. Face detection is achieved with a Convolutional Neural Network (CNN)-based model to ensure high accuracy detection even at low illumination. A real-time object tracking algorithm is used to keep the focus of the camera on detected faces, enabling the camera to change its position dynamically and ensure constant monitoring. Addition of a PID-based servo control algorithm provides maximum movement accuracy with reduced lag and improved response time. The system is made such that it requires minimal human intervention, and hence it is a viable solution for automated security solutions. Performance tests show that the proposed system provides an average accuracy of 92.5% in face tracking and response latency of around 100ms. Power efficiency and Hardware-software optimization also contribute to the success of the system in real-world applications. Multi-person tracking, integration with IoT-based security networks, and AI-based decision-making capabilities are future areas of exploration. This research is a showcase of how robotics and artificial intelligence can be combined to create intelligent security solutions with reliability and automation.
Article
The rapid development of social media has driven the need for opinion mining and sentiment analysis based on multimodal samples. As a fine-grained task within multimodal sentiment analysis, aspect-based multimodal sentiment analysis (ABMSA) enables the accurate and efficient determination of sentiment polarity for aspect-level targets. However, traditional ABMSA methods often perform suboptimally on social media samples, as the images in these samples typically contain embedded text that conventional models overlook. Such text influences sentiment judgment. To address this issue, we propose a text-in-image enhanced self-supervised alignment model (TESAM) that accounts for multimodal information more comprehensively. Specifically, we employed Optical Character Recognition technology to extract embedded text from images and, based on the principle that text-in-image is an integral part of the visual modality, fused it with visual features to obtain more comprehensive image representations. Additionally, we incorporate aspect words to guide the model in disregarding irrelevant semantic features, thereby reducing noise interference. Furthermore, to mitigate the semantic gap between modalities, we propose pre-training the feature extraction module with self-supervised alignment. During this pre-training stage, unimodal semantic embeddings from both modalities are aligned by calculating errors using Euclidean distance and cosine similarity. Experimental results demonstrate that TESAM achieved remarkable performances on three ABMSA benchmarks. These results validate the rationale and effectiveness of our proposed improvements.
Article
Particleboard is an important forest product that can be reprocessed using wood processing by-products. This approach has the potential to achieve significant conservation of forest resources and contribute to the protection of forest ecology. Most current detection models require a significant number of tagged samples for training. However, with the advancement of industrial technology, the prevalence of surface defects in particleboard is decreasing, making the acquisition of sample data difficult and significantly limiting the effectiveness of model training. Deep reinforcement learning-based detection methods have been shown to exhibit strong generalization ability and sample utilization efficiency when the number of samples is limited. This paper focuses on the potential application of deep reinforcement learning in particleboard defect detection and proposes a novel detection method, PPOBoardNet, for the identification of five typical defects: dust spot, glue spot, scratch, sand leak and indentation. The proposed method is based on the proximal policy optimization (PPO) algorithm of the Actor-Critic framework, and defect detection is achieved by performing a series of scaling and translation operations on the mask. The method integrates the variable action space and the composite reward function and achieves the balanced optimization of different types of defect detection performance by adjusting the scaling and translation amplitude of the detection region. In addition, this paper proposes a state characterization strategy of multi-scale feature fusion, which integrates global features, local features and historical action sequences of the defect image and provides reliable guidance for action selection. On the particleboard defect dataset with limited images, PPOBoardNet achieves a mean average precision (mAP) of 79.0%, representing a 5.3% performance improvement over the YOLO series of optimal detection models. This result provides a novel technical approach to the challenge of defect detection with limited samples in the particleboard domain, with significant practical application value.
Article
This paper provides a comprehensive study of the security of YOLO (You Only Look Once) model series for object detection, emphasizing their evolution, technical innovations, and performance across the COCO dataset. The robustness of YOLO models under adversarial attacks and image corruption, offering insights into their resilience and adaptability, is analyzed in depth. As real-time object detection plays an increasingly vital role in applications such as autonomous driving, security, and surveillance, this review aims to clarify the strengths and limitations of each YOLO iteration, serving as a valuable resource for researchers and practitioners aiming to optimize model selection and deployment in dynamic, real-world environments. The results reveal that YOLOX models, particularly their large variants, exhibit superior robustness compared to other YOLO versions, maintaining higher accuracy under challenging conditions. Our findings serve as a valuable resource for researchers and practitioners aiming to optimize YOLO models for dynamic and adversarial real-world environments while guiding future research toward developing more resilient object detection systems.
Article
Full-text available
The proliferation of Arduino has led to numerous low-cost replicas, complicating defect detection due to style variability. Existing detectors struggle to generalize with synthetic data. To address this, we introduce Context-Guided Triplet Attention YOLO-Faster (CGTA-YOLO-F), a real-time model that enhances feature extraction through CGTA blocks, along with a novel C2f-FCGA block (Faster Context Guidance with simplified Attention) for enhancing multi-scale feature fusion. Trained on synthesized data and tested on real data, the method achieves 97.4% mean average precision (mAP) for component detection, outperforming YOLOv8 and YOLOv10 by 3% and 3.4%. It also achieves 91.4% accuracy for misalignment classification, 7.1% higher than the baseline. The model performs well on two additional datasets and integrates detection and classification into a unified framework. It is efficient in speed and memory, making it practical for industrial defect detection tasks.
Article
As a key global food reserve, rice disease detection technology plays an important role in promoting food production, protecting ecological balance and supporting sustainable agricultural development. However, existing rice disease identification techniques face many challenges, such as low training efficiency, insufficient model accuracy, incompatibility with mobile devices, and the need for a large number of training datasets. This study aims to develop a rice disease detection model that is highly accurate, resource efficient, and suitable for mobile deployment to address the limitations of existing technologies. We propose the Transfer Layer iRMB-YOLOv8 (TLI-YOLO) model, which modifies some components of the YOLOv8 network structure based on transfer learning. The innovation of this method is mainly reflected in four key components. First, transfer learning is used to import the pretrained model weights into the TLI-YOLO model, which significantly reduces the dataset requirements and accelerates model convergence. Secondly, it innovatively integrates a new small object detection layer into the feature fusion layer, which enhances the detection ability by combining shallow and deep feature maps so as to learn small object features more effectively. Third, this study is the first to introduce the iRMB attention mechanism, which effectively integrates Inverted Residual Blocks and Transformers, and introduces deep separable convolution to maintain the spatial integrity of features, thus improving the efficiency of computational resources on mobile platforms. Finally, this study adopted the WIoUv3 loss function and added a dynamic non-monotonic aggregation mechanism to the standard IoU calculation to more accurately evaluate and penalize the difference between the predicted and actual bounding boxes, thus improving the robustness and generalization ability of the model. The final test shows that the TLI-YOLO model achieved 93.1% precision, 88% recall, 95% mAP, and a 90.48% F1 score on the custom dataset, with only 12.60 GFLOPS of computation. Compared with YOLOv8n, the precision improved by 7.8%, the recall rate improved by 7.2%, and mAP@.5 improved by 7.6%. In addition, the model demonstrated real-time detection capability on an Android device and achieved efficiency of 30 FPS, which meets the needs of on-site diagnosis. This approach provides important support for rice disease monitoring.
Article
The increasing utilization of surveillance cameras in smart cities, coupled with the surge of online video applications, has heightened concerns regarding public security and privacy protection, which propelled automated Video Anomaly Detection (VAD) into a fundamental research task within the Artificial Intelligence (AI) community. With the advancements in deep learning and edge computing, VAD has made significant progress and advances synergized with emerging applications in smart cities and video internet, which has moved beyond the conventional research scope of algorithm engineering to deployable Networking Systems for VAD (NSVAD), a practical hotspot for intersection exploration in the AI, IoVT, and computing fields. In this article, we delineate the foundational assumptions, learning frameworks, and applicable scenarios of various deep learning-driven VAD routes, offering an exhaustive tutorial for novices in NSVAD. In addition, this article elucidates core concepts by reviewing recent advances and typical solutions and aggregating available research resources accessible at https://github.com/fdjingliu/NSVAD. Lastly, this article projects future development trends and discusses how the integration of AI and computing technologies can address existing research challenges and promote open opportunities, serving as an insightful guide for prospective researchers and engineers.
Article
Traditional steel strip defect detection methods face challenges such as low accuracy and slow efficiency, failing to achieve the high standards of modern industry. To handle these issues, this study proposes SDF-YOLO, an efficient and accurate defect detection method based on the YOLOv8 algorithm, designed to enhance detection performance and optimize production processes. By integrating Shuffle Attention into the Spatial Pyramid Pooling network, the method improves contextual understanding; combining Deformable Convolution DCNv2 with the C2f module enhances the detection of overlapping defects while replacing CIoU with Focal-SIoU increases the precision of bounding box shape and localization. Results from experiments conducted on the NEU-DET dataset demonstrate that SDF-YOLO achieves an accuracy of 76.8%, mAP50 by 76.9%, and mAP50-95 of 45%, outperforming YOLOv8n by 7.9%, 2.4%, and 1.4%, respectively. Additionally, its application on the GC-DET dataset demonstrates improvements in mAP50 by 2.3%, confirming its enhanced generalization and significantly improved detection capabilities for overlapping defects compared to YOLOv8n.
Article
Full-text available
The cross-depiction problem is that of recognising visual objects regardless of whether they are photographed, painted, drawn, etc. It is a potentially significant yet under-researched problem. Emulating the remarkable human ability to recognise objects in an astonishingly wide variety of depictive forms is likely to advance both the foundations and the applications of Computer Vision. In this paper we benchmark classification, domain adaptation, and deep learning methods; demonstrating that none perform consistently well in the cross-depiction problem. Given the current interest in deep learning, the fact such methods exhibit the same behaviour as all but one other method: they show a significant fall in performance over inhomogeneous databases compared to their peak performance, which is always over data comprising photographs only. Rather, we find the methods that have strong models of spatial relations between parts tend to be more robust and therefore conclude that such information is important in modelling object classes regardless of appearance details.
Article
Full-text available
Although the human visual system is surprisingly robust to extreme distortion when recognizing objects, most evaluations of computer object detection methods focus only on robustness to natural form deformations such as people's pose changes. To determine whether algorithms truly mirror the flexibility of human vision, they must be compared against human vision at its limits. For example, in Cubist abstract art, painted objects are distorted by object fragmentation and part-reorganization, to the point that human vision often fails to recognize them. In this paper, we evaluate existing object detection methods on these abstract renditions of objects, comparing human annotators to four state-of-the-art object detectors on a corpus of Picasso paintings. Our results demonstrate that while human perception significantly outperforms current methods, human perception and part-based models exhibit a similarly graceful degradation in object detection performance as the objects become increasingly abstract and fragmented, corroborating the theory of part-based object representation in the brain.
Article
Full-text available
The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide detailed a analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the five years of the challenge, and propose future directions and improvements.
Article
Full-text available
This paper addresses the problem of generating possible object locations for use in object recognition. We introduce selective search which combines the strength of both an exhaustive search and segmentation. Like segmentation, we use the image structure to guide our sampling process. Like exhaustive search, we aim to capture all possible object locations. Instead of a single technique to generate possible object locations, we diversify our search and use a variety of complementary image partitionings to deal with as many image conditions as possible. Our selective search results in a small set of data-driven, class-independent, high quality locations, yielding 99 % recall and a Mean Average Best Overlap of 0.879 at 10,097 locations. The reduced number of locations compared to an exhaustive search enables the use of stronger machine learning techniques and stronger appearance models for object recognition. In this paper we show that our selective search enables the use of the powerful Bag-of-Words model for recognition. The selective search software is made publicly available (Software: http://disi.unitn.it/~uijlings/SelectiveSearch.html).
Article
Full-text available
We present an integrated framework for using Convolutional Networks for classification, localization and detection. We show how a multiscale and sliding window approach can be efficiently implemented within a ConvNet. We also introduce a novel deep learning approach to localization by learning to predict object boundaries. Bounding boxes are then accumulated rather than suppressed in order to increase detection confidence. We show that different tasks can be learnt simultaneously using a single shared network. This integrated framework is the winner of the localization task of the ImageNet Large Scale Visual Recognition Challenge 2013 (ILSVRC2013), and produced near state of the art results for the detection and classifications tasks. Finally, we release a feature extractor from our best model called OverFeat.
Article
Full-text available
Deep convolutional neural networks have recently achieved state-of-the-art performance on a number of image recognition benchmarks, including the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC-2012). The winning model on the localization sub-task was a network that predicts a single bounding box and a confidence score for each object category in the image. Such a model captures the whole-image context around the objects but cannot handle multiple instances of the same object in the image without naively replicating the number of outputs for each instance. In this work, we propose a saliency-inspired neural network model for detection, which predicts a set of class-agnostic bounding boxes along with a single score for each box, corresponding to its likelihood of containing any object of interest. The model naturally handles a variable number of instances for each class and allows for cross-class generalization at the highest levels of the network. We are able to obtain competitive recognition performance on VOC2007 and ILSVRC2012, while using only the top few predicted locations in each image and a small number of neural network evaluations.
Article
Full-text available
When a large feedforward neural network is trained on a small training set, it typically performs poorly on held-out test data. This "overfitting" is greatly reduced by randomly omitting half of the feature detectors on each training case. This prevents complex co-adaptations in which a feature detector is only helpful in the context of several other specific feature detectors. Instead, each neuron learns to detect a feature that is generally helpful for producing the correct answer given the combinatorially large variety of internal contexts in which it must operate. Random "dropout" gives big improvements on many benchmark tasks and sets new records for speech and object recognition.
Article
Full-text available
This paper describes a face detection framework that is capable of processing images extremely rapidly while achieving high detection rates. There are three key contributions. The first is the introduction of a new image representation called the Integral Image which allows the features used by our detector to be computed very quickly. The second is a simple and efficient classifier which is built using the AdaBoost learning algorithm (Freund and Schapire, 1995) to select a small number of critical visual features from a very large set of potential features. The third contribution is a method for combining classifiers in a cascade which allows background regions of the image to be quickly discarded while spending more computation on promising face-like regions. A set of experiments in the domain of face detection is presented. The system yields face detection performance comparable to the best previous systems (Sung and Poggio, 1998; Rowley et al., 1998; Schneiderman and Kanade, 2000; Roth et al., 2000). Implemented on a conventional desktop, face detection proceeds at 15 frames per second.
Article
Full-text available
Sliding window classifiers are among the most successful and widely applied techniques for object localization. However, training is typically done in a way that is not specific to the localization task. First a binary classifier is trained using a sample of positive and negative examples, and this classifier is subsequently applied to multiple regions within test images. We propose instead to treat object localization in a principled way by posing it as a problem of predicting structured data: we model the problem not as binary classification, but as the prediction of the bounding box of objects located in images. The use of a joint-kernel framework allows us to formulate the training procedure as a generalization of an SVM, which can be solved efficiently. We further improve computational efficiency by using a branch-and-bound strategy for localization during both training and testing. Experimental evaluation on the PASCAL VOC and TU Darmstadt datasets show that the structured training procedure improves pe rformance over binary training as well as the best previously published scores.
Conference Paper
Full-text available
This paper presents a general trainable framework for object detection in static images of cluttered scenes. The detection technique we develop is based on a wavelet representation of an object class derived from a statistical analysis of the class instances. By learning an object class in terms of a subset of an overcomplete dictionary of wavelet basis functions, we derive a compact representation of an object class which is used as an input to a support vector machine classifier. This representation overcomes both the problem of in-class variability and provides a low false detection rate in unconstrained environments. We demonstrate the capabilities of the technique in two domains whose inherent information content differs significantly. The first system is face detection and the second is the domain of people which, in contrast to faces, vary greatly in color, texture, and patterns. Unlike previous approaches, this system learns from examples and does not rely on any a priori (hand-crafted) models or motion-based segmentation. The paper also presents a motion-based extension to enhance the performance of the detection algorithm over video sequences. The results presented here suggest that this architecture may well be quite general
Article
This paper describes a visual object detection framework that is capable of processing images extremely rapidly while achieving high detection rates. There are three key contributions. The first is the introduction of a new image representation called the "Integral Image" which allows the features used by our detector to be computed very quickly. The second is a learning algorithm, based on AdaBoost, which selects a small number of critical visual features and yields extremely efficient number of critical visual features and yields extremely efficient classifiers [6]. The third contribution is a method for combining classifiers in a “cascade” which allows background regions of the image to be quickly discarded while spending more computation on promising object-like regions. A set of experiments in the domain of face detection are presented. The system yields face detection performace comparable to the best previous systems [18, 13, 16, 12, 1]. Implemented on a conventional desktop, face detection proceeds at 15 frames per second.
Conference Paper
We study the question of feature sets for robust visual object recognition, adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of Histograms of Oriented Gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.
Conference Paper
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.
Conference Paper
We evaluate whether features extracted from the activation of a deep convolutional network trained in a fully supervised fashion on a large, fixed set of object recognition tasks can be re-purposed to novel generic tasks. Our generic tasks may differ significantly from the originally trained tasks and there may be insufficient labeled or unlabeled data to conventionally train or adapt a deep architecture to the new tasks. We investigate and visualize the semantic clustering of deep convolutional features with respect to a variety of such tasks, including scene recognition, domain adaptation, and fine-grained recognition challenges. We compare the efficacy of relying on various network levels to define a fixed feature, and report novel results that significantly outperform the state-of-the-art on several important vision challenges. We are releasing DeCAF, an open-source implementation of these deep convolutional activation features, along with all associated network parameters to enable vision researchers to be able to conduct experimentation with deep representations across a range of visual concept learning paradigms.
Conference Paper
The use of object proposals is an effective recent approach for increasing the computational efficiency of object detection. We propose a novel method for generating object bounding box proposals using edges. Edges provide a sparse yet informative representation of an image. Our main observation is that the number of contours that are wholly contained in a bounding box is indicative of the likelihood of the box containing an object. We propose a simple box objectness score that measures the number of edges that exist in the box minus those that are members of contours that overlap the box's boundary. Using efficient data structures, millions of candidate boxes can be evaluated in a fraction of a second, returning a ranked set of a few thousand top-scoring proposals. Using standard metrics, we show results that are significantly more accurate than the current state-of-the-art while being faster to compute. In particular, given just 1000 proposals we achieve over 96% object recall at overlap threshold of 0.5 and over 75% recall at the more challenging overlap of 0.7. Our approach runs in 0.25 seconds and we additionally demonstrate a near real-time variant with only minor loss in accuracy.
Conference Paper
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [7] and Fast R-CNN [5] have reduced the running time of these detection networks, exposing region pro-posal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolu-tional features. For the very deep VGG-16 model [18], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.
Conference Paper
Object detection and semantic segmentation are two strongly correlated tasks, yet typically solved separately or sequentially with substantially different techniques. Motivated by the complementary effect observed from the typical failure cases of the two tasks, we propose a unified framework for joint object detection and semantic segmentation. By enforcing the consistency between final detection and segmentation results, our unified framework can effectively leverage the advantages of leading techniques for these two tasks. Furthermore, both local and global context information are integrated into the framework to better distinguish the ambiguous samples. By jointly optimizing the model parameters for all the components, the relative importance of different component is automatically learned for each category to guarantee the overall performance. Extensive experiments on the PASCAL VOC 2010 and 2012 datasets demonstrate encouraging performance of the proposed unified framework for both object detection and semantic segmentation tasks.
Article
Deep convolutional neural networks (CNNs) have had a major impact in most areas of image understanding, including object category detection. In object detection, methods such as R-CNN have obtained excellent results by integrating CNNs with region proposal generation algorithms such as selective search. In this paper, we investigate the role of proposal generation in CNN-based detectors in order to determine whether it is a necessary modelling component, carrying essential geometric information not contained in the CNN, or whether it is merely a way of accelerating detection. We do so by designing and evaluating a detector that uses a trivial region generation scheme, constant for each image. Combined with SPP, this results in an excellent and fast detector that does not require to process an image with algorithms other than the CNN itself. We also streamline and simplify the training of CNN-based detectors by integrating several learning steps in a single algorithm, as well as by proposing a number of improvements that accelerate detection.
Conference Paper
We describe an implementation of the Deformable Parts Model [1] that operates in a user-defined time-frame. Our implementation uses a variety of mechanism to trade-off speed against accuracy. Our implementation can detect all 20 PASCAL 2007 objects simultaneously at 30Hz with an mAP of 0.26. At 15Hz, its mAP is 0.30; and at 100Hz, its mAP is 0.16. By comparison the reference implementation of [1] runs at 0.07Hz and mAP of 0.33 and a fast GPU implementation runs at 1Hz. Our technique is over an order of magnitude faster than the previous fastest DPM implementation. Our implementation exploits a series of important speedup mechanisms. We use the cascade framework of [3] and the vector quantization technique of [2]. To speed up feature computation, we compute HOG features at few scales, and apply many interpolated templates. A hierarchical vector quantization method is used to compress HOG features for fast template evaluation. An object proposal step uses hash-table methods to identify locations where evaluating templates would be most useful; these locations are inserted into a priority queue, and processed in a detection phase. Both proposal and detection phases have an any-time property. Our method applies to legacy templates, and no retraining is required.
Article
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet and Fast R-CNN have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolutional features. For the very deep VGG-16 model, our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.
Article
We propose an object detection system that relies on a multi-region deep convolutional neural network (CNN) that also encodes semantic segmentation-aware features. The resulting CNN-based representation aims at capturing a diverse set of discriminative appearance factors and exhibits localization sensitivity that is essential for accurate object localization. We exploit the above properties of our recognition module by integrating it on an iterative localization mechanism that alternates between scoring a box proposal and refining its location with a deep CNN regression model. Thanks to the efficient use of our modules, we detect objects with very high localization accuracy. On the detection challenges of PASCAL VOC2007 and PASCAL VOC2012 we achieve mAP of 74.9% and 70.7% correspondingly, surpassing any other published work by a significant margin.
Article
Most object detectors contain two important components: a feature extractor and an object classifier. The feature extractor has rapidly evolved with significant research efforts leading to better deep ConvNet architectures. The object classifier, however, has not received much attention and most state-of-the-art systems (like R-CNN) use simple multi-layer perceptrons. This paper demonstrates that carefully designing deep networks for object classification is just as important. We take inspiration from traditional object classifiers, such as DPM, and experiment with deep networks that have part-like filters and reason over latent variables. We discover that on pre-trained convolutional feature maps, even randomly initialized deep classifiers produce excellent results, while the improvement due to fine-tuning is secondary; on HOG features, deep classifiers outperform DPMs and produce the best HOG-only results without external data. We believe these findings provide new insight for developing object detection systems. Our framework, called Networks on Convolutional feature maps (NoC), achieves outstanding results on the PASCAL VOC 2007 (73.3% mAP) and 2012 (68.8% mAP) benchmarks.
Article
We present an accurate, real-time approach to robotic grasp detection based on convolutional neural networks. Our network performs single-stage regression to graspable bounding boxes without using standard sliding window or region proposal techniques. The model outperforms state-of-the-art approaches by 14 percentage points and runs at 13 frames per second on a GPU. Our network can simultaneously perform classification so that in a single step it recognizes the object and finds a good grasp rectangle. A modification to this model predicts multiple grasps per object by using a locally constrained prediction mechanism. The locally constrained model performs significantly better, especially on objects that can be grasped in a variety of ways.
Article
The Pascal Visual Object Classes (VOC) challenge consists of two components: (i) a publicly available dataset of images together with ground truth annotation and standardised evaluation software; and (ii) an annual competition and workshop. There are five challenges: classification, detection, segmentation, action classification, and person layout. In this paper we provide a review of the challenge from 2008–2012. The paper is intended for two audiences: algorithm designers, researchers who want to see what the state of the art is, as measured by performance on the VOC datasets, along with the limitations and weak points of the current generation of algorithms; and, challenge designers, who want to see what we as organisers have learnt from the process and our recommendations for the organisation of future challenges. To analyse the performance of submitted algorithms on the VOC datasets we introduce a number of novel evaluation methods: a bootstrapping method for determining whether differences in the performance of two algorithms are significant or not; a normalised average precision so that performance can be compared across classes with different proportions of positive instances; a clustering method for visualising the performance across multiple algorithms so that the hard and easy images can be identified; and the use of a joint classifier over the submitted algorithms in order to measure their complementarity and combined performance. We also analyse the community’s progress through time using the methods of Hoiem et al. (Proceedings of European Conference on Computer Vision, 2012) to identify the types of occurring errors. We conclude the paper with an appraisal of the aspects of the challenge that worked well, and those that could be improved in future challenges.
Article
Deep Convolutional Neural Networks (CNNs) have gained great success in image classification and object detection. In these applications, the outputs of all layers of CNNs are usually considered as a high dimensional feature vector extracted from an input image. However, for image classification, elements of feature vector with large intra-class and small inter-class variations may not help and thus it makes sense to drop them out. Inspired by this, we propose a novel approach named as Feature Edit, which generates an edited version for each original CNN feature vector by firstly selecting units and then setting them to zeros. It is observed via visualization that almost all the edited feature vectors are very close to their corresponding original versions. The experimental results for classification-based object detection on canonical datasets including VOC 2007(60.1%), 2010(56.4%) and 2012(56.3%) show obvious improvement in mean average precision(mAP) if edited features as well as the original ones are altogether used to train a simple linear support vector machine.
Article
Existing deep convolutional neural networks (CNNs) require a fixed-size (e.g. 224×224) input image. This requirement is “artificial” and may hurt the recognition accuracy for the images or sub-images of an arbitrary size/scale. In this work, we equip the networks with a more principled pooling strategy, “spatial pyramid pooling”, to eliminate the above requirement. The new network structure, called SPP-net, can generate a fixed-length representation regardless of image size/scale. By removing the fixed-size limitation, we can improve all CNN-based image classification methods in general. Our SPP-net achieves state-of-the-art accuracy on the datasets of ImageNet 2012, Pascal VOC 2007, and Caltech101. The power of SPP-net is more significant in object detection. Using SPP-net, we compute the feature maps from the entire image only once, and then pool features in arbitrary regions (sub-images) to generate fixed-length representations for training the detectors. This method avoids repeatedly computing the convolutional features. In processing test images, our method computes convolutional features 30-170× faster than the recent leading method R-CNN (and 24-64× faster overall), while achieving better or comparable accuracy on Pascal VOC 2007.
Conference Paper
This paper shows how to analyze the influences of object characteristics on detection performance and the frequency and impact of different types of false positives. In particular, we examine effects of occlusion, size, aspect ratio, visibility of parts, viewpoint, localization error, and confusion with semantically similar objects, other labeled objects, and background. We analyze two classes of detectors: the Vedaldi et al. multiple kernel learning detector and different versions of the Felzenszwalb et al. detector. Our study shows that sensitivity to size, localization error, and confusion with similar objects are the most impactful forms of error. Our analysis also reveals that many different kinds of improvement are necessary to achieve large gains, making more detailed analysis essential for the progress of recognition research. By making our software and annotations available, we make it effortless for future researchers to perform similar analysis.
Article
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.
Article
We evaluate whether features extracted from the activation of a deep convolutional network trained in a fully supervised fashion on a large, fixed set of object recognition tasks can be re-purposed to novel generic tasks. Our generic tasks may differ significantly from the originally trained tasks and there may be insufficient labeled or unlabeled data to conventionally train or adapt a deep architecture to the new tasks. We investigate and visualize the semantic clustering of deep convolutional features with respect to a variety of such tasks, including scene recognition, domain adaptation, and fine-grained recognition challenges. We compare the efficacy of relying on various network levels to define a fixed feature, and report novel results that significantly outperform the state-of-the-art on several important vision challenges. We are releasing DeCAF, an open-source implementation of these deep convolutional activation features, along with all associated network parameters to enable vision researchers to be able to conduct experimentation with deep representations across a range of visual concept learning paradigms.
Conference Paper
Many object detection systems are constrained by the time required to convolve a target image with a bank of fil-ters that code for different aspects of an object's appear-ance, such as the presence of component parts. We ex-ploit locality-sensitive hashing to replace the dot-product kernel operator in the convolution with a fixed number of hash-table probes that effectively sample all of the filter re-sponses in time independent of the size of the filter bank. To show the effectiveness of the technique, we apply it to evaluate 100,000 deformable-part models requiring over a million (part) filters on multiple scales of a target image in less than 20 seconds using a single multi-core processor with 20GB of RAM. This represents a speed-up of approx-imately 20,000 times— four orders of magnitude— when compared with performing the convolutions explicitly on the same hardware. While mean average precision over the full set of 100,000 object classes is around 0.16 due in large part to the challenges in gathering training data and col-lecting ground truth for so many classes, we achieve a mAP of at least 0.20 on a third of the classes and 0.30 or better on about 20% of the classes.
Conference Paper
Recently Viola et al. [2001] have introduced a rapid object detection. scheme based on a boosted cascade of simple feature classifiers. In this paper we introduce a novel set of rotated Haar-like features. These novel features significantly enrich the simple features of Viola et al. and can also be calculated efficiently. With these new rotated features our sample face detector shows off on average a 10% lower false alarm rate at a given hit rate. We also present a novel post optimization procedure for a given boosted cascade improving on average the false alarm rate further by 12.5%.
Conference Paper
Object detection and multi-class image segmentation are two closely related tasks that can be greatly improved when solved jointly by feeding information from one task to the other (10, 11). However, current state-of-th e-art models use a separate representation for each task making joint inferen ce clumsy and leaving the classification of many parts of the scene ambiguous. In this work, we propose a hierarchical region-based approa ch to joint object detection and image segmentation. Our approach simultaneously reasons about pixels, regions and objects in a coherent probabilistic mod el. Pixel appearance features allow us to perform well on classifying amorphous background classes, while the explicit representation of regions facilitate th e computation of more so- phisticated features necessary for object detection. Impo rtantly, our model gives a single unified description of the scene—we explain every pixel in the image and enforce global consistency between all random variables in our model. We run experiments on the challenging Street Scene dataset (2) and show signifi- cant improvement over state-of-the-art results for object detection accuracy.
Conference Paper
Sliding window classifiers are among the most successful and widely applied techniques for object localization. However, training is typically done in a way that is not specific to the localization task. First a binary classifier is trained using a sample of positive and negative examples, and this classifier is subsequently applied to multiple regions within test images. We propose instead to treat object localization in a principled way by posing it as a problem of predicting structured data : we model the problem not as binary classification, but as the prediction of the bounding box of objects located in images. The use of a joint-kernel framework allows us to formulate the training procedure as a generalization of an SVM, which can be solved efficiently. We further improve computational efficiency by using a branch-and-bound strategy for localization during both training and testing. Experimental evaluation on the PASCAL VOC and TU Darmstadt datasets show that the structured training procedure improves performance over binary training as well as the best previously published scores.
Conference Paper
We address the classic problems of detection, segmentation and pose estimation of people in images with a novel definition of a part, a poselet. We postulate two criteria (1) It should be easy to find a poselet given an input image (2) it should be easy to localize the 3D configuration of the person conditioned on the detection of a poselet. To permit this we have built a new dataset, H3D, of annotations of humans in 2D photographs with 3D joint information, inferred using anthropometric constraints. This enables us to implement a data-driven search procedure for finding poselets that are tightly clustered in both 3D joint configuration space as well as 2D image appearance. The algorithm discovers poselets that correspond to frontal and profile faces, pedestrians, head and shoulder views, among others. Each poselet provides examples for training a linear SVM classifier which can then be run over the image in a multiscale scanning mode. The outputs of these poselet detectors can be thought of as an intermediate layer of nodes, on top of which one can run a second layer of classification or regression. We show how this permits detection and localization of torsos or keypoints such as left shoulder, nose, etc. Experimental results show that we obtain state of the art performance on people detection in the PASCAL VOC 2007 challenge, among other datasets. We are making publicly available both the H3D dataset as well as the poselet parameters for use by other researchers.
Article
We describe an object detection system based on mixtures of multiscale deformable part models. Our system is able to represent highly variable object classes and achieves state-of-the-art results in the PASCAL object detection challenges. While deformable part models have become quite popular, their value had not been demonstrated on difficult benchmarks such as the PASCAL data sets. Our system relies on new methods for discriminative training with partially labeled data. We combine a margin-sensitive approach for data-mining hard negative examples with a formalism we call latent SVM. A latent SVM is a reformulation of MI--SVM in terms of latent variables. A latent SVM is semiconvex, and the training problem becomes convex once latent information is specified for the positive examples. This leads to an iterative training algorithm that alternates between fixing latent values for positive examples and optimizing the latent SVM objective function.
Conference Paper
An object recognition system has been developed that uses a new class of local image features. The features are invariant to image scaling, translation, and rotation, and partially invariant to illumination changes and affine or 3D projection. These features share similar properties with neurons in inferior temporal cortex that are used for object recognition in primate vision. Features are efficiently detected through a staged filtering approach that identifies stable points in scale space. Image keys are created that allow for local geometric deformations by representing blurred image gradients in multiple orientation planes and at multiple scales. The keys are used as input to a nearest neighbor indexing method that identifies candidate object matches. Final verification of each match is achieved by finding a low residual least squares solution for the unknown model parameters. Experimental results show that robust object recognition can be achieved in cluttered partially occluded images with a computation time of under 2 seconds
Models accuracy on imagenet 2012 val
  • D Mishkin
Simultaneous detection and segmentation
  • B Hariharan
  • P Arbeláez
  • R Girshick
  • J Malik