Figure 5 - uploaded by Yurong Chen
Content may be subject to copyright.
Object detection and bounding box regression modules. Top: bounding box regression; Bottom: object classification. 

Object detection and bounding box regression modules. Top: bounding box regression; Bottom: object classification. 

Source publication
Article
Full-text available
We present RON, an efficient and effective framework for generic object detection. Our motivation is to smartly associate the best of the region-based (e.g., Faster R-CNN) and region-free (e.g., SSD) methodologies. Under fully convolutional architecture, RON mainly focuses on two fundamental problems: (a) multi-scale object localization and (b) neg...

Context in source publication

Context 1
... Softmax, the sub-network outputs the per-class score that indicates the presence of a class-specific instance. For bounding box regression, we predict the offsets relative to the default boxes in the cell (see Figure 5). ...

Similar publications

Article
Full-text available
The multi-scale object detection, especially small object detection, is still a challenge task. This paper proposes an improved multi-scale object detection network based on Single Shot MultiBox Detector (SSD), and the network is named as SSD-MSN. SSD-MSN can learn more rich features of small objects from the enlarged areas, which are clipped from...
Article
Full-text available
Downsampling input images is a simple trick to speed up visual object-detection algorithms, especially on robotic vision and applied mobile vision systems. However, this trick comes with a significant decline in accuracy. In this paper, dual-resolution dual-path Convolutional Neural Networks (CNNs), named DualNets, are proposed to bump up the accur...
Conference Paper
Full-text available
Most multi-scale detectors face a challenge of small-size false positives due to the inadequacy of low-level features, which have small receptive field sizes and weak semantic capabilities. This paper demonstrates independent predictions from different feature layers on the same region is beneficial for reducing false positives. We propose a novel...
Article
Full-text available
With the rapid development of computer vision and machine vision, methods based on deep learning have achieved good results in the field of object detection, identification, and tracking. However, for the detection and identification of rebars in smart construction sites, it is very difficult to perform accurate real-time detection of rebars by usi...
Article
Full-text available
Printed circuit board (PCB) defect detection is one of the primary problems in quality control of the most electronic products. Usually, the industrial PCB imagery has high resolution, but defects take up a small proportion (often only ∼10 pixels in size), which makes it difficult to use traditional machine vision methods. To this end, a novel sing...

Citations

... When combined with the TFDS, MAFENet can realize efficient and accurate automatic fault detection of train parts. MAFENet uses the two-step adjustment structure (TAS) from coarse to fine of RON [2] to alleviate the impact of the imbalance between positive and negative samples, and to improve the detection accuracy. A feature transfer block [3] is used to solve the problem of shallow feature maps lacking deep semantic features. ...
Article
Full-text available
Faults in train mechanical parts pose a significant safety hazard to railway transportation. Although some image detection methods have replaced manual fault detection of train mechanical parts, the detection effect on small mechanical parts under low illumination conditions is not ideal. To improve the accuracy and efficiency of the detection of train faults under different environments, we propose a multi-mode aggregation feature enhanced network (MAFENet) based on a single-stage detector (SSD). This network uses the idea of a two-step adjustment structure from coarse to fine and uses the K-means algorithm to design anchors. The receptive field enhancement module (RFEM) is designed to obtain the fusion features of different receptive fields. The attention-guided detail feature enhancement module (ADEM) is designed to complement the detailed features of deep-level feature maps. Meanwhile, the complete intersection over union (CIoU) loss is used to obtain more accurate bounding boxes. The experimental results on the train mechanical parts fault (TMPF) dataset showed that the detection performance of MAFENet is better than those of other SSD models. MAFENet with an input size of 320 × 320 pixels can achieve a mean average precision (mAP) of 0.9787 and a detection speed of 33 frames per second (FPS), which indicates that it can realize real-time detection, has good robustness to images under different environmental conditions, and can be used to improve the efficiency of the detection of faulty train parts.
... One example is the Feature pyramid network (FPN) [9] as shown in Fig. 1b. Similar to many other detection frameworks such as DSSD [10], DSOD [11], RON [12], StairNet [13], FPN aims to iteratively integrate high-level semantic features into high-resolution features by employing lateral connections and top-down pathways. However, this method's disadvantage is that serially merging these two adjacent features may cause details in lower feature maps to be covered by too much semantics from deeper feature maps. ...
... DSSD [10] implements extra deconvolution layers and skip connections to capture additional large-scale context and enhance the deep semantic information of lowlevel features. Similar module can be seen in TDM [22], StairNet [13], DSOD [11] and RON [12]. RefineDet [23] introduces two-step cascade regression to a single-stage pipeline, which utilizes TCB to transfer the feature from the anchor refinement module into the object detection module. ...
Article
Full-text available
Benefit from multi-scale feature pyramid methods, recently single-stage object detectors have achieved promising accuracy and fast inference speed. However, the majority of existing feature pyramid detection techniques only simply describe complex contextual relationships from different scales. Not only are there no effective modules that adaptively extend appropriate semantic information from deeper layers, but the finer spatial localization cues from lower layers are often ignored. In this paper, we present a Local Enhancement and Bidirectional Feature Refinement Network (LFBFR), which includes two optimization methods to achieve remarkable improvements in detection accuracy. Firstly, to make the backbone more suitable for detection task, we modify the pre-trained classification backbone to mitigate the loss of details in small objects due to consecutive decrease of the image resolution. Then we propose a Bidirectional Feature Refinement Pyramid, which can effectively utilize the inter-channel relationship of higher-level features and fine appearance cues from lower-level features by using the attention residual refinement module and the feature reuse module. Ultimately, to assess the performance of the proposed LFBFR, we design a powerful end-to-end single-stage detector called LFBFR-SSD by embedding it into the framework of SSD. Extensive experiments on the PASCAL VOC and MS COCO verify that our LFBFR-SSD outperforms a lot of state-of-the-art detectors while maintaining a real-time speed.
... Multi-scale feature fusion module is widely adopted for accurate object detection [19,16,22,11]. As shown in Fig. 5 (a), feature Pyramid Network (FPN) [19] utilizes a top-down architecture to aggregate features from different levels and enhance the high-level semantic features for all scales. ...
Preprint
Adder neural networks (AdderNets) have shown impressive performance on image classification with only addition operations, which are more energy efficient than traditional convolutional neural networks built with multiplications. Compared with classification, there is a strong demand on reducing the energy consumption of modern object detectors via AdderNets for real-world applications such as autonomous driving and face detection. In this paper, we present an empirical study of AdderNets for object detection. We first reveal that the batch normalization statistics in the pre-trained adder backbone should not be frozen, since the relatively large feature variance of AdderNets. Moreover, we insert more shortcut connections in the neck part and design a new feature fusion architecture for avoiding the sparse features of adder layers. We present extensive ablation studies to explore several design choices of adder detectors. Comparisons with state-of-the-arts are conducted on COCO and PASCAL VOC benchmarks. Specifically, the proposed Adder FCOS achieves a 37.8\% AP on the COCO val set, demonstrating comparable performance to that of the convolutional counterpart with an about 1.4×1.4\times energy reduction.
... Both these approaches rely on assimilating information via their pixel-connectivity to improve feature representations. For scale relations, many efforts have been made on fusing features across scales to alleviate the discrepancy of feature maps from different levels of bottom-up hierarchy and feature scale-space, including top-down information flow [15, 40, 54], an extra bottom-up information path [31,43,68], multiple hourglass structures [46,81], concatenating features from different layers [4,20,38,59] or different tasks [52], gradual multi-stage local information fusions [58,75], pyramid convolutions [67], etc. Even though standard design principles for scale relations are emerging for ConvNet architectures, the problem is far from being solved. ...
Preprint
Full-text available
Incorporating relational reasoning in neural networks for object recognition remains an open problem. Although many attempts have been made for relational reasoning, they generally only consider a single type of relationship. For example, pixel relations through self-attention (e.g., non-local networks), scale relations through feature fusion (e.g., feature pyramid networks), or object relations through graph convolutions (e.g., reasoning-RCNN). Little attention has been given to more generalized frameworks that can reason across these relationships. In this paper, we propose a hierarchical relational reasoning framework (HR-RCNN) for object detection, which utilizes a novel graph attention module (GAM). This GAM is a concise module that enables reasoning across heterogeneous nodes by operating on the graph edges directly. Leveraging heterogeneous relationships, our HR-RCNN shows great improvement on COCO dataset, for both object detection and instance segmentation.
... With the development of artificial intelligence, the mainstream pedestrian detection method is based on deep learning algorithm. It is mainly divided into two categories: one is the two-stage target detection algorithm based on classification, that is represented by R-CNN, Faster R-CNN [2], Hypernet [3] and Mask R-CNN [4], and the other is the one-stage target detection algorithm that using regression algorithm by YOLO, SSD [5], G-CNN [6] and RON [7]. In recent years, breakthroughs have been made in the application of target detection algorithms based on deep learning in pedestrian detection. ...
Article
Full-text available
As a research hotspot in the field of current computer vision, pedestrian detection is widely applied to many fields, such as video surveillance and autonomous driving. However, the accuracy of pedestrian detection under video surveillance is poor, and the miss rate of small target pedestrians is high. In this paper, an improves the YOLOv3 algorithm and a YOLOv3-Multi pedestrian detection model had been proposed. First, referring to the residual structure of DarkNet, the shallow features and deep features had been up-sampled and connected to obtain a multi-scale detection layer. Then, according to different special detection categories, the spatial pyramid pool (SPP) is introduced to strengthen the detection of small targets. The experimental results show that our method improves the average accuracy by 2.54%, 6.43% and 8.99%compared with YOLOv3, SSD and YOLOv2 on the VOC dataset.
... In recent years, this approach has been improved by various excellent works [73,74,75,76,77,78,79,80,81,82,83,84] and achieved remarkable performance. The one-stage approach on the other hand discards the proposal generation procedure in lieu of higher computational efficiency and faster inference speed with anchorbased [85,86,87,88,89,90,91,92] or anchor-free detectors [93,94,95,96,97,98,99,100]. ...
Preprint
Full-text available
We introduce Few-Shot Video Object Detection (FSVOD) with three important contributions: 1) a large-scale video dataset FSVOD-500 comprising of 500 classes with class-balanced videos in each category for few-shot learning; 2) a novel Tube Proposal Network (TPN) to generate high-quality video tube proposals to aggregate feature representation for the target video object; 3) a strategically improved Temporal Matching Network (TMN+) to match representative query tube features and supports with better discriminative ability. Our TPN and TMN+ are jointly and end-to-end trained. Extensive experiments demonstrate that our method produces significantly better detection results on two few-shot video object detection datasets compared to image-based methods and other naive video-based extensions. Codes and datasets will be released at https://github.com/fanq15/FewX.
... Due to the independent feature representation as detection layers, the vanilla SSD only achieves the AP score of 25.1%, which is significantly worse than other detectors with the feature fusing strategy. RON384++ (Kong et al. 2017), DSSD321 (Fu et al. 2017) and RetinaNet400 (Lin et al. 2017b) discover that shallow layers lack high-level semantic information in SSD. Thus, they have designed several strategies of feature fusing and yielded great AP improvement at different object scales. ...
Article
Full-text available
Feature pyramids have delivered significant improvement in object detection. However, building effective feature pyramids heavily relies on expert knowledge, and also requires strenuous efforts to balance effectiveness and efficiency. Automatic search methods, such as NAS-FPN, automates the design of feature pyramids, but the low search efficiency makes it difficult to apply in a large search space. In this paper, we propose a novel search framework for a feature pyramid network, called AutoDet, which enables to automatic discovery of informative connections between multi-scale features and configure detection architectures with both high efficiency and state-of-the-art performance. In AutoDet, a new search space is specifically designed for feature pyramids in object detectors, which is more general than NAS-FPN. Furthermore, the architecture search process is formulated as a combinatorial optimization problem and solved by a Simulated Annealing-based Network Architecture Search method (SA-NAS). Compared with existing NAS methods, AutoDet ensures a dramatic reduction in search times. For example, our SA-NAS can be up to 30x faster than reinforcement learning-based approaches. Furthermore, AutoDet is compatible with both one-stage and two-stage structures with all kinds of backbone networks. We demonstrate the effectiveness of AutoDet with outperforming single-model results on the COCO dataset. Without pre-training on OpenImages, AutoDet with the ResNet-101 backbone achieves an AP of 39.7 and 47.3 for one-stage and two-stage architectures, respectively, which surpass current state-of-the-art methods.
... HON [22] aggregates high- [23] incorporates hierarchical feature maps and compresses them into a fixed-size space. In order to perform detection at multiple scales, RON [24] uses reverse connection to predicts objects at different layers, and FPN [25] presented a clean and simple framework for building feature pyramids inside ConvNets; it archived good a result and trained using the COCO trainval35k dataset. Effi-cientDet [26] proposes a weighted bidirectional feature pyramid network (BiFPN), which allows easy and fast multiscale feature fusion. ...
Article
Full-text available
Object detection is used widely in smart cities including safety monitoring, traffic control, and car driving. However, in the smart city scenario, many objects will have occlusion problems. Moreover, most popular object detectors are often sensitive to various real-world occlusions. This paper proposes a feature-enhanced occlusion perception object detector by simultaneously detecting occluded objects and fully utilizing spatial information. To generate hard examples with occlusions, a mask generator localizes and masks discriminated regions with weakly supervised methods. To obtain enriched feature representation, we design a multiscale representation fusion module to combine hierarchical feature maps. Moreover, this method exploits contextual information by heaping up representations from different regions in feature maps. The model is trained end-to-end learning by minimizing the multitask loss. Our model obtains superior performance compared to previous object detectors, 77.4% mAP and 74.3% mAP on PASCAL VOC 2007 and PASCAL VOC 2012, respectively. It also achieves 24.6% mAP on MS COCO. Experiments demonstrate that the proposed method is useful to improve the effectiveness of object detection, making it highly suitable for smart cities application that need to discover key objects with occlusions.
... The dominant two-stage detectors are Faster R-CNN [52] and its variants [21,1,31,58,59,7,89,17,82,49], which first adopt a Region Proposal Network (RPN) to generate region proposals as coarse localization and then perform perregion classification and location fine-tuning. Differently, single-stage detectors [41,51,26,33,95,70,90,40] employ densely placed anchors as region proposals and directly make predictions on them. These aforementioned methods still rely on many heuristics like anchor generation. ...
Preprint
Full-text available
Few-shot object detection aims at detecting novel objects with only a few annotated examples. Prior works have proved meta-learning a promising solution, and most of them essentially address detection by meta-learning over regions for their classification and location fine-tuning. However, these methods substantially rely on initially well-located region proposals, which are usually hard to obtain under the few-shot settings. This paper presents a novel meta-detector framework, namely Meta-DETR, which eliminates region-wise prediction and instead meta-learns object localization and classification at image level in a unified and complementary manner. Specifically, it first encodes both support and query images into category-specific features and then feeds them into a category-agnostic decoder to directly generate predictions for specific categories. To facilitate meta-learning with deep networks, we design a simple but effective Semantic Alignment Mechanism (SAM), which aligns high-level and low-level feature semantics to improve the generalization of meta-learned representations. Experiments over multiple few-shot object detection benchmarks show that Meta-DETR outperforms state-of-the-art methods by large margins.
... For example, SSD [24] predicts object classes and anchor box offsets by directly using anchor boxes on multi-scale layers. Thereafter, plenty of works are presented to boost its performance in different layers [6,17], architecture redesign [16], training from scratch [35,50], anchor refinement and matching [48,47],data augmentation strategy [51], feature enrichment and alignment [42,28], introducing new loss function [3,22].At present,RetinaNet [22] replaces SSD and becomes a milestone detector in the one-stage methods.On the base of RetinaNet, there were a series of subsequent improved models [46,41,39].Even when compared with these latest methods, the performance of the RetinaNet is still outstanding in both accuracy and speed. ...
Preprint
Full-text available
By definition, object detection requires a multi-task loss in order to solve classification and regression tasks simultaneously. However, loss weight tends to be set manually in actuality. Therefore, a very practical problem that has not been studied so far arises: how to quickly find the loss weight that fits the current loss functions. In addition, when we choose different regression loss functions, whether the loss weight need to be adjusted and if so, how should it be adjusted still is a problem demanding prompt solution. In this paper, through experiments and theoretical analysis of prediction box shifting, we firstly find out three important conclusions about optimal loss weight allocation strategy, including (1) the classification loss curve decays faster than regression loss curve; (2) loss weight is less than 1; (3) the gap between classification and regression loss weight should not be too large. Then, based on the above conclusions, we propose an Adaptive Loss Weight Adjustment(ALWA) to solve the above two problems by dynamically adjusting the loss weight in the training process, according to statistical characteristics of loss values. By incorporating ALWA into both one-stage and two-stage object detectors, we show a consistent improvement on their performance using L1, SmoothL1 and CIoU loss, performance measures on popular object detection benchmarks including PASCAL VOC and MS COCO. The code is available at https://github.com/ywx-hub/ALWA.