Conference PaperPDF Available

Abstract

The paper analyzes data sets containing images with labeled traffic signs, as well as modern approaches for their detection and classification on images of urban scenes. Particular attention is paid to the recognition of Russian types of traffic signs. Various modern architectures of deep neural networks for the simultaneous object detection and classification were studied, including Faster R-CNN, Mask R-CNN, Cascade R-CNN, RetinaNet. To increase the efficiency of neural network recognition of objects in a video sequence, the Seq-BBox Matching algorithm is used. Training and testing of the proposed approach was carried out on Russian Traffic Sign Dataset and IceVision Dataset containing over 150 types of road signs and more than 65,000 marked images. For all the approaches considered, quality metrics are defined: mean average precision mAP, mean average recall mAR and processing time of one frame. The highest quality performance was demonstrated by the architecture of Faster R-CNN with Seq-BBox Matching, while the highest performance is provided by the architecture of RetinaNet. Implementation was carried out using the Python 3.7 programming language and PyTorch deep learning library using NVidia CUDA technology. Performance indicators were obtained on the workstation with the NVidia Tesla V-100 32GB video card. The obtained results demonstrate the possibility of applying the proposed approach both for the resource-intensive procedure for automated labeling of road scene images for new data sets preparation, and for traffic sign recognition in on-board computer vision systems of unmanned vehicles.
A preview of the PDF is not available
... Bergmann et al. [8] propose to use a detector to do the tracking, called "Tractor". Han et al. [22] propose "Seq-NMS" and Belkin et al. [6] propose "BBox-NMS". They both use good detections in nearby frames to boost detections with lower scores. ...
... Similar to Seq-NMS [22] and BBox-NMS [6], we test an approach that also utilizes IoU information and carries out NMS. We call this method as IoU-based scheme, as is shown in Fig. 7. ...
Article
Full-text available
Continuously detecting traffic signs in a video sequence is necessary for autonomous or assisted driving scenarios, since a vehicle needs the information from the signs to facilitate navigation. Single-image based traffic sign detector may fail in many cases, when the car moves fast on the road, resulting in motion blur, partial occlusion, and abrupt environmental change. In this paper, we propose an effective methodology, called detection-by-tracking, for robust traffic sign detection in videos, so as to improve the detection performance beyond a basic object detector. We explore the temporal cues among frames to help with the proposal reasoning for further regression. The correlations of spatial location and appearance similarity for the same sign in adjacent frames are considered in our approach. Experimental results show that the proposed detection-by-tracking mechanism is helpful, with improved detection performance to a large extent. Moreover, the idea of the detection-by-tracking can also be generalized to other scenarios for object detection tasks in videos.
... And we pay special attention to LiDAR data segmentation algorithms that are based on point cloud projections (see figure 1 on the bottom) because of their superiority over other heavy algorithms that process input point clouds directly in terms of low latency. This transition from a 3D point cloud to a 2D image makes it possible to actively use the advantages of deep convolutional neural networks, which have proven themselves well for object detection [5], semantic [6] and instance segmentation [7]. ...
Article
Full-text available
Among a number of problems in the behavior planning of an unmanned vehicle the central one is movement in difficult areas. In particular, such areas are intersections at which direct interaction with other road agents takes place. In our work, we offer a new approach to train of the intelligent agent that simulates the behavior of an unmanned vehicle, based on the integration of reinforcement learning and computer vision. Using full visual information about the road intersection obtained from aerial photographs, it is studied automatic detection the relative positions of all road agents with various architectures of deep neural networks (YOLOv3, Faster R-CNN, RetinaNet, Cascade R-CNN, Mask R-CNN, Cascade Mask R-CNN). The possibilities of estimation of the vehicle orientation angle based on a convolutional neural network are also investigated. Obtained additional features are used in the modern effective reinforcement learning methods of Soft Actor Critic and Rainbow, which allows to accelerate the convergence of its learning process. To demonstrate the operation of the developed system, an intersection simulator was developed, at which a number of model experiments were carried out.
Article
Full-text available
The paper considers the task solution of detection on two-dimensional images not only face, but head of a human regardless of the turn to the observer. Such task is also complicated by the fact that the image receiving at the input of the recognition algorithm may be noisy or captured in low light conditions. The minimum size of a person’s head in an image to be detected for is 10 × 10 pixels. In the course of development, a dataset was prepared containing over 1000 labelled images of classrooms at BSTU n.a. V.G. Shukhov. The markup was carried out using a segmentation software tool specially developed by the authors. Three architectures of convolutional neural networks were trained for human head detection task: a fully convolutional neural network (FCN) with clustering, the Faster R-CNN architecture and the Mask R-CNN architecture. The third architecture works more than ten times slower than the first one, but it almost does not give false positives and has the precision and recall of head detection over 90% on both test and training samples. The Faster R-CNN architecture gives worse accuracy than Mask R-CNN, but it gives fewer false positives than FCN with clustering. Based on Mask R-CNN authors have developed software for human head detection on a lowquality image. It is two-level web-service with client and server modules. This software is used to detect and count people in the premises. The developed software works with IP cameras, which ensures its scalability for different practical computer vision applications.
Article
Full-text available
There has been significant progresses for image object detection in recent years. Nevertheless, video object detection has received little attention, although it is more challenging and more important in practical scenarios. Built upon the recent works, this work proposes a unified approach based on the principle of multi-frame end-to-end learning of features and cross-frame motion. Our approach extends prior works with three new techniques and steadily pushes forward the performance envelope (speed-accuracy tradeoff), towards high performance video object detection.
Article
The highest accuracy object detectors to date are based on a two-stage approach popularized by R-CNN, where a classifier is applied to a sparse set of candidate object locations. In contrast, one-stage detectors that are applied over a regular, dense sampling of possible object locations have the potential to be faster and simpler, but have trailed the accuracy of two-stage detectors thus far. In this paper, we investigate why this is the case. We discover that the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause. We propose to address this class imbalance by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples. Our novel Focal Loss focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training. To evaluate the effectiveness of our loss, we design and train a simple dense detector we call RetinaNet. Our results show that when trained with the focal loss, RetinaNet is able to match the speed of previous one-stage detectors while surpassing the accuracy of all existing state-of-the-art two-stage detectors. Code is at: https://github.com/facebookresearch/Detectron.
Article
In object detection, an intersection over union (IoU) threshold is required to define positives and negatives. An object detector, trained with low IoU threshold, e.g. 0.5, usually produces noisy detections. However, detection performance tends to degrade with increasing the IoU thresholds. Two main factors are responsible for this: 1) overfitting during training, due to exponentially vanishing positive samples, and 2) inference-time mismatch between the IoUs for which the detector is optimal and those of the input hypotheses. A multi-stage object detection architecture, the Cascade R-CNN, is proposed to address these problems. It consists of a sequence of detectors trained with increasing IoU thresholds, to be sequentially more selective against close false positives. The detectors are trained stage by stage, leveraging the observation that the output of a detector is a good distribution for training the next higher quality detector. The resampling of progressively improved hypotheses guarantees that all detectors have a positive set of examples of equivalent size, reducing the overfitting problem. The same cascade procedure is applied at inference, enabling a closer match between the hypotheses and the detector quality of each stage. A simple implementation of the Cascade R-CNN is shown to surpass all single-model object detectors on the challenging COCO dataset. Experiments also show that the Cascade R-CNN is widely applicable across detector architectures, achieving consistent gains independently of the baseline detector strength. The code will be made available at https://github.com/zhaoweicai/cascade-rcnn.