Chapter

Can Existing 3D Monocular Object Detection Methods Work in Roadside Contexts? A Reproducibility Study

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Detecting 3D objects in images from urban monocular cameras is essential to enable intelligent monitoring applications for local municipalities decision-support systems. However, existing detection methods in this domain are mainly focused on autonomous driving and limited to frontal views from sensors mounted on the vehicle. In contrast, to monitor urban areas, local municipalities rely on streams collected from fixed cameras, especially in intersections and particularly dangerous areas. Such streams represent a rich source of data for applications focused on traffic patterns, road conditions, and potential hazards. In this paper, given the lack of availability of large-scale datasets of images from roadside cameras, and the time-consuming process of generating real labelled data, we first proposed a synthetic dataset using the CARLA simulator, which makes dataset creation efficient yet acceptable. The dataset consists of 7,481 development images and 7,518 test images. Then, we reproduced state-of-the-art models for monocular 3D object detection proven to work well in autonomous driving (e.g., M3DRPN, Monodle, SMOKE, and Kinematic) and tested them on the newly generated dataset. Our results show that our dataset can serve as a reference for future experiments and that state-of-the-art models from the autonomous driving domain do not always generalize well to monocular roadside camera images. Source code and data are available at https://bit.ly/monocular-3d-odt.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

Conference Paper
Urban environments are demanding effective and efficient detection in 3D of objects using monocular cameras, e.g., for intelligent monitoring or decision support. The limited availability of large-scale roadside camera datasets and the mere focus of existing 3D object detection methods on autonomous driving scenarios pose significant challenges for their practical adoption, unfortunately. In this paper, we conduct a systematic analysis of 3D object detection methods, originally applied to autonomous driving scenarios, on monocular roadside images. Under a common evaluation protocol, based on a synthetic dataset with images from monocular roadside cameras located at intersection areas, we analyzed the detection quality achieved by these methods in the roadside context and the influence of key operational parameters. Our study finally highlights open challenges and future directions in this field. © 2024 Copyright held by the owner/author(s).
Conference Paper
Full-text available
Current perception models in autonomous driving have become notorious for greatly relying on a mass of annotated data to cover unseen cases and address the long-tail problem. On the other hand, learning from unlabeled large-scale collected data and incrementally self-training powerful recognition models have received increasing attention and may become the solutions of next-generation industry-level powerful and robust perception models in autonomous driving. However, the research community generally suffered from data inadequacy of those essential real-world scene data, which hampers the future exploration of fully/semi/self-supervised methods for 3D perception. In this paper, we introduce the ONCE (One millioN sCenEs) dataset for 3D object detection in the autonomous driving scenario. The ONCE dataset consists of 1 million LiDAR scenes and 7 million corresponding camera images. The data is selected from 144 driving hours, which is 20x longer than the largest 3D autonomous driving dataset available (e.g. nuScenes and Waymo), and it is collected across a range of different areas, periods and weather conditions. To facilitate future research on exploiting unlabeled data for 3D detection, we additionally provide a benchmark in which we reproduce and evaluate a variety of self-supervised and semi-supervised methods on the ONCE dataset. We conduct extensive analyses on those methods and provide valuable observations on their performance related to the scale of used data. Data, code, and more information are available at http://www.once-for-auto-driving.com.
Preprint
Full-text available
Autonomous driving faces great safety challenges for a lack of global perspective and the limitation of long-range perception capabilities. It has been widely agreed that vehicle-infrastructure cooperation is required to achieve Level 5 autonomy. However, there is still NO dataset from real scenarios available for computer vision researchers to work on vehicle-infrastructure cooperation-related problems. To accelerate computer vision research and innovation for Vehicle-Infrastructure Cooperative Autonomous Driving (VICAD), we release DAIR-V2X Dataset, which is the first large-scale, multi-modality, multi-view dataset from real scenarios for VICAD. DAIR-V2X comprises 71254 LiDAR frames and 71254 Camera frames, and all frames are captured from real scenes with 3D annotations. The Vehicle-Infrastructure Cooperative 3D Object Detection problem (VIC3D) is introduced, formulating the problem of collaboratively locating and identifying 3D objects using sensory inputs from both vehicle and infrastructure. In addition to solving traditional 3D object detection problems, the solution of VIC3D needs to consider the temporal asynchrony problem between vehicle and infrastructure sensors and the data transmission cost between them. Furthermore, we propose Time Compensation Late Fusion (TCLF), a late fusion framework for the VIC3D task as a benchmark based on DAIR-V2X. Find data, code, and more up-to-date information at https://thudair.baai.ac.cn/index and https://github.com/AIR-THU/DAIR-V2X.
Conference Paper
Full-text available
Since their appearance, Smart Cities have aimed at improving the daily life of people, helping to make public services smarter and more efficient. Several of these services are often intended to provide better security conditions for citizens and drivers. In this vein, we present HEIMDALL, an AI-based video surveillance system for traffic monitoring and anomalies detection. The proposed system features three main tiers: a ground level, consisting of a set of smart lampposts equipped with cameras and sensors, and an advanced AI unit for detecting accidents and traffic anomalies in real time; a territorial level, which integrates and combines the information collected from the different lampposts, and cross-correlates it with external data sources, in order to coordinate and handle warnings and alerts; a training level, in charge of continuously improving the accuracy of the modules that have to sense the environment. Finally, we propose and discuss an early experimental approach for the detection of anomalies, based on a Faster R-CNN, and adopted in the proposed infrastructure.
Chapter
Full-text available
Perceiving the physical world in 3D is fundamental for self-driving applications. Although temporal motion is an invaluable resource to human vision for detection, tracking, and depth perception, such features have not been thoroughly utilized in modern 3D object detectors. In this work, we propose a novel method for monocular video-based 3D object detection which leverages kinematic motion to extract scene dynamics and improve localization accuracy. We first propose a novel decomposition of object orientation and a self-balancing 3D confidence. We show that both components are critical to enable our kinematic model to work effectively. Collectively, using only a single model, we efficiently leverage 3D kinematics from monocular videos to improve the overall localization precision in 3D object detection while also producing useful by-products of scene dynamics (ego-motion and per-object velocity). We achieve state-of-the-art performance on monocular 3D object detection and the Bird’s Eye View tasks within the KITTI self-driving dataset.
Article
Full-text available
LiDAR-based or RGB-D-based object detection is used in numerous applications, ranging from autonomous driving to robot vision. Voxel-based 3D convolutional networks have been used for some time to enhance the retention of information when processing point cloud LiDAR data. However, problems remain, including a slow inference speed and low orientation estimation performance. We therefore investigate an improved sparse convolution method for such networks, which significantly increases the speed of both training and inference. We also introduce a new form of angle loss regression to improve the orientation estimation performance and a new data augmentation approach that can enhance the convergence speed and performance. The proposed network produces state-of-the-art results on the KITTI 3D object detection benchmarks while maintaining a fast inference speed.
Conference Paper
Full-text available
Scene parsing aims to assign a class (semantic) label for each pixel in an image. It is a comprehensive analysis of an image. Given the rise of autonomous driving, pixel-accurate environmental perception is expected to be a key enabling technical piece. However, providing a large scale dataset for the design and evaluation of scene parsing algorithms, in particular for outdoor scenes, has been difficult. The per-pixel labelling process is prohibitively expensive, limiting the scale of existing ones. In this paper, we present a large-scale open dataset, ApolloScape, that consists of RGB videos and corresponding dense 3D point clouds. Comparing with existing datasets, our dataset has the following unique properties. The first is its scale, our initial release contains over 140K images - each with its per-pixel semantic mask, up to 1M is scheduled. The second is its complexity. Captured in various traffic conditions, the number of moving objects averages from tens to over one hundred. And the third is the 3D attribute, each image is tagged with high-accuracy pose information at cm accuracy and the static background point cloud has mm relative accuracy. We are able to label these many images by an interactive and efficient labelling pipeline that utilizes the high-quality 3D point cloud. Moreover, our dataset also contains different lane markings based on the lane colors and styles. We expect our new dataset can deeply benefit various autonomous driving related applications that include but not limited to 2D/3D scene understanding, localization, transfer learning, and driving simulation.
Article
Full-text available
In this paper, we focus on fine-grained recognition of vehicles mainly in traffic surveillance applications. We propose an approach orthogonal to recent advancement in fine-grained recognition (automatic part discovery, bilinear pooling). Also, in contrast to other methods focused on fine-grained recognition of vehicles, we do not limit ourselves to frontal/rear viewpoint but allow the vehicles to be seen from any viewpoint. Our approach is based on 3D bounding boxes built around the vehicles. The bounding box can be automatically constructed from traffic surveillance data. For scenarios where it is not possible to use the precise construction, we propose a method for estimation of the 3D bounding box. The 3D bounding box is used to normalize the image viewpoint by unpacking the image into plane. We also propose to randomly alter the color of the image and add a rectangle with random noise to random position in the image during training Convolutional Neural Networks. We have collected a large fine-grained vehicle dataset BoxCars116k, with 116k images of vehicles from various viewpoints taken by numerous surveillance cameras. We performed a number of experiments which show that our proposed method significantly improves CNN classification accuracy (the accuracy is increased by up to 12 percent points and the error is reduced by up to 50% compared to CNNs without the proposed modifications). We also show that our method outperforms state-of-the-art methods for fine-grained recognition.
Conference Paper
Full-text available
We present a method for detecting objects in images using a single deep neural network. Our approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. SSD is simple relative to methods that require object proposals because it completely eliminates proposal generation and subsequent pixel or feature resampling stages and encapsulates all computation in a single network. This makes SSD easy to train and straightforward to integrate into systems that require a detection component. Experimental results on the PASCAL VOC, COCO, and ILSVRC datasets confirm that SSD has competitive accuracy to methods that utilize an additional object proposal step and is much faster, while providing a unified framework for both training and inference. For 300×300300 \times 300 input, SSD achieves 74.3 % mAP on VOC2007 test at 59 FPS on a Nvidia Titan X and for 512×512512 \times 512 input, SSD achieves 76.9 % mAP, outperforming a comparable state of the art Faster R-CNN model. Compared to other single stage methods, SSD has much better accuracy even with a smaller input image size. Code is available at https:// github. com/ weiliu89/ caffe/ tree/ ssd.
Article
Face biometrics play a primary role in smart cities, from consumer- to organizational-level applications. This class of technologies has been recently shown to emphasize performance disparities across gender and ethnic groups. Prior work on demographic bias in deep face recognition has tended to focus exclusively on high-resolution images, leaving out low-resolution images captured, for example, by a certain surveillance camera at a distance. Notable reasons for this focus include the lack of low-resolution face image data sets that report user's demographic attributes. Therefore, demographic bias in low-resolution deep face recognition in the wild still remains under-explored. In this paper, we propose a framework for exploring demographic disparities in low-resolution face recognition at a distance. To this end, we devised a deep generative approach to turn high-resolution face images to their low-resolution counterpart. We then trained state-of-the-art face recognition models on different combinations of high- and low-resolution images. Finally, we assessed the model demographic disparities on artificially degraded images from five data sets. Our results show that face images artificially degraded through our generative approach are more realistic than those obtained with other statistical changes. Key disparities across gender and ethnic groups exist and urge timely interventions. Source code: https://cutt.ly/FChLfOC</uri
Conference Paper
In this paper, we propose a novel explanatory framework aimed to provide a better understanding of how face recognition models perform as the underlying data characteristics (protected attributes: gender, ethnicity, age; nonprotected attributes: facial hair, makeup, accessories, face orientation and occlusion, image distortion, emotions) on which they are tested change. With our framework, we evaluate ten state-of-the-art face recognition models, comparing their fairness in terms of security and usability on two data sets, involving six groups based on gender and ethnicity. We then analyze the impact of image characteristics on models performance. Our results show that trends appearing in a single-attribute analysis disappear or reverse when multi-attribute groups are considered, and that performance disparities are also related to non-protected attributes. Source code: https://cutt.1y/2XwRLiA.
Chapter
Nowadays, Smart Cities applications are becoming steadily popular, thanks to their main objective of improving people daily habits. The services provided by the aforementioned applications may be either addressed to the entire digital population or narrowed towards a specific kind of audience, like drivers and pedestrians. In this sense, the proposed paper describes a Deep Learning solution designed to manage traffic control tasks in Smart Cities. It involves a network of smart lampposts, in charge of directly monitoring the traffic by means of a bullet camera, and equipped with an advanced System-on-Module where the data are efficiently processed. In particular, our solution provides both: i) a risk estimation module, and ii) a license plate recognition module. The first module analyses the scene by means of a Faster R-CNN, trained over an ad-hoc set of synthetically videos, to estimate the risk of potential traffic anomalies. Concurrently, the license plate recognition module, by leveraging on YOLO and Tesseract, is active for retrieving the plate number of the vehicles involved. Preliminary experimental findings, from a prototype of the solution applied in a real-world scenario, are provided.
Chapter
We propose CornerNet, a new approach to object detection where we detect an object bounding box as a pair of keypoints, the top-left corner and the bottom-right corner, using a single convolution neural network. By detecting objects as paired keypoints, we eliminate the need for designing a set of anchor boxes commonly used in prior single-stage detectors. In addition to our novel formulation, we introduce corner pooling, a new type of pooling layer that helps the network better localize corners. Experiments show that CornerNet achieves a 42.1% AP on MS COCO, outperforming all existing one-stage detectors.
Article
Cloud-connected mobile applications are becoming a popular solution for ubiquitous access to online services, such as cloud data storage platforms. The adoption of such applications has security and privacy implications that are making individuals hesitant to migrate sensitive data to the cloud; thus, new secure authentication protocols are needed. In this article, we propose a continuous-authentication approach integrating physical (face) and behavioral (touch and hand movements) biometrics to control user access to cloud-based mobile services, going beyond one-time login. Experimental results show the security-usability tradeoff achieved by our approach.
Article
We introduce CARLA, an open-source simulator for autonomous driving research. CARLA has been developed from the ground up to support development, training, and validation of autonomous urban driving systems. In addition to open-source code and protocols, CARLA provides open digital assets (urban layouts, buildings, vehicles) that were created for this purpose and can be used freely. The simulation platform supports flexible specification of sensor suites and environmental conditions. We use CARLA to study the performance of three approaches to autonomous driving: a classic modular pipeline, an end-to-end model trained via imitation learning, and an end-to-end model trained via reinforcement learning. The approaches are evaluated in controlled scenarios of increasing difficulty, and their performance is examined via metrics provided by CARLA, illustrating the platform's utility for autonomous driving research. The supplementary video can be viewed at https://youtu.be/Hp8Dz-Zek2E
Conference Paper
The goal of this paper is to generate high-quality 3D object proposals in the con-text of autonomous driving. Our method exploits stereo imagery to place propos-als in the form of 3D bounding boxes. We formulate the problem as minimizing an energy function encoding object size priors, ground plane as well as several depth informed features that reason about free space, point cloud densities and distance to the ground. Our experiments show significant performance gains over existing RGB and RGB-D object proposal methods on the challenging KITTI benchmark. Combined with convolutional neural net (CNN) scoring, our approach outper-forms all existing results on all three KITTI object classes.
Article
Convolutional networks have had great success in image classification and other areas of computer vision. Recent efforts have designed deeper or wider networks to improve performance; as convolutional blocks are usually stacked together, blocks at different depths represent information at different scales. Recent models have explored `skip' connections to aggregate information across layers, but heretofore such skip connections have themselves been `shallow', projecting to a single fusion node. In this paper, we investigate new deep-across-layer architectures to aggregate the information from multiple layers. We propose novel iterative and hierarchical structures for deep layer aggregation. The former can produce deep high resolution representations from a network whose final layers have low resolution, while the latter can effectively combine scale information from all blocks. Results show that the our proposed architectures can make use of network parameters and features more efficiently without dictating convolution module structure. We also show transfer of the learned networks to semantic segmentation tasks and achieve better results than alternative networks with baseline training settings.
Article
Public intersections are due to their complexity challenging locations for drivers. Therefore the German joint project Ko-PER - which is part of the project initiative Ko-FAS has equipped a public intersection with several laserscanners and video cameras to generate a comprehensive dynamic model of the ongoing traffic. Results of the intersection perception can be communicated to equipped vehicles by wireless communication. This contribution wants to share a dataset of the Ko-PER intersection to the research community for further research in the field of multi-object detection and tracking. Therefor the dataset consists of sensor data from the laserscanners network and cameras as well as reference data and object labels. With that dataset, we aim to stimulate further research in this area.
Conference Paper
Today, visual recognition systems are still rarely employed in robotics applications. Perhaps one of the main reasons for this is the lack of demanding benchmarks that mimic such scenarios. In this paper, we take advantage of our autonomous driving platform to develop novel challenging benchmarks for the tasks of stereo, optical flow, visual odometry/SLAM and 3D object detection. Our recording platform is equipped with four high resolution video cameras, a Velodyne laser scanner and a state-of-the-art localization system. Our benchmarks comprise 389 stereo and optical flow image pairs, stereo visual odometry sequences of 39.2 km length, and more than 200k 3D object annotations captured in cluttered scenarios (up to 15 cars and 30 pedestrians are visible per image). Results from state-of-the-art algorithms reveal that methods ranking high on established datasets such as Middlebury perform below average when being moved outside the laboratory to the real world. Our goal is to reduce this bias by providing challenging benchmarks with novel difficulties to the computer vision community. Our benchmarks are available online at: www.cvlibs.net/datasets/kitti.
BAAI-VANJEE roadside dataset: towards the connected automated vehicle highway technologies in challenging environments of china
  • Y Deng
Faster R-CNN: towards real-time object detection with region proposal networks
  • S Ren
  • K He
  • R B Girshick
  • J Sun
  • C Cortes
  • N D Lawrence
  • D D Lee
  • M Sugiyama
Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada. pp. 91-99 (2015), https://proceedings.neurips.cc/paper/2015/hash/ 14bfa6bb14875e45bba028a21ed38046-Abstract.html
3d object proposals for accurate object class detection
  • X Chen
  • K Kundu
  • Y Zhu
  • A G Berneshawi
  • H Ma
  • S Fidler
  • R Urtasun
Chen, X., Kundu, K., Zhu, Y., Berneshawi, A.G., Ma, H., Fidler, S., Urtasun, R.: 3d object proposals for accurate object class detection. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada. pp. 424-432 (2015), https://proceedings.neurips.cc/paper/2015/ hash/6da37dd3139aa4d9aa55b8d237ec5d4a-Abstract.html
BAAI-VANJEE roadside dataset: Towards the connected automated vehicle highway technologies in challenging environments of china
  • Y Deng
  • D Wang
  • G Cao
  • B Ma
  • X Guan
  • Y Wang
  • J Liu
  • Y Fang
  • J Li
Deng, Y., Wang, D., Cao, G., Ma, B., Guan, X., Wang, Y., Liu, J., Fang, Y., Li, J.: BAAI-VANJEE roadside dataset: Towards the connected automated vehicle highway technologies in challenging environments of china. CoRR abs/2105.14370 (2021), https://arxiv.org/abs/2105.14370
Cornernet: Detecting objects as paired keypoints
  • H Law
  • J Deng
  • V Ferrari
  • M Hebert
  • C Sminchisescu
Law, H., Deng, J.: Cornernet: Detecting objects as paired keypoints. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision -ECCV 2018 -15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XIV. Lecture Notes in Computer Science, vol. 11218, pp. 765-781. Springer (2018). https://doi.org/10.1007/978-3-030-01264-9 45, https: //doi.org/10.1007/978-3-030-01264-9_45