Jonah Philion's research while affiliated with NVIDIA and other places

What is this page?


This page lists the scientific contributions of an author, who either does not have a ResearchGate profile, or has not yet added these contributions to their profile.

It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.

If you're a ResearchGate member, you can follow this page to keep up with this author's work.

If you are this author, and you don't want us to display this page anymore, please let us know.

Publications (28)


Towards Viewpoint Robustness in Bird’s Eye View Segmentation
  • Conference Paper

October 2023

·

2 Reads

·

Jonah Philion

·

·

[...]

·

Share


Figure 7. Evaluation Data: We use images from NVIDIA DRIVE Sim [20] to evaluate our method on a diverse set of target rigs. Shown here are example test images with different viewpoints.
Towards Viewpoint Robustness in Bird's Eye View Segmentation
  • Preprint
  • File available

September 2023

·

49 Reads

Autonomous vehicles (AV) require that neural networks used for perception be robust to different viewpoints if they are to be deployed across many types of vehicles without the repeated cost of data collection and labeling for each. AV companies typically focus on collecting data from diverse scenarios and locations, but not camera rig configurations, due to cost. As a result, only a small number of rig variations exist across most fleets. In this paper, we study how AV perception models are affected by changes in camera viewpoint and propose a way to scale them across vehicle types without repeated data collection and labeling. Using bird's eye view (BEV) segmentation as a motivating task, we find through extensive experiments that existing perception models are surprisingly sensitive to changes in camera viewpoint. When trained with data from one camera rig, small changes to pitch, yaw, depth, or height of the camera at inference time lead to large drops in performance. We introduce a technique for novel view synthesis and use it to transform collected data to the viewpoint of target rigs, allowing us to train BEV segmentation models for diverse target rigs without any additional data collection or labeling cost. To analyze the impact of viewpoint changes, we leverage synthetic data to mitigate other gaps (content, ISP, etc). Our approach is then trained on real data and evaluated on synthetic data, enabling evaluation on diverse target rigs. We release all data for use in future work. Our method is able to recover an average of 14.7% of the IoU that is otherwise lost when deploying to new rigs.

Download

How Much More Data Do I Need? Estimating Requirements for Downstream Tasks

July 2022

·

62 Reads

Given a small training data set and a learning algorithm, how much more data is necessary to reach a target validation or test performance? This question is of critical importance in applications such as autonomous driving or medical imaging where collecting data is expensive and time-consuming. Overestimating or underestimating data requirements incurs substantial costs that could be avoided with an adequate budget. Prior work on neural scaling laws suggest that the power-law function can fit the validation performance curve and extrapolate it to larger data set sizes. We find that this does not immediately translate to the more difficult downstream task of estimating the required data set size to meet a target performance. In this work, we consider a broad class of computer vision tasks and systematically investigate a family of functions that generalize the power-law function to allow for better estimation of data requirements. Finally, we show that incorporating a tuned correction factor and collecting over multiple rounds significantly improves the performance of the data estimators. Using our guidelines, practitioners can accurately estimate data requirements of machine learning systems to gain savings in both development time and data acquisition costs.




M^2BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Birds-Eye View Representation

April 2022

·

107 Reads

In this paper, we propose M$^2$BEV, a unified framework that jointly performs 3D object detection and map segmentation in the Birds Eye View~(BEV) space with multi-camera image inputs. Unlike the majority of previous works which separately process detection and segmentation, M$^2$BEV infers both tasks with a unified model and improves efficiency. M$^2$BEV efficiently transforms multi-view 2D image features into the 3D BEV feature in ego-car coordinates. Such BEV representation is important as it enables different tasks to share a single encoder. Our framework further contains four important designs that benefit both accuracy and efficiency: (1) An efficient BEV encoder design that reduces the spatial dimension of a voxel feature map. (2) A dynamic box assignment strategy that uses learning-to-match to assign ground-truth 3D boxes with anchors. (3) A BEV centerness re-weighting that reinforces with larger weights for more distant predictions, and (4) Large-scale 2D detection pre-training and auxiliary supervision. We show that these designs significantly benefit the ill-posed camera-based 3D perception tasks where depth information is missing. M$^2$BEV is memory efficient, allowing significantly higher resolution images as input, with faster inference speed. Experiments on nuScenes show that M$^2$BEV achieves state-of-the-art results in both 3D object detection and BEV segmentation, with the best single model achieving 42.5 mAP and 57.0 mIoU in these two tasks, respectively.


M²BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Birds-Eye View Representation

April 2022

·

90 Reads

·

108 Citations

In this paper, we propose M$²BEV, a unified framework that jointly performs 3D object detection and map segmentation in the Birds Eye View~(BEV) space with multi-camera image inputs. Unlike the majority of previous works which separately process detection and segmentation, M$²BEV infers both tasks with a unified model and improves efficiency. M2BEV efficiently transforms multi-view 2D image features into the 3D BEV feature in ego-car coordinates. Such BEV representation is important as it enables different tasks to share a single encoder. Our framework further contains four important designs that benefit both accuracy and efficiency: (1) An efficient BEV encoder design that reduces the spatial dimension of a voxel feature map. (2) A dynamic box assignment strategy that uses learning-to-match to assign ground-truth 3D boxes with anchors. (3) A BEV centerness re-weighting that reinforces with larger weights for more distant predictions, and (4) Large-scale 2D detection pre-training and auxiliary supervision. We show that these designs significantly benefit the ill-posed camera-based 3D perception tasks where depth information is missing. M2BEV is memory efficient, allowing significantly higher resolution images as input, with faster inference speed. Experiments on nuScenes show that M$²BEV achieves state-of-the-art results in both 3D object detection and BEV segmentation, with the best single model achieving 42.5 mAP and 57.0 mIoU in these two tasks, respectively.


Generating Useful Accident-Prone Driving Scenarios via a Learned Traffic Prior

December 2021

·

49 Reads

Evaluating and improving planning for autonomous vehicles requires scalable generation of long-tail traffic scenarios. To be useful, these scenarios must be realistic and challenging, but not impossible to drive through safely. In this work, we introduce STRIVE, a method to automatically generate challenging scenarios that cause a given planner to produce undesirable behavior, like collisions. To maintain scenario plausibility, the key idea is to leverage a learned model of traffic motion in the form of a graph-based conditional VAE. Scenario generation is formulated as an optimization in the latent space of this traffic model, effected by perturbing an initial real-world scene to produce trajectories that collide with a given planner. A subsequent optimization is used to find a "solution" to the scenario, ensuring it is useful to improve the given planner. Further analysis clusters generated scenarios based on collision type. We attack two planners and show that STRIVE successfully generates realistic, challenging scenarios in both cases. We additionally "close the loop" and use these scenarios to optimize hyperparameters of a rule-based planner.


Towards Optimal Strategies for Training Self-Driving Perception Models in Simulation

November 2021

·

10 Reads

Autonomous driving relies on a huge volume of real-world data to be labeled to high precision. Alternative solutions seek to exploit driving simulators that can generate large amounts of labeled data with a plethora of content variations. However, the domain gap between the synthetic and real data remains, raising the following important question: What are the best ways to utilize a self-driving simulator for perception tasks? In this work, we build on top of recent advances in domain-adaptation theory, and from this perspective, propose ways to minimize the reality gap. We primarily focus on the use of labels in the synthetic domain alone. Our approach introduces both a principled way to learn neural-invariant representations and a theoretically inspired view on how to sample the data from the simulator. Our method is easy to implement in practice as it is agnostic of the network architecture and the choice of the simulator. We showcase our approach on the bird's-eye-view vehicle segmentation task with multi-sensor data (cameras, lidar) using an open-source simulator (CARLA), and evaluate the entire framework on a real-world dataset (nuScenes). Last but not least, we show what types of variations (e.g. weather conditions, number of assets, map design, and color diversity) matter to perception networks when trained with driving simulators, and which ones can be compensated for with our domain adaptation technique.


Citations (12)


... Additionally, falsification-based testing strategies [15] have been introduced to discern the safety thresholds of the VUT with greater efficiency. Some recent studies generate challenging test cases by using naturalistic driving data (NDD) [16]. ...

Reference:

Wang Wishart et al 2024 - Comprehensive Evaluation of Behavioral Competence of an AV using the DA Methodology
Generating Useful Accident-Prone Driving Scenarios via a Learned Traffic Prior
  • Citing Conference Paper
  • June 2022

... In recent years, significant progress has been made in the field of neural networks, particularly in the domain of biomedical image analysis (1). Their performance gains, however, often require large numbers of samples for training (2,3) in accordance with increased model complexity (4). ...

How Much More Data Do I Need? Estimating Requirements for Downstream Tasks
  • Citing Conference Paper
  • June 2022

... In order to overcome some of the model-driven simulations' limitations, a lot of research is put into data-driven approaches today. Various AD companies and research groups have demonstrated impressive results in this domain recently [3,16,27,56]. Data-driven simulators eliminate the need for explicit (3D-, behavior-, sensor-, etc.) models, but instead involve (generative) neural networks that learn to reconstruct or synthesize virtual worlds and behaviors from real-world data with little to no human supervision. They aim to produce photo-realistic outputs, are characterized through a small domain-and appearance gap and can be scaled easily, since not requiring a human in the loop. ...

DriveGAN: Towards a Controllable High-Quality Neural Simulation
  • Citing Article
  • January 2021

... BEV-based scene representations collapse 3D features onto the Bird's Eye View (BEV) plane and achieve a good balance between accuracy and efficiency. They show its effectiveness in 3D object detection (Li et al. 2022c;Huang et al. 2021;Li et al. 2022b,a;Zhang et al. 2022) and BEV segmentation (Philion and Fidler 2020;Li et al. 2022c;Hu et al. 2021;Xie et al. 2022) but are not applicable The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Figure 2: Framework of Vampire. We extract the 2D image features from multi-view images, and transform the 2D features to 3D volume space to generate sparse intermediate 3D features. ...

M²BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Birds-Eye View Representation
  • Citing Article
  • April 2022

... where w r (o, ω o , t) is same as that in Equation (1). As suggested by many scene understanding works (e.g., (Xing et al. 2018;Zhou, Yu, and Jacobs 2019;Wang et al. 2021)), we assume the BRDF in irradiance fields to be diffuse (Lambertain model), i.e., f r only depends on the position of 3D point x while being free from the view direction, ...

Learning Indoor Inverse Rendering with 3D Spatially-Varying Lighting
  • Citing Conference Paper
  • October 2021

... The author introduces Game GAN, a generative model that, through training with screenplay and keyboard input, learns to graphically mimic the desired game. In response to a key the agent has pressed, Game GAN uses a well-built GAN to "draw" the subsequent screen [26]. ...

Learning to Simulate Dynamic Environments With GameGAN
  • Citing Article
  • January 2020

... The powerful capability of VLMs can be used to generate data [198]- [201]. For example, DriveGAN [202] generated high-quality AD data by disentangling different components without supervision. Additionally, due to the capability of world models for comprehending driving environments, some works [203]- [205] explored world models to generate high-quality driving videos. ...

DriveGAN: Towards a Controllable High-Quality Neural Simulation
  • Citing Conference Paper
  • June 2021

... A primary concern of automation is its ability to integrate with existing systems, which, when considering road usage, have a strong social element [1]. On top of optimising aspects of driving that naturally lend themselves to reward metrics, i.e., time to destination, fuel e ciency and lane discipline [2], vehicle automation must learn valid and useful behaviours when interacting with pedestrians and other road users [3], [4]. A key challenge researchers face in building such an agent is designing a suitable reward function to elicit such behaviours [5]. ...

Emergent Road Rules in Multi-Agent Driving Environments

... HD maps are significant to deploying autonomous driving systems, and a number of studies have utilized HD maps for localization [9], [28] motion forecasting [29], [30], and planning [31]. Furthermore, online HD map construction [14], [15], [16] has garnered considerable research attention, but there has been no attempt to utilize maps for place recognition tasks. ...

Lift, Splat, Shoot: Encoding Images from Arbitrary Camera Rigs by Implicitly Unprojecting to 3D
  • Citing Chapter
  • November 2020

... They extend the mean Average Precision metric by not only considering distance to a detected pedestrian but also the corresponding time-to-collision. Philion et al. [81] measure importance of a perceived object by removing it from their scene and then assessing whether the decision of a subsequent planning component significantly changes. Moreover, their corresponding metric can also consider the effect of synthetically added phantom objects. ...

Learning to Evaluate Perception Models Using Planner-Centric Metrics
  • Citing Conference Paper
  • June 2020