Xinshuo Weng

Xinshuo Weng
NVIDIA | Nvidia · Department of Research

PhD
We are hiring research interns (Ph.D. students only)! If you are interested in working with me, please contact me!

About

59
Publications
22,458
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
849
Citations
Introduction
I am a research scientist at NVIDIA Autonomous Vehicle Research. My research interest lies in generative models and 3D computer vision for autonomous systems. I have developed 3D multi-object tracking systems such as AB3DMOT that received >1,300 stars on GitHub. I was awarded a number of fellowships and nominations such as the Qualcomm Innovation Fellowship for 2020 and Facebook Fellowship Finalist for 2021.
Additional affiliations
June 2022 - March 2022
NVIDIA
Position
  • Research Scientist
May 2021 - March 2022
NVIDIA
Position
  • Research Scientist Intern
February 2018 - August 2018
Meta
Position
  • Research Engineer
Education
September 2018 - June 2022
Carnegie Mellon University
Field of study
  • Computer Science
September 2016 - February 2018
Carnegie Mellon University
Field of study
  • Computer Science

Publications

Publications (59)
Preprint
Full-text available
3D multi-object tracking (MOT) is an essential component technology for many real-time applications such as autonomous driving or assistive robotics. However, recent works for 3D MOT tend to focus more on developing accurate systems giving less regard to computational cost and system complexity. In contrast, this work proposes a simple yet accurate...
Conference Paper
Full-text available
Monocular 3D scene understanding tasks, such as object size estimation, heading angle estimation and 3D localization, is challenging. Successful modern day methods for 3D scene understanding require the use of a 3D sensor. On the other hand, single image based methods have significantly worse performance. In this work, we aim at bridging the perfor...
Preprint
Full-text available
3D multi-object tracking (MOT) and trajectory forecasting are two critical components in modern 3D perception systems that require accurate modeling of multi-agent interaction. We hypothesize that it is beneficial to unify both tasks under one framework in order to learn a shared feature representation of agent interaction. To evaluate this hypothe...
Preprint
Full-text available
3D Multi-object tracking (MOT) is crucial to autonomous systems. Recent work uses a standard tracking-by-detection pipeline, where feature extraction is first performed independently for each object in order to compute an affinity matrix. Then the affinity matrix is passed to the Hungarian algorithm for data association. A key process of this stand...
Preprint
Full-text available
Reinforcement learning has steadily improved and outperform human in lots of traditional games since the resurgence of deep neural network. However, these success is not easy to be copied to autonomous driving because the state spaces in real world are extreme complex and action spaces are continuous and fine control is required. Moreover, the auto...
Preprint
Trajectory prediction using deep neural networks (DNNs) is an essential component of autonomous driving (AD) systems. However, these methods are vulnerable to adversarial attacks, leading to serious consequences such as collisions. In this work, we identify two key ingredients to defend trajectory prediction models against adversarial attacks inclu...
Preprint
Photorealistic avatars of human faces have come a long way in recent years, yet research along this area is limited by a lack of publicly available, high-quality datasets covering both, dense multi-view camera captures, and rich facial expressions of the captured subjects. In this work, we present Multiface, a new multi-view, high-resolution human...
Preprint
Full-text available
Multi-Object Tracking (MOT) has rapidly progressed with the development of object detection and re-identification. However, motion modeling, which facilitates object association by forecasting short-term trajectories with past observations, has been relatively under-explored in recent years. Current motion models in MOT typically assume that the ob...
Preprint
Full-text available
Recently, there has been tremendous progress in developing each individual module of the standard perception-planning robot autonomy pipeline, including detection, tracking, prediction of other agents' trajectories, and ego-agent trajectory planning. Nevertheless, there has been less attention given to the principled integration of these components...
Preprint
LiDAR sensors can be used to obtain a wide range of measurement signals other than a simple 3D point cloud, and those signals can be leveraged to improve perception tasks like 3D object detection. A single laser pulse can be partially reflected by multiple objects along its path, resulting in multiple measurements called echoes. Multi-echo measurem...
Preprint
Full-text available
Point clouds and RGB images are naturally complementary modalities for 3D visual understanding - the former provides sparse but accurate locations of points on objects, while the latter contains dense color and texture information. Despite this potential for close sensor fusion, many methods train two models in isolation and use simple feature conc...
Preprint
Full-text available
We address the problem of estimating the 3D pose of a network of cameras for large-environment wide-baseline scenarios, e.g., cameras for construction sites, sports stadiums, and public spaces. This task is challenging since detecting and matching the same 3D keypoint observed from two very different camera views is difficult, making standard struc...
Article
Full-text available
Multi-object tracking (MOT) and trajectory forecasting are two critical components in modern 3D perception systems that require accurate modeling of multi-agent interaction. We hypothesize that it is beneficial to unify both tasks under one framework in order to learn a shared feature representation of agent interaction. Furthermore, instead of per...
Preprint
Full-text available
Predicting accurate future trajectories of multiple agents is essential for autonomous systems, but is challenging due to the complex agent interaction and the uncertainty in each agent's future behavior. Forecasting multi-agent trajectories requires modeling two key dimensions: (1) time dimension, where we model the influence of past agent states...
Preprint
Full-text available
We present Supervision by Registration and Triangulation (SRT), an unsupervised approach that utilizes unlabeled multi-view video to improve the accuracy and precision of landmark detectors. Being able to utilize unlabeled data enables our detectors to learn from massive amounts of unlabeled data freely available and not be limited by the quality a...
Preprint
Full-text available
Multi-object tracking (MOT) and trajectory forecasting are two critical components in modern 3D perception systems that require accurate modeling of multi-agent interaction. We hypothesize that it is beneficial to unify both tasks under one framework in order to learn a shared feature representation of agent interaction. Furthermore, instead of per...
Preprint
Full-text available
3D multi-object tracking is an important component in robotic perception systems such as self-driving vehicles. Recent work follows a tracking-by-detection pipeline, which aims to match past tracklets with detections in the current frame. To avoid matching with false positive detections, prior work filters out detections with low confidence scores...
Conference Paper
Full-text available
Many autonomous systems forecast aspects of the future in order to aid decision-making. For example, self-driving vehicles and robotic manipulation systems often forecast future object poses by first detecting and tracking objects. However, this detect-then-forecast pipeline is expensive to scale, as pose forecasting algorithms typically require la...
Conference Paper
Full-text available
Figure 1: Given a probe image and ten gallery figures from our synthesized dataset (five from same identity and five with same color of clothes as the probe), we aim to prioritize matching the images with same body shape though in different wearings while previous work [59] is dominated by clothing color information. Abstract Person re-identificati...
Preprint
Full-text available
Object detection and data association are critical components in multi-object tracking (MOT) systems. Despite the fact that the two components are dependent on each other, prior work often designs detection and data association modules separately which are trained with different objectives. As a result, we cannot back-propagate the gradients and op...
Preprint
Full-text available
3D multi-object tracking (MOT) and trajectory forecasting are two critical components in modern 3D perception systems. We hypothesize that it is beneficial to unify both tasks under one framework to learn a shared feature representation of agent interaction. To evaluate this hypothesis, we propose a unified solution for 3D MOT and trajectory foreca...
Preprint
Full-text available
3D Multi-object tracking (MOT) is crucial to autonomous systems. Recent work often uses a tracking-by-detection pipeline, where the feature of each object is extracted independently to compute an affinity matrix. Then, the affinity matrix is passed to the Hungarian algorithm for data association. A key process of this pipeline is to learn discrimin...
Preprint
Full-text available
3D multi-object tracking (MOT) is essential to applications such as autonomous driving. Recent work focuses on developing accurate systems giving less attention to computational cost and system complexity. In contrast, this work proposes a simple real-time 3D MOT system with strong performance. Our system first obtains 3D detections from a LiDAR po...
Conference Paper
Full-text available
3D multi-object tracking (MOT) is essential to applications such as autonomous driving. Recent work focuses on developing accurate systems giving less attention to computational cost and system complexity. In contrast, this work proposes a simple real-time 3D MOT system with strong performance. Our system first obtains 3D detections from a LiDAR po...
Conference Paper
Full-text available
3D Multi-object tracking (MOT) is crucial to autonomous systems. Recent work often uses a tracking-by-detection pipeline, where the feature of each object is extracted independently to compute an affinity matrix. Then, the affinity matrix is passed to the Hungarian algorithm for data association. A key process of this pipeline is to learn discrimin...
Conference Paper
Full-text available
Predicting the future is a crucial first step to effective control. In this work, we study the problem of future prediction of 3D scenes, represented by point clouds captured by a LiDAR sensor, i.e., directly forecasting the evolution of >100,000 points that comprise a complete scene. We term this Sequential Pointcloud Forecasting (SPF). By directl...
Conference Paper
Full-text available
3D multi-object tracking (MOT) is an essential component for many applications such as autonomous driving and assistive robotics. Recent work on 3D MOT focuses on developing accurate systems giving less attention to practical considerations such as computational cost and system complexity. In contrast, this work proposes a simple real-time 3D MOT s...
Preprint
Full-text available
Object detection and data association are critical components in multi-object tracking (MOT) systems. Despite the fact that these two components are highly dependent on each other, one popular trend in MOT is to perform detection and data association as separate modules, processed in a cascaded order. Due to this cascaded process, the resulting MOT...
Preprint
Full-text available
We aim to enable robots to visually localize a target person through the aid of an additional sensing modality -- the target person's 3D inertial measurements. The need for such technology may arise when a robot is to meet person in a crowd for the first time or when an autonomous vehicle must rendezvous with a rider amongst a crowd without knowing...
Article
Full-text available
We present Supervision by Registration and Triangulation (SRT), an unsupervised approach that utilizes unlabeled multi-view video to improve the accuracy and precision of landmark detectors. Being able to utilize unlabeled data enables our detectors to learn from massive amounts of unlabeled data freely available and not be limited by the quality a...
Preprint
Full-text available
Predicting the future is a crucial first step to effective control, since systems that can predict the future can select plans that lead to desired outcomes. In this work, we study the problem of future prediction at the level of 3D scenes, represented by point clouds captured by a LiDAR sensor, i.e., directly learning to forecast the evolution of...
Preprint
Full-text available
Person re-identification (re-ID) aims to recognize instances of the same person contained in multiple images taken across different cameras. Existing methods for re-ID tend to rely heavily on the assumption that both query and gallery images of the same person have the same clothing. Unfortunately, this assumption may not hold for datasets captured...
Conference Paper
Full-text available
We focus on the word-level visual lipreading, which requires recognizing the word being spoken, given only the video but not the audio. State-of-the-art methods explore the use of end-to-end neural networks, including a shallow (up to three layers) 3D con-volutional neural network (CNN) + a deep 2D CNN (e.g., ResNet) as the front-end to extract vis...
Conference Paper
Full-text available
We focus on estimating the 3D orientation of the ground plane from a single image. We formulate the problem as an inter-mingled multi-task prediction problem by jointly optimizing for pixel-wise surface normal direction, ground plane segmentation, and depth estimates. Specifically, our proposed model, GroundNet, first estimates the depth and surfac...
Conference Paper
Full-text available
We explore the possibility of using a single monocular camera to forecast the time to collision between a suitcase-shaped robot being pushed by its user and other nearby pedestrians. We develop a purely image-based deep learning approach that directly estimates the time to collision without the need of relying on explicit geometric depth estimates...
Preprint
Full-text available
Monocular 3D scene understanding tasks, such as object size estimation, heading angle estimation and 3D lo-calization, is challenging. Successful modern day methods for 3D scene understanding require the use of a 3D sensor such as a depth camera, a stereo camera or LiDAR. On the other hand, single image based methods have significantly worse perfor...
Preprint
Full-text available
Figure 1. Results of our approach: (a) input image; (b)(c) geometrically-consistent depth and surface normal estimation; (d) ground segmentation; (e) ground plane normal estimation; (f) computed horizon line from the estimated ground plane normal. Abstract We focus on the problem of estimating the 3D orientation of the ground plane from a single im...
Preprint
Full-text available
We focus on the word-level visual lipreading, which requires recognizing the word being spoken, given only the video but not the audio. State-of-the-art methods explore the use of end-to-end neural networks, including a shallow (up to three layers) 3D convolutional neural network (CNN) + a deep 2D CNN (\emph{e.g.}, ResNet) as the front-end to extra...
Preprint
Full-text available
Monocular 3D scene understanding tasks, such as object size estimation, heading angle estimation and 3D localization, is challenging. Successful modern day methods for 3D scene understanding require the use of a 3D sensor such as a depth camera, a stereo camera or LiDAR. On the other hand, single image based methods have significantly worse perform...
Preprint
We focus on the word-level visual lipreading, which requires to decode the word from the speaker's video. Recently, many state-of-the-art visual lipreading methods explore the end-to-end trainable deep models, involving the use of 2D convolutional networks (e.g., ResNet) as the front-end visual feature extractor and the sequential model (e.g., Bi-L...
Preprint
Full-text available
We explore the possibility of using a single monocular camera to forecast the time to collision between a suitcase-shaped robot being pushed by its user and other nearby pedestrians. We develop a purely image-based deep learning approach that directly estimates the time to collision without the need of relying on explicit geometric depth estimates...
Preprint
Full-text available
Most existing methods for object segmentation in computer vision are formulated as a labeling task. This, in general, could be transferred to a pixel-wise label assignment task, which is quite similar to the structure of hidden Markov random field. In terms of Markov random field, each pixel can be regarded as a state and has a transition probabili...
Preprint
Full-text available
Across a majority of modern learning-based tracking systems, expensive annotations are needed to achieve state-of-the-art performance. In contrast, the Lucas-Kanade (LK) algorithm works well without any annotation. However, LK has a strong assumption of photometric (brightness) consistency on image intensity and is easy to drift because of large mo...
Preprint
Full-text available
We focus on the problem of estimating the orientation of the ground plane with respect to a mobile monocular camera platform (e.g., ground robot, wearable camera, assistive robotic platform). To address this problem, we formulate the ground plane estimation problem as an inter-mingled multi-task prediction problem by jointly optimizing for point-wi...
Preprint
Full-text available
In this paper, we present supervision-by-registration, an unsupervised approach to improve the precision of facial landmark detectors on both images and video. Our key observation is that the detections of the same landmark in adjacent frames should be coherent with registration, i.e., optical flow. Interestingly, the coherency of optical flow is a...
Article
Full-text available
Pedestrian detection performance has steadily improved on a variety of benchmark datasets such as Caltech, KITTI, INRIA and ETH, since the resurgence of deep neural networks. Across a majority of pedestrian datasets, it is assumed that pedestrians will be standing upright with respect to the image coordinate system. This assumption however, is not...