PreprintPDF Available

Towards Closing the Generalization Gap in Autonomous Driving

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Autonomous driving research faces significant challenges in transitioning from simulation-based evaluations to real-world implementations. While simulation environments offer controlled settings for training driving agents, real-world scenarios introduce unforeseen complexities crucial for assessing the robustness and adaptability of these agents. This study addresses two pivotal questions in autonomous driving research: (1) the translation of simulated experiences to a real-world environment, and (2) the correlation between offline evaluation metrics and closed-loop driving performance To address the first question, we employ a novel method using pre-trained foundation models to abstract vision input. This allows us to train driving policies in simulation and assess their performance with real-world data, investigating the effectiveness of Sim2Real for driving scenarios. For the second question, we analyze the relationship between a selected set of offline metrics and established closed-loop metrics in both simulation and real-world contexts. By comparing their performance, we aim to ascertain the efficacy of offline evaluations in predicting closed-loop driving behavior. Our research aims to bridge the gap between simulation and real-world environments, understanding the efficacy of open-loop evaluation in autonomous driving.
Content may be subject to copyright.
BOSTON UNIVERSITY
GRADUATE SCHOOL OF ARTS AND SCIENCES
Thesis
TOWARDS CLOSING THE GENERALIZATION GAP IN
AUTONOMOUS DRIVING
by
ANIMIKH AICH
B.E., Visvesvaraya Technological University, 2019
Submitted in partial fulfillment of the
requirements for the degree of
Master of Science
2024
©2024 by
ANIMIKH AICH
All rights reserved
Approved by
First Reader
Eshed Ohn-Bar, PhD
Assistant Professor of Electrical and Computer Engineering
Second Reader
Renato Mancuso, PhD
Assistant Professor of Electrical and Computer Engineering
Third Reader
Wenchao Li, PhD
Assistant Professor of Electrical and Computer Engineering
Acknowledgments
I’m immensely grateful to my advisor, Prof. Eshed Ohn-Bar for his steadfast support,
invaluable guidance, and continuous encouragement during my thesis journey. His
expertise, patience, and insightful feedback have shaped this work and guided me
towards excellence.
I’m deeply thankful to Prof. Eshed Ohn-Bar and Boston University for providing
crucial computational resources and platforms essential for the research in this the-
sis. Their support enabled complex simulations and analyses, greatly enhancing the
quality and depth of this work.
I extend my heartfelt appreciation to my friends and research colleagues who have
been an integral part of this project. Their insightful ideas, stimulating discussions,
and unwavering support have been invaluable in shaping my thoughts and refining
the direction of this research.
Last but not least, I am deeply indebted to my family for their unconditional
love, encouragement, and understanding throughout this exceptional journey. Their
unwavering support and belief in me have been my source of strength, motivating me
to persevere through challenges and strive for excellence.
To all those mentioned above, and to countless others who have contributed in
ways big and small, I extend my sincerest thanks. This thesis would not have been
possible without your support, guidance, and encouragement.
Animikh Aich
Computer Science Department
iv
TOWARDS CLOSING THE GENERALIZATION GAP IN
AUTONOMOUS DRIVING
ANIMIKH AICH
ABSTRACT
Autonomous driving research faces significant challenges in transitioning from
simulation-based evaluations to real-world implementations. While simulation en-
vironments offer controlled settings for training driving agents, real-world scenarios
introduce unforeseen complexities crucial for assessing the robustness and adaptabil-
ity of these agents. This study addresses two pivotal questions in autonomous driving
research: (1) the translation of simulated experiences to a real-world environment,
and (2) the correlation between offline evaluation metrics and closed-loop driving
performance
To address the first question, we employ a novel method using pre-trained foun-
dation models to abstract vision input. This allows us to train driving policies in
simulation and assess their performance with real-world data, investigating the effec-
tiveness of Sim2Real for driving scenarios.
For the second question, we analyze the relationship between a selected set of
offline metrics and established closed-loop metrics in both simulation and real-world
contexts. By comparing their performance, we aim to ascertain the efficacy of offline
evaluations in predicting closed-loop driving behavior.
Our research aims to bridge the gap between simulation and real-world environ-
ments, understanding the efficacy of open-loop evaluation in autonomous driving.
v
Contents
1 Introduction 1
1.1 Motivation................................. 1
1.2 ProblemFormulation........................... 3
1.3 Organization ............................... 3
2 Related Work 5
2.1 Conditional Imitation Learning . . . . . . . . . . . . . . . . . . . . . 5
2.2 Transfer from Simulation to the Real World . . . . . . . . . . . . . . 6
2.3 Open and Closed Loop Evaluation . . . . . . . . . . . . . . . . . . . . 8
3 Sim2Real: Transfer Simulation Experiences to the Real World 9
3.1 DataandSetup.............................. 10
3.1.1 TrainingSetup .......................... 10
3.1.2 DataCollection.......................... 11
3.1.3 RealWorldSetup......................... 11
3.1.4 Real World Expert Dataset . . . . . . . . . . . . . . . . . . . 13
3.2 Methodology ............................... 14
3.2.1 Data Abstraction for Sim2Real . . . . . . . . . . . . . . . . . 15
3.2.2 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Results and Discussions . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3.1 Indoor Experiments . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3.2 Outdoor Experiments . . . . . . . . . . . . . . . . . . . . . . . 20
3.4 Conclusion................................. 21
vi
4 Correlation Analysis: Offline and Online Evaluation Metrics 25
4.1 DataandSetup.............................. 25
4.2 Metrics................................... 26
4.2.1 Pearson Correlation Coefficient . . . . . . . . . . . . . . . . . 26
4.2.2 Open-Loop Metrics . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2.3 Closed-Loop Metrics . . . . . . . . . . . . . . . . . . . . . . . 30
4.2.4 Real World Metrics . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3 Methodology ............................... 33
4.3.1 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . 33
4.3.2 ModelSampling.......................... 34
4.3.3 ModelEvaluation......................... 35
4.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.4.1 Simulation Experiments . . . . . . . . . . . . . . . . . . . . . 35
4.4.2 Real World Experiments . . . . . . . . . . . . . . . . . . . . . 37
4.5 Conclusion................................. 38
5 Conclusions 39
5.1 Contributions ............................... 39
5.2 FutureWork................................ 40
5.3 Limitations & Broader Impacts . . . . . . . . . . . . . . . . . . . . . 40
References 42
Curriculum Vitae 48
vii
List of Tables
3.1 Sim2Real Indoor Results (MAE): This table shows offline evalu-
ation results of five model architectures for indoor driving, indicating
performance in steering and action MAE across various input config-
urations. While models incorporating additional information like se-
mantic segmentation and depth show promise with clear improvements
over the baseline, no clear winner emerges due to variability attributed
to abstract input methods. Among Methods, ‘D’ represents Depth, and
‘C’ denotes contours. Similarly, SMAE and AMAE represent MAE er-
rors based on steer and action (steer + throttle) respectively, and SW
represents their speed-weighted versions. . . . . . . . . . . . . . . . . 23
3.2 Sim2Real Outdoor Results (MAE): Diverse model architectures
and input setups show an improvement over the baseline while show-
ing no clear top-performing model or abstraction. ResNetY-008 stands
out with multi-abstraction inputs, ResNetY-004 performs well with
raw RGB with depth, and PilotNet improves with mask-based archi-
tectures. Among Methods, ‘D’ represents Depth, and ‘C’ denotes con-
tours. Similarly, SMAE and AMAE represent MAE errors based on
steer and action (steer + throttle) respectively, and SW represents
their speed-weighted versions. . . . . . . . . . . . . . . . . . . . . . . 24
viii
List of Figures
3·1Real-world Outdoor and Indoor Setup: Outdoor data collection
(left) involves a remote-controlled car (xma, 2024) with Raspberry Pi,
a joystick controller, and an Android smartphone following the red
path. Indoor data collection (right) employs an AgileX LIMO (agi,
2024) with Nvidia Jetson Nano, set within a miniature urban scenario
with marked lanes and objects. . . . . . . . . . . . . . . . . . . . . . 12
3·2Sim2Real Architecture: Stage 1 is input data abstraction using
foundation models to reduce the domain gap between simulation and
the real world. Stage 2 is training or inference of the Sim2Real policy
network by training the CIL network in simulation and evaluating it
in the real world with abstracted inputs. . . . . . . . . . . . . . . . . 14
3·3Data Abstraction: This grid of images shows qualitative examples
of abstracted examples for both simulation (top row) and real-world
(bottom row). From left to right: RGB image, depth image using,
segmentation masks, contours for those masks, and segmentation mask
+ contours. We use the foundation models: Depth Anything and
Segment Anything (SAM) for these abstractions. . . . . . . . . . . . 15
4·1Real-world CIL network architecture 640x480 image Iis given as
input along with one-hot encoded Command Cto predict steer Sand
throttle T. The encoded command Cis added to the network multiple
times at different stages to improve the robustness of the policy. . . . 34
ix
4·2Simulation Correlation Plot: Showing results from 110 models
highlight strong correlations between driving score and ADE and Ac-
tion metrics. Lack of correlation observed for stop interactions due
to models’ disregard for stop signs. TRE also shows a good overall
correlation. ................................ 36
4·3Real World Correlation Analysis: Indoor analysis (left) showed
consistent correlations between offline metrics (e.g., action MSE) and
online metrics, except for TRE and QCE. Outdoor analysis (right) ex-
hibited more varied correlations, likely due to single-route evaluations
and fewer data points compared to simulations. . . . . . . . . . . . . 37
x
List of Abbreviations
AMAE . . . . . . . . . . . . . Action-Mean Absolute Error
BC . . . . . . . . . . . . . Behavior Cloning
CIL . . . . . . . . . . . . . Conditional Imitation Learning
CR . . . . . . . . . . . . . Collision Rate
DA . . . . . . . . . . . . . Depth Anything
DS . . . . . . . . . . . . . Driving Score
IP . . . . . . . . . . . . . Infraction Penalty
IR . . . . . . . . . . . . . Infraction Rate
MAE . . . . . . . . . . . . . Mean Absolute Error
MSE . . . . . . . . . . . . . Mean Squared Error
QCE . . . . . . . . . . . . . Quantized Classification Error
RC . . . . . . . . . . . . . Route Completion Rate
SOTA . . . . . . . . . . . . . State-of-the-Art
Sim2Real . . . . . . . . . . . . . Transfer of Simulation Experiences to the Real World
SMAE . . . . . . . . . . . . . Steer-Mean Absolute Error
SR . . . . . . . . . . . . . Success Rate
SW-MAE . . . . . . . . . . . . . Speed Weighted Mean Absolute Error
TRE . . . . . . . . . . . . . Thresholded Relative Error
xi
1
Chapter 1
Introduction
1.1 Motivation
Our objective is to leverage simulation-based data collection for training and evalu-
ate the trained model in the real world. This approach is motivated by two primary
factors. Firstly, simulation environments offer a cost-effective way to generate vast
quantities of accurately labeled data spanning a diverse range of scenarios, ensur-
ing both dataset quality and consistency. In contrast, gathering equivalent data
in real-world settings is characterized by significant time and resource expenditure.
Secondly, despite the advancements in simulation technology, it remains inherently
challenging to encompass the entirety of potential scenarios and driving conditions
that an autonomous agent may encounter in the unpredictable and complex real
world. Therefore, while simulation is useful for training, the true test of our model’s
performance lies in real-world application.
However, while controlled environments may allow for safe evaluations, it remains
an expensive task with numerous logistical challenges. Furthermore, controlled envi-
ronments may not fully encompass the unforeseen challenges the policy may face upon
real-world deployment. To mitigate these concerns, researchers often resort to offline
evaluation methods, analyzing performance against a pre-collected dataset without
exposing the model to potential hazards.
Assessing autonomous driving progress through offline evaluation, particularly
comparing model predictions to human-driven benchmarks, has significantly advanced
2
research (Bansal et al., 2019; Hu et al., 2023; Jiang et al., 2023; Wu et al., 2023; Dauner
et al., 2023; Hu et al., 2022; Zhang et al., 2022). However, how well offline metrics
reflect actual driving performance, especially in urgent situations requiring quick and
accurate responses, remains unclear. Several factors contribute to the difference be-
tween offline predictions and real-world performance. Firstly, real-world driving data
mostly involves routine events, making it difficult for models to handle critical situ-
ations (Tu et al., 2023; Li et al., 2023). Secondly, while techniques exist to address
data biases, standard evaluation metrics may not fully capture the subtle aspects of
safe driving (Li et al., 2020; Codevilla et al., 2018a). Lastly, online testing involves
dynamic feedback loops where small errors can lead to unpredictable outcomes (Ross
et al., 2011; Osa et al., 2018; Prakash et al., 2020; Laskey et al., 2017). Despite rec-
ognizing the limitations of offline evaluation (Zhang et al., 2022; Li et al., 2023), few
studies have explored the transition from offline to online assessment beyond basic
scenarios. This study aims to fill this gap by investigating how well offline metrics
correlate with real-time performance.
End-to-end deep learning methods aim to bypass intricate hand-engineering by
training driving policies directly on data (Pomerleau, 1988; LeCun et al., 2005; Bo-
jarski et al., 2016b). However, scaling these solutions to realistic urban driving sce-
narios presents significant hurdles. The sheer volume of data required to encompass
the full spectrum of driving situations poses a major challenge. Additionally, deploy-
ment and testing become intricate due to safety considerations: the opaque nature of
end-to-end models complicates risk assessment and evaluation.
Furthermore, deep learning-based autonomous driving models face the challenge
of obtaining complex training datasets for policy development. Real-world dataset
acquisition requires significant time and financial investment, while simulation offers
a faster and more scalable alternative. In simulated environments, abundant training
3
data and safe testing of driving policies are possible. However, transferring learned
policies from simulation to real-world settings presents a hurdle due to the reality gap,
resulting in performance disparity between simulated and physical environments. As a
result, transferring control policies for autonomous vehicles in complex urban settings
remains a persistent challenge.
1.2 Problem Formulation
In this section, we outline the problem formulation addressing two primary challenges
within autonomous driving. Given a dataset curated in simulation, we first try to an-
swer the question: Can a policy derived from a curated simulation dataset effectively
extrapolate its learnings to navigate the complexities of the real-world environment?
Subsequently, we formulate our second problem statement: Given a driving policy, can
we accurately evaluate its driving performance using a pre-collected dataset without
driving on the road? Our research primarily aims to find answers to these questions,
thus taking a step toward closing the generalization gap in autonomous driving.
1.3 Organization
The organization of this document is as follows.
Chapter 2 serves as the foundation of our research, offering an exploration of re-
lated work. It provides essential background information, discussing three primary
topics: conditional imitation learning, offline and online evaluation metrics, and ef-
fective transfer of knowledge from simulation to the real world.
In Chapter 3, we first discuss the dataset and setup for the Sim2Real experiments.
We then introduce the foundation models and the CIL network architecture power-
ing the experiments. Finally, we discuss the quantitative improvement in Sim2Real
transfer over the baselines.
4
Chapter 4 introduces our correlation analysis studying the relationship between
offline and online metrics in both simulation and the real world. We first outline the
experimental setup and the evaluation metrics. Then, we analyze the results and
identify key findings highlighting the top-performing offline metrics. Finally, we draw
upon the analysis to identify if open-loop evaluation results represent closed-loop
driving performance and to what extent.
The final chapter, Chapter 5, serves as a culmination of our research, summarizing
our contributions and findings. We reiterate the importance of our work and its
potential impact on the field of autonomous driving research. Additionally, we present
future directions and avenues for continued progress, providing a roadmap for the
advancement of this field.
This structured approach ensures a thorough exploration of our research, from
its foundational concepts to practical implementation and analysis, ultimately con-
tributing to the body of knowledge in the domain of autonomous driving.
5
Chapter 2
Related Work
2.1 Conditional Imitation Learning
The landscape of autonomous driving research has witnessed an evolution with the
advent of deep learning methodologies. Departing from conventional systems, recent
research endeavors have adopted learning-based models, indicating a paradigm shift
in the approach to realizing autonomous driving capabilities. This transformation is
notably highlighted by the increase of end-to-end autonomous driving architectures,
which leverage the power of deep learning techniques for task accomplishment (Muller
et al., 2006; Bojarski et al., 2016a; Xu et al., 2017; Pomerleau, 1989; Bansal et al.,
2019; Casas et al., 2021; Cui et al., 2021; Hu et al., 2022).
A prominent manifestation of this trend can be observed through the lens of the
CARLA Leaderboard (car, 2024), a benchmarking platform for autonomous driving in
simulation. Here, end-to-end deep learning-based models show superior performance
metrics compared to their modular counterparts (car, 2024; uller et al., 2018; Tang
et al., 2018; Chen et al., 2021; Gonz´alez et al., 2015; Kendall et al., 2019; Xu et al.,
2021). This shift underscores the growing efficacy and adoption of end-to-end deep
learning models in tackling the complex challenges in autonomous navigation tasks.
Moreover, the optimization of end-to-end autonomous driving systems has been
further enriched by integrating high-level navigational commands during test-time
policy control. Previous studies have demonstrated the efficacy of this approach
in enhancing overall system performance and adaptability to diverse driving sce-
6
narios (Codevilla et al., 2018b; Hawke et al., 2020). By incorporating higher-level
guidance cues into the decision-making process, these systems demonstrate height-
ened robustness and versatility, thereby advancing the frontier of autonomous driving
research.
Beyond the realm of visual perception, the integration of additional sensor modal-
ities such as LiDAR has emerged as a pivotal strategy for improving the safety and
reliability of autonomous driving policies. Through multi-modal sensor fusion tech-
niques, LiDAR-equipped systems can glean rich environmental information, thereby
enhancing situational awareness and mitigating potential hazards (Chitta et al., 2023;
Shao et al., 2023; Prakash et al., 2021). This holistic approach to sensor integration
underscores a fundamental principle in autonomous driving research - the fusion of
complementary sensory inputs to construct a comprehensive understanding of the sur-
rounding environment, thereby empowering more robust and dependable autonomous
driving systems.
2.2 Transfer from Simulation to the Real World
The transition from simulated environments to real-world applications has been exten-
sively explored in the fields of computer vision and robotics. Synthetic data has proven
invaluable for training and assessing perception systems across various settings, in-
cluding indoor environments and driving scenarios (Zhang et al., 2017; Handa et al.,
2016; McCormac et al., 2017; Ros et al., 2016; Gaidon et al., 2016; Mayer et al., 2016;
Richter et al., 2016; Skinner et al., 2016; Johnson-Roberson et al., 2017; Tsirikoglou
et al., 2017; Alhaija et al., 2017). Despite advancements in high-fidelity simulation,
directly transferring algorithms from simulation to reality remains challenging (Mc-
Cormac et al., 2017; Richter et al., 2016; Tsirikoglou et al., 2017; Zhang et al., 2017),
although successful examples exist for tasks like optical flow estimation (Dosovitskiy
7
et al., 2015) and object detection (Johnson-Roberson et al., 2017; Hinterstoisser et al.,
2017).
Research on transferring sensorimotor control policies has primarily focused on
manual grasping and manipulation tasks. Various approaches employ specialized
learning techniques and network architectures to facilitate transfer, including domain
adaptation methods and specialized network architectures (Tzeng et al., 2016; Gupta
et al., 2017; Wulfmeier et al., 2017; Bousmalis et al., 2018; Pinto et al., 2018; Rusu
et al., 2017; Zhang and McCarthy, 2017). Utilizing depth maps instead of color
images has been shown to simplify transfer tasks (Viereck et al., 2017; Mahler et al.,
2017b,a), while domain randomization techniques enhance sim-to-real generalization
by maximizing simulation diversity (Tobin et al., 2017; James et al., 2017; Sadeghi
et al., 2017b,a).
In the context of manual grasping and pushing, modularization approaches have
been explored to facilitate transfer. Clavera et al. (2017) proposed a modular system
consisting of perception, high-level policy, and low-level motion control modules, al-
beit requiring special instrumentation and limited to specific tasks with a robotic arm
against a uniform background. Similarly, Devin et al. (2017) investigated modular
neural networks for control tasks but only evaluated them in simulation with simple
tasks. In contrast, our research focuses on developing policies for outdoor mobile
robots in more complex perceptual environments.
The transfer of driving policies traces back to pioneering work by Pomerleau
(1988), who trained a neural network for lane keeping using synthetic road images
and deployed it on a physical vehicle. However, this early work was limited to basic
lane-keeping. Michels et al. (2005) trained an obstacle avoidance policy in simulation
and transferred it to a robotic vehicle using depth-based representations, although
insufficient for urban driving scenarios.
8
A more recent work by uller et al. (2018) introduces a modular deep architecture
for autonomous driving that combines traditional modular pipelines with end-to-
end deep learning approaches, enabling flexibility in adapting to new environments
and domains. The study concludes by suggesting future extensions to enhance the
model’s applicability for real autonomous vehicles, such as incorporating additional
information like lane markings and traffic signs, exploring more complex controllers,
and utilizing data-hungry learning algorithms like reinforcement learning.
2.3 Open and Closed Loop Evaluation
The advancement of learning-based models in autonomous driving primarily relies on
assessing their performance through two key evaluation approaches: open-loop (or
offline) and closed-loop (or online) (Codevilla et al., 2018a). In open-loop evaluation,
models are tested on pre-collected datasets, without deploying them to drive on the
road. This method provides a structured way to gauge performance but might not
fully capture the complexities of actual driving scenarios. Closed-loop evaluation, on
the other hand, involves evaluating models in real-world conditions by letting them
drive based on the predictions of the policy. This approach offers a more realistic
assessment but comes with challenges in safety, repeatability, and cost.
In a more recent study, Schreier et al. (2023) applied the NDS and demonstrated
improved correlation with comprehensive online metrics like the CARLA Driving
Score (Dosovitskiy et al. (2017)). Despite these advancements, Li et al. (2023) show-
cased in their work that a basic model lacking visual input can achieve competitive
performance on leaderboards evaluated by offline metrics such as NDS. This highlights
the limitations of current offline metrics in accurately predicting driving performance.
These findings underscore the need to re-evaluate the adequacy of offline metrics
in predicting real-world driving performance.
9
Chapter 3
Sim2Real: Transfer Simulation
Experiences to the Real World
In this chapter, we explore the Sim2Real transfer process, aiming to enhance the
development and deployment of autonomous driving technologies. This examination
is structured to sequentially address the various phases of our experimental approach,
progressing from initial simulation training to real-world application and analysis.
Initially, we focus on training our autonomous driving model within a simulated
environment. Following the simulation training, the model is then tested in real-world
scenarios to evaluate its adaptability and to benchmark its performance against real-
world data. This stage is crucial for establishing a baseline of performance without
applying Sim2Real algorithms.
Once the initial testing is complete, we introduce the concept of abstraction in the
training process. Here, we utilize foundation models: Segment Anything by Kirillov
et al. (2023) and Depth Anything by Yang et al. (2024) to train on abstract visual
inputs. This step is designed to refine the model’s ability to process and react to
varied and unforeseen environmental factors that are typically encountered in real
driving conditions.
The culmination of this chapter is an in-depth evaluation of the model’s per-
formance in the real world, post-abstraction training. We analyze the outcomes to
identify improvements, challenges, and the model’s overall efficacy in navigating real-
world environments compared to its simulated training.
10
Through this structured exploration, the chapter aims to provide a comprehen-
sive understanding of the Sim2Real process, offering insights into the practicalities
and potentials of employing advanced abstraction techniques in autonomous driving
technology.
3.1 Data and Setup
In this section, we outline our experimental setup. We first discuss the training setup
and further discuss the evaluation setup along with their respective datasets.
To establish a robust foundation for our Sim2Real experiments, we meticulously
configured a simulation setup within the CARLA driving simulator Dosovitskiy et al.
(2017). This setup involves a detailed orchestration of routes, weathers, vehicle dy-
namics, and scenario-based data collection, all designed to mirror real-world urban
driving conditions as closely as possible to facilitate effective transfer learning.
We then introduce the setup scenarios for our experimental target real-world do-
mains: indoor and outdoor environments. Each domain presents unique challenges
and variables for testing autonomous driving algorithms. We’ll delve into the specifics
of these setups, highlighting their significance in evaluating the performance of our
models.
3.1.1 Training Setup
Our experiment utilizes the CARLA simulator, where we use the expert from Trans-
fuser by Chitta et al. (2023). This expert agent navigates through two specifically
chosen towns: Town01 and Town02. These towns are selected for their narrow road
configurations, offering a simplified yet challenging environment that helps avoid the
complexities of multi-lane roads, thus aligning more closely with our target real-world
scenarios.
11
3.1.2 Data Collection
In both towns, the vehicle is exposed to a variety of driving conditions and events,
including routine traffic interactions, collision avoidance scenarios, and traffic light
compliance, to create a diverse and comprehensive dataset. The data collection en-
compasses several key parameters:
Visual Frames: We capture RGB images from the vehicle’s onboard camera,
ensuring that the camera’s field of view and placement accurately reflect those
used in real-world autonomous driving setups.
Driving Commands: At intersections, specific navigational commands are
recorded: left, right, or straight, which mirror the decision-making process re-
quired in real driving situations.
Vehicle Controls: Detailed records of the vehicle’s controls are maintained,
including steering angle, throttle, and brake application, which provides the
ground truth labels to train the Conditional Imitation Learning Agent.
Overall, our dataset consists of approximately 20,000 frames, each tagged with
corresponding control and command data across various simulated weather condi-
tions, which is used to train our driving agents. Critical to our Sim2Real initiative
is the alignment of our simulated data collection with real-world driving conditions.
The camera setup, vehicle control dynamics, and scenario designs are all crafted to
ensure that the transition from simulation to real-world application is as seamless
and effective as possible.
3.1.3 Real World Setup
Indoor Setup: A model city is constructed using a black-colored mat with taped
lanes to simulate roadways as shown in Fig. 3·1 (right). We use the AgileX LIMO agi
12
Figure 3·1: Real-world Outdoor and Indoor Setup: Outdoor
data collection (left) involves a remote-controlled car (xma, 2024) with
Raspberry Pi, a joystick controller, and an Android smartphone fol-
lowing the red path. Indoor data collection (right) employs an AgileX
LIMO (agi, 2024) with Nvidia Jetson Nano, set within a miniature ur-
ban scenario with marked lanes and objects.
(2024) vehicle equipped with a built-in camera for navigation within this controlled
environment. Various obstacles, such as toys and robot figures, are strategically
positioned along the lanes, mimicking typical urban obstacles. Additionally, simulated
traffic lights are incorporated to introduce traffic regulation challenges. This setup
allows for precise control and observation of the vehicle’s behavior in a confined yet
dynamic environment, facilitating targeted experimentation and analysis.
Outdoor Setup: This setup takes place in a park setting featuring narrow pathways
(paved roads) and natural terrain. A larger remote-controlled vehicle (xma, 2024) is
employed for mobility within this open environment. The vehicle is equipped with
sensors and cameras to capture real-time data and navigate through the challenging
13
terrain as shown in Fig. 3·1 (left). Unlike the indoor setup, the outdoor environment
presents unpredictable variables such as uneven terrain, varying weather conditions,
and potential interaction with pedestrians. Despite these complexities, the outdoor
setup offers valuable insights into the adaptability and robustness of autonomous
driving algorithms in real-world scenarios, validating the findings from simulation-
based experiments.
3.1.4 Real World Expert Dataset
In addition to our simulation-based experiments, we leverage real-world data collected
from expert human drivers to validate our Sim2Real models. This dataset is obtained
by rigging the outdoor vehicle (xma, 2024) with a Raspberry Pi while the indoor
vehicle (agi, 2024) comes pre-installed with a Jetson Nano, enabling human operators
to navigate designated paths while recording critical data points.
Data Collection Process: Using joystick controllers, human operators drive vehi-
cles along predetermined paths, mimicking real-world driving behavior. During these
maneuvers, RGB frames from onboard cameras are captured, alongside control inputs
and navigation commands provided by the operator. This comprehensive data collec-
tion process yields a ground-truth trajectory against which our models’ performance
can be evaluated.
Dataset Characteristics: Our real-world expert dataset comprises a total of 200,000
frames collected from both indoor and outdoor scenarios. By capturing data in diverse
environments, we aim to capture a wide range of driving conditions and challenges
to evaluate the robustness and generalization capabilities of our autonomous driving
algorithms.
14
Depth Anything
Segment Anything
Command
C
Steer
S
Throttle
T
CNN
Block
Control
Block
Stage 1: Abstraction
Stage 2: Sim2Real Policy
Figure 3·2: Sim2Real Architecture: Stage 1 is input data ab-
straction using foundation models to reduce the domain gap between
simulation and the real world. Stage 2 is training or inference of the
Sim2Real policy network by training the CIL network in simulation
and evaluating it in the real world with abstracted inputs.
3.2 Methodology
In this section, we discuss our approach to dataset abstraction, a pivotal step in our
research methodology. Here, we explain our process of transforming raw RGB inputs
into segmented and depth-estimated representations during train and test time. This
abstraction serves to streamline the input domain, enhancing the Sim2Real transfer.
Additionally, we detail our utilization of a custom architecture, comprising con-
volutional neural networks (CNNs) with command inputs, creating an end-to-end
convolutional neural network for supervised imitation learning. The combined archi-
15
Figure 3·3: Data Abstraction: This grid of images shows qualitative
examples of abstracted examples for both simulation (top row) and
real-world (bottom row). From left to right: RGB image, depth image
using, segmentation masks, contours for those masks, and segmentation
mask + contours. We use the foundation models: Depth Anything and
Segment Anything (SAM) for these abstractions.
tecture is shown in Fig. 3·2.
3.2.1 Data Abstraction for Sim2Real
In our research, we employ data abstraction techniques to simplify and enhance the
input data for training and inference in our autonomous driving models. The primary
approach applies foundation models: Segment Anything (SAM) (Kirillov et al., 2023)
and Depth Anything (DA) (Yang et al., 2024). These components serve as the build-
ing blocks for abstracting the raw RGB input into segmented and depth-enhanced
formats as shown in Fig. 3·3.
Abstraction During Training (Simulation): During simulation training, the
model is presented with the raw RGB input, which is simultaneously transformed
into segmented and depth-enhanced representations using SAM and SAM + DA,
respectively. This abstraction process aims to simplify the input data while retaining
essential information relevant to driving tasks. Consequently, the model learns to
interpret the segmented and depth-enhanced inputs, facilitating better generalization
to real-world scenarios.
16
Abstraction During Test/Inference (Real World): Similarly, during real-world
inference, the model receives input from onboard sensors, which are processed using
SAM and DA to generate segmented and depth-enhanced inputs. By abstracting
the raw sensor data in this manner, the model remains consistent in its perception
regardless of the input source, ensuring robust performance in diverse environments.
Formulations of Data Abstraction: To explore the effectiveness of different ab-
straction formulations, we experiment with various combinations of input types:
1. RGB Only: Training the model solely on raw RGB input.
2. RGB with Depth (RGB-D): Training the model on RGB input with an
additional channel containing the depth information.
3. Segmentation Mask Only: Training the model exclusively on the segmented
output of SAM.
4. Segmentation Mask with Depth (Mask-D): Training the model on the
SAM segmented output along with depth information.
5. Contour with Depth: Incorporating depth information into the SAM contour-
based input.
6. Mask with Contour and Depth: Combining the segmented mask with its
contours and depth information.
By exploring these different formulations, we aim to identify the most effective
abstraction strategy for enhancing the performance and generalization capabilities of
our autonomous driving models.
17
3.2.2 Network Architecture
In crafting our network architecture for the Conditional Imitation Learning (CIL)
network, we adopt a customized approach tailored to our specific research objectives.
Our architecture incorporates a convolutional block that offers flexibility through
its modularity, allowing for seamless swapping of components. In our experiments
we replace this CNN Block with variants of the RegNetY family of networks by
Radosavovic et al. (2020) as well as a custom five-layer convolution network.
Our architecture accommodates inputs with varying dimensions, typically con-
sisting of three-channel RGB images. However, for inputs augmented with depth
information, such as RGB-D or SAM-D images, we introduce a four-channel to three-
channel convolution layer to handle the additional depth dimension seamlessly.
Following feature extraction, the output is passed through a linear layer where it
undergoes concatenation with the command input, denoted as C, which is a one-hot
encoded vector. Initially, we feed this encoded command through two fully connected
layers, following a similar approach as described in Codevilla et al. (2018b), and
concatenate it with the flattened image feature vector. However, acknowledging the
limitations highlighted in previous works Codevilla et al. (2018b), particularly in
ensuring command adherence, we draw inspiration from Hawke et al. (2020). Here,
we directly concatenate the encoded command to each fully connected layer within
the Control Block.
The output from the control block is subsequently branched into two linear layers
responsible for predicting the scalar values corresponding to driving controls: Steer
and Throttle. For network optimization, we employ the AdamW optimizer Loshchilov
and Hutter (2017) and compute the loss as the sum of L1 losses for Steer and Throttle.
Additionally, to ensure consistency in target ranges, we normalize the targets for Steer
and Throttle to (1,1) and (0,1) respectively.
18
L(ˆ
a,a) = ˆ
SS+ˆ
TT
The loss function L(ˆ
a,a) is defined as the sum of the absolute differences between
the predicted and ground truth actions, where ˆ
adenotes the predicted action and
arepresents the ground truth action. Similarly, ˆ
Sand Sdenote the predicted and
ground truth steering angles, respectively, and ˆ
Tand Trepresent the predicted and
ground truth throttle values, respectively. · denotes the absolute difference between
the corresponding predicted and ground truth values.
3.3 Results and Discussions
In this section, we present a comprehensive analysis of the outcomes derived from
our Sim2Real experiments. This section encapsulates our findings from the transi-
tion of models trained in simulation within the CARLA environment to real-world
application in both indoor and outdoor settings. Across all experiments, we sys-
tematically evaluate the performance of various model configurations, ranging from
simple RGB baselines to more complex formulations incorporating mask, contours,
and depth estimation.
3.3.1 Indoor Experiments
Setup: Our indoor experiments are designed to simulate challenging urban driving
scenarios, with a focus on maneuvering through complex environments and navigating
potential obstacles. In total, we evaluate each of our models across eight distinct
scenarios, each tailored to emulate real-world urban driving challenges.
1. Pedestrian Crossing on Turn with No Lights: A scenario where a pedes-
trian crosses the road during a turn without any traffic lights.
19
2. Car Crossing on Green: Simulation of a scenario where a car crosses the
road while the traffic light is green.
3. Car Crossing on Red: Similar to the previous scenario, while the traffic light
is red.
4. Object Crossing and Turn: A scenario where an object crosses the road
during a turn, requiring the vehicle to navigate around it.
5. Hidden Pedestrian Behind Object: Emulates a situation where a pedes-
trian hidden behind an object suddenly appears in front of the vehicle.
6. Traffic Light Scenarios: Various scenarios involving traffic lights, including
different signal states and pedestrian crossings.
7. Wheelchair Person Crossing on Green: Simulation of a wheelchair or
person in wheelchair crossing the road while the traffic light is green.
8. Car Parked in the Middle of the Road: A scenario where a car is parked
obstructing the road, requiring the ego vehicle to stop.
For each scenario, we collect expert trajectory data, where a human driver navi-
gates through the scenario and stops accordingly. We then compare the performance
of our models against these expert trajectories using offline metrics, averaging the
results to obtain comprehensive insights into our model’s performance across diverse
urban driving scenarios.
Results: In Table 3.1 we present the results of our offline evaluation of seven different
model architectures across various input configurations in indoor driving scenarios.
The architectures range from the baseline custom architecture, PilotNet, to increas-
ingly complex RegNetY variants (RegNetY-002 to RegNetY-016). For each architec-
ture, we assess the performance in terms of steering Mean Absolute Error (MAE)
20
and action MAE, which includes both steering and throttle control. The SW metric
denotes Speed Weighted variants of steering and action MAE. Here, ground-truth
action a= [S, T ] where Sis steering and Tis throttle and similarity, ˆa,ˆ
S, ˆ
Tare their
predicted counterparts. The metrics are mathematically defined in Section 4.2.
Across all input configurations, we observe improvements over the RGB base-
line, indicating the efficacy of incorporating additional information into the model
through the foundation models. Notably, methods utilizing semantic segmentation or
contour-based inputs, combined with depth information, exhibit particularly promis-
ing results. However, no clear winner emerges among the different model architec-
tures, with performance varying across scenarios.
This variability in performance can be attributed to several factors, including the
abstract nature of mask-based methods, which may not capture critical details such
as traffic light colors. Despite this, qualitative analysis suggests that models utilizing
input abstractions generally outperform the baseline in overall performance.
3.3.2 Outdoor Experiments
Setup: The outdoor route spanned approximately 250 feet, comprising five turns,
interspersed with straight sections. Pedestrians entered the vehicle’s path three times
during the route. The route traversed various backgrounds, including pedestrian-
filled areas adjacent to the path, multi-road intersections, and turns exceeding 90
degrees. The terrain of the outdoor path was characterized by a brick road surface,
contributing to significant vibrations experienced by the vehicle during traversal. This
uneven surface posed a challenge to the vehicle’s stability and control. Moreover, blind
corners were encountered due to the presence of trees and shrubs along the road’s
edge, requiring the vehicle to navigate cautiously through these obscured areas.
Results: As shown in Table 3.2, similar to our indoor findings, no single model or
21
abstraction emerges as a clear winner. However, observable patterns emerge across
different model architectures and input configurations. For instance, ResNetY-008
demonstrates promising performance when the input is abstracted with mask, con-
tour, and depth information, indicating the efficacy of incorporating multiple abstrac-
tion methods. We follow a similar notation for action-based metrics as discussed in
Subsection 3.3.1.
Conversely, ResNetY-004 exhibits better performance when utilizing raw RGB in-
put combined with depth information, suggesting that certain models may excel under
specific input configurations. Similarly, PilotNet demonstrates improved performance
with mask-based architectures, highlighting the importance of model optimization
and architectural considerations in achieving optimal results.
Overall, while all proposed abstraction methods outperform the RGB baseline,
there is no definitive winner identified in this evaluation. Rather, the performance
appears to be influenced by how effectively the model adapts and optimizes for dif-
ferent input configurations.
3.4 Conclusion
In this chapter, we explored how modern foundation models, like Segment Anything
and Depth Anything, enhance Sim2Real transfer in autonomous driving. We tested
various input configurations but found no clear winner based on our offline met-
rics. However, all configurations performed better than the baseline RGB-only input,
indicating the efficacy of Sim2Real transfer in improving model generalization to
real-world scenarios.
Despite these positive findings, our results aren’t conclusive, leading us to question
the adequacy of our current evaluation metrics. In the next chapter, we’ll discuss the
limitations of traditional offline metrics and identify the correlation between offline
22
and online evaluation. This will help us explore how accurately we can predict real-
world performance based on offline metrics, paving the way for more robust evaluation
methodologies in autonomous driving research.
23
Table 3.1: Sim2Real Indoor Results (MAE): This table shows
offline evaluation results of five model architectures for indoor driv-
ing, indicating performance in steering and action MAE across various
input configurations. While models incorporating additional informa-
tion like semantic segmentation and depth show promise with clear
improvements over the baseline, no clear winner emerges due to vari-
ability attributed to abstract input methods. Among Methods, ‘D’ rep-
resents Depth, and ‘C’ denotes contours. Similarly, SMAE and AMAE
represent MAE errors based on steer and action (steer + throttle) re-
spectively, and SW represents their speed-weighted versions.
Backbone Method SMAE SW SMAE AMAE SW AMAE
PilotNet
RGB (Baseline) 0.249 0.344 0.271 0.946
RGB+D 0.245 0.33 0.269 0.920
Mask 0.264 0.362 0.292 0.938
Mask+D 0.306 0.392 0.302 0.988
Contour+D 0.319 0.398 0.314 0.993
Mask+C+D 0.307 0.400 0.310 0.990
RegNetY 002
RGB (Baseline) 0.294 0.391 0.319 0.850
RGB+D 0.276 0.383 0.331 0.812
Mask 0.295 0.397 0.304 0.955
Mask+D 0.308 0.405 0.332 0.861
Contour+D 0.324 0.414 0.347 0.876
Mask+C+D 0.284 0.377 0.312 0.853
RegNetY 004
RGB (Baseline) 0.278 0.376 0.296 0.935
RGB+D 0.304 0.407 0.321 0.945
Mask 0.253 0.355 0.278 0.888
Mask+D 0.321 0.418 0.331 0.892
Contour+D 0.296 0.397 0.330 0.838
Mask+C+D 0.300 0.392 0.317 0.909
RegNetY 008
RGB (Baseline) 0.260 0.355 0.295 0.813
RGB+D 0.250 0.342 0.345 0.786
Mask 0.269 0.367 0.308 0.899
Mask+D 0.324 0.420 0.354 0.833
Contour+D 0.299 0.390 0.331 0.864
Mask+C+D 0.289 0.394 0.324 0.868
RegNetY 016
RGB (Baseline) 0.262 0.357 0.278 0.911
RGB+D 0.282 0.377 0.309 0.888
Mask 0.253 0.347 0.291 0.815
Mask+D 0.312 0.396 0.341 0.861
Contour+D 0.356 0.432 0.353 0.916
Mask+C+D 0.277 0.375 0.301 0.904
24
Table 3.2: Sim2Real Outdoor Results (MAE): Diverse model
architectures and input setups show an improvement over the baseline
while showing no clear top-performing model or abstraction. ResNetY-
008 stands out with multi-abstraction inputs, ResNetY-004 performs
well with raw RGB with depth, and PilotNet improves with mask-based
architectures. Among Methods, ‘D’ represents Depth, and ‘C’ denotes
contours. Similarly, SMAE and AMAE represent MAE errors based
on steer and action (steer + throttle) respectively, and SW represents
their speed-weighted versions.
Backbone Method SMAE SW SMAE AMAE SW AMAE
PilotNet
RGB (Baseline) 0.139 0.304 0.447 1.595
RGB+D 0.113 0.251 0.459 1.788
Mask 0.110 0.238 0.446 1.697
Mask+D 0.125 0.272 0.456 1.719
Contour+D 0.199 0.402 0.466 1.606
Mask+C+D 0.162 0.347 0.461 1.657
RegNetY 002
RGB (Baseline) 0.080 0.178 0.357 1.586
RGB+D 0.079 0.179 0.195 0.782
Mask 0.100 0.220 0.439 1.986
Mask+D 0.086 0.191 0.335 1.475
Contour+D 0.122 0.264 0.394 1.753
Mask+C+D 0.097 0.214 0.316 1.370
RegNetY 004
RGB (Baseline) 0.093 0.205 0.481 2.199
RGB+D 0.082 0.190 0.242 1.015
Mask 0.082 0.194 0.435 1.977
Mask+D 0.116 0.251 0.370 1.641
Contour+D 0.097 0.214 0.338 1.486
Mask+C+D 0.102 0.223 0.365 1.611
RegNetY 008
RGB (Baseline) 0.094 0.207 0.393 1.233
RGB+D 0.100 0.218 0.225 0.919
Mask 0.092 0.206 0.324 1.427
Mask+D 0.113 0.247 0.346 1.520
Contour+D 0.110 0.239 0.355 1.567
Mask+C+D 0.086 0.198 0.218 1.388
RegNetY 016
RGB (Baseline) 0.086 0.190 0.415 1.875
RGB+D 0.140 0.305 0.270 1.128
Mask 0.084 0.198 0.341 1.511
Mask+D 0.112 0.249 0.289 1.238
Contour+D 0.133 0.291 0.388 1.721
Mask+C+D 0.093 0.203 0.425 1.922
25
Chapter 4
Correlation Analysis: Offline and Online
Evaluation Metrics
In this chapter, we take a closer look at what we learned in the previous chapter
about Sim2Real transfer in autonomous driving. Although we made some progress,
like outperforming the baseline RGB-only input, we also found some issues. One
main issue is that the results were not consistent across different setups, which makes
us question how reliable conventional evaluation methods are.
To tackle this problem, we dive into comparing offline and online evaluation meth-
ods. We want to see if the scores we get in offline tests match up with how well the
models perform in the real world. By doing this, we hope to figure out if our tests
accurately predict how autonomous driving systems will do when they are out there
on the road.
4.1 Data and Setup
We operate under the premise of having access to a compiled dataset D={(oi,ai)}N
i=1,
wherein each observation oi= (Ii,Li, ci, vi) O comprises an image IiRNW×NH×3,
a LiDAR point cloud with alpha channel LiRNX×NY×NZ×1, a conditional command
ciN, and speed measurements viR, alongside the ground-truth action vector
ai A executed by the driver. Here, πθ:O A represents the policy, a mapping
function parameterized by weights θRd, which predicts actions ˆ
aibased on input
observations.
26
In our investigation, the action can take two forms: a low-level control action,
such as a 3D vector encompassing steering, throttle, and brake magnitudes, or a
collection of future 2D waypoints visible in the vehicle’s map perspective, outlining
the intended trajectory (Codevilla et al., 2018a; M¨uller et al., 2018). Present-day
vision-based planners typically employ waypoints as the output model (Jiang et al.,
2023; Hu et al., 2023; Chitta et al., 2023), with a low-level controller, such as PID,
utilized to generate the final control. Conversely, earlier studies have emphasized
direct prediction of steering values Codevilla et al. (2018a). Given the prevalent
adoption of waypoint-based metrics (expanded upon in the subsequent section), we
explore the significance of both action definitions and their correlation with closed-
loop performance.
4.2 Metrics
4.2.1 Pearson Correlation Coefficient
To understand the relationship between performance metrics evaluated in offline sim-
ulation environments and those observed in a closed-loop setup, we calculate the
correlation between these metrics. We quantify the Pearson Correlation Coefficient
(r) between online metrics (X) and offline metrics (Y) as:
r=Pn
i=1(Xi¯
X)(Yi¯
Y)
pPn
i=1(Xi¯
X)2Pn
i=1(Yi¯
Y)2
Where, nrepresents the number of samples, ¯
Xdenotes the mean of the online
metrics, and ¯
Yrepresents the mean of the offline metrics. The numerator calculates
the covariance between Xand Y, while the denominator normalizes this covariance
by the standard deviations of Xand Y, ensuring that rfalls within the range of
[-1, 1], with 1 indicating perfect positive correlation, -1 indicating perfect negative
correlation, and 0 indicating no correlation.
27
4.2.2 Open-Loop Metrics
Mean Absolute Error (MAE): The Mean Absolute Error (MAE) serves as a met-
ric to quantify the average magnitude of errors between predicted and actual values.
It is calculated using Equation 4.1, where airepresents the ground-truth action vec-
tor and ˆaidenotes the predicted action vector. The · 1notation signifies the L1
norm, also known as the Manhattan distance, which computes the sum of absolute
differences between corresponding elements of the vectors. In this formulation, weight
scalars αiare set to 1, implying equal importance for all observations.
LMAE =aiˆai1(4.1)
Mean Squared Error (MSE): The Mean Squared Error (MSE) serves as another
metric to gauge the average magnitude of squared errors between predicted and actual
values. It is computed using Equation 4.2, where airepresents the ground-truth
action vector and ˆaidenotes the predicted action vector. The · 2notation signifies
the L2norm, which calculates the sum of squared differences between corresponding
elements of the vectors. Similar to MAE, in this formulation of MSE, weight scalars
αiare set to 1, indicating equal weighting for all observations.
LMSE =aiˆai2(4.2)
Speed Weighted Mean Absolute Error (SW-MAE): Speed Weighted Mean
Absolute Error (Speed-MAE) is a modified version of Mean Absolute Error (MAE)
that accounts for the speed of the vehicle in computing prediction accuracy. It is
calculated as the weighted sum of absolute errors between predicted and actual values,
with weights determined by the speed of the vehicle. Speed-MAE is expressed as:
28
LSpeed-M AE =1
N
N
X
i=1
|vi·(aiˆai)|(4.3)
Where, airepresents the ground-truth action vector, ˆaidenotes the predicted
action vector, viis the speed of the vehicle at observation i, and Ndenotes the total
number of observations.
The SW-MAE metric emphasizes errors occurring during higher-speed scenarios
more than those at lower speeds, acknowledging that accuracy requirements may
differ depending on the velocity of the vehicle. This adjustment allows for a more
nuanced evaluation of model performance, particularly in driving scenarios where
speed variability significantly impacts control accuracy.
Quantized Classification Error (QCE): Quantized Classification Error (QCE)
offers a method to quantify prediction accuracy by discretizing predictions based
on a predetermined threshold, focusing solely on substantial errors relative to the
ground-truth. QCE is expressed as:
LQCE = (1 δ(Q(ai, σ), Q(ˆ
ai, σ))) (4.4)
Here, δrepresents the Kronecker delta function, and Q(x) is defined as:
Q(x) =
1 if x < σ
0 if σx < σ
1 if x>σ
(4.5)
The parameter σsets the threshold for quantization. To ensure meaningful com-
parisons, we adopt the same threshold settings as provided by Codevilla et al. (2018a).
Thresholded Relative Error (TRE): Thresholded Relative Error (TRE) enhances
QCE by introducing an adaptive threshold, proportional to the ground-truth steering
29
angle, thereby penalizing errors more during low ground-truth action values, partic-
ularly when the steering angle is minimal. TRE is represented as:
LT RE =H(ˆaiai λai) (4.6)
In this equation, Hdenotes the Heaviside step function. We maintain λ= 0.1
consistently with Codevilla et al. (2018a), and σ= 0.5 as the midpoint of our steering
range. It’s important to note that λiis set to 0.1 for computing TRE.
Average Displacement Error (ADE): The Average Displacement Error is a mea-
sure that computes the average Euclidean distance between the predicted trajectory
and the ground truth positions of the vehicle at each time step over a given trajectory.
Mathematically, it is defined as the sum of these distances over all time steps, nor-
malized by the total number of steps T. The ADE offers a comprehensive overview of
the prediction accuracy throughout the entire predicted trajectory, allowing for the
assessment of the overall consistency of the prediction algorithm. Mathematically,
ADE is defined by:
ADE = 1
T
T
X
t=1
q(ˆa(t)
xa(t)
x)2+ (ˆa(t)
ya(t)
y)2(4.7)
In this equation, xand ydenote the x and y coordinate values of the predicted
trajectory point denoted by ˆa or ground truth trajectory point denoted by afor
timestep tT.
Final Displacement Error (FDE): The Final Displacement Error measures the
prediction accuracy at the final time step of the trajectory. It is the Euclidean distance
between the predicted final position and the true final position of the vehicle. This
metric is particularly important as it directly relates to the capability of the prediction
system to accurately forecast the vehicle’s final destination point, which is crucial for
30
planning and collision avoidance in autonomous driving scenarios. FDE is given by:
FDE = q(ˆaxax)2+ (ˆayay)2(4.8)
In this equation, xand ydenote the x and y coordinate values of the predicted
trajectory point denoted by ˆa or ground truth trajectory point denoted by a.
4.2.3 Closed-Loop Metrics
Success Rate (SR): Success Rate (SR) denotes the proportion of successfully com-
pleted routes by our autonomous agent, averaged across a set of Nroutes. It provides
a holistic measure of the agent’s ability to navigate from start to finish without any
critical errors or failures.
SR =Number of successfully completed routes
N(4.9)
Route Completion (RC): Route Completion (RC) quantifies the percentage of
a given route successfully traversed by our autonomous agent. It is calculated by
averaging the completion percentages across all Nroutes.
RC =1
N
N
X
i=1
Ri(4.10)
Where Rirepresents the completion percentage of route i.
Collision Rate (CR): The Collision Rate (CR) quantifies the frequency of collisions
encountered by the ego vehicle relative to the distance traveled along a given route.
It is calculated by dividing the total number of collisions by the distance traveled by
the ego vehicle and then averaged across all Nroutes.
CR =Total number of collisions
Total distance traveled by the ego vehicle (4.11)
31
Infraction Rate (IR): The Infraction Rate (IR) measures the frequency of infrac-
tions, including collisions and other traffic violations, relative to the distance traveled
by the ego vehicle along a given route. It is computed by dividing the total number
of infractions by the distance traveled and then averaged across all Nroutes.
IR =Total number of infractions
Total distance traveled by the ego vehicle (4.12)
These refined metrics, Collision Rate and Infraction Rate, offer a more standard-
ized assessment of safety and rule compliance, accounting for the distance traveled
by the ego vehicle during route traversal.
Infraction Penalty (IP): To rigorously evaluate safety and rule compliance, the
Infraction Penalty (IP) metric incorporates a geometric series of infraction penalty
coefficients, pj, where jrepresents different types of infractions incurred during route
traversal. The agent’s initial score of 1.0 is systematically diminished with each
infraction, as per the equation:
IP =
Ped, Veh, Stat, Red, Stop
Y
j
(pj)#infractionsj(4.13)
Driving Score (DS): The Driving Score (DS) offers a comprehensive measure of the
autonomous agent’s performance by blending Route Completion (RC) and Driving
Penalty (DP) into a single evaluative metric. Calculated for each route and then
averaged across all Nroutes, DS is expressed as:
DS =1
N
N
X
i=1
Ri×Pi(4.14)
Where Rirepresents the Route Completion percentage for route i, and Pirepre-
sents the Driving Penalty for route i.
32
4.2.4 Real World Metrics
In addition to the online metrics utilized in our simulation, we have incorporated
additional metrics to ensure a comprehensive and robust evaluation in real-world
settings:
Interventions: Number of interventions per route (e.g., readjusting to stay on
track).
Lane Infractions: Number of times the vehicle crossed lanes and had to be
brought back on track. This is a subset of Infractions.
Collision Rate: Frequency of collisions between the vehicle and other objects
or entities relative to the distance traveled.
Disobeyed Navigation: Number of times the vehicle did not follow navigation
commands and deviated from the route.
Scenario Timeout: Number of instances when the vehicle came to a stop but
did not resume motion even after the stopping stimulus was removed.
Success Rate: Number of times the vehicle reached the intended goal location
without major incidents (e.g., Collisions) out of total scenarios.
Route Completion Rate: Percentage of the route completed.
Running a Red Light: Instances of not stopping at red or orange lights or
stopping at a green light without an obstacle.
Infraction Rate: Frequency of infractions, including lane infractions and colli-
sions, relative to the distance traveled.
Infraction Penalty: Aggregated weighted infractions as a geometric series. Re-
duces from 1.0 which reduces each time an infraction is committed.
33
Driving Score: Product of Route Completion and Infraction Penalty.
4.3 Methodology
In this section, we provide a detailed outline of our experimental setup to determine
the relationship between offline and online evaluation metrics in autonomous driving
research. Furthermore, our primary goal for this correlation analysis is to refine and
ensure the accuracy of correlation results by exploring a diverse array of models within
our simulation environment. While our focus for correlation analysis is predominantly
on simulation due to its advantages, it is imperative to validate our findings in real-
world settings. Hence, we conduct validation experiments on a smaller dataset and
a subset of models in the real world. It’s worth noting that our research represents
a pioneering effort in the field, as prior work on correlation analysis does not exist in
the real world. This novelty underscores the significance of our research in advancing
the understanding and application of autonomous driving technologies.
4.3.1 Network Architecture
For our simulation network architecture, we leverage Transfuser by Chitta et al. (2023)
as our base code, utilizing its default variation due to its state-of-the-art capabilities
and multimodality, providing an optimal foundation for correlation analysis. We
augment the Transfuser model backbone with various backbones to explore their
impact on correlation analysis: two ConvNext Liu et al. (2022) - small and tiny, and
nine RegNet Radosavovic et al. (2020) - from RegNetY-200MF to RegNetY-8.0GF
Additionally, we investigate whether the inclusion of LiDAR enhances correlation
analysis by comparing models with and without LiDAR. To this end, also experiment
with models for RGB-only input using the Transfuser architecture, referred to as
Latent-TF.
For the real world, we employ a custom architecture similar to our Sim2Real
34
Command
C
Steer
S
Throttle
T
CNN Block
Control Block
Image
I
Figure 4·1: Real-world CIL network architecture 640x480 image
Iis given as input along with one-hot encoded Command Cto predict
steer Sand throttle T. The encoded command Cis added to the
network multiple times at different stages to improve the robustness of
the policy.
setup as shown in Fig. 4·1. However, in this architecture, we omit the segmentation
and depth abstractions, directly accommodating RGB images and control inputs. To
gather data, we conduct our own training data in the real world both indoors and
outdoors. Evaluation is carried out on the same routes and scenarios as previously
mentioned in the Sim2Real experiments, encompassing both online and offline eval-
uation. Due to the simplified nature of our network, which eliminates semantic and
depth features, it operates more efficiently, enabling real-time closed-loop evaluations.
4.3.2 Model Sampling
To draw a statistically conclusive correlation analysis, we need multiple models with
a varying range in performance. Hence we train 22 models (11 multimodal and 11
RGB-only) for 40 epochs by saving checkpoints at 5 epochs: 1, 11, 21, 31, and 40
forming a total of 110 models spanning multiple evaluation routes across two towns,
35
we aim to establish statistical significance for correlation analysis across diverse towns
and weather conditions. In the real world, we evaluate seven model variants for the
correlation analysis in each of the indoor and outdoor settings.
4.3.3 Model Evaluation
In our model evaluation process, we use Carla simulator to assess the performance
of our models across different scenarios. For simulation-based evaluations, our anal-
ysis covers two towns, namely Town02 and Town05. Within each town, we evaluate
performance across six distinct routes, resulting in a total of 12 evaluation routes
spanning the two towns. In the real-world scenario, our evaluation encompasses 10
different scenarios within indoor environments. Additionally, we conduct evaluations
on a single long 250-foot route for outdoor experiments. This comprehensive evalu-
ation strategy allows us to thoroughly assess the robustness and effectiveness of our
models across a range of simulated and real-world conditions.
4.4 Results and Discussion
In this section, we explore the correlation results obtained from our experiments,
focusing on simulation, and real-world scenarios. Using radar plots, we compare the
correlation coefficients for each online metric individually, providing insights into the
performance of our models across different environments.
4.4.1 Simulation Experiments
We plot the results of all the 110 models and summarize them in Fig. 4·2 showing
the correlation of nine offline metrics, including action and steering, across various
online metrics such as driving score success, route completion, and infractions. The
examination revealed a robust correlation between driving score and ADE (Average
Displacement Error), a waypoint-based metric. Moreover, metrics proposed by the
36
Driving Score
Success
Route Completion
All Infractions
Collisions (Object)
Collisions (Environment)
Route Infractions
Red Light Violations
Stop Infractions
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Action MAE
Action MSE
FDE
ADE
Steer MAE
Steer MSE
TRE
QCE
SW Steer MAE
Figure 4·2: Simulation Correlation Plot: Showing results from
110 models highlight strong correlations between driving score and
ADE and Action metrics. Lack of correlation observed for stop in-
teractions due to models’ disregard for stop signs. TRE also shows a
good overall correlation.
Codevilla et al. (2018a), notably, TRE (Threshold Relative Error) also show strong
correlations with driving score and other performance metrics. However, a discernible
lack of correlation was observed for stop interactions, attributable to models’ disregard
for stop signs, resulting in heightened interactions. Nonetheless, overall interactions
displayed a relatively high correlation, peaking at 0.67, signifying a good alignment
between offline and online metrics.
37
Driving Score
Success
Route CompletionInfractions
Collisions
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Action MAE
Steer MAE
TRE
Action MSE
Steer MSE
QCE
(a) Indoor Analysis.
Driving Score
Success
Route CompletionInfractions
Collisions
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Action MAE
Steer MAE
TRE
Action MSE
Steer MSE
QCE
(b) Outdoor Analysis.
Figure 4·3: Real World Correlation Analysis: Indoor analysis
(left) showed consistent correlations between offline metrics (e.g., action
MSE) and online metrics, except for TRE and QCE. Outdoor analysis
(right) exhibited more varied correlations, likely due to single-route
evaluations and fewer data points compared to simulations.
4.4.2 Real World Experiments
Indoor Correlation Analysis: As shown in Fig. 4·3a We observed consistent trends
across various offline metrics, including action and steering Mean Squared/Absolute
Error (MSE/MSE). These metrics exhibited uniform correlation patterns across all
evaluated online metrics, likely attributed to the diverse set of scenario-based eval-
uations. Since the assessments were event-driven rather than encompassing entire
routes, a balanced and consistent correlation was observed across standard metrics.
However, notable discrepancies were observed for metrics such as Total Route Effi-
ciency (TRE) and Quality of Correlation with Expectation (QCE), originally pro-
posed by the Codevilla et al. (2018a), which displayed poor correlations compared to
other metrics.
Outdoor Correlation Analysis: Fig. 4·3b displayed a more varied and unpre-
38
dictable pattern. While action and steering MSE maintained reasonable correlations
with the Driving Score, other metrics exhibited less consistent correlations. This vari-
ability may stem from evaluating a single route, in contrast to the diverse evaluation
routes used in simulation experiments. Furthermore, the limited number of evaluation
points in the real-world experiments, compared to the 110 models evaluated in the
simulation, likely contributed to the unpredictable correlation observed in outdoor
scenarios since a single outlier may affect the correlation significantly.
4.5 Conclusion
In this study, we investigate the adequacy of offline evaluation for autonomous driving
development through extensive simulations and inaugural real-world experiments.
Our validation efforts extend to pioneering the first real-world correlation analysis.
Our findings reveal notable correlations, particularly for waypoint-based metrics
in simulation and action-based metrics in the real world. However, these correla-
tions leave room for further examination, as deeper analysis across a larger number
of models is warranted. Although our analysis reveals promising results, the correla-
tion between offline and online metrics remains sub-optimal, indicating ample room
for enhancement. Consequently, our research contributes to a deeper understanding
of autonomous driving evaluation metrics, aiming to narrow the divide between of-
fline and online evaluation frameworks across simulated environments and real-world
scenarios.
For future work, it is essential to analyze correlations across various robotic tasks
and policies to understand their effectiveness in different contexts. Further explo-
ration of complex real-world data is needed to strengthen our findings. Future re-
search could also focus on creating better offline metrics that correlate more strongly
with existing online metrics.
39
Chapter 5
Conclusions
5.1 Contributions
In this study, our research aimed to bridge the gap of autonomous driving by ad-
dressing key challenges in Sim2Real transfer and offline-online correlation analysis.
Our investigation began with a focus on bridging the gap between simulation and
the real world. By leveraging modern foundation models to abstract the RGB inputs
in the dataset, we showed an enhancement of the performance of models trained in
simulation for deployment in real-world scenarios.
Subsequently, we performed a comprehensive analysis of the correlation between
offline and online metrics for autonomous driving networks. Our research revealed
some strong correlations, particularly for trajectory and action-based offline metrics.
However, despite these promising findings, our study also underscored the need for
further refinement. While certain metrics exhibited strong correlations, there re-
mained areas of divergence, highlighting the complexity of evaluating driving model
performance. Our investigation serves as a critical step toward understanding the
nuances of offline-online correlation analysis and provides valuable insights for future
research endeavors.
Overall, our study pioneers the application of foundation models in Sim2Real for
autonomous driving and conducts novel correlation analysis in real-world settings.
These contributions offer vital insights for advancing autonomous driving research,
laying the foundation for more robust and reliable autonomous systems.
40
5.2 Future Work
In future work, there is significant potential for advancing Sim2Real analysis beyond
our current study’s scope. While our initial findings show promising improvements
over baseline models, it’s crucial to recognize the limitations of our evaluation ap-
proach, as highlighted in our correlation analysis. Future research should prioritize
comprehensive quantitative analysis across diverse real-world driving scenarios, par-
ticularly in urban environments. Additionally, addressing drawbacks such as traffic
light abstraction presents opportunities for enhancement. Future work could focus on
modifying the network architecture to incorporate traffic light detection and recogni-
tion capabilities, improving the model’s ability to interpret and respond to complex
traffic scenarios, thereby enhancing overall driving performance in real-world settings.
Similarly, extending the analysis beyond autonomous driving and exploring cor-
relations across a spectrum of robotic tasks and policies can be a good starting point
for future work. Furthermore, encompassing complex real-world datasets and eval-
uation scenarios will bolster the robustness of our conclusions and provide deeper
insights into end-to-end policies. Additionally, there is a pressing need for the de-
velopment of improved offline metrics that demonstrate stronger correlations with
established online metrics. By refining our evaluation methodologies and metrics, we
can enhance the accuracy and reliability of performance assessments for autonomous
systems, ultimately advancing the field of robotics and AI.
5.3 Limitations & Broader Impacts
In reflecting on our research, two primary limitations emerge. Firstly, our evaluation
methodology for Sim2Real experiments primarily focuses on offline scenarios, lacking
assessment in online and closed-loop evaluations. Incorporating online evaluation
would provide a more comprehensive understanding of model performance in real-time
41
driving conditions. Secondly, our real-world evaluation is constrained by the limited
number of models used for correlation analysis, with only seven models each for indoor
and outdoor scenarios. Expanding the research to include a larger number of models
would enhance the robustness of correlation analysis, offering greater resilience against
outliers or noise in the data. Addressing these limitations in future research endeavors
would contribute to a more thorough and reliable assessment of autonomous driving
systems.
References
(2024). Agile-X LIMO. https://docs.trossenrobotics.com/agilex limo docs.
(2024). Carla autonomous driving leaderboard. https://leaderboard.carla.org/.
(2024). Traxxas X-Maxx RC Vehicle. https://traxxas.com/products/landing/x-maxx.
Alhaija, H. A., Mustikovela, S. K., Mescheder, L., Geiger, A., and Rother, C. (2017).
Augmented reality meets deep learning for car instance segmentation in urban
scenes. In BMVC.
Bansal, M., Krizhevsky, A., and Ogale, A. (2019). ChauffeurNet: Learning to drive
by imitating the best and synthesizing the worst. In RSS.
Bojarski, M., Del Testa, D., Dworakowski, D., Firner, B., Flepp, B., Goyal, P., Jackel,
L. D., Monfort, M., Muller, U., Zhang, J., et al. (2016a). End to end learning for
self-driving cars. arXiv preprint arXiv:1604.07316.
Bojarski, M., Testa, D. D., Dworakowski, D., Firner, B., Flepp, B., Goyal, P., and
Jackel, L. D. (2016b). End to end learning for self-driving cars. arXiv:1604.07316.
Bousmalis, K., Irpan, A., Wohlhart, P., Bai, Y., Kelcey, M., Kalakrishnan, M., Downs,
L., Ibarz, J., Pastor, P., Konolige, K., Levine, S., and Vanhoucke, V. (2018). Using
simulation and domain adaptation to improve efficiency of deep robotic grasping.
In ICRA.
Casas, S., Sadat, A., and Urtasun, R. (2021). Mp3: A unified model to map, perceive,
predict and plan. In CVPR.
Chen, L., Platinsky, L., Speichert, S., Osi´nski, B., Scheel, O., Ye, Y., Grimmett, H.,
Del Pero, L., and Ondruska, P. (2021). What data do we need for training an av
motion planner? In ICRA.
Chitta, K., Prakash, A., Jaeger, B., Yu, Z., Renz, K., and Geiger, A. (2023). Trans-
fuser: Imitation with transformer-based sensor fusion for autonomous driving.
PAMI.
Clavera, I., Held, D., and Abbeel, P. (2017). Policy transfer via modularity and
reward guiding. In IROS.
42
43
Codevilla, F., Lopez, A. M., Koltun, V., and Dosovitskiy, A. (2018a). On offline
evaluation of vision-based driving models. In ECCV.
Codevilla, F., Miiller, M., opez, A., Koltun, V., and Dosovitskiy, A. (2018b). End-
to-end driving via conditional imitation learning. In ICRA.
Cui, A., Casas, S., Sadat, A., Liao, R., and Urtasun, R. (2021). Lookout: Diverse
multi-future prediction and planning for self-driving. In ICCV.
Dauner, D., Hallgarten, M., Geiger, A., and Chitta, K. (2023). Parting with
misconceptions about learning-based vehicle motion planning. arXiv preprint
arXiv:2306.07962.
Devin, C., Gupta, A., and Gupta, P. A. (2017). Learning modular neural network
policies for multi-task and multi-robot transfer. arXiv:1710.03641.
Dosovitskiy, A., Fischer, P., Ilg, E., ausser, P., Hazirbas, C., Golkov, V., van der
Smagt, P., Cremers, D., and Brox, T. (2015). Flownet: Learning optical flow with
convolutional networks. ICCV.
Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., and Koltun, V. (2017). CARLA:
An open urban driving simulator. In CoRL.
Gaidon, A., Wang, Q., Cabon, Y., and Vig, E. (2016). Virtual worlds as proxy for
multi-object tracking analysis. In CVPR.
Gonz´alez, D., erez, J., Milan´es, V., and Nashashibi, F. (2015). A review of motion
planning techniques for automated vehicles. T-ITS.
Gupta, A., Devin, C., Liu, Y., Abbeel, P., and Levine, S. (2017). Learning invariant
feature spaces to transfer skills with reinforcement learning. In ICLR.
Handa, A., Patraucean, V., Badrinarayanan, V., Stent, S., and Cipolla, R. (2016).
Understanding realworld indoor scenes with synthetic data. In CVPR.
Hawke, J., Shen, R., Gurau, C., Sharma, S., Reda, D., Nikolov, N., Mazur, P.,
Micklethwaite, S., Griffiths, N., Shah, A., et al. (2020). Urban driving with
conditional imitation learning. In ICRA.
Hinterstoisser, S., Lepetit, V., Wohlhart, P., and Konolige, K. (2017). On pre-trained
image features and synthetic images for deep learning. In arXiv:1710.10710.
Hu, S., Chen, L., Wu, P., Li, H., Yan, J., and Tao, D. (2022). ST-P3: End-to-end
vision-based autonomous driving via spatial-temporal feature learning. In ECCV.
Hu, Y., Yang, J., Chen, L., Li, K., Sima, C., Zhu, X., Chai, S., Du, S., Lin, T., Wang,
W., et al. (2023). Planning-oriented autonomous driving. In CVPR.
44
James, S., Paleyes, A., and Kiciman, E. (2017). Transferring end-to-end visuomotor
control from simulation to real world for a multi-stage task. In CoRL.
Jiang, B., Chen, S., Xu, Q., Liao, B., Chen, J., Zhou, H., Zhang, Q., Liu, W.,
Huang, C., and Wang, X. (2023). VAD: Vectorized scene representation for efficient
autonomous driving. ICCV.
Johnson-Roberson, M., Barto, C., Mehta, R., Sridhar, S. N., Rosaen, K., and Vasude-
van, R. (2017). Driving in the matrix: Can virtual worlds replace human-generated
annotations for real world tasks? In ICRA.
Kendall, A., Hawke, J., Janz, D., Mazur, P., Reda, D., Allen, J.-M., Lam, V.-D.,
Bewley, A., and Shah, A. (2019). Learning to drive in a day. In ICRA.
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T.,
Whitehead, S., Berg, A. C., Lo, W.-Y., et al. (2023). Segment anything. In
Proceedings of the IEEE/CVF International Conference on Computer Vision.
Laskey, M., Lee, J., Fox, R., Dragan, A., and Goldberg, K. (2017). Dart: Noise
injection for robust imitation learning. In CoRL.
LeCun, Y., Muller, U., Ben, J., Cosatto, E., and Flepp, B. (2005). Off-road ob-
stacle avoidance through end-to-end learning. In Advances in Neural Information
Processing Systems.
Li, L. L., Yang, B., Liang, M., Zeng, W., Ren, M., Segal, S., and Urtasun, R. (2020).
End-to-end contextual perception and prediction with interaction transformer. In
IROS.
Li, Z., Yu, Z., Lan, S., Li, J., Kautz, J., Lu, T., and Alvarez, J. M. (2023). Is ego
status all you need for open-loop end-to-end autonomous driving? arXiv preprint
arXiv:2312.03031.
Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., and Xie, S. (2022). A
convnet for the 2020s. In CVPR.
Loshchilov, I. and Hutter, F. (2017). Decoupled weight decay regularization. arXiv
preprint arXiv:1711.05101.
Mahler, J., Liang, J., Niyaz, S., Laskey, M., Doan, R., Liu, X., Ojea, J. A., and Gold-
berg, K. (2017a). Learning ambidextrous robot grasping policies. arXiv:1711.06839.
Mahler, J., Matl, M., Liu, X., Li, A., Gealy, D., and Goldberg, K. (2017b). Dex-net
2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic
grasp metrics. arXiv:1703.09312.
45
Mayer, N., Ilg, E., ausser, P., Fischer, P., Cremers, D., Dosovitskiy, A., and Brox,
T. (2016). A large dataset to train convolutional networks for disparity, optical
flow, and scene flow estimation. In CVPR.
McCormac, J., Handa, A., Leutenegger, S., and Davison, A. J. (2017). Scenenet
RGB-D: Can 5M synthetic images beat generic ImageNet pre-training on indoor
segmentation? In ICCV.
Michels, J., Saxena, A., and Ng, A. Y. (2005). High performance visual servoing:
robot driving by visual road detection. ICRA.
uller, M., Dosovitskiy, A., Ghanem, B., and Koltun, V. (2018). Driving policy
transfer via modularity and abstraction. In CoRL.
Muller, U., Ben, J., Cosatto, E., Flepp, B., and Cun, Y. L. (2006). Off-road obstacle
avoidance through end-to-end learning. In NeurIPS.
Osa, T., Pajarinen, J., Neumann, G., Bagnell, J. A., Abbeel, P., Peters, J., et al.
(2018). An algorithmic perspective on imitation learning. Foundations and
Trends®in Robotics.
Pinto, L., Andrychowicz, M., Welinder, P., Zaremba, W., and Abbeel, P. (2018).
Asymmetric actor critic for image-based robot learning. In RSS.
Pomerleau, D. (1988). ALVINN: An autonomous land vehicle in a neural network.
In Advances in Neural Information Processing Systems.
Pomerleau, D. A. (1989). ALVINN: An autonomous land vehicle in a neural network.
In NeurIPS.
Prakash, A., Behl, A., Ohn-Bar, E., Chitta, K., and Geiger, A. (2020). Exploring
data aggregation in policy learning for vision-based urban autonomous driving. In
CVPR.
Prakash, A., Chitta, K., and Geiger, A. (2021). Multi-modal fusion transformer for
end-to-end autonomous driving. In CVPR.
Radosavovic, I., Kosaraju, R. P., Girshick, R., He, K., and Doll´ar, P. (2020). Design-
ing network design spaces. In CVPR.
Richter, S. R., Vineet, V., Roth, S., and Koltun, V. (2016). Playing for data: Ground
truth from computer games. In ECCV.
Ros, G., Sellart, L., Materzynska, J., azquez, D., and opez, A. M. (2016). The
synthia dataset: A large collection of synthetic images for semantic segmentation
of urban scenes. In CVPR.
46
Ross, S., Gordon, G., and Bagnell, D. (2011). A reduction of imitation learning and
structured prediction to no-regret online learning. In AISTATS.
Rusu, A. A., Vecerik, M., Roth¨orl, T., Heess, N., Pascanu, R., and Hadsell, R. (2017).
Sim-to-real robot learning from pixels with progressive nets. In CoRL.
Sadeghi, F., Toshev, A., Jang, E., Levine, S., and Mahadevan, S. (2017a). Cad2rl:
Real single-image flight without a single real image. arXiv:1708.07848.
Sadeghi, F., Toshev, A., Jang, E., Levine, S., and Mahadevan, S. (2017b). Sim2real
view invariant visual servoing by recurrent control. In CoRL.
Schreier, T., Renz, K., Geiger, A., and Chitta, K. (2023). On offline evaluation of 3d
object detection for autonomous driving. In ICCV.
Shao, H., Wang, L., Chen, R., Li, H., and Liu, Y. (2023). Safety-enhanced au-
tonomous driving using interpretable sensor fusion transformer. In CoRL.
Skinner, J., Garg, S., underhauf, N., Corke, P., Upcroft, B., and Milford, M. (2016).
High-fidelity simulation for evaluating robotic vision performance. In IROS.
Tang, J., Shaoshan, L., Pei, S., Zuckerman, S., Chen, L., Shi, W., and Gaudiot, J.-L.
(2018). Teaching autonomous driving using a modular and integrated approach.
In COMPSAC.
Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., and Abbeel, P. (2017).
Domain randomization for transferring deep neural networks from simulation to
the real world. In IROS.
Tsirikoglou, A., Kronander, J., Wrenninge, M., and Unger, J. (2017). Procedural
modeling and physically based rendering for synthetic data generation in automo-
tive applications. arXiv:1710.06270.
Tu, J., Suo, S., Zhang, C., Wong, K., and Urtasun, R. (2023). Towards scalable
coverage-based testing of autonomous vehicles. In CoRL.
Tzeng, E., Devin, C., Hoffman, J., Finn, C., Abbeel, P., Levine, S., Saenko, K., and
Darrell, T. (2016). Towards adapting deep visuomotor representations from simu-
lated to real environments. In Workshop on Algorithmic Foundations of Robotics
(WAFR).
Viereck, U., Wijmans, E., Prasad, A., Tulsiani, S., Zhu, J., Davidson, J., Corso, J. J.,
Fitter, N., Boots, B., and Koltun, V. (2017). Learning a visuomotor controller for
real world robotic grasping using simulated depth images. In arXiv:1710.04835.
Wu, P., Chen, L., Li, H., Jia, X., Yan, J., and Qiao, Y. (2023). Policy pre-training
for end-to-end autonomous driving via self-supervised geometric modeling. ICLR.
47
Wulfmeier, M., Posner, I., and Abbeel, P. (2017). Mutual alignment transfer learning.
In CoRL.
Xu, H., Gao, Y., Yu, F., and Darrell, T. (2017). End-to-end learning of driving
models from large-scale video datasets. In CVPR.
Xu, W., Wang, Q., and Dolan, J. M. (2021). Autonomous vehicle motion planning
via recurrent spline optimization. In ICRA.
Yang, L., Kang, B., Huang, Z., Xu, X., Feng, J., and Zhao, H. (2024). Depth
anything: Unleashing the power of large-scale unlabeled data. In CVPR.
Zhang, C., Guo, R., Zeng, W., Xiong, Y., Dai, B., Hu, R., Ren, M., and Urtasun, R.
(2022). Rethinking closed-loop training for autonomous driving. In ECCV.
Zhang, Y., Song, S., Yumer, E., Savva, M., Lee, J.-Y., Jin, H., and Funkhouser, T.
(2017). Physically-based rendering for indoor scene understanding using convolu-
tional neural networks. In CVPR.
Zhang, Z. and McCarthy, Z. K. (2017). Reaching the unreachables: A probabilistic
look at motion prediction for collision avoidance. In CoRL.
48
CURRICULUM VITAE
Animikh Aich
animikh@bu.edu - animikhaich.github.io
EDUCATION
Boston University (BU) - Boston, MA September 2022 - Present
MS in Artificial Intelligence - Advisor: Eshed Ohn-Bar
Area of Interests: Autonomous Driving, Generative AI, Vision & Language
Visvesvaraya Technological University, India August 2015 - July 2019
Bachelor of Engineering in Electronics & Communication Engineering
WORK EXPERIENCE
Boston University, Boston, MA January 2023 - Present
Research Assistant
Evaluated and refined 160 state-of-the-art models to establish precise open and
closed-loop assessment metrics, introducing innovative real-world evaluations as
a groundbreaking research advancement.
Invented an original event-based metric to bolster safety assessments, yielding
a significant correlation boost of up to 7.3%, particularly in crucial turning and
braking situations.
Innovated model sampling techniques for multi-modal end-to-end Transformer-
based imitation learning frameworks, incorporating test-time dropout and di-
verse backbones to train agents for CARLA evaluations.
Spearheaded the integration of conditional imitation learning into real-world
scenarios, successfully deploying it indoors and outdoors.
Utilized DINOv2 transfer learning to detect AI-generated images from human
artwork, enhancing analysis of changes in artist productivity trends based on
the adoption of Generative AI models.
PRADCO - Outdoor Brands, Remote, USA June 2023 - August 2023
Machine Learning Engineer
Applied Real-ESRGAN to enhance and upscale wildlife images from 640x360
to 2560x1440, delivering high-resolution reconstructions across diverse lighting
conditions for users.
49
Developed models for both attacking and defending in large citation graphs
Designed antler segmentation and counting using object detection, semantic
segmentation, and pose estimation by applying foundation models Grounding
DINO, SegmentAnything, and ViTPose+.
Demonstrated viability of utilizing NeRF-based algorithms to create 3D re-
constructions of antlers from monocular video clips through development of a
proof-of-concept.
Wobit.ai, India June 2019 - June 2022
Computer Vision Engineer and Lead
Spearheaded 14 engineers and developed 90 real-time video analytics solutions
deployed globally across 20K+ CCTV cameras, increasing hygiene compliance
by 2x in the food and hospitality industry.
Enforced safety and hygiene compliance with distinctive vision-based person
identification and multi-object tracking algorithms, deployed across 3 continents
lowering non-compliance by over 25%.
Reduced data-to-production time by building development tools for data and
models resulting in a 3x increase in productivity, positively impacting the team’s
efficiency and decreasing time-to-market by 50%.
Implemented Synthetic Dataset Generation for object detection, shrinking la-
beled data requirements by 35% and accelerating computer vision model devel-
opment, improving time-to-market by 25%.
Improved alert precision by up to 95% by implementing a novel algorithm lever-
aging ensemble models and temporal image features, resulting in a 30% decrease
in false positive alerts.
PUBLICATIONS
Sanket Kalwar, Animikh Aich, and Tanay Dixit, “LatentGAN Autoencoder:
Learning Disentangled Latent Distribution.” arXiv preprint arXiv:2204.02010
(2022).
Animikh Aich, Akshay Krishna, Akhilesh V., and Chetana Hegde, “Encoding
web-based data for efficient storage in machine learning applications.” In 2019
Fifteenth International Conference on Information Processing (ICINPRO), pp.
1-6. IEEE, 2019.
Akshay Krishna, Akhilesh V., Animikh Aich, and Chetana Hegde, “Sentiment
analysis of restaurant reviews using machine learning techniques.” In Emerging
Research in Electronics, Computer Science and Technology: Proceedings of
International Conference, ICERECT 2018, pp. 687-696. Springer Singapore,
2019.
50
Akshay Krishna, Akhilesh V., Animikh Aich, and Chetana Hegde, “Sales-
forecasting of retail stores using machine learning techniques.” In 2018 3rd in-
ternational conference on computational systems and information technology
for sustainable solutions (CSITSS), pp. 160-166. IEEE, 2018.
Akshay Krishna, Animikh Aich, and Chetana Hegde, “Analysis of customer
opinion using machine learning and NLP techniques.” International Journal of
Advanced Studies of Scientific Research 3, no. 9 (2018).
SKILLS
Languages and Libraries: Python, PyTorch, HuggingFace, TensorFlow, Keras,
PyTorch Lightning, OpenCV, NumPy, pandas, Seaborn, W&B, Docker, AWS,
Azure, Intel OpenVINO, Nvidia Triton, Tensorflow Serving, TensorRT.
Technologies: Transfer Learning, Transformers, Foundation Models, Gener-
ative Models, Multi-Modal Learning, Self-Supervised Learning, Fine-Tuning,
Sensor Fusion, Imitation Learning, LLMs, LangChain.
ACADEMIC HONORS, AWARDS AND ACTIVITIES
Reviewer: The Journal of Open Source Software (2023-2024)
Teaching Assistant: EC 518-Robot Learning (Fall 2023, BU)
Best Outgoing Student: RNS Institute of Technology (2019)
Letter of Appreciation: RNS Institute of Technology (2019)
First Prize: BITES BXSPA, IIIT-Bangalore (2019)
Best Paper Award: ICERECT (2018)
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Modern autonomous driving system is characterized as modular tasks in sequential order, i.e., perception, prediction, and planning. In order to perform a wide diversity of tasks and achieve advanced-level intelligence, contemporary approaches either deploy standalone models for individual tasks, or design a multi-task paradigm with separate heads. However, they might suffer from accumulative errors or deficient task coordination. Instead, we argue that a favorable framework should be devised and optimized in pursuit of the ultimate goal, i.e., planning of the self-driving car. Oriented at this, we revisit the key components within perception and prediction, and prioritize the tasks such that all these tasks contribute to planning. We introduce Unified Autonomous Driving (UniAD), a comprehensive framework up-to-date that incorporates full-stack driving tasks in one network. It is exquisitely devised to leverage advantages of each module, and provide complementary feature abstractions for agent interaction from a global perspective. Tasks are communicated with unified query interfaces to facilitate each other toward planning. We instantiate UniAD on the challenging nuScenes benchmark. With extensive ablations, the effectiveness of using such a philosophy is proven by substantially outperforming previous state-of-the-arts in all aspects. Code and models are public.