PreprintPDF Available

Knowledge Augmented Machine Learning with Applications in Autonomous Driving: A Survey

Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

The existence of representative datasets is a prerequisite of many successful artificial intelligence and machine learning models. However, the subsequent application of these models often involves scenarios that are inadequately represented in the data used for training. The reasons for this are manifold and range from time and cost constraints to ethical considerations. As a consequence, the reliable use of these models, especially in safety-critical applications, is a huge challenge. Leveraging additional, already existing sources of knowledge is key to overcome the limitations of purely data-driven approaches, and eventually to increase the generalization capability of these models. Furthermore, predictions that conform with knowledge are crucial for making trustworthy and safe decisions even in underrepresented scenarios. This work provides an overview of existing techniques and methods in the literature that combine data-based models with existing knowledge. The identified approaches are structured according to the categories integration, extraction and conformity. Special attention is given to applications in the field of autonomous driving.
Content may be subject to copyright.
Knowledge Augmented Machine Learning
with Applications in Autonomous Driving:
A Survey
Julian Wörmann7, Daniel Bogdoll9, Etienne Bührle9, Han Chen2, Evaristus Fuh Chuo2, Kostadin Cvejoski8,
Ludger van Elst4, Tobias Gleißner8, Philip Gottschall8, Stefan Griesche10, Christian Hellert3,
Christian Hesels8, Sebastian Houben8, Tim Joseph9, Niklas Keil1, Johann Kelsch5, Hendrik Königshof9,
Erwin Kraft3, Leonie Kreuser1, Kevin Krone8, Tobias Latka6, Denny Mattern8, Stefan Matthes7,
Mohsin Munir4, Moritz Nekolla9, Adrian Paschke8, Maximilian Alexander Pintz8, Tianming Qiu7,
Faraz Qureishi11, Syed Tahseen Raza Rizvi4, Jörg Reichardt3, Laura von Rueden8, Stefan Rudolph6,
Alexander Sagel7, Gerhard Schunk11, Hao Shen7, Hendrik Stapelbroek2, Vera Stehr11,
Gurucharan Srinivas5, Anh Tuan Tran10, Abhishek Vivekanandan9, Ya Wang8, Florian Wasserrab1,
Tino Werner5, Christian Wirth3, and Stefan Zwicklbauer3
1Alexander Thamm GmbH
2Capgemini Engineering
3Continental AG
4Deutsches Forschungszentrum für Künstliche Intelligenz GmbH (DFKI)
5Deutsches Zentrum für Luft- und Raumfahrt e.V. (DLR)
6Elektronische Fahrwerksysteme GmbH
7fortiss GmbH
8Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. (FOKUS & IAIS)
9FZI Forschungszentrum Informatik
10Robert Bosch GmbH
11Valeo Schalter und Sensoren GmbH
The existence of representative datasets is a prerequisite of many successful artificial intelligence and machine
learning models. However, the subsequent application of these models often involves scenarios that are inadequately
represented in the data used for training. The reasons for this are manifold and range from time and cost constraints to
ethical considerations. As a consequence, the reliable use of these models, especially in safety-critical applications, is a
huge challenge. Leveraging additional, already existing sources of knowledge is key to overcome the limitations of purely
data-driven approaches, and eventually to increase the generalization capability of these models. Furthermore, predictions
that conform with knowledge are crucial for making trustworthy and safe decisions even in underrepresented scenarios. This
work provides an overview of existing techniques and methods in the literature that combine data-based models with existing
knowledge. The identified approaches are structured according to the categories integration, extraction and conformity.
Special attention is given to applications in the field of autonomous driving.
Acknowledgement: The research leading to these results is funded by the German Federal Ministry for Economic Affairs and Climate
Action within the project “KI Wissen Entwicklung von Methoden für die Einbindung von Wissen in maschinelles Lernen". The
authors would like to thank the consortium for the successful cooperation.
arXiv:2205.04712v1 [cs.LG] 10 May 2022
1 Introduction 3
2 Overview use case domains 3
2.1 Perception..................................................... 3
2.2 Situation Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Planning...................................................... 6
3 Knowledge Representations 7
3.1 Symbolic Representations and Knowledge Crafting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Knowledge Representation Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4 Knowledge Integration 11
4.1 Auxiliary Losses and Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.2 Neural-symbolic Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.3 Attention Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.4 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.5 State Space Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.6 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.7 Deep-Learning with Prior Knowledge Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5 Knowledge Transfer 29
5.1 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.2 Continual Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.3 Meta Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.4 Active Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6 Knowledge Extraction - Symbolic Explanations 39
6.1 Rule Extraction and Rule Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.2 Structured Output Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.3 Natural Language Processing for Legal Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.4 Question Answering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
7 Knowledge Extraction - Visual Explanations 46
7.1 Visual Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
7.2 Saliency Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.3 Interpretable Feature Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
8 Knowledge Conformity 51
8.1 Uncertainty Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
8.2 Causal Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
8.3 Rule Conformity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
List of Abbreviations 61
References 62
Data-driven learning, first and foremost deep learning, has
become a key paradigm in the vast majority of current
Artificial Intelligence (
) and Machine Learning (
applications. The excellent performance of many models
learned in a supervised manner can be predominantly
attributed to the availability of huge amounts of labeled
data. Prominent examples are image classification and object
detection, sequential data processing as well as decision
making. On the downside, the unprecedented performance
comes at the cost of lacking interpretability and transparency
leading to so called black box models that do not allow for
easy and straightforward human inspection.
Transferring data-driven approaches to safety critical
applications becomes thus a major challenge. Usually, in
these scenarios labeled data is scarce due to high acquisition
costs or, not least, for ethical reasons. Furthermore, both
developers and users postulate the requirement to be able to
understand the decisions made by the deployed model. In
order to tackle both problems, the exploitation of knowledge
sources in form of, e.g., basic laws of physics, logical
databases of facts, common behaviour in certain scenarios,
or simply counterexamples is key to evolve purely data-
driven models towards robustness against perturbations,
better generalization to unseen samples, and conformity to
existing principles of safe and reliable behaviour.
This survey provides a collection of different methods
that are suitable in order to augment data driven models
with knowledge, to extract informative concepts and patterns
out of given models and to compare observed outputs and
representations to existing basic assumptions and common
understanding about safe, reliable and intuitive behaviour.
Eventually, this overview on the integration of knowledge
and data will pave the way to trustworthy
that can be used safely in critical applications.
This review of the state of the art is structured as follows.
In the following Chapter 2, we introduce three major tasks
that autonomous agents encounter during interaction with
their environment. Starting from different perspective to rep-
resent knowledge and to make it machine readable discussed
in Chapter 3, subsequently, different general methods eligible
to combine knowledge with data-driven approaches, as well
as more specific methods tailored to the autonomous driving
use case, are reviewed in Chapter 4. Furthermore Chapter 5
introduces learning paradigms in the context of knowledge
Besides integration of knowledge, current approaches
focusing on the extraction of concepts and structures are
outlined in the subsequent chapters. While Chapter 6 summa-
rizes methods that provide symbolic, partly natural language
explanations, Chapter 7 puts emphasize on procedures that
allow for visual inspection of the decision process. We
conclude our survey in Chapter 8 with an overview of
techniques that consider conformity to already existing as
well as newly discovered knowledge components, which
eventually completes the pipeline of knowledge empowered
artificial intelligence.
The task of automated driving may be sub categorized
into the following categories: perception, situation inter-
pretation, planning and control [306]. The foremost task
in the autonomous driving is to understand and perceive
the environment around the vehicle. Section 2.1 provides an
introduction to the perception module with a special focus
on image-based pedestrian detection. Once the objects are
detected and segmented, the second task in the autonomous
driving is to understand the environment along with the
road users. In order to perform safe maneuvers, the situation
interpretation is a decisive step. In this module, the goal is
to answer important questions related to object’s states and
actions, like what an object could do next. An overview
is given in section Section 2.2. After figuring out these
situational scenarios, next task in autonomous driving is
to plan the motion of ego vehicle. The planning module
described in Section 2.3 utilizes the output of the previous
two modules and takes high level routing and trajectory
planning decisions.
2.1 Perception
Authors: Syed Tahseen Raza Rizvi, Mohsin Munir, Ludger van
2.1.1 Perception in the AD Stack
Perception plays a crucial role in attaining the goal of au-
tonomous driving. An ego-vehicle is generally equipped with
a variety of sensors including cameras, lidar and radar. These
sensors serve as the senses of an ego-vehicle and therefore
enable the capability of perceiving the environment around
the ego-vehicle in different spectrums. Object detection, and
in particular pedestrian detection, has significant importance
in the perception spectra as it serves as a critical piece of
information for the downstream tasks associated with the
autonomous driving pipeline.
2.1.2 Task Formulation
Autonomous driving systems highly rely on object detection
models to identify all the traffic participants. Pedestrians
are usually the most common and abundantly found traffic
participant. Therefore, the detection of a pedestrian is more
prominent and crucial for the perception of an autonomous
driving system.
Pedestrian detection deals with the identification of
pedestrians in the environment around an ego-vehicle. There
exist approaches in the literature which perform pedestrian
detection only using lidar sensors [438]. However, such
approaches are usually not popular in the community due
to fact that the features obtained from camera images
are significantly richer as compared to the ones obtained
from lidar or radar. On the other hand, [223] uses lidar
to incorporate depth information into the image data for
the pedestrian detection task. Therefore, the approaches to
perform pedestrian detection mainly using camera images
are generally widely adopted. The images from the mounted
cameras serve as an input from which individual pedestrians
are identified and are enclosed in a bounding box. A variety
of solutions have been proposed to effectively identify
individual pedestrians in the surrounding environment.
The neural network based object detection solutions can
be divided into two main categories: One-stage and Two-
stage approaches. One-stage approaches are generally based
on a fully convolutional architecture and consider the object
detection problem as a simple regression problem [693]. For
a given input image, the One-stage detectors learn class
probabilities and the coordinates of a bounding box encom-
passing an object. On the other hand, Two-stage approaches
are more sophisticated where each stage specializes in a
sub-task which eventually contributes to the final output of
the system. The first stage is responsible for identifying the
region of interest and the second stage is responsible for the
object classification and bounding box regression. Both types
of approaches have certain pros and cons. Most notably, Two-
stage approaches yield better detection accuracy than One-
stage approaches as they have specialized stages where the
output of the second stage is built on top of the output of the
first stage. However, One-stage approaches are much faster
than Two-stage approaches as they do not have an additional
stage with supplementary computational overhead.
Single Shot MultiBox Detector (
) [447], You Only
Look Once (YOLO) [589, 591, 590], RetinaNet [437] and
Fully Convolutional One-Stage object detector [720] are
the most prominent One-stage object detectors. Generally,
these approaches divide the image into a grid followed by
predicting the probability of a class object in each grid box
along with its bounding box coordinates. However, some
of these approaches are slightly different as they employ
a unique focal loss or pixel-wise classification to achieve a
higher detection accuracy in real-time. On the other hand,
Fast Regions with CNN (
) [249], Faster
[297], MimicDet [459] are the most common
examples of a Two-stage object detector. Generally the first
stage in these Two-stage object detection model consists of
a Region Proposal Network (
), where in the second
stage the candidate region proposals are classified based
on the feature maps. Approaches like Mask
a mask branch which is a small Fully Convolutional Net-
work (
) [144] applied to each Region of Interest (
predicting a pixel-wise segmentation mask. Additionally,
Feature Pyramid Network (
) [436] is generally used in
combination with
and Faster
to make bounding
box proposal more robust especially for small objects.
Pedestrian detection is applied in various vision-based
applications ranging from surveillance to autonomous driv-
ing. Despite their good performance, it is still unknown how
the detection performs on unseen data. Hasan et al. [288]
presented a study in quest of generalization capabilities of
pedestrian detectors. In their cross-dataset evaluation, they
have tested several backbones with their baseline detector
[88] on famous autonomous driving data
sets including Caltech [175], CityPersons [854], ECP [77],
CrowdHuman [666], and Wider Pedestrian [124]). Cross-
dataset evaluation is an effective way of evaluating a method
on unseen data and checking its generalization capability,
otherwise, a method may overfit on a single dataset. The
analysis presented in the paper is very interesting. The
authors have demonstrated that the existing pedestrian
detection methods perform poorly when compared with
general object detection methods given larger and diverse
datasets. A carefully trained state-of-the-art general-purpose
object detector can outperform pedestrian-specific detection
methods. The trick lies in the training pipeline and the
dataset. In this study, the authors used large datasets that
contain more persons per image. These general purpose
datasets, generally collected by crawling the web and
through surveillance cameras, are likely to have more human
poses, appearances, and occlusion cases as compared to
pedestrian-specific datasets. It is also shown in this study
that by progressively fine-tuning the models from largest
(general purpose) to smallest (close to target domain),
performance can be improved. The generalization ability
of pedestrian detectors has been compromised due to the
lack of diversity and density of the pedestrian benchmarks.
However, benchmarks such as WiderPerson [856], Wider
Pedestrian [124], and CrowdHuman [666] provide much
higher diversity and density.
Pedestrian detection has improved a lot in recent years,
however, it is still challenging to detect occluded pedestrians.
The pedestrian appearance varies in different scenarios and
depends on a wide range of occlusion patterns. To address
this issue, Zhang et al. [855] proposed an architecture for
pedestrian detection based on the Faster
. In contrast
to ensemble models for most frequent occlusion patterns, the
authors leverage different attention mechanisms to guide the
detector in paying more attention to the visible body parts.
The authors proposed to employ channel-wise attention in a
convolution network that allows the network to learn more
representative features for different occluded body parts
in one model. The observation that many Convolutional
Neural Network (
) channels in a pedestrian
localizable, strongly motivates them to perform re-weighting
of channel features to guide the detector to pay more
attention to the visible body parts. In order to generate
the attention vector, different realizations of attention net-
works are examined. The attention vector is trained end-
to-end for all of the attention networks either through self-
attention or guided by some additional external information
like convolution features, visible bounding boxes, or part
detection heatmaps. Eventually, the features are passed to the
classification network for category prediction and bounding
box regression. The experimental results are shown on the
CityPersons [854], Caltech [175], and ETH [192] datasets. The
results show improvements over the baseline Faster
Vulnerable road user detection is another major challenge
in pedestrian detection. The safety of road users is and
should be the utmost priority in the domain of autonomous
driving. In addition to detect occluded pedestrians, another
key challenge is to detect pedestrians at long range. When a
pedestrian is detected at long range, it increases the security
of the pedestrian and driver at the same time, also, it leads to
a comfortable driving experience. Fürst et al. [223] introduced
an approach that targets long range 3D pedestrians detection.
Their approach leverages the density of Red Green Blue
) images and precision of lidar. The symmetrical fusion
and lidar helps them outperform current state-of-the-
art for long range 3D pedestrian detection.
2.1.3 Goals and Requirements
Perception plays a pivotal role in autonomous driving. It
enables the ego vehicle to analyze and understand the
traffic scene and surrounding circumstances. Detection of
traffic participants, i.e., pedestrians, vehicles, cyclists, etc
serves as the core of perception involved in autonomous
driving. Additionally, traffic circumstances like road, weather,
and light conditions are also important factors in a traffic
scenario. For instance, rainy weather results in a wet road
which consequently has a direct impact on the decisions like
breaking distance. This is due to that braking distance is
significantly more in wet weather than that of the normal dry
road. Therefore, traffic participants and their surrounding
circumstances collectively provide a basis for planning and
executing decisions taken by an ego vehicle. The significance
of the perception can be understood by the fact that it di-
rectly contributes towards use cases like collision avoidance,
trajectory planning, etc.
With the rise of deep learning for solving a universe
of different tasks, object detection has also benefited from
deep learning One- and Two-stage models to achieve higher
detection performance. The effectiveness of an object detec-
tion approaches heavily relies on the efficacy of the trained
object detection model. In other words, it can be said as,
provided an effective object detection model, the quality of
perception can be ensured. In order to train an effective object
detection model, it requires a large amount of high-quality
data. For this purpose, several real-life public datasets are
available, i.e., Caltech [175], CityPersons [854], ECP [77], etc.
However, certain scenarios are possibly scarce or outright not
feasible in such pedestrian detection datasets. For example,
it is infeasible to find a dataset that contains a traffic scenario
where the ego vehicle is about to collide with another traffic
participant. Such a scenario can be helpful to evaluate the
performance of an object detection model to detect and evade
collision in such a hazardous environment. For this purpose,
datasets with simulated custom scenarios can be generated
to fill this gap in real-life datasets. Ultimately, a combination
of real and simulated data is the key thus enabling the object
detection model to effectively perform under several unseen
or rarely occurring traffic scenarios.
2.1.4 Necessity of Knowledge Integration
Computer vision methods and in general
methods have
significantly improved over the last years. Different methods
are able to accurately interpret a situation presented in an
image or video. Even with such advancements, there are
scenarios where
methods react differently as humans.
The main reason of this gap is the absence of the background
knowledge from the learned model. The
methods only
account for patterns present in the training data, whereas
humans have implicit knowledge that could help them to
interpret a critical situation more robustly. In the context of
autonomous driving, and in general too, it is not possible to
train a model for every possible scenario that could happen
on road. To provide a safer environment for pedestrians
and autonomous vehicles, it is important to incorporate
knowledge in the module that is responsible for taking
important decisions.
2.2 Situation Interpretation
Authors: Daniel Bogdoll, Abhishek Vivekanandan, Faraz Qureishi,
Gerhard Schunk
2.2.1 Situation Interpretation in the AD Stack
Situation interpretation is typically a follow-up module of the
perception stage as shown in Section 2.1. Accordingly, this
module is aware of objects, their states, and classifications
within the surrounding environment. Its main objective is
to interpret the situation, which includes questions such as
“What is an object doing next?”, “Is there an implicit meaning
of an object’s action?” or “Is a rule exception applicable right
2.2.2 Task Formulation
Automated driving relies on accurate perception of the
environment. We follow the concept of Gerwien et al. [244],
who describe situation interpretation as a module which pro-
vides a “situation-aware environment model”, that expands
an environment model, which is typically the results of
the perception stage, by situation recognition and situation
prediction. They classify these three modules as Situation
Awareness (
) levels 1-3. The output of the perception
layer can be represented in various forms, for instance
with object lists or probabilistic maps. Independent of
the structure, the output is critical for the functioning of
subsequent Autonomous Driving (
) layers, which are
tasked with situation interpretation, path planning as
shown in Section 2.3 and vehicle control.
Nevertheless, sometimes raw data in addition to the
outputs of the perception layer is relevant to detect intentions
or meanings which are typically not addressed by perception
systems. Two examples are direction of view [287] and hand
gestures [781].
Situation interpretation works in tandem with perception,
planning and control. A typical example of situation interpre-
tation may involve cut in scenarios during automated driving
using adaptive cruise control [559]. In a cut in scenario, the
situation interpretation system shall be able to detect if a
collision is imminent (using perception and planning output)
and employ mitigation measures (braking in this case) in due
time, ensuring the safety of the ego vehicle and its occupants.
In the aforementioned example, the collision detection
and avoidance can be designed by using vehicle motion
models and traffic rules. In complex situations, however, the
task of situation interpretation may not be accomplished by
only using a predefined set of rules. Especially for urban
scenarios where the number of interactions between the ego
vehicle and the objects in the scene are significantly higher.
Additionally, there might be situations where a particular
rule needs to be violated in order to ensure safety of human
2.2.3 Goals and Requirements
To be consistent with the previously defined
levels, level
2 takes in raw data and adds semantic meaning to it in the
form of semantic data models. Many works, especially [244],
have defined the operational context in regard to adding
more semantic structure to identify situations of interest.
As with
level 3 defined by [244], motion prediction
forms the abstract layer for situation understanding, which
comprises different actors in the ego space. It plays a
crucial role in determining safety critical applications for
the autonomous driving stack by providing the service of
estimating the future positions of an object. For instance,
when driving in a highway scenario, assuming that a lead
vehicle suddenly merges or cuts-in to the ego lane; the
primary goal of this layer is to mitigate the collision by
anticipating the intention of the lead vehicle(s). The crash
avoidance maneuver should have safety properties such that
the maneuver itself should not cause an additional collision,
e.g., while hard braking could prevent the crash it could lead
to a rear ended collision with other vehicles. This requires not
only a prediction module but also a system that checks for
the validity of the planned decision based on dynamic safety
reasoning methodologies which could influence the Time-To-
Collision (TTC), such as including weather constraints.
Most of the existing behavior prediction approaches
perform simultaneous tracking and forecasting with the use
of Kalman Filters or in the form of rule based approaches,
as can be seen from the previous works [418]. Although
variants of Kalman filters are good for short term predictions,
their performance degrades for long term motion problems
as they fail to make use of the situation or environmental
knowledge [132] which could be obtained via vectored maps.
As a result, prediction modules should make use of domain
knowledge to forecast reliable predictions [76].
In a typical
stack, motion prediction is a separate
module which does prediction based on the outputs from the
previous perception layer. For example, the object detection
outputs bounding box coordinates of an object along with
the probability score of a class it belongs to such as truck,
car, or construction cone. When this is used as an input to
the motion prediction, a failure to propagate uncertainty
happens due to the softmax outputs [224]. To alleviate those
shortcomings, end-to-end networks, which take raw inputs
such as lidar point clouds and camera fusion to produce
motion predictions directly [797, 174] should be considered.
Additionally, knowledge about one’s own path planning can
be integrated into the prediction component [29].
2.2.4 Necessity of Knowledge Integration
Vehicles equipped with a level 4 or 5 driving automation
system are expected to master a wide variety of situations
within their Operational Design Domain (
) [628]. Since
many situations do not occur frequently in real life,
based systems are struggling to extrapolate from their trained
domain. Therefore, hybrid approaches that integrate rule-
and knowledge based algorithms and insights into
systems have the potential to combine the best of two worlds
great general performance and improved handling of rare
situations, such as corner cases.
2.3 Planning
Authors: Etienne Bührle, Hendrik Königshof, Abhishek Vivekanan-
dan, Moritz Nekolla
2.3.1 Motion Planning in the AD Stack
The planning module uses the outputs of the perception and
prediction modules to plan a trajectory for the vehicle, which
is subsequently handed down to the vehicle controls to be
executed. This plan considers high-level routing decisions,
and follows the rules of the road as well as basic principles
of safe and comfortable driving.
A wide range of methods has been developed to tackle
the trajectory tracking control problem, and we refer to [535]
for an overview. However, the motion planning problem,
especially in highly complex and dynamic environments like
road traffic, remains largely unsolved and constitutes an area
of ongoing research.
2.3.2 Task Formulation
Formally, the solution to the trajectory planning problem is a
function that assigns every point in time a position in con-
figuration space (typically, planar coordinates and heading).
Classical approaches include variational methods (which
represent the path as a function of continuously adjustable
parameters), graph-search methods (which discretize the
configuration space), and incremental search methods (which
improve upon graph-search methods by using iterative
refinement procedures). An excellent overview is given in
The mentioned approaches are usually modular and
interpretable. However, as hand-engineered solutions to dif-
ficult problems, they tend to be brittle and require extensive
manual fine-tuning. Additionally, isolated changes to parts
of the system might reduce or break the overall system
performance, requiring careful re-tuning [840].
These drawbacks motivate the use of deep learning
based approaches, which have proven more robust to
variations and can be trained in an end-to-end fashion.
The current applications of deep learning to autonomous
driving can roughly be classified into two groups. Full
end-to-end approaches that map raw sensory input directly
to vehicle commands (steering, acceleration), and methods
that produce or work on intermediate representations. An
overview can be found in [709].
2.3.3 Goals and Requirements
The motion planning system is in charge of ensuring be-
havioral safety of the self-driving vehicle [515, 516]. This
includes taking the correct behavior and driving decisions,
based on the knowledge of traffic rules and the behavior
of other traffic participants, as well as the ability to safely
navigate expected and unexpected scenarios.
The U.S. Department of Transportation (
) has rec-
ommended that Level 3, Level 4, and Level 5 self-driving
vehicles should be able to demonstrate at least 28 core com-
petencies adapted from research by California Partners for
Advanced Transportation Technology (
) at the Institute
of Transportation Studies at University of California, Berkeley.
These basic behavioral competencies include, amongst others,
keeping the vehicle in lane, obeying traffic laws, following
road etiquette, responding to other vehicles, and responding
to hazards [516].
While the majority of these behavioral competencies cover
normal driving, i.e., regularly encountered situations, a self-
driving vehicle is also responsible for Object and Event
Detection and Response (
), which includes detecting
unusual circumstances (emergency vehicles, work zones, ...)
as well as planning an appropriate reaction, which typically
takes place in the behavior and planning components. Above
all, the planning system is responsible for crash avoidance,
and should be able to handle control loss, crossing-path
crashes, lane changes/merges, head-on/opposite-direction
travel, rear-end, road departure, and low speed situations
(backing, parking). At any time, the system should be able to
execute a fallback action that brings the vehicle to a minimal
risk condition. According to [515], "a minimal risk condition
will vary according to the type and extent of a given failure,
but may include automatically bringing the vehicle to a safe
stop, preferably outside of an active lane of traffic."
Finally, the motion planner not only interacts with other
traffic participants, but also to a great extent with its
passengers. In particular, it must be able to communicate
proper function, malfunction, as well as an eventual takeover
request to a human driver, who must be able to take over in
2.3.4 Necessity of Knowledge Integration
Level 5 self-driving vehicles are expected to function in a
wide variety of operational design domains (we refer to
[81] for a taxonomy). While the basic principles of safe and
comfortable driving remain unchanged, the concrete imple-
mentations at the level of traffic laws, customary behavior,
and scene structure might be subject to change. We argue that
the inclusion of knowledge into a motion planning system
will make it easier to handle these situations by increasing
traceability (e.g., in the case of crash reconstructions) and
reliability. Furthermore, a transparent decision process based
on a common understanding between humans and machines
will increase interpretability and trust. Finally, we expect
the emergence of alternatives to extensive simulation testing,
which is at the core of present validation concepts [25, 450,
Emphasizing the advantages of Knowledge Integration,
[116] demonstrates many of the aspects mentioned above.
Fan Chen et al. integrate rules, in the form of social norms,
by extending the agents reward function, e.g., passing objects
with a minimum distance. Violating these rules results in a
reward penalty. According to their results, agents with such
restrictions exhibit behavior more similar to a human level.
Therefore, when integrating knowledge into the machine
learning pipeline, models become more interpretable and
confidential not solely for experts but for ordinary people
since these constraints occur in everyday life. Furthermore,
their extension of the agent
s knowledge reduces learning
effort which accelerates training and enables them to out-
perform their benchmark algorithm in most cases. Despite
those promising benefits, integrating knowledge typically
narrows down the broad variety of possible solutions while
consuming human work force for hand engineering. This
shrinks the original, holistic approach of machine learning.
Therefore, the trade-off between knowledge integration and
self-learning needs to be chosen carefully [116].
The symbolic and the sub-symbolic methods represent two
ends of the
spectrum. The former is more driven by
the knowledge and the latter by the data. A plethora of
ongoing research can be found in the literature to develop
systems which exploit the strengths of one another.
However, there still exists a core challenge in representation
of knowledge used in symbolic space to integrate or augment
within the data-driven sub-symbolic/statistical world. An
overview of formalism and languages for representing
symbolic knowledge which exists in the form of facts,
rules and structured information is reviewed in Section 3.1.
Furthermore, in Section 3.2 a survey on knowledge em-
bedding is presented, which focuses on transforming prior
knowledge from the symbolic space to a real-vector space,
i.e., embeddings. These embeddings can be leveraged to
improve the sub-symbolic methods (Neural Network (
Deep Learning (
)) for effective training, inference and im-
proved reasoning. In addition to it, methods and approaches
dealing with injection of hard and soft rules together with
embeddings are discussed in Section 3.2.3. Each of the
sections in this chapter dealing with different mechanisms in
representing knowledge is concluded with an outlook that is
more tailored to the field of autonomous driving. Mapping
perceived information to semantic concepts and reasoning
using symbolic models provides improved understanding of
driving situation. Furthermore, formalized traffic rules and
legal concepts are used to derive possible driving actions
conditioned on their legal consequences Section 3.1.3.
3.1 Symbolic Representations and Knowledge Crafting
Authors: Denny Mattern, Tobias Gleißner
In contrast to numerical representations, e.g., vector embed-
dings, symbolic representations use symbols to represent
things (cars, motorcycles, traffic signs), people (pedestrians,
driver, police), abstract concepts (overtake, brake, slow down)
or non-physical things (website, blog, god) as well as their
relations. Symbolic knowledge representations comprise all
kinds of logical formalism, as well as structural knowledge
representing entities with their attributes, class hierarchies
and relations.
3.1.1 Logic Formalism
Logic formalisms are used to express knowledge (mostly
facts and rules) as formal logical terms. Logic formalisms
or logic systems differ in expressivity, complexity and
decidability. The choice of the right formalism depends on
the concrete problem to model. The most simplistic (and
decidable) logic formalism is propositional logic. It consists
of a set of symbols representing the individual propositions
and a set of junctions that define the relation between
propositions or modify the value of a proposition. The value
of a proposition can either be true or false.
: The car drives carefully.
: The car is in good condition.
R: The car does not cause an accident.
Logical connectors (
) are used to build
compound propositions that again can either be true or false.
: The car does not cause an accident if it
drives carefully and is in good condition.
In order to make logic statements that apply to many
objects, predicate logic (also known as first-order-logic (FOL))
extends propositional logic with truth-valued functions, pred-
icates, constants, variables and quantifiers (
). Predicate
logic is more expressive than propositional logic but not
always decidable, meaning that the truth value of a statement
cannot be inferred in every case.
: X is destroyable.
: X is a car.
We can deduce from the set of axioms that
Car(M odel T ) Destroyable(M odel T )
: If Model T
is a car (which is true), then it is destroyable. Note that the
truth value of the proposition made by a predicate depends
on the actual variable.
Although predicate logic is more expressive than propo-
sitional logic, both share the property of being truth-valued.
I.e., both are binary and both are dealing with statements of
ontological nature (statements of being). In order to model
legal norms as we are aiming for we need logic formalisms
that are concerned with the concepts of obligation and
permission and would not be binary anymore.
According to scholars of legal theory, norms can be
expressed with rules of the structure IF A
C, where A
are the pre-conditions of a norm, C
is the normative effect and
is the normative
conditional [26]. In contrast to implications in propositional
logic (
), the normative conditional is generally defeasible,
meaning that even if all pre-conditions hold the normative
is only concluded presumingly but not necessarily,
which allows for reasoning in the face of contradictions.
Defeasibility is the property that a conclusion is open in
principle to revision in case more evidence to the contrary is
provided [26].
Logic formalisms concerning with statements of obliga-
tion, permission, prohibition, right, etc. are a type of modal
logic and known as deontic logic. Propositions are augmented
with deontic modal operators
[OBL],[P ERM ],[F OR]
meaning obligation, permission and prohibition/forbidden.
The operators qualify the content of the augmented proposi-
tion. E.g.,
means that the content of
is considered
obligatory. The deontic operators relate to another as follows
(see [26] for more details):
[OBL]p ¬[P ERM ]¬p
: if
is obligatory, then its
opposite, ¬p, is not permitted.
[F OR]p[OBL]¬p
: if
is forbidden, then its
opposite is obligatory.
[P ERM ]p ¬[OBL]p ¬[F OR]p
is neither
obligatory nor forbidden.
Computer interpretable formalization of legal norms is
topic of active research in the field of legal informatics. There
are multiple logic formalisms for formalizing legal rules and
norms, e.g., Standard Deontic Logic (SDL) ([143, 478]), Reified
Input-Output Logic ([609, 610]) or (non-deontic) Temporal
Logic ([471, 193]). However, there is still no consensus on
the "best" logic formalisms. In order to keep the formalized
legal rules agnostic to possible (deontic) logic systems, an
intermediate formal representation of the legal norms can
be used. LegalRuleML ([537, 27, 26]) aims to provide such
an interchange format for legal rules, supporting deontic
operators and defeasiblity among other features for formal-
izing legal norms. As open as the question of the "best"
logic formalism for norm representation is the question of a
good interface for legal experts who want to represent legal
norms computer understandable. A recent work proposes a
dedicated editor allowing for intuitive formalization of legal
texts and featuring consistency checks as well [433]. Another
approach proposes an agile and repetitive process [42].
3.1.2 Relational Knowledge
Knowledge concerning entities, concepts, their hierarchies
and properties as well as their relations to another is naturally
represented by graph structures. Prominent examples for
graph structured representations of structural knowledge are
Taxonomies,Ontologies and Knowledge Graphs.
Taxonomies categorize entities into a hierarchy of classes
and sub-classes represented as a directed acyclic graph
with nodes representing the entities, classes and sub-classes,
and edges representing the relations. Taxonomies categorize
objects regarding one specific aspect and commonly use only
one type of relation the "is-a" relation. E.g., a car is a vehicle,
which is a machine.
An Ontology is a formal, explicit specification of a shared
conceptualization [699]. This means an Ontology is an abstract
model of explicitly defined, relevant concepts of the specific
domain of discourse and their relations which is constructed
in a computer understandable manner. The definitions of the
meaning of the relevant concepts and relations reflect the
common sense of domain experts. The given definition of
what an Ontology actually is, implies that the development
of a specific Ontology is a process which involves different
persons (e.g the knowledge engineer, the domain experts,
maybe also the users) and that it takes a certain communica-
tion effort to develop a shared understanding of the concepts,
the formalizations of those concepts as well as the usability of
the Ontology for the user. Hence, Ontology building is ideally
an iterative and repetitive design process for which multiple
process patterns had been developed [525, 156, 237].
Concrete Ontologies consist of classes and sub-classes
which refer to domain concepts as well as the properties and
relations between those, which is referred to as terminological
knowledge. Additionally to the class definitions, relations
and constraints for concrete instances of classes are also de-
fined in an Ontology and referred to as assertional knowledge.
These definitions and constraints are expressed in description
logic which is a decidable fragment of predicate logic, where
the terminology TBox and ABox are often used instead of
terminological knowledge and assertional knowledge. The logic is
commonly represented in the Web Ontology Language (OWL),
which is a computational language based on description logic
that allows for formalizing complex knowledge such that it
can be exploited by computer programs [534]. An Ontology
can be interpreted as a meta-schema for domain-specific data,
that not only specifies the relational structure and semantics
of the data but also allows, e.g., to verify the consistency of
that knowledge or to infer implicit knowledge through its
strong logical foundation. Ontologies have been developed
for a wide range of domains and applications.
In literature Knowledge Graphs and Ontologies had often
been used as synonyms until [187] proposed the following
definition: "A knowledge graph acquires and integrates
information into an Ontology and applies a reasoner to
derive new knowledge." In a Knowledge Graph data from
heterogeneous data sources is integrated, linked, enriched
with contextual information and meta-data (e.g., information
about provenience or versioning) and semantically described
with an Ontology. Through their linked structure Knowledge
Graphs are prominently used in semantic search applications
and recommender systems but also allow for logical rea-
soning when featuring a formal meta-schema in form of an
Ontology. Surveys on Knowledge Graphs and their general
applications are provided by [349, 884] and Knowledge Graphs
for recommender systems specifically by [267].
3.1.3 Applications
Symbolic representations improve scene understanding by
mapping detected objects to a formal semantic representation
of the current traffic scene (e.g., as a scene graph ([6, 102])).
To integrate knowledge into machine learning algorithms,
a representation of this knowledge is essential. While this
knowledge is in form of embeddings, a symbolic represen-
tation allows traceability and makes it understandable for
Given a sound formalization of traffic rules and a seman-
tic representation of the entities, actions and legal concepts
in traffic scenes (analogue to the legal ontology modeling
the concepts of privacy proposed by [538]), we can derive
the current legal state of an AD vehicle. An example where
knowledge graphs are used as embeddings is [528]. In this
case a knowledge graph is build upon a road scene ontology
to recognize similar situations that are visually different.
Using this technique to integrate legal knowledge and derive
the legal state of different situations is a possible approach.
Analogue to the application of symbolic representations
for situation understanding we make use of formal represen-
tations of traffic rules and legal concepts as well as symbolic
scene descriptions for planning tasks by ranking possible
alternative trajectories and actions, e.g., according to their
legal consequences.
3.2 Knowledge Representation Learning
Author: Stefan Zwicklbauer
Complementary strength and weaknesses of data-driven
and knowledge-driven
systems have led to a plethora
of research works that focus on combining both symbolic
(e.g., Knowledge Graphs (KGs)) and statistical (e.g., NNs)
methods [148]. One promising approach is the conversion of
symbolic knowledge into embeddings, i.e., dense, real-vector
representations of prior knowledge, that can be naturally pro-
cessed by NNs. Typical examples of symbolic knowledge are
textual descriptions, graph-based definitions or propositional
logical rules. The research area of Knowledge Representation
Learning (
) aims to represent prior knowledge, e.g.,
entities, relations or rules into embeddings that can be used
to improve or solve inference or reasoning tasks ([439],
[424]). Most existing literature narrows down the problem by
as converting prior knowledge from KGs only
[439]. Thus, our focus in this survey also lies on knowledge
modeled in graph-based structures.
3.2.1 Textual Embeddings
With the development and advances in
, Natural Lan-
guage Representation Learning has become a hot topic
over the last couple of years. Natural Language Models,
such as proposed in [164], [563], [78] are capable of directly
converting natural language text, e.g., common sense text like
Wikipedia articles or textual rules like road traffic regulations
into embeddings that implicitly represent the syntactic or
semantic features of the language [576]. Those embeddings
are mostly used for specific downstream tasks like Question
Answering (
) [875], Neural Machine Translation (
[817] or Common Sense Reasoning [697], but probably lack
power of expressiveness when it comes to representing
specific rules and logic. As a consequence, most research
works extract entities, relations and rules from sentences
first and model them in a more expressive representation
format, e.g., KGs, afterwards. In the following, we do not
further elaborate literature regarding Natural Language
Representation Learning but refer to the respective surveys
([576], [131]) and assume that knowledge has already been
converted to an expressive format like KGs or another logical
3.2.2 Knowledge Graph Embeddings
Many research works described how to create dense-vector
representation for either homogeneous (i.e., graphs with a
single type of edge) and heterogeneous (i.e., graph with
multiple types of edges) graphs [145]. Graphs with auxiliary
information ([519], [270]) and graphs constructed from non-
relational data [845] are out of scope in this survey. For
homogeneous graphs, the authors of [562] made a significant
progress in
. They created a node corpus by randomly
walking over the graph and applied Word2Vec [485] to gener-
ate node embeddings. The authors of [885] further improved
and used this approach for heterogeneous graphs. Tang et
al. [713] and especially Grover at al. [262] proposed state-
of-the-art works which intelligently explore the specific and
varying neighborhoods of nodes and consider the respective
node order to create their embeddings. Most research works
however, focus on heterogeneous graphs since they are best
suited for rule and relation modeling. We first focus on pure
node (entity) and edge (relation) representation learning, also
called Triplet Fact-based Representation Learning Models.
Hereby, we further distinguish between Translation-Based
Models, Tensor Factorization-Based Models and
Starting with Translation-Based Models, the first influ-
ential work proposed TransE [73], a framework to create
embeddings for heterogeneous graphs. Given a triple
(h, r, t)
denoting the head and tail entity and
the respective relation, the idea is to embed each component
into a low-dimensional space
in a way that
translate to
. The authenticity of the
respective triplet is defined via a specific scoring function,
which is the distance under either `1or `2norm:
fr(h, t) = kh+rtkp(1)
p= 1
p= 2
. This objective function is minimized
with a margin-based hinge ranking loss function over the
training process. Since TransE came up with several limita-
tions, such as not being able to model one-to-many, many-to-
one and many-to-many relations, various authors addressed
these shortcomings by using TransE as foundation for their
works. For instance, the authors of TransH [773] introduced
relation-related projection vectors where the entities are
projected onto relation-related hyperplanes. TransH enables
different embeddings based on the underlying relation. All
entities and relations are still represented in the same feature
space. In TransR [440], the entities
are projected from
their initial entity vector space in to the relation space of the
connecting relation
. This allows us to render entities that
are similar to the head or tail entity in the entity space as
distinct in the relation space. Further improvements can be
found in the TransD [348] model, which has fewer parameters
and replaces matrix-vector multiplication by vector-vector
multiplication for an entity-relation pair, which is more
scalable and can be applied to large-scale graphs. Another
problem of existing approaches is the non-consideration of
crossover-interactions, bi-directional effects between entities
and relations including interactions from relations to entities
and interactions from entities to relations [858]. To provide an
example, predicting a specific relation between two entities
typically relies on the entities’ relevant topic in form of their
connecting entities/relations. Not all connected entities and
relations belong to the topic of the relation to be found.
This is modeled in CrossE [858], which simulates crossover
interactions between entities and relations by learning an
interaction matrix to generate multiple specific interaction
embeddings. Another state-of-the-art approach Hake [862] is
capable of modeling a) entities at a different level in the
semantic hierarchy, and b) entities on the same level of
the semantic hierarchy. This is achieved by mapping the
entities in the polar coordinate system. Entities on a different
hierarchy level are modeled with a modulus approach,
whereas the phase part aims to model the entities at the
same level of the semantic hierarchy.
Regarding Tensor Factorization-Based Models, RESCAL
[517] represents the foundational work for most follow-up
works. RESCAL uses a tensor representation to model the
structure of KGs. More specifically, a rank-
factorization is
used to obtain the latent semantics:
XkARkAT, f or k =
1,2, ..., m
, with
being a matrix that captures the
latent semantic representation of entities and
being a matrix that models the pairwise interactions in the
-th relation. Based on this principle, the scoring function
is defined as
fr(h, t) = hTMrt
, where
denote the
entity embeddings and the matrix
represents the
pairwise interactions in the k-relation ([517], [145]). The work
DistMult [813] improves RESCAL in terms of algorithmic
complexity and embedding accuracy by restricting
be diagonal matrices. To overcome the problem of DistMult
that head and tail entities are symmetric for each relation
symmetry, the works Complex [728] and QuatRE [513] satisfy
the key desiderata of relational representation learning, i.e.,
modeling symmetry, anti-symmetry and inversion. Both
approaches leverage complex-value embeddings to support
asymmetric relations. More recently proposed state-of-the-art
models use special tensor factorization methods. For instance,
SimplE [372] leverages an adapted and simpler version of
Canonical Polyadic Decomposition to allow head and tail
entities to have embeddings that are dependent on each
other, which would be impossible with the original model.
Similar, TuckER [36] is based on the Tucker-Decomposition
on a binary entity-relation-entity matrix.
Due to their success in the last decade,
-Based Models
became also a hot topic for
. The first shallow
approaches comprise standard feed-forward networks [73]
(with linear layers) and neural tensor networks [692] (with
bi-linear tensor layers). Over time deeper variants such as
NAM [446] have established to provide more flexibility
when it comes to train a network towards the underly-
ing training goal. More recently, graph neural networks
[872] were introduced which strive to explicitly model the
peculiarities of (knowledge) graphs. In particular, graph
convolutional networks for multi-relational graphs [384]
generalize nonvolutional neural networks to non-euclidean
data and gather information from the entity’s neighborhood
and all neighbors contribute equally in the information
passing. Graph convolutional networks are mostly built on
top of the message passing neural networks framework [248]
for node aggregation. Many works are limited to create
embeddings for knowledge entities only ([637], [664]), but
recent approaches tried to overcome this limitation ([163],
[738], [826], [739]). A neighborhood attention operation in
graph attention networks [742] can enhance the represen-
tation power of graph neural networks [808]. Similar to
natural language models, these approaches apply a multi-
head self attention mechanism [740] to focus on specific
neighbor interactions when aggregating messages ([808], [4],
[467]). Many authors incorporated mechanisms to improve
the overall quality of entity and relation embeddings. For
instance, the idea of negative sampling is to intelligently
sample specific wrong samples that are needed for margin-
based loss functions. Recent methods employed Generative
Adversarial Networks (GANs) [253] in which the generator
is trained to generate negative samples ([761], [86]). Another
work ATransN suggested to improve existing embeddings
by leveraging GANs to correctly align the embeddings with
those from teacher KGs [756].
In this section, we mostly concentrated on methods that
exclusively generated their embeddings on relational data.
However, some approaches consider additional information,
such es textual (entity) descriptions (e.g., [221], [774], [273]),
path-based information (e.g., [521], [266]) and even hierar-
chies (e.g., [862], [863]) as available in ontologies.
3.2.3 Knowledge Graph Embeddings with Rule Injection
So far, we have discussed approaches to embed knowl-
edge that is formalized within KGs. These methods create
representations that purely reflect the items’ graph-based
modeling (e.g., triples). In addition to this, specific rules
(soft or hard rules) can be derived from KGs, which is
also known as rule learning (e.g., [312], [859], [529]), or
be leveraged in the embeddings learning process, also
known as rule injection. In the following, we focus on the
former works, how to additionally integrate pre-defined or
mined rules into embeddings. The authors of RUGE [269]
presented a novel paradigm to leverage horn soft rules mined
from the underlying
in addition to the existing triples.
Their iterative training procedure improves the transfer of
the knowledge contained in logic rules into the learned
embeddings. The framework SLRE [268] also presents an
option to leverage horn-based soft rules with confidence
scores to improve the accuracy of down-stream tasks. These
rules are directly integrated as regularization terms in the
training mechanism for relation embeddings. The authors
of [521] additionally enriched the horn-based rules with
path information to improve the state of the art. A related
work [762] mines inference, transitivity and anti-symmetry
rules from the given
first and converts them into first-
order logic rules in the second step. Finally, the proposed
rule-enhanced embedding method can be integrated in any
translation-based KG embedding model.
Apart from rules directly mined from the underlying
knowledge graph, other approaches exist that try to apply
more extrinsic rules. For instance, the authors of [171] try to
improve the embeddings’ capability of modeling rules by
using non-negativity and approximate entailment constraints
to learn compact entity representations. The former naturally
induce sparsity and embedding interpretability, and the
latter can encode regularities of logical entailment between
relations in their distributed representations. Other works
propose to encode knowledge items into geometric regions.
For instance, [274] encodes relations into convex regions,
which is a natural way to take into account prior knowledge
about dependencies between different relations. Query2box
[594] encodes entities (and queries) into hyper-rectangles,
also called box embeddings to overcome the problem of
point queries, i.e., a complex query represents a potentially
large set of its answer entities, but it is unclear how such a set
can be represented as a single point. Box Embeddings have
also been used to model the hierarchical nature of ontology
concepts with uncertainty [429].
Most approaches described above rely on common-sense
knowledge bases like DBPedia [28] or Freebase [70] and
leverage their developed embedding approaches for knowl-
edge base link prediction or inference / reasoning tasks.
However, we believe that existing models and algorithms
can be similarly applied to special domain knowledge bases,
e.g., knowledge bases with data for AD [787].
3.2.4 Applications
The application of KGs in the
domain has not received
too much attention at the current point of time, albeit it can
be an effective way to help situation or scene understanding
[732]. For instance, the authors of [283] built a specific
ontology to represent all core concepts that are essential
to model the driving concept. The built
CoSi models
information about driver, vehicle, road infrastructure, driving
situation and interacting traffic participants [283]. To classify
the underlying traffic situation with a
, a relational
graph convolutional network [637] is used to convert the
into embeddings first. Similar, the work
by Buechel et al. [82] presented a framework for driving
scene representation and incorporated traffic regulations.
Wickramarachchi et al. [787] focused on embedding
and investigated the quality of the trained embeddings given
various degrees of
scene details in the
. Moreover, the
authors evaluated the created embeddings on two relevant
use cases, namely Scene Distinction and Scene Similarity.
A plethora of methods and approaches have been proposed
in literature that focus on augmenting data driven models
and algorithms with additional prior knowledge. Among
the most prominent approaches are the modification of the
training objective via customized cost functions, especially
knowledge affected constraints and penalties. An overview
of auxiliary losses and constraints that take into account physi-
cal and domain knowledge in various peculiarity is presented
in Section 4.1. Often these approaches are accompanied
by problem-specific designs of the architecture, leading to
hybrid models that leverage symbolic knowledge in form
of logical expressions or knowledge graphs. The merging
of symbolic and sub-symbolic methods, also referred to as
neural-symbolic integration is focus in Section 4.2.
Besides external input, recent methods rely on preferably
internal representations in order to focus attention on distinct
features and concepts within a network itself. Key weighting
and guidance approaches are discussed in Section 4.3. Last
but not least, data augmentation techniques form the backbone
to integrate additional domain knowledge into the data and
thus indirectly into the model. Approaches starting from
data transformations to augmentations in feature space up
to simulations are discussed in Section 4.4.
In addition to these prevalent general approaches, this
chapter concludes with methods and paradigms that are
more tailored to the field of autonomous driving, considering
multiple agents that interact with specific environments
typical for the application under investigation. Especially
inferring and predicting the state of an agent plays an
essential role in the considered state space models in Section 4.5
and reinforcement learning in Section 4.6. The involvement of
positional as well as semantic information is essential part of
the information fusion approach outlined in Section 4.7.
4.1 Auxiliary Losses and Constraints
Authors: Tino Werner, Maximilian Alexander Pintz, Laura von
Rueden, Vera Stehr
The usual Empirical Risk Minimization (
) principle in
machine learning amounts to replacing the minimization of
an intractable risk, i.e., an expected loss over a ground-truth
data distribution, by the minimization of the empirical risk.
A mismatch between the expected loss and its empirical
approximation causes
to result in models that do not
generalize well to unseen data. This manifests either in
overfitting, where the model represents the training data
too closely and fails to capture the overall data distribu-
tion, or underfitting, where the model fails to capture the
underlying structure of the data. Regularization schemes
have been proposed to mitigate the problem of overfitting.
The Structural Risk Minimization (
) principle [275, 736]
extends the
principle for regularization.
seeks to
find models with the best tradeoff between the empirical
risk and model complexity as measured by the Vapnik-
Chervonenkis dimension or Rademacher complexity. In
practice, this encompasses minimizing an empirical risk with
an added regularization term. This technique has successfully
entered variable selection as done in the path-breaking
work of [721] who introduced the Lasso. Regularization
in general proved to be indispensable in high-dimensional
regression [186, 882, 89, 835, 685], classification [544, 735],
clustering [791], ranking [410] and sparse covariance or
precision matrix estimation [38, 220, 87].
As for knowledge-infusion into
, a natural strategy
is to similarly use regularization terms (so-called auxiliary
losses) that correspond to formalized knowledge. However,
constraints may also appear in terms of hard constraints,
for example, if some logic rule must not be violated so that
integrating it in a soft manner via auxiliary losses would
not be appropriate, as dependencies or as regularization
priors. This section is structured as follows: After describ-
ing techniques that integrate physical knowledge or other
domain knowledge via auxiliary losses, we review ideas
to incorporate constraints into the
training and the
architecture, followed by works that propose uncertainty
quantification for knowledge-infused networks. At the end,
we review applications of knowledge-infused networks for
perception and planning in the automotive context.
Let us first point out the advantages of such techniques,
besides the stronger adaptation of the model to the knowl-
edge. For example, the authors in [370] highlighted that
the knowledge-based regularization term does generally not
require labeled inputs, which enables data augmentation
with unlabeled instances, saving a large amount of time and
money that would be required to generate a large labeled
data set. The approach in [230] does not even require any
labeled instance. Moreover, a common result is better gener-
alizability of the model, paralleling the improved generaliza-
tion ability of models trained with complexity regularization.
Each improvement in explainability and