PreprintPDF Available

LongiControl: A Reinforcement Learning Environment for Longitudinal Vehicle Control

Authors:
  • DB Netz AG
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Reinforcement Learning (RL) is a promising algorithm for solving numerous challenges in the field of autonomous driving due to its ability to find long-term oriented controls in complex decision scenarios. For training and validation of the RL algorithm, a simulative environment is advantageous due to the risk reduction and the saving of resources. This paper presents an RL Environment designed for the optimization of longitudinal control. In addition to details on implementation, reference is also made to areas where research is required. https://github.com/dynamik1703/gym_longicontrol https://towardsdatascience.com/do-you-want-to-train-a-simplified-self-driving-car-with-reinforcement-learning-be1263622e9e
Content may be subject to copyright.
LONGICONTROL: A REINFORCEMENT LEARNING
ENVIRONMENT FOR LONGITUDINAL VEHICLE CONTROL
A PREPRINT
Jan Dohmen*, Roman Liessner*, Christoph Friebel* and Bernard Bäker*
October 12, 2020
ABS TRAC T
Reinforcement Learning (RL) is a promising algorithm for solving a variety of challenges in the field
of autonomous driving due to its ability to find long-term oriented solutions in complex decision
scenarios. For training and validation of a RL algorithm, a simulative environment is advantageous
due to risk reduction and saving of resources. This contribution presents an RL environment designed
for the optimization of longitudinal control. In addition to details on implementation reference is also
made to areas where research is required.
1 Introduction
A large proportion of road traffic accidents are due to human error [Gründl, 2005]. Autonomous vehicles and driver
assistance systems are therefore promising ways to increase road safety in the future [Bertoncello and Wee, 2015].
Moreover, global climate change and dwindling resources are increasingly contributing to raising society’s awareness
of environmental policy issues. In addition to vehicle electrification, advancing automation in transport promises a
much more efficient use of energy. Assistance systems in particular which support the predictive longitudinal control of
a vehicle can lead to significant energy savings [Radke, 2013].
A commonly chosen approach for energy-efficient optimization of the longitudinal control is based on the use of
dynamic programming [Uebel et al., 2018]. Although this is basically capable of finding the discrete global optimum it
requires in advance a comprehensive problem modelling, a deterministic environment and a discretization of the action
space. Especially when considering other road users the conventional approaches therefore reach their limits [Ye et al.,
2017]. A priori, arbitrary traffic cannot be sufficiently modelled and thus no precise knowledge of the entire route can
be given. Furthermore, the computing power available in the vehicle is not sufficient to perform new optimizations
depending on the constantly changing environment. Online use in the vehicle is therefore unlikely.
The developments in the field of machine learning, especially deep reinforcement learning (DRL), are very promising.
The learning system recognizes the relations between its actions and the associated effect on the environment. This
enables the system to react immediately to environmental influences instead of just following a previously calculated
plan. After proving in recent years to solve challenging video games [Mnih et al., 2013] on a partly superhuman level
DRL has lately been increasingly used for engineering and physical tasks [Hinssen and Abbeel, 2018]. Examples are
the cooling of data centers [Gao, 2014], robotics [Gu et al., 2016], the energy management of hybrid vehicles [Liessner
et al., 2018] or self-driving vehicles [Sallab et al., 2017, Kendall et al., 2018]. This motivates to apply such an approach
also to the problem of optimizing longitudinal control.
In this contribution we propose LongiControl [Dohmen et al., 2019], a RL environment beeing adapted to the OpenAI
Gym standardization. Thereby, we aim to bridge real-world motivated RL with easy accessibility inside a highly relevant
problem. The environment is designed in such a way that RL agents can be trained even with an ordinary notebook in a
relatively short period of time. Though, the longitudinal control problem has several easily comprehensible challenges
making it a suitable example to investigate advanced topics like multi-objective RL (trade-off between conflicting goals
of travel time minimization and energy consumption) or safe RL (violation of speed limits may lead to accidents).
Dresden Institute of Automobile Engineering, TU Dresden, George-Bähr-Straße 1c, 01069 Dresden, Germany
LongiControl A PREPRINT
This paper is structured as follows. In section 2 overviews are given on the longitudinal control problem and on the
basic principles of RL. In section 3 we present the LongiControl environment describing the route simulation, the
vehicle model and its interaction with a RL agent. Thereafter, in section 4 we show examplary results for different
training phases and give a brief insight into the challenges with contrary reward formulations. This is followed by the
concluding discussion in section 5 providing a basis for future working directions.
2 Background
2.1 Longitudinal control
Energy-efficient driving
In general terms, an energetically optimal driving corresponds to a global minimization of
the input energy Ein the interval t0tTas a function of acceleration a, velocity vand power P:
E=ZT
t0
P(t, a(t), v(t)) dt (1)
At the same time, according to external requirements, such as other road users or speed limits, the following boundary
conditions must be met:
vlim,min(x)vvlim,max (x)
alim,min(v)aalim,max (v)
˙alim,min(v)˙a˙alim,max (v).
(2)
Where
v
is the velocity,
a
is the acceleration and
˙a
is the jerk, with
(·)lim,min
and
(·)lim,max
representing the lower
and upper limits respectively.
Following Freuer [Freuer, 2015] the optimization can be divided roughly into four areas:
1. optimization of the vehicle properties,
2. optimization of traffic routing,
3. optimization on an organizational level,
4. optimization of vehicle control.
This paper deals with the last point. In various contributions [Barkenbus, 2010, Uebel et al., 2018] an adapted
vehicle control system is assigned an enormous savings potential. In addition to the safety aspect, assistance systems
supporting vehicle control are becoming increasingly important for this reason as well. This trend is made possible by
comprehensive sensor technology and the supply of up-to-date route data. In terms of longitudinal control, energy-saving
driving modes can thus be encouraged:
driving in energy-efficient speed ranges,
keeping an appropriate distances to vehicles in front,
anticipatory deceleration and acceleration.
Simulation
Simulations become more and more important in automotive engineering. According to Winner et al.
[Winner and Wachenfeld, 2015], in the context of the automotive industry the overall system is composed of three parts:
the vehicle, the driving environment and the vehicle control. These three components interact through an exchange of
information and energy.
Within the simulation a vehicle model is needed which indicates the energy consumption. In general, physical and
data-based approaches are suitable for modelling those [Isermann, 2008].
External influences are represented by the driving environment. This includes for example information about other road
users and route data such as traffic light signals or speed limits. These information are used then by the vehicle control
as boundary conditions for the driving strategy.
While in reality with increasing automation the information content of the sensor system in vehicles is increasing
[Winner et al., 2015], this information can be easily generated in the simulation. Considering the modeling of the
driving environment a distinction must be made between deterministic and stochastic approaches. In the deterministic
case it is assumed that the driving environment behaves the same in every run. Changes during the simulation are not
2
LongiControl A PREPRINT
allowed. This means that reality can only be represented in a very simplified way. For example a sudden change of
a traffic light signal or an unforeseen braking of the vehicle in front is not represented by such a model. In contrast,
the stochastic approach offers the possibility to vary external influences during the simulation. Therefore, this type of
modeling is much closer to the real driving situation.
Optimization
The aim of the RL environment is to train an agent to drive an electric vehicle a single-lane route as
energy-efficient as possible. This corresponds to the minimization of equation 1 while considering the corresponding
boundary conditions in equation 2.
Examples for state-of-the-art approaches for the optimization of the longitudinal control problem are Dynamic Pro-
gramming [Bellman, 1954], Pontryagin’s Maximum Principle [Pontryagin et al., 1962] or a combination of both [Uebel
et al., 2018]. As previously mentioned these approaches have two basic limitations: they are based on deterministic
models and suffer from the curse of dimensionality [Bellman, 1961].
According to [Sutton and Barto, 2018] and [und John N. Tsitsiklis, 1999] RL approaches are a solution to this dilemma.
The main difference between Dynamic Programming and RL is that the former assumes to know the complete model.
RL approaches on the other hand only require the possibility of interaction with the environment model. Without
knowing its inner structure solutions are learned. In modern deep RL (DRL), the use of neural networks for function
approximation also allows to handle continuous state spaces and react to previously unknown states.
2.2 Reinforcement Learning
A standard reinforcement learning framework is considered, consisting of an agent that interacts with an environment
(see Fig. 1). The agent perceives its state
st∈ S
in the environment in each time step
t= 0,1,2, . . .
and consequently
chooses an action
at∈ A
. With this, the agent in turn directly influences the environment resulting in an updated state
st+1
for the next time step. The selected action is evaluated using a numerical reward
rt+1(s, a)
. The sets
S
and
A
contain all possible states and actions that can occur in the description of the problem to be learned.
The policy
π(a|s)
specifies for each time step which action is to be executed depending on the state. The aim is to
select actions in such a way that the cumulative reward is maximized.
Policy gradient methods are probably the most popular class of RL algorithms for continous problems. Currently
very relevant examples for such methods are Proximal Policy Optimization (PPO) [Schulman et al., 2017], Deep
Deterministic Policy Gradient (DDPG) [Lillicrap et al., 2015] or Soft Actor-Critic (SAC) [Haarnoja et al., 2018].
Agent
Environment
State,
Reward Action
Figure 1: Agent environment interaction
3 RL Environment
3.1 OpenAI Gym
OpenAI Gym [Brockman et al., 2016] is a widly used open-source framework with a large number of well-designed
environments to compare RL algorithms. It does not rely on a specific agent structure or deep learning framework. To
provide an easy starting point for RL and the longitudinal control problem, the implementation of the LongiControl
environment follows the OpenAI Gym standardization.
3.2 Route simulation
Fig. 2 shows an example of the simplified track implementation within the simulation.
3
LongiControl A PREPRINT
50 70 90
Figure 2: An example for the track visualization.
Equation of motion
The vehicle motion is modelled simplified as uniform accelerated. The simulation is based on a
time discretization of t= 0.1 s. The current velocity vtand position xtare calculated as follows:
vt=att+vt1
xt=1
2at(∆t)2+vt1t+xt1
The acceleration
at
must be specified through the agents action in each time step
t
. Since only the longitudinal control
is considered the track can be modelled single-laned. Therefore, one-dimensional velocities
vt
and positions
xt
are
sufficient at this point.
Stochastic route modelling
The route simulation is modelled in such a way that the track length may be arbitrarily
long and that arbitrarily positioned speed limits specify an arbitrary permissible velocity. Here, it is argued that this can
be considered equivalent to a stochastically modelled traffic.
Under the requirement that a certain safety distance to the vehicle in front must be maintained other road users are
simply treated as further speed limits which depend directly on the distance and the difference in speed. For each time
step the relevant speed limit is then equal to the minimum of the distance-related and traffic-related limit.
Restrictively, speed limits are generated in a minimum possible distance of
100 m
. The permissible velocities are
sampled from
[20,30,40,50,60,70,80,90,100 km/h]
while the difference of contiguous limits may not be greater
than
40 km/h
. It should therefore apply
xlim,j+1 xlim,j 100 m
and
|vlim,j+1 vlim,j | ≤ 40 km/h
. The former
is a good compromise to induce as many speed changes per trajectory as possible and to be able to identify anticipatory
driving at the same time. The second is introduced as another simplification to speed up the learning process. Very
large speed changes may be very hard for the agent to handle.
Up to 150 m in advance, the agent receives information about the upcoming two speed limits.
3.3 Vehicle model
The vehicle model derived from vehicle measurement data (see Figure 3) consists of several subcomponents. These
have the function of receiving the action of the agent, assigning a physical acceleration value and outputting the
corresponding energy consumption.
Assigning the action to an acceleration
The action of the agent is interpreted in this environment as the actuation of
the vehicle pedals. In this sense, a positive action actuates the accelerator pedal. A negative analogous action actuates
the brake pedal. The vehicle acceleration resulting from the pedal interaction depends on the current vehicle speed
(road slopes are neglected) due to the limited vehicle motorization.
If neither pedal is actuated (corresponds to
action = 0
), the vehicle decelerates its speed by simulating the driving
resistance. This means that to maintain a positive speed a positive action must be selected.
It becomes clear from the explanations that three speed-dependent acceleration values determine the physical range of
the agent. These are the maximum and minimum acceleration and the acceleration value for action = 0.
Determination of the acceleration values
The speed-dependent maximum and minimum acceleration can be de-
termined from the measurement data and the technical data of the vehicle. In the RL environment, the maximum
and minimum values for each speed are stored as characteristic curves. The resulting acceleration at
action = 0
is calculated physically. Using the driving resistance equation and the vehicle parameters an acceleration value is
4
LongiControl A PREPRINT
Figure 3: Assigning the action to an acceleration.
calculated for each speed. This is stored in the environment as a speed-dependent characteristic curve, analogous to the
other two acceleration values.
Once the action, the current vehicle speed and the three acceleration values are available the resulting acceleration can
be calculated as follows:
at=
(amax a0)·action +a0if action > 0
a0if action = 0
(amin a0)·action +a0if action < 0
Calculation of energy consumption
Knowing the vehicle speed and acceleration the energy consumption can be
estimated from these two values. For this purpose measured values of an electric vehicle [Argonne National Laboratory,
2013] were learned using a neural network and the network was stored in the environment.
3.4 Agent environment interaction
In accordance with the basic principle of RL an agent interacts with its environment through its actions and receives an
updated state and reward.
Action
The agent selects an action in the value range [-1,1]. The agent can thus choose between a condition-dependent
maximum and minimum acceleration of the vehicle. This type of modeling results in the agent only being able to select
valid actions.
State
The features of the state must provide the agent with all the necessary information to enable a goal-oriented
learning process. The individual features and their meaning are listed in Table 1.
When training neural networks the learning process often benefits from the fact that the dimensions of the input variables
do not differ greatly from one another. According to Ioffe et al. [Ioffe and Szegedy, 2015] the gradient descent algorithm
converges faster if the individual features have the same order of magnitude. Since according to Table 1 different
physical quantities with different value ranges enter the state a measure for normalization seems to be reasonable at this
point. All features are scaled min-max for this purpose so that they are always in the fixed interval [0,1].
Reward
In the following, the reward function that combines several objectives is presented. The explanations
indicate the complexity of the multi-objective manner. The LongiControl Environment thus provides a good basis for
investigating these issues and for developing automated solutions to solve them.
A reward function defines the feedback the agent receives for each action and is the only way to control the agent’s
behavior. It is one of the most important and challenging components of an RL environment. If only the energy
5
LongiControl A PREPRINT
Table 1: Meaning of state features.
Feature Meaning
v(t)Vehicles’s current velocity
aprev (t)Vehicle acceleration of the last time step, s.t. the agent is able to have an intuition for the jerk
vlim(t)Current speed limit
vlim,f ut(t)The next two speed limit changes, as long as they are within a range of 150 m
dvlim,fut (t)Distances to the next two speed limit changes, as long as they are within a range of 150 m
consumption were rewarded (negatively) the vehicle would simply stand still. The agent would learn that from the point
of view of energy consumption it is most efficient simply not to drive. Although this is true we still want the agent to
drive in our environment. So we need a reward that makes driving more appealing to the agent. By comparing different
approaches the difference between the current speed and the current speed limit has proven to be particularly suitable.
By minimizing the difference the agent automatically sets itself in motion. In order to still take energy consumption into
account the reward component is maintained with energy consumption. A third reward component is caused by the jerk.
This is because our autonomous vehicle should also be able to drive comfortably. To punish finally also the violation of
the speed limits a fourth reward part is supplemented. Since RL is designed for a scalar reward it is necessary to weight
these four parts.
A suitable weighting is not trivial and poses a great challenge.
For the combined reward we propose the following (see also Table 2):
rt=ξforward rf orw ard(t)
ξenergy renerg y (t)
ξjerk rj erk (t)
ξsafe rsafe (t),
while
rforward (t) = |v(t)vlim(t)|
vlim(t)
renergy (t) = ˆ
E
rjerk (t) = |a(t)aprev (t)|
t
rsafe(t) = 0v(t)vlim(t)
1v(t)> vlim(t).
ξ
are the weighting parameters for all reward shares. In some cases, the terms are used as penalty so that the learning
algorithm minimizes their amount. To make it easier to get started with the environment we have preconfigured
a functioning weighting (see Table 3). In the next section we will show some examples of the effects of different
weightings.
Table 2: Meaning of reward terms.
Reward Meaning
rforward (t)Penalty for slow driving
renergy (t)Penalty for energy consumption
rjerk (t)Penalty for jerk
rsafe(t)Penalty for speeding
4 Examples
In the following various examples of the environment are presented. For training the agent is confronted with new
routes in each run using the stochastic mode of the environment. For validation it is always used the same deterministic
route to compare like with like.
6
LongiControl A PREPRINT
Table 3: Weighting parameters for the reward.
Parameter Value
ξforward (t) 1.0
ξenergy (t) 0.5
ξjerk (t) 1.0
ξsafe(t) 1.0
4.1 Learning progress
In the following different stages of an exemplary learning process are presented. An implementation of SAC [Haarnoja
et al., 2018] was chosen as the deep RL algorithm. The used hyperparameters are listed in table 4. Animated
visualizations for the upcoming learning stages can be found on GitHub [Dohmen et al., 2019].
Table 4: SAC hyperparameter
Parameter Value
optimizer Adam [Kingma and Ba, 2014]
learning rate 0.001
discount γ0.99
replay buffer size 1000000
number of hidden layers (all networks) 2
number of hidden units per layer 64
optimization batch size 256
target entropy dim(A)
activation function ReLU
soft update factor τ0.01
target update interval 1
gradient steps 1
Beginning of the learning process
At the very beginning of the learning process the agent remains in place and does
not move at all. Then after a few more training epochs the agent starts to move but is not yet able to finish the track.
Figure 4a shows this stage in the deterministic validation run.
After some learning progress
After some progress the agent is able to complete the course (Figure 4b) but ignores
speed limits while driving very jerky. Obviously, this is not desirable. Therefore the training continuous.
After a longer training procedure
By letting the agent train even longer it learns to drive more comfortably and
finally starts to respect the speed limits by an early enough deceleration. Though, in general it is driving quite slow in
relation to the maximum allowed (see Figure 4c).
After an even longer training period
Finally after an even longer training, it drives very smooth, respects the speed
limits while minimizing the safety margin to the maximum allowed (see Figure 4d).
4.2 Multi-objective optimization
As mentioned before, this problem has several contrary objectives. Thus also multi-objective investigations can be
carried out. For a better understanding we present three examples.
Reward Example 1
If only the movement reward – the deviation from the allowed speed – is applied (reward
weighting [
ξforward (t)=1
,
ξenergy (t)=0
,
ξjerk (t)=0
,
ξsafe(t)=0
]) the agent violates the speed limits because
being 5 km/h too fast is rewarded the same as being 5 km/h too slow (see Figure 5a).
7
LongiControl A PREPRINT
(a) Beginning of the learning process
(b) After some learning progress
(c) After a longer training procedure
(d) After an even longer training period
Figure 4: Learning progress
8
LongiControl A PREPRINT
(a) ξfor ward (t) = 1,ξenerg y (t) = 0,ξjerk (t) = 0,ξsaf e (t) = 0
(b) ξfor ward (t) = 1,ξenerg y (t) = 0,ξjerk (t) = 0,ξsaf e (t) = 1
(c) ξfor ward (t) = 1,ξenerg y (t) = 0.5,ξjerk (t) = 1,ξsaf e (t) = 1
Figure 5: Reward weighting
9
LongiControl A PREPRINT
Reward Example 2
In the second example, the penalty for exceeding the speed limit is added (reward weighting
[
ξforward (t)=1
,
ξenergy (t)=0
,
ξjerk (t) = 0
,
ξsafe(t) = 1
]). This results in the agent actually complying with the
limits (see Figure 5b).
Reward Example 3
In the third example we add the energy and jerk reward (reward weighting [
ξforward (t) = 1
,
ξenergy (t)=0.5
,
ξjerk (t)=1
,
ξsafe(t)=1
]). This results in the agent driving more energy-efficiently and also
choosing smoother accelerations (see Figure 5c).
These examples illustrate that the environment provides a basis to investigate multi-objective optimization algorithms.
For such investigations the weights of the individual rewards can be used as control variables and the travel time, energy
consumption and the number of speed limit violations can be used to evaluate the higher-level objectives.
5 Discussion and Conclusion
Through the proposed RL environment, which is adapted to the OpenAI Gym standardization, we show that it is easy to
prototype and implement state-of-art RL algorithms.
Besides, the LongiControl environment is suitable for various examinations. In addition to the comparison of RL
algorithms and the evaluation of safety algorithms, investigations in the area of Multi-Objective Reinforcement Learning
are possible. Further possible research objectives are the comparison with planning algorithms for known routes,
investigation of the influence of model uncertainties and the consideration of very long-term objectives like arriving at a
specific time.
LongiControl is designed to enable the community to leverage the latest strategies of reinforcement learning to address
a real-world and high-impact problem in the field of autonomous driving.
References
Martin Gründl. Fehler und Fehlverhalten als Ursache von Verkehrsunfällen und Konsequenzen für das Unfallvermei-
dungspotenzial und die Gestaltung von Fahrerassistenzsystemen. PhD thesis, University Regensburg, 2005.
Michelle Bertoncello and Dominik Wee. Mckinsey: Ten ways autonomous driving could redefine the automotive
world. Available:
, 2015.
Tobias Radke. Energieoptimale Längsführung von Kraftfahrzeugen durch Einsatz vorausschauender Fahrstrategien.
PhD thesis, Karlsruhe Institute of Technology (KIT), 2013.
S. Uebel, N. Murgovski, C. Tempelhahn, and B. Bäker. Optimal energy management and velocity control of hy-
brid electric vehicles. IEEE Transactions on Vehicular Technology, 67(1):327–337, Jan 2018. ISSN 0018-9545.
doi:10.1109/TVT.2017.2727680.
Ziqi Ye, Thorsten Plum, Stefan Pischinger, Jakob Andert, Michael Franz Stapelbroek, and Jan-Simon Remco Pfluger.
Vehicle speed trajectory optimization under limits in time and spatial domains. In International ATZ Conference
Automated Driving, volume 3, Wiesbaden, 2017.
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin A.
Riedmiller. Playing atari with deep reinforcement learning. CoRR, abs/1312.5602, 2013. URL
.
Peter Hinssen and Pieter Abbeel. Everything is going to be touched by ai. Available:
, 2018.
Jim Gao. Machine learning applications for data center optimization, 2014.
Shixiang Gu, Timothy P. Lillicrap, Zoubin Ghahramani, Richard E. Turner, and Sergey Levine. Q-prop: Sample-
efficient policy gradient with an off-policy critic. CoRR, abs/1611.02247, 2016. URL
.
Roman Liessner, Christian Schroer, Ansgar Dietermann, and Bernard Bäker. Deep reinforcement learning for advanced
energy management of hybrid electric vehicles. In Proceedings of the 10th International Conference on Agents and
Artificial Intelligence ICAART,, volume 2, pages 61–72, 2018.
Ahmad Sallab, Mohammed Abdou, Etienne Perot, and Senthil Yogamani. Deep reinforcement learning framework for
autonomous driving. Electronic Imaging, 2017:70–76, 01 2017. doi:10.2352/ISSN.2470-1173.2017.19.AVM-023.
10
LongiControl A PREPRINT
Alex Kendall, Jeffrey Hawke, David Janz, Przemyslaw Mazur, Daniele Reda, John-Mark Allen, Vinh-Dieu Lam, Alex
Bewley, and Amar Shah. Learning to drive in a day. CoRR, abs/1807.00412, 2018. URL
.
Jan Dohmen, Roman Liessner, Christoph Friebel, and Bernard Bäker. LongiControl environment for OpenAI gym.
, 2019.
Andreas Freuer. Ein Assistenzsystem für die energetisch optimierte Längsführung eines Elektrofahrzeugs. PhD thesis,
2015.
J. N. Barkenbus. Eco-driving: an overlooked climate change initiative. Energy Policy, 38, 2010.
.
Hermann Winner and Walther Wachenfeld. Auswirkungen des autonomen fahrens auf das fahrzeugkonzept. 2015.
Rolf Isermann. Mechatronische Systeme - Grundlagen. Springer-Verlag, Berlin Heidelberg, 2 edition, 2008.
Hermann Winner, Stephan Hakuli, Felix Lotz, and Christina Singer, editors. Handbuch Fahrerassistenzsysteme.
ATZ/MTZ-Fachbuch. Springer Vieweg, Wiesbaden, 3 edition, 2015. ISBN 978-3-658-05733-6. doi:10.1007/978-3-
658-05734-3.
Richard Bellman. The theory of dynamic programming. Bull. Amer. Math. Soc., 60(6):503–515, 11 1954. URL
.
L. S. Pontryagin, V. G. Boltyanshii, R. V. Gamkrelidze, and E. F. Mishenko. The Mathematical Theory of Optimal
Processes. John Wiley and Sons, New York, 1962.
Richard E. Bellman. Adaptive Control Processes: A Guided Tour. Princeton Legacy Library, 1961. ISBN
9781400874668.
Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, USA,
2te edition, 2018.
Dimitri P. Bertsekas und John N. Tsitsiklis. Neuro-dynamic programming. 2te edition, 1999. ISBN 1-886529-10-8.
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization
algorithms. CoRR, abs/1707.06347, 2017. URL .
Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver,
and Daan Wierstra. Continuous control with deep reinforcement learning. CoRR, abs/1509.02971, 2015. URL
.
Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry
Zhu, Abhishek Gupta, Pieter Abbeel, and Sergey Levine. Soft actor-critic algorithms and applications. CoRR,
abs/1812.05905, 2018. visited: 07.07.2020.
Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba.
Openai gym. CoRR, abs/1606.01540, 2016. URL .
Argonne National Laboratory. Downloadable dynamometer database (d3) generated at the advanced mobility technology
laboratory (amtl) under the funding and guidance of the u.s. department of energy (doe).
, 2013. visited: 07.07.2020.
Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal
covariate shift. CoRR, 2015. .
Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International Conference on Learning
Representations, 12 2014.
11
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Reinforcement learning is considered to be a strong AI paradigm which can be used to teach machines through interaction with the environment and learning from their mistakes. Despite its perceived utility, it has not yet been successfully applied in automotive applications. Motivated by the successful demonstrations of learning of Atari games and Go by Google DeepMind, we propose a framework for autonomous driving using deep reinforcement learning. This is of particular relevance as it is difficult to pose autonomous driving as a supervised learning problem due to strong interactions with the environment including other vehicles, pedestrians and roadworks. As it is a relatively new area of research for autonomous driving, we provide a short overview of deep reinforcement learning and then describe our proposed framework. It incorporates Recurrent Neural Networks for information integration, enabling the car to handle partially observable scenarios. It also integrates the recent work on attention models to focus on relevant information, thereby reducing the computational complexity for deployment on embedded hardware. The framework was tested in an open source 3D car racing simulator called TORCS. Our simulation results demonstrate learning of autonomous maneuvering in a scenario of complex road curvatures and simple interaction of other vehicles.
Article
Full-text available
Model-free deep reinforcement learning (RL) methods have been successful in a wide variety of simulated domains. However, a major obstacle facing deep RL in the real world is the high sample complexity of such methods. Unbiased batch policy-gradient methods offer stable learning, but at the cost of high variance, which often requires large batches, while TD-style methods, such as off-policy actor-critic and Q-learning, are more sample-efficient but biased, and often require costly hyperparameter sweeps to stabilize. In this work, we aim to develop methods that combine the stability of unbiased policy gradients with the efficiency of off-policy RL. We present Q-Prop, a policy gradient method that uses a Taylor expansion of the off-policy critic as a control variate. Q-Prop is both sample efficient and stable, and effectively combines the benefits of on-policy and off-policy methods. We analyze the connection between Q-Prop and existing model-free algorithms, and use control variate theory to derive two variants of Q-Prop with conservative and aggressive adaptation. We show that conservative Q-Prop provides substantial gains in sample efficiency over trust region policy optimization (TRPO) with generalized advantage estimation (GAE), and improves stability over deep deterministic policy gradient (DDPG), the state-of-the-art on-policy and off-policy methods, on OpenAI Gym's MuJoCo continuous control environments.
Book
Mechatronische Systeme entstehen durch Integration von vorwiegend mechanischen und elektronischen Systemen sowie zugehöriger Informationsverarbeitung. Wesentlich ist dabei die Integration der mechanischen und elektronischen Elemente durch ihre räumliche Anordnung und durch ihre Funktionen sowie die Erzielung synergetischer Effekte. Die örtliche Integration erfolgt durch den konstruktiven Entwurf, die funktionelle Integration durch die Informationsverarbeitung und damit durch die Gestaltung der Software. Das vorliegende Buch führt in den Aufbau und die Modellbildung mechatronischer Systeme in einer einheitlichen Form ein und stellt das Verhalten von mechanischen Bauelementen, elektrischen Antrieben, Maschinen, Sensoren, Aktoren und Mikrorechnern dar. Ziel dabei ist, ein bestimmtes Systemverhalten zu erreichen. Die zweite Auflage enthält wesentliche Erweiterungen bei der Entwicklungsmethodik, bei mechanischen Komponenten, elektrischen Antrieben, Beispielen von Maschinenmodellen, Sensoren, hydraulischen und pneumatischen Aktoren und fehlertoleranten Systemen. Aufgabensammlungen ergänzen die einzelnen Kapitel.
Article
We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more general, and have better sample complexity (empirically). Our experiments test PPO on a collection of benchmark tasks, including simulated robotic locomotion and Atari game playing, and we show that PPO outperforms other online policy gradient methods, and overall strikes a favorable balance between sample complexity, simplicity, and wall-time.
Article
An assessment study of a novel approach is presented that combines discrete state-space Dynamic Programming and Pontryagin's Maximum Principle for online optimal control of hybrid electric vehicles (HEV). In addition to electric energy storage, engine state and gear, kinetic energy and travel time are considered states in this paper. After presenting the corresponding model using a parallel HEV as an example, a benchmark method with Dynamic Programming is introduced which is used to show the solution quality of the novel approach. It is illustrated that the proposed method yields a close-to-optimal solution by solving the optimal control problem over one hundred thousand times faster than the benchmark method. Finally, a potential online usage is assessed by comparing solution quality and calculation time with regard to the quantization of the state space.
Conference Paper
With the development of V2X technology, predictive energy management of vehicles becomes a focus of automotive industry to further reduce energy consumption. Among various aspects of applications, driving speed trajectory optimization contributes a significant share of energy consumption reduction. Common approaches utilize a two dimensional “speed limit tube” as boundary condition for the speed trajectory optimization. This article introduces a three dimensional speed limit matrix, considering constraints not only in spatial domain which normally comes from high definition map data, but also in time domain for time varying constraints, for example the traffic light changing schedule which can be communicated via a traffic control center with V2X technology. Under such boundary conditions, the speed trajectory of a vehicle is optimized using a novel application of Discrete Dynamic Programming (DDP). The goal of the optimization is not only energy consumption reduction, but also to not increasing the total travel time. Without sacrificing the travel time, simulation results show approximately 18 % improvement of energy consumption in two test cases without and with considering the influence from traffic, compared to a random non-optimized city trip.