PreprintPDF Available

LongiControl: A Reinforcement Learning Environment for Longitudinal Vehicle Control

July 2020

July 2020

DOI:10.13140/RG.2.2.26777.65123

License
CC BY 4.0

Authors:

Roman Liessner

DB Netz AG

Preprints and early-stage research may not have been peer reviewed yet.

Reinforcement Learning (RL) is a promising algorithm for solving numerous challenges in the field of autonomous driving due to its ability to find long-term oriented controls in complex decision scenarios. For training and validation of the RL algorithm, a simulative environment is advantageous due to the risk reduction and the saving of resources. This paper presents an RL Environment designed for the optimization of longitudinal control. In addition to details on implementation, reference is also made to areas where research is required. https://github.com/dynamik1703/gym_longicontrol https://towardsdatascience.com/do-you-want-to-train-a-simplified-self-driving-car-with-reinforcement-learning-be1263622e9e

shows an example of the track implementation within the simulation.

…

Assigning the action to an acceleration.

…

Learning progress

…

Figures - uploaded by Roman Liessner

Content may be subject to copyright.

Content uploaded by Roman Liessner

Content may be subject to copyright.

Content uploaded by Roman Liessner

Content may be subject to copyright.

LONGICONTROL: A REINFORCEMENT LEARNING

ENVIRONMENT FOR LONGITUDINAL VEHICLE CONTROL

A PREPRINT

Jan Dohmen*, Roman Liessner*, Christoph Friebel* and Bernard Bäker*

∗

October 12, 2020

ABS TRAC T

Reinforcement Learning (RL) is a promising algorithm for solving a variety of challenges in the ﬁeld

of autonomous driving due to its ability to ﬁnd long-term oriented solutions in complex decision

scenarios. For training and validation of a RL algorithm, a simulative environment is advantageous

due to risk reduction and saving of resources. This contribution presents an RL environment designed

for the optimization of longitudinal control. In addition to details on implementation reference is also

made to areas where research is required.

1 Introduction

A large proportion of road trafﬁc accidents are due to human error [Gründl, 2005]. Autonomous vehicles and driver

assistance systems are therefore promising ways to increase road safety in the future [Bertoncello and Wee, 2015].

Moreover, global climate change and dwindling resources are increasingly contributing to raising society’s awareness

of environmental policy issues. In addition to vehicle electriﬁcation, advancing automation in transport promises a

much more efﬁcient use of energy. Assistance systems in particular which support the predictive longitudinal control of

a vehicle can lead to signiﬁcant energy savings [Radke, 2013].

A commonly chosen approach for energy-efﬁcient optimization of the longitudinal control is based on the use of

dynamic programming [Uebel et al., 2018]. Although this is basically capable of ﬁnding the discrete global optimum it

requires in advance a comprehensive problem modelling, a deterministic environment and a discretization of the action

space. Especially when considering other road users the conventional approaches therefore reach their limits [Ye et al.,

2017]. A priori, arbitrary trafﬁc cannot be sufﬁciently modelled and thus no precise knowledge of the entire route can

be given. Furthermore, the computing power available in the vehicle is not sufﬁcient to perform new optimizations

depending on the constantly changing environment. Online use in the vehicle is therefore unlikely.

The developments in the ﬁeld of machine learning, especially deep reinforcement learning (DRL), are very promising.

The learning system recognizes the relations between its actions and the associated effect on the environment. This

enables the system to react immediately to environmental inﬂuences instead of just following a previously calculated

plan. After proving in recent years to solve challenging video games [Mnih et al., 2013] on a partly superhuman level

DRL has lately been increasingly used for engineering and physical tasks [Hinssen and Abbeel, 2018]. Examples are

the cooling of data centers [Gao, 2014], robotics [Gu et al., 2016], the energy management of hybrid vehicles [Liessner

et al., 2018] or self-driving vehicles [Sallab et al., 2017, Kendall et al., 2018]. This motivates to apply such an approach

also to the problem of optimizing longitudinal control.

In this contribution we propose LongiControl [Dohmen et al., 2019], a RL environment beeing adapted to the OpenAI

Gym standardization. Thereby, we aim to bridge real-world motivated RL with easy accessibility inside a highly relevant

problem. The environment is designed in such a way that RL agents can be trained even with an ordinary notebook in a

relatively short period of time. Though, the longitudinal control problem has several easily comprehensible challenges

making it a suitable example to investigate advanced topics like multi-objective RL (trade-off between conﬂicting goals

of travel time minimization and energy consumption) or safe RL (violation of speed limits may lead to accidents).

∗Dresden Institute of Automobile Engineering, TU Dresden, George-Bähr-Straße 1c, 01069 Dresden, Germany

LongiControl A PREPRINT

This paper is structured as follows. In section 2 overviews are given on the longitudinal control problem and on the

basic principles of RL. In section 3 we present the LongiControl environment describing the route simulation, the

vehicle model and its interaction with a RL agent. Thereafter, in section 4 we show examplary results for different

training phases and give a brief insight into the challenges with contrary reward formulations. This is followed by the

concluding discussion in section 5 providing a basis for future working directions.

2 Background

2.1 Longitudinal control

Energy-efﬁcient driving

In general terms, an energetically optimal driving corresponds to a global minimization of

the input energy Ein the interval t0≤t≤Tas a function of acceleration a, velocity vand power P:

E=ZT

P(t, a(t), v(t)) dt (1)

At the same time, according to external requirements, such as other road users or speed limits, the following boundary

conditions must be met:

vlim,min(x)≤v≤vlim,max (x)

alim,min(v)≤a≤alim,max (v)

˙alim,min(v)≤˙a≤˙alim,max (v).

(2)

Where

is the velocity,

is the acceleration and

˙a

is the jerk, with

(·)lim,min

and

(·)lim,max

representing the lower

and upper limits respectively.

Following Freuer [Freuer, 2015] the optimization can be divided roughly into four areas:

1. optimization of the vehicle properties,

2. optimization of trafﬁc routing,

3. optimization on an organizational level,

4. optimization of vehicle control.

This paper deals with the last point. In various contributions [Barkenbus, 2010, Uebel et al., 2018] an adapted

vehicle control system is assigned an enormous savings potential. In addition to the safety aspect, assistance systems

supporting vehicle control are becoming increasingly important for this reason as well. This trend is made possible by

comprehensive sensor technology and the supply of up-to-date route data. In terms of longitudinal control, energy-saving

driving modes can thus be encouraged:

•driving in energy-efﬁcient speed ranges,

•keeping an appropriate distances to vehicles in front,

•anticipatory deceleration and acceleration.

Simulation

Simulations become more and more important in automotive engineering. According to Winner et al.

[Winner and Wachenfeld, 2015], in the context of the automotive industry the overall system is composed of three parts:

the vehicle, the driving environment and the vehicle control. These three components interact through an exchange of

information and energy.

Within the simulation a vehicle model is needed which indicates the energy consumption. In general, physical and

data-based approaches are suitable for modelling those [Isermann, 2008].

External inﬂuences are represented by the driving environment. This includes for example information about other road

users and route data such as trafﬁc light signals or speed limits. These information are used then by the vehicle control

as boundary conditions for the driving strategy.

While in reality with increasing automation the information content of the sensor system in vehicles is increasing

[Winner et al., 2015], this information can be easily generated in the simulation. Considering the modeling of the

driving environment a distinction must be made between deterministic and stochastic approaches. In the deterministic

case it is assumed that the driving environment behaves the same in every run. Changes during the simulation are not

LongiControl A PREPRINT

allowed. This means that reality can only be represented in a very simpliﬁed way. For example a sudden change of

a trafﬁc light signal or an unforeseen braking of the vehicle in front is not represented by such a model. In contrast,

the stochastic approach offers the possibility to vary external inﬂuences during the simulation. Therefore, this type of

modeling is much closer to the real driving situation.

Optimization

The aim of the RL environment is to train an agent to drive an electric vehicle a single-lane route as

energy-efﬁcient as possible. This corresponds to the minimization of equation 1 while considering the corresponding

boundary conditions in equation 2.

Examples for state-of-the-art approaches for the optimization of the longitudinal control problem are Dynamic Pro-

gramming [Bellman, 1954], Pontryagin’s Maximum Principle [Pontryagin et al., 1962] or a combination of both [Uebel

et al., 2018]. As previously mentioned these approaches have two basic limitations: they are based on deterministic

models and suffer from the curse of dimensionality [Bellman, 1961].

According to [Sutton and Barto, 2018] and [und John N. Tsitsiklis, 1999] RL approaches are a solution to this dilemma.

The main difference between Dynamic Programming and RL is that the former assumes to know the complete model.

RL approaches on the other hand only require the possibility of interaction with the environment model. Without

knowing its inner structure solutions are learned. In modern deep RL (DRL), the use of neural networks for function

approximation also allows to handle continuous state spaces and react to previously unknown states.

2.2 Reinforcement Learning

A standard reinforcement learning framework is considered, consisting of an agent that interacts with an environment

(see Fig. 1). The agent perceives its state

st∈ S

in the environment in each time step

t= 0,1,2, . . .

and consequently

chooses an action

at∈ A

. With this, the agent in turn directly inﬂuences the environment resulting in an updated state

st+1

for the next time step. The selected action is evaluated using a numerical reward

rt+1(s, a)

. The sets

and

contain all possible states and actions that can occur in the description of the problem to be learned.

The policy

π(a|s)

speciﬁes for each time step which action is to be executed depending on the state. The aim is to

select actions in such a way that the cumulative reward is maximized.

Policy gradient methods are probably the most popular class of RL algorithms for continous problems. Currently

very relevant examples for such methods are Proximal Policy Optimization (PPO) [Schulman et al., 2017], Deep

Deterministic Policy Gradient (DDPG) [Lillicrap et al., 2015] or Soft Actor-Critic (SAC) [Haarnoja et al., 2018].

Agent

Environment

State,

Reward Action

Figure 1: Agent environment interaction

3 RL Environment

3.1 OpenAI Gym

OpenAI Gym [Brockman et al., 2016] is a widly used open-source framework with a large number of well-designed

environments to compare RL algorithms. It does not rely on a speciﬁc agent structure or deep learning framework. To

provide an easy starting point for RL and the longitudinal control problem, the implementation of the LongiControl

environment follows the OpenAI Gym standardization.

3.2 Route simulation

Fig. 2 shows an example of the simpliﬁed track implementation within the simulation.

LongiControl A PREPRINT

50 70 90

Figure 2: An example for the track visualization.

Equation of motion

The vehicle motion is modelled simpliﬁed as uniform accelerated. The simulation is based on a

time discretization of ∆t= 0.1 s. The current velocity vtand position xtare calculated as follows:

vt=at∆t+vt−1

xt=1

2at(∆t)2+vt−1∆t+xt−1

The acceleration

must be speciﬁed through the agents action in each time step

. Since only the longitudinal control

is considered the track can be modelled single-laned. Therefore, one-dimensional velocities

and positions

are

sufﬁcient at this point.

Stochastic route modelling

The route simulation is modelled in such a way that the track length may be arbitrarily

long and that arbitrarily positioned speed limits specify an arbitrary permissible velocity. Here, it is argued that this can

be considered equivalent to a stochastically modelled trafﬁc.

Under the requirement that a certain safety distance to the vehicle in front must be maintained other road users are

simply treated as further speed limits which depend directly on the distance and the difference in speed. For each time

step the relevant speed limit is then equal to the minimum of the distance-related and trafﬁc-related limit.

Restrictively, speed limits are generated in a minimum possible distance of

100 m

. The permissible velocities are

sampled from

[20,30,40,50,60,70,80,90,100 km/h]

while the difference of contiguous limits may not be greater

than

40 km/h

. It should therefore apply

xlim,j+1 −xlim,j ≥100 m

and

|vlim,j+1 −vlim,j | ≤ 40 km/h

. The former

is a good compromise to induce as many speed changes per trajectory as possible and to be able to identify anticipatory

driving at the same time. The second is introduced as another simpliﬁcation to speed up the learning process. Very

large speed changes may be very hard for the agent to handle.

Up to 150 m in advance, the agent receives information about the upcoming two speed limits.

3.3 Vehicle model

The vehicle model derived from vehicle measurement data (see Figure 3) consists of several subcomponents. These

have the function of receiving the action of the agent, assigning a physical acceleration value and outputting the

corresponding energy consumption.

Assigning the action to an acceleration

The action of the agent is interpreted in this environment as the actuation of

the vehicle pedals. In this sense, a positive action actuates the accelerator pedal. A negative analogous action actuates

the brake pedal. The vehicle acceleration resulting from the pedal interaction depends on the current vehicle speed

(road slopes are neglected) due to the limited vehicle motorization.

If neither pedal is actuated (corresponds to

action = 0

), the vehicle decelerates its speed by simulating the driving

resistance. This means that to maintain a positive speed a positive action must be selected.

It becomes clear from the explanations that three speed-dependent acceleration values determine the physical range of

the agent. These are the maximum and minimum acceleration and the acceleration value for action = 0.

Determination of the acceleration values

The speed-dependent maximum and minimum acceleration can be de-

termined from the measurement data and the technical data of the vehicle. In the RL environment, the maximum

and minimum values for each speed are stored as characteristic curves. The resulting acceleration at

action = 0

is calculated physically. Using the driving resistance equation and the vehicle parameters an acceleration value is

LongiControl A PREPRINT

Figure 3: Assigning the action to an acceleration.

calculated for each speed. This is stored in the environment as a speed-dependent characteristic curve, analogous to the

other two acceleration values.

Once the action, the current vehicle speed and the three acceleration values are available the resulting acceleration can

be calculated as follows:

at=





(amax −a0)·action +a0if action > 0

a0if action = 0

−(amin −a0)·action +a0if action < 0

Calculation of energy consumption

Knowing the vehicle speed and acceleration the energy consumption can be

estimated from these two values. For this purpose measured values of an electric vehicle [Argonne National Laboratory,

2013] were learned using a neural network and the network was stored in the environment.

3.4 Agent environment interaction

In accordance with the basic principle of RL an agent interacts with its environment through its actions and receives an

updated state and reward.

Action

The agent selects an action in the value range [-1,1]. The agent can thus choose between a condition-dependent

maximum and minimum acceleration of the vehicle. This type of modeling results in the agent only being able to select

valid actions.

State

The features of the state must provide the agent with all the necessary information to enable a goal-oriented

learning process. The individual features and their meaning are listed in Table 1.

When training neural networks the learning process often beneﬁts from the fact that the dimensions of the input variables

do not differ greatly from one another. According to Ioffe et al. [Ioffe and Szegedy, 2015] the gradient descent algorithm

converges faster if the individual features have the same order of magnitude. Since according to Table 1 different

physical quantities with different value ranges enter the state a measure for normalization seems to be reasonable at this

point. All features are scaled min-max for this purpose so that they are always in the ﬁxed interval [0,1].

Reward

In the following, the reward function that combines several objectives is presented. The explanations

indicate the complexity of the multi-objective manner. The LongiControl Environment thus provides a good basis for

investigating these issues and for developing automated solutions to solve them.

A reward function deﬁnes the feedback the agent receives for each action and is the only way to control the agent’s

behavior. It is one of the most important and challenging components of an RL environment. If only the energy

LongiControl A PREPRINT

Table 1: Meaning of state features.

Feature Meaning

v(t)Vehicles’s current velocity

aprev (t)Vehicle acceleration of the last time step, s.t. the agent is able to have an intuition for the jerk

vlim(t)Current speed limit

vlim,f ut(t)The next two speed limit changes, as long as they are within a range of 150 m

dvlim,fut (t)Distances to the next two speed limit changes, as long as they are within a range of 150 m

consumption were rewarded (negatively) the vehicle would simply stand still. The agent would learn that from the point

of view of energy consumption it is most efﬁcient simply not to drive. Although this is true we still want the agent to

drive in our environment. So we need a reward that makes driving more appealing to the agent. By comparing different

approaches the difference between the current speed and the current speed limit has proven to be particularly suitable.

By minimizing the difference the agent automatically sets itself in motion. In order to still take energy consumption into

account the reward component is maintained with energy consumption. A third reward component is caused by the jerk.

This is because our autonomous vehicle should also be able to drive comfortably. To punish ﬁnally also the violation of

the speed limits a fourth reward part is supplemented. Since RL is designed for a scalar reward it is necessary to weight

these four parts.

A suitable weighting is not trivial and poses a great challenge.

For the combined reward we propose the following (see also Table 2):

rt=−ξforward rf orw ard(t)

−ξenergy renerg y (t)

−ξjerk rj erk (t)

−ξsafe rsafe (t),

while

rforward (t) = |v(t)−vlim(t)|

vlim(t)

renergy (t) = ˆ

rjerk (t) = |a(t)−aprev (t)|

∆t

rsafe(t) = 0v(t)≤vlim(t)

1v(t)> vlim(t).

are the weighting parameters for all reward shares. In some cases, the terms are used as penalty so that the learning

algorithm minimizes their amount. To make it easier to get started with the environment we have preconﬁgured

a functioning weighting (see Table 3). In the next section we will show some examples of the effects of different

weightings.

Table 2: Meaning of reward terms.

Reward Meaning

rforward (t)Penalty for slow driving

renergy (t)Penalty for energy consumption

rjerk (t)Penalty for jerk

rsafe(t)Penalty for speeding

4 Examples

In the following various examples of the environment are presented. For training the agent is confronted with new

routes in each run using the stochastic mode of the environment. For validation it is always used the same deterministic

route to compare like with like.

LongiControl A PREPRINT

Table 3: Weighting parameters for the reward.

Parameter Value

ξforward (t) 1.0

ξenergy (t) 0.5

ξjerk (t) 1.0

ξsafe(t) 1.0

4.1 Learning progress

In the following different stages of an exemplary learning process are presented. An implementation of SAC [Haarnoja

et al., 2018] was chosen as the deep RL algorithm. The used hyperparameters are listed in table 4. Animated

visualizations for the upcoming learning stages can be found on GitHub [Dohmen et al., 2019].

Table 4: SAC hyperparameter

Parameter Value

optimizer Adam [Kingma and Ba, 2014]

learning rate 0.001

discount γ0.99

replay buffer size 1000000

number of hidden layers (all networks) 2

number of hidden units per layer 64

optimization batch size 256

target entropy −dim(A)

activation function ReLU

soft update factor τ0.01

target update interval 1

gradient steps 1

Beginning of the learning process

At the very beginning of the learning process the agent remains in place and does

not move at all. Then after a few more training epochs the agent starts to move but is not yet able to ﬁnish the track.

Figure 4a shows this stage in the deterministic validation run.

After some learning progress

After some progress the agent is able to complete the course (Figure 4b) but ignores

speed limits while driving very jerky. Obviously, this is not desirable. Therefore the training continuous.

After a longer training procedure

By letting the agent train even longer it learns to drive more comfortably and

ﬁnally starts to respect the speed limits by an early enough deceleration. Though, in general it is driving quite slow in

relation to the maximum allowed (see Figure 4c).

After an even longer training period

Finally after an even longer training, it drives very smooth, respects the speed

limits while minimizing the safety margin to the maximum allowed (see Figure 4d).

4.2 Multi-objective optimization

As mentioned before, this problem has several contrary objectives. Thus also multi-objective investigations can be

carried out. For a better understanding we present three examples.

Reward Example 1

If only the movement reward – the deviation from the allowed speed – is applied (reward

weighting [

ξforward (t)=1

ξenergy (t)=0

ξjerk (t)=0

ξsafe(t)=0

]) the agent violates the speed limits because

being 5 km/h too fast is rewarded the same as being 5 km/h too slow (see Figure 5a).

LongiControl A PREPRINT

(a) Beginning of the learning process

(b) After some learning progress

(d) After an even longer training period

Figure 4: Learning progress

LongiControl A PREPRINT

(a) ξfor ward (t) = 1,ξenerg y (t) = 0,ξjerk (t) = 0,ξsaf e (t) = 0

(b) ξfor ward (t) = 1,ξenerg y (t) = 0,ξjerk (t) = 0,ξsaf e (t) = 1

Figure 5: Reward weighting

LongiControl A PREPRINT

Reward Example 2

In the second example, the penalty for exceeding the speed limit is added (reward weighting

[

ξforward (t)=1

ξenergy (t)=0

ξjerk (t) = 0

ξsafe(t) = 1

]). This results in the agent actually complying with the

limits (see Figure 5b).

Reward Example 3

In the third example we add the energy and jerk reward (reward weighting [

ξforward (t) = 1

ξenergy (t)=0.5

ξjerk (t)=1

ξsafe(t)=1

]). This results in the agent driving more energy-efﬁciently and also

choosing smoother accelerations (see Figure 5c).

These examples illustrate that the environment provides a basis to investigate multi-objective optimization algorithms.

For such investigations the weights of the individual rewards can be used as control variables and the travel time, energy

consumption and the number of speed limit violations can be used to evaluate the higher-level objectives.

5 Discussion and Conclusion

Through the proposed RL environment, which is adapted to the OpenAI Gym standardization, we show that it is easy to

prototype and implement state-of-art RL algorithms.

Besides, the LongiControl environment is suitable for various examinations. In addition to the comparison of RL

algorithms and the evaluation of safety algorithms, investigations in the area of Multi-Objective Reinforcement Learning

are possible. Further possible research objectives are the comparison with planning algorithms for known routes,

investigation of the inﬂuence of model uncertainties and the consideration of very long-term objectives like arriving at a

speciﬁc time.

LongiControl is designed to enable the community to leverage the latest strategies of reinforcement learning to address

a real-world and high-impact problem in the ﬁeld of autonomous driving.

References

Martin Gründl. Fehler und Fehlverhalten als Ursache von Verkehrsunfällen und Konsequenzen für das Unfallvermei-

dungspotenzial und die Gestaltung von Fahrerassistenzsystemen. PhD thesis, University Regensburg, 2005.

Michelle Bertoncello and Dominik Wee. Mckinsey: Ten ways autonomous driving could redeﬁne the automotive

world. Available:

, 2015.

Tobias Radke. Energieoptimale Längsführung von Kraftfahrzeugen durch Einsatz vorausschauender Fahrstrategien.

PhD thesis, Karlsruhe Institute of Technology (KIT), 2013.

S. Uebel, N. Murgovski, C. Tempelhahn, and B. Bäker. Optimal energy management and velocity control of hy-

brid electric vehicles. IEEE Transactions on Vehicular Technology, 67(1):327–337, Jan 2018. ISSN 0018-9545.

doi:10.1109/TVT.2017.2727680.

Ziqi Ye, Thorsten Plum, Stefan Pischinger, Jakob Andert, Michael Franz Stapelbroek, and Jan-Simon Remco Pﬂuger.

Vehicle speed trajectory optimization under limits in time and spatial domains. In International ATZ Conference

Automated Driving, volume 3, Wiesbaden, 2017.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin A.

Riedmiller. Playing atari with deep reinforcement learning. CoRR, abs/1312.5602, 2013. URL

Peter Hinssen and Pieter Abbeel. Everything is going to be touched by ai. Available:

, 2018.

Jim Gao. Machine learning applications for data center optimization, 2014.

Shixiang Gu, Timothy P. Lillicrap, Zoubin Ghahramani, Richard E. Turner, and Sergey Levine. Q-prop: Sample-

efﬁcient policy gradient with an off-policy critic. CoRR, abs/1611.02247, 2016. URL

Roman Liessner, Christian Schroer, Ansgar Dietermann, and Bernard Bäker. Deep reinforcement learning for advanced

energy management of hybrid electric vehicles. In Proceedings of the 10th International Conference on Agents and

Artiﬁcial Intelligence ICAART,, volume 2, pages 61–72, 2018.

Ahmad Sallab, Mohammed Abdou, Etienne Perot, and Senthil Yogamani. Deep reinforcement learning framework for

autonomous driving. Electronic Imaging, 2017:70–76, 01 2017. doi:10.2352/ISSN.2470-1173.2017.19.AVM-023.

LongiControl A PREPRINT

Alex Kendall, Jeffrey Hawke, David Janz, Przemyslaw Mazur, Daniele Reda, John-Mark Allen, Vinh-Dieu Lam, Alex

Bewley, and Amar Shah. Learning to drive in a day. CoRR, abs/1807.00412, 2018. URL

Jan Dohmen, Roman Liessner, Christoph Friebel, and Bernard Bäker. LongiControl environment for OpenAI gym.

, 2019.

Andreas Freuer. Ein Assistenzsystem für die energetisch optimierte Längsführung eines Elektrofahrzeugs. PhD thesis,

2015.

J. N. Barkenbus. Eco-driving: an overlooked climate change initiative. Energy Policy, 38, 2010.

Hermann Winner and Walther Wachenfeld. Auswirkungen des autonomen fahrens auf das fahrzeugkonzept. 2015.

Rolf Isermann. Mechatronische Systeme - Grundlagen. Springer-Verlag, Berlin Heidelberg, 2 edition, 2008.

Hermann Winner, Stephan Hakuli, Felix Lotz, and Christina Singer, editors. Handbuch Fahrerassistenzsysteme.

ATZ/MTZ-Fachbuch. Springer Vieweg, Wiesbaden, 3 edition, 2015. ISBN 978-3-658-05733-6. doi:10.1007/978-3-

658-05734-3.

Richard Bellman. The theory of dynamic programming. Bull. Amer. Math. Soc., 60(6):503–515, 11 1954. URL

L. S. Pontryagin, V. G. Boltyanshii, R. V. Gamkrelidze, and E. F. Mishenko. The Mathematical Theory of Optimal

Processes. John Wiley and Sons, New York, 1962.

Richard E. Bellman. Adaptive Control Processes: A Guided Tour. Princeton Legacy Library, 1961. ISBN

9781400874668.

Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, USA,

2te edition, 2018.

Dimitri P. Bertsekas und John N. Tsitsiklis. Neuro-dynamic programming. 2te edition, 1999. ISBN 1-886529-10-8.

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization

algorithms. CoRR, abs/1707.06347, 2017. URL .

Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver,

and Daan Wierstra. Continuous control with deep reinforcement learning. CoRR, abs/1509.02971, 2015. URL

Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry

Zhu, Abhishek Gupta, Pieter Abbeel, and Sergey Levine. Soft actor-critic algorithms and applications. CoRR,

abs/1812.05905, 2018. visited: 07.07.2020.

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba.

Openai gym. CoRR, abs/1606.01540, 2016. URL .

Argonne National Laboratory. Downloadable dynamometer database (d3) generated at the advanced mobility technology

laboratory (amtl) under the funding and guidance of the u.s. department of energy (doe).

, 2013. visited: 07.07.2020.

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal

covariate shift. CoRR, 2015. .

Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International Conference on Learning

Representations, 12 2014.

ResearchGate has not been able to resolve any citations for this publication.

Deep Reinforcement Learning for Advanced Energy Management of Hybrid Electric Vehicles

Conference Paper

Full-text available

Jan 2018

Deep Reinforcement Learning framework for Autonomous Driving

Article

Full-text available

Jan 2017

Reinforcement learning is considered to be a strong AI paradigm which can be used to teach machines through interaction with the environment and learning from their mistakes. Despite its perceived utility, it has not yet been successfully applied in automotive applications. Motivated by the successful demonstrations of learning of Atari games and Go by Google DeepMind, we propose a framework for autonomous driving using deep reinforcement learning. This is of particular relevance as it is difficult to pose autonomous driving as a supervised learning problem due to strong interactions with the environment including other vehicles, pedestrians and roadworks. As it is a relatively new area of research for autonomous driving, we provide a short overview of deep reinforcement learning and then describe our proposed framework. It incorporates Recurrent Neural Networks for information integration, enabling the car to handle partially observable scenarios. It also integrates the recent work on attention models to focus on relevant information, thereby reducing the computational complexity for deployment on embedded hardware. The framework was tested in an open source 3D car racing simulator called TORCS. Our simulation results demonstrate learning of autonomous maneuvering in a scenario of complex road curvatures and simple interaction of other vehicles.

Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic

Article

Full-text available

Nov 2016

Model-free deep reinforcement learning (RL) methods have been successful in a wide variety of simulated domains. However, a major obstacle facing deep RL in the real world is the high sample complexity of such methods. Unbiased batch policy-gradient methods offer stable learning, but at the cost of high variance, which often requires large batches, while TD-style methods, such as off-policy actor-critic and Q-learning, are more sample-efficient but biased, and often require costly hyperparameter sweeps to stabilize. In this work, we aim to develop methods that combine the stability of unbiased policy gradients with the efficiency of off-policy RL. We present Q-Prop, a policy gradient method that uses a Taylor expansion of the off-policy critic as a control variate. Q-Prop is both sample efficient and stable, and effectively combines the benefits of on-policy and off-policy methods. We analyze the connection between Q-Prop and existing model-free algorithms, and use control variate theory to derive two variants of Q-Prop with conservative and aggressive adaptation. We show that conservative Q-Prop provides substantial gains in sample efficiency over trust region policy optimization (TRPO) with generalized advantage estimation (GAE), and improves stability over deep deterministic policy gradient (DDPG), the state-of-the-art on-policy and off-policy methods, on OpenAI Gym's MuJoCo continuous control environments.

Learning to Drive in a Day

Conference Paper

May 2019

Adaptive Control Processes: A Guided Tour

Book

Dec 1961

Mechatronische Systeme: Grundlagen

Book

Jan 2008

Rolf Isermann

Mechatronische Systeme entstehen durch Integration von vorwiegend mechanischen und elektronischen Systemen sowie zugehöriger Informationsverarbeitung. Wesentlich ist dabei die Integration der mechanischen und elektronischen Elemente durch ihre räumliche Anordnung und durch ihre Funktionen sowie die Erzielung synergetischer Effekte. Die örtliche Integration erfolgt durch den konstruktiven Entwurf, die funktionelle Integration durch die Informationsverarbeitung und damit durch die Gestaltung der Software. Das vorliegende Buch führt in den Aufbau und die Modellbildung mechatronischer Systeme in einer einheitlichen Form ein und stellt das Verhalten von mechanischen Bauelementen, elektrischen Antrieben, Maschinen, Sensoren, Aktoren und Mikrorechnern dar. Ziel dabei ist, ein bestimmtes Systemverhalten zu erreichen. Die zweite Auflage enthält wesentliche Erweiterungen bei der Entwicklungsmethodik, bei mechanischen Komponenten, elektrischen Antrieben, Beispielen von Maschinenmodellen, Sensoren, hydraulischen und pneumatischen Aktoren und fehlertoleranten Systemen. Aufgabensammlungen ergänzen die einzelnen Kapitel.

Proximal Policy Optimization Algorithms

Article

Jul 2017

We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more general, and have better sample complexity (empirically). Our experiments test PPO on a collection of benchmark tasks, including simulated robotic locomotion and Atari game playing, and we show that PPO outperforms other online policy gradient methods, and overall strikes a favorable balance between sample complexity, simplicity, and wall-time.

Optimal Energy Management and Velocity Control of Hybrid Electric Vehicles

Article

Jul 2017

An assessment study of a novel approach is presented that combines discrete state-space Dynamic Programming and Pontryagin's Maximum Principle for online optimal control of hybrid electric vehicles (HEV). In addition to electric energy storage, engine state and gear, kinetic energy and travel time are considered states in this paper. After presenting the corresponding model using a parallel HEV as an example, a benchmark method with Dynamic Programming is introduced which is used to show the solution quality of the novel approach. It is illustrated that the proposed method yields a close-to-optimal solution by solving the optimal control problem over one hundred thousand times faster than the benchmark method. Finally, a potential online usage is assessed by comparing solution quality and calculation time with regard to the quantization of the state space.

Vehicle speed trajectory optimization under limits in time and spatial domains

Conference Paper

Apr 2017

With the development of V2X technology, predictive energy management of vehicles becomes a focus of automotive industry to further reduce energy consumption. Among various aspects of applications, driving speed trajectory optimization contributes a significant share of energy consumption reduction. Common approaches utilize a two dimensional “speed limit tube” as boundary condition for the speed trajectory optimization. This article introduces a three dimensional speed limit matrix, considering constraints not only in spatial domain which normally comes from high definition map data, but also in time domain for time varying constraints, for example the traffic light changing schedule which can be communicated via a traffic control center with V2X technology. Under such boundary conditions, the speed trajectory of a vehicle is optimized using a novel application of Discrete Dynamic Programming (DDP). The goal of the optimization is not only energy consumption reduction, but also to not increasing the total travel time. Without sacrificing the travel time, simulation results show approximately 18 % improvement of energy consumption in two test cases without and with considering the influence from traffic, compared to a random non-optimized city trip.

The Mathematical Theory of Optimal Processes

Article

Jan 1965
ECONOMETRICA

LongiControl: A Reinforcement Learning Environment for Longitudinal Vehicle Control

Abstract and Figures

Recommended publications

LongiControl: A Reinforcement Learning Environment for Longitudinal Vehicle Control

Explainable Reinforcement Learning for Longitudinal Control

LongiControl: A New Reinforcement Learning Environment

Adaptive Operating Strategies for the Energy Management of Hybrid Electric Vehicles with Deep Reinfo...