Conference PaperPDF Available

Charging control of electric vehicles using contextual bandits considering the electrical distribution grid


Abstract and Figures

With the proliferation of electric vehicles, the electrical distribution grids are more prone to overloads. In this paper, we study an intelligent pricing and power control mechanism based on contextual bandits to provide incentives for distributing charging load and preventing network failure. The presented work combines the microscopic mobility simulator SUMO with electric network simulator SIMONA and thus produces reliable electrical distribution load values. Our experiments are carefully conducted under realistic conditions and reveal that conditional bandit learning outperforms context-free reinforcement learning algorithms and our approach is suitable for the given problem. As reinforcement learning algorithms can be adapted rapidly to include new information we assume these to be suitable as part of a holistic traffic control scenario.
Content may be subject to copyright.
Charging control of electric vehicles using
contextual bandits considering the electrical
distribution grid
Christian R¨omer1[0000000343869323] , Johannes Hiry2[0000000214470607] ,
Chris Kittl2[0000000211870568] , Thomas Liebig1[0000000298411101] , and
Christian Rehtanz2[0000000281346841]
1TU Dortmund University
Department of Computer Science
Otto-Hahn-Str. 12, 44227 Dortmund, Germany
{christian.roemer, thomas.liebig}
2TU Dortmund University
Institute of Energy Systems, Energy Efficiency and Energy Economics
Emil-Figge-Str. 70, 44227 Dortmund, Germany
{johannes.hiry, chris.kittl, christian.rehtanz}
Abstract. With the proliferation of electric vehicles, the electrical dis-
tribution grids are more prone to overloads. In this paper, we study
an intelligent pricing and power control mechanism based on contextual
bandits to provide incentives for distributing charging load and prevent-
ing network failure. The presented work combines the microscopic mobil-
ity simulator SUMO with electric network simulator SIMONA and thus
produces reliable electrical distribution load values. Our experiments are
carefully conducted under realistic conditions and reveal that condi-
tional bandit learning outperforms context-free reinforcement learning
algorithms and our approach is suitable for the given problem. As re-
inforcement learning algorithms can be adapted rapidly to include new
information we assume these to be suitable as part of a holistic traffic
control scenario.
Keywords: electric mobility ·power supply ·grid planning ·reinforce-
ment learning ·contextual bandit ·artificial intelligence ·travel demand
analysis and prediction ·intelligent mobility models and policies for ur-
ban environments
1 Motivation
The amount of street based individual traffic is rising world wide. In Germany,
the amount of registered passenger cars has risen by 20.9% between the years
2000 and 2018 [14]. At the same time use of electric vehicles is getting more
popular: Since the year 2006 the number of registered vehicles in Germany has
2 C. R¨omer et al.
risen from 1931 to 53861 [5]. This can be seen as an indicator for a world wide
development. When such a great number of vehicles enter the market it has to
be considered that these act as new electric loads in the electrical grid. This
additional load needs to be included by the grid operators in their future plan-
ing. According to Uhlig et al. [17] the charging of electric vehicles (EV) in low
voltage grids occurs mainly in the early to late evening, where load peaks are
already common. The combination of already existing peaks and the additional
load due to the charging of EVs can lead to critical loads in the involved op-
erating resources. Schierhorn and Martensen [13] found that first line overloads
occur at EV-market coverages of 8%, when vehicles are allowed to charge at all
times without restrictions or sophisticated charging strategies. In this paper we
investigate the load reduction by a smart charging strategy.
The situation gets more complicated by the emergence of renewable ener-
gies and the liberalization of the energy market, leading to challenging daily
fluctuations with partly opposing targets. While grid operators need to concern
themselves to uphold a high service quality by keeping the additional strain on
the grid caused by electric vehicles in reasonable boundaries the charging station
operators try to maximize utilizations on their assets.
2 Related Works
Waraich et al. [20] have examined an approach for researching the effects of
electric vehicles on the power grid and possible strategies for smart charging
and prevention of power shortages. They integrated the simulation tool MATSim
for the simulation of traffic flows and energy usages with a ”PEV management
and power system simulation (PMPSS)”, which simulates power electrical grids
and EV-charging stations. In the simulation they equipped each subnet with a
”PEV management device”, which optimizes the charging of electric vehicles.
The optimization routine changes a price signal, which depends on the grid’s
state, the number of charging vehicles and their urgency. The vehicles react to
the price signal within the scope of the MATSim simulation cycle. The authors
survey different scenarios and the effects of the EV-charges to the power grid
regarding average and peak loads over the course of a day. The management
device knows the exact daily routine of every vehicle and can use that to plan
In [15], the authors present an optimization method to plan charging of
electric vehicles, main focus of their study is the decision on a charging time
interval to keep performance of the individual EV and reduce energy costs. In
contrast, our work will adjust either energy prices or charging power such that
the quality of service is guaranteed by the smart grid. An interesting related
study was published in [18] and discusses bidding strategies of electric vehicle
aggregators on the energy market.
The application of machine learning methods, especially reinforcement learn-
ing, to problems in planning and operation of power grids does not seem to be
well researched yet. Vlachogiannis et al. [19] used a Q-learning algorithm for
Title Suppressed Due to Excessive Length 3
reactive power control, finding that while the used algorithm was able to solve
the problem it took a long training phase to converge. They however consid-
ered neither EVs nor renewable (weather-dependent) energy generation. Other
authors used learning methods to provide a frequency regulation service via a
vehicle-to-grid market [21] or optimize a residential demand response of single
households using an EV-battery as a buffer for high demand situations [11].
The usage of bandit algorithms and the test environment was motivated by
the successful application to the problem of avoiding traffic jams using a self-
organizing trip-planning solution in [10].
3 Fundamentals
This section describes the fundamentals regarding electric power grids, electric
mobility and the used learning algorithms needed for comprehension of this work.
3.1 Electric power grids
An electric power grid encompasses all utilities required for the transmission
and distribution of electric power, like cables, overhead lines, substations and
switching devices. Generally the operational cost of utilities are higher for higher
voltages. However, transmitting great powers over great distances is only possible
with high voltages due to transmission losses. For obtaining the maximal cost
effectiveness, the system is segmented into hierarchical levels with each level
having a specific purpose. Extra high and high voltage grids transmit power
over large distances and connect large scale industry, cities and larger service
areas with great and middle sized power plants. The purpose of medium voltage
grids, which are fed by the high voltage grids and smaller plants, is to supply
industry, large scale commercial areas, public facilities and city or land districts.
Low voltage grids make out the ”last mile” to supply residential or commercial
areas and are fed by the medium voltage grids or very small plants like personal
photovoltaic systems.
3.2 Electric mobility
The propulsion technologies in the electric mobility can be roughly divided into
three categories: a) hybrid vehicles, which use two different energy sources for
propulsion. Most of the time these are one gasoline engine and one electric engine
b) plug-in hybrid vehicles, which are hybrid vehicles that can be connected to the
power grid, enabling it to charge the battery while parking c) battery electric
vehicles, which do not have a gasoline engine (expect sometimes a so called
range extender, which however is not directly connected to the powertrain). In
this work we will concentrate on the pure battery electric vehicles (BEV).
Various charging technologies exist for connection BEVs to the electrical
power grid. The specifications for conductive charging technologies are mainly
4 C. R¨omer et al.
defined in the IEC 61851. An international as well as an european and ger-
man norm for inductive charging is currently under development (DINEN 61980
based on IEC 61980 and others). Due to the lack of a frequent use of inductive
charging technologies they are not consindered further in this work. Hence, con-
sidering only conductive charging, one main distinction can be made between
alternate current (AC) and direct current (DC) charging. AC charging can be
further subdivided depending on the needed maximum power, number of used
phases as well as the grid coupler.
Table 1. Charging modes defined by the DIN-EN-61851
Mode Definition Technology Communication
1Direct household socket connection AC, 1 or 3 phase(s) none
same as 1 plus in-cable control and
protective device (IC-CPD)/low level
control pilot function
same as 1 Control Pilot
3Dedicated charging station ”wallbox” AC, 1 or 3 phase(s) Control Pilot
3Dedicated charging station AC, 1 or 3 phase(s) Powerline (PLC)
4Dedicated charging station DC Powerline (PLC)
Depending on the kind of charging infrastructure available, one can distin-
guish between uncontrolled and controlled charging. Uncontrolled in this context
means, that the maximum installed power of the charging station is available to
the connected car for the whole charging process. During the process there are
no external interventions by other entities of the electrical power system (e.g.
distribution grid operator (DSO)) nor any load shifting or charging strategies
executed. This kind of charging can be provided by any of the charging modes
shown in Table 1. In the controlled case, the available installed power can be
altered within the technical limits. Specifically, controlled charging can be used
to reduce the load on the electrical grid by shifting the charging process from
times with high overall grid utilization to times with a lower grid utilization or
to carry specific charging strategies for an electric vehicle fleet. This process can
either be carried out in a centralized (e. g. the DSO executes load curtailment
actions) or a decentralized (e. g. the charging station reduces its power by itself)
way. The centralized approach is only possible if the necessary communication
infrastructure is available. Hence, only charging modes 3 with PLC or 4 are
suitable for the centraliced controlled charging.
3.3 LinUCB algorithm for contextual bandits
In the multi armed bandit problem, an agent has to make repetitive choices in
a limited time frame between various actions, with each having a fixed but un-
known reward probability distribution, as to maximizing the expected gain from
the rewards received in each round. As the time frame, or any other resource, is
Title Suppressed Due to Excessive Length 5
Algorithm 1.1 LinUCB according to Li, Chu, Langford und Schapire [9].
1: Input: αR+, Context dimension d, Arms A
2: for all arm aAdo
3: Initialize context histories Aaand reward histories ba
4: [Hybrid] Initialize shared context history Baand shared reward history b0
5: end for
6: for all round t= 1,2,3,...,T do
7: Observe context for all arms aAt:xt,a Rd
8: for all Arm aAtdo
9: Using the context and reward history Aaand bado a ridge regression,
10: updating the coefficients ˆ
θa. Using the coefficients ˆ
θaand the current context
11: vector xt,a determine the expected reward pt,a .
12: [Hybrid] Besides ˆ
θaalso consider the shared context history Ba, the shared
13: reward history b0and create shared coefficients ˆ
14: end for
15: Choose arm that maximizes the expected reward at= arg maxaAtpt,a, observe
16: reward rt.
17: Update the context history Aatand reward history bat
18: [Hybrid] Update shared context history Baand reward history b0.
19: end for
limited, the agent must constantly balance between the exploitation of promis-
ing actions and the exploration of those actions, of whose expected reward it has
no good estimation yet [16]. Various approaches to this exploration-exploitation
dilemma exist. A commonly used one is a family of algorithms called UCB (for
upper confidence bounds). The idea is to hold a confidence interval of the ex-
pected reward for each possible action and always choose the action with the
highest upper confidence bound [1].
This basic algorithm, which apart from the saved intervals is stateless, can
be extended to include environmental information in the so called contextual
bandits. In this work we examined one particular implementation of that algo-
rithm family called LinUCB as first proposed by Li et al. [9]. Lets assume an
agent is put into an environment, in which it has to decide between various ac-
tions (e.g. moving a piece in a game of chess) in discrete timesteps t=t0, t1. . . .
In each timestep, the agent perceives the environment (e.g. the positions of the
pieces on a chess board) before making a decision. In LinUCB, this perception
at time tis encoded as a context vector xt∈ Rd. The action at time tiis chosen
by computing a ridge (linear) regression between the already observed context
vectors xt,a and the resulting reward value rtfor each timestep t=t0. . . ti1
and each action a, thus yielding the expected reward for choosing each action in
the current situation. Exploration is promoted by adding the standard deviation
to the expected reward.
Due to the complexity of the algorithm it cannot be explained in full, there-
fore we will only present a brief pseudo-code in Algo. 1.1 at this point. For details
please refer to the original paper [9]. Li et al. considered two versions of the algo-
rithm, one where each action/arm only considers previous contexts in which this
6 C. R¨omer et al.
Algorithm 1.2 Q-Learning based on Sutton and Barto [16, S. 149].
1: Initialize Q(s, a) arbitrary
2: for all Episode do
3: Initialize ss0
4: while State sis not terminal do
5: Choose afrom spolicy regarding Q, e.g. -greedy
6: Execute action a, observe reward and state rt+1, st+1
7: Q(st, at)Q(st, at) + α[rt+1 +γmaxaQ(st+1, a)Q(st, at)]
8: sst+1
9: end while
10: end for
action was chosen (called disjoint (context) model) and another version in which
the arms have additional shared context informations (called hybrid model). The
lines marked with [Hybrid] are only considered with the hybrid model.
3.4 Q-Learning
The Q-Learning algorithm computes a mapping of action-state-pairs to a real
number, which represents the value for the agent of taking the action in the
given state.
Initially all Q-values are fixed to a problem-specific starting value. Every time
the agent receives a reward rt+1 for doing action ain the current state sthe
value for this action-state-pair is updated:
Q(st, at)Q(st, at)+αhrt+1 +γmax
aQ(st+1, a)Q(st, at)iα, γ [0,1] (2)
αis the learning rate, with which new information is incorporated. γis the
factor for discounted rewards. On the basis of the Q-values an agent can decide
the approximate ’profitability’ of choosing a certain action in a certain state.
For a complete algorithm this mechanism needs to be extended by an action
choosing policy. In this work we used a policy called -greedy. This policy accepts
a probability parameter (0,1]. Each time the agent needs to make a decision
this policy will choose the optimal action (according to the Q-values) with a
(high) probability of (1 ) or, with probability , randomly uniform one of the
non-optimal actions. Randomly choosing a non-optimal action from time to time
promotes the exploration as previously mentioned [16].
4 Methods
This section briefly describes the frameworks, the input data and the ambient
process of the experiments conducted in this work.
Title Suppressed Due to Excessive Length 7
Main Program
Load data
Instructions TraCI
Battery status
Defintio n
Demand Definition
Load requests
Charging point model
Database with
Electric Grid De finition
Fig. 1. Overview over the various tools used for the experiments.
SUMO, short for Simulation of Urban MObility, is an open source software
package for the simulation of traffic flows [7]. It is microscopic (simulating each
individual vehicle), inter- and multimodal (supporting multiple types of vehicles
and pedestrians), space-continuous and time-discrete. The tool has been chosen
as it allows to simulate energy/fuel consumptions and external online interven-
tion into vehicle behaviors using an socket API. The simulation requires various
input definitions for the street network and vehicle’s mobility demands, which
define where and when vehicles enter and leave the environment. We used the
tool to a) accurately measure energy consumptions of electric vehicles and there-
fore the additional demand for the electric grid b) to determine where and when
vehicles park near charging stations.
SIMONA is a multi agent simulation tool for electric distribution grids [4]. It
integrates various heterogeneous grid elements which can react on the observed
power system state. The main purpose of SIMONA is the calculation of time
series under varying future scenarios and the effect of intelligent grid elements
for use in grid expansion planning. Due to the agent structure the elements can
be individually parameterized and actively communicate with each other, which
is beneficial for considering intelligent control elements like the one examined in
this work. Like SUMO, the simulation in SIMONA acts in discrete time steps,
simulating loads, generators and other grid elements bottom up in the context
of weather and other nonelectrical data to determine the power system’s state.
This state includes the load flow in each time step from which the strain/loading
put on each grid element can be derived.
Input data The tool chain requires various input definitions as depicted in
fig. 1. Our goal was to create a scenario that represents typical vehicle flows in
a city. Fortunately several authors have already created realistic street net and
demand definitions of several European cities. We chose the Luxembourg SUMO
Traffic dataset created by Codec´a et al. [3] as the city fits our needs and the
representation is of high quality. The authors showed that the demands included
in the dataset realistically recreate the actual traffic in Luxembourg. For usage in
the experiments however two issues arise. The dataset contains individual trips
8 C. R¨omer et al.
between two locations, each having a starting time and a random id, without the
possibility to identify which trips belong to the same vehicle. Each trip generates
a vehicle in SUMO that is spawned upon the starting time and removed from
the simulation as soon as it reaches it’s destination. Furthermore, the dataset
contains no parking areas.
As the continuity of individual vehicles, with preserving their battery state,
is vital to this work we undertook measures to identify trips belonging to the
same logical vehicle. We aimed to find cycles of trips of length 2 - 4, with each
trip starting on the same net edge the last trip ended on and them being ordered
by the starting time, meaning the last trip of the cycle had to start last on a
given day. We discovered these circles by building a directed graph with each trip
being a node. For each pair of nodes (f, g), the graph contains an edge (fg)
if the starting edge of gmatches the ending edge of fand the depart time of g
is later than that of f. Inside this graph, the trip circles could be found using a
depth-first-search. Using this method, a total of 13934 trip circles / vehicles have
been identified, which use 27567 single trips (12.8% of all original trips). Fig. 2
shows the normalized distributions of the departures in the original dataset and
in the extracted tours for the electric vehicles.
Fig. 2. Cumulated normalized departures per half-hour in the original Luxembourg
dataset and the extracted tours for electric vehicles.
The energy and charging models for the electric vehicles are based on the
work of Kurczveil, L´opez and Schnieder [8], who developed an integration of
electric vehicles and inductive charging stations into SUMO. We adopted their
computation system for conductive charging. The model has been parameterized
by using the values (e.g. vehicle mass, front surface area, maximum battery ca-
pacity i.a.) of the most popular pure electric vehicle (by stock) in Germany with
a 22 kWh battery [6]. The manufacturer states a realistic range of about 107
km in winter. We conducted a verification experiment in which vehicles drove
random routes trough the simulated city until their battery was completely de-
Title Suppressed Due to Excessive Length 9
pleted. Using the energy model of Kurczveil et al., we measured an average range
of 104.0 km with a standard deviation of 6.0 km, leading us to the assumption
that the model and its parameterization are sufficiently realistic.
For the simulation of the electric nets SIMONA requires a net definition, con-
taining the various elements and their parameters which are to be simulated. The
Cigr´e Task Force C6.04.02 created datasets of representative nets with typical
topologies found in Europe and North America [2]. These datasets are suitable
for research purposes. The document contains three different topologies for Eu-
ropean low voltage grids, one each for residential, commercial and industrial
focuses. To determine the count, position and type of the nets we used openly
available data from OpenStreetMap (OSM), especially the coordinates of substa-
tions and land uses. Each substation position in OSM has been used as the base
coordinate of one grid. In the next step, each of the previously extracted parking
areas were assigned to the nearest grid (by euclidean distance). After that the
grid was rotated and stretched as to minimize the average distance between the
net nodes and the respective parking areas. The grid type was determined by
considering the closest land use definition which had to be either residential,
commercial or industrial in OSM. This process resulted in the creation of 60
Process On startup, the program initializes SUMO and SIMONA with the
parameters stated before and creates the simulated vehicles and parking areas
/ charging stations. The charging stations are associated with their respective
grid nodes in SIMONA. The simulation acts in discrete time steps of 1 second
each. In each time step, the vehicles are updated first, updating their position
of they are underway or checking whether the next departure time has been
reached if they are parked. When reaching a charging station with free capacity
the charging station’s decision agent is updated with current data (the loading
point’s relative load, the current load of the substation belonging to the loading
point and the load of up to 5 neighboring substations, the current time and the
vehicle’s current battery state of charge) and enabled to take an experiment-
specific action (e.g. changing the station’s offered charging price or the offered
maximum power). After that the vehicle agent can decide whether it starts the
charging process. The charging process cannot be interrupted once started except
when the vehicle is leaving the parking area. Additionally, the process can only
be started when the vehicle arrives at the station.
When a vehicle leaves a parking area it receives information about the current
status/offers of charging stations in walking range of their intended target. The
vehicle agent can use this information to divert from their original target and
for example get a cheaper charging price. The loads of all charging points are
averaged over 5 minutes each and synchronized every 300 time steps with the
SIMONA framework.
10 C. R¨omer et al.
5 Experiments
To evaluate the learning algorithms we conducted experiments in which the al-
gorithms were used to control the behaviors of charging stations. In the following
section we define a game that is being played by the charging station agents.
5.1 Game description
Using the definition of Russell and Norvig [12], the game consists of a sequence
of states which can be defined via six components.
S0(initial state): The simulation starts on the 03. January 2011 (Monday)
at 00:00. All vehicles start at the starting edge of their first planned (trip)
with a 100% charged battery. Loading prices are initialized to 0.25e/ kWh
(where applicable). The initial state of the electric grid results from the first
load flow calculation in SIMONA.
ACTIONS(s) (listing of possible actions that the agent can take in state s):
We examined two different action models and two target variables (changing
the charging price and changing the offered charging power). The variants
Variant A:
The agent can take three different actions
·Increase charging price
·Decrease charging price
·Keep charging price
The action ”increase charging price” is valid, when the price has
not been decreased in the current time step yet and a defined max-
imum price has not been reached yet. ”Decrease charging price” is
The action ”keep charging price” is always valid.
This variant is only compatible with the ’Price’ target variable.
Variant B
The agent can take five different actions:
·Set charging price/power to 10%, 25%, 50%, 75%, 100%.
All actions are always valid.
PLAYER(s) (determines which player/agent is choosing the next action): In
every time step all agents belonging to charging stations of parking areas on
which a vehicle arrived in that time step need to decide on an action. If this
happens for multiple agents at the same time the order is chosen randomly.
RESULT(s, a) (state transition model of action ain state s): Every action
causes a change of the respective parameter (charging price or maximum
power). The exact state transitions are determined by the simulation envi-
ronment. The follow-up state of the last issued action in time step tarises
as soon as the next action is required in a time step (t+x).
TERMINAL-TEST(s) (tests whether the state sis a terminal state, marking
the end of the simulation run): The terminal state is reached after 864000
time steps (translating to 240 hours of simulated time).
Title Suppressed Due to Excessive Length 11
UTILITY(s, p) (utility function of player pin state s): We examined two dif-
ferent utility functions. Let ¯
lt[0,1] be the average load and (max lt)[0,1]
the maximum load of the respective substation. Let ¯c[e/ kWh] be the aver-
age charging price. Let m[e] be the average income (charging price charged
energy) of the charging pole operator. Let γ∈ R be a balancing factor which
constitutes a simulation parameter. We defined the utility functions as fol-
lt· −1) + ((max lt)· −1) + (γ·¯c· −1) Variant ’Price’ (3)
lt· −1) + ((max lt)· −1) + (γ·m) Variant ’Income’ (4)
The average and maximum loads affect the reward negatively. The motiva-
tion of the first variant was to reduce the price as much as possible to attract
customers without overloading the grid elements. In the second variant the
pricing has been replaced by the specifically rendered service (in form of the
generated income), which the charging station operator aims to maximize.
In both cases there is a conflict of interest between the charging station
operator (aiming to generate income through high power throughput) and
the grid operator (aiming to prevent overloadings), which the agent both
accounts for.
Formally this game definition results in a multi-objective optimization prob-
lem for each net, which involves the minimization of ¯
ltand (max lt) and the
minimization of ¯cor maximization of mrespectively, with the target variables
being dependent on the set of taken actions of all charging station agents. For
this definition we assume the maximization of mto be equivalent to the mini-
mization of mto reach a consistent notation. We formalize the actions taken
by each agent as integer numbers, and each agent’s solution to the game as a
vector of kpossible actions taken in tpossible time steps.
lt(x),max lt(x),¯c, m) (5)
s.t. xXby XZk,t
The complete solution to the game would consist of nagent’s solutions, with n
being the number of charging station agents participating.
5.2 Strategy profiles
We examined multiple strategy profiles which are to be described right now.
After this description the profiles will be referenced by their bold shorthand.
The profiles are determined by the following charging station agent behaviors
and the utility functions (3) and (4) defined in the last section.
ConstantLoading The charging point will never change its offered price/power.
The offered power is always 100% of the maximum value.
WorkloadProportional The charging point will change its price/power in pro-
portion to the load of the respective substation.
12 C. R¨omer et al.
Random The price/power is determined randomly between two set thresholds.
LinUCB Disjunct The agent uses a LinUCB-bandit algorithm with disjunct
contexts to determine the price/power.
LinUCB Hybrid Like LinUCB Disjunct, but with hybrid contexts.
QLearning The agent uses a Q-Learning-algorithm to determine the price/power.
The behavior of the vehicles is determined by their charging and diversion be-
havior. For the charging behavior we examined these two main variants:
AlwaysLoad The vehicle always starts the charging process, if it has the pos-
sibility to.
PriceAware The vehicle holds a history of the last seen charging prices and
only starts the charging process if the following condition holds. Let cakt
[e/ kWh] be the currently offered price. Let Cbe the saved price history.
Let bSoC [%] be the battery state of charge.
100 |cC:ccakt|
We also examined variants in which the vehicles only charge ’at home’ (that is,
at the first charging station of the day). Besides of these behaviors the vehicles
will always charge if the battery state of charge falls below 20% to meet basic
comfort requirements. Two different diversion behaviors have been considered:
DoNotDivert The vehicles do not change targets.
DivertToCheapest / DivertToHighestPower The vehicle can change the
target to an alternative charging station in walking distance to the desired
target edge.
The DivertToCheapest behavior was always used when the price was the variable
controlled by the charging station. The DoNotDivert behavior was used with
the ConstantLoading behavior do determine the load in the uncontrolled case.
In other experiments the DivertToHighestPower behavior was used.
For simplicity, all parking areas were equipped with charging stations with
a fixed maximum charging power of 11 kW per space. There was no distinction
between private and public charging points in the simulation. We conducted two
series of experiments. In the first one we aimed to determine the effects of the
uncontrolled charging to the simulated electrical grid. The simulation was run
once without electric vehicles (to determine the base case) and once with the
profiles ConstantLoading/AlwaysLoad/DoNotDivert. The second series was run
to determine the effects of the learning algorithms with varying configurations.
6 Results
For the first experiment series, our thesis was that the uncontrolled charging of
electric vehicles will lead to problems in at least some grid elements. As fig.3
shows, that the maximum transformer load over the course of one day come to
Title Suppressed Due to Excessive Length 13
Fig. 3. Maxima and average values over all 5-minute transformer load maxima (i.e.:
the maximum load value of every 5 minute step is taken for each transformer and of
these 60 values the maximum / average is taken) in the base case (without electric
vehicles) and the uncontrolled charging case.
Fig. 4. Transformer load maxima over one day and net type by net in the uncontrolled
14 C. R¨omer et al.
54.4% of the transformer’s rated power in the base case but rises up to a peak of
over 200% in the uncontrolled charging case. Note that this value is dominated by
one outlier net which receives an exceptional level of traffic, however transformer
loads between 101.2% and 130.5% were measured in 7 more nets, which means
8 / 60 subnets registered at least one overload situation, as can be seen in fig. 4.
There were no overloads in other net elements like power lines or significant
voltage deviations.
In the second experiment series we took a deeper look into the effects of
the various control algorithms for charging stations on the grid. Our thesis were
that the learning algorithms improved their behavior (measured by the received
reward values) over time, that the bandit algorithm reduced the negative effects
of the vehicle charging loads on the grid and that the learning algorithms all in
all perform better than the simpler ones. Fig. 5 top shows the absolute change
between the averaged received rewards of the first and last day over all charging
point agents for various selected profiles. The values in the parentheses state the
used action variant (A or B), the utility function (I. = Income, P. = Price) and
the target variable (Po. = Power, Pr. = Price). It can be seen that some variants
of LinUCB and QLearning develop positively (in the sense of received rewards),
which indicates that the learning target is being approached. One must note that
many parameter combinations did not perform very well. The agent which used
the agent model variant A or the variant B with the price as a controlling variable
rarely converged towards a meaningful result. Also, in some configurations the
choice of the balancing factor γor the LinUCB-parameter αhad a significant
impact on the overall performance even for small changes. Fig. 5 bottom shows
the development of the received rewards for selected scenarios. The values have
been normalized to [0,1] for better comparability. It is noticeable that the general
trend of the development can usually be seen after just 1 day of simulation.
In the next step we examined the transformer loads after 10 days of sim-
ulation using the various charging point behaviors and their respective control
algorithms (fig. 6). Note that for a better overview the plot only shows the best
performing algorithm variant for each category. The random-choosing profile has
been left out as it did not perform very well. The mean transformer load could
be reduced by up to 12% in comparison to the uncontrolled case, the maximum
peak load could be reduced from 201% to 70%. The Q-Learning-algorithm, while
making some progress reward-wise (fig. 5 bottom), did not perform very well in
this scenario. The training time possibly was not long enough for it to converge
towards a profitable solution. While the best-performing algorithm was a vari-
ant of LinUCB using action variant B, the ”Income” utility function (2) and
a power target variable, a variant of the WorkProportional-strategy performed
almost as good. In contrast to the bandit algorithms, the simple price-centered
WorkProportional-strategy performed comparably well.
Title Suppressed Due to Excessive Length 15
LinUCB_Disjunct(A,Pr.) =-10
LinUCB_Disjunct(A,Pr.) =10
LinUCB_Disjunct(A,Pr.) =1
LinUCB_Disjunct(A,Pr.) =3
LinUCB_Disjunct(A,Pr.) =50
LinUCB_Disjunct(B,I.,Po.) =5
LinUCB_Disjunct(B,I.,Po.) =1
LinUCB_Hybrid(B,I.,Po.) =1
LinUCB_Disjunct(B,I.,Po.) =3
QLearner (B,I.,Po.)
LinUCB_Hybrid(B,I.,Po.) =3
Change in %
Change of the average received rewards
between the first and the last day
Values out
of scale
Fig. 5. Top: Change of the average received reward between the first and last day.
Bottom: Average received rewards between the first and last day for selected algorithm
16 C. R¨omer et al.
Fig. 6. Mean transformer loads over all nets after 10 days for selected charging point
7 Conclusion
With the proliferation of electric vehicles, the electrical distribution grids are
more prone to overloads. In this paper we provided a literature survey on coun-
termeasures to control charging of electric vehicles such that overloads are pre-
vented. After a brief introduction of the fundamentals, we modeled and studied
an intelligent pricing mechanism based on a reinforcement learning problem. As
context information is crucial in our setting, we tested in particular contextual
bandit learning algorithms to provide incentives for distributing charging load
and prevent network failures.
The presented algorithms were implemented and combine the microscopic
mobility simulator SUMO with the electrical network simulator SIMONA. The
simulation framework thus produces reliable electrical distribution load values.
Our extensive experiments are carefully conducted under realistic conditions
and reveal that conditional bandit learning outperforms context-free reinforce-
ment learning algorithms and our approach is suitable for the given problem.
While we found that the used bandit algorithms were indeed able to reduce the
problematic effects on the grid considerably, we also noticed that some of the
tested variants did not perform very well in the simulation environment. From
this we conclude that, if the algorithm were to be implemented productively, con-
siderable work would need to be invested into the correct parametrization. After
Title Suppressed Due to Excessive Length 17
this has been accomplished a learning algorithm should be able to be rapidly
implemented in various target environments.
Due to the rising popularity of electric mobility the charging stations will
become a vital part of future mobility and transportation considerations. As
reinforcement learning algorithms can be adapted rapidly to include new in-
formation we assume these to be suitable as part of a holistic traffic control
In future works, the decision model of the vehicle (passengers) could be ex-
panded. As the emphasis of this work lied on the charging stations a simple
vehicle’s model was chosen. A more complex model that considers the planned
daily routine, retention times or socioeconomic factors could lead to a more di-
verse task and thus promote the advantages of self-adapting charging stations
even better. Future works should also consider differences between private and
public charging points: A charging device owned by the same person as the ve-
hicle would pursue other goals than a profit-oriented public charging station.
A private charging device could potentially be a useful actor in a holistic smart
home environment which can also include a privately owned photovoltaic system.
Lastly the agent system was designed with an emphasis on independence of single
charging points. Another interesting approach would be to test a mechanism de-
sign in which the agents act towards a common target and are rewarded/graded
as an ensemble and not individually. This would be realistically possible for op-
erators owning multiple charging stations, as they probably run a centralized
controlling platform.
Part of the work on this paper has been supported by Deutsche Forschungsge-
meinschaft (DFG) within the Collaborative Research Center SFB 876 ”Providing
Information by Resource-Constrained Analysis”, project B4. Thomas Liebig re-
ceived funding by the European Union through the Horizon 2020 Programme
under grant agreement number 688380 ”VaVeL: variety, Veracity, VaLue: Han-
dling the Multiplicity of Urban Sensors”.
This work contains results from the master’s thesis of Christian R¨omer ti-
tled ”Ladesteuerung von Elektrofahrzeugen mit kontextsensitiven Banditen un-
ter Ber¨ucksichtigung des elektrischen Verteilnetzes” at the TU Dortmund Uni-
1. Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the
multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.
2. Stefano Barsali et al. Benchmark systems for network integration of renewable
and distributed energy resources. Technical Report Cigr´e Task Force C6.04.02,
2014. URL: 273 8-benchmark-systems-for-
18 C. R¨omer et al.
3. Lara Codeca, Raphael Frank, S´ebastien Faye, and Thomas Engel. Luxembourg
sumo traffic (lust) scenario: Traffic demand evaluation. IEEE Intelligent Trans-
portation Systems Magazine, 9(2):52–63, 2017.
4. J Kays, A Seack, and U H¨ager. The potential of using generated time series in the
distribution grid planning process. In Proc. 23rd Int. Conf. Electricity Distribution,
Lyon, France, 2015.
5. Kraftfahrt-Bundesamt. Anzahl der elektroautos in deutschland von 2006 bis 2018.
march 2018.
6. Kraftfahrt-Bundesamt. Bestand an Personenkraftwagen nach Seg-
menten und Modellreihen am 1. Januar 2018 gegen¨uber 1. Januar 2017.
fz12 2018 xls.xls? blob=publicationFile&v=2, 2018. Online; accessed 27 Jun
7. Daniel Krajzewicz, Jakob Erdmann, Michael Behrisch, and Laura Bieker. Recent
development and applications of sumo-simulation of urban mobility. International
Journal On Advances in Systems and Measurements, 5(3&4), 2012.
8. Tam´as Kurczveil, Pablo ´
Alvarez L´opez, and Eckehard Schnieder. Implementation
of an energy model and a charging infrastructure in sumo. In Simulation of Urban
MObility User Conference, pages 33–43. Springer, 2013.
9. Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit
approach to personalized news article recommendation. In Proceedings of the 19th
international conference on World wide web, pages 661–670. ACM, 2010.
10. Thomas Liebig and Maurice Sotzny. On Avoiding Traffic Jams with Dynamic Self-
Organizing Trip Planning. In 13th International Conference on Spatial Information
Theory (COSIT 2017), volume 86 of Leibniz International Proceedings in Infor-
matics (LIPIcs), pages 17:1–17:12, Dagstuhl, Germany, 2017. Schloss Dagstuhl–
Leibniz-Zentrum fuer Informatik.
11. Daniel O’Neill, Marco Levorato, Andrea Goldsmith, and Urbashi Mitra. Residen-
tial demand response using reinforcement learning. In Smart Grid Communications
(SmartGridComm), 2010 First IEEE International Conference on, pages 409–414.
IEEE, 2010.
12. Stuart J Russell and Peter Norvig. Artificial intelligence: a modern approach.
Malaysia; Pearson Education Limited,, 2016.
13. Peter-Philipp Schierhorn and N Martensen. ¨
Uberblick zur Bedeutung der Elek-
tromobilit¨at zur Integration von EE-Strom auf Verteilnetzebene. Energynautics
GmbH, Darmstadt, 2015.
14. Statistisches Bundesamt (Destatis). Verkehr aktuell. p. 92. May 2018.
VerkehrAktuellPDF 2080110.pdf? blob=publicationFile, 2018. Online; accessed
27 Jun 2018.
15. Olle Sundstr¨om and Carl Binding. Optimization methods to plan the charging of
electric vehicle fleets. In Proceedings of the international conference on control,
communication and power engineering, pages 28–29. Citeseer, 2010.
16. Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction.
17. Roman Uhlig, Nils Neusel-Lange, Markus Zdrallek, Wolfgang Friedrich, Peter
Kl¨oker, and Thomas Rzeznik. Integration of e-mobility into distribution grids
via innovative charging strategies. Wuppertal University, 2014.
18. Stylianos I Vagropoulos and Anastasios G Bakirtzis. Optimal bidding strategy for
electric vehicle aggregators in electricity markets. IEEE Transactions on power
systems, 28(4):4031–4041, 2013.
Title Suppressed Due to Excessive Length 19
19. John G Vlachogiannis and Nikos D Hatziargyriou. Reinforcement learning for
reactive power control. IEEE transactions on power systems, 19(3):1317–1325,
20. Rashid A Waraich, Matthias D Galus, Christoph Dobler, Michael Balmer, G¨oran
Andersson, and Kay W Axhausen. Plug-in hybrid electric vehicles and smart
grids: Investigations based on a microsimulation. Transportation Research Part C:
Emerging Technologies, 28:74–86, 2013.
21. Chenye Wu, Hamed Mohsenian-Rad, and Jianwei Huang. Vehicle-to-aggregator
interaction game. IEEE Transactions on Smart Grid, 3(1):434–442, 2012.
... The presented general DBFS approach calculates the power flow within the agent-based discrete-event simulation environment, SIMONA. Its general concept, capabilities based on an initial version, has already been published in several publications, e.g., [2], [11], [12]. However, since the latest available publication, further developments have been added for several research projects. ...
Full-text available
In the energy transition context, the use of steady-state time series is a promising approach to account for temporal interdependencies and flexibilities in modern distribution power system analysis, planning, and operation processes. This paper proposes a distributed backward–forward sweep power flow algorithm executed in a discrete-event, agent-based simulation framework. The algorithm shows fast convergence, allows for concurrent execution, and scales up to large-scale multi-voltage level grids with arbitrary topology. An agent-based simulation model integrates the developed algorithm to generate detailed grid utilization, asset, and system participant time series. We demonstrate the capabilities of our approach by performing several simulations, leveraging the proposed algorithm, on nine different benchmark grid models. The selected models comprise grids at a single voltage level, medium voltage level, and combined multi-voltage levels. The evaluation of the numerical results validates the approach and demonstrates its capabilities.
... The starting point of the proposed tool is the Eclipse SUMO (Simulation of Urban Mobility) (López et al., 2018), a traffic micro-simulator which proved its applicability for traffic simulations and simulations dealing with electromobility in many works (Ascher and Hackenberg, 2015, Burmeister et al., 2015, Yan et al., 2018, Römer et al., 2019, Khan et al., 2019. Eclipse SUMO is also an accepted tool to determine the impact of certain measures or use-cases not only on traffic parameters but also the environment, see for example (Přibyl et al., 2020). ...
When designing new electric vehicles for urban transport, both vehicle producers and operators need to establish and verify requirements that the vehicle has to fulfil regarding, e.g., energy storage capacity, driving range, or battery wear. These requirements are typically verified using simulation tools that concentrate mostly on electrical quantities, but disregard the influence of road infrastructure, surrounding traffic and detours. At the same time, an expansion of the electric vehicle fleet may have a negative impact on the existing power grid. Sufficient dimensioning of the grid elements and their verification are therefore necessary to keep the electric transport service reliable. This paper demonstrates a tool suitable for verification of battery-assisted trolleybus fleet and power infrastructure parameters, based on extensions to the SUMO traffic simulator. The tool carries out a joint simulation of electric and traffic-related quantities. It is demonstrated on seven use-cases inspired by real-life problems.
... Römer et al. [58] implemented a contextual bandit process to control charging demands of electric vehicles by adjusting the price and recommending stations to users. Considering station load, charging price, or income as features which affect driver behavior, they analyzed the effect of bandit algorithms on maximum loads at stations and average rewards of drivers. ...
Full-text available
With Mobility-as-a-Service platforms moving toward vertical service expansion, we propose a destination recommender system for Mobility-on-Demand (MOD) services that explicitly considers dynamic vehicle routing constraints as a form of a "physical internet search engine". It incorporates a routing algorithm to build vehicle routes and an upper confidence bound based algorithm for a generalized linear contextual bandit algorithm to identify alternatives which are acceptable to passengers. As a contextual bandit algorithm, the added context from the routing subproblem makes it unclear how effective learning is under such circumstances. We propose a new simulation experimental framework to evaluate the impact of adding the routing constraints to the destination recommender algorithm. The proposed algorithm is first tested on a 7 by 7 grid network and performs better than benchmarks that include random alternatives, selecting the highest rating, or selecting the destination with the smallest vehicle routing cost increase. The RecoMOD algorithm also reduces average increases in vehicle travel costs compared to using random or highest rating recommendation. Its application to Manhattan dataset with ratings for 1,012 destinations reveals that a higher customer arrival rate and faster vehicle speeds lead to better acceptance rates. While these two results sound contradictory, they provide important managerial insights for MOD operators.
... environmental factors and agent profiles, it is referred as contextual MAB [8]. Due to its simple and general structure, contextual MAB has been successfully applied in many fields, such as recommendation systems [9], clinical trials [10], web advertisements [11], electric vehicle charging control [12], and etc. ...
Full-text available
Residential loads have great potential to enhance the efficiency and reliability of electricity systems via demand response (DR) programs. One major challenge in residential DR is how to learn and handle unknown and uncertain customer behaviors. In this paper, we consider the residential DR problem where the load service entity (LSE) aims to select an optimal subset of customers to optimize some DR performance, such as maximizing the expected load reduction with a financial budget or minimizing the expected squared deviation from a target reduction level. To learn the uncertain customer behaviors influenced by various time-varying environmental factors, we formulate the residential DR as a contextual multi-armed bandit (MAB) problem, and develop an online learning and selection (OLS) algorithm based on Thompson sampling to solve it. This algorithm takes the contextual information into consideration and is applicable to complicated DR settings. Numerical simulations are performed to demonstrate the learning effectiveness of the proposed algorithm.
... Applications exist in areas of mobility. Researchers have studied the demand management of electric vehicle charging stations by changing charging prices and recommending alternative stations when one is congested [43]. Zhou et al. [40] developed a recommender system for sequential departure time and path choice with on-time arrival reliability. ...
Full-text available
While public transit network design has a wide literature, the study of line planning and route generation under uncertainty is not so well covered. Such uncertainty is present in planning for emerging transit technologies or operating models in which demand data is largely unavailable to make predictions on. In such circumstances, we propose a sequential route generation process in which an operator periodically expands the route set and receives ridership feedback. Using this sensor loop, we propose a reinforcement learning-based route generation methodology to support line planning for emerging technologies. The method makes use of contextual bandit problems to explore different routes to invest in while optimizing the operating cost or demand served. Two experiments are conducted. They (1) prove that the algorithm is better than random choice; and (2) show good performance with a gap of 3.7% relative to a heuristic solution to an oracle policy.
... In previous works, SUMO has been combined in other traffic related ressource allocation problems using reinforcement learning. For example, [5] uses SUMO to train price policies of the smart electrical distribution grid for prevention of overloads due to electric vehicles. In [4], the authors use SUMO to compare various selfish routing regimes and propose usage of reinforcemnt learning for self-organization. ...
Occupied truck parking lots regularly cause hazardous situations. Estimation of current parking lot state could be utilized to provide drivers parking recommendations. In this work, we highlight based on a simulation scenario, how sparse observations, as obtained by a mobile application could be utilized to estimate parking lot occupancy. Our simulated results reveal that a detection of a filled parking lot could be possible with an error of less than half an hour, if the required data would be available.
Full-text available
Mobility service route design requires potential demand information to well accommodate travel demand within the service region. Transit planners and operators can access various data sources including household travel survey data and mobile device location logs. However, when implementing a mobility system with emerging technologies, estimating demand level becomes harder because of more uncertainties with user behaviors. Therefore, this study proposes an artificial intelligence-driven algorithm that combines sequential transit network design with optimal learning. An operator gradually expands its route system to avoid risks from inconsistency between designed routes and actual travel demand. At the same time, observed information is archived to update the knowledge that the operator currently uses. Three learning policies are compared within the algorithm: multi-armed bandit, knowledge gradient, and knowledge gradient with correlated beliefs. For validation, a new route system is designed on an artificial network based on public use microdata areas in New York City. Prior knowledge is reproduced from the regional household travel survey data. The results suggest that exploration considering correlations can achieve better performance compared to greedy choices in general. In future work, the problem may incorporate more complexities such as demand elasticity to travel time, no limitations to the number of transfers, and costs for expansion.
Full-text available
Both the industrial and the scientific communities are working on problems related to vehicular traffic congestion, intelligent transportation systems, and mobility patterns using information collected from a variety of sources. Usually, a vehicular traffic simulator, with an appropriate scenario for the problem at hand, is used to reproduce realistic mobility patterns. Many mobility simulators are available, and the choice is made based on the type of simulation required, but a common problem is finding a realistic traffic scenario. The aim of this work is to provide and evaluate a scenario able to meet all the basic requirements in terms of size, realism, and duration, in order to have a common basis for evaluations. In the interest of building a realistic scenario, we used information from a real city with a typical topology common in mid-size European cities, and realistic traffic demand and mobility patterns. In this paper, we show the process used to build the Luxembourg SUMO Traffic (LuST) Scenario, and present a summary of its characteristics together with our evaluation and validation of the traffic demand and mobility patterns.
Full-text available
SUMO is an open source traffic simulation package including the simulation application itself as well as supporting tools, mainly for network import and demand modeling. SUMO helps to investigate a large variety of research topics, mainly in the context of traffic management and vehicular communications. We describe the current state of the package, its major applications, both by research topic and by example, as well as future developments and extensions.
Full-text available
Electric vehicles (EVs) are likely to become very popular worldwide within the next few years. With possibly millions of such vehicles operating across the country, one can establish a distributed electricity storage system that comprises of the EVs' batteries with a huge total storage capacity. This can help the power grid by providing various ancillary services, once an effective vehicle-to-grid (V2G) market is established. In this paper, we propose a new game-theoretic model to understand the interactions among EVs and aggregators in a V2G market, where EVs participate in providing frequency regulation service to the grid. We develop a smart pricing policy and design a mechanism to achieve optimal frequency regulation performance in a distributed fashion. Simulation results show that our proposed pricing model and designed mechanism work well and can benefit both EVs (in terms of obtaining additional income) and the grid (in terms of achieving the frequency regulation command signal).
Full-text available
Reinforcement learning policies face the exploration versus exploitation dilemma, i.e. the search for a balance between exploring the environment to find profitable actions while taking the empirically best action as often as possible. A popular measure of a policy's success in addressing this dilemma is the regret, that is the loss due to the fact that the globally optimal policy is not followed all the times. One of the simplest examples of the exploration/exploitation dilemma is the multi-armed bandit problem. Lai and Robbins were the first ones to show that the regret for this problem has to grow at least logarithmically in the number of plays. Since then, policies which asymptotically achieve this regret have been devised by Lai and Robbins and many others. In this work we show that the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support.
Full-text available
This paper presents a Reinforcement Learning (RL) method for network constrained setting of control variables. The RL method formulates the constrained load flow problem as a multistage decision problem. More specifically, the model-free learning algorithm (Q-learning) learns by experience how to adjust a closed-loop control rule mapping states (load flow solutions) to control actions (offline control settings) by means of reward values. Rewards are chosen to express how well control actions cause satisfaction of operating constraints. The Q-learning algorithm is applied to the IEEE 14 busbar and to the IEEE 136 busbar system for constrained reactive power control. The results are compared with those given by the probabilistic constrained load flow based on sensitivity analysis demonstrating the advantages and flexibility of the Q-learning algorithm. Computing times with another heuristic method is also compared.
Future traffic that will be accompanied by higher alternative drive concepts will pose as a challenge when it comes to corresponding energy systems, coordination of operations, and communication interfaces, such as needed for data acquisition and billing. On one hand, the increasing attractiveness of electric vehicles will inevitably lead to the development and testing of compatible technologies; on the other, these will need to be conformed to existing systems, when integrating them into the prevailing infrastructure and traffic. Funded by the German Federal Ministry of Transport, Building and Urban Development, an inductive vehicle charging system and a compatible prototype bus fleet shall be integrated into Braunschweig’s traffic infrastructure in the scope of the project emil (Elektromobilität mittels induktiver Ladung – electric mobility via inductive charging). This paper describes the functional implementations in SUMO that are required by the methodic approach for the evaluation of novel charging infrastructures by means of traffic simulation.
Introduction of Plug-in Hybrid Electric Vehicles (PHEVs) could potentially trigger a stepwise electrification of the whole transportation sec- tor. But the impact on the electric grid by electrical vehicl e charging is still not fully known. This paper investigates several PHEV charging schemes, including smart charging, using a novel iterative approach. An agent based traffic demand model is used for modeling the electrical dema nd of PHEVs over the day. For modeling the different parts of the electri c grid, an ap- proach based on interconnected multiple energy carrier systems is used. For a given charging scheme the power system simulation gives back a price signal indicating whether grid constraints, such as m aximum power output at hub transformators, have been violated. This leads to a correc- tive step in the iterative process, until a charging pattern is found, which does not violate grid constraints. The proposed system allows to investi- gate existing electric grids, whether they are capable of meeting increased electricity demand by certain future PHEV penetration. Furthermore, in the future, different types of smart charging schemes can be added into the system for comparison.
Conference Paper
We present a novel energy management system for residential demand response. The algorithm, named CAES, reduces residential energy costs and smooths energy usage. CAES is an online learning application that implicitly estimates the impact of future energy prices and of consumer decisions on long term costs and schedules residential device usage. CAES models both energy prices and residential device usage as Markov, but does not assume knowledge of the structure or transition probabilities of these Markov chains. CAES learns continuously and adapts to individual consumer preferences and pricing modifications over time. In numerical simulations CAES reduced average end-user financial costs from 16% to 40% with respect to a price-unaware energy allocation.