PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

We consider the problem of an aggregator attempting to learn customers' load flexibility models while implementing a load shaping program by means of broadcasting daily dispatch signals. We adopt a multi-armed bandit formulation to account for the stochastic and unknown nature of customers' responses to dispatch signals. We propose a constrained Thompson sampling heuristic, Con-TS-RTP, that accounts for various possible aggregator objectives (e.g., to reduce demand at peak hours, integrate more intermittent renewable generation, track a desired daily load profile, etc) and takes into account the operational constraints of a distribution system to avoid potential grid failures as a result of uncertainty in the customers' response. We provide a discussion on the regret bounds for our algorithm as well as a discussion on the operational reliability of the distribution system's constraints being upheld throughout the learning process.
Content may be subject to copyright.
Constrained Thompson Sampling for Real-Time
Electricity Pricing with Grid Reliability Constraints
Nathaniel TuckerAhmadreza MoradipariMahnoosh Alizadeh
Authors have equal contribution
Abstract—We consider the problem of an aggregator at-
tempting to learn customers’ load flexibility models while
implementing a load shaping program by means of broad-
casting daily dispatch signals. We adopt a multi-armed bandit
formulation to account for the stochastic and unknown nature
of customers’ responses to dispatch signals. We propose a
constrained Thompson sampling heuristic, Con-TS-RTP, that
accounts for various possible aggregator objectives (e.g., to
reduce demand at peak hours, integrate more intermittent
renewable generation, track a desired daily load profile, etc) and
takes into account the operational constraints of a distribution
system to avoid potential grid failures as a result of uncertainty
in the customers’ response. We provide a discussion on the
regret bounds for our algorithm as well as a discussion on the
operational reliability of the distribution system’s constraints
being upheld throughout the learning process.
Index Terms—Constrained optimization, distribution net-
work, multi-armed bandit, real-time pricing, demand response,
Thompson sampling.
I. INTRODUCTION
In order to integrate the increasing volume of intermittent
renewables in modern power grids, aggregators are exploring
various methods to manipulate both residential and commer-
cial loads in real-time. As a result, various demand response
(DR) frameworks are gaining popularity because of their
ability to shape electricity demand by broadcasting time-
varying signals to customers; however, most aggregators have
not implemented complex DR programs beyond peak shaving
and emergency load reduction initiatives. One reason for this
is the customers’ unknown and time-varying responses to
dispatch signals, which can lead to economic uncertainty for
the aggregator and reliability concerns for the grid opera-
tor relying on DR performance [1]. The aggregator could
explicitly request response information from its customers;
however, this process would have a large communication
overhead. More importantly, most customers cannot readily
characterize their response, and even if they could, they might
not be willing to share this private information. With this in
mind, it is evident that future load shaping initiatives for
renewable integration (i.e., more complex objectives than
peak shaving) need to passively learn customers’ response
only from historical data of past interactions [2].
This work was supported by NSF grants #1847096 and #1737565 and
UCOP Grant LFR-18-548175.
N. Tucker, A. Moradipari, and M. Alizadeh are with the Department
of Electrical and Computer Engineering, University of California, Santa
Barbara, CA 93106 USA (email: nathaniel tucker@ucsb.edu)
Recently, much work has been done for aggregators at-
tempting to learn customers’ price responses whilst imple-
menting peak shaving DR programs. The authors of [3]
present a data-driven strategy to estimate customers’ demands
and develop prices for DR. In [4], the authors use linear
regression models to derive estimations of customers’ re-
sponses to DR signals. Similarly, [5] develops a joint online
learning and pricing algorithm based on linear regression.
In [6], the authors present a contract-based DR strategy to
learn customer behavior while broadcasting DR signals. The
authors of [7] present an online learning approach based on
piecewise linear stochastic approximation for an aggregator
to sequentially adjust its DR prices based on the behavior
of the customers in the past. In [8], the authors develop a
risk-averse learning approach for aggregators operating DR
programs. In [9], a learning algorithm for customers’ utility
functions is developed and it is assumed that the aggregator
acts within a two-stage (day-ahead and real-time) electricity
market. Furthermore, a multi-armed bandit (MAB) formula-
tion is used in [10], [11] to determine which customers to
target with load reduction signals for DR programs.
In addition to learning how customers respond to DR
signals, an aggregator must also consider power system
constraints to ensure reliable operation (e.g., nodal voltage,
transformer capacities, and line flow limits). In real distribu-
tion systems, it is critical that these constraints are satisfied at
every time step to ensure customers receive adequate service
and to avoid potential grid failures even without sufficient
knowledge about how customers respond to price signals
(i.e., in early learning stages) [12], [13]. One paper that
considers these realistic constraints, [14], presents a least-
square estimator approach to learn customer sensitivities
and implements DR in a distribution network. However, the
proposed learning approach does not have a regret guarantee
compared to the clairvoyant solution that has full information
of customer sensitivities.
Similar to the aforementioned papers, the work presented
in this manuscript considers the problem of an aggregator
passively learning the customers’ price sensitivities while
running a load shaping program. However, our approach
permits more complex load shaping objectives (e.g., tracking
a daily target load profile) and varies in terms of both
load modeling and learning approach from all the above pa-
pers. Specifically, we present a modified multi-armed bandit
(MAB) heuristic akin to Thompson sampling (TS) to tackle
arXiv:1908.07964v1 [eess.SY] 21 Aug 2019
the trade-off between exploration of untested price signals
and exploitation of well-performing price signals while en-
suring grid reliability (A preliminary version of this work was
published in [15]; however, it did not account for distribution
system constraints). It is important to note that standard TS
cannot guarantee that grid reliability constraints are upheld
during the learning process. As such, we present a modified
version of TS while retaining the fundamental principles
TS is based on. Furthermore, we provide discussion on
how the constraints are upheld (i.e., operational reliability),
discussion on the performance of the heuristics compared to
a clairvoyant solution, and simulation results highlighting the
strengths of the method.
The remainder of the paper is organized as follows:
Section II presents the aggregator’s daily objective as well
as the customers’ load model. Section III describes the
multi-armed bandit formulation for the electricity pricing
problem, presents the modified TS heuristic, and discusses
its performance and reliability. Section IV presents simulation
results that showcase the efficacy of the approach.
II. PRO BL EM FO RM UL ATIO N
A. The Aggregator’s Objective
The aggregator’s main objective is to select dispatch sig-
nals to manipulate customer demand according to a given
optimization objective that varies daily. Specifically, we con-
sider the case where the aggregator broadcasts a dispatch
signal pτ= [p(t)]t=1,...,T to the population of customers
each day (we use t= 1, . . . , T to index time of day and
τ= 1,...,Tto index days). The set of dispatch signals
available for use by the aggregator is denoted as P. In this
paper, without a loss of generality, we will assume that the
dispatch signal sent to customers for load shaping purposes
is a real-time pricing (RTP) signal.
On each day, τ, the aggregator’s cost function is a fixed
and known nonlinear function f(Dτ(pτ),Vτ)that depends
on the load profile Dτ(pτ)of the population in response to
the daily price pτand a random exogenous parameter Vτ.
The exogenous and given vector Vτvaries daily and can
correspond to a daily target profile reflecting renewable gen-
eration forecasts, weather predictions, and grid conditions.
Moreover, the aggregator must ensure that the broadcasted
price signals do not result in load profiles that violate
distribution system reliability constraints (e.g., nodal voltage,
transformer capacities, or line flow limits). As such, if the
aggregator had full information about how the population
responds to price signals (i.e., full knowledge of Dτ(pτ)),
The reader should not that this choice is not fundamental to the
development of the modified learning heuristics we present in this paper.
It only allows us to provide a concrete characterization of the response
to dispatch signals by mathematically modeling the customers as cost-
minimizing agents equipped with home energy management systems in
Section II-C.
the aggregator can solve the following optimization problem
on day τto select the optimal price p?
τ:
p?
τ= arg min
pτ∈P
fDτ(pτ),Vτ(1)
s.t. gjDτ(pτ)0,j= 1, . . . , J (2)
where gj(·)j=1,...,J represents the reliability constraints for
the distribution system as a function of load injections.
Specifically, these general functions represent distribution
system parameters (i.e., the nodal voltage uτ(t)and power
flow through distribution lines fτ(t)) that should obey the
following constraints:
uτ(t)umin,t, τ , (3)
uτ(t)umax,t, τ , (4)
fτ(t)Smax,t, τ , (5)
where umin,umax , and Smax correspond to the lower
voltage limit, upper voltage limit, and power flow limit, re-
spectively, for the population’s connection to the distribution
grid. The power flow model we use to derive uτ(t)and fτ(t)
from the load profile Dτ(pτ)is given in Section IV-B.
However, the aggregator cannot simply solve (1). As
explained in the introduction, knowledge of customers’ price
response is unavailable to the aggregator. Recall, 1) the
aggregator does not want to directly query customers for
their response function, 2) most customers cannot readily
characterize their response, and 3) customers might not be
willing to share this private information. Accordingly, the
aggregator needs a method to sequentially choose daily prices
to simultaneously 1) control the daily incurred cost; 2) learn
the customers’ price response models; and 3) ensure the
distribution system constraints are not violated at any time.
B. Load Flexibility Model
It is hard to approach the problem of learning the response
of a population of customers to complex dispatch signals
such as RTP as a complete “black box problem”, i.e., by
just observing the broadcasted price and the load response.
There are many reasons for this, including 1) the existence
of random or exogenous parameters which lead to variability
in the temporal and geographical behavior of electricity
demand; 2) the variability of the control objective on a
daily basis (e.g., due to randomness in renewable generation
outputs, market conditions, or baseload); and 3) the small
size of the set of observations that one can gather compared
to the high dimensional structure of the load (there are
only 365 days in a year, so only 365 set of prices can be
posted). Hence, in this paper, we will be exploiting the known
physical structure of the problem and making use of our
statistical prior knowledge of how the load behaves to lower
the problem dimensionality.
Specifically, to lower the dimensionality for the learning
problem, we explore the fact that flexible loads only show
limited number of “load signatures” (potentially due to the
automated nature of load response through home energy man-
agement systems and the limited types of flexible appliances).
Let us assume that electric appliances can belong to a finite
number of clusters c∈ C. For each cluster c, we denote the set
of feasible daily power consumption schedules that satisfy the
energy requirements of the corresponding appliances by Dc.
Any power consumption schedule, [dc(t)]t=1,...,T =Dc
Dc, would satisfy the daily power needs of an appliance
in cluster c. To give an example, consider a cluster that
represents plug-in electric vehicles (EVs) that require Ec
kWh in the time interval [t1, t2]with a maximum charging
rate of ρkW. Accordingly, the set Dcof daily feasible power
consumption schedules is given by:
Dc={Dc|
t2
X
t=t1
dc(t) = Ec; 0 dc(t)ρ}.(6)
For discussion on characterizing the sets Dcfor other flexi-
ble appliances, including interruptible, non-interruptible, and
thermostatically controlled loads, we refer the reader to [16].
C. Price Response Model
In this section, we discuss how the total population’s load
responds to prices given the fact that flexible appliances
belong to a finite number of clusters c∈ C. The price signals
affects the power consumption in two ways:
1) Automated per cluster response: Within each load
cluster c(i.e., given prespecified preferences such as EV
charging deadlines or AC temperature set points), we assume
that the customer chooses the power consumption profile
Dc∈ Dcthat minimizes their electricity cost dependent on
the daily broadcasted price pτ. For appliances in cluster c
on day τ, we assume all will choose the same minimum cost
power consumption profile:
e
Dc,τ (pτ) = arg min
Dc∈Dc
T
X
t=1
p(t)dc(t).(7)
Due to the automated nature of home energy management
systems, each cluster selecting its cost minimizing profile
is a reasonable assumption once the customers have defined
their flexibility preferences, e.g., the desired charge amounts
and deadlines for EVs [17], [18].
2) Preference Adjustment: We also consider the fact that
customers may respond to price signals by adjusting their
preferences. For example, some customers are willing to pay
more to keep their AC temperature set points lower, or charge
their EV less. This means that the number of appliances in
each cluster, denoted by ac, also depends on the daily posted
price vector pτ.
Combining the automated per cluster response and prefer-
ence adjustment, we can define the population’s load on day
τin response to the posted price pτas follows:
D?
τ(pτ) = X
c∈C
ac(pτ)e
Dc,τ (pτ).(8)
As stated before, if the aggregator has full knowledge of
the customers’ price responses, which reduces to having full
knowledge of the preference adjustments ac(pτ), then the
aggregator can pick the daily price vector p?
τin order to
shape the population’s power consumption according to (1).
However, the functions ac(pτ)are unknown to the aggregator
and also exhibit inherent stochasticities due to variations of
daily customer needs. As such, we will model the ac(pτ)’s as
random variables with parameterized distributions, φc, based
on the posted price signal pτand an unknown but constant
parameter vector θ?. Here, θ?represents the true model for
the customers’ sensitivity to the price signals. This allows
for the complex response of the customer population to be
represented through a single unknown vector, thus reducing
the dimensionality of the learning problem. With this in mind,
we would like to highlight three important properties of the
price response model we adopt:
1) The preference adjustment models ac(pτ)are stochas-
tic and their distributions φcare parameterized by pτ
and θ?. This is due to exogenous factors outside of
the aggregator’s scope that influence customers’ power
consumption profiles resulting in a level of stochasticity
in the responses to prices (i.e., customers will not
respond to prices in the same fashion each day).
2) The probability distributions of ac(pτ)(i.e., φc) are
unknown to the aggregator, i.e., the aggregator does not
know the true parameter θ?of the stochastic model.
3) The realizations of ac(pτ)are not directly observable
by the aggregator. The aggregator can only monitor
the population’s total consumption profile Dτand can-
not observe the decomposed response of each cluster
ac(pτ)e
Dc,τ (pτ)independently.
Because we have introduced stochasticity to customers’
price response models, we appropriately alter the aggregator’s
optimization problem for selecting the price signal on day τ
to account for the distributions φc:
p?
τ= arg min
pτ∈P
E{φc}c∈C fDτ(pτ),Vτ (9)
s.t. P{φc}c∈C gjDτ(pτ)01µ, j(10)
where the distribution system constraints have to be satisfied
with a given probability 1µ. In (9), the aggregator now
considers minimizing an expected cost and is subject to
probabilistic reliability constraints in (10) that depend on the
distributions φcof the preference adjustment models ac(pτ).
Clearly, the aggregator needs to learn the underlying
parameters of the stochastic models φcof how customers
respond to price signals in order to select price signals for
load shaping initiatives (i.e., the aggregator needs to learn
θ?). Our proposed learning approach and pricing strategy
for an electricity aggregator is detailed in the next section.
III. REA L-TIME PRICING VIA MU LTI- ARMED BANDIT
A. Multi-Armed Bandit Overview
We utilize the multi-armed bandit (MAB) framework to
model the iterative decision making procedure of an aggre-
gator implementing a daily load shaping program [19]–[21].
Moreover, the MAB framework exemplifies the exploration-
exploitation trade-off dilemma faced by an aggregator each
day in the electricity pricing problem. Namely, should the
aggregator choose to broadcast untested prices (i.e., explore)
to learn more information about the customers? Or should the
aggregator choose to broadcast well-performing prices (i.e.,
exploit) to manipulate the daily electricity demand?
To evaluate the performance of an algorithm that aims to
tackle the exploration-exploitation trade-off, one commonly
examines the algorithm’s regret. Formally, regret is the cu-
mulative difference in cost incurred over Tdays between
a clairvoyant algorithm (i.e., the optimal strategy that is
aware of the customers’ price responses) and any proposed
algorithm that does not know the customers’ price responses:
RT=
T
X
τ=1 nE{φc}c∈C f(Dτ(pτ),Vτ)
E{φc}c∈C f(Dτ(p?),Vτ)o.(11)
Instead of the above more standard definition, an alterna-
tive metric for regret that is easier to bound for more complex
bandits is to count the number of times that suboptimal price
signals are selected over the Tdays. For this, we introduce
the following notation: let pVτ,? denote the optimal price
signal for the true model of the population’s price response θ?
when the daily exogenous parameter Vτis observed on day
τ. Any price signal pτ6=pVτ,? is considered a suboptimal
price. Moreover, we denote Nτ(p,V)as the number of times
up to day τthat the algorithm simultaneously observes the
exogenous parameter Vand selects the price signal p. As
such, the total number of times that suboptimal price signals
are selected over Tdays is:
X
V∈V X
p∈{P\pV,? }
NT(p,V) =
T
X
τ=1
{pτ6=pVτ,?},(12)
where {·} is the indicator function that is set equal to one
if the criteria is met and zero otherwise. Subsequently, in an
iterative decision making problem such as this, the question
arises: how can an aggregator learn to price electricity with
bounded regret, and what are the regret bounds we can
provide for a proposed algorithm given dynamically changing
grid conditions and reliability constraints? In the following
sections, we present a modified Thompson sampling heuristic
for the electricity pricing problem to simultaneously learn
the true model θ?for the population, select the daily price
signals, ensure grid reliability, and provide a regret guarantee.
B. Thompson Sampling
Thompson sampling is a well-known MAB heuristic for
choosing actions in an iterative decision making problem with
the exploration-exploitation dilemma [22]–[24]. In summary,
the integral characteristic of Thompson sampling is that the
algorithm’s knowledge on day τof the unknown parameter
θ?is represented by the prior distribution πτ1. Each day
the algorithm samples e
θτfrom the prior distribution, and
selects a price assuming that the sampled parameter is the
true parameter. The algorithm then makes an observation
dependent on the chosen price and the hidden parameter and
performs a Bayesian update on the parameter’s distribution
πτbased on the new observation. Because TS samples
parameters from the prior distribution, the algorithm has a
chance to explore (i.e., draw new parameters) and can exploit
(i.e., draw parameters that are likely to be the true parameter)
throughout the run of the algorithm.
C. Constrained Thompson Sampling
In this section, we present the MAB heuristic titled Con-
TS-RTP adopted to the electricity pricing problem. Con-TS-
RTP is a modified Thompson sampling algorithm where the
daily optimization problem is subject to constraints (standard
TS algorithms do not have constraints in the daily optimiza-
tion) [25]. Each day, the algorithm observes the daily target
profile Vτ, draws a parameter e
θτfrom the prior distribution,
broadcasts a price signal to the customers, observes the load
profile of the population in response to the broadcasted price,
and then performs a Bayesian update on the parameter’s
distribution πτbased on the new observation.
The observation on day τis denoted as Yτ=D?
τ(pτ)and
we assume that each Yτcomes from the observation space
Ythat is known a priori. When performing the Bayesian
update, the algorithm makes use of the following likelihood
function:`(Yτ;p,θ) = Pθ(D?
τ(pτ) = Yτ|pτ=p). This
function calculates the likelihood of observing a specific load
profile when broadcasting price pand the true parameter is θ.
The pseudocode for Con-TS-RTP applied to the constrained
electricity pricing problem is presented in Algorithm 1. The
reader will notice that the optimization in line 4 is presented
under two different sets of constraints: Constraint Set A:
The operational constraints of the distribution system are
formulated with respect to the drawn parameter e
θτ(i.e., the
constraints are not necessarily enforced for θ?); Constraint
Set B: The operational constraints of the distribution system
are formulated with respect to the current prior distribution
πτ1on the unknown parameter θ?(as chance constraints).
We will see that under certain assumptions, this means that
the constraints are upheld for the true unknown parameter θ?
at each round (with high probability).
D. Discussion on Regret Performance of Con-TS-RTP
The regret analysis of Con-TS-RTP is inspired by the
results in [26] for TS with nonlinear cost functions. The
analysis in the aforementioned paper provides bounds on the
total number of times that suboptimal price signals selected
by the algorithm over Tdays as specified in equation (12).
The regret guarantee we provide in this work extends the re-
sult further, allowing for constraints in the daily optimization
that are dependent on the sampled e
θτand on the exogenous
target profiles Vτ. As such, our regret guarantee applies to
the Con-TS-RTP algorithm with constraints as formulated in
Algorithm 1 CON -TS-RTP
Input: Parameter set Θ; Price set P; Observation set
Y; Voltage constraints umin, umax; Power flow constraint
Smax, Reliability metrics µ,ν
Initialize π0.
1: for Day index τ= 1...Tdo
2: Sample e
θτfrom distribution πτ1.
3: Observe the daily exogenous parameter Vτ.
4: Broadcast the daily price signal:
ˆ
pτ= arg min
P
E{φc}c∈C f(Dτ(pτ),Vτ)|θ=e
θτ
Subject to:
Constraint Set A:
A.1: P{φc}c∈C [uτ(t)umin|θ=e
θτ]1µ, t
A.2: P{φc}c∈C [uτ(t)umax|θ=e
θτ]1µ, t
A.3: P{φc}c∈C [fτ(t)Smax|θ=e
θτ]1µ, t
Constraint Set B:
B.1: P{φc}c∈C [uτ(t)umin|θπτ1]1ν, t
B.2: P{φc}c∈C [uτ(t)umax|θπτ1]1ν, t
B.3: P{φc}c∈C [fτ(t)Smax|θπτ1]1ν, t
5: Observe Yτ=D?
τ(pτ).
6: Posterior update:
SΘ : πτ(S) = RS`(Yτ;ˆ
pτ,θ)πτ1(dθ)
RΘ`(Yτ;ˆ
pτ,θ)πτ1(dθ)
7: end for
Constraint Set Ain Algorithm 1.
Assumption 1. (Finitely many price signals, observations,
exogenous vectors). |P|,|Y|,|V| <.
Assumption 2. (Finite Prior,“Grain of truth”) The prior
distribution πis supported over finitely many particles:
|Θ|<. The true parameter exists within the parameter
space: θ?Θ. The initial distribution π0has non-zero mass
on the true parameter θ?(i.e., Pπ0[θ?]>0).
Assumption 3. The exogenous vectors Vare i.i.d. drawn
from a distribution defined on a finite sample space V, with
each outcome drawn with a nonzero probability.
Assumption 4. (Unique optimal price signal). There is
a unique optimal price signal pV,? for each exogenous
parameter V∈ V.
We note that the regret result in Theorem 1 only applies to the Con-TS-
RTP algorithm with the daily optimization (line 4) subject to Constraint
Set A. This is due to Constraint Set B (which is dependent on πτ1)
potentially prohibiting the aggregator from selecting the optimal price signals
throughout the learning process due to constraint violations from parameters
θ6=θ?. However, Constraint Set B provides reliability guarantees that
Constraint Set A cannot provide (see Section III-E). A thorough analysis on
the effect of stage-wise reliability-constraints such as Constraint Set B on
the growth of regret can be found in [27] but only for the case of stochastic
MABs with linear costs and linear constraints.
Theorem 1. Under assumptions 1-4 and Constraint Set A
in Algorithm 1, for δ,  (0,1), there exists T?0such
that for all T ≥ T ?, with probability 1δ:
X
V∈V X
p∈{P\pV,? }
NT(p,V)B+C(log T),(13)
where BB(δ, , P,Y,Θ) is a problem-dependent constant
that does not depend on T, and C(log T)depends on T,
the sequence of selected price signals, and the Kullback-
Leibler divergence properties of the bandit problem (i.e.,
the marginal Kullback-Leibler divergences of the observa-
tion distributions KL`(Y;p,θ?), `(Y;p,θ)(The complete
description of the C(log T)term is left to the appendix).
Proof. The proof is in the appendix.
In the next section, we discuss the distribution system
reliability issues that could arise from Constraint Set A and
a modification to the Con-TS-RTP algorithm to ensure the
constraints are enforced on all days (i.e., Constraint Set B).
E. Con-TS-RTP with Improved Reliability Constraints
In order for the aggregator to ensure safe operation of the
distribution grid while running the Con-TS-RTP algorithm,
the reliability constraints need to hold for the true price
response model θ?each day. However, with the constraints
formulated as in Algorithm 1’s Constraint Set A, the distribu-
tion system constraints are only enforced for the sampled e
θτ
and not necessarily the true parameter θ?. This entails that
the distributions {φc}c∈C are parameterized by the sampled
e
θτ; therefore, they are inaccurate if any parameter e
θτ6=θ?
is sampled. This could potentially lead to many constraint
violations throughout the run of the algorithm resulting in
inadequate service for the customers and grid failures.
Due to the importance of reliable operation of the distri-
bution system, we present a modification to the Con-TS-RTP
algorithm (i.e., replacing Constraint Set A with Constraint Set
Bin Algorithm 1) to increase the reliability of the selected
prices and resulting load profiles with respect to the grid
constraints. Specifically, we propose alternate constraints that
depend on the algorithm’s current knowledge of the true
parameter, instead of the sampled parameter. In other words,
instead of depending on e
θτ, the proposed alternate constraints
depend on the prior distributions πτ1as follows:
P{φc}c∈C [uτ(t)umin|θπτ1]1ν, t(14)
P{φc}c∈C [uτ(t)umax|θπτ1]1ν, t(15)
P{φc}c∈C [fτ(t)Smax|θπτ1]1ν, t(16)
where νis a small constant (detailed in Proposition 1).
When considering constraints (14)-(16) in Con-TS-RTP, the
algorithm will select more conservative price signals each
day that can guarantee the distribution system’s constraints
are met with high probability by using the information in the
updated prior distributions. Before analyzing the modified
algorithm’s reliability, we make the following assumption:
Fig. 1. Radial distribution system.
Assumption 5. There exists ξ?>0,λ0,δ(0,1)
such that for all θ6=θ?,KL`(Y;p,θ?), `(Y;p,θ)ξ?,
where
ξ?
θ,p= max
xZ>0(λ
x+4
xrlog |Y||P|
δ+log x
2
×X
Y∈Y log `(Y;p,θ?)
`(Y;p,θ))
and
ξ?= max
θΘ,p∈P ξ?
θ,p.
Assumption 5 ensures that if the aggregator observes Yτon
day τ, the algorithm’s Bayesian updates of the prior distri-
bution πτwill never decrease the mass of the true parameter
θ?below a certain threshold. Specifically, with Assumption
5, it can be shown (as in [26]) that with probability 1δ2
the following holds for all τ1:
πτ(θ?)π0(θ?)eλ|P| ,(17)
where λ0is a chosen parameter (from Assumption 5) that
dictates the minimum reachable mass of the true parameter
via Bayesian updates. With the modified constraints (14)-
(16) and the minimum mass of the true parameter (17), the
reliability of Con-TS-RTP can be characterized as follows:
Proposition 1. Under assumptions 1-5, with νin equations
(14)-(16) chosen such that ν=µπ0(θ?)eλ|P|, with proba-
bility 1δ2, the Con-TS-RTP algorithm with Constraint Set
B will uphold the probabilistic distribution system constraints
as formulated in (10) for each day τ= 1,...,T.
Proof. The proof is in the appendix.
IV. EXP ER IM EN TAL EVALUATIO N
A. Test Setup: Radial Distribution System
In this section we describe the power distribution system
and the corresponding network parameters for the test case.
We consider an actual radial distribution system from the
ComEd service territory in Illinois, USA (adopted from [28]
and shown in Fig. 1) represented by the undirected graph
Gwhich includes a set of nodes (vertices) Nand a set of
power lines (edges) L. In this work, we consider each node as
one population with its own daily load profile; however, each
node could be an aggregation of smaller entities downstream
Line R X Smax Line R X Smax
(103) (103) (KVA) (103) (103) (KVA)
1 24.2 48.2 54 20 129.5 30.9 10.8
2 227.3 743.5 84 21 15.1 5.4 14.4
3 76.3 18.2 10.8 22 50.8 12.1 10.8
4 43.6 142.7 84 23 69.1 16.5 10.8
5 25.8 84.4 84 24 31.6 11.2 14.4
6 10.5 10.7 40.2 25 96.3 23 10.8
7 23.2 23.6 40.2 26 110.7 112.6 40.2
8 75.1 26.7 14.4 27 2.1 0.7 14.4
9 114.4 27.3 10.8 28 242.1 86.2 14.4
10 110.8.3 67.7 14.4 29 27.3 27.8 40.2
11 63.7 22.7 14.4 30 174.6 62.1 16.2
12 278.7 99.2 14.4 31 43 15.3 10.8
13 254.2 10.8.5 14.4 32 207.8 74 10.8
14 21.8 5.2 10.8 33 109.4 38.9 14.4
15 57.3 20.4 14.4 34 50.5 18 14.4
16 126.7 45.1 14.4 35 165.2 58.8 14.4
17 48.6 11.6 10.8 36 49.5 17.6 14.4
18 95.1 22.7 10.8 37 5.8 2.1 14.4
19 137.3 32.8 10.8
TABLE I
DISTRIBUTION SYSTEM PARAMETERS.
of the local distribution connection point. The undirected
graph is organized as a tree, with the root node representing
the distribution system’s substation where it is connected
to the regional transmission system. We denote Nas the
total number of nodes in the network excluding the root
node. The nodes are indexed as i= 0, . . . , N, and the node
corresponding to i= 0 (i.e., the root node) is the substation.
The power lines are indexed by i= 1, . . . , N where the i-th
line is directly upstream of node i(i.e., line ifeeds directly
to node i). In the following, we denote the parent vertex of
node ias Aiand the set of children vertices of node ias Ki.
Furthermore, we assume the aggregator has access to
measurement data at each node’s local connection point.
Specifically, the aggregator measures the active and reactive
power demands at each node iat time ton day τdenoted
as dP
i,τ (t)and dQ
i,τ (t), respectively. In order to ensure the
delivered power is suitable for the electricity customers, the
aggregator also monitors node is local voltage at time t
on day τdenoted as vi,τ (t). In the following, we denote
the active power daily load profile of node ion day τ
as DP
i,τ = [dP
i,τ (t)]t=1,...,T . Additionally, the aggregator
records the active and reactive power flows, fP
i,τ (t)and
fQ
i,τ (t), respectively, on each line i∈ L. Each line in the
distribution system has its own internal resistance denoted
as Ri, reactance denoted as Xi, and apparent power limit
denoted as Smax
i. The parameters for the distribution system
are listed in Table I.
B. Power Flow Model
In order to solve for the power flow and nodal voltages of
the power distribution system, we make use of the LinDis-
tFlow model [29], which is a linear approximation for the
AC power flow model. The LinDistFlow model has been
Fig. 2. Evolution of prior distribution πτfor node 10. From left to right: Day 1 (initialized to uniform distribution, i.e., no knowledge of the true parameter),
Day 15 prior, Day 30 prior, Day 90 prior, and Day 180 prior.
extensively studied and verified to be competitive to the
nonlinear AC flow model on many realistic feeder topologies
including radial [30]–[33]. The LinDistFlow model reduces
computational complexity by making use of the following
linear power flow and voltage equations:
dP
i,τ (t)gP
i,τ (t) + X
j∈Ki
fP
j,τ (t) = fP
Ai(t); t, τ, i, (18)
dQ
i,τ (t)gQ
i,τ (t) + X
j∈Ki
fQ
j,τ (t) = fQ
Ai(t); t, τ, i, (19)
uAi(t)2fP
i,τ (t)Ri+fQ
i,τ (t)Xi=ui,τ (t); t, τ, i. (20)
In (20) we make use of the operator ui,τ (t) = vi,τ (t)2
to provide a linear voltage drop relationship across the
distribution system. For the scope of this work, we assume
that the substation connection to the regional transmission
system (node i= 0) is regulated and has a fixed voltage
v0(t) = 120V,t, τ .
C. Distribution System Operational Constraints
The nodal voltages and line flows calculated in (18)-(20)
should obey the following constraints for reliable operation:
ui,τ (t)umin
i,t, τ, i ∈ N ,(21)
ui,τ (t)umax
i,t, τ, i ∈ N ,(22)
fP
i,τ (t)2+fQ
i,τ (t)2Smax
i2,t, τ, i ∈ L,(23)
where (21)-(22) are the nodal voltage constraints and (23) is
the apparent power constraint for each distribution line.
D. Load Model and Multi-armed Bandit Formulation
In this test case, the goal of the aggregator is to integrate
varying levels of intermittent solar generation into the dis-
tribution system (i.e., the aggregator wants the customers
to take advantage of the available renewable energy and
consume all the solar generation each day). To model this,
we consider 10 unique target load profile vectors, with the
daily target profile Vi,τ for node ifor day τdrawn from a
uniform distribution each morning. Each of the 10 target load
profile vectors corresponds to the forecasted solar generation
in each time slot at node i. In this setup, we consider 6 time
slots each day, each 4 hours long and the aggregator transmits
daily price signals pi,τ to each node within the system. The
aggregator has a high and low price for each of the 6 time
slots resulting in 26possible daily price signals. Each node
has a cost function that is dependent on the node’s demand
as well as the target profile. In this test case, we assume the
cost function is the squared deviation of the node’s demand
from the target profile, thus equally penalizing over-usage
and under-usage.
We consider 20 unique load flexibility clusters in this
test case. Each node in the distribution system is comprised
of these 20 load clusters with its own unique sensitivities
ai,c(pτ)for each cluster. Each sensitivity parameter is se-
lected as ai,c(pi,τ )∼ N(βc
θ?
ipi,τ , σ2)each day where βcis
a cluster specific constant known by the aggregator. Each
node’s price sensitivity, i.e., parameter to be learned, θ?
i, is
a vector of length 6 and the set of possible parameters, Θ,
contains 10 unique vectors.
E. Results
We simulated the Con-TS-RTP algorithm for 365 days for
an aggregator attempting to learn the sensitivities of the nodes
in the system and shape their demands. In the following, we
highlight the results of the simulation at node 10 of the radial
distribution system. Figure 2 presents the evolution of the
prior distribution for node 10’s hidden parameter.
Figure 3 presents the regret performance (both the cumula-
tive regret and number of suboptimal price signals selected)
of Con-TS-RTP at node 10. As seen in Figure 3, the regret
curve flattens after day 130 as the algorithm never chooses
a suboptimal price signal after this day.
Figure 4 presents node 10’s deviation from a specific daily
target profile. On days 2, 3, 4, 53, and 365 the same target
profile (i.e., V2=V3=V4=V53 =V365) was drawn
and the aggregator selected different price signals to shape
the node’s demand. As seen in Fig. 4, the deviation from the
target profile on day 365 is less than the deviation on the
other days as the algorithm has learned the true parameter
and selects the optimal price signal to shape the load.
In Figure 5, we present the distribution system constraint
violations that were avoided by using Con-TS-RTP instead
Fig. 3. Regret performance of Con-TS-RTP at node 10.
Fig. 4. Deviation of node 10’s demand from a specific daily target profile.
of an unconstrained TS algorithm. Clearly, in the early
learning stages, the unconstrained TS algorithm does not have
accurate knowledge of the hidden parameters and violates the
distribution system constraints often. Con-TS-RTP is more
conservative with its exploration of untested price signals and
avoids the constraint violations made by the unconstrained
TS algorithm. The simulation was implemented with Mat-
lab/CVX, an i7 processor, and 16gb RAM in <5minutes.
V. CONCLUSION
We presented a multi-armed bandit problem formulation
for an electricity aggregator implementing a real-time pricing
program for load shaping (e.g., reduce demand at peak hours,
integrate more intermittent renewables, track a desired daily
load profile, etc). We made use of a constrained Thomp-
son sampling heuristic, Con-TS-RTP, as a solution to the
exploration/exploitation problem of an aggregator passively
learning customers’ price sensitivities while broadcasting
price signals that influence customers to alter their demand.
The Con-TS-RTP algorithm permits day-varying target load
profiles and takes into account the actual operational con-
Fig. 5. Top: Distribution system constraint violations at node 10 avoided
by using Con-TS-RTP instead of an unconstrained TS. Bottom: Distribution
system constraint violations across the entire system avoided by using Con-
TS-RTP.
straints of a distribution system to ensure that the customers
receive adequate service and to avoid potential grid failures.
We discussed a regret guarantee for the proposed Con-TS-
RTP algorithm which bounds the total number of suboptimal
price signals broadcasted by the aggregator. Furthermore, we
discussed an operational reliability guarantee that ensures the
power distribution system constraints are upheld with high
probability throughout the run of the Con-TS-RTP algorithm.
REFERENCES
[1] C. Eid, E. Koliou, M. Valles, J. Reneses, and R. Hakvoort, “Time-
based pricing and electricity demand response: Existing barriers and
next steps,” Utilities Policy, vol. 40, pp. 15 – 25, 2016.
[2] V. Gomez, M. Chertkov, S. Backhaus, and H. J. Kappen, “Learning
price-elasticity of smart consumers in power distribution systems,” in
2012 SmartGridComm. IEEE, 2012, pp. 647–652.
[3] Z. Xu, T. Deng, Z. Hu, Y. Song, and J. Wang, “Data-driven pricing
strategy for demand-side resource aggregators,IEEE Transactions on
Smart Grid, vol. 9, no. 1, pp. 57–66, 2016.
[4] P. Li and B. Zhang, “Linear estimation of treatment effects in de-
mand response: An experimental design approach,” arXiv preprint
arXiv:1706.09835, 2017.
[5] P. Li, H. Wang, and B. Zhang, “A distributed online pricing strategy
for demand response programs,” IEEE Transactions on Smart Grid,
vol. 10, no. 1, pp. 350–360, 2017.
[6] K. Khezeli, W. Lin, and E. Bitar, “Learning to buy (and sell) demand
response,” IFAC-PapersOnLine, vol. 50, no. 1, pp. 6761–6767, 2017.
[7] L. Jia, L. Tong, and Q. Zhao, “An online learning approach to dynamic
pricing for demand response,” arXiv preprint arXiv:1404.1325, 2014.
[8] K. Khezeli and E. Bitar, “Risk-sensitive learning and pricing for
demand response,” IEEE Transactions on Smart Grid, vol. 9, no. 6,
pp. 6000–6007, 2017.
[9] L. Jia, Q. Zhao, and L. Tong, “Retail pricing for stochastic demand
with unknown parameters: An online machine learning approach,” in
2013 51st Allerton. IEEE, 2013, pp. 1353–1358.
[10] Y. Li, Q. Hu, and N. Li, “Learning and selecting the right customers for
reliability: A multi-armed bandit approach,” in 2018 IEEE Conference
on Decision and Control (CDC). IEEE, 2018, pp. 4869–4874.
[11] D. Kalathil and R. Rajagopal, “Online learning for demand response,”
in 2015 53rd Annual Allerton Conference on Communication, Control,
and Computing (Allerton). IEEE, 2015, pp. 218–222.
[12] R. Mieth and Y. Dvorkin, “Data-driven distributionally robust optimal
power flow for distribution systems,IEEE Control Systems Letters,
vol. 2, no. 3, pp. 363–368, 2018.
[13] E. DallAnese, K. Baker, and T. Summers, “Chance-constrained ac
optimal power flow for distribution systems with renewables,” IEEE
Transactions on Power Systems, vol. 32, no. 5, pp. 3427–3438, 2017.
[14] R. Mieth and Y. Dvorkin, “Online learning for network constrained
demand response pricing in distribution systems,” arXiv:1811.09384,
2018.
[15] A. Moradipari, C. Silva, and M. Alizadeh, “Learning to dynamically
price electricity demand based on multi-armed bandits,” in 2018 IEEE
GlobalSIP, Nov 2018, pp. 917–921.
[16] M. Alizadeh, A. Scaglione, A. Applebaum, G. Kesidis, and K. Levitt,
“Reduced-order load models for large populations of flexible appli-
ances,” IEEE Transactions on Power Systems, vol. 30, no. 4, 2015.
[17] T.-H. Chang, M. Alizadeh, and A. Scaglione, “Coordinated home
energy management for real-time power balancing,” in 2012 IEEE
Power and Energy Society General Meeting. IEEE, 2012, pp. 1–8.
[18] M. Alizadeh and A. Scaglione, “Least laxity first scheduling of
thermostatically controlled loads for regulation services,” in 2013 IEEE
GlobalSIP. IEEE, 2013, pp. 503–506.
[19] A. Krishnamurthy, Z. S. Wu, and V. Syrgkanis, “Semiparametric
contextual bandits,” in International Conference on Machine Learning,
2018, pp. 2781–2790.
[20] D. Foster, A. Agarwal, M. Dudik, H. Luo, and R. Schapire, “Practical
contextual bandits with regression oracles,Proceedings of Machine
Learning Research, vol. 80, 2018.
[21] T. Xu, Y. Yu, J. Turner, and A. Regan, “Thompson sampling in
dynamic systems for contextual bandit problems,” arXiv preprint
arXiv:1310.5008, 2013.
[22] D. J. Russo, B. Van Roy, A. Kazerouni, I. Osband, Z. Wen et al., “A
tutorial on thompson sampling,” Foundations and Trends R
in Machine
Learning, vol. 11, no. 1, pp. 1–96, 2018.
[23] D. Russo and B. Van Roy, “Learning to optimize via posterior
sampling,” Mathematics of Operations Research, vol. 39, no. 4, pp.
1221–1243, 2014.
[24] S. Agrawal and N. Goyal, “Analysis of thompson sampling for the
multi-armed bandit problem,” in Conference on Learning Theory,
2012, pp. 39–1.
[25] V. Saxena, J. Jald´
en, J. E. Gonzalez, I. Stoica, and H. Tullberg,
“Constrained thompson sampling for wireless link optimization,” arXiv
preprint arXiv:1902.11102, 2019.
[26] A. Gopalan, S. Mannor, and Y. Mansour, “Thompson sampling for
complex online problems,” in International Conference on Machine
Learning, 2014, pp. 100–108.
[27] S. Amani, M. Alizadeh, and C. Thrampoulidis, “Linear stochastic
bandits under safety constraints,” arXiv:1908.05814, 2019.
[28] P. Andrianesis, M. Caramanis, R. Masiello, R. Tabors, and S. Bahrami-
rad, “Locational marginal value of distributed energy resources as non-
wires alternatives,IEEE Transactions on Smart Grid, 2019.
[29] M. E. Baran and F. F. Wu, “Optimal capacitor placement on radial
distribution systems,” IEEE Transactions on Power Delivery, vol. 4,
no. 1, pp. 725–734, Jan 1989.
[30] H. J. Liu, “Decentralized optimization approach for power distribution
network and microgrid controls,” Ph.D. dissertation, University of
Illinois at Urbana-Champaign, 2017.
[31] H. Zhu and H. J. Liu, “Fast local voltage control under limited reactive
power: Optimality and stability analysis,IEEE Transactions on Power
Systems, vol. 31, no. 5, pp. 3794–3803, 2016.
[32] P. ˇ
Sulc, S. Backhaus, and M. Chertkov, “Optimal distributed control
of reactive power via the alternating direction method of multipliers,
IEEE Transactions on Energy Conversion, vol. 29, no. 4, 2014.
[33] M. Farivar, L. Chen, and S. Low, “Equilibrium and dynamics of local
voltage control in distribution systems,” in 52nd IEEE Conference on
Decision and Control. IEEE, 2013, pp. 4329–4334.
APPENDIX
The following contains the supplementary material for
the manuscript entitled: Constrained Thompson Sampling
for Real-Time Electricity Pricing with Grid Reliability Con-
straints by N. Tucker, A. Moradipari, and M. Alizadeh.
A. Discussion on Regret Performance
In this section, we describe the necessary background for
Theorem 1 and then present the full version of the Theorem.
In the following, pVτ,? denotes the optimal price signal for
the true model of the population’s price response θ?when
the daily exogenous parameter Vτis observed on day τ. Any
price signal pτ6=pVτ,? is considered a suboptimal price.
We now briefly explain how the posterior updates affect
the regret performance. When price pis posted on day τ,
the prior density is updated as
πτ(dθ)exp log l(Yτ;p,θ?)
l(Yτ;p,θ)πτ1(dθ).(24)
Now, denote by KL(θ?
p||θp)the marginal Kullback-Leibler
divergence between the distribution {l(Y;p,θ?) : Y∈ Y}
and {l(Y;p,θ) : Y∈ Y}. As in [26], we can approximately
write (24) as:
πτ(dθ)exp X
p∈P
Nτ(p)KL(θ?
p||θp)πτ1(dθ),
(25)
where Nτ(p) = PV∈V Nτ(p,V), and Nτ(p,V)is the
number of times up to day τthat the algorithm simultane-
ously observes a target profile Vand posts a price p.
Furthermore, we define Nτ= [Nτ(p)]p∈P as a vector
consisting of the number of times each price is posted up to
day τ. We can consider the quantity in the exponent of (25)
as a loss suffered by model θup to day τ. Since the term
in the exponent of (25) is equal to 0 when θ=θ?, we can
see that Thompson sampling samples θ?and hence posts the
optimal price with at least a constant probability at each day,
i.e., Nτ(pV,?,V)grows linearly with τfor all V.
For each price, we define Sp(V) := {θΘ : pτ=
p|Vτ=V}to be the set of parameters θΘwhose
optimal price when observing a daily target load profile
Vis p. Furthermore, define S0
p(V) := {θSp(V) :
KL(θ?
pV,? kθpV,? ) = 0}which is the set of models θthat
exactly match θ?in marginal distribution of Ywhen the true
model θ?is selected and the optimal price pV,? is posted,
and S00
p(V) := Sp(V)\S0
p(V).
For each of the models θin S00
p(V),p6=
pV,?,KL(θ?
pV,? kθpV,? )>ε>0. As we have assumed that
the probability of observing any target profile V∈ V is
bounded away from zero, Nτ(pV,?)grows linearly with
τfor all V∈ V. Hence, any such model θis sampled
with probability exponentially decaying in τin (25) and the
regret from such S00
p(V)-sampling is negligible. We define
the set of all such models as θΘ00 =V∈V S00
p(V).
A model θS0
p(V)will only face loss whenever the algo-
rithm posted a suboptimal price pfor which KL(θ?
pkθp)>0.
For V, a suboptimal price pV
k6=pV,? may still be posted if
any of the set of models in S0
pV
k
(V)may still be drawn with
non-negligible probability. Hence, a price will be eliminated
after the probability of drawing all θS0
pV
k
(V)is negligible.
For each V, suboptimal prices are eliminated one after the
other at times τV
k, k = 1,...,|P| − 1. We refer the reader
to [26] for a full discussion of when a suboptimal price p
is considered statistically eliminated, which is used to write
constraints in (26) below.
Theorem 1. (Expanded Version) Under assumptions 1-4
and Constraint Set A in Algorithm 1, for δ,  (0,1), there
exists T?0s.t. for all T ≥ T ?, with probability 1δ:
X
V∈V X
p∈{P\pV,? }
NT(p,V)B+C(log T),
where BB(δ, , P,Y,Θ) is a problem-dependent constant
that does not depend on T, and C(log T)depends on T,
the sequence of selected price signals, and the Kullback-
Leibler divergence properties of the bandit problem (i.e.,
the marginal Kullback-Leibler divergences of the observa-
tion distributions KL`(Y;p,θ?), `(Y;p,θ). Specifically,
the C(log T)term is defined as follows:
C(log T)(26)
max X
V∈V
|P|−1
X
k=1
NτV
k(p,V)
s.t. V∈ V,j > 1,1k |P| − 1 :
min
θnS0
pV
k
(V)Θ00 ohNτV
k,KLθi ≥ 1 +
1log T,
min
θnS0
pV
k
(V)Θ00 ohNτV
ke(j),KLθi<1 +
1log T,
where e(j)denotes the j-th unit vector in finite-dimensional
Euclidean space. The last two constraints ensure that price
pV
kis eliminated at time tV
k(no earlier and no later).
Proof. In Con-TS-RTP with Constraint Set A, the aggrega-
tor’s daily objective and constraints are dependent on the
sampled parameter e
θτ. The only difference between Con-
TS-RTP and the daily optimization in [15] is the added
constraints. Since the constraints are only enforced for the
sampled parameter, each sampled parameter e
θτstill has
a unique optimal price signal, and more importantly, the
constraints do not prohibit the algorithm from selecting
the optimal price for the sampled parameter. As such, the
addition of constraints that depend only on the daily sampled
parameter does not alter the bandit problem, and the regret
analysis follows from [26].
B. Discussion on Operational Reliability
Proposition 1. (Repeated) Under assumptions 1-5, with ν
in equations (14)-(16) chosen such that ν=µπ0(θ?)eλ|P|,
with probability 1δ2, the Con-TS-RTP algorithm with
Constraint Set B will uphold the probabilistic distribution
system constraints as formulated in (10) for each day τ=
1,...,T.
Proof. In [26], it is shown that with probability 1δ2
the mass of the true parameter never decreases below
π0(θ?)eλ|P| in the prior distribution during the learning
process. As such, the desired reliability metric on the RHS of
the constraints (14)-(16), i.e., 1ν, can be selected such that
the constraints must hold for the true parameter. Let π?
min =
π0(θ?)eλ|P| be the minimum reachable mass of the true
parameter in the prior distribution. Furthermore, we abuse
notation and denote Psafe
j=P{φc}c∈C gjDτ(pτ)0as
the probability that constraint jis upheld. Now, assuming
the aggregator only has knowledge of the true parameter
given by the prior distribution πτon day τ, the aggregator
can calculate the probability of satisfying the constraint as
follows: X
ˆ
θΘ
πτ(ˆ
θ)(Psafe
j|θ=ˆ
θ).(27)
This can be split into two terms for the true parameter θ?
and all other parameters θ6=θ?:
πτ(θ?)(Psafe
j|θ=θ?) + (1 πτ(θ?))(Psafe
j|θ6=θ?).
(28)
Now, we can rewrite the probability assuming that θ?has
reached the minimum mass π?
min in the prior distribution:
π?
min(Psaf e
j|θ=θ?) + (1 π?
min)(Psaf e
j|θ6=θ?).(29)
Recall, the aggregator wants constraint jto hold with prob-
ability at least 1µfor the true parameter θ?, so we
can replace (Psafe
j|θ=θ?)with 1µ. Furthermore,
(Psafe
j|θ6=θ?)1and we replace it accordingly yielding:
π?
min(1 µ) + (1 π?
min).(30)
Now, we want this probability to be the minimum allowable
probability across the prior πfor constraint jto hold so we
set it equal to the reliability metric:
π?
min(1 µ) + (1 π?
min) = 1 ν, (31)
which yields
ν=µπ?
min.(32)
By selecting ν=µπ?
min, the aggregator ensures that con-
straint jwill be upheld with probability at least 1µfor
the true parameter θ?. (i.e., the total mass of the incorrect
parameters θ6=θ?in the prior distribution πτcan never be
large enough to satisfy the constraint’s inequality without the
true parameter also satisfying the constraint).
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Flexible demand response (DR) resources can be leveraged to accommodate the stochasticity of some distributed energy resources. This paper develops an online learning approach that continuously estimates price sensitivities of residential DR participants and produces such price signals to the DR participants that ensure a desired level of DR capacity. The proposed learning approach incorporates the dispatch decisions on DR resources into the distributionally robust chance-constrained optimal power flow (OPF) framework. This integration is shown to adequately remunerate DR resources and co-optimize the dispatch of DR and conventional generation resources. The distributionally robust chance-constrained formulation only relies on empirical data acquired over time and makes no restrictive assumptions on the underlying distribution of the demand uncertainty. The distributional robustness also allows for robustifying the otpimal solution against systematically misestimating empirically learned parameters. The effectiveness of the proposed learning approach is shown via numerical experiments. The paper is accompanied by the code and data supplement released for public use.
Conference Paper
Full-text available
In this paper, we consider residential demand response (DR) programs where an aggregator calls upon some residential customers to change their demand so that the total load adjustment is as close to a target value as possible. Major challenges lie in the uncertainty and randomness of the customer behaviors in response to DR signals, and the limited knowledge available to the aggregator of the customers. To learn and select the right customers, we formulate the DR problem as a combinatorial multi-armed bandit (CMAB) problem with a reliability goal. We propose a learning algorithm: CUCB-Avg (Combinatorial Upper Confidence Bound-Average), which utilizes both upper confidence bounds and sample averages to balance the tradeoff between exploration (learning) and exploitation (selecting). We prove that CUCB-Avg achieves O(logT) regret given a time-invariant target. Simulation results demonstrate that our CUCB-Avg performs significantly better than the classic algorithm CUCB (Combinatorial Upper Confidence Bound).
Article
Full-text available
Increasing penetration of distributed energy resources complicate operations of electric power distribution systems by amplifying volatility of nodal power injections. On the other hand, these resources can provide additional control means to the distribution system operator (DSO). This paper takes the DSO perspective and leverages a data-driven distributionally robust decision-making framework to overcome the uncertainty of these injections and its impact on the distribution system operations. We develop an AC OPF formulation for radial distribution systems that exploits distributionally robust optimization to immunize the optimized decisions against uncertainty in the probabilistic models of forecast errors obtained from the available observations. The model is reformulated to be computationally tractable and tested in numerical experiments. We also release the code supplement that implements the proposed model in Julia and can be used to reproduce our numerical results.
Article
Full-text available
Demand response aims to stimulate electricity consumers to modify their loads at critical time periods. In this paper, we consider signals in demand response programs as a binary treatment to the customers and estimate the average treatment effect, which is the average change in consumption under the demand response signals. More specifically, we propose to estimate this effect by linear regression models and derive several estimators based on the different models. From both synthetic and real data, we show that including more information about the customers does not always improve estimation accuracy: the interaction between the side information and the demand response signal must be carefully modeled. In addition, we compare the traditional linear regression model with the modified covariate method which models the interaction between treatment effect and covariates. We analyze the variances of these estimators and discuss different cases where each respective estimator works the best. The purpose of these comparisons is not to claim the superiority of the different methods, rather we aim to provide practical guidance on the most suitable estimator to use under different settings. Our results are validated using data collected by Pecan Street and EnergyPlus.
Article
Full-text available
We study a demand response problem from operator's perspective with realistic settings, in which the operator faces uncertainty and limited communication. Specifically, the operator does not know the cost function of consumers and cannot have multiple rounds of information exchange with consumers. We formulate an optimization problem for the operator to minimize its operational cost considering time-varying demand response targets and responses of consumers. We develop a joint online learning and pricing algorithm. In each time slot, the operator sends out a price signal to all consumers and estimates the cost functions of consumers based on their noisy responses. We measure the performance of our algorithm using regret analysis and show that our online algorithm achieves logarithmic regret with respect to the operating horizon. In addition, our algorithm employs linear regression to estimate the aggregate response of consumers, making it easy to implement in practice. Simulation experiments validate the theoretic results and show that the performance gap between our algorithm and the offline optimality decays quickly.
Article
In this paper, we address the issue of valuating Distributed Energy Resources (DERs) as Non-Wires Alternatives (NWAs) against wires investments in the traditional distribution network planning process. Motivated by the recent literature on Distribution Locational Marginal Prices, we propose a framework that allows the planner to identify rigorously the short-term Locational Marginal Value (LMV) of DERs using the notion of Marginal Cost of Capacity (MCC) of the best grid investment alternative to monetize hourly network constraint violations encountered during a yearly rate base timescale. We apply our methodology on two actual distribution feeders anticipated to experience overloads in the absence of additional DERs, and present numerical results on desirable LMV-based generic DER adoption targets and associated costs that can offset or delay different types of grid wires investments. We close with a discussion on policy and actual DER adoption implementation.
Article
We adopt the perspective of an aggregator, which seeks to coordinate its purchase of demand reductions from a fixed group of residential electricity customers, with its sale of the aggregate demand reduction in a two-settlement wholesale energy market. The aggregator procures reductions in demand by offering its customers a uniform price for reductions in consumption relative to their predetermined baselines. Prior to its realization of the aggregate demand reduction, the aggregator must also determine how much energy to sell into the two-settlement energy market. In the day-ahead market, the aggregator commits to a forward contract, which calls for the delivery of energy in the real-time market. The underlying aggregate demand curve, which relates the aggregate demand reduction to the aggregator's offered price, is assumed to be affine and subject to unobservable, random shocks. Assuming that both the parameters of the demand curve and the distribution of the random shocks are initially unknown to the aggregator, we investigate the extent to which the aggregator might dynamically adapt its DR prices and forward contracts to maximize its expected profit over a window of T days. Specifically, we design a data-driven pricing and contract offering policy that resolves the aggregator's need to learn the unknown demand model with its desire to maximize its cumulative expected profit over time. The proposed pricing policy is proven to exhibit a regret over T days that is at most O(√T).
Article
Thompson sampling is an algorithm for online decision problems where actions are taken sequentially in a manner that must balance between exploiting what is known to maximize immediate performance and investing to accumulate new information that may improve future performance. The algorithm addresses a broad range of problems in a computationally efficient manner and is therefore enjoying wide use. This tutorial covers the algorithm and its application, illustrating concepts through a range of examples, including Bernoulli bandit problems, shortest path problems, dynamic pricing, recommendation, active learning with neural networks, and reinforcement learning in Markov decision processes. Most of these problems involve complex information structures, where information revealed by taking an action informs beliefs about other actions. We will also discuss when and why Thompson sampling is or is not effective and relations to alternative algorithms.
Article
This paper focuses on distribution systems featuring renewable energy sources (RESs) and energy storage systems, and presents an AC optimal power flow (OPF) approach to optimize system-level performance objectives while coping with uncertainty in both RES generation and loads. The proposed method hinges on a chance-constrained AC OPF formulation where probabilistic constraints are utilized to enforce voltage regulation with prescribed probability. A computationally more affordable convex reformulation is developed by resorting to suitable linear approximations of the AC power-flow equations as well as convex approximations of the chance constraints. The approximate chance constraints provide conservative bounds that hold for arbitrary distributions of the forecasting errors. An adaptive strategy is then obtained by embedding the proposed AC OPF task into a model predictive control framework. Finally, a distributed solver is developed to strategically distribute the solution of the optimization problems across utility and customers.
Article
We consider the setting in which an electric power utility seeks to curtail its peak electricity demand by offering a fixed group of customers a uniform price for reductions in consumption relative to their predetermined baselines. The underlying demand curve, which describes the aggregate reduction in consumption in response to the offered price, is assumed to be affine and subject to unobservable random shocks. Assuming that both the parameters of the demand curve and the distribution of the random shocks are initially unknown to the utility, we investigate the extent to which the utility might dynamically adjust its offered prices to maximize its cumulative risk-sensitive payoff over a finite number of T days. In order to do so effectively, the utility must design its pricing policy to balance the tradeoff between the need to learn the unknown demand model (exploration) and maximize its payoff (exploitation) over time. In this paper, we propose such a pricing policy, which is shown to exhibit an expected payoff loss over T days that is at most O(T)O(\sqrt{T}), relative to an oracle pricing policy that knows the underlying demand model. Moreover, the proposed pricing policy is shown to yield a sequence of prices that converge to the oracle optimal prices in the mean square sense.